Skip to main content

Data Representation

Board Coverage

BoardPaperNotes
AQAPaper 1Measures of location and spread, coding
EdexcelP1Similar
OCR (A)Paper 1Includes outlier detection
CIE (9709)P1, P6Data handling in P1; further statistics in P6
info

You must know when to use the sample variance formula (dividing by n1n-1) versus the population variance formula (dividing by nn). Edexcel and OCR use n1n-1 for sample data.


1. Measures of Central Tendency

1.1 Mean

Definition. The mean of nn values x1,x2,,xnx_1, x_2, \ldots, x_n is

xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i

1.2 The mean minimises the sum of squared deviations

Theorem. The function S(a)=i=1n(xia)2S(a) = \displaystyle\sum_{i=1}^{n}(x_i - a)^2 is minimised when a=xˉa = \bar{x}.

Proof. Expand S(a)S(a):

S(a)=(xi22axi+a2)=xi22axi+na2S(a) = \sum(x_i^2 - 2ax_i + a^2) = \sum x_i^2 - 2a\sum x_i + na^2

dSda=2xi+2na\frac{dS}{da} = -2\sum x_i + 2na

Setting dSda=0\dfrac{dS}{da} = 0: 2na=2xi    a=LBxiRB◆◆LBnRB=xˉ2na = 2\sum x_i \implies a = \dfrac◆LB◆\sum x_i◆RB◆◆LB◆n◆RB◆ = \bar{x}.

Check: d2Sda2=2n>0\dfrac{d^2S}{da^2} = 2n \gt{} 0, so this is a minimum. \blacksquare

Intuition. The mean is the "centre of mass" of the data. It is the single value that best represents all the data points in the sense of least squares — no other value produces a smaller total squared error. This is why the mean is the foundation of regression and estimation theory.

1.3 Median

The median is the middle value when data are arranged in order. For nn values:

  • If nn is odd: median = n+12\dfrac{n+1}{2}-th value.
  • If nn is even: median = average of n2\dfrac{n}{2}-th and (n2+1)\left(\dfrac{n}{2}+1\right)-th values.

1.4 Mode

The mode is the most frequently occurring value. A dataset can be unimodal, bimodal, or have no mode.

1.5 Comparing measures

  • The mean uses all data values but is affected by outliers.
  • The median is robust to outliers but ignores the magnitude of extreme values.
  • The mode is useful for categorical data.
warning

warning mean. A few extreme values can pull the mean far from the centre of the data.


2. Variance and Standard Deviation

2.1 Definition

The variance of x1,,xnx_1, \ldots, x_n is

σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2

The standard deviation is σ=LBσ2RB\sigma = \sqrt◆LB◆\sigma^2◆RB◆.

2.2 Computational formula

Theorem. σ2=LBxi2RB◆◆LBnRBxˉ2\sigma^2 = \dfrac◆LB◆\sum x_i^2◆RB◆◆LB◆n◆RB◆ - \bar{x}^2

Proof.

σ2=1n(xixˉ)2=1n(xi22xˉxi+xˉ2)=1n[xi22xˉxi+nxˉ2]=LBxi2RB◆◆LBnRB2xˉ2+xˉ2=LBxi2RB◆◆LBnRBxˉ2\begin{aligned} \sigma^2 &= \frac{1}{n}\sum(x_i - \bar{x})^2 = \frac{1}{n}\sum(x_i^2 - 2\bar{x}x_i + \bar{x}^2) \\ &= \frac{1}{n}\left[\sum x_i^2 - 2\bar{x}\sum x_i + n\bar{x}^2\right] \\ &= \frac◆LB◆\sum x_i^2◆RB◆◆LB◆n◆RB◆ - 2\bar{x}^2 + \bar{x}^2 \\ &= \frac◆LB◆\sum x_i^2◆RB◆◆LB◆n◆RB◆ - \bar{x}^2 \quad \blacksquare \end{aligned}
tip

This formula is computationally more efficient and is the one you should use in exams. Just remember: "mean of squares minus square of mean."

2.3 Sample variance

For sample data, the unbiased estimator of the population variance is

s2=1n1i=1n(xixˉ)2=LBxi2nxˉ2RB◆◆LBn1RBs^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 = \frac◆LB◆\sum x_i^2 - n\bar{x}^2◆RB◆◆LB◆n-1◆RB◆

The division by n1n-1 (Bessel's correction) accounts for the fact that xˉ\bar{x} is estimated from the same data, losing one degree of freedom.


3. Quartiles, IQR, and Box Plots

3.1 Quartiles

  • Q1Q_1 (lower quartile): the median of the lower half of the data.
  • Q2Q_2 (median): the median of all the data.
  • Q3Q_3 (upper quartile): the median of the upper half.

The interquartile range (IQR) is Q3Q1Q_3 - Q_1.

3.2 Box plots

A box plot displays:

  • Minimum and maximum values (or whisker endpoints)
  • Q1Q_1, Q2Q_2 (median), Q3Q_3
  • The box spans from Q1Q_1 to Q3Q_3
  • The median is marked inside the box

3.3 Outlier detection

An outlier is a value that lies more than 1.5×IQR1.5 \times \mathrm{IQR} below Q1Q_1 or above Q3Q_3:

Lowerfence=Q11.5×IQR\mathrm{Lower fence} = Q_1 - 1.5 \times \mathrm{IQR} Upperfence=Q3+1.5×IQR\mathrm{Upper fence} = Q_3 + 1.5 \times \mathrm{IQR}

Values outside these fences are potential outliers.

warning

warning Some use 1.5×1.5 \times IQR, others use different multipliers.


4. Coding Data

4.1 Linear coding

Definition. Coding transforms data using y=xacy = \dfrac{x - a}{c} where aa and cc are constants (c0c \neq 0).

4.2 Effect on summary statistics

If yi=xiacy_i = \dfrac{x_i - a}{c}, then:

yˉ=LBxˉaRB◆◆LBcRB,σy=LBσxRB◆◆LBcRB\bar{y} = \frac◆LB◆\bar{x} - a◆RB◆◆LB◆c◆RB◆, \qquad \sigma_y = \frac◆LB◆\sigma_x◆RB◆◆LB◆|c|◆RB◆

Proof.

yˉ=1nyi=1nxiac=1c(LBxiRB◆◆LBnRBa)=LBxˉaRB◆◆LBcRB\bar{y} = \frac{1}{n}\sum y_i = \frac{1}{n}\sum\frac{x_i - a}{c} = \frac{1}{c}\left(\frac◆LB◆\sum x_i◆RB◆◆LB◆n◆RB◆ - a\right) = \frac◆LB◆\bar{x} - a◆RB◆◆LB◆c◆RB◆

σy2=1n(yiyˉ)2=1n(xiacLBxˉaRB◆◆LBcRB)2=1c21n(xixˉ)2=LBσx2RB◆◆LBc2RB\sigma_y^2 = \frac{1}{n}\sum(y_i - \bar{y})^2 = \frac{1}{n}\sum\left(\frac{x_i - a}{c} - \frac◆LB◆\bar{x}-a◆RB◆◆LB◆c◆RB◆\right)^2 = \frac{1}{c^2}\cdot\frac{1}{n}\sum(x_i-\bar{x})^2 = \frac◆LB◆\sigma_x^2◆RB◆◆LB◆c^2◆RB◆

Hence σy=σx/c\sigma_y = \sigma_x/|c|. \blacksquare

tip

Coding makes computation easier when data values are large. Always work with coded data to find the mean and standard deviation, then decode back. Remember: adding a constant shifts the mean but does not affect the spread.


5. Frequency Tables and Grouped Data

5.1 Discrete frequency data

For data with frequencies f1,f2,,fkf_1, f_2, \ldots, f_k:

xˉ=LBfixiRB◆◆LBfiRB,σ2=LBfixi2RB◆◆LBfiRBxˉ2\bar{x} = \frac◆LB◆\sum f_i x_i◆RB◆◆LB◆\sum f_i◆RB◆, \qquad \sigma^2 = \frac◆LB◆\sum f_i x_i^2◆RB◆◆LB◆\sum f_i◆RB◆ - \bar{x}^2

5.2 Grouped continuous data

Use the midpoint of each class as the representative value. This introduces an approximation since we lose information about the distribution within each class.


6. Skewness

6.1 Definition

Skewness measures the asymmetry of a distribution about its centre. A distribution is:

  • Positively skewed (right-skewed): the right tail is longer; mean >\gt{} median.
  • Negatively skewed (left-skewed): the left tail is longer; mean <\lt{} median.
  • Symmetric: mean = median (and mode, for unimodal distributions).

6.2 Pearson's coefficient of skewness

Pearson's first coefficient uses the mean, median, and standard deviation:

S1=LB3(xˉQ2)RB◆◆LBσRBS_1 = \frac◆LB◆3\left(\bar{x} - Q_2\right)◆RB◆◆LB◆\sigma◆RB◆

Pearson's second coefficient uses only the quartiles:

S2=Q3+Q12Q2Q3Q1S_2 = \frac{Q_3 + Q_1 - 2Q_2}{Q_3 - Q_1}

Interpretation:

  • S>0S \gt{} 0: positive skew (right tail longer).
  • S<0S \lt{} 0: negative skew (left tail longer).
  • S=0S = 0: symmetric distribution.
info

info useful when quartiles are already known and the standard deviation has not been calculated. Both give the same sign of skewness but may differ in magnitude.

6.3 Relationship between measures of central tendency

For a unimodal distribution:

  • Symmetric: mean = median = mode.
  • Positively skewed: mode <\lt{} median <\lt{} mean.
  • Negatively skewed: mean <\lt{} median <\lt{} mode.

The mean is pulled in the direction of the longer tail, while the mode remains at the peak and the median lies between them.


7. Outliers in Depth

7.1 The IQR method — mild and extreme outliers

As introduced in Section 3.3, the 1.5×IQR1.5 \times \mathrm{IQR} rule defines fences. Some boards further distinguish between mild and extreme outliers:

  • Mild outlier: a value between 1.5×IQR1.5 \times \mathrm{IQR} and 3×IQR3 \times \mathrm{IQR} from the nearest quartile.
  • Extreme outlier: a value more than 3×IQR3 \times \mathrm{IQR} from the nearest quartile.

Extremelowerfence=Q13×IQR\mathrm{Extreme lower fence} = Q_1 - 3 \times \mathrm{IQR} Extremeupperfence=Q3+3×IQR\mathrm{Extreme upper fence} = Q_3 + 3 \times \mathrm{IQR}

7.2 The modified z-score method

The modified z-score uses the median absolute deviation (MAD). For a dataset x1,x2,,xnx_1, x_2, \ldots, x_n with median x~\tilde{x}:

MAD=median(xix~)\mathrm{MAD} = \mathrm{median}\left(|x_i - \tilde{x}|\right)

The modified z-score for each observation is:

Mi=LB0.6745(xix~)RB◆◆LBMADRBM_i = \frac◆LB◆0.6745\left(x_i - \tilde{x}\right)◆RB◆◆LB◆\mathrm{MAD}◆RB◆

An observation is flagged as an outlier if Mi>3.5|M_i| \gt{} 3.5.

tip

tip MAD, which are themselves resistant to outliers. The factor 0.67450.6745 is the 0.750.75-quantile of the standard normal distribution, so the modified z-score is on a comparable scale to the standard z-score for normally distributed data.

7.3 Choosing an outlier method

MethodStrengthsLimitations
IQR (1.5×1.5 \times)Standard at A-level; easy to apply from quartilesLess effective with very small samples
Modified z-scoreRobust to multiple or clustered outliersRequires computing the MAD, less common

8. Box Plots — Drawing and Interpreting

8.1 Drawing a box plot

To construct a box plot:

  1. Draw a horizontal (or vertical) number line covering the range of the data.
  2. Draw a rectangular box from Q1Q_1 to Q3Q_3.
  3. Mark the median Q2Q_2 as a line inside the box.
  4. Extend a whisker from Q1Q_1 to the smallest data value within the lower fence, and from Q3Q_3 to the largest data value within the upper fence.
  5. Plot any values outside the fences as individual points (these are the outliers).
warning

warning fences themselves. If no values lie outside the fences, the whiskers extend to the minimum and maximum of the dataset.

8.2 Interpreting skewness from a box plot

Compare the distances from Q2Q_2 to each quartile:

  • Q3Q2>Q2Q1Q_3 - Q_2 \gt{} Q_2 - Q_1: the upper half is more spread out, indicating positive skew.
  • Q3Q2<Q2Q1Q_3 - Q_2 \lt{} Q_2 - Q_1: the lower half is more spread out, indicating negative skew.
  • Q3Q2Q2Q1Q_3 - Q_2 \approx Q_2 - Q_1: the distribution is approximately symmetric.

Outliers on one side also indicate skewness in that direction.

8.3 Comparing box plots

When two or more box plots are drawn on the same scale, compare:

  • Location: which distribution has the higher median?
  • Spread: which has the larger IQR or total range?
  • Skewness: do the distributions differ in shape?
  • Outliers: does one distribution have more extreme values?
warning

warning such as "distribution A has a higher median" is incomplete without also addressing how the spreads compare.


9. Comparing Distributions

9.1 Back-to-back stem-and-leaf diagrams

A back-to-back stem-and-leaf diagram places two distributions on either side of a shared stem, enabling direct visual comparison of shape, spread, and outliers.

Example. Comparing test scores of two classes (Class A | Stem | Class B):

8 7 5 | 5 | 3 4 6
4 3 1 | 6 | 0 2 7 9
6 2 0 | 7 | 1 3 5 8
9 5 | 8 | 2 4
| 9 | 1 7

Reading from the diagram: Class A has scores 55, 57, 58, 61, 63, 64, ... while Class B has 53, 54, 56, 60, 62, 67, ... Both classes share the stem (tens digit), with Class A on the left and Class B on the right.

9.2 Cumulative frequency curves

A cumulative frequency curve (ogive) plots cumulative frequency against the upper class boundary of each group. To compare two distributions:

  1. Plot both ogives on the same axes.
  2. Read off medians, quartiles, and percentiles from each curve.
  3. Compare location (medians), spread (IQR), and shape (skewness).
tip

tip to the curve, then drop a vertical line to the xx-axis. The reverse process gives the cumulative frequency for a given xx-value.

9.3 Structuring a comparison

When asked to compare two distributions in an exam, structure your response around four points:

  1. Average: compare means or medians, stated in the context of the data.
  2. Spread: compare standard deviations or IQRs, stated in context.
  3. Shape: compare skewness where apparent.
  4. Outliers: mention any unusual values and their effect.

Always relate numerical comparisons to the original context of the data.


10. Interpolation from Grouped Data

10.1 Linear interpolation formula

When data are grouped into classes, quantiles are estimated using linear interpolation. For the pp-th percentile:

xp=L+(LBpnRB◆◆LB100RBcf)wfx_p = L + \left(\frac◆LB◆p \cdot n◆RB◆◆LB◆100◆RB◆ - c_f\right) \cdot \frac{w}{f}

where:

  • LL = lower class boundary of the class containing the pp-th percentile
  • nn = total frequency
  • cfc_f = cumulative frequency of all classes below LL
  • ww = class width
  • ff = frequency of the class containing the pp-th percentile

For the median (p=50p = 50):

Q2=L+(n2cf)wfQ_2 = L + \left(\frac{n}{2} - c_f\right) \cdot \frac{w}{f}

For quartiles (p=25p = 25 and p=75p = 75):

Q1=L+(n4cf)wf,Q3=L+(3n4cf)wfQ_1 = L + \left(\frac{n}{4} - c_f\right) \cdot \frac{w}{f}, \qquad Q_3 = L + \left(\frac{3n}{4} - c_f\right) \cdot \frac{w}{f}

10.2 Worked example

Find the median from the following grouped frequency distribution:

ClassFrequency
0<x100 \lt{} x \le{} 105
10<x2010 \lt{} x \le{} 2012
20<x3020 \lt{} x \le{} 3018
30<x4030 \lt{} x \le{} 408
40<x5040 \lt{} x \le{} 504

n=47n = 47. The median position is n/2=23.5n/2 = 23.5.

Cumulative frequencies: 5, 17, 35, 43, 47. The 23.5th value falls in the class 20<x3020 \lt{} x \le{} 30.

Q2=20+(23.517)1018=20+6.51018=20+651823.6Q_2 = 20 + \left(23.5 - 17\right) \cdot \frac{10}{18} = 20 + 6.5 \cdot \frac{10}{18} = 20 + \frac{65}{18} \approx 23.6

info

info approximation; the true quantile may differ if the data are not uniformly spread within the class.


Problem Set

Details

Problem 1 For the dataset {3,5,7,2,8,4,6,5}\{3, 5, 7, 2, 8, 4, 6, 5\}, find the mean, median, and mode.

Details

Solution 1 Ordered: {2,3,4,5,5,6,7,8}\{2, 3, 4, 5, 5, 6, 7, 8\}. n=8n = 8.

Mean: xˉ=(3+5+7+2+8+4+6+5)/8=40/8=5\bar{x} = (3+5+7+2+8+4+6+5)/8 = 40/8 = 5.

Median: average of 4th and 5th values = (5+5)/2=5(5+5)/2 = 5.

Mode: 5 (appears twice).

If you get this wrong, revise: Measures of Central Tendency — Section 1.

Details

Problem 2 Find the variance and standard deviation of {4,8,6,5,3,7,9,2}\{4, 8, 6, 5, 3, 7, 9, 2\} using the computational formula.

Details

Solution 2 x=44\sum x = 44, n=8n = 8, xˉ=44/8=5.5\bar{x} = 44/8 = 5.5.

x2=16+64+36+25+9+49+81+4=284\sum x^2 = 16 + 64 + 36 + 25 + 9 + 49 + 81 + 4 = 284.

σ2=284/85.52=35.530.25=5.25\sigma^2 = 284/8 - 5.5^2 = 35.5 - 30.25 = 5.25.

σ=5.252.29\sigma = \sqrt{5.25} \approx 2.29.

If you get this wrong, revise: Computational Formula — Section 2.2.

Details

Problem 3 Data is coded using y=(x100)/5y = (x - 100)/5. The coded data has mean 12 and variance 9. Find the original mean and standard deviation.

Details

Solution 3 yˉ=(xˉ100)/5=12    xˉ100=60    xˉ=160\bar{y} = (\bar{x} - 100)/5 = 12 \implies \bar{x} - 100 = 60 \implies \bar{x} = 160.

σy=σx/5=3    σx=15\sigma_y = \sigma_x/5 = 3 \implies \sigma_x = 15.

If you get this wrong, revise: Coding Data — Section 4.

Details

Problem 4 For the ordered dataset {2,3,5,7,8,11,14,18,23}\{2, 3, 5, 7, 8, 11, 14, 18, 23\}, find Q1Q_1, Q2Q_2, Q3Q_3, and the IQR. Identify any outliers.

Details

Solution 4 n=9n = 9 (odd). Q2=5Q_2 = 5th value =8= 8.

Lower half: {2,3,5,7}\{2, 3, 5, 7\}. Q1=(3+5)/2=4Q_1 = (3+5)/2 = 4. Upper half: {11,14,18,23}\{11, 14, 18, 23\}. Q3=(14+18)/2=16Q_3 = (14+18)/2 = 16.

IQR=164=12\mathrm{IQR} = 16 - 4 = 12.

Lower fence: 41.5(12)=144 - 1.5(12) = -14. Upper fence: 16+1.5(12)=3416 + 1.5(12) = 34.

All values are within [14,34][-14, 34], so no outliers.

If you get this wrong, revise: Quartiles, IQR, and Box Plots — Section 3.

Details

Problem 5 The following frequency table shows the number of goals scored in 20 football matches. Find the mean and variance.

GoalsFrequency
03
17
25
33
42
Details

Solution 5 f=20\sum f = 20.

fx=0+7+10+9+8=34\sum fx = 0 + 7 + 10 + 9 + 8 = 34. xˉ=34/20=1.7\bar{x} = 34/20 = 1.7.

fx2=0+7+20+27+32=86\sum fx^2 = 0 + 7 + 20 + 27 + 32 = 86.

σ2=86/201.72=4.32.89=1.41\sigma^2 = 86/20 - 1.7^2 = 4.3 - 2.89 = 1.41.

If you get this wrong, revise: Frequency Tables — Section 5.1.

Details

Problem 6 Prove that i=1n(xixˉ)=0\displaystyle\sum_{i=1}^{n}(x_i - \bar{x}) = 0.

Details

Solution 6 (xixˉ)=xinxˉ=xinLBxiRB◆◆LBnRB=xixi=0\sum(x_i - \bar{x}) = \sum x_i - n\bar{x} = \sum x_i - n \cdot \frac◆LB◆\sum x_i◆RB◆◆LB◆n◆RB◆ = \sum x_i - \sum x_i = 0 \quad \blacksquare

If you get this wrong, revise: Mean — Section 1.1.

Details

Problem 7 Two datasets A and B have the same mean but A has standard deviation 5 while B has standard deviation 2. What does this tell you about the two datasets?

Details

Solution 7 Both datasets are centred at the same point (same mean), but dataset A is more spread out (larger standard deviation). The values in A are more dispersed from the mean, while B's values cluster more tightly around the mean.

If you get this wrong, revise: Variance and Standard Deviation — Section 2.

Details

Problem 8 Given that xˉ=20\bar{x} = 20 and (xi20)2=360\sum(x_i - 20)^2 = 360 for n=10n = 10 observations, find σ\sigma and the sample variance s2s^2.

Details

Solution 8 σ2=360/10=36\sigma^2 = 360/10 = 36, so σ=6\sigma = 6.

s2=360/9=40s^2 = 360/9 = 40.

If you get this wrong, revise: Sample Variance — Section 2.3.

Details

Problem 9 A dataset {xi}\{x_i\} has mean 10 and standard deviation 4. A new dataset is formed by adding 5 to each value and then multiplying by 3. Find the new mean and standard deviation.

Details

Solution 9 Adding 5 shifts mean by 5: mean becomes 15. SD unchanged at 4. Multiplying by 3 scales mean by 3: mean becomes 45. SD scales by 3: SD becomes 12.

New mean = 45, new SD = 12.

If you get this wrong, revise: Coding Data — Section 4.2.

Details

Problem 10 Explain why the median is preferred to the mean for measuring average income in a country.

Details

Solution 10 Income distributions are typically right-skewed — a small number of very high earners pull the mean upward. The median, being the middle value, is unaffected by extreme values and gives a more representative "typical" income. For example, if one billionaire lives in a village of 1000 people earning 3000030\,000, the mean would be vastly inflated while the median would remain close to 3000030\,000.

If you get this wrong, revise: Comparing Measures — Section 1.5.

Details

Problem 11 For the dataset {2,4,5,6,7,8,12,15,28}\{2, 4, 5, 6, 7, 8, 12, 15, 28\}, find Q1Q_1, Q2Q_2, Q3Q_3, xˉ\bar{x}, and σ\sigma. Hence calculate Pearson's first coefficient of skewness and interpret the result.

Details

Solution 11 n=9n = 9. Ordered data: {2,4,5,6,7,8,12,15,28}\{2, 4, 5, 6, 7, 8, 12, 15, 28\}.

Q2=5Q_2 = 5th value =7= 7.

Lower half: {2,4,5,6}\{2, 4, 5, 6\}. Q1=(4+5)/2=4.5Q_1 = (4+5)/2 = 4.5. Upper half: {8,12,15,28}\{8, 12, 15, 28\}. Q3=(12+15)/2=13.5Q_3 = (12+15)/2 = 13.5.

xˉ=(2+4+5+6+7+8+12+15+28)/9=87/9=9.67\bar{x} = (2+4+5+6+7+8+12+15+28)/9 = 87/9 = 9.67.

x2=4+16+25+36+49+64+144+225+784=1347\sum x^2 = 4 + 16 + 25 + 36 + 49 + 64 + 144 + 225 + 784 = 1347.

σ2=1347/99.672=149.6793.51=56.16\sigma^2 = 1347/9 - 9.67^2 = 149.67 - 93.51 = 56.16, so σ=7.49\sigma = 7.49.

Pearson's first coefficient: S1=3(9.677)7.49=LB3×2.67RB◆◆LB7.49RB=8.017.491.07S_1 = \frac{3(9.67 - 7)}{7.49} = \frac◆LB◆3 \times 2.67◆RB◆◆LB◆7.49◆RB◆ = \frac{8.01}{7.49} \approx 1.07

Since S1>0S_1 \gt{} 0, the distribution is positively skewed. This is consistent with the right tail produced by the value 28.

If you get this wrong, revise: Skewness — Section 6.

Details

Problem 12 A box plot shows: minimum = 5, Q1=12Q_1 = 12, Q2=18Q_2 = 18, Q3=25Q_3 = 25, maximum = 34, with one outlier at 42. Calculate the IQR, the upper fence, and describe the skewness of the distribution.

Details

Solution 12 IQR=Q3Q1=2512=13\mathrm{IQR} = Q_3 - Q_1 = 25 - 12 = 13.

Upper fence =Q3+1.5×IQR=25+1.5×13=25+19.5=44.5= Q_3 + 1.5 \times \mathrm{IQR} = 25 + 1.5 \times 13 = 25 + 19.5 = 44.5.

Skewness: Q3Q2=2518=7Q_3 - Q_2 = 25 - 18 = 7 and Q2Q1=1812=6Q_2 - Q_1 = 18 - 12 = 6.

Since 7>67 \gt{} 6 (and there is an outlier at 42 on the upper side), the distribution is positively skewed, though only slightly so from the quartiles alone.

If you get this wrong, revise: Box Plots — Drawing and Interpreting — Section 8.

Details

Problem 13 Two classes sat the same maths test. Their results are summarised in back-to-back stem-and-leaf form:

Class A | Stem | Class B
9 7 3 | 4 | 1 2 5
8 5 4 | 5 | 0 3 6 8
6 2 0 | 6 | 1 4 7
4 | 7 | 2 5 9
| 8 | 3 6

Compare the distributions of the two classes.

Details

Solution 13 Class A: {43,47,49,54,55,58,60,62,66,74}\{43, 47, 49, 54, 55, 58, 60, 62, 66, 74\}. n=10n = 10. Median =(55+58)/2=56.5= (55 + 58)/2 = 56.5. Range =7443=31= 74 - 43 = 31.

Class B: {41,42,45,50,53,56,58,61,64,67,72,75,79,83,86}\{41, 42, 45, 50, 53, 56, 58, 61, 64, 67, 72, 75, 79, 83, 86\}. n=15n = 15. Median =8= 8th value =61= 61. Range =8641=45= 86 - 41 = 45.

Comparison:

  1. Average: Class B has the higher median (61 vs 56.5), so on average Class B performed better.
  2. Spread: Class B has a wider range (45 vs 31), suggesting greater variability in scores.
  3. Shape: Both distributions are roughly symmetric. Class B extends further in both directions.
  4. Outliers: No obvious outliers in either class.

If you get this wrong, revise: Comparing Distributions — Section 9.

Details

Problem 14 Estimate the median and interquartile range from the following grouped frequency distribution using linear interpolation:

ClassFrequency
10<x2010 \lt{} x \le{} 208
20<x3020 \lt{} x \le{} 3015
30<x4030 \lt{} x \le{} 4022
40<x5040 \lt{} x \le{} 5010
50<x6050 \lt{} x \le{} 605
Details

Solution 14 n=60n = 60.

Median (n/2=30n/2 = 30th value). Cumulative frequencies: 8, 23, 45, 55, 60. The 30th value falls in the class 30<x4030 \lt{} x \le{} 40.

Q2=30+(60223)1022=30+(3023)1022=30+71022=30+702233.18Q_2 = 30 + \left(\frac{60}{2} - 23\right) \cdot \frac{10}{22} = 30 + (30 - 23) \cdot \frac{10}{22} = 30 + 7 \cdot \frac{10}{22} = 30 + \frac{70}{22} \approx 33.18

Lower quartile (n/4=15n/4 = 15th value). The 15th value falls in 20<x3020 \lt{} x \le{} 30.

Q1=20+(158)1015=20+71015=20+701524.67Q_1 = 20 + \left(15 - 8\right) \cdot \frac{10}{15} = 20 + 7 \cdot \frac{10}{15} = 20 + \frac{70}{15} \approx 24.67

Upper quartile (3n/4=453n/4 = 45th value). The 45th value falls in 30<x4030 \lt{} x \le{} 40.

Q3=30+(4523)1022=30+221022=30+10=40Q_3 = 30 + \left(45 - 23\right) \cdot \frac{10}{22} = 30 + 22 \cdot \frac{10}{22} = 30 + 10 = 40

IQR=Q3Q1=4024.67=15.33\mathrm{IQR} = Q_3 - Q_1 = 40 - 24.67 = 15.33.

If you get this wrong, revise: Interpolation from Grouped Data — Section 10.

Details

Problem 15 The ordered dataset {4,6,8,10,12,13,15,16,48}\{4, 6, 8, 10, 12, 13, 15, 16, 48\} has median 12 and MAD = 4. Use the modified z-score method to determine whether the value 48 is an outlier.

Details

Solution 15 x~=12\tilde{x} = 12 (median). MAD=4\mathrm{MAD} = 4.

For x=48x = 48: M=0.6745(4812)4=LB0.6745×36RB◆◆LB4RB=24.2824=6.07M = \frac{0.6745(48 - 12)}{4} = \frac◆LB◆0.6745 \times 36◆RB◆◆LB◆4◆RB◆ = \frac{24.282}{4} = 6.07

Since M=6.07>3.5|M| = 6.07 \gt{} 3.5, the value 48 is classified as an outlier by the modified z-score method.

If you get this wrong, revise: Outliers in Depth — Section 7.2.

Details

Problem 16 A grouped frequency distribution has class 50<w6050 \lt{} w \le{} 60 with frequency 14. The cumulative frequency below this class is 32, and the total frequency is 80. Use linear interpolation to estimate Q3Q_3.

Details

Solution 16 Q3Q_3 is at position 3n/4=3×80/4=603n/4 = 3 \times 80 / 4 = 60.

The 60th value falls in the class 50<w6050 \lt{} w \le{} 60.

L=50L = 50, cf=32c_f = 32, f=14f = 14, w=10w = 10.

Q3=50+(6032)1014=50+281014=50+28014=50+20=70Q_3 = 50 + \left(60 - 32\right) \cdot \frac{10}{14} = 50 + 28 \cdot \frac{10}{14} = 50 + \frac{280}{14} = 50 + 20 = 70

If you get this wrong, revise: Interpolation from Grouped Data — Section 10.1.

Details

Problem 17 For the dataset {3,5,6,7,8,9,10,12,45}\{3, 5, 6, 7, 8, 9, 10, 12, 45\}, compute both Pearson's first and second coefficients of skewness. Do they agree on the direction of skewness?

Details

Solution 17 n=9n = 9. Q2=5Q_2 = 5th value =8= 8.

Lower half: {3,5,6,7}\{3, 5, 6, 7\}. Q1=(5+6)/2=5.5Q_1 = (5+6)/2 = 5.5. Upper half: {9,10,12,45}\{9, 10, 12, 45\}. Q3=(10+12)/2=11Q_3 = (10+12)/2 = 11.

xˉ=(3+5+6+7+8+9+10+12+45)/9=105/9=11.67\bar{x} = (3+5+6+7+8+9+10+12+45)/9 = 105/9 = 11.67.

x2=9+25+36+49+64+81+100+144+2025=2533\sum x^2 = 9 + 25 + 36 + 49 + 64 + 81 + 100 + 144 + 2025 = 2533.

σ2=2533/911.672=281.44136.19=145.25\sigma^2 = 2533/9 - 11.67^2 = 281.44 - 136.19 = 145.25, so σ=12.05\sigma = 12.05.

Pearson's first coefficient: S1=3(11.678)12.05=11.0112.050.91S_1 = \frac{3(11.67 - 8)}{12.05} = \frac{11.01}{12.05} \approx 0.91

Pearson's second coefficient: S2=LB11+5.52×8RB◆◆LB115.5RB=16.5165.5=0.55.50.09S_2 = \frac◆LB◆11 + 5.5 - 2 \times 8◆RB◆◆LB◆11 - 5.5◆RB◆ = \frac{16.5 - 16}{5.5} = \frac{0.5}{5.5} \approx 0.09

Both coefficients are positive, so they agree on positive skew. However, S1S_1 is much larger because the mean (11.67) is strongly pulled by the outlier 45, whereas S2S_2 depends only on the quartiles, which are less affected by that extreme value.

If you get this wrong, revise: Skewness — Section 6.

:::

:::

:::

:::

:::

:::


tip

Diagnostic Test Ready to test your understanding of Data Representation? The diagnostic test contains the hardest questions within the A-Level specification for this topic, each with a full worked solution.

Unit tests probe edge cases and common misconceptions. Integration tests combine Data Representation with other topics to test synthesis under exam conditions.

See Diagnostic Guide for instructions on self-marking and building a personal test matrix.