Data Representation
Board Coverage
| Board | Paper | Notes |
|---|---|---|
| AQA | Paper 1 | Measures of location and spread, coding |
| Edexcel | P1 | Similar |
| OCR (A) | Paper 1 | Includes outlier detection |
| CIE (9709) | P1, P6 | Data handling in P1; further statistics in P6 |
You must know when to use the sample variance formula (dividing by ) versus the population variance formula (dividing by ). Edexcel and OCR use for sample data.
1. Measures of Central Tendency
1.1 Mean
Definition. The mean of values is
1.2 The mean minimises the sum of squared deviations
Theorem. The function is minimised when .
Proof. Expand :
Setting : .
Check: , so this is a minimum.
Intuition. The mean is the "centre of mass" of the data. It is the single value that best represents all the data points in the sense of least squares — no other value produces a smaller total squared error. This is why the mean is the foundation of regression and estimation theory.
1.3 Median
The median is the middle value when data are arranged in order. For values:
- If is odd: median = -th value.
- If is even: median = average of -th and -th values.
1.4 Mode
The mode is the most frequently occurring value. A dataset can be unimodal, bimodal, or have no mode.
1.5 Comparing measures
- The mean uses all data values but is affected by outliers.
- The median is robust to outliers but ignores the magnitude of extreme values.
- The mode is useful for categorical data.
warning mean. A few extreme values can pull the mean far from the centre of the data.
2. Variance and Standard Deviation
2.1 Definition
The variance of is
The standard deviation is .
2.2 Computational formula
Theorem.
Proof.
This formula is computationally more efficient and is the one you should use in exams. Just remember: "mean of squares minus square of mean."
2.3 Sample variance
For sample data, the unbiased estimator of the population variance is
The division by (Bessel's correction) accounts for the fact that is estimated from the same data, losing one degree of freedom.
3. Quartiles, IQR, and Box Plots
3.1 Quartiles
- (lower quartile): the median of the lower half of the data.
- (median): the median of all the data.
- (upper quartile): the median of the upper half.
The interquartile range (IQR) is .
3.2 Box plots
A box plot displays:
- Minimum and maximum values (or whisker endpoints)
- , (median),
- The box spans from to
- The median is marked inside the box
3.3 Outlier detection
An outlier is a value that lies more than below or above :
Values outside these fences are potential outliers.
warning Some use IQR, others use different multipliers.
4. Coding Data
4.1 Linear coding
Definition. Coding transforms data using where and are constants ().
4.2 Effect on summary statistics
If , then:
Proof.
Hence .
Coding makes computation easier when data values are large. Always work with coded data to find the mean and standard deviation, then decode back. Remember: adding a constant shifts the mean but does not affect the spread.
5. Frequency Tables and Grouped Data
5.1 Discrete frequency data
For data with frequencies :
5.2 Grouped continuous data
Use the midpoint of each class as the representative value. This introduces an approximation since we lose information about the distribution within each class.
6. Skewness
6.1 Definition
Skewness measures the asymmetry of a distribution about its centre. A distribution is:
- Positively skewed (right-skewed): the right tail is longer; mean median.
- Negatively skewed (left-skewed): the left tail is longer; mean median.
- Symmetric: mean = median (and mode, for unimodal distributions).
6.2 Pearson's coefficient of skewness
Pearson's first coefficient uses the mean, median, and standard deviation:
Pearson's second coefficient uses only the quartiles:
Interpretation:
- : positive skew (right tail longer).
- : negative skew (left tail longer).
- : symmetric distribution.
info useful when quartiles are already known and the standard deviation has not been calculated. Both give the same sign of skewness but may differ in magnitude.
6.3 Relationship between measures of central tendency
For a unimodal distribution:
- Symmetric: mean = median = mode.
- Positively skewed: mode median mean.
- Negatively skewed: mean median mode.
The mean is pulled in the direction of the longer tail, while the mode remains at the peak and the median lies between them.
7. Outliers in Depth
7.1 The IQR method — mild and extreme outliers
As introduced in Section 3.3, the rule defines fences. Some boards further distinguish between mild and extreme outliers:
- Mild outlier: a value between and from the nearest quartile.
- Extreme outlier: a value more than from the nearest quartile.
7.2 The modified z-score method
The modified z-score uses the median absolute deviation (MAD). For a dataset with median :
The modified z-score for each observation is:
An observation is flagged as an outlier if .
tip MAD, which are themselves resistant to outliers. The factor is the -quantile of the standard normal distribution, so the modified z-score is on a comparable scale to the standard z-score for normally distributed data.
7.3 Choosing an outlier method
| Method | Strengths | Limitations |
|---|---|---|
| IQR () | Standard at A-level; easy to apply from quartiles | Less effective with very small samples |
| Modified z-score | Robust to multiple or clustered outliers | Requires computing the MAD, less common |
8. Box Plots — Drawing and Interpreting
8.1 Drawing a box plot
To construct a box plot:
- Draw a horizontal (or vertical) number line covering the range of the data.
- Draw a rectangular box from to .
- Mark the median as a line inside the box.
- Extend a whisker from to the smallest data value within the lower fence, and from to the largest data value within the upper fence.
- Plot any values outside the fences as individual points (these are the outliers).
warning fences themselves. If no values lie outside the fences, the whiskers extend to the minimum and maximum of the dataset.
8.2 Interpreting skewness from a box plot
Compare the distances from to each quartile:
- : the upper half is more spread out, indicating positive skew.
- : the lower half is more spread out, indicating negative skew.
- : the distribution is approximately symmetric.
Outliers on one side also indicate skewness in that direction.
8.3 Comparing box plots
When two or more box plots are drawn on the same scale, compare:
- Location: which distribution has the higher median?
- Spread: which has the larger IQR or total range?
- Skewness: do the distributions differ in shape?
- Outliers: does one distribution have more extreme values?
warning such as "distribution A has a higher median" is incomplete without also addressing how the spreads compare.
9. Comparing Distributions
9.1 Back-to-back stem-and-leaf diagrams
A back-to-back stem-and-leaf diagram places two distributions on either side of a shared stem, enabling direct visual comparison of shape, spread, and outliers.
Example. Comparing test scores of two classes (Class A | Stem | Class B):
8 7 5 | 5 | 3 4 6
4 3 1 | 6 | 0 2 7 9
6 2 0 | 7 | 1 3 5 8
9 5 | 8 | 2 4
| 9 | 1 7
Reading from the diagram: Class A has scores 55, 57, 58, 61, 63, 64, ... while Class B has 53, 54, 56, 60, 62, 67, ... Both classes share the stem (tens digit), with Class A on the left and Class B on the right.
9.2 Cumulative frequency curves
A cumulative frequency curve (ogive) plots cumulative frequency against the upper class boundary of each group. To compare two distributions:
- Plot both ogives on the same axes.
- Read off medians, quartiles, and percentiles from each curve.
- Compare location (medians), spread (IQR), and shape (skewness).
tip to the curve, then drop a vertical line to the -axis. The reverse process gives the cumulative frequency for a given -value.
9.3 Structuring a comparison
When asked to compare two distributions in an exam, structure your response around four points:
- Average: compare means or medians, stated in the context of the data.
- Spread: compare standard deviations or IQRs, stated in context.
- Shape: compare skewness where apparent.
- Outliers: mention any unusual values and their effect.
Always relate numerical comparisons to the original context of the data.
10. Interpolation from Grouped Data
10.1 Linear interpolation formula
When data are grouped into classes, quantiles are estimated using linear interpolation. For the -th percentile:
where:
- = lower class boundary of the class containing the -th percentile
- = total frequency
- = cumulative frequency of all classes below
- = class width
- = frequency of the class containing the -th percentile
For the median ():
For quartiles ( and ):
10.2 Worked example
Find the median from the following grouped frequency distribution:
| Class | Frequency |
|---|---|
| 5 | |
| 12 | |
| 18 | |
| 8 | |
| 4 |
. The median position is .
Cumulative frequencies: 5, 17, 35, 43, 47. The 23.5th value falls in the class .
info approximation; the true quantile may differ if the data are not uniformly spread within the class.
Problem Set
Details
Problem 1
For the dataset , find the mean, median, and mode.Details
Solution 1
Ordered: . .Mean: .
Median: average of 4th and 5th values = .
Mode: 5 (appears twice).
If you get this wrong, revise: Measures of Central Tendency — Section 1.
Details
Problem 2
Find the variance and standard deviation of using the computational formula.Details
Details
Problem 3
Data is coded using . The coded data has mean 12 and variance 9. Find the original mean and standard deviation.Details
Details
Problem 4
For the ordered dataset , find , , , and the IQR. Identify any outliers.Details
Solution 4
(odd). th value .Lower half: . . Upper half: . .
.
Lower fence: . Upper fence: .
All values are within , so no outliers.
If you get this wrong, revise: Quartiles, IQR, and Box Plots — Section 3.
Details
Problem 5
The following frequency table shows the number of goals scored in 20 football matches. Find the mean and variance.| Goals | Frequency |
|---|---|
| 0 | 3 |
| 1 | 7 |
| 2 | 5 |
| 3 | 3 |
| 4 | 2 |
Details
Details
Problem 6
Prove that .Details
Details
Problem 7
Two datasets A and B have the same mean but A has standard deviation 5 while B has standard deviation 2. What does this tell you about the two datasets?Details
Solution 7
Both datasets are centred at the same point (same mean), but dataset A is more spread out (larger standard deviation). The values in A are more dispersed from the mean, while B's values cluster more tightly around the mean.If you get this wrong, revise: Variance and Standard Deviation — Section 2.
Details
Problem 8
Given that and for observations, find and the sample variance .Details
Details
Problem 9
A dataset has mean 10 and standard deviation 4. A new dataset is formed by adding 5 to each value and then multiplying by 3. Find the new mean and standard deviation.Details
Solution 9
Adding 5 shifts mean by 5: mean becomes 15. SD unchanged at 4. Multiplying by 3 scales mean by 3: mean becomes 45. SD scales by 3: SD becomes 12.New mean = 45, new SD = 12.
If you get this wrong, revise: Coding Data — Section 4.2.
Details
Problem 10
Explain why the median is preferred to the mean for measuring average income in a country.Details
Solution 10
Income distributions are typically right-skewed — a small number of very high earners pull the mean upward. The median, being the middle value, is unaffected by extreme values and gives a more representative "typical" income. For example, if one billionaire lives in a village of 1000 people earning , the mean would be vastly inflated while the median would remain close to .If you get this wrong, revise: Comparing Measures — Section 1.5.
Details
Problem 11
For the dataset , find , , , , and . Hence calculate Pearson's first coefficient of skewness and interpret the result.Details
Solution 11
. Ordered data: .th value .
Lower half: . . Upper half: . .
.
.
, so .
Pearson's first coefficient:
Since , the distribution is positively skewed. This is consistent with the right tail produced by the value 28.
If you get this wrong, revise: Skewness — Section 6.
Details
Problem 12
A box plot shows: minimum = 5, , , , maximum = 34, with one outlier at 42. Calculate the IQR, the upper fence, and describe the skewness of the distribution.Details
Solution 12
.Upper fence .
Skewness: and .
Since (and there is an outlier at 42 on the upper side), the distribution is positively skewed, though only slightly so from the quartiles alone.
If you get this wrong, revise: Box Plots — Drawing and Interpreting — Section 8.
Details
Problem 13
Two classes sat the same maths test. Their results are summarised in back-to-back stem-and-leaf form:Class A | Stem | Class B
9 7 3 | 4 | 1 2 5
8 5 4 | 5 | 0 3 6 8
6 2 0 | 6 | 1 4 7
4 | 7 | 2 5 9
| 8 | 3 6
Compare the distributions of the two classes.
Details
Solution 13
Class A: . . Median . Range .Class B: . . Median th value . Range .
Comparison:
- Average: Class B has the higher median (61 vs 56.5), so on average Class B performed better.
- Spread: Class B has a wider range (45 vs 31), suggesting greater variability in scores.
- Shape: Both distributions are roughly symmetric. Class B extends further in both directions.
- Outliers: No obvious outliers in either class.
If you get this wrong, revise: Comparing Distributions — Section 9.
Details
Problem 14
Estimate the median and interquartile range from the following grouped frequency distribution using linear interpolation:| Class | Frequency |
|---|---|
| 8 | |
| 15 | |
| 22 | |
| 10 | |
| 5 |
Details
Solution 14
.Median (th value). Cumulative frequencies: 8, 23, 45, 55, 60. The 30th value falls in the class .
Lower quartile (th value). The 15th value falls in .
Upper quartile (th value). The 45th value falls in .
.
If you get this wrong, revise: Interpolation from Grouped Data — Section 10.
Details
Problem 15
The ordered dataset has median 12 and MAD = 4. Use the modified z-score method to determine whether the value 48 is an outlier.Details
Solution 15
(median). .For :
Since , the value 48 is classified as an outlier by the modified z-score method.
If you get this wrong, revise: Outliers in Depth — Section 7.2.
Details
Problem 16
A grouped frequency distribution has class with frequency 14. The cumulative frequency below this class is 32, and the total frequency is 80. Use linear interpolation to estimate .Details
Solution 16
is at position .The 60th value falls in the class .
, , , .
If you get this wrong, revise: Interpolation from Grouped Data — Section 10.1.
Details
Problem 17
For the dataset , compute both Pearson's first and second coefficients of skewness. Do they agree on the direction of skewness?Details
Solution 17
. th value .Lower half: . . Upper half: . .
.
.
, so .
Pearson's first coefficient:
Pearson's second coefficient:
Both coefficients are positive, so they agree on positive skew. However, is much larger because the mean (11.67) is strongly pulled by the outlier 45, whereas depends only on the quartiles, which are less affected by that extreme value.
If you get this wrong, revise: Skewness — Section 6.
:::
:::
:::
:::
:::
:::
Diagnostic Test Ready to test your understanding of Data Representation? The diagnostic test contains the hardest questions within the A-Level specification for this topic, each with a full worked solution.
Unit tests probe edge cases and common misconceptions. Integration tests combine Data Representation with other topics to test synthesis under exam conditions.
See Diagnostic Guide for instructions on self-marking and building a personal test matrix.