Data Representation — Diagnostic Tests
Unit Tests
Tests edge cases, boundary conditions, and common misconceptions for data representation.
UT-1: Outlier Effect on Measures of Central Tendency and Spread
Question:
A botanist records the heights (in cm) of 12 sunflower plants from a controlled growth experiment:
(a) Calculate the mean, median, and mode of the full dataset.
(b) The value 180 cm is identified as a measurement error (the actual height was 58 cm). Recalculate the mean, median, and mode after correcting this value.
(c) The interquartile range and standard deviation are both measures of spread. Without calculating the standard deviation of the original (uncorrected) dataset, determine which measure of spread is more affected by the outlier. Justify your answer using the properties of each measure.
(d) A student argues: "Since the median barely changed, we should always use the median instead of the mean." Construct a counterexample with a small dataset where the median gives a misleading measure of central tendency.
[Difficulty: hard. Tests understanding of when each measure is preferred and requires constructing a counterexample.]
Solution:
(a) There are 12 data values, so .
Mean:
Median: Since is even, the median is the average of the 6th and 7th values. The ordered data is already given in ascending order.
Mode: Every value appears exactly once, so there is no mode.
(b) Replacing 180 with 58 gives the dataset:
Mean:
Median: Average of 6th and 7th values:
Mode: Still no mode.
(c) The standard deviation is more affected by the outlier. This is because the standard deviation involves squaring the deviations from the mean:
The outlier 180 is very far from the mean (about 111 cm away), so contributes enormously to the sum of squared deviations. The interquartile range (IQR), by contrast, only uses and , which depend on the middle 50% of the data. The outlier at 180 does not affect or at all, so the IQR is completely unchanged.
This is the fundamental advantage of the IQR over the standard deviation for skewed data or data with outliers: it is resistant (robust) to extreme values.
(d) Consider the dataset: .
- Mean
- Median
The median of 1 suggests the "typical" value is 1, which is true for 4 out of 5 observations. However, the value 100 is a genuine part of the data (not an error), and the mean of 20.8 better reflects the overall level of the data. The median is misleading here because it completely ignores the magnitude of the upper tail.
This shows that the choice between mean and median depends on the context: the median is robust to outliers (good for error detection), but it discards information about the tails of the distribution.
UT-2: Misleading Histograms with Unequal Class Widths
Question:
The frequency distribution below shows the daily commuting times (in minutes) for 200 employees at a large company:
| Commuting time (min) | Frequency |
|---|---|
| 12 | |
| 38 | |
| 56 | |
| 65 | |
| 29 |
(a) A student draws a histogram using the frequency on the vertical axis and the commuting time on the horizontal axis, making all bars the same width. Explain why this histogram is misleading, and state the correct quantity to plot on the vertical axis.
(b) Calculate the frequency density for each class and estimate the mean commuting time.
(c) An employee claims "the most common commuting time is between 35 and 60 minutes." Determine whether this claim is supported by the data, carefully distinguishing between the class with the highest frequency and the class with the highest frequency density.
(d) Estimate the proportion of employees who commute for more than 50 minutes, using the assumption that values are uniformly distributed within each class.
[Difficulty: hard. Tests the critical distinction between frequency and frequency density with unequal class widths.]
Solution:
(a) The class widths are not equal: 10, 10, 15, 25, and 30 minutes respectively. When bars are drawn with equal width, the area of each bar is proportional to the frequency, but the height should be proportional to the frequency density, not the raw frequency. By plotting raw frequency as the bar height with equal-width bars, the visual impression over-represents narrow classes and under-represents wide classes.
The correct quantity for the vertical axis is the frequency density, defined as:
(b) Frequency densities:
| Class | Width | Frequency | Frequency density |
|---|---|---|---|
| 10 | 12 | 1.2 | |
| 10 | 38 | 3.8 | |
| 15 | 56 | 3.73 | |
| 25 | 65 | 2.6 | |
| 30 | 29 | 0.97 |
To estimate the mean, we use the midpoint of each class:
| Class | Midpoint | Frequency | |
|---|---|---|---|
| 5 | 12 | 60 | |
| 15 | 38 | 570 | |
| 27.5 | 56 | 1540 | |
| 47.5 | 65 | 3087.5 | |
| 75 | 29 | 2175 |
(c) The class has the highest frequency (65), so more employees fall in this class than any other. However, the class with the highest frequency density is (density 3.8), meaning the data is most concentrated (per unit time interval) in the 10--20 minute range.
The employee's claim is supported in the sense that the largest number of employees commute for 35--60 minutes. But the claim could be misleading if interpreted as "the most common single commuting time is in this range," since the density is highest in the 10--20 minute range. The wide class width of the 35--60 minute class inflates its frequency.
(d) We need the proportion with .
-
In the class (width 25, frequency 65): the fraction beyond 50 minutes is .
Estimated frequency for with : .
-
In the class : all 29 employees commute for more than 50 minutes.
Total estimated frequency with : .
So approximately 27.5% of employees commute for more than 50 minutes.
UT-3: Data Coding and Its Effect on Summary Statistics
Question:
The temperatures (in degrees Celsius) at a weather station at noon on 15 consecutive days are recorded. The summary statistics for the raw data are:
The data is coded using the formula .
(a) Find the mean and standard deviation of the coded data .
(b) A student claims that the standard deviation of equals the standard deviation of divided by 5. Another student claims it equals the standard deviation of divided by , so they agree. A third student says the standard deviation of equals the standard deviation of divided by where , and asks: "Does it matter whether is positive or negative?" Resolve this dispute with a clear explanation.
(c) A second weather station uses the coding . Without recalculating from the raw data, find the mean and variance of . Show that the variance of is the same as the variance of where , and explain why this is the case.
[Difficulty: hard. Tests the precise effect of coding on variance, particularly the role of vs .]
Solution:
(a) For the raw data:
For the coded data :
For the standard deviation: if , then .
Alternatively:
(b) The key fact is:
The constant (the additive shift) has no effect on the variance. The multiplicative factor affects the variance by .
For :
So .
The question about whether being positive or negative matters: it does not. Since the variance scales by , and , the sign of is irrelevant. If we had coded as , the variance would be the same. The mean would flip ( becomes instead of ), but the spread is identical.
The first two students are correct that SD() = SD() / 5. The third student is correct to ask the question, and the answer is: it does not matter because variance depends on , not .
(c) For :
For :
The variances are equal: .
This is because variance depends on the square of the scaling factor. Since both and use a scaling factor of magnitude 2, and , the variances are the same. The additive constant (3 or ) and the sign of the multiplier only affect the mean, not the spread.
Integration Tests
Tests synthesis of data representation with other topics. Requires combining concepts from multiple units.
IT-1: Probability Distribution from Grouped Data (with Probability)
Question:
A factory produces bolts, and the length mm of each bolt is measured. The grouped frequency distribution for 500 bolts is:
| Length (mm) | Frequency |
|---|---|
| 20 | |
| 85 | |
| 160 | |
| 145 | |
| 75 | |
| 15 |
A bolt is classified as defective if its length is less than 24.5 mm or greater than 26.5 mm.
(a) Estimate the probability that a randomly selected bolt is defective.
(b) Bolts are packed in boxes of 10. Assuming the probability of a bolt being defective is independent between bolts and equal to your estimate from part (a), find the probability that a randomly selected box contains at least one defective bolt.
(c) The factory claims that the mean length of bolts is 25.5 mm. Using the midpoints of the classes, test this claim at the 5% significance level. You may assume the distribution of sample means is approximately normal. [The standard deviation of bolt lengths is estimated to be 0.60 mm.]
[Difficulty: hard. Combines grouped data estimation, binomial probability, and hypothesis testing.]
Solution:
(a) Defective bolts are those in the classes and .
(b) Let be the number of defective bolts in a box of 10. Then .
Using the complement:
There is approximately a 51.6% chance that a box contains at least one defective bolt.
(c) We test against .
Estimate the sample mean using class midpoints:
| Class | Midpoint | Frequency | |
|---|---|---|---|
| 24.25 | 20 | 485 | |
| 24.75 | 85 | 2103.75 | |
| 25.25 | 160 | 4040 | |
| 25.75 | 145 | 3733.75 | |
| 26.25 | 75 | 1968.75 | |
| 26.75 | 15 | 401.25 |
The test statistic under :
For a two-tailed test at the 5% level, the critical values are .
Since , the test statistic does not fall in the critical region.
Conclusion: There is insufficient evidence to reject . The data is consistent with the factory's claim that the mean bolt length is 25.5 mm.
IT-2: Regression Residuals and Box Plot Analysis (with Correlation and Regression)
Question:
A scientist investigates the relationship between the dosage (in mg) of a drug and the response time (in seconds) for 8 patients. The data is:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|
| 12.1 | 9.8 | 8.2 | 7.5 | 6.1 | 5.3 | 5.0 | 4.9 |
The regression line of on is .
(a) Calculate the residuals for each patient and display them as a stem-and-leaf diagram.
(b) Construct a box plot of the residuals. Use the box plot to assess whether the linear model is appropriate. In your answer, identify any potential outlier and discuss whether removing it would change the slope of the regression line.
(c) The scientist fits a second regression line after removing the data point with the largest absolute residual. Without recalculating, predict whether the new regression line will have a steeper or less steep slope. Justify your answer by considering how the least squares criterion works.
[Difficulty: hard. Combines residual analysis, box plot construction, and understanding of leverage in regression.]
Solution:
(a) Residuals :
| Observed | Predicted | Residual | |
|---|---|---|---|
| 1 | 12.1 | 11.127 | 0.973 |
| 2 | 9.8 | 10.174 | -0.374 |
| 3 | 8.2 | 9.221 | -1.021 |
| 4 | 7.5 | 8.268 | -0.768 |
| 5 | 6.1 | 7.315 | -1.215 |
| 6 | 5.3 | 6.362 | -1.062 |
| 7 | 5.0 | 5.409 | -0.409 |
| 8 | 4.9 | 4.456 | 0.444 |
Rounded to 2 d.p.: 0.97, , , , , , , 0.44.
Stem-and-leaf (key: ):
-1 | 2 1 0
-0 | 8 4 1
0 | 4
1 | 0
(b) Ordered residuals: , , , , , , 0.44, 0.97. ()
= median of lower half
= median of upper half
Median
Lower fence
Upper fence
No residuals fall outside the fences, so there are no outliers. The box plot would show:
- Whiskers from to 0.97
- Box from to 0.035
- Median line at
The residuals show a pattern: negative residuals are clustered for middle dosages () while positive residuals appear at the extremes (). This U-shaped pattern in the residuals suggests the relationship may be slightly curved rather than perfectly linear, though the deviation is small.
(c) The largest absolute residual is at (residual ). The observed value is below the predicted value of 7.32, meaning the model over-predicts at this point.
Removing this point would likely make the slope slightly less steep (closer to zero). Here is why: the least squares regression line minimises the sum of squared residuals. The point at pulls the line down (it lies below the line). Removing it reduces the "pull downward" at the centre of the data range, allowing the line to sit slightly higher in the middle. Since the endpoints () are close to the line, the line will adjust to be slightly flatter.
More precisely: the point is below the current regression line, so it contributes a large negative residual. The least squares method adjusts to reduce this residual, which it does by pulling the line downward near . Without this point, there is less need for the line to dip in the middle, so the slope becomes slightly less negative (i.e., less steep in magnitude).
IT-3: Continuous Probability Density Function and the Median (with Integration)
Question:
A continuous random variable has probability density function:
(a) Verify that is a valid probability density function.
(b) Find the median of , giving your answer to 3 significant figures.
(c) Find the interquartile range of .
(d) The values of are recorded as a grouped frequency distribution using the classes , , , . Estimate the mean and standard deviation from this grouped data, and compare your answers with the true values and . Comment on the accuracy of the grouped estimates.
[Difficulty: hard. Combines integration of a PDF, quartile calculation, and comparison of grouped vs exact statistics.]
Solution:
(a) A valid PDF must satisfy for all and .
Since and , we have on and elsewhere.
(b) The median satisfies :
(c) satisfies :
satisfies :
(d) First, the true expected value:
Now the grouped estimates. We need the class frequencies. Since we are modelling from the PDF, the expected frequency in each class (out of a large sample) is proportional to the class probability:
| Class | Midpoint | |||
|---|---|---|---|---|
| 0.5 | 0.00781 | 0.00391 | ||
| 1.5 | 0.16406 | 0.24609 | ||
| 2.5 | 0.74219 | 1.85547 | ||
| 3.5 | 2.02344 | 7.08203 |
Estimated mean:
The true mean is 3. The grouped estimate underestimates by about 0.06, or roughly 2%. This error arises because the midpoint approximation assumes uniform distribution within each class, but the PDF is increasing, so the midpoint systematically underestimates the class mean for each class.
Estimated variance:
The true SD is 0.775, so the grouped estimate underestimates by about 3.6%. The grouped frequency approach loses precision because it replaces the continuous distribution with a discrete approximation within each class.