Correlation and Regression — Diagnostic Tests
Unit Tests
Tests edge cases, boundary conditions, and common misconceptions for correlation and regression.
UT-1: PMCC vs Spearman's Rank and Coding Invariance
Question:
An economist collects data on the annual income (in thousands of pounds) and annual savings (in hundreds of pounds) for 7 households:
| Income (, in ) | 15 | 22 | 30 | 35 | 42 | 55 | 68 |
|---|---|---|---|---|---|---|---|
| Savings (, in ) | 3 | 8 | 12 | 18 | 22 | 35 | 48 |
(a) Calculate the product moment correlation coefficient (PMCC) for this data.
(b) The data is coded using and . A student claims that the PMCC of and will be different from the PMCC of and because "the units have changed." Determine whether this claim is correct, and explain your reasoning.
(c) Calculate Spearman's rank correlation coefficient for the original data.
(d) The economist considers fitting a regression line. Explain why the PMCC is the more appropriate measure of correlation here rather than Spearman's rank coefficient, and state one scenario where Spearman's rank would be preferred.
[Difficulty: hard. Tests the invariance property of PMCC and the distinction between the two correlation measures.]
Solution:
(a) We need , , , , .
(b) The student's claim is incorrect. The PMCC is invariant under linear coding of the form and (where ). Here and , which are linear transformations.
To see why: the PMCC is defined as . Under coding:
The factors of and cancel out completely, so the PMCC is unchanged by scaling or shifting.
(c) Ranking the data:
| Rank | Rank | ||||
|---|---|---|---|---|---|
| 15 | 1 | 3 | 1 | 0 | 0 |
| 22 | 2 | 8 | 2 | 0 | 0 |
| 30 | 3 | 12 | 3 | 0 | 0 |
| 35 | 4 | 18 | 4 | 0 | 0 |
| 42 | 5 | 22 | 5 | 0 | 0 |
| 55 | 6 | 35 | 6 | 0 | 0 |
| 68 | 7 | 48 | 7 | 0 | 0 |
Spearman's rank correlation coefficient is 1 (perfect rank correlation).
(d) The PMCC is more appropriate here because:
- The data appears to follow an approximately linear relationship (PMCC ), so the PMCC correctly captures the strength of this linear association.
- The PMCC uses the actual data values, giving a more precise measure of the strength of the linear relationship. Spearman's rank discards information about the magnitude of differences between values.
Spearman's rank would be preferred when:
- The relationship is monotonic but not linear (e.g., exponential or logarithmic).
- The data contains extreme outliers whose values would distort the PMCC but whose rank positions are not affected.
- The data is ordinal (ranked categories) rather than interval/ratio.
UT-2: Regression Line Properties and Extrapolation Risk
Question:
The regression line of on for a dataset is given by , and the regression line of on is . The dataset has observations with and .
(a) Verify that the point lies on both regression lines.
(b) Calculate the PMCC for the dataset.
(c) A student uses the regression line of on to predict when , obtaining . Explain why this prediction may be unreliable, identifying the specific statistical concept that is violated.
(d) Show that the two regression lines intersect at the point , and explain geometrically why this must always be the case for any bivariate dataset.
[Difficulty: hard. Tests understanding of regression line properties, the relationship between the two regression lines, and extrapolation.]
Solution:
(a) On the line : when , . This matches . So lies on the regression line of on .
On the line : when , . This does not equal .
This discrepancy means the regression coefficients as stated are not consistent with and simultaneously for both lines. Let me verify: if the regression line of on passes through , then , so . This checks out.
For the regression line of on to pass through : , so . But the given line has intercept , not . The given regression lines are not consistent.
For the remainder of this solution, I will use the corrected regression line of on : , or equivalently I will derive from the -on- line alone.
(b) From the regression line of on : , the regression coefficient is .
The formula for the regression coefficient is:
We need the standard deviations. Since we have and the regression line of on is :
We can use the relationship between the two regression coefficients:
where is the gradient of the regression line of on . From the corrected line : .
We take the positive root because both regression coefficients are positive, indicating a positive association.
(c) The prediction at is an extrapolation beyond the range of the data (the data has and likely spans a much smaller range around this mean). The regression line is only valid for interpolation within (or close to) the range of the observed data. Extrapolating assumes that the linear relationship continues to hold well beyond the observed range, which is often not the case in practice.
The specific statistical concept violated is that the least squares regression model assumes the relationship is linear over the relevant range. There is no evidence that linearity holds at .
(d) The regression line of on is:
Setting gives , confirming the line passes through .
Similarly, the regression line of on is:
Setting gives , confirming this line also passes through .
Geometrically, the point of means is the "centre of gravity" of the data. The least squares criterion minimises the sum of squared vertical distances (for on ) or horizontal distances (for on ) from the line. The line of best fit must pass through the centre of gravity because shifting the line away from would increase the total sum of squared residuals. Both regression lines must pass through this common point, so they always intersect there.
UT-3: Residual Analysis and Least Squares Justification
Question:
A teacher believes there is a linear relationship between the number of hours a student revises () and their exam score (). She collects data for 10 students and fits the regression line . The residuals are:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Residual |
(a) Verify that the sum of the residuals is zero, and explain why this must always be the case for a least squares regression line.
(b) Plot the residuals against and describe the pattern. What does this pattern suggest about the appropriateness of the linear model?
(c) The teacher considers fitting a quadratic model instead. Without fitting the model, explain what change in the residual pattern you would expect to see if the quadratic model were a better fit.
(d) A student claims: "Since the least squares line minimises the sum of squared residuals, it is the 'best' line in every sense." Explain why vertical distances (not perpendicular distances) are minimised, and give one situation where minimising vertical distances is inappropriate.
[Difficulty: hard. Tests deep understanding of the least squares method and residual diagnostics.]
Solution:
(a) Sum of residuals:
This must always be zero because the least squares regression line of on satisfies where:
The sum of residuals is:
Substituting :
Since and :
The residuals always sum to zero because the regression line passes through , and the deviations from the mean sum to zero.
(b) Plotting residuals against :
At : residual At : residual At : residual At : residual At : residual At : residual At : residual At : residual At : residual At : residual
The residual plot shows a clear curved pattern: residuals are negative at low , rise to positive around --, then dip negative again at before rising at . This curved (U-shaped or S-shaped) pattern in the residuals indicates that the relationship between and is not purely linear. A linear model is systematically under-predicting at the extremes and over-predicting in the middle (or vice versa), which is the hallmark of a non-linear relationship.
If the residuals were randomly scattered around zero with no discernible pattern, this would support the linearity assumption.
(c) If a quadratic model were a better fit, the residuals from the quadratic model should show no systematic pattern when plotted against . Specifically:
- The curved pattern should disappear.
- The residuals should be randomly scattered above and below zero.
- The magnitude of the residuals should be roughly constant across the range of (homoscedasticity).
The quadratic model would capture the curvature in the data, so the remaining variation would be random rather than systematic.
(d) The least squares regression line of on minimises the sum of squared vertical distances because it treats as the independent (explanatory) variable and as the dependent (response) variable. The assumption is that is measured without error (or with negligible error) and all the random variation is in . This is appropriate when we want to predict from .
Minimising perpendicular distances would be inappropriate when:
- The two variables have different units or scales (e.g., predicting exam score from hours of revision). A perpendicular distance would mix the two units in a way that has no meaningful interpretation.
- One variable is clearly the predictor and the other the response. In prediction contexts, we care about vertical error (prediction error in ), not the geometric distance from the line.
- When there is substantial measurement error in both variables, neither "vertical" nor "perpendicular" least squares is ideal; techniques like Deming regression would be more appropriate, though this is beyond A-Level.
Integration Tests
Tests synthesis of correlation and regression with other topics. Requires combining concepts from multiple units.
IT-1: Significance Test for Correlation Coefficient (with Hypothesis Testing)
Question:
A researcher investigates whether there is a correlation between the number of hours of sleep () and performance on a cognitive test () for a random sample of 12 adults. She calculates the PMCC to be .
(a) Stating your hypotheses clearly, test at the 5% significance level whether the population correlation coefficient is positive. Use the fact that the critical value for a one-tailed test with at the 5% level is 0.497.
(b) A colleague suggests using a two-tailed test instead. Without recalculating, state whether the conclusion would change and explain why.
(c) The researcher then collects data for a larger sample of 30 adults and finds . Using the fact that the critical value for a one-tailed test with at the 5% level is 0.306, test whether there is evidence of positive correlation. Compare your conclusion with part (a) and explain the role of sample size.
(d) Explain why the PMCC can be tested for significance using a critical value table, but Spearman's rank correlation coefficient requires a different table.
[Difficulty: hard. Combines correlation with formal hypothesis testing and understanding of sample size effects.]
Solution:
(a) Hypotheses:
(no correlation in the population)
(positive correlation in the population)
Test: One-tailed test at the 5% significance level.
Critical value: For (so degrees of freedom ), the critical value is 0.497.
Test statistic:
Since , the test statistic exceeds the critical value.
Conclusion: There is sufficient evidence to reject and conclude that there is evidence of a positive correlation between hours of sleep and cognitive test performance in the population.
(b) For a two-tailed test at the 5% level, the critical value would be higher (typically the two-tailed 5% critical value for is approximately 0.576).
Since , the test statistic still exceeds the critical value, so the conclusion would not change. There would still be sufficient evidence to reject .
However, note that the two-tailed test is more conservative: it requires stronger evidence because it tests for any non-zero correlation (positive or negative) rather than just positive correlation.
(c) Hypotheses:
,
Critical value: For at the 5% level (one-tailed), the critical value is 0.306.
Test statistic:
Since , we reject .
Comparison: Both tests lead to rejection of , but in part (a) the correlation was stronger () with a smaller sample (), while in part (c) the correlation is weaker () but the larger sample () provides more evidence. The critical value decreases as increases because with more data, even a weak correlation becomes statistically significant. This illustrates that statistical significance depends on both the strength of the correlation and the sample size. A small sample needs a stronger correlation to achieve significance.
(d) The PMCC follows a known sampling distribution (related to the -distribution) under the null hypothesis , assuming the population is bivariate normal. This allows the construction of critical value tables specific to the PMCC.
Spearman's rank correlation coefficient, , is based on the ranks of the data rather than the raw values. Its sampling distribution under is different from that of the PMCC. For small samples, has a discrete distribution (since ranks are integers), and the exact critical values are tabulated separately. The PMCC critical values cannot be used for Spearman's test because the distributions are different.
IT-2: Expected Value of Regression Predictions (with Probability)
Question:
The regression line of exam mark on hours of revision for a large population of students is:
The number of revision hours for a randomly selected student follows the distribution:
(a) Find and .
(b) Find using the regression line and the distribution of .
(c) The variance of the actual exam marks around the regression line is . Find .
(d) A student studies for exactly 3 hours. The regression line predicts . Explain why the actual exam mark is unlikely to be exactly 35, and calculate the probability that a student who revises for 3 hours scores above 40, assuming the residuals are normally distributed.
[Difficulty: hard. Combines regression with probability distributions and properties of expectation/variance.]
Solution:
(a)
(b) Using the linearity of expectation:
Alternatively, using the law of total expectation: the predicted mark for each value of is , and we average these predictions weighted by the probability of each . This gives the same result.
(c) The total variance of has two components:
- Variance due to the variation in across students.
- Variance due to the scatter around the regression line.
where represents the random scatter (residuals), assumed independent of .
(d) The regression line gives the expected (mean) exam mark for a given number of revision hours. Individual students will score above or below this prediction due to other factors not captured by the model (natural ability, exam difficulty, etc.). The residual for a student who revises 3 hours is the difference between their actual mark and the predicted mark of 35.
Assuming the residuals are normally distributed with mean 0 and standard deviation :
The probability is approximately 0.159 (15.9%).
IT-3: Outlier Removal and Recalculation (with Data Representation)
Question:
The data below shows the temperature (, in C) and ice cream sales (, in ) at a shop on 8 days:
| 12 | 15 | 18 | 20 | 22 | 25 | 28 | 30 | |
|---|---|---|---|---|---|---|---|---|
| 45 | 80 | 120 | 150 | 170 | 250 | 310 | 50 |
The regression line of on is with PMCC .
(a) Identify the outlier in the data by examining the residuals. Explain why the value when is anomalous.
(b) After removing the outlier, the new summary statistics are: , , , , , . Calculate the new PMCC and regression line.
(c) Calculate the percentage change in the PMCC after removing the outlier. Comment on the sensitivity of the PMCC to outliers compared to Spearman's rank correlation coefficient.
(d) Using the original data (with the outlier), a box plot of the residuals shows a value far below the lower whisker. Explain why removing outliers based solely on residual box plots can be dangerous, and suggest a more principled approach.
[Difficulty: hard. Combines outlier detection, recalculation of regression statistics, and critical evaluation of outlier removal.]
Solution:
(a) Calculating residuals from :
| (observed) | (predicted) | Residual | |
|---|---|---|---|
| 12 | 45 | 45.2 | |
| 15 | 80 | 82.2 | |
| 18 | 120 | 119.2 | 0.8 |
| 20 | 150 | 143.9 | 6.1 |
| 22 | 170 | 168.6 | 1.4 |
| 25 | 250 | 205.6 | 44.4 |
| 28 | 310 | 242.7 | 67.3 |
| 30 | 50 | 267.3 |
The residual for is , which is enormously negative. The predicted sales for C should be around , but the observed value is only . This is almost certainly a data entry error (perhaps the sales were or the temperature was recorded incorrectly). The residual of is an extreme outlier compared to all other residuals (which range from to ).
(b) After removing the outlier ():
Regression coefficient:
Intercept:
New regression line: (to 3 s.f.)
(c) Original PMCC: . New PMCC: .
The PMCC increased by approximately 11.1%. This demonstrates that the PMCC is sensitive to outliers because it uses the actual data values (and their deviations from the mean). A single extreme value can substantially reduce the PMCC.
Spearman's rank correlation coefficient would be much less affected. In the original data, the ranks of both and are mostly in the same order (except the outlier at has rank 8 for but rank 1 for ). This single rank inversion would change by a moderate amount, but much less than the 11% change in the PMCC. This is because Spearman's only uses rank positions, not the magnitudes of the deviations.
(d) Removing outliers based solely on residual box plots can be dangerous because:
-
The outlier may be a genuine data point that reflects a real phenomenon (e.g., the shop was closed early that day, explaining low sales despite high temperature). Removing genuine outliers biases the analysis.
-
The residual box plot itself depends on the regression line, which is influenced by the outlier. If the outlier pulls the regression line, the residuals of other points change, potentially creating false outliers or masking real ones.
-
Confirmation bias: Researchers may be tempted to remove outliers that contradict their hypothesis, leading to inflated correlation coefficients and misleading conclusions.
A more principled approach:
- Investigate the source of the outlier (data entry error, equipment malfunction, etc.).
- If the outlier is confirmed as an error, correct or remove it.
- If the outlier is genuine, consider using robust methods (e.g., Spearman's rank) or report results both with and without the outlier.
- Use domain knowledge to determine whether the value is plausible.