Skip to main content

Correlation and Regression — Diagnostic Tests

Unit Tests

Tests edge cases, boundary conditions, and common misconceptions for correlation and regression.

UT-1: PMCC vs Spearman's Rank and Coding Invariance

Question:

An economist collects data on the annual income (in thousands of pounds) and annual savings (in hundreds of pounds) for 7 households:

Income (xx, in £1000\pounds 1000)15223035425568
Savings (yy, in £100\pounds 100)381218223548

(a) Calculate the product moment correlation coefficient (PMCC) for this data.

(b) The data is coded using u=x1000u = \frac{x}{1000} and v=y100v = \frac{y}{100}. A student claims that the PMCC of uu and vv will be different from the PMCC of xx and yy because "the units have changed." Determine whether this claim is correct, and explain your reasoning.

(c) Calculate Spearman's rank correlation coefficient for the original data.

(d) The economist considers fitting a regression line. Explain why the PMCC is the more appropriate measure of correlation here rather than Spearman's rank coefficient, and state one scenario where Spearman's rank would be preferred.

[Difficulty: hard. Tests the invariance property of PMCC and the distinction between the two correlation measures.]

Solution:

(a) We need x\sum x, y\sum y, x2\sum x^2, y2\sum y^2, xy\sum xy.

x=15+22+30+35+42+55+68=267\sum x = 15 + 22 + 30 + 35 + 42 + 55 + 68 = 267

y=3+8+12+18+22+35+48=146\sum y = 3 + 8 + 12 + 18 + 22 + 35 + 48 = 146

x2=225+484+900+1225+1764+3025+4624=12247\sum x^2 = 225 + 484 + 900 + 1225 + 1764 + 3025 + 4624 = 12247

y2=9+64+144+324+484+1225+2304=4554\sum y^2 = 9 + 64 + 144 + 324 + 484 + 1225 + 2304 = 4554

xy=45+176+360+630+924+1925+3264=7324\sum xy = 45 + 176 + 360 + 630 + 924 + 1925 + 3264 = 7324

Sxx=1224726727=12247712897=1224710184.14...=2062.857S_{xx} = 12247 - \frac{267^2}{7} = 12247 - \frac{71289}{7} = 12247 - 10184.14... = 2062.857

Syy=455414627=4554213167=45543045.143=1508.857S_{yy} = 4554 - \frac{146^2}{7} = 4554 - \frac{21316}{7} = 4554 - 3045.143 = 1508.857

Sxy=7324LB267×146RB◆◆LB7RB=7324389827=73245568.857=1755.143S_{xy} = 7324 - \frac◆LB◆267 \times 146◆RB◆◆LB◆7◆RB◆ = 7324 - \frac{38982}{7} = 7324 - 5568.857 = 1755.143

r=LBSxyRB◆◆LBLBSxxSyyRB◆◆RB=LB1755.143RB◆◆LBLB2062.857×1508.857RB◆◆RBr = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt◆LB◆S_{xx} \cdot S_{yy}◆RB◆◆RB◆ = \frac◆LB◆1755.143◆RB◆◆LB◆\sqrt◆LB◆2062.857 \times 1508.857◆RB◆◆RB◆

=LB1755.143RB◆◆LB3111755.1RB=1755.1431764.02=0.9950 (4 d.p.)= \frac◆LB◆1755.143◆RB◆◆LB◆\sqrt{3111755.1}◆RB◆ = \frac{1755.143}{1764.02} = 0.9950 \text{ (4 d.p.)}

(b) The student's claim is incorrect. The PMCC is invariant under linear coding of the form u=ax+bu = ax + b and v=cy+dv = cy + d (where a,c0a, c \neq 0). Here u=11000xu = \frac{1}{1000}x and v=1100yv = \frac{1}{100}y, which are linear transformations.

To see why: the PMCC is defined as r=LBSxyRB◆◆LBLBSxxSyyRB◆◆RBr = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt◆LB◆S_{xx} \cdot S_{yy}◆RB◆◆RB◆. Under coding:

Suv=uvLB(u)(v)RB◆◆LBnRB=acxyacLB(x)(y)RB◆◆LBnRB=acSxyS_{uv} = \sum uv - \frac◆LB◆(\sum u)(\sum v)◆RB◆◆LB◆n◆RB◆ = ac \sum xy - ac \frac◆LB◆(\sum x)(\sum y)◆RB◆◆LB◆n◆RB◆ = ac \cdot S_{xy}

Suu=a2Sxx,Svv=c2SyyS_{uu} = a^2 S_{xx}, \quad S_{vv} = c^2 S_{yy}

ruv=LBacSxyRB◆◆LBLBa2Sxxc2SyyRB◆◆RB=LBacSxyRB◆◆LBacLBSxxSyyRB◆◆RB=LBSxyRB◆◆LBLBSxxSyyRB◆◆RB=rxyr_{uv} = \frac◆LB◆ac \cdot S_{xy}◆RB◆◆LB◆\sqrt◆LB◆a^2 S_{xx} \cdot c^2 S_{yy}◆RB◆◆RB◆ = \frac◆LB◆ac \cdot S_{xy}◆RB◆◆LB◆|ac|\sqrt◆LB◆S_{xx} \cdot S_{yy}◆RB◆◆RB◆ = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt◆LB◆S_{xx} \cdot S_{yy}◆RB◆◆RB◆ = r_{xy}

The factors of aa and cc cancel out completely, so the PMCC is unchanged by scaling or shifting.

(c) Ranking the data:

xxRank RxR_xyyRank RyR_yd=RxRyd = R_x - R_yd2d^2
1513100
2228200
30312300
35418400
42522500
55635600
68748700

d2=0\sum d^2 = 0

rs=1LB6d2RB◆◆LBn(n21)RB=10=1r_s = 1 - \frac◆LB◆6 \sum d^2◆RB◆◆LB◆n(n^2 - 1)◆RB◆ = 1 - 0 = 1

Spearman's rank correlation coefficient is 1 (perfect rank correlation).

(d) The PMCC is more appropriate here because:

  1. The data appears to follow an approximately linear relationship (PMCC =0.995= 0.995), so the PMCC correctly captures the strength of this linear association.
  2. The PMCC uses the actual data values, giving a more precise measure of the strength of the linear relationship. Spearman's rank discards information about the magnitude of differences between values.

Spearman's rank would be preferred when:

  • The relationship is monotonic but not linear (e.g., exponential or logarithmic).
  • The data contains extreme outliers whose values would distort the PMCC but whose rank positions are not affected.
  • The data is ordinal (ranked categories) rather than interval/ratio.

UT-2: Regression Line Properties and Extrapolation Risk

Question:

The regression line of yy on xx for a dataset is given by y=4.2x3.1y = 4.2x - 3.1, and the regression line of xx on yy is x=0.21y+1.8x = 0.21y + 1.8. The dataset has n=20n = 20 observations with xˉ=5.0\bar{x} = 5.0 and yˉ=17.9\bar{y} = 17.9.

(a) Verify that the point (xˉ,yˉ)(\bar{x}, \bar{y}) lies on both regression lines.

(b) Calculate the PMCC for the dataset.

(c) A student uses the regression line of yy on xx to predict yy when x=15x = 15, obtaining y^=59.9\hat{y} = 59.9. Explain why this prediction may be unreliable, identifying the specific statistical concept that is violated.

(d) Show that the two regression lines intersect at the point (xˉ,yˉ)(\bar{x}, \bar{y}), and explain geometrically why this must always be the case for any bivariate dataset.

[Difficulty: hard. Tests understanding of regression line properties, the relationship between the two regression lines, and extrapolation.]

Solution:

(a) On the line y=4.2x3.1y = 4.2x - 3.1: when x=5.0x = 5.0, y=4.2(5.0)3.1=21.03.1=17.9y = 4.2(5.0) - 3.1 = 21.0 - 3.1 = 17.9. This matches yˉ\bar{y}. So (xˉ,yˉ)=(5.0,17.9)(\bar{x}, \bar{y}) = (5.0, 17.9) lies on the regression line of yy on xx.

On the line x=0.21y+1.8x = 0.21y + 1.8: when y=17.9y = 17.9, x=0.21(17.9)+1.8=3.759+1.8=5.559x = 0.21(17.9) + 1.8 = 3.759 + 1.8 = 5.559. This does not equal xˉ=5.0\bar{x} = 5.0.

This discrepancy means the regression coefficients as stated are not consistent with xˉ=5.0\bar{x} = 5.0 and yˉ=17.9\bar{y} = 17.9 simultaneously for both lines. Let me verify: if the regression line of yy on xx passes through (5,17.9)(5, 17.9), then 17.9=4.2(5)+c17.9 = 4.2(5) + c, so c=17.921=3.1c = 17.9 - 21 = -3.1. This checks out.

For the regression line of xx on yy to pass through (5,17.9)(5, 17.9): 5=0.21(17.9)+c5 = 0.21(17.9) + c, so c=53.759=1.241c = 5 - 3.759 = 1.241. But the given line has intercept 1.81.8, not 1.2411.241. The given regression lines are not consistent.

For the remainder of this solution, I will use the corrected regression line of xx on yy: x=0.21y+1.241x = 0.21y + 1.241, or equivalently I will derive rr from the yy-on-xx line alone.

(b) From the regression line of yy on xx: y=4.2x3.1y = 4.2x - 3.1, the regression coefficient is byx=4.2b_{yx} = 4.2.

The formula for the regression coefficient is:

byx=rsysxb_{yx} = r \cdot \frac{s_y}{s_x}

We need the standard deviations. Since we have xˉ=5.0\bar{x} = 5.0 and the regression line of yy on xx is y=4.2x3.1y = 4.2x - 3.1:

We can use the relationship between the two regression coefficients:

byxbxy=r2b_{yx} \cdot b_{xy} = r^2

where bxyb_{xy} is the gradient of the regression line of xx on yy. From the corrected line x=0.21y+1.241x = 0.21y + 1.241: bxy=0.21b_{xy} = 0.21.

r2=4.2×0.21=0.882r^2 = 4.2 \times 0.21 = 0.882

r=0.882=0.939 (3 d.p.)r = \sqrt{0.882} = 0.939 \text{ (3 d.p.)}

We take the positive root because both regression coefficients are positive, indicating a positive association.

(c) The prediction at x=15x = 15 is an extrapolation beyond the range of the data (the data has xˉ=5.0\bar{x} = 5.0 and likely spans a much smaller range around this mean). The regression line is only valid for interpolation within (or close to) the range of the observed data. Extrapolating assumes that the linear relationship continues to hold well beyond the observed range, which is often not the case in practice.

The specific statistical concept violated is that the least squares regression model assumes the relationship is linear over the relevant range. There is no evidence that linearity holds at x=15x = 15.

(d) The regression line of yy on xx is:

yyˉ=byx(xxˉ)y - \bar{y} = b_{yx}(x - \bar{x})

Setting x=xˉx = \bar{x} gives y=yˉy = \bar{y}, confirming the line passes through (xˉ,yˉ)(\bar{x}, \bar{y}).

Similarly, the regression line of xx on yy is:

xxˉ=bxy(yyˉ)x - \bar{x} = b_{xy}(y - \bar{y})

Setting y=yˉy = \bar{y} gives x=xˉx = \bar{x}, confirming this line also passes through (xˉ,yˉ)(\bar{x}, \bar{y}).

Geometrically, the point of means (xˉ,yˉ)(\bar{x}, \bar{y}) is the "centre of gravity" of the data. The least squares criterion minimises the sum of squared vertical distances (for yy on xx) or horizontal distances (for xx on yy) from the line. The line of best fit must pass through the centre of gravity because shifting the line away from (xˉ,yˉ)(\bar{x}, \bar{y}) would increase the total sum of squared residuals. Both regression lines must pass through this common point, so they always intersect there.


UT-3: Residual Analysis and Least Squares Justification

Question:

A teacher believes there is a linear relationship between the number of hours a student revises (hh) and their exam score (ss). She collects data for 10 students and fits the regression line s=12.5h+28s = 12.5h + 28. The residuals are:

hh12345678910
Residual8-83-311223344221-15-555

(a) Verify that the sum of the residuals is zero, and explain why this must always be the case for a least squares regression line.

(b) Plot the residuals against hh and describe the pattern. What does this pattern suggest about the appropriateness of the linear model?

(c) The teacher considers fitting a quadratic model instead. Without fitting the model, explain what change in the residual pattern you would expect to see if the quadratic model were a better fit.

(d) A student claims: "Since the least squares line minimises the sum of squared residuals, it is the 'best' line in every sense." Explain why vertical distances (not perpendicular distances) are minimised, and give one situation where minimising vertical distances is inappropriate.

[Difficulty: hard. Tests deep understanding of the least squares method and residual diagnostics.]

Solution:

(a) Sum of residuals:

(8)+(3)+1+2+3+4+2+(1)+(5)+5=0(-8) + (-3) + 1 + 2 + 3 + 4 + 2 + (-1) + (-5) + 5 = 0 \checkmark

This must always be zero because the least squares regression line of yy on xx satisfies y^=a+bx\hat{y} = a + bx where:

b=SxySxx=LB(xixˉ)(yiyˉ)RB◆◆LB(xixˉ)2RBb = \frac{S_{xy}}{S_{xx}} = \frac◆LB◆\sum(x_i - \bar{x})(y_i - \bar{y})◆RB◆◆LB◆\sum(x_i - \bar{x})^2◆RB◆

a=yˉbxˉa = \bar{y} - b\bar{x}

The sum of residuals is:

ei=(yiy^i)=yi(a+bxi)=yinabxi\sum e_i = \sum(y_i - \hat{y}_i) = \sum y_i - \sum(a + bx_i) = \sum y_i - na - b\sum x_i

Substituting a=yˉbxˉa = \bar{y} - b\bar{x}:

=yin(yˉbxˉ)bxi=yinyˉ+nbxˉbxi= \sum y_i - n(\bar{y} - b\bar{x}) - b\sum x_i = \sum y_i - n\bar{y} + nb\bar{x} - b\sum x_i

Since yi=nyˉ\sum y_i = n\bar{y} and xi=nxˉ\sum x_i = n\bar{x}:

=nyˉnyˉ+nbxˉnbxˉ=0= n\bar{y} - n\bar{y} + nb\bar{x} - nb\bar{x} = 0

The residuals always sum to zero because the regression line passes through (xˉ,yˉ)(\bar{x}, \bar{y}), and the deviations from the mean sum to zero.

(b) Plotting residuals against hh:

At h=1h = 1: residual =8= -8 At h=2h = 2: residual =3= -3 At h=3h = 3: residual =1= 1 At h=4h = 4: residual =2= 2 At h=5h = 5: residual =3= 3 At h=6h = 6: residual =4= 4 At h=7h = 7: residual =2= 2 At h=8h = 8: residual =1= -1 At h=9h = 9: residual =5= -5 At h=10h = 10: residual =5= 5

The residual plot shows a clear curved pattern: residuals are negative at low hh, rise to positive around h=5h = 5--77, then dip negative again at h=9h = 9 before rising at h=10h = 10. This curved (U-shaped or S-shaped) pattern in the residuals indicates that the relationship between hh and ss is not purely linear. A linear model is systematically under-predicting at the extremes and over-predicting in the middle (or vice versa), which is the hallmark of a non-linear relationship.

If the residuals were randomly scattered around zero with no discernible pattern, this would support the linearity assumption.

(c) If a quadratic model were a better fit, the residuals from the quadratic model should show no systematic pattern when plotted against hh. Specifically:

  • The curved pattern should disappear.
  • The residuals should be randomly scattered above and below zero.
  • The magnitude of the residuals should be roughly constant across the range of hh (homoscedasticity).

The quadratic model would capture the curvature in the data, so the remaining variation would be random rather than systematic.

(d) The least squares regression line of yy on xx minimises the sum of squared vertical distances because it treats xx as the independent (explanatory) variable and yy as the dependent (response) variable. The assumption is that xx is measured without error (or with negligible error) and all the random variation is in yy. This is appropriate when we want to predict yy from xx.

Minimising perpendicular distances would be inappropriate when:

  • The two variables have different units or scales (e.g., predicting exam score from hours of revision). A perpendicular distance would mix the two units in a way that has no meaningful interpretation.
  • One variable is clearly the predictor and the other the response. In prediction contexts, we care about vertical error (prediction error in yy), not the geometric distance from the line.
  • When there is substantial measurement error in both variables, neither "vertical" nor "perpendicular" least squares is ideal; techniques like Deming regression would be more appropriate, though this is beyond A-Level.

Integration Tests

Tests synthesis of correlation and regression with other topics. Requires combining concepts from multiple units.

IT-1: Significance Test for Correlation Coefficient (with Hypothesis Testing)

Question:

A researcher investigates whether there is a correlation between the number of hours of sleep (ss) and performance on a cognitive test (pp) for a random sample of 12 adults. She calculates the PMCC to be r=0.58r = 0.58.

(a) Stating your hypotheses clearly, test at the 5% significance level whether the population correlation coefficient ρ\rho is positive. Use the fact that the critical value for a one-tailed test with n=12n = 12 at the 5% level is 0.497.

(b) A colleague suggests using a two-tailed test instead. Without recalculating, state whether the conclusion would change and explain why.

(c) The researcher then collects data for a larger sample of 30 adults and finds r=0.35r = 0.35. Using the fact that the critical value for a one-tailed test with n=30n = 30 at the 5% level is 0.306, test whether there is evidence of positive correlation. Compare your conclusion with part (a) and explain the role of sample size.

(d) Explain why the PMCC can be tested for significance using a critical value table, but Spearman's rank correlation coefficient requires a different table.

[Difficulty: hard. Combines correlation with formal hypothesis testing and understanding of sample size effects.]

Solution:

(a) Hypotheses:

H0:ρ=0H_0: \rho = 0 (no correlation in the population)

H1:ρ>0H_1: \rho > 0 (positive correlation in the population)

Test: One-tailed test at the 5% significance level.

Critical value: For n=12n = 12 (so degrees of freedom =10= 10), the critical value is 0.497.

Test statistic: r=0.58r = 0.58

Since 0.58>0.4970.58 > 0.497, the test statistic exceeds the critical value.

Conclusion: There is sufficient evidence to reject H0H_0 and conclude that there is evidence of a positive correlation between hours of sleep and cognitive test performance in the population.

(b) For a two-tailed test at the 5% level, the critical value would be higher (typically the two-tailed 5% critical value for n=12n = 12 is approximately 0.576).

Since r=0.58>0.576r = 0.58 > 0.576, the test statistic still exceeds the critical value, so the conclusion would not change. There would still be sufficient evidence to reject H0H_0.

However, note that the two-tailed test is more conservative: it requires stronger evidence because it tests for any non-zero correlation (positive or negative) rather than just positive correlation.

(c) Hypotheses:

H0:ρ=0H_0: \rho = 0, H1:ρ>0H_1: \rho > 0

Critical value: For n=30n = 30 at the 5% level (one-tailed), the critical value is 0.306.

Test statistic: r=0.35r = 0.35

Since 0.35>0.3060.35 > 0.306, we reject H0H_0.

Comparison: Both tests lead to rejection of H0H_0, but in part (a) the correlation was stronger (r=0.58r = 0.58) with a smaller sample (n=12n = 12), while in part (c) the correlation is weaker (r=0.35r = 0.35) but the larger sample (n=30n = 30) provides more evidence. The critical value decreases as nn increases because with more data, even a weak correlation becomes statistically significant. This illustrates that statistical significance depends on both the strength of the correlation and the sample size. A small sample needs a stronger correlation to achieve significance.

(d) The PMCC follows a known sampling distribution (related to the tt-distribution) under the null hypothesis H0:ρ=0H_0: \rho = 0, assuming the population is bivariate normal. This allows the construction of critical value tables specific to the PMCC.

Spearman's rank correlation coefficient, rsr_s, is based on the ranks of the data rather than the raw values. Its sampling distribution under H0:ρs=0H_0: \rho_s = 0 is different from that of the PMCC. For small samples, rsr_s has a discrete distribution (since ranks are integers), and the exact critical values are tabulated separately. The PMCC critical values cannot be used for Spearman's test because the distributions are different.


IT-2: Expected Value of Regression Predictions (with Probability)

Question:

The regression line of exam mark MM on hours of revision HH for a large population of students is:

M=5H+20M = 5H + 20

The number of revision hours HH for a randomly selected student follows the distribution:

P(H=h)={0.1h=10.2h=20.3h=30.25h=40.15h=5P(H = h) = \begin{cases} 0.1 & h = 1 \\ 0.2 & h = 2 \\ 0.3 & h = 3 \\ 0.25 & h = 4 \\ 0.15 & h = 5 \end{cases}

(a) Find E(H)\mathrm{E}(H) and Var(H)\mathrm{Var}(H).

(b) Find E(M)\mathrm{E}(M) using the regression line and the distribution of HH.

(c) The variance of the actual exam marks around the regression line is σ2=25\sigma^2 = 25. Find Var(M)\mathrm{Var}(M).

(d) A student studies for exactly 3 hours. The regression line predicts M=35M = 35. Explain why the actual exam mark is unlikely to be exactly 35, and calculate the probability that a student who revises for 3 hours scores above 40, assuming the residuals are normally distributed.

[Difficulty: hard. Combines regression with probability distributions and properties of expectation/variance.]

Solution:

(a)

E(H)=hP(H=h)=1(0.1)+2(0.2)+3(0.3)+4(0.25)+5(0.15)\mathrm{E}(H) = \sum h \cdot P(H = h) = 1(0.1) + 2(0.2) + 3(0.3) + 4(0.25) + 5(0.15)

=0.1+0.4+0.9+1.0+0.75=3.15= 0.1 + 0.4 + 0.9 + 1.0 + 0.75 = 3.15

E(H2)=1(0.1)+4(0.2)+9(0.3)+16(0.25)+25(0.15)\mathrm{E}(H^2) = 1(0.1) + 4(0.2) + 9(0.3) + 16(0.25) + 25(0.15)

=0.1+0.8+2.7+4.0+3.75=11.35= 0.1 + 0.8 + 2.7 + 4.0 + 3.75 = 11.35

Var(H)=E(H2)[E(H)]2=11.353.152=11.359.9225=1.4275\mathrm{Var}(H) = \mathrm{E}(H^2) - [\mathrm{E}(H)]^2 = 11.35 - 3.15^2 = 11.35 - 9.9225 = 1.4275

(b) Using the linearity of expectation:

E(M)=E(5H+20)=5E(H)+20=5(3.15)+20=15.75+20=35.75\mathrm{E}(M) = \mathrm{E}(5H + 20) = 5\mathrm{E}(H) + 20 = 5(3.15) + 20 = 15.75 + 20 = 35.75

Alternatively, using the law of total expectation: the predicted mark for each value of HH is 5h+205h + 20, and we average these predictions weighted by the probability of each hh. This gives the same result.

(c) The total variance of MM has two components:

  1. Variance due to the variation in HH across students.
  2. Variance due to the scatter around the regression line.

Var(M)=Var(5H+20+ε)=Var(5H)+Var(ε)\mathrm{Var}(M) = \mathrm{Var}(5H + 20 + \varepsilon) = \mathrm{Var}(5H) + \mathrm{Var}(\varepsilon)

where ε\varepsilon represents the random scatter (residuals), assumed independent of HH.

Var(5H)=52Var(H)=25×1.4275=35.6875\mathrm{Var}(5H) = 5^2 \mathrm{Var}(H) = 25 \times 1.4275 = 35.6875

Var(ε)=σ2=25\mathrm{Var}(\varepsilon) = \sigma^2 = 25

Var(M)=35.6875+25=60.6875\mathrm{Var}(M) = 35.6875 + 25 = 60.6875

SD(M)=60.68757.79\mathrm{SD}(M) = \sqrt{60.6875} \approx 7.79

(d) The regression line gives the expected (mean) exam mark for a given number of revision hours. Individual students will score above or below this prediction due to other factors not captured by the model (natural ability, exam difficulty, etc.). The residual for a student who revises 3 hours is the difference between their actual mark and the predicted mark of 35.

Assuming the residuals are normally distributed with mean 0 and standard deviation σ=5\sigma = 5:

P(M>40H=3)=P(M355>40355)=P(Z>1)P(M > 40 \mid H = 3) = P\left(\frac{M - 35}{5} > \frac{40 - 35}{5}\right) = P(Z > 1)

=1Φ(1)=10.8413=0.1587= 1 - \Phi(1) = 1 - 0.8413 = 0.1587

The probability is approximately 0.159 (15.9%).


IT-3: Outlier Removal and Recalculation (with Data Representation)

Question:

The data below shows the temperature (TT, in ^\circC) and ice cream sales (SS, in £\pounds) at a shop on 8 days:

TT1215182022252830
SS458012015017025031050

The regression line of SS on TT is S=12.34T102.9S = 12.34T - 102.9 with PMCC r=0.893r = 0.893.

(a) Identify the outlier in the data by examining the residuals. Explain why the value S=50S = 50 when T=30T = 30 is anomalous.

(b) After removing the outlier, the new summary statistics are: n=7n = 7, Tˉ=20\bar{T} = 20, Sˉ=160.7\bar{S} = 160.7, STT=238S_{TT} = 238, SSS=46247S_{SS} = 46247, STS=3292S_{TS} = 3292. Calculate the new PMCC and regression line.

(c) Calculate the percentage change in the PMCC after removing the outlier. Comment on the sensitivity of the PMCC to outliers compared to Spearman's rank correlation coefficient.

(d) Using the original data (with the outlier), a box plot of the residuals shows a value far below the lower whisker. Explain why removing outliers based solely on residual box plots can be dangerous, and suggest a more principled approach.

[Difficulty: hard. Combines outlier detection, recalculation of regression statistics, and critical evaluation of outlier removal.]

Solution:

(a) Calculating residuals from S=12.34T102.9S = 12.34T - 102.9:

TTSS (observed)SS (predicted)Residual
124545.20.2-0.2
158082.22.2-2.2
18120119.20.8
20150143.96.1
22170168.61.4
25250205.644.4
28310242.767.3
3050267.3217.3-217.3

The residual for T=30T = 30 is 217.3-217.3, which is enormously negative. The predicted sales for T=30T = 30^\circC should be around £267\pounds 267, but the observed value is only £50\pounds 50. This is almost certainly a data entry error (perhaps the sales were £350\pounds 350 or the temperature was recorded incorrectly). The residual of 217.3-217.3 is an extreme outlier compared to all other residuals (which range from 2.2-2.2 to 67.367.3).

(b) After removing the outlier (T=30,S=50T = 30, S = 50):

r=LBSTSRB◆◆LBLBSTTSSSRB◆◆RB=LB3292RB◆◆LBLB238×46247RB◆◆RBr = \frac◆LB◆S_{TS}◆RB◆◆LB◆\sqrt◆LB◆S_{TT} \cdot S_{SS}◆RB◆◆RB◆ = \frac◆LB◆3292◆RB◆◆LB◆\sqrt◆LB◆238 \times 46247◆RB◆◆RB◆

=LB3292RB◆◆LB11006786RB=32923317.65=0.9923 (4 d.p.)= \frac◆LB◆3292◆RB◆◆LB◆\sqrt{11006786}◆RB◆ = \frac{3292}{3317.65} = 0.9923 \text{ (4 d.p.)}

Regression coefficient:

b=STSSTT=3292238=13.832b = \frac{S_{TS}}{S_{TT}} = \frac{3292}{238} = 13.832

Intercept:

a=SˉbTˉ=160.713.832(20)=160.7276.64=115.94a = \bar{S} - b\bar{T} = 160.7 - 13.832(20) = 160.7 - 276.64 = -115.94

New regression line: S=13.83T116S = 13.83T - 116 (to 3 s.f.)

(c) Original PMCC: r=0.893r = 0.893. New PMCC: r=0.992r = 0.992.

Percentage change=0.9920.8930.893×100=0.0990.893×100=11.1%\text{Percentage change} = \frac{0.992 - 0.893}{0.893} \times 100 = \frac{0.099}{0.893} \times 100 = 11.1\%

The PMCC increased by approximately 11.1%. This demonstrates that the PMCC is sensitive to outliers because it uses the actual data values (and their deviations from the mean). A single extreme value can substantially reduce the PMCC.

Spearman's rank correlation coefficient would be much less affected. In the original data, the ranks of both TT and SS are mostly in the same order (except the outlier at T=30T = 30 has rank 8 for TT but rank 1 for SS). This single rank inversion would change rsr_s by a moderate amount, but much less than the 11% change in the PMCC. This is because Spearman's only uses rank positions, not the magnitudes of the deviations.

(d) Removing outliers based solely on residual box plots can be dangerous because:

  1. The outlier may be a genuine data point that reflects a real phenomenon (e.g., the shop was closed early that day, explaining low sales despite high temperature). Removing genuine outliers biases the analysis.

  2. The residual box plot itself depends on the regression line, which is influenced by the outlier. If the outlier pulls the regression line, the residuals of other points change, potentially creating false outliers or masking real ones.

  3. Confirmation bias: Researchers may be tempted to remove outliers that contradict their hypothesis, leading to inflated correlation coefficients and misleading conclusions.

A more principled approach:

  • Investigate the source of the outlier (data entry error, equipment malfunction, etc.).
  • If the outlier is confirmed as an error, correct or remove it.
  • If the outlier is genuine, consider using robust methods (e.g., Spearman's rank) or report results both with and without the outlier.
  • Use domain knowledge to determine whether the value is plausible.