Skip to main content

Correlation and Regression

Board Coverage

BoardPaperNotes
AQAPaper 1PMCC, regression lines
EdexcelP1Includes Spearman's rank
OCR (A)Paper 1Similar
CIE (9709)P1, P6Correlation and regression in P1/P6
info

The formula booklet gives the formula for PMCC and the least squares regression line. You must be able to interpret these and understand their limitations.


1. Pearson's Product Moment Correlation Coefficient (PMCC)

1.1 Definition

Definition. For bivariate data (x1,y1),,(xn,yn)(x_1,y_1),\ldots,(x_n,y_n), the PMCC is

r=LBSxyRB◆◆LBSxxSyyRBr = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt{S_{xx}\,S_{yy}}◆RB◆

where

Sxx=(xixˉ)2=xi2nxˉ2S_{xx} = \sum(x_i-\bar{x})^2 = \sum x_i^2 - n\bar{x}^2 Syy=(yiyˉ)2=yi2nyˉ2S_{yy} = \sum(y_i-\bar{y})^2 = \sum y_i^2 - n\bar{y}^2 Sxy=(xixˉ)(yiyˉ)=xiyinxˉyˉS_{xy} = \sum(x_i-\bar{x})(y_i-\bar{y}) = \sum x_i y_i - n\bar{x}\bar{y}

1.2 Properties

  • 1r1-1 \leq r \leq 1
  • r=1r = 1: perfect positive linear correlation
  • r=1r = -1: perfect negative linear correlation
  • r=0r = 0: no linear correlation (but there may be non-linear relationship)
  • rr measures the strength of linear relationship only
warning

Correlation does not imply causation. Two variables may be strongly correlated because they are both influenced by a third (confounding) variable, or by coincidence.

1.3 Real-World Applications

Economics: GDP per capita and life expectancy across countries typically show r0.7r \approx 0.7 to 0.850.85. The relationship is strong but non-linear at high income levels (diminishing returns). The PMCC captures the overall linear trend but underestimates the strength of the relationship at lower incomes.

Medical studies: Dose-response relationships often yield strong positive PMCC values. A clinical trial might find r=0.92r = 0.92 between drug dosage and reduction in blood pressure, suggesting a strong linear dose-response. However, biological systems typically have thresholds and saturation points where linearity breaks down.

Psychology: Study hours and exam scores often show moderate positive correlation (r0.4r \approx 0.4 to 0.70.7). The PMCC captures the linear trend, but individual variation means prediction is imprecise — a student studying 10 hours could score anywhere on a wide range. This illustrates that even a moderate rr does not guarantee accurate individual predictions.


2. Spearman's Rank Correlation Coefficient

2.1 Definition

When data are ranked, Spearman's coefficient is

rs=1LB6di2RB◆◆LBn(n21)RBr_s = 1 - \frac◆LB◆6\sum d_i^2◆RB◆◆LB◆n(n^2-1)◆RB◆

where did_i is the difference in ranks for the ii-th pair.

2.2 When to use

  • Data is ordinal (ranked categories)
  • The relationship is monotonic but not necessarily linear
  • There are outliers that would distort the PMCC

2.3 Handling tied ranks

When values are tied, assign the average of the ranks they would have occupied. The simplified formula above does not account for ties — a correction factor is needed for tied data.

2.4 PMCC vs. Spearman's Rank: When to Use Which

CriterionPMCCSpearman's
Data typeContinuous (interval/ratio)Ordinal or continuous
Relationship typeLinear onlyAny monotonic
Sensitivity to outliersHighLow (ranks reduce impact)
Distribution assumptionBivariate normalNone
Power (when assumptions met)HigherLower

Key point: If the data has a strong linear relationship and no extreme outliers, PMCC is preferred as it uses more information from the data. If the relationship is clearly monotonic but curved, or if outliers are present, Spearman's is more appropriate.

Example. Consider judge rankings in a competition. The data is inherently ordinal, so Spearman's rank is the natural choice regardless of whether PMCC could technically be computed. Similarly, in a psychology study measuring agreement between two raters on a Likert scale, Spearman's is the standard choice.


3. Least Squares Regression

3.1 Derivation

Problem. Find the line y=a+bxy = a + bx that minimises

S(a,b)=i=1n(yiabxi)2S(a,b) = \sum_{i=1}^{n}(y_i - a - bx_i)^2

3.2 Derivation using partial derivatives

Setting LBSRB◆◆LBaRB=0\dfrac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = 0 and LBSRB◆◆LBbRB=0\dfrac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = 0:

\frac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = -2\sum(y_i - a - bx_i) = 0 \implies \sum y_i = na + b\sum x_i \tag{1}

\frac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = -2\sum x_i(y_i - a - bx_i) = 0 \implies \sum x_i y_i = a\sum x_i + b\sum x_i^2 \tag{2}

From (1): a=yˉbxˉa = \bar{y} - b\bar{x}.

Substituting into (2):

xiyi=(yˉbxˉ)xi+bxi2=nxˉyˉbnxˉ2+bxi2\sum x_i y_i = (\bar{y}-b\bar{x})\sum x_i + b\sum x_i^2 = n\bar{x}\bar{y} - bn\bar{x}^2 + b\sum x_i^2

xiyinxˉyˉ=b(xi2nxˉ2)\sum x_i y_i - n\bar{x}\bar{y} = b\left(\sum x_i^2 - n\bar{x}^2\right)

Sxy=bSxxS_{xy} = b\,S_{xx}

b=SxySxx=LBxiyinxˉyˉRB◆◆LBxi2nxˉ2RB\boxed{b = \frac{S_{xy}}{S_{xx}} = \frac◆LB◆\sum x_i y_i - n\bar{x}\bar{y}◆RB◆◆LB◆\sum x_i^2 - n\bar{x}^2◆RB◆}

a=yˉbxˉ\boxed{a = \bar{y} - b\bar{x}}


4. The Regression Line Passes Through (xˉ,yˉ)(\bar{x}, \bar{y})

Theorem. The least squares regression line y=a+bxy = a + bx passes through the point (xˉ,yˉ)(\bar{x}, \bar{y}).

Proof. Substituting x=xˉx = \bar{x}:

y=a+bxˉ=(yˉbxˉ)+bxˉ=yˉy = a + b\bar{x} = (\bar{y} - b\bar{x}) + b\bar{x} = \bar{y}

So (xˉ,yˉ)(\bar{x}, \bar{y}) lies on the regression line. \blacksquare

Intuition. The regression line passes through the "centre of mass" of the data. This makes sense — the best-fit line should balance the data around it, just as the mean balances a univariate dataset.


5. Interpreting Regression

5.1 Residuals

The residual for the ii-th data point is ei=yi(a+bxi)e_i = y_i - (a + bx_i).

Properties:

  • ei=0\sum e_i = 0 (the residuals sum to zero)
  • The least squares line minimises ei2\sum e_i^2

5.2 Extrapolation

warning

The regression line should only be used for interpolation (predicting within the range of the data). Extrapolation (predicting outside the data range) is unreliable because the linear relationship may not hold.

5.3 Regression of yy on xx vs. xx on yy

The regression line of yy on xx minimises vertical residuals (yiy^iy_i - \hat{y}_i). The regression line of xx on yy minimises horizontal residuals (xix^ix_i - \hat{x}_i).

These are different lines unless r=±1r = \pm 1. The two regression lines intersect at (xˉ,yˉ)(\bar{x}, \bar{y}).

5.4 Residual Plots

A residual plot graphs the residuals eie_i against the fitted values y^i\hat{y}_i (or against xix_i).

What to look for:

  • Random scatter around zero: the linear model is appropriate.
  • Curved pattern (e.g., U-shape): the relationship is non-linear; a linear model is unsuitable.
  • Funnel shape (increasing spread): the variance is not constant (heteroscedasticity); predictions are less reliable at extremes.

Residual plots are a diagnostic tool — they reveal whether the assumptions of linear regression are met. In A Level exams, you may be asked to comment on a residual plot to assess whether the regression line is a good model.

5.5 Outliers and Influential Points

An outlier is a point with a large residual — it falls far from the regression line. An influential point is an outlier with high leverage, meaning its xx-value is far from xˉ\bar{x}. Influential points can pull the regression line significantly toward themselves.

Effect on PMCC: A single influential point can dramatically change rr. For example, adding an extreme point to a dataset with r=0.3r = 0.3 could push rr to 0.80.8 or change its sign entirely. This is why it is essential to inspect scatter plots alongside numerical summaries.

Example. In a study of height vs. salary across 50 people, most data shows weak positive correlation (r0.2r \approx 0.2). If one NBA player earning millions is included, the PMCC may jump to r0.6r \approx 0.6, giving a misleading impression. In such cases, Spearman's rank is more robust because ranking reduces the disproportionate influence of extreme values.

5.6 Dangers of Extrapolation

Economic example: A regression model based on UK inflation data from 2010--2020 (rates between 0% and 3%) might predict negative inflation for certain conditions. Extrapolating to predict 2022 inflation (which reached 11.1%) would produce wildly inaccurate results because the underlying economic conditions changed entirely.

Medical example: A linear dose-response model calibrated for doses of 0--50 mg might predict y=3y = -3 for a dose of 0 mg, which is physically impossible (negative response). The model is only valid within its calibration range. Biological systems typically exhibit thresholds and saturation effects that linear models cannot capture.

General principle: Always state the range of the original data and note that predictions outside this range are unreliable. In exam questions, you will typically lose marks if you extrapolate without commenting on the limitation.


6. Coding in Regression

If we code u=xpqu = \dfrac{x-p}{q} and v=yrsv = \dfrac{y-r}{s}, and find the regression line v=c+duv = c + du, then:

  • The gradient in terms of original variables: b=sqdb = \dfrac{s}{q}d
  • The intercept: a=r+scbpa = r + s \cdot c - b \cdot p

6.1 Effect of Coding on Correlation

Coding does not change the PMCC or Spearman's rank correlation coefficient.

Why? PMCC is based on standardised quantities. Coding xu=(xp)/qx \mapsto u = (x-p)/q is a linear transformation (shift by pp, scale by 1/q1/q), and rr is invariant under linear transformations of either variable. Similarly, Spearman's uses ranks, which are unaffected by any monotonic transformation including linear coding.

Effect on regression: Coding changes the gradient and intercept of the regression line, as shown in Section 6, but the underlying relationship between the variables is unchanged. The coefficient of determination r2r^2 is also invariant under coding.

6.2 Worked Example: Coding with Economic Data

An economist records quarterly revenue and advertising spend. To simplify calculations, she codes u=x/10u = x/10 (where xx is advertising in GBP) and v=y/1000v = y/1000 (where yy is revenue in GBP).

If the coded regression line is v=2.3+0.7uv = 2.3 + 0.7u, then in original variables:

y1000=2.3+0.7(x10)\frac{y}{1000} = 2.3 + 0.7\left(\frac{x}{10}\right)

y=2300+70xy = 2300 + 70x

The gradient b=70b = 70 means each additional GBP spent on advertising is associated with an increase of GBP 70 in revenue. The PMCC calculated from the coded data would be identical to the PMCC from the original data.


Problem Set

Details

Problem 1 Calculate the PMCC for the data: (1,2)(1,2), (2,3)(2,3), (3,5)(3,5), (4,4)(4,4), (5,7)(5,7).

Details

Solution 1 n=5n=5, xˉ=3\bar{x}=3, yˉ=4.2\bar{y}=4.2.

x2=55\sum x^2 = 55, y2=103\sum y^2 = 103, xy=74\sum xy = 74.

Sxx=555(9)=10S_{xx} = 55 - 5(9) = 10. Syy=1035(17.64)=10388.2=14.8S_{yy} = 103 - 5(17.64) = 103 - 88.2 = 14.8. Sxy=745(3)(4.2)=7463=11S_{xy} = 74 - 5(3)(4.2) = 74 - 63 = 11.

r=LB11RB◆◆LBLB10×14.8RB◆◆RB=LB11RB◆◆LB148RB=1112.1660.904r = \dfrac◆LB◆11◆RB◆◆LB◆\sqrt◆LB◆10 \times 14.8◆RB◆◆RB◆ = \dfrac◆LB◆11◆RB◆◆LB◆\sqrt{148}◆RB◆ = \dfrac{11}{12.166} \approx 0.904.

If you get this wrong, revise: Pearson's PMCC — Section 1.

Details

Problem 2 Find the equation of the regression line of yy on xx for the data in Problem 1.

Details

Solution 2 b=Sxy/Sxx=4/10=0.4b = S_{xy}/S_{xx} = 4/10 = 0.4.

b=SxySxx=1110=1.1b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{11}{10} = 1.1.

a=yˉbxˉ=4.21.1(3)=4.23.3=0.9a = \bar{y} - b\bar{x} = 4.2 - 1.1(3) = 4.2 - 3.3 = 0.9.

Regression line: y=0.9+1.1xy = 0.9 + 1.1x.

If you get this wrong, revise: Least Squares Regression — Section 3.

Details

Problem 3 For the data below, calculate Spearman's rank correlation coefficient.

| xx | 10 | 20 | 30 | 40 | 50 | | yy | 15 | 25 | 18 | 35 | 42 |

Details

Solution 3 Ranks of xx: 1, 2, 3, 4, 5. Ranks of yy: 1, 3, 2, 4, 5.

| dd | 0 | -1 | 1 | 0 | 0 |

d2=0+1+1+0+0=2\sum d^2 = 0 + 1 + 1 + 0 + 0 = 2.

rs=1LB6×2RB◆◆LB5(251)RB=112120=10.1=0.9r_s = 1 - \dfrac◆LB◆6 \times 2◆RB◆◆LB◆5(25-1)◆RB◆ = 1 - \dfrac{12}{120} = 1 - 0.1 = 0.9.

If you get this wrong, revise: Spearman's Rank Correlation — Section 2.

Details

Problem 4 Prove that ei=0\sum e_i = 0 where ei=yi(a+bxi)e_i = y_i - (a + bx_i) are the residuals of the least squares regression line.

Details

Solution 4 ei=yinabxi=nyˉn(yˉbxˉ)bnxˉ=nyˉnyˉ+nbxˉnbxˉ=0\sum e_i = \sum y_i - na - b\sum x_i = n\bar{y} - n(\bar{y} - b\bar{x}) - bn\bar{x} = n\bar{y} - n\bar{y} + nb\bar{x} - nb\bar{x} = 0 \quad \blacksquare

If you get this wrong, revise: Residuals — Section 5.1.

Details

Problem 5 Data is coded using u=x10u = x - 10 and v=y/2v = y/2. The coded regression line is v=0.5+0.3uv = 0.5 + 0.3u. Find the regression line of yy on xx.

Details

Solution 5 Original: y/2=0.5+0.3(x10)y/2 = 0.5 + 0.3(x-10).

y=1+0.6(x10)=1+0.6x6=0.6x5y = 1 + 0.6(x-10) = 1 + 0.6x - 6 = 0.6x - 5.

So y=5+0.6xy = -5 + 0.6x.

If you get this wrong, revise: Coding in Regression — Section 6.

Details

Problem 6 A student finds r=0.95r = 0.95 between ice cream sales and drowning deaths. The student concludes ice cream causes drowning. Explain the flaw.

Details

Solution 6 Correlation does not imply causation. Both ice cream sales and drowning deaths are influenced by a confounding variable: hot weather. In summer, more people buy ice cream and more people swim, leading to more of both. The correlation is real but the causal claim is not supported.

If you get this wrong, revise: Properties — Section 1.2.

Details

Problem 7 Given Sxx=80S_{xx} = 80, Syy=200S_{yy} = 200, and Sxy=100S_{xy} = 100, find rr, bb (gradient of yy on xx), and the proportion of variance in yy explained by xx.

Details

Solution 7 r=LB100RB◆◆LBLB80×200RB◆◆RB=LB100RB◆◆LB16000RB=100126.490.791r = \dfrac◆LB◆100◆RB◆◆LB◆\sqrt◆LB◆80 \times 200◆RB◆◆RB◆ = \dfrac◆LB◆100◆RB◆◆LB◆\sqrt{16000}◆RB◆ = \dfrac{100}{126.49} \approx 0.791.

b=SxySxx=10080=1.25b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{100}{80} = 1.25.

Proportion of variance explained =r2=0.625= r^2 = 0.625 (62.5%).

If you get this wrong, revise: Least Squares Regression — Section 3.

Details

Problem 8 The regression line of yy on xx is y=2+3xy = 2 + 3x with xˉ=5\bar{x} = 5. What is yˉ\bar{y}?

Details

Solution 8 Since the regression line passes through (xˉ,yˉ)(\bar{x}, \bar{y}):

yˉ=2+3(5)=17\bar{y} = 2 + 3(5) = 17.

If you get this wrong, revise: The Regression Line Passes Through (xˉ,yˉ)(\bar{x}, \bar{y}) — Section 4.

Details

Problem 9 A residual plot shows a clear U-shaped pattern. What does this suggest about the regression model, and what would be a more appropriate approach?

Solution 9

A U-shaped residual plot indicates the relationship between the variables is non-linear (likely quadratic). The linear regression model is inappropriate because it fails to capture the curvature. A more appropriate approach would be to fit a quadratic model y=a+bx+cx2y = a + bx + cx^2, or to apply a transformation (e.g., taking logarithms) to linearise the relationship.

If you get this wrong, revise: Residual Plots — Section 5.4.

Details

Problem 10 Two datasets have the same PMCC of r=0.85r = 0.85. Dataset A has n=10n = 10 observations; Dataset B has n=100n = 100 observations. Explain why Dataset B provides stronger evidence of a real association.

Solution 10

With a larger sample size, the PMCC is estimated more precisely (smaller standard error). For n=10n = 10, the PMCC must exceed approximately 0.632 to be significant at the 5% level (two-tailed). For n=100n = 100, the threshold is approximately 0.197. While both datasets show the same correlation, Dataset B provides far stronger statistical evidence because random fluctuations are much less likely to produce r=0.85r = 0.85 with 100 observations.

If you get this wrong, revise: Properties — Section 1.2.

Details

Problem 11 Eight students were ranked by two teachers for a presentation. The rankings are:

StudentABCDEFGH
Teacher 125173846
Teacher 216284735

Calculate Spearman's rank correlation coefficient and interpret the result.

Solution 11

The data is already ranked, so:

StudentABCDEFGH
did_i1-1-1-1-1111
di2d_i^211111111

di2=8\sum d_i^2 = 8.

rs=1LB6×8RB◆◆LB8(641)RB=148504=10.0952=0.905r_s = 1 - \dfrac◆LB◆6 \times 8◆RB◆◆LB◆8(64 - 1)◆RB◆ = 1 - \dfrac{48}{504} = 1 - 0.0952 = 0.905 (3 s.f.).

This indicates very strong positive agreement between the two teachers' rankings, suggesting consistent assessment standards.

If you get this wrong, revise: Spearman's Rank Correlation — Section 2.

Details

Problem 12 A medical researcher collects data on blood pressure (xx mmHg) and cholesterol level (yy mg/dL) for 12 patients. She finds xˉ=132\bar{x} = 132, yˉ=218\bar{y} = 218, Sxx=3600S_{xx} = 3600, Syy=28900S_{yy} = 28900, Sxy=8100S_{xy} = 8100.

(a) Calculate the PMCC and interpret it. (b) Find the regression line of yy on xx. (c) Predict the cholesterol level for a patient with blood pressure of 150 mmHg. Comment on the reliability.

Solution 12

(a) r=LB8100RB◆◆LBLB3600×28900RB◆◆RB=LB8100RB◆◆LB104040000RB=8100102000.794r = \dfrac◆LB◆8100◆RB◆◆LB◆\sqrt◆LB◆3600 \times 28900◆RB◆◆RB◆ = \dfrac◆LB◆8100◆RB◆◆LB◆\sqrt{104040000}◆RB◆ = \dfrac{8100}{10200} \approx 0.794.

This indicates a strong positive linear correlation between blood pressure and cholesterol level.

(b) b=SxySxx=81003600=2.25b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{8100}{3600} = 2.25.

a=2182.25×132=218297=79a = 218 - 2.25 \times 132 = 218 - 297 = -79.

Regression line: y=79+2.25xy = -79 + 2.25x.

(c) When x=150x = 150: y=79+2.25(150)=79+337.5=258.5y = -79 + 2.25(150) = -79 + 337.5 = 258.5 mg/dL.

This prediction is reasonably reliable since 150 is within (or close to) the range of the data. However, n=12n = 12 is a small sample, so there is considerable uncertainty. The prediction should not be treated as precise.

If you get this wrong, revise: Least Squares Regression — Section 3, and Extrapolation — Section 5.2.

Details

Problem 13 Data is coded using u=(x20)/5u = (x - 20)/5 and v=(y100)/10v = (y - 100)/10. The coded PMCC is r=0.64r = 0.64 and the coded regression line of vv on uu is v=1.2+0.8uv = 1.2 + 0.8u.

Find: (a) The PMCC for the original data. (b) The regression line of yy on xx in original variables.

Solution 13

(a) Coding does not change the PMCC, so r=0.64r = 0.64 for the original data.

(b) Start from the coded line:

y10010=1.2+0.8×x205\frac{y - 100}{10} = 1.2 + 0.8 \times \frac{x - 20}{5}

y10010=1.2+0.16(x20)\frac{y - 100}{10} = 1.2 + 0.16(x - 20)

y10010=1.2+0.16x3.2\frac{y - 100}{10} = 1.2 + 0.16x - 3.2

y10010=0.16x2.0\frac{y - 100}{10} = 0.16x - 2.0

y100=1.6x20y - 100 = 1.6x - 20

y=80+1.6xy = 80 + 1.6x

If you get this wrong, revise: Coding in Regression — Section 6, and Effect of Coding on Correlation — Section 6.1.

Details

Problem 14 A dataset of 15 observations has regression line y=5+2xy = 5 + 2x with xˉ=10\bar{x} = 10. A 16th observation (25,70)(25, 70) is added. Without recalculating, explain qualitatively how this point would affect: (a) The gradient of the regression line. (b) The PMCC.

Solution 14

The point (25,70)(25, 70) has x=25x = 25, which is far from xˉ=10\bar{x} = 10, so it has high leverage. Its predicted yy-value from the current line would be y^=5+2(25)=55\hat{y} = 5 + 2(25) = 55, but the actual value is 7070. The residual is 7055=1570 - 55 = 15, which is positive and large.

(a) Since the point lies above the regression line and has high leverage, it will increase the gradient (pull the line upward at the right side).

(b) Since the point lies close to the general positive trend (above the line in the same direction as the overall slope), it will likely increase the PMCC slightly. However, if the point were below the trend, it could decrease rr significantly — a single influential point can change rr by a large amount.

If you get this wrong, revise: Outliers and Influential Points — Section 5.5.

:::

:::

:::


tip

tip Ready to test your understanding of Correlation and Regression? The diagnostic test contains the hardest questions within the A-Level specification for this topic, each with a full worked solution.

Unit tests probe edge cases and common misconceptions. Integration tests combine Correlation and Regression with other topics to test synthesis under exam conditions.

See Diagnostic Guide for instructions on self-marking and building a personal test matrix.