Correlation and Regression
Board Coverage
| Board | Paper | Notes |
|---|---|---|
| AQA | Paper 1 | PMCC, regression lines |
| Edexcel | P1 | Includes Spearman's rank |
| OCR (A) | Paper 1 | Similar |
| CIE (9709) | P1, P6 | Correlation and regression in P1/P6 |
The formula booklet gives the formula for PMCC and the least squares regression line. You must be able to interpret these and understand their limitations.
1. Pearson's Product Moment Correlation Coefficient (PMCC)
1.1 Definition
Definition. For bivariate data , the PMCC is
where
1.2 Properties
- : perfect positive linear correlation
- : perfect negative linear correlation
- : no linear correlation (but there may be non-linear relationship)
- measures the strength of linear relationship only
Correlation does not imply causation. Two variables may be strongly correlated because they are both influenced by a third (confounding) variable, or by coincidence.
1.3 Real-World Applications
Economics: GDP per capita and life expectancy across countries typically show to . The relationship is strong but non-linear at high income levels (diminishing returns). The PMCC captures the overall linear trend but underestimates the strength of the relationship at lower incomes.
Medical studies: Dose-response relationships often yield strong positive PMCC values. A clinical trial might find between drug dosage and reduction in blood pressure, suggesting a strong linear dose-response. However, biological systems typically have thresholds and saturation points where linearity breaks down.
Psychology: Study hours and exam scores often show moderate positive correlation ( to ). The PMCC captures the linear trend, but individual variation means prediction is imprecise — a student studying 10 hours could score anywhere on a wide range. This illustrates that even a moderate does not guarantee accurate individual predictions.
2. Spearman's Rank Correlation Coefficient
2.1 Definition
When data are ranked, Spearman's coefficient is
where is the difference in ranks for the -th pair.
2.2 When to use
- Data is ordinal (ranked categories)
- The relationship is monotonic but not necessarily linear
- There are outliers that would distort the PMCC
2.3 Handling tied ranks
When values are tied, assign the average of the ranks they would have occupied. The simplified formula above does not account for ties — a correction factor is needed for tied data.
2.4 PMCC vs. Spearman's Rank: When to Use Which
| Criterion | PMCC | Spearman's |
|---|---|---|
| Data type | Continuous (interval/ratio) | Ordinal or continuous |
| Relationship type | Linear only | Any monotonic |
| Sensitivity to outliers | High | Low (ranks reduce impact) |
| Distribution assumption | Bivariate normal | None |
| Power (when assumptions met) | Higher | Lower |
Key point: If the data has a strong linear relationship and no extreme outliers, PMCC is preferred as it uses more information from the data. If the relationship is clearly monotonic but curved, or if outliers are present, Spearman's is more appropriate.
Example. Consider judge rankings in a competition. The data is inherently ordinal, so Spearman's rank is the natural choice regardless of whether PMCC could technically be computed. Similarly, in a psychology study measuring agreement between two raters on a Likert scale, Spearman's is the standard choice.
3. Least Squares Regression
3.1 Derivation
Problem. Find the line that minimises
3.2 Derivation using partial derivatives
Setting and :
\frac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = -2\sum(y_i - a - bx_i) = 0 \implies \sum y_i = na + b\sum x_i \tag{1}
\frac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = -2\sum x_i(y_i - a - bx_i) = 0 \implies \sum x_i y_i = a\sum x_i + b\sum x_i^2 \tag{2}
From (1): .
Substituting into (2):
4. The Regression Line Passes Through
Theorem. The least squares regression line passes through the point .
Proof. Substituting :
So lies on the regression line.
Intuition. The regression line passes through the "centre of mass" of the data. This makes sense — the best-fit line should balance the data around it, just as the mean balances a univariate dataset.
5. Interpreting Regression
5.1 Residuals
The residual for the -th data point is .
Properties:
- (the residuals sum to zero)
- The least squares line minimises
5.2 Extrapolation
The regression line should only be used for interpolation (predicting within the range of the data). Extrapolation (predicting outside the data range) is unreliable because the linear relationship may not hold.
5.3 Regression of on vs. on
The regression line of on minimises vertical residuals (). The regression line of on minimises horizontal residuals ().
These are different lines unless . The two regression lines intersect at .
5.4 Residual Plots
A residual plot graphs the residuals against the fitted values (or against ).
What to look for:
- Random scatter around zero: the linear model is appropriate.
- Curved pattern (e.g., U-shape): the relationship is non-linear; a linear model is unsuitable.
- Funnel shape (increasing spread): the variance is not constant (heteroscedasticity); predictions are less reliable at extremes.
Residual plots are a diagnostic tool — they reveal whether the assumptions of linear regression are met. In A Level exams, you may be asked to comment on a residual plot to assess whether the regression line is a good model.
5.5 Outliers and Influential Points
An outlier is a point with a large residual — it falls far from the regression line. An influential point is an outlier with high leverage, meaning its -value is far from . Influential points can pull the regression line significantly toward themselves.
Effect on PMCC: A single influential point can dramatically change . For example, adding an extreme point to a dataset with could push to or change its sign entirely. This is why it is essential to inspect scatter plots alongside numerical summaries.
Example. In a study of height vs. salary across 50 people, most data shows weak positive correlation (). If one NBA player earning millions is included, the PMCC may jump to , giving a misleading impression. In such cases, Spearman's rank is more robust because ranking reduces the disproportionate influence of extreme values.
5.6 Dangers of Extrapolation
Economic example: A regression model based on UK inflation data from 2010--2020 (rates between 0% and 3%) might predict negative inflation for certain conditions. Extrapolating to predict 2022 inflation (which reached 11.1%) would produce wildly inaccurate results because the underlying economic conditions changed entirely.
Medical example: A linear dose-response model calibrated for doses of 0--50 mg might predict for a dose of 0 mg, which is physically impossible (negative response). The model is only valid within its calibration range. Biological systems typically exhibit thresholds and saturation effects that linear models cannot capture.
General principle: Always state the range of the original data and note that predictions outside this range are unreliable. In exam questions, you will typically lose marks if you extrapolate without commenting on the limitation.
6. Coding in Regression
If we code and , and find the regression line , then:
- The gradient in terms of original variables:
- The intercept:
6.1 Effect of Coding on Correlation
Coding does not change the PMCC or Spearman's rank correlation coefficient.
Why? PMCC is based on standardised quantities. Coding is a linear transformation (shift by , scale by ), and is invariant under linear transformations of either variable. Similarly, Spearman's uses ranks, which are unaffected by any monotonic transformation including linear coding.
Effect on regression: Coding changes the gradient and intercept of the regression line, as shown in Section 6, but the underlying relationship between the variables is unchanged. The coefficient of determination is also invariant under coding.
6.2 Worked Example: Coding with Economic Data
An economist records quarterly revenue and advertising spend. To simplify calculations, she codes (where is advertising in GBP) and (where is revenue in GBP).
If the coded regression line is , then in original variables:
The gradient means each additional GBP spent on advertising is associated with an increase of GBP 70 in revenue. The PMCC calculated from the coded data would be identical to the PMCC from the original data.
Problem Set
Details
Problem 1
Calculate the PMCC for the data: , , , , .Details
Details
Problem 2
Find the equation of the regression line of on for the data in Problem 1.Details
Solution 2
..
.
Regression line: .
If you get this wrong, revise: Least Squares Regression — Section 3.
Details
Problem 3
For the data below, calculate Spearman's rank correlation coefficient.| | 10 | 20 | 30 | 40 | 50 | | | 15 | 25 | 18 | 35 | 42 |
Details
Solution 3
Ranks of : 1, 2, 3, 4, 5. Ranks of : 1, 3, 2, 4, 5.| | 0 | -1 | 1 | 0 | 0 |
.
.
If you get this wrong, revise: Spearman's Rank Correlation — Section 2.
Details
Problem 4
Prove that where are the residuals of the least squares regression line.Details
Details
Problem 5
Data is coded using and . The coded regression line is . Find the regression line of on .Details
Details
Problem 6
A student finds between ice cream sales and drowning deaths. The student concludes ice cream causes drowning. Explain the flaw.Details
Solution 6
Correlation does not imply causation. Both ice cream sales and drowning deaths are influenced by a confounding variable: hot weather. In summer, more people buy ice cream and more people swim, leading to more of both. The correlation is real but the causal claim is not supported.If you get this wrong, revise: Properties — Section 1.2.
Details
Problem 7
Given , , and , find , (gradient of on ), and the proportion of variance in explained by .Details
Solution 7
..
Proportion of variance explained (62.5%).
If you get this wrong, revise: Least Squares Regression — Section 3.
Details
Problem 8
The regression line of on is with . What is ?Details
Solution 8
Since the regression line passes through :.
If you get this wrong, revise: The Regression Line Passes Through — Section 4.
Details
Problem 9
A residual plot shows a clear U-shaped pattern. What does this suggest about the regression model, and what would be a more appropriate approach?Solution 9
A U-shaped residual plot indicates the relationship between the variables is non-linear (likely quadratic). The linear regression model is inappropriate because it fails to capture the curvature. A more appropriate approach would be to fit a quadratic model , or to apply a transformation (e.g., taking logarithms) to linearise the relationship.
If you get this wrong, revise: Residual Plots — Section 5.4.
Details
Problem 10
Two datasets have the same PMCC of . Dataset A has observations; Dataset B has observations. Explain why Dataset B provides stronger evidence of a real association.Solution 10
With a larger sample size, the PMCC is estimated more precisely (smaller standard error). For , the PMCC must exceed approximately 0.632 to be significant at the 5% level (two-tailed). For , the threshold is approximately 0.197. While both datasets show the same correlation, Dataset B provides far stronger statistical evidence because random fluctuations are much less likely to produce with 100 observations.
If you get this wrong, revise: Properties — Section 1.2.
Details
Problem 11
Eight students were ranked by two teachers for a presentation. The rankings are:| Student | A | B | C | D | E | F | G | H |
|---|---|---|---|---|---|---|---|---|
| Teacher 1 | 2 | 5 | 1 | 7 | 3 | 8 | 4 | 6 |
| Teacher 2 | 1 | 6 | 2 | 8 | 4 | 7 | 3 | 5 |
Calculate Spearman's rank correlation coefficient and interpret the result.
Solution 11
The data is already ranked, so:
| Student | A | B | C | D | E | F | G | H |
|---|---|---|---|---|---|---|---|---|
| 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
.
(3 s.f.).
This indicates very strong positive agreement between the two teachers' rankings, suggesting consistent assessment standards.
If you get this wrong, revise: Spearman's Rank Correlation — Section 2.
Details
Problem 12
A medical researcher collects data on blood pressure ( mmHg) and cholesterol level ( mg/dL) for 12 patients. She finds , , , , .(a) Calculate the PMCC and interpret it. (b) Find the regression line of on . (c) Predict the cholesterol level for a patient with blood pressure of 150 mmHg. Comment on the reliability.
Solution 12
(a) .
This indicates a strong positive linear correlation between blood pressure and cholesterol level.
(b) .
.
Regression line: .
(c) When : mg/dL.
This prediction is reasonably reliable since 150 is within (or close to) the range of the data. However, is a small sample, so there is considerable uncertainty. The prediction should not be treated as precise.
If you get this wrong, revise: Least Squares Regression — Section 3, and Extrapolation — Section 5.2.
Details
Problem 13
Data is coded using and . The coded PMCC is and the coded regression line of on is .Find: (a) The PMCC for the original data. (b) The regression line of on in original variables.
Solution 13
(a) Coding does not change the PMCC, so for the original data.
(b) Start from the coded line:
If you get this wrong, revise: Coding in Regression — Section 6, and Effect of Coding on Correlation — Section 6.1.
Details
Problem 14
A dataset of 15 observations has regression line with . A 16th observation is added. Without recalculating, explain qualitatively how this point would affect: (a) The gradient of the regression line. (b) The PMCC.Solution 14
The point has , which is far from , so it has high leverage. Its predicted -value from the current line would be , but the actual value is . The residual is , which is positive and large.
(a) Since the point lies above the regression line and has high leverage, it will increase the gradient (pull the line upward at the right side).
(b) Since the point lies close to the general positive trend (above the line in the same direction as the overall slope), it will likely increase the PMCC slightly. However, if the point were below the trend, it could decrease significantly — a single influential point can change by a large amount.
If you get this wrong, revise: Outliers and Influential Points — Section 5.5.
:::
:::
:::
tip Ready to test your understanding of Correlation and Regression? The diagnostic test contains the hardest questions within the A-Level specification for this topic, each with a full worked solution.
Unit tests probe edge cases and common misconceptions. Integration tests combine Correlation and Regression with other topics to test synthesis under exam conditions.
See Diagnostic Guide for instructions on self-marking and building a personal test matrix.