Correlation and Regression (Extended) Correlation and Regression (Extended Treatment)
This document covers scatter diagrams, the product moment correlation coefficient, Spearman's rank
correlation, least squares regression, and residual analysis.
Correlation measures the strength of a linear association. It does not imply causation, and it
does not capture non-linear relationships. Always plot your data before interpreting correlation
values.
1. Scatter Diagrams
1.1 Interpretation
A scatter diagram (scatter plot) displays pairs of values ( x i , y i ) (x_i, y_i) ( x i , y i ) as points on a
coordinate grid. Visual inspection reveals:
The direction of association (positive, negative, or none).
The strength of association (strong, moderate, weak).
The shape of the relationship (linear, curved, clustered).
The presence of outliers .
1.2 Types of correlation
Pattern Description Strong + Points lie close to an upward-sloping line Moderate + General upward trend with more scatter Weak + Slight upward tendency, much scatter No correlation No discernible pattern Strong - Points lie close to a downward-sloping line Non-linear Clear pattern but not a straight line
1.3 Outliers
An outlier is a data point that lies far from the general pattern. Outliers can:
Be genuine extreme values.
Result from measurement errors.
Significantly affect the correlation coefficient and regression line.
Common Pitfall
A single outlier can dramatically change the value of the correlation coefficient. Always examine
your scatter diagram before relying on numerical measures.
2. Product Moment Correlation Coefficient (PMCC)
2.1 Definition
The product moment correlation coefficient (also called Pearson's correlation coefficient)
for a sample of n n n pairs ( x i , y i ) (x_i, y_i) ( x i , y i ) is:
r = ◆ L B ◆ S x y ◆ R B ◆◆ L B ◆ S x x S y y ◆ R B ◆ r = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt{S_{xx}\,S_{yy}}◆RB◆ r = L ◆ B ◆ S x y ◆ R B ◆◆ L B ◆ S xx S y y ◆ R B ◆
where:
S x y = ∑ ( x i − x ˉ ) ( y i − y ˉ ) = ∑ x i y i − n x ˉ y ˉ S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y}) = \sum x_i y_i - n\bar{x}\bar{y} S x y = ∑ ( x i − x ˉ ) ( y i − y ˉ ) = ∑ x i y i − n x ˉ y ˉ
S x x = ∑ ( x i − x ˉ ) 2 = ∑ x i 2 − n x ˉ 2 S_{xx} = \sum(x_i - \bar{x})^2 = \sum x_i^2 - n\bar{x}^2 S xx = ∑ ( x i − x ˉ ) 2 = ∑ x i 2 − n x ˉ 2
S y y = ∑ ( y i − y ˉ ) 2 = ∑ y i 2 − n y ˉ 2 S_{yy} = \sum(y_i - \bar{y})^2 = \sum y_i^2 - n\bar{y}^2 S y y = ∑ ( y i − y ˉ ) 2 = ∑ y i 2 − n y ˉ 2
2.2 Properties
− 1 ≤ r ≤ 1 -1 \leq r \leq 1 − 1 ≤ r ≤ 1 .
r = 1 r = 1 r = 1 : perfect positive linear correlation.
r = − 1 r = -1 r = − 1 : perfect negative linear correlation.
r = 0 r = 0 r = 0 : no linear correlation (but there may be a non-linear relationship).
r r r is independent of the units of measurement.
r r r is unchanged if both variables are transformed linearly (x ′ = a x + b x' = ax + b x ′ = a x + b , y ′ = c y + d y' = cy + d y ′ = cy + d with
a , c > 0 a, c \gt 0 a , c > 0 ).
2.3 Proof that ∣ r ∣ ≤ 1 |r| \leq 1 ∣ r ∣ ≤ 1
Proof. By the Cauchy-Schwarz inequality:
( ∑ a i b i ) 2 ≤ ( ∑ a i 2 ) ( ∑ b i 2 ) \left(\sum a_i b_i\right)^2 \leq \left(\sum a_i^2\right)\!\left(\sum b_i^2\right) ( ∑ a i b i ) 2 ≤ ( ∑ a i 2 ) ( ∑ b i 2 )
Setting a i = x i − x ˉ a_i = x_i - \bar{x} a i = x i − x ˉ and b i = y i − y ˉ b_i = y_i - \bar{y} b i = y i − y ˉ :
S x y 2 ≤ S x x S y y S_{xy}^2 \leq S_{xx}\,S_{yy} S x y 2 ≤ S xx S y y
r 2 = S x y 2 S x x S y y ≤ 1 ⟹ ∣ r ∣ ≤ 1 ■ r^2 = \frac{S_{xy}^2}{S_{xx}\,S_{yy}} \leq 1 \implies |r| \leq 1 \quad \blacksquare r 2 = S xx S y y S x y 2 ≤ 1 ⟹ ∣ r ∣ ≤ 1 ■
2.4 Worked example
Problem. Find the PMCC for the following data:
n = 5 n = 5 n = 5 , x ˉ = 6 \bar{x} = 6 x ˉ = 6 , y ˉ = 5.6 \bar{y} = 5.6 y ˉ = 5.6 .
∑ x i y i = 6 + 20 + 24 + 56 + 90 = 196 \sum x_i y_i = 6 + 20 + 24 + 56 + 90 = 196 ∑ x i y i = 6 + 20 + 24 + 56 + 90 = 196
S x y = 196 − 5 ( 6 ) ( 5.6 ) = 196 − 168 = 28 S_{xy} = 196 - 5(6)(5.6) = 196 - 168 = 28 S x y = 196 − 5 ( 6 ) ( 5.6 ) = 196 − 168 = 28
∑ x i 2 = 4 + 16 + 36 + 64 + 100 = 220 \sum x_i^2 = 4 + 16 + 36 + 64 + 100 = 220 ∑ x i 2 = 4 + 16 + 36 + 64 + 100 = 220 , S x x = 220 − 5 ( 36 ) = 40 S_{xx} = 220 - 5(36) = 40 S xx = 220 − 5 ( 36 ) = 40
∑ y i 2 = 9 + 25 + 16 + 49 + 81 = 180 \sum y_i^2 = 9 + 25 + 16 + 49 + 81 = 180 ∑ y i 2 = 9 + 25 + 16 + 49 + 81 = 180 , S y y = 180 − 5 ( 31.36 ) = 180 − 156.8 = 23.2 S_{yy} = 180 - 5(31.36) = 180 - 156.8 = 23.2 S y y = 180 − 5 ( 31.36 ) = 180 − 156.8 = 23.2
r = ◆ L B ◆ 28 ◆ R B ◆◆ L B ◆ ◆ L B ◆ 40 × 23.2 ◆ R B ◆◆ R B ◆ = ◆ L B ◆ 28 ◆ R B ◆◆ L B ◆ 928 ◆ R B ◆ = 28 30.46 ≈ 0.919 r = \frac◆LB◆28◆RB◆◆LB◆\sqrt◆LB◆40 \times 23.2◆RB◆◆RB◆ = \frac◆LB◆28◆RB◆◆LB◆\sqrt{928}◆RB◆ = \frac{28}{30.46} \approx 0.919 r = L ◆ B ◆28◆ R B ◆◆ L B ◆ ◆ L B ◆40 × 23.2◆ R B ◆◆ R B ◆ = L ◆ B ◆28◆ R B ◆◆ L B ◆ 928 ◆ R B ◆ = 30.46 28 ≈ 0.919
This indicates strong positive linear correlation.
2.5 Coding data
When data values are large, coding simplifies calculations. Use u = x − a c u = \dfrac{x - a}{c} u = c x − a and
v = y − b d v = \dfrac{y - b}{d} v = d y − b where a , b a, b a , b are shift values and c , d c, d c , d are scaling values.
The PMCC is unchanged by coding: r x y = r u v r_{xy} = r_{uv} r x y = r uv .
3. Spearman's Rank Correlation Coefficient
3.1 Definition
Spearman's rank correlation coefficient r s r_s r s measures the strength of the monotonic
relationship between two variables:
r s = 1 − ◆ L B ◆ 6 ∑ d i 2 ◆ R B ◆◆ L B ◆ n ( n 2 − 1 ) ◆ R B ◆ r_s = 1 - \frac◆LB◆6\sum d_i^2◆RB◆◆LB◆n(n^2 - 1)◆RB◆ r s = 1 − L ◆ B ◆6 ∑ d i 2 ◆ R B ◆◆ L B ◆ n ( n 2 − 1 ) ◆ R B ◆
where d i = r a n k ( x i ) − r a n k ( y i ) d_i = \mathrm{rank}(x_i) - \mathrm{rank}(y_i) d i = rank ( x i ) − rank ( y i ) is the difference in ranks for the i i i -th pair.
3.2 When to use Spearman's rank
Data is ordinal (ranked categories).
The relationship is monotonic but not necessarily linear.
There are significant outliers that would distort the PMCC.
The data contains tied ranks.
3.3 Handling tied ranks
When values are tied, assign the average rank to all tied values. For example, if two values
are tied for ranks 3 and 4, both receive rank 3.5.
When ties exist, the simplified formula is only approximate. A more accurate formula uses:
r s = ◆ L B ◆ S x y ◆ R B ◆◆ L B ◆ S x x S y y ◆ R B ◆ r_s = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt{S_{xx}\,S_{yy}}◆RB◆ r s = L ◆ B ◆ S x y ◆ R B ◆◆ L B ◆ S xx S y y ◆ R B ◆
applied to the rank data.
3.4 Worked example
Problem. Two judges rank 6 competitors:
Competitor A B C D E F Judge 1 1 3 2 5 4 6 Judge 2 2 1 3 6 5 4
∑ d i 2 = 1 + 4 + 1 + 1 + 1 + 4 = 12 \sum d_i^2 = 1 + 4 + 1 + 1 + 1 + 4 = 12 ∑ d i 2 = 1 + 4 + 1 + 1 + 1 + 4 = 12
r s = 1 − ◆ L B ◆ 6 × 12 ◆ R B ◆◆ L B ◆ 6 ( 35 ) ◆ R B ◆ = 1 − 72 210 = 1 − 0.343 = 0.657 r_s = 1 - \frac◆LB◆6 \times 12◆RB◆◆LB◆6(35)◆RB◆ = 1 - \frac{72}{210} = 1 - 0.343 = 0.657 r s = 1 − L ◆ B ◆6 × 12◆ R B ◆◆ L B ◆6 ( 35 ) ◆ R B ◆ = 1 − 210 72 = 1 − 0.343 = 0.657
This indicates moderate positive agreement between the judges.
3.5 Worked example with ties
Problem. Find r s r_s r s for the following data:
Ranks of x x x : 1, 2.5, 2.5, 4, 5 (tied at 20).
Ranks of y y y : 1, 2, 3, 4, 5.
d i d_i d i : 0, 0.5, -0.5, 0, 0.
∑ d i 2 = 0 + 0.25 + 0.25 + 0 + 0 = 0.5 \sum d_i^2 = 0 + 0.25 + 0.25 + 0 + 0 = 0.5 ∑ d i 2 = 0 + 0.25 + 0.25 + 0 + 0 = 0.5
r s = 1 − ◆ L B ◆ 6 × 0.5 ◆ R B ◆◆ L B ◆ 5 × 24 ◆ R B ◆ = 1 − 3 120 = 1 − 0.025 = 0.975 r_s = 1 - \frac◆LB◆6 \times 0.5◆RB◆◆LB◆5 \times 24◆RB◆ = 1 - \frac{3}{120} = 1 - 0.025 = 0.975 r s = 1 − L ◆ B ◆6 × 0.5◆ R B ◆◆ L B ◆5 × 24◆ R B ◆ = 1 − 120 3 = 1 − 0.025 = 0.975
Very strong positive monotonic relationship.
4. Least Squares Regression
4.1 The regression line of y y y on x x x
The least squares regression line of y y y on x x x is the line y = a + b x y = a + bx y = a + b x that minimises the
sum of squared residuals:
S = ∑ i = 1 n ( y i − a − b x i ) 2 S = \sum_{i=1}^{n}(y_i - a - bx_i)^2 S = ∑ i = 1 n ( y i − a − b x i ) 2
Setting ◆ L B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ a ◆ R B ◆ = 0 \dfrac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = 0 L ◆ B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ a ◆ R B ◆ = 0 and ◆ L B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ b ◆ R B ◆ = 0 \dfrac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = 0 L ◆ B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ b ◆ R B ◆ = 0 :
b = S x y S x x = ◆ L B ◆ ∑ ( x i − x ˉ ) ( y i − y ˉ ) ◆ R B ◆◆ L B ◆ ∑ ( x i − x ˉ ) 2 ◆ R B ◆ b = \frac{S_{xy}}{S_{xx}} = \frac◆LB◆\sum(x_i - \bar{x})(y_i - \bar{y})◆RB◆◆LB◆\sum(x_i - \bar{x})^2◆RB◆ b = S xx S x y = L ◆ B ◆ ∑ ( x i − x ˉ ) ( y i − y ˉ ) ◆ R B ◆◆ L B ◆ ∑ ( x i − x ˉ ) 2 ◆ R B ◆
a = y ˉ − b x ˉ a = \bar{y} - b\bar{x} a = y ˉ − b x ˉ
Key property: The regression line always passes through the point ( x ˉ , y ˉ ) (\bar{x}, \bar{y}) ( x ˉ , y ˉ ) .
4.2 Derivation of the normal equations
◆ L B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ a ◆ R B ◆ = − 2 ∑ ( y i − a − b x i ) = 0 ⟹ n a + b ∑ x i = ∑ y i \frac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = -2\sum(y_i - a - bx_i) = 0 \implies na + b\sum x_i = \sum y_i L ◆ B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ a ◆ R B ◆ = − 2 ∑ ( y i − a − b x i ) = 0 ⟹ na + b ∑ x i = ∑ y i
◆ L B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ b ◆ R B ◆ = − 2 ∑ x i ( y i − a − b x i ) = 0 ⟹ a ∑ x i + b ∑ x i 2 = ∑ x i y i \frac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = -2\sum x_i(y_i - a - bx_i) = 0 \implies a\sum x_i + b\sum x_i^2 = \sum x_i y_i L ◆ B ◆ ∂ S ◆ R B ◆◆ L B ◆ ∂ b ◆ R B ◆ = − 2 ∑ x i ( y i − a − b x i ) = 0 ⟹ a ∑ x i + b ∑ x i 2 = ∑ x i y i
These are the normal equations . Dividing the first by n n n gives y ˉ = a + b x ˉ \bar{y} = a + b\bar{x} y ˉ = a + b x ˉ ,
confirming the line passes through the mean point.
4.3 Worked example
Using the data from Section 2.4:
b = S x y S x x = 28 40 = 0.7 b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{28}{40} = 0.7 b = S xx S x y = 40 28 = 0.7
a = y ˉ − b x ˉ = 5.6 − 0.7 ( 6 ) = 5.6 − 4.2 = 1.4 a = \bar{y} - b\bar{x} = 5.6 - 0.7(6) = 5.6 - 4.2 = 1.4 a = y ˉ − b x ˉ = 5.6 − 0.7 ( 6 ) = 5.6 − 4.2 = 1.4
y = 1.4 + 0.7 x y = 1.4 + 0.7x y = 1.4 + 0.7 x
To predict y y y when x = 7 x = 7 x = 7 : y = 1.4 + 4.9 = 6.3 y = 1.4 + 4.9 = 6.3 y = 1.4 + 4.9 = 6.3 .
4.4 The regression line of x x x on y y y
The regression line of x x x on y y y (used when predicting x x x from y y y ) is:
x = x ˉ + S x y S y y ( y − y ˉ ) x = \bar{x} + \frac{S_{xy}}{S_{yy}}(y - \bar{y}) x = x ˉ + S y y S x y ( y − y ˉ )
Important: The two regression lines are different unless ∣ r ∣ = 1 |r| = 1 ∣ r ∣ = 1 . The line of y y y on x x x
minimises vertical residuals; the line of x x x on y y y minimises horizontal residuals.
4.5 Restrictions on using regression
Interpolation (predicting within the data range) is generally reliable.
Extrapolation (predicting outside the data range) is unreliable -- the relationship may
not hold.
The regression line assumes a linear relationship.
The model assumes the residuals are independent and normally distributed with constant
variance (homoscedasticity).
warning
Do not use the regression line of y y y on x x x to predict x x x from a given y y y , or vice versa.
Use the appropriate regression line for the direction of prediction.
5. Residuals
5.1 Definition
A residual for the i i i -th data point is the difference between the observed value and the
predicted value:
e i = y i − y ^ i = y i − ( a + b x i ) e_i = y_i - \hat{y}_i = y_i - (a + bx_i) e i = y i − y ^ i = y i − ( a + b x i )
5.2 Properties of residuals
∑ e i = 0 \sum e_i = 0 ∑ e i = 0 (the residuals sum to zero).
∑ x i e i = 0 \sum x_i e_i = 0 ∑ x i e i = 0 (residuals are uncorrelated with x x x ).
The mean of the residuals is zero.
5.3 Residual analysis
Plotting residuals against x x x (or against y ^ \hat{y} y ^ ) reveals:
Random scatter around zero: the linear model is appropriate.
Curved pattern: a non-linear model would be better.
Funnel shape: the variance is not constant (heteroscedasticity).
Large outliers: individual points with unusually large residuals.
5.4 Worked example: residual calculation
Using the data and regression line y = 1.4 + 0.7 x y = 1.4 + 0.7x y = 1.4 + 0.7 x from Section 4.3:
x x x y y y y ^ \hat{y} y ^ Residual e e e 2 3 2.8 0.2 4 5 4.2 0.8 6 4 5.6 -1.6 8 7 7.0 0.0 10 9 8.4 0.6
Check: ∑ e = 0.2 + 0.8 − 1.6 + 0 + 0.6 = 0 \sum e = 0.2 + 0.8 - 1.6 + 0 + 0.6 = 0 ∑ e = 0.2 + 0.8 − 1.6 + 0 + 0.6 = 0 .
The residual at x = 6 x = 6 x = 6 is relatively large (− 1.6 -1.6 − 1.6 ), suggesting this point deviates most from
the linear model.
6. Practice Problems
Problem 1
Find the PMCC for the data: (1, 2), (2, 3), (3, 5), (4, 4), (5, 7), (6, 8).
Solution n = 6 n = 6 n = 6 , x ˉ = 3.5 \bar{x} = 3.5 x ˉ = 3.5 , y ˉ = 4.833 \bar{y} = 4.833 y ˉ = 4.833 .
S x x = 1 + 0.25 + 0.25 + 0.25 + 2.25 + 6.25 = 17.5 S_{xx} = 1 + 0.25 + 0.25 + 0.25 + 2.25 + 6.25 = 17.5 S xx = 1 + 0.25 + 0.25 + 0.25 + 2.25 + 6.25 = 17.5 .
S y y = 7.36 + 3.36 + 0.028 + 0.694 + 4.694 + 10.03 = 26.17 S_{yy} = 7.36 + 3.36 + 0.028 + 0.694 + 4.694 + 10.03 = 26.17 S y y = 7.36 + 3.36 + 0.028 + 0.694 + 4.694 + 10.03 = 26.17 .
S x y = ( 1 − 3.5 ) ( 2 − 4.833 ) + … = ( − 2.5 ) ( − 2.833 ) + ( − 1.5 ) ( − 1.833 ) + ( − 0.5 ) ( 0.167 ) + ( 0.5 ) ( − 0.833 ) + ( 1.5 ) ( 2.167 ) + ( 2.5 ) ( 3.167 ) S_{xy} = (1 - 3.5)(2 - 4.833) + \ldots = (-2.5)(-2.833) + (-1.5)(-1.833) + (-0.5)(0.167) + (0.5)(-0.833) + (1.5)(2.167) + (2.5)(3.167) S x y = ( 1 − 3.5 ) ( 2 − 4.833 ) + … = ( − 2.5 ) ( − 2.833 ) + ( − 1.5 ) ( − 1.833 ) + ( − 0.5 ) ( 0.167 ) + ( 0.5 ) ( − 0.833 ) + ( 1.5 ) ( 2.167 ) + ( 2.5 ) ( 3.167 )
= 7.083 + 2.750 − 0.083 − 0.417 + 3.250 + 7.917 = 20.5 = 7.083 + 2.750 - 0.083 - 0.417 + 3.250 + 7.917 = 20.5 = 7.083 + 2.750 − 0.083 − 0.417 + 3.250 + 7.917 = 20.5
r = ◆ L B ◆ 20.5 ◆ R B ◆◆ L B ◆ ◆ L B ◆ 17.5 × 26.17 ◆ R B ◆◆ R B ◆ = ◆ L B ◆ 20.5 ◆ R B ◆◆ L B ◆ 457.98 ◆ R B ◆ = 20.5 21.40 ≈ 0.958 r = \dfrac◆LB◆20.5◆RB◆◆LB◆\sqrt◆LB◆17.5 \times 26.17◆RB◆◆RB◆ = \dfrac◆LB◆20.5◆RB◆◆LB◆\sqrt{457.98}◆RB◆ = \dfrac{20.5}{21.40} \approx 0.958 r = L ◆ B ◆20.5◆ R B ◆◆ L B ◆ ◆ L B ◆17.5 × 26.17◆ R B ◆◆ R B ◆ = L ◆ B ◆20.5◆ R B ◆◆ L B ◆ 457.98 ◆ R B ◆ = 21.40 20.5 ≈ 0.958
Problem 2
Find the equation of the regression line of y y y on x x x for the data in Problem 1, and predict y y y
when x = 7 x = 7 x = 7 .
Solution b = 20.5 17.5 = 1.171 b = \dfrac{20.5}{17.5} = 1.171 b = 17.5 20.5 = 1.171 , a = 4.833 − 1.171 ( 3.5 ) = 4.833 − 4.100 = 0.734 a = 4.833 - 1.171(3.5) = 4.833 - 4.100 = 0.734 a = 4.833 − 1.171 ( 3.5 ) = 4.833 − 4.100 = 0.734 .
y = 0.734 + 1.171 x y = 0.734 + 1.171x y = 0.734 + 1.171 x .
When x = 7 x = 7 x = 7 : y = 0.734 + 8.200 = 8.934 ≈ 8.9 y = 0.734 + 8.200 = 8.934 \approx 8.9 y = 0.734 + 8.200 = 8.934 ≈ 8.9 .
Problem 3
Two teachers rank 8 students by exam performance. Calculate Spearman's rank correlation coefficient.
Student A B C D E F G H Teacher 1 1 2 3 4 5 6 7 8 Teacher 2 3 1 4 2 6 5 8 7
Solution d i d_i d i : -2, 1, -1, 2, -1, 1, -1, 1.
∑ d i 2 = 4 + 1 + 1 + 4 + 1 + 1 + 1 + 1 = 14 \sum d_i^2 = 4 + 1 + 1 + 4 + 1 + 1 + 1 + 1 = 14 ∑ d i 2 = 4 + 1 + 1 + 4 + 1 + 1 + 1 + 1 = 14 .
r s = 1 − ◆ L B ◆ 6 × 14 ◆ R B ◆◆ L B ◆ 8 × 63 ◆ R B ◆ = 1 − 84 504 = 1 − 0.1667 = 0.833 r_s = 1 - \dfrac◆LB◆6 \times 14◆RB◆◆LB◆8 \times 63◆RB◆ = 1 - \dfrac{84}{504} = 1 - 0.1667 = 0.833 r s = 1 − L ◆ B ◆6 × 14◆ R B ◆◆ L B ◆8 × 63◆ R B ◆ = 1 − 504 84 = 1 − 0.1667 = 0.833 .