Skip to main content

Correlation and Regression (Extended)

Correlation and Regression (Extended Treatment)

This document covers scatter diagrams, the product moment correlation coefficient, Spearman's rank correlation, least squares regression, and residual analysis.

info

Correlation measures the strength of a linear association. It does not imply causation, and it does not capture non-linear relationships. Always plot your data before interpreting correlation values.


1. Scatter Diagrams

1.1 Interpretation

A scatter diagram (scatter plot) displays pairs of values (xi,yi)(x_i, y_i) as points on a coordinate grid. Visual inspection reveals:

  • The direction of association (positive, negative, or none).
  • The strength of association (strong, moderate, weak).
  • The shape of the relationship (linear, curved, clustered).
  • The presence of outliers.

1.2 Types of correlation

PatternDescription
Strong +Points lie close to an upward-sloping line
Moderate +General upward trend with more scatter
Weak +Slight upward tendency, much scatter
No correlationNo discernible pattern
Strong -Points lie close to a downward-sloping line
Non-linearClear pattern but not a straight line

1.3 Outliers

An outlier is a data point that lies far from the general pattern. Outliers can:

  • Be genuine extreme values.
  • Result from measurement errors.
  • Significantly affect the correlation coefficient and regression line.
warning

Common Pitfall A single outlier can dramatically change the value of the correlation coefficient. Always examine your scatter diagram before relying on numerical measures.


2. Product Moment Correlation Coefficient (PMCC)

2.1 Definition

The product moment correlation coefficient (also called Pearson's correlation coefficient) for a sample of nn pairs (xi,yi)(x_i, y_i) is:

r=LBSxyRB◆◆LBSxxSyyRBr = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt{S_{xx}\,S_{yy}}◆RB◆

where:

Sxy=(xixˉ)(yiyˉ)=xiyinxˉyˉS_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y}) = \sum x_i y_i - n\bar{x}\bar{y}

Sxx=(xixˉ)2=xi2nxˉ2S_{xx} = \sum(x_i - \bar{x})^2 = \sum x_i^2 - n\bar{x}^2

Syy=(yiyˉ)2=yi2nyˉ2S_{yy} = \sum(y_i - \bar{y})^2 = \sum y_i^2 - n\bar{y}^2

2.2 Properties

  • 1r1-1 \leq r \leq 1.
  • r=1r = 1: perfect positive linear correlation.
  • r=1r = -1: perfect negative linear correlation.
  • r=0r = 0: no linear correlation (but there may be a non-linear relationship).
  • rr is independent of the units of measurement.
  • rr is unchanged if both variables are transformed linearly (x=ax+bx' = ax + b, y=cy+dy' = cy + d with a,c>0a, c \gt 0).

2.3 Proof that r1|r| \leq 1

Proof. By the Cauchy-Schwarz inequality:

(aibi)2(ai2) ⁣(bi2)\left(\sum a_i b_i\right)^2 \leq \left(\sum a_i^2\right)\!\left(\sum b_i^2\right)

Setting ai=xixˉa_i = x_i - \bar{x} and bi=yiyˉb_i = y_i - \bar{y}:

Sxy2SxxSyyS_{xy}^2 \leq S_{xx}\,S_{yy}

r2=Sxy2SxxSyy1    r1r^2 = \frac{S_{xy}^2}{S_{xx}\,S_{yy}} \leq 1 \implies |r| \leq 1 \quad \blacksquare

2.4 Worked example

Problem. Find the PMCC for the following data:

xx246810
yy35479

n=5n = 5, xˉ=6\bar{x} = 6, yˉ=5.6\bar{y} = 5.6.

xiyi=6+20+24+56+90=196\sum x_i y_i = 6 + 20 + 24 + 56 + 90 = 196

Sxy=1965(6)(5.6)=196168=28S_{xy} = 196 - 5(6)(5.6) = 196 - 168 = 28

xi2=4+16+36+64+100=220\sum x_i^2 = 4 + 16 + 36 + 64 + 100 = 220, Sxx=2205(36)=40S_{xx} = 220 - 5(36) = 40

yi2=9+25+16+49+81=180\sum y_i^2 = 9 + 25 + 16 + 49 + 81 = 180, Syy=1805(31.36)=180156.8=23.2S_{yy} = 180 - 5(31.36) = 180 - 156.8 = 23.2

r=LB28RB◆◆LBLB40×23.2RB◆◆RB=LB28RB◆◆LB928RB=2830.460.919r = \frac◆LB◆28◆RB◆◆LB◆\sqrt◆LB◆40 \times 23.2◆RB◆◆RB◆ = \frac◆LB◆28◆RB◆◆LB◆\sqrt{928}◆RB◆ = \frac{28}{30.46} \approx 0.919

This indicates strong positive linear correlation.

2.5 Coding data

When data values are large, coding simplifies calculations. Use u=xacu = \dfrac{x - a}{c} and v=ybdv = \dfrac{y - b}{d} where a,ba, b are shift values and c,dc, d are scaling values.

The PMCC is unchanged by coding: rxy=ruvr_{xy} = r_{uv}.


3. Spearman's Rank Correlation Coefficient

3.1 Definition

Spearman's rank correlation coefficient rsr_s measures the strength of the monotonic relationship between two variables:

rs=1LB6di2RB◆◆LBn(n21)RBr_s = 1 - \frac◆LB◆6\sum d_i^2◆RB◆◆LB◆n(n^2 - 1)◆RB◆

where di=rank(xi)rank(yi)d_i = \mathrm{rank}(x_i) - \mathrm{rank}(y_i) is the difference in ranks for the ii-th pair.

3.2 When to use Spearman's rank

  • Data is ordinal (ranked categories).
  • The relationship is monotonic but not necessarily linear.
  • There are significant outliers that would distort the PMCC.
  • The data contains tied ranks.

3.3 Handling tied ranks

When values are tied, assign the average rank to all tied values. For example, if two values are tied for ranks 3 and 4, both receive rank 3.5.

When ties exist, the simplified formula is only approximate. A more accurate formula uses:

rs=LBSxyRB◆◆LBSxxSyyRBr_s = \frac◆LB◆S_{xy}◆RB◆◆LB◆\sqrt{S_{xx}\,S_{yy}}◆RB◆

applied to the rank data.

3.4 Worked example

Problem. Two judges rank 6 competitors:

CompetitorABCDEF
Judge 1132546
Judge 2213654
did_i-12-1-1-12

di2=1+4+1+1+1+4=12\sum d_i^2 = 1 + 4 + 1 + 1 + 1 + 4 = 12

rs=1LB6×12RB◆◆LB6(35)RB=172210=10.343=0.657r_s = 1 - \frac◆LB◆6 \times 12◆RB◆◆LB◆6(35)◆RB◆ = 1 - \frac{72}{210} = 1 - 0.343 = 0.657

This indicates moderate positive agreement between the judges.

3.5 Worked example with ties

Problem. Find rsr_s for the following data:

xx1020203040
yy58121520

Ranks of xx: 1, 2.5, 2.5, 4, 5 (tied at 20).

Ranks of yy: 1, 2, 3, 4, 5.

did_i: 0, 0.5, -0.5, 0, 0.

di2=0+0.25+0.25+0+0=0.5\sum d_i^2 = 0 + 0.25 + 0.25 + 0 + 0 = 0.5

rs=1LB6×0.5RB◆◆LB5×24RB=13120=10.025=0.975r_s = 1 - \frac◆LB◆6 \times 0.5◆RB◆◆LB◆5 \times 24◆RB◆ = 1 - \frac{3}{120} = 1 - 0.025 = 0.975

Very strong positive monotonic relationship.


4. Least Squares Regression

4.1 The regression line of yy on xx

The least squares regression line of yy on xx is the line y=a+bxy = a + bx that minimises the sum of squared residuals:

S=i=1n(yiabxi)2S = \sum_{i=1}^{n}(y_i - a - bx_i)^2

Setting LBSRB◆◆LBaRB=0\dfrac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = 0 and LBSRB◆◆LBbRB=0\dfrac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = 0:

b=SxySxx=LB(xixˉ)(yiyˉ)RB◆◆LB(xixˉ)2RBb = \frac{S_{xy}}{S_{xx}} = \frac◆LB◆\sum(x_i - \bar{x})(y_i - \bar{y})◆RB◆◆LB◆\sum(x_i - \bar{x})^2◆RB◆

a=yˉbxˉa = \bar{y} - b\bar{x}

Key property: The regression line always passes through the point (xˉ,yˉ)(\bar{x}, \bar{y}).

4.2 Derivation of the normal equations

LBSRB◆◆LBaRB=2(yiabxi)=0    na+bxi=yi\frac◆LB◆\partial S◆RB◆◆LB◆\partial a◆RB◆ = -2\sum(y_i - a - bx_i) = 0 \implies na + b\sum x_i = \sum y_i

LBSRB◆◆LBbRB=2xi(yiabxi)=0    axi+bxi2=xiyi\frac◆LB◆\partial S◆RB◆◆LB◆\partial b◆RB◆ = -2\sum x_i(y_i - a - bx_i) = 0 \implies a\sum x_i + b\sum x_i^2 = \sum x_i y_i

These are the normal equations. Dividing the first by nn gives yˉ=a+bxˉ\bar{y} = a + b\bar{x}, confirming the line passes through the mean point.

4.3 Worked example

Using the data from Section 2.4:

b=SxySxx=2840=0.7b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{28}{40} = 0.7

a=yˉbxˉ=5.60.7(6)=5.64.2=1.4a = \bar{y} - b\bar{x} = 5.6 - 0.7(6) = 5.6 - 4.2 = 1.4

y=1.4+0.7xy = 1.4 + 0.7x

To predict yy when x=7x = 7: y=1.4+4.9=6.3y = 1.4 + 4.9 = 6.3.

4.4 The regression line of xx on yy

The regression line of xx on yy (used when predicting xx from yy) is:

x=xˉ+SxySyy(yyˉ)x = \bar{x} + \frac{S_{xy}}{S_{yy}}(y - \bar{y})

Important: The two regression lines are different unless r=1|r| = 1. The line of yy on xx minimises vertical residuals; the line of xx on yy minimises horizontal residuals.

4.5 Restrictions on using regression

  1. Interpolation (predicting within the data range) is generally reliable.
  2. Extrapolation (predicting outside the data range) is unreliable -- the relationship may not hold.
  3. The regression line assumes a linear relationship.
  4. The model assumes the residuals are independent and normally distributed with constant variance (homoscedasticity).
warning

warning Do not use the regression line of yy on xx to predict xx from a given yy, or vice versa. Use the appropriate regression line for the direction of prediction.


5. Residuals

5.1 Definition

A residual for the ii-th data point is the difference between the observed value and the predicted value:

ei=yiy^i=yi(a+bxi)e_i = y_i - \hat{y}_i = y_i - (a + bx_i)

5.2 Properties of residuals

  1. ei=0\sum e_i = 0 (the residuals sum to zero).
  2. xiei=0\sum x_i e_i = 0 (residuals are uncorrelated with xx).
  3. The mean of the residuals is zero.

5.3 Residual analysis

Plotting residuals against xx (or against y^\hat{y}) reveals:

  • Random scatter around zero: the linear model is appropriate.
  • Curved pattern: a non-linear model would be better.
  • Funnel shape: the variance is not constant (heteroscedasticity).
  • Large outliers: individual points with unusually large residuals.

5.4 Worked example: residual calculation

Using the data and regression line y=1.4+0.7xy = 1.4 + 0.7x from Section 4.3:

xxyyy^\hat{y}Residual ee
232.80.2
454.20.8
645.6-1.6
877.00.0
1098.40.6

Check: e=0.2+0.81.6+0+0.6=0\sum e = 0.2 + 0.8 - 1.6 + 0 + 0.6 = 0.

The residual at x=6x = 6 is relatively large (1.6-1.6), suggesting this point deviates most from the linear model.


6. Practice Problems

Problem 1

Find the PMCC for the data: (1, 2), (2, 3), (3, 5), (4, 4), (5, 7), (6, 8).

Solution

n=6n = 6, xˉ=3.5\bar{x} = 3.5, yˉ=4.833\bar{y} = 4.833.

Sxx=1+0.25+0.25+0.25+2.25+6.25=17.5S_{xx} = 1 + 0.25 + 0.25 + 0.25 + 2.25 + 6.25 = 17.5.

Syy=7.36+3.36+0.028+0.694+4.694+10.03=26.17S_{yy} = 7.36 + 3.36 + 0.028 + 0.694 + 4.694 + 10.03 = 26.17.

Sxy=(13.5)(24.833)+=(2.5)(2.833)+(1.5)(1.833)+(0.5)(0.167)+(0.5)(0.833)+(1.5)(2.167)+(2.5)(3.167)S_{xy} = (1 - 3.5)(2 - 4.833) + \ldots = (-2.5)(-2.833) + (-1.5)(-1.833) + (-0.5)(0.167) + (0.5)(-0.833) + (1.5)(2.167) + (2.5)(3.167)

=7.083+2.7500.0830.417+3.250+7.917=20.5= 7.083 + 2.750 - 0.083 - 0.417 + 3.250 + 7.917 = 20.5

r=LB20.5RB◆◆LBLB17.5×26.17RB◆◆RB=LB20.5RB◆◆LB457.98RB=20.521.400.958r = \dfrac◆LB◆20.5◆RB◆◆LB◆\sqrt◆LB◆17.5 \times 26.17◆RB◆◆RB◆ = \dfrac◆LB◆20.5◆RB◆◆LB◆\sqrt{457.98}◆RB◆ = \dfrac{20.5}{21.40} \approx 0.958

Problem 2

Find the equation of the regression line of yy on xx for the data in Problem 1, and predict yy when x=7x = 7.

Solution

b=20.517.5=1.171b = \dfrac{20.5}{17.5} = 1.171, a=4.8331.171(3.5)=4.8334.100=0.734a = 4.833 - 1.171(3.5) = 4.833 - 4.100 = 0.734.

y=0.734+1.171xy = 0.734 + 1.171x.

When x=7x = 7: y=0.734+8.200=8.9348.9y = 0.734 + 8.200 = 8.934 \approx 8.9.

Problem 3

Two teachers rank 8 students by exam performance. Calculate Spearman's rank correlation coefficient.

StudentABCDEFGH
Teacher 112345678
Teacher 231426587
Solution

did_i: -2, 1, -1, 2, -1, 1, -1, 1.

di2=4+1+1+4+1+1+1+1=14\sum d_i^2 = 4 + 1 + 1 + 4 + 1 + 1 + 1 + 1 = 14.

rs=1LB6×14RB◆◆LB8×63RB=184504=10.1667=0.833r_s = 1 - \dfrac◆LB◆6 \times 14◆RB◆◆LB◆8 \times 63◆RB◆ = 1 - \dfrac{84}{504} = 1 - 0.1667 = 0.833.