Correlation and Regression (Extended)
Correlation and Regression (Extended Treatment)
This document covers scatter diagrams, the product moment correlation coefficient, Spearman’s rank Correlation, least squares regression, and residual analysis.
:::info Correlation measures the strength of a linear association. It does not imply causation, and it Does not capture non-linear relationships. Always plot your data before interpreting correlation Values. :::
1. Scatter Diagrams
1.1 Interpretation
A scatter diagram (scatter plot) displays pairs of values as points on a Coordinate grid. Visual inspection reveals:
- The direction of association (positive, negative, or none).
- The strength of association (strong, moderate, weak).
- The shape of the relationship (linear, curved, clustered).
- The presence of outliers.
1.2 Types of correlation
| Pattern | Description |
|---|---|
| Strong + | Points lie close to an upward-sloping line |
| Moderate + | General upward trend with more scatter |
| Weak + | Slight upward tendency, much scatter |
| No correlation | No discernible pattern |
| Strong - | Points lie close to a downward-sloping line |
| Non-linear | Clear pattern but not a straight line |
1.3 Outliers
An outlier is a data point that lies far from the general pattern. Outliers can:
- Be genuine extreme values.
- Result from measurement errors.
- Significantly affect the correlation coefficient and regression line.
:::caution Common Pitfall A single outlier can dramatically change the value of the correlation coefficient. Always examine Your scatter diagram before relying on numerical measures. :::
2. Product Moment Correlation Coefficient (PMCC)
2.1 Definition
The product moment correlation coefficient (also called Pearson’s correlation coefficient) For a sample of pairs is:
Where:
2.2 Properties
- .
- : perfect positive linear correlation.
- : perfect negative linear correlation.
- : no linear correlation (but there may be a non-linear relationship).
- is independent of the units of measurement.
- is unchanged if both variables are transformed linearly (, with ).
2.3 Proof that
Proof. By the Cauchy-Schwarz inequality:
Setting and :
2.4 Worked example
Problem. Find the PMCC for the following data:
| 2 | 4 | 6 | 8 | 10 | |
|---|---|---|---|---|---|
| 3 | 5 | 4 | 7 | 9 |
n = 5$$\bar{x} = 6$$\bar{y} = 5.6.
\sum x_i^2 = 4 + 16 + 36 + 64 + 100 = 220$$S_{xx} = 220 - 5(36) = 40
\sum y_i^2 = 9 + 25 + 16 + 49 + 81 = 180$$S_{yy} = 180 - 5(31.36) = 180 - 156.8 = 23.2
This indicates strong positive linear correlation.
2.5 Coding data
When data values are large, coding simplifies calculations. Use and where are shift values and are scaling values.
The PMCC is unchanged by coding: .
3. Spearman’s Rank Correlation Coefficient
3.1 Definition
Spearman’s rank correlation coefficient measures the strength of the monotonic Relationship between two variables:
Where is the difference in ranks for the -th pair.
3.2 When to use Spearman’s rank
- Data is ordinal (ranked categories).
- The relationship is monotonic but not necessarily linear.
- There are significant outliers that would distort the PMCC.
- The data contains tied ranks.
3.3 Handling tied ranks
When values are tied, assign the average rank to all tied values. For example, if two values Are tied for ranks 3 and 4, both receive rank 3.5.
When ties exist, the simplified formula is only approximate. A more accurate formula uses:
Applied to the rank data.
3.4 Worked example
Problem. Two judges rank 6 competitors:
| Competitor | A | B | C | D | E | F |
|---|---|---|---|---|---|---|
| Judge 1 | 1 | 3 | 2 | 5 | 4 | 6 |
| Judge 2 | 2 | 1 | 3 | 6 | 5 | 4 |
| -1 | 2 | -1 | -1 | -1 | 2 |
|---|
This indicates moderate positive agreement between the judges.
3.5 Worked example with ties
Problem. Find for the following data:
| 10 | 20 | 20 | 30 | 40 | |
|---|---|---|---|---|---|
| 5 | 8 | 12 | 15 | 20 |
Ranks of : 1, 2.5, 2.5, 4, 5 (tied at 20).
Ranks of : 1, 2, 3, 4, 5.
: 0, 0.5, -0.5, 0, 0.
Very strong positive monotonic relationship.
4. Least Squares Regression
4.1 The regression line of on
The least squares regression line of on is the line that minimises the Sum of squared residuals:
Setting and :
Key property: The regression line always passes through the point .
4.2 Derivation of the normal equations
These are the normal equations. Dividing the first by gives Confirming the line passes through the mean point.
4.3 Worked example
Using the data from Section 2.4:
To predict when : .
4.4 The regression line of on
The regression line of on (used when predicting from ) is:
Important: The two regression lines are different unless . The line of on Minimises vertical residuals; the line of on minimises horizontal residuals.
4.5 Restrictions on using regression
- Interpolation (predicting within the data range) is generally reliable.
- Extrapolation (predicting outside the data range) is unreliable — the relationship may not hold.
- The regression line assumes a linear relationship.
- The model assumes the residuals are independent and normally distributed with constant variance (homoscedasticity).
:::caution Warning Do not use the regression line of on to predict from a given Or vice versa. Use the appropriate regression line for the direction of prediction.
5. Residuals
5.1 Definition
A residual for the -th data point is the difference between the observed value and the Predicted value:
5.2 Properties of residuals
- (the residuals sum to zero).
- (residuals are uncorrelated with ).
- The mean of the residuals is zero.
5.3 Residual analysis
Plotting residuals against (or against ) reveals:
- Random scatter around zero: the linear model is appropriate.
- Curved pattern: a non-linear model would be better.
- Funnel shape: the variance is not constant (heteroscedasticity).
- Large outliers: individual points with unusually large residuals.
5.4 Worked example: residual calculation
Using the data and regression line from Section 4.3:
| Residual | |||
|---|---|---|---|
| 2 | 3 | 2.8 | 0.2 |
| 4 | 5 | 4.2 | 0.8 |
| 6 | 4 | 5.6 | -1.6 |
| 8 | 7 | 7.0 | 0.0 |
| 10 | 9 | 8.4 | 0.6 |
Check: .
The residual at is relatively large (), suggesting this point deviates most from The linear model.
6. Practice Problems
Problem 1
Find the PMCC for the data: (1, 2), (2, 3), (3, 5), (4, 4), (5, 7), (6, 8).
Solution
n = 6$$\bar{x} = 3.5$$\bar{y} = 4.833.
.
.
Problem 2
Find the equation of the regression line of on for the data in Problem 1, and predict When .
Solution
b = \dfrac{20.5}{17.5} = 1.171$$a = 4.833 - 1.171(3.5) = 4.833 - 4.100 = 0.734.
.
When : .
Problem 3
Two teachers rank 8 students by exam performance. Calculate Spearman’s rank correlation coefficient.
| Student | A | B | C | D | E | F | G | H |
|---|---|---|---|---|---|---|---|---|
| Teacher 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Teacher 2 | 3 | 1 | 4 | 2 | 6 | 5 | 8 | 7 |
Solution
: -2, 1, -1, 2, -1, 1, -1, 1.
.
.
Common Pitfalls
-
Losing marks by not showing sufficient working — always write out each step, especially in proof questions.
-
Forgetting to check that solutions satisfy the original equation (especially with squaring both sides or dividing by variables).
-
Misreading the question, particularly with ‘hence’ vs ‘hence or otherwise’ — the former requires using previous work.
-
Incorrectly applying integration by parts by choosing and the wrong way around.
Worked Examples
Example 1: PMCC and Regression Line
Problem. Given the data:
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 3 | 5 | 6 | 8 | 11 |
Calculate the PMCC and the equation of the regression line of on .
Solution. , , .
. .
. .
. .
Regression: , .
Example 2: Spearman’s Rank with Ties
Problem. Two judges score 5 contestants:
| Contestant | A | B | C | D | E |
|---|---|---|---|---|---|
| Judge 1 | 7 | 9 | 5 | 9 | 3 |
| Judge 2 | 6 | 8 | 7 | 10 | 4 |
Calculate Spearman’s rank correlation coefficient.
Solution. Judge 1 ranks: B and D tied at , so A=2, B=3.5, C=1, D=3.5, E=0 (not possible). Actually: scores sorted: 3(E), 5(C), 7(A), 9(B), 9(D).
Ranks: E=1, C=2, A=3, B=4.5, D=4.5.
Judge 2 ranks: E=1, A=2, C=3, B=4, D=5.
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | 0.5 | -1 | -0.5 | 0 |
.
Summary
- PMCC: measures linear correlation; .
- Regression line of on : where , .
- Spearman’s rank: measures monotonic correlation.
- Residuals: ; random scatter confirms linearity, curved pattern suggests non-linearity.
- Correlation does not imply causation; always inspect scatter diagrams before interpreting .
:::