Correlation — padho-wiki

In short

Correlation measures how strongly two numerical variables move together. The correlation coefficient r is a number between -1 and +1: close to +1 means the two variables increase together, close to -1 means one increases as the other decreases, and close to 0 means they do not move together in any straight-line way. The sign tells you the direction; the magnitude tells you the strength; the picture — a scatter diagram — tells you everything the number leaves out.

A researcher collects data on 50 students from a coaching institute. For each student she notes two numbers: how many hours per week the student spent solving problems, and the marks the student scored on the end-of-term mock test.

She draws a graph. On the horizontal axis: hours of practice. On the vertical axis: marks. Each student becomes one dot. She ends up with 50 dots scattered across the page.

The dots are not random. They lean — from the lower-left to the upper-right. Students who practised more tended to score higher. Not always, not perfectly, but the trend is visible.

Now she wants to put a number on that trend. How strong is the leaning? If she handed a newspaper a statement like "practice hours and exam marks are strongly related in this dataset," she would want that word strongly to correspond to something measurable. Something that would give the same answer if she sent the dataset to a colleague.

The number she wants is the correlation coefficient — a single number between -1 and +1 that captures exactly how strongly two columns of data move together, and which way.

Scatter diagrams: the picture before the number

Before you compute anything, draw the data.

A scatter diagram (also called a scatter plot) is the most basic tool in bivariate statistics. Each observation is a pair (x_i, y_i) and you draw a dot at that point on a pair of axes. The result — a cloud of dots — tells you almost everything you need to know about how the two variables relate to each other.

Three things are worth looking for in a scatter diagram:

Direction. Do the dots rise from left to right (positive association) or fall (negative association)?
Strength. Are the dots packed tightly around a line, or do they form a loose cloud?
Shape. Is the trend straight, or does it bend?

Three scatter diagrams with very different correlations. The left is tightly clustered around a rising line — high positive correlation. The middle is looser but clearly falling — negative correlation. The right has no trend at all — the dots look random.

The picture is the primary evidence. The correlation coefficient is one number extracted from that picture — and extracting one number from a whole scatter plot loses information. That is why a scatter diagram always comes first and the coefficient always comes second.

Building the correlation coefficient

Now build the number. Start from a simple question and let the formula emerge.

The question. Given two variables x and y and n pairs of observations (x_i, y_i), is there a single number that captures "how strongly do x and y move together in a straight-line way?"

The first idea. Look at each data point and ask: does it support the claim that x and y move together?

Compute the mean \bar{x} of the x-values and the mean \bar{y} of the y-values. These define a "centre" of the scatter cloud. Draw a vertical line at x = \bar{x} and a horizontal line at y = \bar{y}. These two lines divide the scatter plot into four quadrants.

Upper-right quadrant (x_i > \bar{x} and y_i > \bar{y}): the point is above average in both variables.
Lower-left quadrant (x_i < \bar{x} and y_i < \bar{y}): the point is below average in both.
Upper-left quadrant (x_i < \bar{x} and y_i > \bar{y}): above average in y, below average in x.
Lower-right quadrant (x_i > \bar{x} and y_i < \bar{y}): above average in x, below average in y.

Points in the upper-right and lower-left quadrants support a positive relationship: when x is higher than usual, y is too. Points in the other two quadrants support a negative relationship.

Now assign each point a signed contribution. For each point, compute

(x_i - \bar{x})(y_i - \bar{y}).

In the upper-right quadrant, both factors are positive, so the product is positive. In the lower-left, both factors are negative, so the product is positive again. In the upper-left and lower-right, one factor is positive and one is negative, so the product is negative.

Sum these products across all points:

S_{xy} \;=\; \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).

If most of the data sits in the positive quadrants, S_{xy} is a large positive number. If most of it sits in the negative quadrants, S_{xy} is a large negative number. If the points are balanced across all four quadrants, S_{xy} is near zero. That is exactly the behaviour you want for a measure of association.

The sum S_{xy} is almost the correlation — but not quite. It has a problem. If you measure x in metres instead of centimetres, every (x_i - \bar{x}) shrinks by a factor of 100, and S_{xy} shrinks with it. The strength of the relationship has not changed — only the units — but the number has. That is a deal-breaker. You want a measure that does not depend on the unit of measurement.

The fix. Divide by the natural "scale" of each variable so the units cancel. The natural scale of the x-values is how spread out they are around their mean, which is measured by

S_{xx} \;=\; \sum_{i=1}^{n} (x_i - \bar{x})^2.

Similarly S_{yy} = \sum (y_i - \bar{y})^2 measures the spread of the y-values.

Define

r \;=\; \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}} \;=\; \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}}.

This is the Karl Pearson correlation coefficient, usually just called the correlation coefficient. It is the standard way to summarise a linear relationship between two variables.

Two things are worth checking.

Check 1: it is unit-free. If you multiply every x_i by a constant c > 0, then every (x_i - \bar{x}) gets multiplied by c, so S_{xy} gets multiplied by c and S_{xx} gets multiplied by c^2. The denominator has \sqrt{c^2 \cdot S_{xx} \cdot S_{yy}} = c \sqrt{S_{xx} S_{yy}}, and the c cancels. So r is unchanged. You get the same correlation whether you measure heights in centimetres or feet.

Check 2: it lies between -1 and +1. This is a consequence of the Cauchy-Schwarz inequality (a general statement about sums of products), and the proof is set out in the "Going deeper" section below. Intuitively: the numerator |S_{xy}| can never be bigger than the geometric mean of the two denominators.

Karl Pearson correlation coefficient

For n pairs of observations (x_i, y_i), the correlation coefficient is

r \;=\; \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \cdot \sum_{i=1}^{n} (y_i - \bar{y})^2}}

where \bar{x} and \bar{y} are the means of the x-values and y-values.

r always satisfies -1 \leq r \leq +1.

r = +1: perfect positive linear relationship (all points on a rising straight line)
r = -1: perfect negative linear relationship (all points on a falling straight line)
r = 0: no linear relationship

An equivalent formula, easier to compute

The definition above is clean but needs you to compute the means first and then subtract them from every observation. When doing the calculation by hand, an equivalent form is easier:

r \;=\; \frac{n \sum x_i y_i - (\sum x_i)(\sum y_i)}{\sqrt{\bigl[n \sum x_i^2 - (\sum x_i)^2\bigr]\bigl[n \sum y_i^2 - (\sum y_i)^2\bigr]}}.

This form needs only five sums: \sum x_i, \sum y_i, \sum x_i^2, \sum y_i^2, \sum x_i y_i. You compute them all in a single pass through the data and then plug into the formula. That is how correlations were computed for a century before calculators existed, and it is still the form most Indian textbooks use for exam problems.

The two formulas give identical answers — they are algebraically the same thing after expanding (x_i - \bar{x}) and collecting terms. Choose whichever is more convenient.

Worked examples

Example 1: practice hours and exam marks

Six students report how many hours per week they spent solving practice problems, and their marks on a 50-mark mock exam.

Student	x (hours)	y (marks)
A	2	18
B	4	24
C	6	28
D	8	36
E	10	40
F	12	46

Compute the correlation coefficient.

Step 1. Compute the five sums.

\sum x = 2 + 4 + 6 + 8 + 10 + 12 = 42

\sum y = 18 + 24 + 28 + 36 + 40 + 46 = 192

\sum xy = 2 \cdot 18 + 4 \cdot 24 + 6 \cdot 28 + 8 \cdot 36 + 10 \cdot 40 + 12 \cdot 46

= 36 + 96 + 168 + 288 + 400 + 552 = 1540

\sum x^2 = 4 + 16 + 36 + 64 + 100 + 144 = 364

\sum y^2 = 324 + 576 + 784 + 1296 + 1600 + 2116 = 6696

Why: these five sums are all the raw material the formula needs. Compute them once, reuse them twice.

Step 2. Apply the shortcut formula with n = 6.

n \sum xy - (\sum x)(\sum y) = 6 \cdot 1540 - 42 \cdot 192 = 9240 - 8064 = 1176.

Why: this is the numerator — it measures how much the data leans toward a positive (or negative) line relative to the product of the marginal sums.

Step 3. Compute the two denominator terms.

n \sum x^2 - (\sum x)^2 = 6 \cdot 364 - 42^2 = 2184 - 1764 = 420.

n \sum y^2 - (\sum y)^2 = 6 \cdot 6696 - 192^2 = 40176 - 36864 = 3312.

Why: each of these measures the variability of one of the variables, rescaled by a factor of n (which cancels in the final ratio).

Step 4. Plug into the formula.

r = \frac{1176}{\sqrt{420 \cdot 3312}} = \frac{1176}{\sqrt{1{,}390{,}840}} = \frac{1176}{1179.3} \approx 0.997.

Why: take the square root of the product of the two denominator terms and divide. The result is a unit-free number in [-1, 1].

Result: r \approx 0.997.

Scatter plot of hours versus marks for the six students. The dots lie very close to a rising straight line — consistent with a correlation of $r \approx 0.997$, which is almost perfect. The dashed red line is the best-fit line you will learn how to compute in the article on regression.

Read the number. r = 0.997 is extremely close to +1, which is the maximum possible. That means this particular dataset shows an almost-perfect positive linear relationship: more practice hours line up almost exactly with higher marks, in a straight-line way. In real data you would rarely see something this clean; this dataset was constructed to be close to a line so the arithmetic works out cleanly.

Example 2: age of a car and its resale price

Five cars of the same model are sold in the used market. For each car, record the age (in years) and the resale price (in lakhs of rupees).

Car	x (age)	y (price)
1	1	7
2	3	6
3	5	4
4	7	3
5	9	2

Step 1. Compute the sums.

The pairs are (1,7), (3,6), (5,4), (7,3), (9,2). Compute each product x_i y_i: 7, 18, 20, 21, 18. Then:

\sum x = 1 + 3 + 5 + 7 + 9 = 25,

\sum y = 7 + 6 + 4 + 3 + 2 = 22,

\sum x^2 = 1 + 9 + 25 + 49 + 81 = 165,

\sum y^2 = 49 + 36 + 16 + 9 + 4 = 114,

\sum xy = 7 + 18 + 20 + 21 + 18 = 84.

Why: computing each product explicitly before summing makes it impossible to lose track of which term goes with which. Work the table left-to-right, top-to-bottom.

Step 2. Apply the shortcut formula with n = 5.

Numerator:

n \sum xy - (\sum x)(\sum y) = 5 \cdot 84 - 25 \cdot 22 = 420 - 550 = -130.

The numerator is negative. That already tells you r will be negative — the dots lean the other way.

Why: the sign of the numerator is the sign of r. Here, the sign is negative, matching the intuition that older cars tend to be cheaper.

Step 3. Compute the denominator terms.

n \sum x^2 - (\sum x)^2 = 5 \cdot 165 - 25^2 = 825 - 625 = 200.

n \sum y^2 - (\sum y)^2 = 5 \cdot 114 - 22^2 = 570 - 484 = 86.

Why: both denominator terms are always non-negative, because each is a scaled variance.

Step 4. Plug in.

r = \frac{-130}{\sqrt{200 \cdot 86}} = \frac{-130}{\sqrt{17200}} = \frac{-130}{131.15} \approx -0.991.

Why: the magnitude is close to 1, so the relationship is strong; the sign is negative, so it is an inverse relationship.

Result: r \approx -0.991.

Scatter plot of age versus resale price. The dots fall almost exactly along a descending straight line, matching $r \approx -0.991$. The dashed red line is the best-fit line — for every extra year of age, the price drops by roughly 0.65 lakh.

The number -0.991 tells you the resale price has an almost-perfect negative linear relationship with age — each additional year of age is associated with a very predictable drop in price.

Properties of the correlation coefficient

The correlation coefficient has several properties that are worth knowing — some help you sanity-check calculations, others help you interpret the result.

Bounded: -1 \leq r \leq +1 always. If your calculation gives r = 1.3, the calculation is wrong.
Symmetric: r_{xy} = r_{yx}. The correlation between x and y is the same as the correlation between y and x. Unlike regression (next article), there is no "explanatory" and "response" variable — correlation treats them equally.
Unaffected by change of origin: if you add a constant a to every x_i and a constant b to every y_i, the correlation r does not change. Shifting the data left, right, up, or down does not change how tightly the dots hug a line.
Unaffected by change of scale: if you multiply every x_i by a positive constant and every y_i by a positive constant, r does not change. Measuring heights in inches instead of centimetres does not change the correlation. If one of the multipliers is negative, the sign of r flips, but the magnitude stays the same.
Measures linear relationship only: r is near zero does not mean the variables are unrelated. It means they have no straight-line relationship. Two variables can be perfectly related by a curve and still have r = 0. See the common-confusions section.
Dimensionless: the numerator and denominator carry the same units, so they cancel. r is a pure number — it has no units.

Common confusions

"Correlation implies causation." The single most common mistake in statistics. High correlation between two variables does not mean one causes the other. Ice cream sales and drowning deaths are strongly correlated, but ice cream does not cause drowning — both go up in summer, because hot weather increases both swimming and ice cream consumption. A lurking third variable (temperature) explains both. Always ask: could there be a hidden common cause?
"r = 0 means the variables are independent." False. r only measures linear association. Take y = x^2 with x ranging over -5, -4, \ldots, 4, 5. The two variables are perfectly determined by each other (given x, you know y exactly) — but r = 0, because the relationship is a parabola, not a line, and the parabola is symmetric around x = 0 so the positive and negative contributions cancel.
"A strong correlation means the line is steep." No. The steepness of the relationship is measured by the regression coefficient (next article), not by r. A correlation of 0.9 can come from a dataset where y increases by 0.1 per unit of x (very shallow line) or by 100 per unit of x (very steep line). The correlation measures how tightly the points cluster around whatever line, not the slope of that line.
"Correlation tells you about individual predictions." A correlation of 0.8 is strong, but it does not mean every individual point is close to the line. It is a population-level summary. For any single observation, the "prediction error" can still be substantial.
"r^2 is just r squared, with no new meaning." Mathematically it is — but r^2 has a specific interpretation called the coefficient of determination. It is the fraction of the variability in y that is "explained" by a linear fit to x. If r = 0.7, then r^2 = 0.49, so roughly 49\% of the variability in y is accounted for by x alone. This interpretation becomes central in the article on regression.

Going deeper

If you are comfortable computing r, interpreting its sign and magnitude, and remembering that correlation is not causation, you have the core idea. The rest of this section proves that r is always in [-1, 1] and shows you the covariance formulation that generalises cleanly to probability.

Why |r| \leq 1: the Cauchy-Schwarz proof

The claim is that

\left(\sum (x_i - \bar{x})(y_i - \bar{y})\right)^2 \;\leq\; \sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2.

Write u_i = x_i - \bar{x} and v_i = y_i - \bar{y} to tidy the notation. Then the claim is

\left(\sum u_i v_i\right)^2 \leq \left(\sum u_i^2\right)\left(\sum v_i^2\right),

which is the Cauchy-Schwarz inequality for sums. There is a slick proof. For any real number t,

\sum (u_i - t v_i)^2 \geq 0

because it is a sum of squares. Expand:

\sum u_i^2 - 2t \sum u_i v_i + t^2 \sum v_i^2 \geq 0.

This is a quadratic in t that is never negative. A quadratic at^2 + bt + c is never negative exactly when its discriminant b^2 - 4ac is at most zero (and a \geq 0, which is true here because \sum v_i^2 \geq 0). So

\bigl(2 \sum u_i v_i\bigr)^2 - 4 \bigl(\sum v_i^2\bigr)\bigl(\sum u_i^2\bigr) \leq 0.

Simplify:

\bigl(\sum u_i v_i\bigr)^2 \leq \bigl(\sum u_i^2\bigr)\bigl(\sum v_i^2\bigr).

Taking square roots and dividing gives |r| \leq 1. Equality — meaning r = \pm 1 — holds exactly when u_i = t v_i for all i for some constant t, which is to say when y_i - \bar{y} = t(x_i - \bar{x}) for all i. That is exactly the condition that all the data points lie on a single straight line. So r = \pm 1 if and only if the scatter plot is perfectly collinear.

Covariance, and the language of expectation

The numerator of r has a name on its own:

\text{Cov}(x, y) \;=\; \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

is called the covariance of x and y. Covariance measures how much two variables vary together — same sign when they move together, opposite sign when they move oppositely. It is the direct generalisation of variance: the variance of x is the covariance of x with itself.

In this language, the correlation coefficient is

r = \frac{\text{Cov}(x, y)}{\sigma_x \sigma_y}

where \sigma_x and \sigma_y are the standard deviations of x and y. The correlation is the standardised covariance — the covariance divided by the standard deviations to remove the units.

When you meet probability distributions later, the sample covariance \text{Cov}(x, y) becomes the population covariance \mathbb{E}[(X - \mu_X)(Y - \mu_Y)], and the sample correlation becomes the population correlation \rho = \text{Cov}(X, Y) / (\sigma_X \sigma_Y). The formulas look nearly identical — the sums become expectations, but the structure is the same.

The anti-example: Anscombe's quartet

Four datasets, each with 11 points, were constructed by Frank Anscombe in 1973 to make a single point. All four have:

the same mean of x
the same mean of y
the same variance of x
the same variance of y
the same correlation coefficient (r \approx 0.816)
the same best-fit line

And yet the four scatter plots look completely different: one is a clean linear cloud, one is a smooth parabola, one has a single outlier that forces the correlation up, one has all the x values identical except for a single point that pulls the line through it.

The moral is the one you already met: summaries like r are useful but not sufficient. They compress a scatter plot down to a single number, and that compression hides information. Always plot the data before trusting the number. That is why this article began with scatter diagrams and not with the formula.

Where this leads next

You know how to measure the strength of a linear relationship between two variables. The obvious next question is: given that relationship, can you use it to predict y from x for a new observation?

Regression — the line that best fits the data, derived by minimising the sum of squared errors, along with how to use it for prediction.
Covariance — the unstandardised version of correlation, and the bridge to the formal theory of random variables.
Sampling — if your data comes from a sample of a larger population, how reliable is the sample correlation as an estimate of the true population correlation?
Chi-Square and Rank Correlation — how to measure association when one or both variables are ordinal (ranks) rather than measurements.
Introduction to Inference — deciding whether an observed correlation is "real" or could plausibly be due to sampling noise.