Regression — padho-wiki

In short

Regression is the line that best fits a scatter plot. The regression line of y on x is the straight line that makes the total squared vertical error as small as possible. Its slope, called the regression coefficient b_{yx}, measures how much y changes per unit change in x. Given the best-fit line, you can predict y for any new x by reading off the line. There are always two regression lines — one for predicting y from x and a different one for predicting x from y — and they coincide only when |r| = 1.

A company wants to estimate how many air conditioners it will sell next summer as a function of the maximum temperature in June. It has 10 years of data: for each year, the June maximum temperature and the number of air conditioners sold that month. Each year becomes one dot on a scatter plot.

The dots show a clear positive trend — hotter Junes mean more sales. But the trend is not perfect; there is scatter around it. The question is: is there a way to draw a single straight line through that cloud that captures the trend as well as possible, and can you then use that line to answer the practical question "if June 2027 has a maximum of 44°C, how many ACs should we stock?"

The answer is yes. The mathematical method is called least-squares regression. It produces the specific line that fits the data best in a precise sense, and once you have that line, you can plug in any x and read off the predicted y.

This article walks through where that line comes from, why the formula is what it is, and what the numbers mean once you have them.

What does "best fit" even mean?

Look at a scatter plot and draw any straight line through it. For every data point, there is a vertical gap between the actual point and the line — the point is either above the line, on it, or below it. Call the signed vertical gap the residual for that point:

e_i \;=\; y_i - \hat{y}_i

where y_i is the actual y-value and \hat{y}_i is the value predicted by the line at x = x_i.

If you move the line up or rotate it slightly, the residuals change — some shrink, others grow. The "best" line should make the residuals small as a whole. But how do you measure "small as a whole" when some are positive and some are negative?

Summing the residuals directly does not work: positive and negative errors cancel, and a wildly wrong line can still have \sum e_i = 0.
Summing |e_i| avoids cancellation but is mathematically hard to work with (the absolute value has a corner).
Summing e_i^2 avoids cancellation and is smooth — it is differentiable, which means you can find its minimum using calculus.

The last option is the one that wins, both for tradition and for pleasant mathematics. The least-squares line is the line that minimises

S \;=\; \sum_{i=1}^{n} e_i^2 \;=\; \sum_{i=1}^{n} (y_i - \hat{y}_i)^2.

Every regression calculation in the rest of this article is a consequence of this one choice.

Six data points and a straight-line fit. The short dashed segments show the residuals — the vertical distances from each dot to the line. The least-squares line is the one that makes the sum of the squared lengths of those segments as small as possible.

Deriving the regression line of y on x

Assume the line has equation \hat{y} = a + bx where a and b are unknown. The job is to choose a and b to minimise

S(a, b) \;=\; \sum_{i=1}^{n} (y_i - a - b x_i)^2.

S is a smooth function of two variables, and it has a minimum where both partial derivatives vanish. Set them up one at a time.

Derivative with respect to a. Differentiate S treating b as a constant:

\frac{\partial S}{\partial a} \;=\; \sum_{i=1}^{n} 2(y_i - a - b x_i)(-1) \;=\; -2 \sum_{i=1}^{n}(y_i - a - b x_i).

Set this equal to zero:

\sum (y_i - a - b x_i) = 0.

Split the sum:

\sum y_i - na - b \sum x_i = 0.

Divide by n:

\bar{y} - a - b \bar{x} = 0 \quad\Longrightarrow\quad a = \bar{y} - b\bar{x}.

That is the first normal equation. It says: the best-fit line passes through the point (\bar{x}, \bar{y}). The centre of the data is always on the line.

Derivative with respect to b. Differentiate S treating a as a constant:

\frac{\partial S}{\partial b} \;=\; \sum 2(y_i - a - b x_i)(-x_i) \;=\; -2 \sum x_i (y_i - a - b x_i).

Set this equal to zero:

\sum x_i y_i - a \sum x_i - b \sum x_i^2 = 0.

Substitute a = \bar{y} - b \bar{x} from the first equation:

\sum x_i y_i - (\bar{y} - b\bar{x}) \sum x_i - b \sum x_i^2 = 0.

\sum x_i y_i - \bar{y} \sum x_i + b \bar{x} \sum x_i - b \sum x_i^2 = 0.

Use \sum x_i = n \bar{x} and \sum y_i = n\bar{y}:

\sum x_i y_i - n \bar{x} \bar{y} + b n \bar{x}^2 - b \sum x_i^2 = 0.

Solve for b:

b = \frac{\sum x_i y_i - n \bar{x} \bar{y}}{\sum x_i^2 - n \bar{x}^2}.

Factor a little more to get a cleaner form. Using the identities \sum (x_i - \bar{x})(y_i - \bar{y}) = \sum x_i y_i - n \bar{x}\bar{y} and \sum (x_i - \bar{x})^2 = \sum x_i^2 - n \bar{x}^2:

b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}.

This is the slope of the best-fit line. It has a name.

Regression coefficient of $y$ on $x$

The regression coefficient of y on x, written b_{yx}, is the slope of the least-squares line predicting y from x:

b_{yx} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}.

The regression line of y on x is

y - \bar{y} = b_{yx}(x - \bar{x}).

The equation y - \bar{y} = b_{yx}(x - \bar{x}) just uses the fact that the line passes through (\bar{x}, \bar{y}) with slope b_{yx} — point-slope form of a line.

The connection to correlation

Compare the formula for b_{yx} with the formula for the correlation coefficient r from the previous article:

b_{yx} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \qquad r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}.

The numerators are identical. The denominators differ — and the ratio comes out to

b_{yx} \;=\; r \cdot \frac{s_y}{s_x}

where s_x = \sqrt{\frac{1}{n}\sum(x_i - \bar{x})^2} and s_y = \sqrt{\frac{1}{n}\sum(y_i - \bar{y})^2} are the standard deviations of the two variables. This is a very useful relationship: once you have the correlation and the standard deviations, the regression slope comes for free.

The other regression line

What if, instead of predicting y from x, you wanted to predict x from y? Run the same derivation with the roles swapped. You end up minimising the sum of squared horizontal distances from the points to a line, and you get a different line — the regression line of x on y.

Regression coefficient of $x$ on $y$

The regression coefficient of x on y, written b_{xy}, is

b_{xy} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(y_i - \bar{y})^2} \;=\; r \cdot \frac{s_x}{s_y}.

The regression line of x on y is

x - \bar{x} = b_{xy}(y - \bar{y}).

Why are there two lines and not one? Because the definition of "best fit" is asymmetric. The line of y on x minimises vertical error — the distances measured parallel to the y-axis. The line of x on y minimises horizontal error — the distances measured parallel to the x-axis. Those are different optimisation problems, and they give different answers.

Both lines pass through the centre (\bar{x}, \bar{y}), so they always intersect at the mean point. They coincide exactly when |r| = 1 — when the data lies perfectly on a single line, both ways of measuring error give the same line. As the correlation weakens, the two lines spread apart.

A useful identity. Multiply the two regression coefficients:

b_{yx} \cdot b_{xy} = \left(r \cdot \frac{s_y}{s_x}\right)\left(r \cdot \frac{s_x}{s_y}\right) = r^2.

So the product of the two regression coefficients is the square of the correlation coefficient. This gives you another way to compute r — compute both regression coefficients and take the square root of their product (with the sign chosen to match the common sign of b_{yx} and b_{xy}, which is always the same because both depend on the numerator \sum(x_i - \bar{x})(y_i - \bar{y})).

Worked examples

Example 1: advertising spend and sales

A retailer has data for 6 months on advertising spend (in lakh rupees) and sales (in lakh rupees).

Month	x (ad spend)	y (sales)
1	2	15
2	3	18
3	4	22
4	5	24
5	6	27
6	8	34

Find the regression line of sales on advertising spend, and predict the sales in a month with an advertising spend of 7 lakh.

Step 1. Compute the means.

\sum x = 28, \quad \bar{x} = 28/6 \approx 4.667.

\sum y = 140, \quad \bar{y} = 140/6 \approx 23.333.

Why: the regression line passes through (\bar{x}, \bar{y}), so you always need the means first.

Step 2. Compute the deviations and their products.

x_i	y_i	x_i - \bar{x}	y_i - \bar{y}	(x_i - \bar{x})(y_i - \bar{y})	(x_i - \bar{x})^2
2	15	-2.667	-8.333	22.222	7.111
3	18	-1.667	-5.333	8.889	2.778
4	22	-0.667	-1.333	0.889	0.444
5	24	0.333	0.667	0.222	0.111
6	27	1.333	3.667	4.889	1.778
8	34	3.333	10.667	35.556	11.111
			Sum:	72.667	23.333

Why: computing the deviations once and reusing them keeps the arithmetic organised. The two column totals are all you need for the slope.

Step 3. Compute b_{yx}.

b_{yx} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{72.667}{23.333} \approx 3.114.

Why: this is the slope of the best-fit line. Each extra lakh of advertising is associated with roughly 3.11 lakh of additional sales.

Step 4. Write the regression line.

y - 23.333 = 3.114(x - 4.667)

y = 3.114 x - 14.533 + 23.333

y = 3.114 x + 8.800.

Why: expanding the point-slope form gives the slope-intercept form y = mx + c, which is what you use for prediction.

Step 5. Predict y at x = 7.

\hat{y} = 3.114 \cdot 7 + 8.800 = 21.798 + 8.800 = 30.598.

Why: prediction is just substitution. Plug the new x into the equation of the line and read off \hat{y}.

Result: Regression line \hat{y} = 3.114 x + 8.800. Predicted sales at ad spend 7 is approximately 30.6 lakh.

Scatter plot of advertising spend versus sales for the six months. The red line is the least-squares regression line $\hat{y} = 3.114 x + 8.800$. The lighter red dot at $x = 7$ is the prediction for a new month — read off directly from the line.

The prediction \hat{y} = 30.6 is what the regression line says sales will be given an ad spend of 7 lakh, assuming the relationship continues to hold. It is not a guarantee — actual sales in any specific month will differ from the line by some residual.

Example 2: both regression lines

Take a smaller dataset and compute both regression lines — y on x and x on y.

x	y
1	2
2	3
3	5
4	4
5	6

Step 1. Compute sums and means.

The products x_i y_i are 2, 6, 15, 16, 30. Sum them: \sum xy = 2 + 6 + 15 + 16 + 30 = 69.

\sum x = 15,\; \sum y = 20,\; \sum x^2 = 55,\; \sum y^2 = 90,\; \sum xy = 69.

\bar{x} = 15/5 = 3, \quad \bar{y} = 20/5 = 4.

Why: compute the five sums in one pass and derive the means by a single division each.

Step 2. Compute the building blocks.

S_{xy} = \sum xy - n\bar{x}\bar{y} = 69 - 5 \cdot 3 \cdot 4 = 69 - 60 = 9.

S_{xx} = \sum x^2 - n\bar{x}^2 = 55 - 5 \cdot 9 = 55 - 45 = 10.

S_{yy} = \sum y^2 - n\bar{y}^2 = 90 - 5 \cdot 16 = 90 - 80 = 10.

Why: these three numbers — S_{xx}, S_{yy}, S_{xy} — are all you need for both regression lines and the correlation.

Step 3. Compute both regression coefficients.

b_{yx} = \frac{S_{xy}}{S_{xx}} = \frac{9}{10} = 0.9.

b_{xy} = \frac{S_{xy}}{S_{yy}} = \frac{9}{10} = 0.9.

By coincidence (because S_{xx} = S_{yy}), the two coefficients are numerically equal here — but they mean different things. b_{yx} = 0.9 says "if x goes up by 1, the line of y on x predicts y to go up by 0.9"; b_{xy} = 0.9 says "if y goes up by 1, the line of x on y predicts x to go up by 0.9".

Why: the two coefficients are defined differently — one divides by S_{xx}, the other by S_{yy}. They only happen to agree when the two variables have equal dispersion.

Step 4. Write both regression lines.

Regression line of y on x:

y - 4 = 0.9(x - 3) \;\Longrightarrow\; y = 0.9 x + 1.3.

Regression line of x on y:

x - 3 = 0.9(y - 4) \;\Longrightarrow\; x = 0.9 y - 0.6.

Why: both pass through the common mean point (\bar{x}, \bar{y}) = (3, 4). That is why they intersect at (3, 4).

Step 5. Compute r.

r^2 = b_{yx} \cdot b_{xy} = 0.9 \cdot 0.9 = 0.81 \;\Longrightarrow\; r = \pm 0.9.

Both regression coefficients are positive, so r is positive: r = 0.9.

Why: the identity b_{yx} \cdot b_{xy} = r^2 gives you the correlation as a free by-product of the two regression coefficients.

Result: b_{yx} = b_{xy} = 0.9, r = 0.9, regression lines y = 0.9x + 1.3 and x = 0.9y - 0.6.

Scatter plot of the five points with both regression lines. The solid red line is $y$ on $x$; the dashed red line is $x$ on $y$, re-expressed as a function of $x$. The two lines cross at the mean point $(3, 4)$. They are close but not identical — the angle between them is a visual measure of how far $|r|$ is from $1$.

The two regression lines open up at the mean point like a pair of scissors. The angle between them is small when the correlation is strong and wider when the correlation is weak. At r = \pm 1 the two lines close together into a single line; at r = 0 they become perpendicular (one horizontal, one vertical).

Properties of regression coefficients

b_{yx} and b_{xy} have the same sign. Both are proportional to the same numerator \sum(x_i - \bar{x})(y_i - \bar{y}). The sign of that numerator is the sign of r, and both coefficients inherit it.
b_{yx} \cdot b_{xy} = r^2. Already derived above. This implies |b_{yx} \cdot b_{xy}| \leq 1 — the product of the two coefficients is always at most 1 in magnitude.
Unaffected by change of origin, affected by change of scale. Shifting all x_i by a constant does not change the slope of either line. Multiplying all x_i by a constant c multiplies b_{yx} by 1/c and multiplies b_{xy} by c.
The two regression lines intersect at (\bar{x}, \bar{y}). Always, for any dataset. This is a direct consequence of the derivation — both lines are written in point-slope form anchored at the mean point.
The arithmetic mean of b_{yx} and b_{xy} is at least |r|. This is the AM-GM inequality applied to the product identity: \frac{b_{yx} + b_{xy}}{2} \geq \sqrt{b_{yx} \cdot b_{xy}} = |r|.

Common confusions

"You only ever need one regression line." The line depends on which variable you are predicting. Predicting sales from ad spend uses y on x; predicting ad spend from sales uses x on y. They are different lines and they give different numerical answers if you use the wrong one.
"The regression line is the same as the correlation line." Correlation is a single number; it does not come with a line. The regression line is a line; it does not directly tell you the strength of the association. They are related (via b_{yx} = r \cdot s_y / s_x) but are different objects answering different questions.
"Prediction far outside the observed data is as reliable as prediction within it." False. A regression line fit to data from x = 20 to x = 50 tells you very little about x = 200. The relationship might bend, flatten, or break down entirely outside the range you observed. Using the line far outside the observed range is called extrapolation and should be done with extreme caution.
"A steeper line means a stronger correlation." The slope measures the rate of change; the correlation measures the tightness of the fit. A line with slope 100 fit to a very scattered cloud of points can have a weak correlation. A line with slope 0.1 fit to a tight cloud can have a strong one.
"Residuals sum to zero, so the line fits perfectly." The residuals of a least-squares line always sum to zero — this falls out of the first normal equation. But having \sum e_i = 0 does not mean the line fits well; it just means the errors balance out. The sum of squared residuals, \sum e_i^2, is the actual measure of fit.

Going deeper

If you can derive the regression line, compute the coefficients, and use the equation to predict, you have the working knowledge of regression. The rest of this section is about what the line is doing in a slightly deeper sense, and how it connects to larger ideas.

The geometry of least squares

Think of the n observations as a single point in n-dimensional space: \mathbf{y} = (y_1, y_2, \ldots, y_n). The n fitted values \hat{\mathbf{y}} = (\hat{y}_1, \ldots, \hat{y}_n) are another point — one that has to lie on a specific 2-dimensional plane inside the n-dimensional space, because every \hat{y}_i is of the form a + b x_i for some a and b.

What the least-squares method does is this: it drops a perpendicular from \mathbf{y} to that 2-dimensional plane. The foot of the perpendicular is the point \hat{\mathbf{y}} closest to \mathbf{y} in the ordinary Euclidean sense. The vector from \hat{\mathbf{y}} to \mathbf{y} is the vector of residuals \mathbf{e}, and by construction this residual vector is perpendicular to the plane.

This geometric picture is why least squares has so many pleasant properties. Orthogonal projection is the cleanest operation in linear algebra, and once you recognise regression as a projection, you inherit all the theorems about projections for free. This point of view is the starting point for multiple regression (more than one explanatory variable), generalised linear models, and most of modern statistics.

Coefficient of determination

You saw in the article on correlation that r^2 — the square of the correlation — has an interpretation as the fraction of variability in y that is "explained" by x. Now you can see exactly what that means.

Define three sums of squares:

Total sum of squares: \text{TSS} = \sum(y_i - \bar{y})^2
Regression sum of squares: \text{RSS} = \sum(\hat{y}_i - \bar{y})^2
Error sum of squares: \text{ESS} = \sum(y_i - \hat{y}_i)^2

After some algebra (using the fact that the residuals are perpendicular to the fitted values — the geometric picture above), you get

\text{TSS} = \text{RSS} + \text{ESS}.

Total variability splits into "variability captured by the line" plus "variability left over in the residuals." The fraction captured is

\frac{\text{RSS}}{\text{TSS}} = r^2.

So r^2 is literally the fraction of the variance in y that the linear fit to x accounts for. An r of 0.9 means the line explains 81\% of the variability; an r of 0.5 means the line only explains 25\% — a huge difference from what the raw correlations suggest.

Assumptions and their cost

The least-squares line always exists as a mathematical object. Whether it is a useful summary of the data depends on several assumptions about how the data was generated. The basic ones:

The true relationship is approximately linear. If the real pattern is a parabola, a straight line fit to it will be misleading — even if the fit is numerically the best possible line. Look at the scatter plot before trusting the line.
The residuals have roughly constant spread. If the points scatter tightly around the line for small x and widely for large x, the regression is still computable, but predictions near one end will be more reliable than predictions at the other end.
No influential outliers. A single extreme point can pull the regression line a long way. If the scatter plot reveals one or two points far from the mass, the regression should be recomputed both with and without them to see how much they are driving the fit.

These assumptions are not part of the formula. The formula produces a line for any dataset. The assumptions are what determine whether using the line as a predictive tool is a good idea.

Where this leads next

You now have the machinery to fit a line to data and use it for prediction. The real world does not always hand you data where a straight line is the right model — and even when it does, you often want a more sophisticated analysis.

Correlation — the measure of strength that pairs with regression. Every regression calculation can be restated in terms of correlations and standard deviations.
Sampling — if the data is a sample from a population, the regression line you compute is an estimate of the true underlying line. Sampling theory tells you how much the estimate can wobble.
Least Squares — the deeper theory of minimising squared errors, covering multiple predictors, weighted errors, and non-linear models.
Linear Algebra Preview — regression as orthogonal projection onto a subspace, which generalises effortlessly to multiple predictors.
Introduction to Inference — how to test whether a regression slope is "real" or could have arisen by chance from a sample.