Normal Distribution — padho-wiki

In short

The normal distribution N(\mu, \sigma^2) has density f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x - \mu)^2 / (2\sigma^2)} — a symmetric bell curve centred at the mean \mu with spread controlled by the standard deviation \sigma. The standard normal Z \sim N(0, 1) is the special case with \mu = 0 and \sigma = 1. Any normal variable can be converted to standard form via the z-score z = (x - \mu)/\sigma. Roughly 68\% of the probability lies within one \sigma of the mean, 95\% within two, and 99.7\% within three.

You take a state-wide mathematics exam. Afterwards you find out that the mean score was 62 and the standard deviation was 10. Your score is 78. How well did you do?

"Sixteen marks above the mean" is true, but it doesn't quite answer the question. On a test where the spread is tiny — where almost everyone scores between 60 and 65 — a score of 78 is off the charts. On a test where the spread is huge, with students scattered from 30 to 95, a score of 78 is solid but not extraordinary. The same raw score means very different things depending on how the whole distribution is shaped.

There is one shape, though, that keeps showing up in real data. Exam scores across many students. Heights of adults. Errors in repeated measurements of a physical constant. Sums of many small random quantities. They all tend to cluster around an average, they tend to be symmetric, and they taper off at the extremes in the same characteristic way: most of the probability near the middle, a little at each tail, nothing at the very far ends. This shape has a name and a formula, and it is the most important object in all of statistics.

The shape of averages

Here is the observation that makes the normal distribution unavoidable.

Roll one die. The distribution of the face is uniform — each of \{1, 2, 3, 4, 5, 6\} is equally likely. No bell curve anywhere.

Now roll two dice and take the average of the two faces. The possible values are 1, 1.5, 2, \ldots, 6, but they are no longer equally likely. An average of 3.5 can come from many pairs ((1,6), (2,5), (3,4), \ldots); an average of 1 requires both dice to show 1. The distribution has a peak in the middle.

Roll ten dice and take the average. Roll a hundred. Roll a thousand. The shape of the distribution of the average gets smoother and more symmetric, and it starts to look like a bell — with a sharp peak at 3.5 and tails that fall off rapidly on both sides. Try this with anything — not just dice. Flip a coin ten thousand times and count the heads. Measure the same pendulum period a thousand times. Average the heights of a thousand random people. The shape you get always ends up looking like the same curve.

That curve is the normal distribution. The phenomenon is the central limit theorem: the sum (or average) of many independent random contributions tends to be normally distributed, no matter what the original contributions look like. You will meet the theorem properly later, but the upshot is that normal distributions show up in real data because real data is almost always a sum of many small effects — genes plus nutrition plus childhood environment determining height, hundreds of test items determining a total score, a thousand small perturbations determining a measurement error.

The formula

Normal distribution

A continuous random variable X has a normal distribution with mean \mu and variance \sigma^2, written X \sim N(\mu, \sigma^2), if its probability density function is

f(x) \;=\; \frac{1}{\sigma \sqrt{2\pi}} \, \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

for all real x. The mean of X is \mu, the variance is \sigma^2, and the standard deviation is \sigma.

Reading the formula. The exponent is a negative squared distance from the mean, scaled by \sigma^2. When x = \mu, the exponent is 0 and the density attains its peak value \frac{1}{\sigma\sqrt{2\pi}}. When x is far from \mu, the exponent is a large negative number and the density is essentially 0. The factor \frac{1}{\sigma\sqrt{2\pi}} out front is the normalization constant — whatever value is needed to make the total area under the curve equal to 1. (The fact that you get exactly \sqrt{2\pi} out of the required integral is a small miracle of calculus, proved in the going-deeper section.)

The two parameters \mu and \sigma do two very different jobs:

\mu shifts the curve horizontally. Changing \mu slides the peak left or right along the x-axis but does nothing to the shape.
\sigma stretches or squeezes the curve horizontally while rescaling it vertically. A small \sigma gives a tall, narrow curve — probability crammed near the mean. A large \sigma gives a short, wide curve — probability spread out. The total area always remains 1.

Three normal distributions with the same mean $\mu$ but different standard deviations. Shrinking $\sigma$ concentrates the probability near the peak; growing $\sigma$ spreads it out. All three enclose exactly the same total area of $1$.

The standard normal

Every normal distribution is a shifted and rescaled version of one particular distribution — the one with \mu = 0 and \sigma = 1. That distribution is the standard normal, usually written Z \sim N(0, 1), and it has the simplest possible form of the density:

\varphi(z) \;=\; \frac{1}{\sqrt{2\pi}} \, e^{-z^2 / 2}.

(The symbol \varphi is the Greek letter "phi.") Its CDF is written \Phi(z), and it is what probability tables and calculator functions compute. There is no elementary closed form for \Phi — it cannot be written in terms of polynomials, roots, exponentials, or trig functions — so the values are either looked up in a table or computed numerically.

Three properties of the standard normal are worth internalising.

Symmetry. Because \varphi(-z) = \varphi(z), the curve is symmetric around z = 0. This gives you

\Phi(-z) = 1 - \Phi(z).

You only need a table from 0 to \infty; the negative-z values come from the symmetry identity.

Peak at z = 0. The density attains its maximum at z = 0, with peak height \frac{1}{\sqrt{2\pi}} \approx 0.3989. That is a density value, not a probability.

Inflection points at z = \pm 1. The curve bends downward between -1 and 1 and bends upward outside that range. The boundary between the two — where the curve is instantaneously straight — is exactly at z = \pm 1. This is not a coincidence; it is \sigma showing up geometrically.

The standard normal density $\varphi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$. The peak is at $z = 0$ with height about $0.399$. The curve has inflection points at $z = \pm 1$, exactly one standard deviation from the centre.

The standard normal CDF $\Phi(z)$. Reading the curve at $z = 1.6$ gives about $0.945$ — the probability that a standard normal random variable lies at or below $1.6$ standard deviations above the mean. This is the number you would use to answer "what fraction of students scored at or below $78$" in the opening example.

Z-scores: converting any normal to the standard normal

The key trick for working with normals is that every normal variable can be turned into a standard normal with one subtraction and one division. If X \sim N(\mu, \sigma^2), then

Z \;=\; \frac{X - \mu}{\sigma}

has the standard normal distribution N(0, 1). The number z = \frac{x - \mu}{\sigma} is called the z-score of x — it says how many standard deviations x lies above (positive) or below (negative) the mean.

Why does this work? Subtracting \mu shifts the distribution so its mean is at 0. Dividing by \sigma rescales the distribution horizontally so its standard deviation is 1. The shape is preserved — you never change the shape of a normal by shifting and scaling, because shifting and scaling a normal always yields another normal.

Converting to z-scores is what lets you answer the exam question from the opening. Your score was 78 on an exam with \mu = 62 and \sigma = 10. Your z-score is

z = \frac{78 - 62}{10} = 1.6.

Your score is 1.6 standard deviations above the mean. Now you can look up \Phi(1.6) \approx 0.9452 in a standard normal table, and you learn that roughly 94.5\% of all scores are at or below yours — meaning you are in the top 5.5\%.

The 68-95-99.7 rule

A fact worth knowing by heart for any normal distribution, not just the standard normal:

\begin{aligned} P(|X - \mu| \leq 1\sigma) &\approx 0.6827 \quad\text{(about 68\%)} \\ P(|X - \mu| \leq 2\sigma) &\approx 0.9545 \quad\text{(about 95\%)} \\ P(|X - \mu| \leq 3\sigma) &\approx 0.9973 \quad\text{(about 99.7\%)} \end{aligned}

These numbers come from \Phi(1) - \Phi(-1), \Phi(2) - \Phi(-2), and \Phi(3) - \Phi(-3). They are worth memorising because they let you estimate normal probabilities without a table at all.

The 68-95-99.7 rule. The three nested shaded regions show the fraction of the distribution within one, two, and three standard deviations of the mean. Almost all normal data falls within $3\sigma$ of the mean, which is why outliers beyond that range are statistically surprising.

Two worked examples

Example 1: heights of adult men

Heights of adult men in a population are approximately normally distributed with mean \mu = 170 cm and standard deviation \sigma = 7 cm. What is the probability that a randomly chosen man is taller than 180 cm? What height would place someone in the top 10\%?

Step 1. Translate to a z-score. For X = 180:

z = \frac{180 - 170}{7} = \frac{10}{7} \approx 1.43.

Why: the normal-distribution machinery all runs on z-scores, so the first move is always "convert to standard form."

Step 2. Compute P(X > 180) = P(Z > 1.43). Use the standard normal table:

P(Z > 1.43) = 1 - \Phi(1.43) \approx 1 - 0.9236 = 0.0764.

About 7.6\% of the population is taller than 180 cm.

Why: the CDF \Phi(1.43) is the probability of being at most 1.43 standard deviations above the mean; subtracting from 1 gives the probability of being above that threshold.

Step 3. For the top-10\% question, find z such that \Phi(z) = 0.90. From the table, z \approx 1.28.

Why: you want the cut-off that leaves exactly 10\% of the probability to the right, which is the same as placing 90\% to the left.

Step 4. Convert back to a height. If z = 1.28, then

x = \mu + z\sigma = 170 + 1.28 \cdot 7 = 170 + 8.96 = 178.96 \text{ cm}.

A man taller than about 179 cm is in the top 10\%.

Result: P(X > 180) \approx 0.076, and the 90th percentile is about 179 cm.

The normal distribution of heights centred at $170$ cm with $\sigma = 7$ cm. The shaded right tail beyond $180$ cm has area about $0.076$ — the probability of a randomly chosen man being taller than $180$ cm.

Example 2: measurement errors

A lab measures the weight of a chemical sample. The measurements are normally distributed around the true weight with standard deviation \sigma = 0.4 grams. How likely is it that a single measurement is within 0.5 grams of the true weight? Within 1 gram?

Step 1. Set up the problem. Let X be the measurement and \mu be the true weight. The measurement error X - \mu is N(0, \sigma^2) with \sigma = 0.4.

Step 2. Convert the first condition to z-scores. "Within 0.5 grams of the true weight" means |X - \mu| \leq 0.5, which in standard form is |Z| \leq \frac{0.5}{0.4} = 1.25.

Why: dividing by \sigma = 0.4 converts a tolerance in grams into a tolerance in standard-deviation units, which is the currency the z-table uses.

Step 3. Compute the probability.

P(|Z| \leq 1.25) = \Phi(1.25) - \Phi(-1.25) = 2\Phi(1.25) - 1 = 2(0.8944) - 1 = 0.7888.

About 78.9\% of measurements will be within 0.5 grams of the true weight.

Why: for any symmetric interval [-z, z] around zero, the probability is 2\Phi(z) - 1 — a shortcut worth remembering.

Step 4. Repeat for the second condition. "Within 1 gram" becomes |Z| \leq \frac{1}{0.4} = 2.5:

P(|Z| \leq 2.5) = 2\Phi(2.5) - 1 = 2(0.9938) - 1 = 0.9876.

About 98.8\% of measurements fall within 1 gram.

Result: A single measurement is within 0.5 g of the true weight 78.9\% of the time and within 1 g about 98.8\% of the time.

Measurement errors with $\sigma = 0.4$ g. The narrow shaded band from $-0.5$ to $+0.5$ grams contains $78.9\%$ of measurements; the wider band from $-1$ to $+1$ gram contains $98.8\%$.

Common confusions

"All bell-shaped curves are normal." No. Many distributions are unimodal and roughly symmetric without being normal — the t-distribution, for example, has heavier tails, and the logistic distribution has a slightly different shape. "Normal" specifically means the e^{-(x - \mu)^2 / 2\sigma^2} formula.
"The peak of the normal distribution is 1." The peak is \frac{1}{\sigma\sqrt{2\pi}}, which depends on \sigma. For the standard normal it is about 0.399. The peak of a density is not usually 1 — only the total area is required to be 1.
"A score of +2 standard deviations is twice as good as +1 standard deviation." Not in any linear sense. A z-score of +1 puts you above roughly 84\% of the population; +2 puts you above roughly 97.7\%. The jump from +1 to +2 is much more extreme than it looks because the tails thin out rapidly.
"The normal distribution is a good model for anything." It is a good model for averages of many independent contributions — that is the central limit theorem's domain. It is a bad model for income (which is right-skewed), for wait times (which are usually exponential), and for counts of rare events (which are Poisson). Match the model to the mechanism.
"P(X = 170.0000) on a normal distribution is a real, positive number." It is exactly 0. Normal distributions are continuous, so single points carry no probability. You always need an interval.

Going deeper

If you can compute normal probabilities using z-scores and the 68-95-99.7 rule, you have the working toolkit. The rest is for readers who want to see why the normalization constant is \sqrt{2\pi}, how the central limit theorem works, and a few useful identities for combining normal variables.

Why the normalization is \sqrt{2\pi}

The claim is that

\int_{-\infty}^{\infty} e^{-x^2 / 2}\, dx \;=\; \sqrt{2\pi}.

This is not something you can prove with elementary one-variable integration. The standard trick — due to Poisson — is to compute the square of the integral and then switch to polar coordinates.

Let I = \int_{-\infty}^{\infty} e^{-x^2/2}\, dx. Then

I^2 \;=\; \left(\int_{-\infty}^{\infty} e^{-x^2/2}\, dx\right) \left(\int_{-\infty}^{\infty} e^{-y^2/2}\, dy\right) \;=\; \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(x^2 + y^2)/2}\, dx\, dy.

The two-dimensional integrand depends only on x^2 + y^2 = r^2, so polar coordinates are natural. Let x = r\cos\theta and y = r\sin\theta. The area element becomes dx\,dy = r\,dr\,d\theta, and the double integral becomes

I^2 \;=\; \int_{0}^{2\pi}\int_{0}^{\infty} e^{-r^2/2} \cdot r\, dr\, d\theta.

The \theta-integral contributes a factor of 2\pi. The r-integral is elementary — substitute u = r^2/2, so du = r\, dr:

\int_{0}^{\infty} e^{-r^2/2} r\, dr = \int_{0}^{\infty} e^{-u}\, du = 1.

So I^2 = 2\pi, which gives I = \sqrt{2\pi}. Dividing by \sqrt{2\pi} is what turns e^{-x^2/2} into a probability density. The extra \sigma out front (for general \mu and \sigma) comes from the chain rule when you substitute u = (x - \mu)/\sigma — the du = dx/\sigma drags down a factor of \sigma, which is why the full normalization constant is \frac{1}{\sigma\sqrt{2\pi}}.

Sums of independent normals

If X \sim N(\mu_X, \sigma_X^2) and Y \sim N(\mu_Y, \sigma_Y^2) are independent, then their sum is also normal:

X + Y \sim N(\mu_X + \mu_Y, \; \sigma_X^2 + \sigma_Y^2).

Note that the variances add, not the standard deviations. If you add ten independent N(0, 1) variables, the sum is N(0, 10), with standard deviation \sqrt{10} \approx 3.16 — not 10. This is why the standard deviation of an average of n variables is \sigma/\sqrt{n}, which shrinks much more slowly than you might first guess.

The central limit theorem

Informally: let X_1, X_2, \ldots, X_n be independent random variables from some distribution with mean \mu and finite variance \sigma^2. Let \bar{X}_n be the average of the first n. Then

\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \;\longrightarrow\; N(0, 1)

in distribution as n \to \infty. In plain language: the z-score of the sample mean converges to a standard normal no matter what the original distribution looked like, as long as it has a finite variance. This is the deep fact that explains why the normal distribution is everywhere in real data: real measurements are usually averages of many small independent effects, and averages of many independent things are approximately normal.

The theorem has a concrete consequence for the binomial distribution. A \text{Binomial}(n, p) variable is a sum of n independent Bernoulli indicators, so for large n it is approximately N(np, np(1-p)). That approximation is what lets you compute "what is the probability of at least 520 heads in 1000 tosses" without summing hundreds of binomial terms — you just use the normal table.

A note on history

The normal distribution was known to Abraham de Moivre in the 1730s as the limit of binomial distributions, was used by Gauss in the 1800s for the distribution of measurement errors, and was put on its modern statistical footing by Laplace and Quetelet. What made it unavoidable, though, is the central limit theorem: the realisation that the bell curve is not just one distribution among many but the distribution that sums of independent quantities always gravitate toward. That universality is what made it the object at the centre of modern statistics.

Where this leads next

The normal distribution is the gateway to all of inferential statistics. Once you can compute probabilities on a bell curve, you can build confidence intervals, run hypothesis tests, and make quantitative claims about populations from samples.

Continuous Random Variables — the general framework of densities and CDFs that the normal distribution fits inside.
Binomial Distribution — the discrete distribution that the normal approximates for large n, via the central limit theorem.
Measures of Dispersion — where standard deviation comes from and how it is computed for data, as opposed to for a theoretical distribution.
Introduction to Inference — using the normal distribution to estimate population parameters from samples.