In short

Statistical inference uses sample data to make conclusions about a population. A confidence interval gives a range of plausible values for a population parameter. A hypothesis test decides whether data is consistent with a specific claim about the population. The p-value measures how surprising the observed data would be if that claim were true.

A pharmaceutical company develops a new drug to reduce blood pressure. They test it on 200 patients. The patients' average blood pressure drops by 8 mmHg. The company claims the drug works.

But here is the catch. Even a sugar pill — a placebo with no active ingredient — would cause some change in the average, just by random variation. Maybe the 200 patients happened to include more people whose blood pressure was going to drop anyway. Maybe 8 mmHg is exactly the kind of fluctuation you would expect from pure chance.

So the real question is not "did the average go down?" — it obviously did, you measured it. The real question is: is the drop large enough that it probably was not caused by chance alone?

This is the question that statistical inference answers. You have a sample. You have a number computed from that sample. You want to know what that number tells you about the population — and how confident you can be.

Two tools do this work. The first, the confidence interval, says "the true value of the parameter probably lies somewhere in this range." The second, the hypothesis test, says "the data is (or is not) consistent with the claim that nothing is happening." They are two sides of the same coin, built from the same mathematics, and both rest on the sampling distribution you met in the article on Sampling.

Confidence intervals: a range, not a point

You survey 100 randomly chosen households in a city and find that their average monthly electricity bill is ₹2,150. That is a single number — a point estimate of the true average. It is your best guess.

But how good is this guess? The true average might be ₹2,100 or ₹2,200 or ₹2,300. A single number gives you no sense of how far off you might be.

A confidence interval fixes this. Instead of reporting a single number, you report a range:

"The average monthly electricity bill is between ₹2,010 and ₹2,290, with 95% confidence."

What does "95% confidence" mean? It does not mean "there is a 95% probability that the true average is in this interval." (That is a common misreading, and we will come back to why it is wrong.) What it means is:

If you repeated this entire procedure — draw a random sample of 100 households, compute the interval — over and over, then 95% of the intervals you construct would contain the true population mean.

The interval is a net, and the method of casting the net catches the true value 95% of the time. Any single interval either contains the true value or it does not — you do not know which — but the procedure has a 95% success rate.

Twenty confidence intervals, most capturing the true mean A vertical dashed line marks the true population mean. Twenty horizontal line segments represent twenty 95% confidence intervals from twenty different samples. Nineteen of them cross the dashed line (they contain the true mean). One does not — it is drawn in red to show that 95% confidence means 1 in 20 intervals will miss. true μ ← missed
Twenty samples, twenty 95% confidence intervals. The dashed red line is the true population mean. Nineteen of the twenty intervals (black) capture it. One (red) does not — it happened to come from a sample whose mean was unusually far from the truth. A 95% confidence level means this kind of miss happens about 1 time in 20.

Building a confidence interval

The formula for a 95% confidence interval for the population mean \mu is:

\bar{x} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}

where \bar{x} is the sample mean, \sigma is the population standard deviation (or the sample standard deviation s if \sigma is unknown), and n is the sample size.

Where does the 1.96 come from? From the standard normal distribution. In a normal distribution, 95% of the area lies within 1.96 standard deviations of the mean. Since the sampling distribution of \bar{x} is approximately normal with standard deviation \sigma/\sqrt{n}, the interval \bar{x} \pm 1.96 \cdot \sigma/\sqrt{n} captures \mu 95% of the time.

For a 99% confidence interval, you use 2.576 instead of 1.96. For 90%, you use 1.645. The wider the interval, the more confident you can be that it contains \mu — but the less precise the estimate.

Confidence interval for the mean

A (1 - \alpha) \times 100\% confidence interval for the population mean \mu is

\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}

where z_{\alpha/2} is the critical value from the standard normal distribution such that P(-z_{\alpha/2} \leq Z \leq z_{\alpha/2}) = 1 - \alpha.

Hypothesis testing: making a decision

A confidence interval gives a range. Sometimes you need a yes-or-no decision. Does the drug work? Is the coin fair? Is this batch of resistors within specification?

A hypothesis test formalises this kind of decision. You start with two competing claims:

The logic of a hypothesis test is proof by contradiction. You assume H_0 is true. You compute how likely it is that you would see data as extreme as what you actually observed, under that assumption. If the answer is "very unlikely," you reject H_0 in favour of H_1. If the answer is "not that unusual," you fail to reject H_0.

The logic of hypothesis testing as a flowchart A flowchart showing the steps of hypothesis testing. Start with the null hypothesis. Collect data. Compute a test statistic. Ask: is the result surprising under the null? If yes, reject the null. If no, fail to reject the null. Assume H₀ Collect data Compute test statistic Surprising under H₀? No Fail to reject H₀ Yes Reject H₀
The logic of a hypothesis test. You assume nothing is happening (the null hypothesis). You ask how surprising the data is under that assumption. If the data is too surprising, you reject the null.

The test statistic

To measure "how surprising," you compute a test statistic — a single number that summarises how far the observed data is from what H_0 predicts.

For testing whether a population mean \mu equals a claimed value \mu_0, the test statistic is:

z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

This is the number of standard errors by which the sample mean \bar{x} differs from \mu_0. If H_0 is true, this quantity follows a standard normal distribution — so values near 0 are expected, and values far from 0 are surprising.

The p-value: how surprising is the data?

The p-value is the probability of observing a test statistic at least as extreme as the one you actually got, assuming the null hypothesis is true.

p-value

The p-value is

p = P(\text{test statistic at least as extreme as observed} \mid H_0 \text{ is true})

A small p-value means the observed data is unlikely under H_0. A large p-value means the data is consistent with H_0.

The decision rule is simple. You choose a threshold — called the significance level \alpha, typically 0.05 — before looking at the data. If p \leq \alpha, you reject H_0. If p > \alpha, you fail to reject H_0.

Why 0.05? It is a convention, not a law of nature. It means you are willing to accept a 5% chance of rejecting H_0 when it is actually true (a Type I error — a false alarm). Some fields use stricter thresholds: particle physics uses \alpha \approx 0.0000003 (the "5-sigma" standard), because claiming to have discovered a new particle when you have not is catastrophically expensive.

The p-value as area in the tail of the distribution under the null hypothesis A bell curve representing the distribution of the test statistic under the null hypothesis. The observed test statistic is marked to the right. The area to the right of the observed value is shaded, representing the p-value — the probability of seeing a result this extreme or more extreme if the null is true. observed z 0 p-value distribution of test statistic under H₀
The p-value is the shaded area: the probability, under the null hypothesis, of getting a test statistic as extreme as or more extreme than the one you observed. A small shaded area means the data is hard to explain by chance alone.

What the p-value is NOT

The p-value is one of the most misunderstood numbers in science. Here are the things it does not mean:

The distinction is subtle but critical. A p-value lives in the world where H_0 is true and asks about the data. It does not live in the real world and ask about H_0.

Worked examples

Example 1: Confidence interval for average commute time

A city planner surveys 64 randomly selected commuters and finds a mean commute time of \bar{x} = 42 minutes with a sample standard deviation of s = 16 minutes. Construct a 95% confidence interval for the true mean commute time.

Step 1. Identify the quantities: \bar{x} = 42, s = 16, n = 64, confidence level = 95%.

Why: since \sigma is unknown, use the sample standard deviation s as an estimate. With n = 64, this is reliable.

Step 2. Compute the standard error.

\text{SE} = \frac{s}{\sqrt{n}} = \frac{16}{\sqrt{64}} = \frac{16}{8} = 2 \text{ minutes}

Why: the standard error measures the typical distance between a sample mean and the population mean.

Step 3. Compute the margin of error. For 95% confidence, the critical value is z_{0.025} = 1.96.

\text{ME} = 1.96 \times 2 = 3.92 \text{ minutes}

Why: 1.96 standard errors on each side captures 95% of the sampling distribution.

Step 4. Construct the interval.

42 - 3.92 = 38.08 \quad \text{to} \quad 42 + 3.92 = 45.92

Result: The 95% confidence interval for the true mean commute time is [38.08, 45.92] minutes.

95% confidence interval for commute time from 38.08 to 45.92 minutes A number line from 35 to 50 minutes. The sample mean is marked at 42 with a dot. The confidence interval extends from 38.08 to 45.92, shown as a thick horizontal bar. The margin of error of 3.92 is labelled on each side. 35 37 39 41 42 43 50 x̄ = 42 38.08 45.92 95% confidence interval
The 95% confidence interval stretches 3.92 minutes on each side of the sample mean. The true population mean is somewhere in this range — probably. In 95 out of 100 such samples, the interval would capture it.

The interval says: a commute time between about 38 and 46 minutes is plausible for the true city-wide average. The picture shows how the margin of error creates a range centred on the sample mean.

Example 2: Testing whether a coin is fair

You flip a coin 100 times and get 60 heads. Is the coin fair, or is it biased?

Step 1. Set up the hypotheses.

H_0: p = 0.5 \quad (\text{fair coin})
H_1: p \neq 0.5 \quad (\text{biased coin})

Why: the null hypothesis claims nothing interesting is happening — the coin is fair. The alternative claims it is biased in either direction (two-sided test).

Step 2. Compute the test statistic. Under H_0, the number of heads in 100 flips has mean np_0 = 50 and standard deviation \sqrt{np_0(1-p_0)} = \sqrt{25} = 5. The observed proportion is \hat{p} = 60/100 = 0.6.

z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} = \frac{0.6 - 0.5}{\sqrt{0.5 \times 0.5 / 100}} = \frac{0.1}{0.05} = 2.0

Why: the test statistic measures how many standard errors the observed proportion is from the claimed value. A z of 2.0 means the observed result is 2 standard errors above the null hypothesis prediction.

Step 3. Compute the p-value. For a two-sided test with z = 2.0:

p = 2 \times P(Z > 2.0) = 2 \times 0.0228 = 0.0456

Why: "two-sided" means extreme results in either direction count, so you double the one-tail area. From the standard normal table, P(Z > 2.0) = 0.0228.

Step 4. Compare with \alpha = 0.05. Since p = 0.0456 < 0.05, you reject H_0.

Result: At the 5% significance level, you reject the null hypothesis. The data provides sufficient evidence that the coin is biased.

Two-tailed p-value for z equals 2.0 A bell curve representing the null distribution. The areas beyond z equals negative 2 and z equals positive 2 are shaded in red, representing the p-value of 0.0456. The observed z equals 2.0 is marked. 0 2.0 −2.0 2.28% 2.28% p-value = 0.0456 (both tails combined)
The two shaded tails represent the p-value: the total probability, under the fair-coin assumption, of getting a result at least as extreme as $z = 2.0$ in either direction. The combined area is 0.0456 — just barely below the 0.05 threshold, so you reject the null hypothesis.

The result is borderline. With 60 heads out of 100, the evidence against a fair coin is statistically significant at the 5% level, but not at the 1% level (where you would need p < 0.01). A single hypothesis test does not prove the coin is biased — it says the data is hard to explain if the coin were fair.

Common confusions

Going deeper

If you came here to understand what confidence intervals and hypothesis tests mean at a conceptual level, you have it — you can stop here. The rest is for readers who want to see the connection between the two, the error types, and the mathematical framework that unifies everything.

The duality between confidence intervals and hypothesis tests

Confidence intervals and hypothesis tests are not separate tools — they are the same tool viewed from two angles.

A 95% confidence interval for \mu is the set of all values \mu_0 that you would not reject in a two-sided hypothesis test at the \alpha = 0.05 level. If \mu_0 is inside the interval, then |\bar{x} - \mu_0| / (\sigma/\sqrt{n}) < 1.96, which means p > 0.05, which means you fail to reject. If \mu_0 is outside the interval, then |\bar{x} - \mu_0| / (\sigma/\sqrt{n}) > 1.96, which means p < 0.05, which means you reject.

So a confidence interval is a visual hypothesis test: every value inside the interval is "not rejected," and every value outside is "rejected."

Type I and Type II errors

Two kinds of mistakes are possible:

H_0 is actually true H_0 is actually false
Reject H_0 Type I error (false positive) Correct decision
Fail to reject H_0 Correct decision Type II error (false negative)

The significance level \alpha controls the Type I error rate: the probability of rejecting H_0 when it is true. The power of a test, 1 - \beta, is the probability of correctly rejecting H_0 when it is false. You want \alpha small (few false alarms) and power high (you detect real effects). These two goals pull in opposite directions — reducing \alpha reduces power, and vice versa — which is why choosing \alpha is a judgment call, not a mathematical derivation.

The role of sample size in power

Power increases with sample size. A sample of 1,000 can detect a small effect that a sample of 50 would miss entirely. This is why clinical trials and large surveys specify a minimum sample size before collecting data: they compute how many observations are needed to have, say, 80% power to detect an effect of a given size. The sample size calculation uses the same standard error formula from the sampling article: \text{SE} = \sigma / \sqrt{n}.

Where this leads next

You now have the conceptual foundation of statistical inference. The next ideas build specific tools on this foundation: