In short
Statistical inference uses sample data to make conclusions about a population. A confidence interval gives a range of plausible values for a population parameter. A hypothesis test decides whether data is consistent with a specific claim about the population. The p-value measures how surprising the observed data would be if that claim were true.
A pharmaceutical company develops a new drug to reduce blood pressure. They test it on 200 patients. The patients' average blood pressure drops by 8 mmHg. The company claims the drug works.
But here is the catch. Even a sugar pill — a placebo with no active ingredient — would cause some change in the average, just by random variation. Maybe the 200 patients happened to include more people whose blood pressure was going to drop anyway. Maybe 8 mmHg is exactly the kind of fluctuation you would expect from pure chance.
So the real question is not "did the average go down?" — it obviously did, you measured it. The real question is: is the drop large enough that it probably was not caused by chance alone?
This is the question that statistical inference answers. You have a sample. You have a number computed from that sample. You want to know what that number tells you about the population — and how confident you can be.
Two tools do this work. The first, the confidence interval, says "the true value of the parameter probably lies somewhere in this range." The second, the hypothesis test, says "the data is (or is not) consistent with the claim that nothing is happening." They are two sides of the same coin, built from the same mathematics, and both rest on the sampling distribution you met in the article on Sampling.
Confidence intervals: a range, not a point
You survey 100 randomly chosen households in a city and find that their average monthly electricity bill is ₹2,150. That is a single number — a point estimate of the true average. It is your best guess.
But how good is this guess? The true average might be ₹2,100 or ₹2,200 or ₹2,300. A single number gives you no sense of how far off you might be.
A confidence interval fixes this. Instead of reporting a single number, you report a range:
"The average monthly electricity bill is between ₹2,010 and ₹2,290, with 95% confidence."
What does "95% confidence" mean? It does not mean "there is a 95% probability that the true average is in this interval." (That is a common misreading, and we will come back to why it is wrong.) What it means is:
If you repeated this entire procedure — draw a random sample of 100 households, compute the interval — over and over, then 95% of the intervals you construct would contain the true population mean.
The interval is a net, and the method of casting the net catches the true value 95% of the time. Any single interval either contains the true value or it does not — you do not know which — but the procedure has a 95% success rate.
Building a confidence interval
The formula for a 95% confidence interval for the population mean \mu is:
where \bar{x} is the sample mean, \sigma is the population standard deviation (or the sample standard deviation s if \sigma is unknown), and n is the sample size.
Where does the 1.96 come from? From the standard normal distribution. In a normal distribution, 95% of the area lies within 1.96 standard deviations of the mean. Since the sampling distribution of \bar{x} is approximately normal with standard deviation \sigma/\sqrt{n}, the interval \bar{x} \pm 1.96 \cdot \sigma/\sqrt{n} captures \mu 95% of the time.
For a 99% confidence interval, you use 2.576 instead of 1.96. For 90%, you use 1.645. The wider the interval, the more confident you can be that it contains \mu — but the less precise the estimate.
Confidence interval for the mean
A (1 - \alpha) \times 100\% confidence interval for the population mean \mu is
where z_{\alpha/2} is the critical value from the standard normal distribution such that P(-z_{\alpha/2} \leq Z \leq z_{\alpha/2}) = 1 - \alpha.
Hypothesis testing: making a decision
A confidence interval gives a range. Sometimes you need a yes-or-no decision. Does the drug work? Is the coin fair? Is this batch of resistors within specification?
A hypothesis test formalises this kind of decision. You start with two competing claims:
- The null hypothesis H_0: the boring claim. Nothing interesting is happening. The drug has no effect. The coin is fair. The population mean is the standard value.
- The alternative hypothesis H_1 (or H_a): the interesting claim. The drug does reduce blood pressure. The coin is biased. The population mean differs from the standard.
The logic of a hypothesis test is proof by contradiction. You assume H_0 is true. You compute how likely it is that you would see data as extreme as what you actually observed, under that assumption. If the answer is "very unlikely," you reject H_0 in favour of H_1. If the answer is "not that unusual," you fail to reject H_0.
The test statistic
To measure "how surprising," you compute a test statistic — a single number that summarises how far the observed data is from what H_0 predicts.
For testing whether a population mean \mu equals a claimed value \mu_0, the test statistic is:
This is the number of standard errors by which the sample mean \bar{x} differs from \mu_0. If H_0 is true, this quantity follows a standard normal distribution — so values near 0 are expected, and values far from 0 are surprising.
The p-value: how surprising is the data?
The p-value is the probability of observing a test statistic at least as extreme as the one you actually got, assuming the null hypothesis is true.
p-value
The p-value is
A small p-value means the observed data is unlikely under H_0. A large p-value means the data is consistent with H_0.
The decision rule is simple. You choose a threshold — called the significance level \alpha, typically 0.05 — before looking at the data. If p \leq \alpha, you reject H_0. If p > \alpha, you fail to reject H_0.
Why 0.05? It is a convention, not a law of nature. It means you are willing to accept a 5% chance of rejecting H_0 when it is actually true (a Type I error — a false alarm). Some fields use stricter thresholds: particle physics uses \alpha \approx 0.0000003 (the "5-sigma" standard), because claiming to have discovered a new particle when you have not is catastrophically expensive.
What the p-value is NOT
The p-value is one of the most misunderstood numbers in science. Here are the things it does not mean:
- It is not the probability that H_0 is true. The p-value is computed assuming H_0 is true. It cannot tell you the probability of something it already assumed.
- It is not the probability that the result happened by chance. It is the probability of data this extreme or more extreme under H_0.
- A p-value of 0.03 does not mean "there is only a 3% chance the drug does not work." It means "if the drug truly did not work, there is only a 3% chance of seeing data this extreme."
The distinction is subtle but critical. A p-value lives in the world where H_0 is true and asks about the data. It does not live in the real world and ask about H_0.
Worked examples
Example 1: Confidence interval for average commute time
A city planner surveys 64 randomly selected commuters and finds a mean commute time of \bar{x} = 42 minutes with a sample standard deviation of s = 16 minutes. Construct a 95% confidence interval for the true mean commute time.
Step 1. Identify the quantities: \bar{x} = 42, s = 16, n = 64, confidence level = 95%.
Why: since \sigma is unknown, use the sample standard deviation s as an estimate. With n = 64, this is reliable.
Step 2. Compute the standard error.
Why: the standard error measures the typical distance between a sample mean and the population mean.
Step 3. Compute the margin of error. For 95% confidence, the critical value is z_{0.025} = 1.96.
Why: 1.96 standard errors on each side captures 95% of the sampling distribution.
Step 4. Construct the interval.
Result: The 95% confidence interval for the true mean commute time is [38.08, 45.92] minutes.
The interval says: a commute time between about 38 and 46 minutes is plausible for the true city-wide average. The picture shows how the margin of error creates a range centred on the sample mean.
Example 2: Testing whether a coin is fair
You flip a coin 100 times and get 60 heads. Is the coin fair, or is it biased?
Step 1. Set up the hypotheses.
Why: the null hypothesis claims nothing interesting is happening — the coin is fair. The alternative claims it is biased in either direction (two-sided test).
Step 2. Compute the test statistic. Under H_0, the number of heads in 100 flips has mean np_0 = 50 and standard deviation \sqrt{np_0(1-p_0)} = \sqrt{25} = 5. The observed proportion is \hat{p} = 60/100 = 0.6.
Why: the test statistic measures how many standard errors the observed proportion is from the claimed value. A z of 2.0 means the observed result is 2 standard errors above the null hypothesis prediction.
Step 3. Compute the p-value. For a two-sided test with z = 2.0:
Why: "two-sided" means extreme results in either direction count, so you double the one-tail area. From the standard normal table, P(Z > 2.0) = 0.0228.
Step 4. Compare with \alpha = 0.05. Since p = 0.0456 < 0.05, you reject H_0.
Result: At the 5% significance level, you reject the null hypothesis. The data provides sufficient evidence that the coin is biased.
The result is borderline. With 60 heads out of 100, the evidence against a fair coin is statistically significant at the 5% level, but not at the 1% level (where you would need p < 0.01). A single hypothesis test does not prove the coin is biased — it says the data is hard to explain if the coin were fair.
Common confusions
-
"Failing to reject H_0 means H_0 is true." It does not. It means the data is not strong enough to rule it out. Absence of evidence is not evidence of absence. With a small sample, you might fail to detect a real effect simply because you do not have enough data.
-
"A smaller p-value means a bigger effect." No. The p-value depends on both the effect size and the sample size. A tiny, meaningless effect can produce a very small p-value if the sample is huge. A large, important effect can produce a large p-value if the sample is tiny. The p-value measures statistical significance, not practical importance.
-
"95% confidence means there is a 95% probability that \mu is in the interval." The true \mu is a fixed number — it is either in the interval or not. The 95% refers to the procedure: if you repeated the sampling and interval construction many times, 95% of the intervals would contain \mu. This is a frequency statement about the method, not a probability statement about any single interval.
-
"You can only do hypothesis testing with large samples." For small samples, you use the t-distribution instead of the normal distribution. The logic is exactly the same; only the critical values change. The t-distribution has heavier tails, which makes the test more conservative — harder to reject H_0 — which is appropriate because small samples carry more uncertainty.
-
"Statistical significance means the result is important." A drug that lowers blood pressure by 0.1 mmHg might be statistically significant with a sample of 100,000 patients — but clinically, 0.1 mmHg is meaningless. Significance tests detect whether an effect exists; they do not measure whether it matters.
Going deeper
If you came here to understand what confidence intervals and hypothesis tests mean at a conceptual level, you have it — you can stop here. The rest is for readers who want to see the connection between the two, the error types, and the mathematical framework that unifies everything.
The duality between confidence intervals and hypothesis tests
Confidence intervals and hypothesis tests are not separate tools — they are the same tool viewed from two angles.
A 95% confidence interval for \mu is the set of all values \mu_0 that you would not reject in a two-sided hypothesis test at the \alpha = 0.05 level. If \mu_0 is inside the interval, then |\bar{x} - \mu_0| / (\sigma/\sqrt{n}) < 1.96, which means p > 0.05, which means you fail to reject. If \mu_0 is outside the interval, then |\bar{x} - \mu_0| / (\sigma/\sqrt{n}) > 1.96, which means p < 0.05, which means you reject.
So a confidence interval is a visual hypothesis test: every value inside the interval is "not rejected," and every value outside is "rejected."
Type I and Type II errors
Two kinds of mistakes are possible:
| H_0 is actually true | H_0 is actually false | |
|---|---|---|
| Reject H_0 | Type I error (false positive) | Correct decision |
| Fail to reject H_0 | Correct decision | Type II error (false negative) |
The significance level \alpha controls the Type I error rate: the probability of rejecting H_0 when it is true. The power of a test, 1 - \beta, is the probability of correctly rejecting H_0 when it is false. You want \alpha small (few false alarms) and power high (you detect real effects). These two goals pull in opposite directions — reducing \alpha reduces power, and vice versa — which is why choosing \alpha is a judgment call, not a mathematical derivation.
The role of sample size in power
Power increases with sample size. A sample of 1,000 can detect a small effect that a sample of 50 would miss entirely. This is why clinical trials and large surveys specify a minimum sample size before collecting data: they compute how many observations are needed to have, say, 80% power to detect an effect of a given size. The sample size calculation uses the same standard error formula from the sampling article: \text{SE} = \sigma / \sqrt{n}.
Where this leads next
You now have the conceptual foundation of statistical inference. The next ideas build specific tools on this foundation:
- Sampling — the sampling distribution that underlies both confidence intervals and hypothesis tests.
- Data Organization — how to collect, clean, and organise data before any inference.
- Arithmetic Mean — the properties of the mean as a statistic, including when it is and is not the best summary.
- Probability Introduction — the probability theory that makes the p-value and the confidence level mathematically precise.