Sampling — padho-wiki

In short

A population is the entire group you want to study. A sample is a smaller subset you actually measure. If you choose the sample randomly, the sample average is a reliable stand-in for the population average — and the larger the sample, the closer it gets. The pattern of how sample averages spread out is called the sampling distribution.

India has roughly 25 crore households. Suppose the government wants to know the average monthly electricity bill across the entire country. Going door to door and collecting 25 crore electricity bills is not feasible — it would take years and cost a fortune. So instead, a surveyor picks 10,000 households, records their bills, and computes the average of those 10,000 numbers. That average comes out to, say, ₹1,840.

Here is the question: can you trust that number? Those 10,000 households are not the country. They are a tiny sliver of the country — 0.004% of all households. Why should their average tell you anything about the average for everyone?

This is the central problem of sampling. You want to know something about a huge group. You can only measure a small piece. Under what conditions does the small piece faithfully represent the whole?

The answer is surprisingly precise, and it rests on one idea: randomness. If you choose the 10,000 households at random — genuinely at random, not "convenient" households, not "nearby" households — then the mathematics guarantees that their average will be close to the true average. Not exactly equal, but close, and you can even quantify how close.

Population and sample

These two words are the foundation of everything in statistics, so they need careful definitions.

Population and sample

The population is the complete set of individuals or objects you want to study. The sample is a subset of the population that you actually observe and measure.

A parameter is a fixed number describing the population (like the true average electricity bill of all 25 crore households). A statistic is a number computed from the sample (like the average electricity bill of your 10,000 sampled households).

The population is not always "people." If you are testing whether a batch of 50,000 light bulbs meets a quality standard, the population is all 50,000 bulbs. The sample might be 200 bulbs pulled from the production line and tested. If you are studying the heights of teak trees in a forest reserve, the population is every teak tree in that reserve.

The key distinction: a parameter is a fact about the world. It has a single, fixed, true value — you just do not know it. A statistic is something you compute from data you actually collected. It changes every time you draw a new sample. The entire point of sampling is to use the statistic (which you can compute) to estimate the parameter (which you cannot directly observe).

Here is a concrete example. Suppose a school has 1,200 students, and the average height of all 1,200 is exactly 162.3 cm. That is the parameter — you just do not know it yet. You pick 50 students at random and measure their heights. Their average comes out to 163.1 cm. That is the statistic. Pick a different 50 students and you might get 161.8 cm, or 162.7 cm. Each sample gives a slightly different statistic, but they all hover around the true parameter.

The population is everything you want to study. The sample is the piece you actually measure. The goal: use the sample statistic to estimate the unknown population parameter.

Why the sample has to be random

Suppose you want to estimate the average monthly income of families in a city. You stand outside a shopping mall on a Saturday afternoon and survey 200 people. What is wrong with this?

Everything. The people at the mall are not representative of the city. They are disproportionately wealthier (they are shopping), disproportionately urban (they are at a mall), and disproportionately free on Saturday (ruling out many workers). Your sample is biased — it systematically leans in one direction. The average income you compute from this sample will almost certainly be higher than the city's true average, and no amount of increasing the sample size will fix the problem. You could survey 10,000 mall-goers and the bias would still be there.

The cure for bias is random sampling — a procedure where every individual in the population has a known, nonzero chance of being selected.

Simple random sampling

A simple random sample (SRS) of size n from a population of size N is a sample chosen so that every possible subset of n individuals is equally likely to be selected. Equivalently, every individual has the same probability n/N of appearing in the sample.

The word "random" here is doing real work. It does not mean "haphazard" or "whatever feels right." It means you have a specific mechanism — a lottery, a random number generator, a table of random digits — that gives every member of the population an equal shot.

Why does randomness help? Because it removes your biases from the selection. You might unconsciously prefer taller students, richer families, trees near the path. Randomness does not have preferences. Over many possible random samples, the biases cancel out: sometimes you oversample tall students, sometimes short ones, and on average, you get the right answer.

There are other sampling methods beyond simple random sampling — stratified sampling divides the population into subgroups (strata) and samples from each, systematic sampling picks every k-th individual from a list, cluster sampling selects entire groups at once. Each has its uses. But simple random sampling is the baseline: the method against which all others are compared, and the one whose mathematics is cleanest.

What happens when you sample repeatedly

Here is an experiment you can run in your head. Take the school with 1,200 students whose true average height is 162.3 cm. Draw a random sample of 50 students and compute their average. Write it down. Put those students back, draw another random sample of 50, compute the average. Do this 1,000 times. You now have 1,000 sample averages.

What does the collection of those 1,000 averages look like?

It forms a pattern — a distribution — centred on the true population mean. Most of the sample averages cluster near 162.3 cm. A few land at 160 or 164. Almost none land below 158 or above 167. The shape is a bell curve, symmetric and concentrated.

If you draw 1,000 random samples of size 50 from the same population and compute each sample's average, the averages form a bell-shaped distribution centred on the true population mean. This distribution of sample averages is the sampling distribution.

This distribution of sample averages is called the sampling distribution of the sample mean. It is not a distribution of individual heights — it is a distribution of averages, each computed from a different random sample.

Three facts about the sampling distribution make the entire field of statistics possible.

Fact 1: The centre. The mean of the sampling distribution equals the population mean \mu. If you averaged all 1,000 sample means, you would get (very close to) 162.3 cm. In symbols:

E(\bar{x}) = \mu

where \bar{x} is the sample mean. This says that random sampling is unbiased — on average, the sample mean hits the right target.

Fact 2: The spread. The standard deviation of the sampling distribution — called the standard error — is

\text{SE} = \frac{\sigma}{\sqrt{n}}

where \sigma is the population standard deviation and n is the sample size. The \sqrt{n} in the denominator is the key: as the sample size grows, the spread of sample means shrinks. With n = 50, the standard error is \sigma/\sqrt{50}. With n = 200, it is \sigma/\sqrt{200} — half as wide. Larger samples give more precise estimates.

Fact 3: The shape. Even if the population distribution is skewed or irregular, the sampling distribution of the sample mean is approximately normal (bell-shaped) when n is large enough. This is the Central Limit Theorem — one of the most remarkable results in mathematics. It says that averaging washes out individual quirks, and the average of many random things is always approximately normal.

How large is "large enough"? For most populations, n \geq 30 is a reasonable threshold. If the population is strongly skewed, you might need n \geq 50 or more. If the population is already symmetric, even n = 10 can be enough.

The standard error formula, derived

The formula \text{SE} = \sigma / \sqrt{n} deserves a derivation, not just a statement.

Suppose you draw a sample of n values x_1, x_2, \ldots, x_n from a population with mean \mu and standard deviation \sigma. Assume the values are drawn independently (each draw does not affect the next). The sample mean is

\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}

Now compute the variance of \bar{x}. Each x_i has variance \sigma^2. Since the draws are independent, the variance of a sum is the sum of the variances:

\text{Var}(x_1 + x_2 + \cdots + x_n) = \sigma^2 + \sigma^2 + \cdots + \sigma^2 = n\sigma^2

The sample mean divides this sum by the constant n. When you divide a random variable by a constant c, the variance gets divided by c^2:

\text{Var}(\bar{x}) = \text{Var}\!\left(\frac{x_1 + \cdots + x_n}{n}\right) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}

Take the square root to get the standard deviation:

\text{SD}(\bar{x}) = \frac{\sigma}{\sqrt{n}}

That is the standard error. The \sqrt{n} does not come from a rule of thumb — it comes from the algebra of variances. Every time you quadruple the sample size, the standard error halves.

Worked examples

Example 1: Estimating mean marks from a sample

A coaching centre has 2,000 students. The true mean score on a mock test is \mu = 68 marks with a standard deviation of \sigma = 12 marks. You randomly sample 36 students and compute the sample mean. What is the standard error, and within what range would you expect most sample means to fall?

Step 1. Identify the known quantities.

\mu = 68, \quad \sigma = 12, \quad n = 36

Why: \mu and \sigma describe the population. n is the sample size you chose.

Step 2. Compute the standard error.

\text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{12}{\sqrt{36}} = \frac{12}{6} = 2

Why: the standard error measures how much sample means typically deviate from the population mean.

Step 3. Find the range where most sample means fall. By the empirical rule for normal distributions, about 95% of sample means lie within \pm 2 standard errors of \mu.

68 - 2(2) = 64 \quad \text{to} \quad 68 + 2(2) = 72

Why: the sampling distribution is approximately normal (Central Limit Theorem, n = 36 \geq 30), so the 95% rule applies.

Step 4. Interpret: if you sampled 36 students many times, about 95% of those sample means would land between 64 and 72.

Result: The standard error is 2 marks. About 95% of sample means fall in the interval [64, 72].

The sampling distribution of $\bar{x}$ when $n = 36$. The bell curve is centred at $\mu = 68$, with standard error $\text{SE} = 2$. The shaded region from 64 to 72 captures about 95% of all possible sample means.

Notice what the picture says: even though individual students' marks vary widely (standard deviation 12), the average of 36 students barely moves — its standard deviation is only 2. Averaging compresses randomness.

Example 2: How sample size controls precision

A factory produces resistors whose resistance has a population mean of \mu = 100\,\Omega and a standard deviation of \sigma = 5\,\Omega. The quality inspector wants the standard error of the sample mean to be at most 0.5\,\Omega. How many resistors must be sampled?

Step 1. Write the standard error formula and set it equal to the target.

\text{SE} = \frac{\sigma}{\sqrt{n}} \leq 0.5

Why: the inspector's requirement translates directly into a bound on the standard error.

Step 2. Substitute \sigma = 5 and solve for n.

\frac{5}{\sqrt{n}} \leq 0.5

\sqrt{n} \geq \frac{5}{0.5} = 10

n \geq 100

Why: algebraically, multiplying both sides by \sqrt{n} and dividing by 0.5 isolates \sqrt{n}. Squaring both sides gives n.

Step 3. Verify: with n = 100, \text{SE} = 5/\sqrt{100} = 5/10 = 0.5\,\Omega. Exactly meets the requirement.

Step 4. Compare with a smaller sample: with n = 25, \text{SE} = 5/\sqrt{25} = 1.0\,\Omega — twice the target. With n = 400, \text{SE} = 5/\sqrt{400} = 0.25\,\Omega — half the target but four times the sample size.

Result: The inspector must sample at least 100 resistors.

The standard error $\text{SE} = 5/\sqrt{n}$ plotted against $n$. The curve drops steeply at first — going from $n = 1$ to $n = 25$ cuts the error by 80% — but then flattens. To halve the error from 0.5 to 0.25, you need to quadruple the sample from 100 to 400. Precision comes cheap at first, then gets expensive.

The picture reveals a law of diminishing returns. The first 100 resistors buy you a lot of precision. The next 300 buy you only a little more. This is the \sqrt{n} at work: precision improves as the square root of the sample size, not linearly.

Common confusions

"A bigger sample is always better." A bigger random sample is always more precise. But a bigger biased sample is not better — it just gives you a more precise wrong answer. A sample of 100,000 mall-goers is worse than a sample of 500 randomly chosen city residents, because the mall sample has systematic bias that no amount of data will fix.
"The sample should be a fixed fraction of the population." You might think you need to sample 10% of the population, or 1%. In reality, the standard error depends on the absolute sample size n, not on the fraction n/N. A sample of 1,000 from a population of 10 lakh gives nearly the same precision as a sample of 1,000 from a population of 100 crore. The population size barely matters (unless the sample is a large fraction of the population, in which case a small correction applies).
"Random means haphazard." Picking "whoever is around" or "the first 50 names that come to mind" is not random sampling. Random sampling requires a specific, reproducible mechanism — like a random number generator or drawing lots — where every individual has a known probability of selection.
"The sampling distribution is the same as the population distribution." These are different things. The population distribution describes individual values (individual heights, individual test scores). The sampling distribution describes averages of samples. The sampling distribution is always narrower and more symmetric than the population distribution.
"The Central Limit Theorem says everything is normal." It says the distribution of the sample mean is approximately normal for large n. It does not say the population is normal, and it does not say individual data points are normal.

Going deeper

If you came here to understand what a sample is, why it needs to be random, and how sample means behave — you have it. You can stop here. The rest of this section is for readers who want the mathematical statement of the Central Limit Theorem and a closer look at what "approximately normal" really means.

The Central Limit Theorem, stated precisely

The Central Limit Theorem is one of those results where the statement itself is the surprise.

Let x_1, x_2, \ldots, x_n be independent draws from any distribution with mean \mu and finite variance \sigma^2. Define the standardised sample mean:

Z_n = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}

Then as n \to \infty:

P(Z_n \leq z) \to \Phi(z)

where \Phi(z) is the cumulative distribution function of the standard normal distribution.

Read what this says. You start with any distribution — it could be skewed, bimodal, discrete, continuous, anything — as long as it has a finite mean and variance. You take the average of n independent draws. The distribution of that average, properly standardised, converges to the standard normal. The original distribution does not matter. Only the mean and variance survive in the limit.

This is why the bell curve appears everywhere in science. Whenever a measurement is the average (or sum) of many small, independent contributions — measurement errors, molecular velocities, exam scores summed over many questions — the Central Limit Theorem predicts a normal distribution. The bell curve is not an assumption; it is a consequence of averaging.

Finite population correction

The standard error formula \text{SE} = \sigma / \sqrt{n} assumes sampling with replacement (or, equivalently, that the population is infinite relative to the sample). When the sample is a significant fraction of the population, you can apply a finite population correction:

\text{SE}_{\text{corrected}} = \frac{\sigma}{\sqrt{n}} \cdot \sqrt{\frac{N - n}{N - 1}}

The factor \sqrt{(N - n)/(N - 1)} is less than 1, so the corrected standard error is smaller than \sigma/\sqrt{n}. This makes intuitive sense: if your sample covers 90% of the population, you know a lot more than the basic formula suggests. In practice, if n/N < 0.05 (the sample is less than 5% of the population), the correction is negligible and you can ignore it.

Stratified sampling and why it can beat SRS

Simple random sampling is the baseline, but it is not always the best. Suppose you want to estimate the average income of a state that has both large cities and small villages. In a simple random sample, you might — by chance — oversample cities or oversample villages. Stratified sampling avoids this by dividing the population into strata (urban vs. rural, or by district), sampling from each stratum separately, and then combining the results. The combined estimate is guaranteed to represent each stratum in the right proportion, and its standard error is often smaller than that of a simple random sample of the same total size.

The mathematics of stratified sampling uses the same variance-of-sums machinery you saw in the standard error derivation, applied stratum by stratum. It is a natural extension, not a different theory.

Where this leads next

You now know what a sample is, what makes a sample reliable, and how sample means distribute themselves. The next set of ideas uses the sampling distribution as a foundation:

Introduction to Inference — using the sampling distribution to build confidence intervals and test hypotheses.
Data Organization — how to organize, tabulate, and visualize data before computing statistics.
Arithmetic Mean — a closer look at the mean itself, its algebraic properties, and how it interacts with other measures of central tendency.
Probability Introduction — the formal probability framework that underpins random sampling and the Central Limit Theorem.