Quartiles and Percentiles

In short

Quartiles are the three values that split a sorted dataset into four equal parts: Q_1 (one-quarter of the data lies below), Q_2 (the median, half the data lies below), and Q_3 (three-quarters of the data lies below). The interquartile range Q_3 - Q_1 is a measure of spread that ignores extreme values. Percentiles are the same idea with finer resolution — the k-th percentile is the value below which k\% of the data lies. A box plot turns these five numbers into one picture.

A school runs a mathematics exam for its 400 students in class 10. The principal wants a short answer to a simple question: how did the class do?

One number is not enough. The class average might be 62 out of 100 — but that alone cannot tell you whether most students were clustered near 62, or whether half the class was failing and half was topping. You want a feel for the whole shape of the results.

Here is one efficient way to describe that shape. Sort the 400 scores from lowest to highest. The score at position 100 (one-quarter of the way through) tells you the cut-off below which the bottom 25% sit. The score at position 200 is the median — half the class did worse, half did better. The score at position 300 tells you the cut-off above which the top 25% sit.

Those three numbers are called the quartiles. Together with the lowest and highest scores, they compress 400 data points into 5 numbers that capture almost everything you want to know about the shape of the distribution: where the middle is, how spread out the data is, whether the top and bottom tails are long or short, whether there are extreme outliers.

That five-number summary — and the picture you draw from it — is the subject of this article.

From median to quartiles

You already know one cut-point. The median of a sorted dataset is the value in the middle: half the data lies below it and half lies above. For the dataset \{3, 7, 8, 12, 15, 17, 22\}, the median is 12 — three values below, three values above.

The median splits the data into two halves. Quartiles do the same thing, but they split it into four quarters.

Quartiles

For a sorted dataset, the quartiles are three values Q_1, Q_2, Q_3 that divide the data into four equal parts.

Q_1 (the lower quartile or first quartile) is the value below which one-quarter of the data lies.
Q_2 (the second quartile) is the median.
Q_3 (the upper quartile or third quartile) is the value below which three-quarters of the data lies.

Equivalently: Q_1 is the median of the lower half, and Q_3 is the median of the upper half.

The "median-of-halves" description is the one you actually use to compute quartiles by hand. Take the data, find the median, split the data into two halves around the median, then find the median of each half.

A small example. Take the sorted dataset

4,\; 6,\; 7,\; 9,\; 11,\; 13,\; 15,\; 18,\; 20.

There are 9 values. The middle value is the 5th, so Q_2 = 11. The lower half (before the median) is \{4, 6, 7, 9\} — its median is the average of the two middle values, (6+7)/2 = 6.5. So Q_1 = 6.5. The upper half is \{13, 15, 18, 20\} — its median is (15+18)/2 = 16.5. So Q_3 = 16.5.

The three quartiles are Q_1 = 6.5, Q_2 = 11, Q_3 = 16.5. They divide the nine values into four groups of roughly equal size.

There is one detail that varies between textbooks: when the dataset has an odd number of values, should the median itself be included in the lower and upper halves when computing Q_1 and Q_3? The convention in Indian school mathematics (and the one used throughout this article) is to exclude the median from both halves. The exclusion convention keeps the arithmetic clean and is the one every NCERT-aligned textbook uses.

The interquartile range

Once you have Q_1 and Q_3, you have a natural measure of spread.

Interquartile range

The interquartile range (IQR) of a dataset is the difference between the third and first quartiles:

\text{IQR} = Q_3 - Q_1

The IQR is the width of the middle 50% of the data.

Why do you want such a thing? Because the ordinary range (maximum minus minimum) is easily wrecked by a single outlier. If you are measuring the monthly income of 100 families and one of them happens to be a billionaire, the range is enormous — but it tells you almost nothing about how the other 99 families are doing.

The IQR ignores the top 25% and the bottom 25% entirely. It only looks at the middle half. That makes it a robust measure of spread — one that does not get pulled around by extreme values.

For the dataset above (Q_1 = 6.5, Q_3 = 16.5), the IQR is 16.5 - 6.5 = 10. The middle half of the data is spread across a width of 10 units. If instead you had replaced the value 20 with 2000, the range would explode from 16 to 1996 — but the IQR would not change at all, because the outlier is above Q_3 and the position of Q_3 itself does not depend on what exact value sits at the extreme top.

This robustness is why the IQR is the standard measure of spread in exploratory data analysis. When someone shows you a box plot of salaries or exam scores, the width of the box is the IQR.

Percentiles: quartiles with finer teeth

Quartiles cut the data into quarters. You can do the same thing with any denominator.

Percentile

The k-th percentile P_k of a sorted dataset is the value below which k\% of the observations lie.

In particular:

P_{25} = Q_1
P_{50} = Q_2 = median
P_{75} = Q_3

Percentiles are the natural scale when you want finer resolution than quartiles give you. A student who scores in the 87th percentile of a national exam did better than 87% of the candidates. That is a more informative number than "above the third quartile" — because "above Q_3" describes the top 25% as a single lump, while "87th percentile" pinpoints the exact position inside that lump.

Computing P_k from sorted data. The standard method: for a sorted dataset of n values, the k-th percentile sits at position

i = \frac{k}{100} \times (n + 1).

If i comes out to a whole number, P_k is exactly the value at position i. If i falls between two positions — say i = 4.7 — you interpolate linearly between the 4th and 5th values.

Take the 9-value dataset from before: 4, 6, 7, 9, 11, 13, 15, 18, 20. Find P_{30}.

Position: i = 0.30 \times 10 = 3.0. So P_{30} is exactly the 3rd value, which is 7.

Find P_{80}. Position: i = 0.80 \times 10 = 8.0. So P_{80} is the 8th value, which is 18.

Find P_{65}. Position: i = 0.65 \times 10 = 6.5. So P_{65} is halfway between the 6th value (13) and the 7th value (15) — that is, P_{65} = 14.

The interpolation rule is a convention that smooths over the fact that a real dataset has a finite number of points — you cannot have exactly 65% of 9 values below any actual data point. Different statistical software packages use slightly different conventions; the one above is the one used in most Indian textbooks and in the QUARTILE.INC function in spreadsheets.

Box plots: the five-number picture

Five numbers — minimum, Q_1, Q_2, Q_3, maximum — are enough to sketch the shape of a dataset. The standard way to draw them is called a box plot (or box-and-whisker plot).

A box plot is just a number line with a rectangle drawn on it. The box stretches from Q_1 to Q_3, so its width is the IQR. A line is drawn inside the box at the median Q_2. Two "whiskers" extend from the ends of the box — the left whisker to the minimum, the right whisker to the maximum.

Example 1: Exam scores for one class

Here are the scores of 15 students in a mathematics exam (out of 50):

12,\; 18,\; 22,\; 25,\; 27,\; 28,\; 30,\; 32,\; 34,\; 35,\; 36,\; 38,\; 40,\; 42,\; 48.

The data is already sorted. Build a box plot.

Step 1. Find the median Q_2.

With 15 values, the middle is the 8th value. So Q_2 = 32.

Why: the median is always the "middle-position" value in a sorted list. Position = (n+1)/2 = 8 when n = 15.

Step 2. Find Q_1 (the median of the lower half).

The lower half is the first 7 values: 12, 18, 22, 25, 27, 28, 30. Its middle is the 4th value, which is 25. So Q_1 = 25.

Why: excluding the overall median, the lower half is positions 1 through 7. Its own median is at position (7+1)/2 = 4.

Step 3. Find Q_3 (the median of the upper half).

The upper half is the last 7 values: 34, 35, 36, 38, 40, 42, 48. Its middle is the 4th value in that subset, which is 38. So Q_3 = 38.

Why: same logic as Step 2, applied to the upper half.

Step 4. Compute the IQR and the five-number summary.

\text{IQR} = Q_3 - Q_1 = 38 - 25 = 13.

The five-number summary is (\text{min}, Q_1, Q_2, Q_3, \text{max}) = (12, 25, 32, 38, 48).

Why: these five numbers together fix the position, width, and skew of the data at a glance.

Step 5. Draw the box plot.

The box runs from 25 to 38, with a line inside at 32. A whisker extends from 25 down to 12. Another whisker extends from 38 up to 48.

Why: the box displays the middle 50% of the data, the internal line marks the median, and the whiskers show how far the extremes stretch out on each side.

Result: Five-number summary (12, 25, 32, 38, 48), IQR = 13.

Box plot for the exam scores. The box covers the middle 50% of the class (scores 25 to 38). The red line marks the median at 32. The whiskers stretch out to the lowest and highest scores. The right whisker is longer than the left, so the top of the class is more spread out than the bottom.

Read what the picture is showing you. The box is not centred on the line inside it — the median (32) is closer to Q_3 (38) than to Q_1 (25). That is a signal that the data is slightly skewed. The lower half of the data (below the median) is packed more tightly than the upper half. If the median had sat exactly in the middle of the box, the data would have been symmetric about its middle.

This is what a box plot gives you at a glance: not just where the data sits, but how it is shaped.

A second example, with outliers

The power of box plots and the IQR shows up most clearly when the data contains extreme values.

Example 2: Monthly salaries with an outlier

A startup with 11 employees has the following monthly salaries (in thousands of rupees):

25,\; 28,\; 30,\; 32,\; 35,\; 38,\; 40,\; 42,\; 45,\; 50,\; 350.

That last value is the founder's salary. Every other employee earns between 25 and 50 thousand.

Step 1. Locate the five-number summary.

The data is already sorted. With n = 11, the median is the 6th value: Q_2 = 38. The lower half is the first 5 values \{25, 28, 30, 32, 35\} with median 30, so Q_1 = 30. The upper half is the last 5 values \{40, 42, 45, 50, 350\} with median 45, so Q_3 = 45.

Why: the exclusion convention means the overall median is not repeated in either half. With 11 values, this leaves 5 on each side — clean odd counts with clear middles.

Step 2. Compute the IQR.

\text{IQR} = 45 - 30 = 15.

Why: the IQR captures the width of the middle half of the data. It has nothing to do with how extreme the extremes are.

Step 3. Identify outliers using the 1.5 \times \text{IQR} rule.

Any data point below Q_1 - 1.5 \times \text{IQR} or above Q_3 + 1.5 \times \text{IQR} is flagged as an outlier.

Q_1 - 1.5 \times 15 = 30 - 22.5 = 7.5

Q_3 + 1.5 \times 15 = 45 + 22.5 = 67.5

The value 350 is far above 67.5, so it is an outlier. No value is below 7.5, so there are no low-end outliers.

Why: the 1.5 \times \text{IQR} rule is a standard cut-off — any point more than one-and-a-half box widths away from the box is considered far enough out to warrant attention.

Step 4. Compare the mean and the median.

The mean of all 11 salaries is

\frac{25 + 28 + 30 + 32 + 35 + 38 + 40 + 42 + 45 + 50 + 350}{11} = \frac{715}{11} = 65.

So the mean salary is ₹65,000 — a number that describes none of the employees. The median ₹38,000 is much closer to what the typical employee actually earns.

Why: the mean is pulled toward the outlier because the outlier contributes its full value to the sum. The median does not care how extreme the top value is — only that it is above the middle.

Result: Five-number summary (25, 30, 38, 45, 350), IQR = 15, outlier = 350.

Box plot of the startup salaries. The box is tiny — the middle 50% of the staff earn between 30 and 45 thousand. The median sits near the left of the box, showing a mild right-skew even among the non-outlier employees. The founder's salary of 350 thousand sits far to the right as an isolated point, clearly flagged as an outlier.

One picture — one narrow box and one faraway dot — tells you the whole story: the employees are clustered tightly, one person earns vastly more than everyone else, and the "average salary" in the usual sense is not a meaningful summary of this company.

Common confusions

"The median of the lower half includes the overall median." Different textbooks use different conventions. The exclusion convention — the one used above — is the Indian school standard. The inclusion convention (also valid, used by some American textbooks) gives slightly different Q_1 and Q_3 for odd-sized datasets. Both are correct as long as you are consistent.
"Percentile and percentage are the same thing." They are not. If a student scores 70 marks out of 100, that is a percentage of 70%. If that same student is in the 92nd percentile, it means they scored higher than 92% of the other students — which depends on how everyone else did, not just on the raw score. A student can score 70% and be in the 92nd percentile if most students scored below 70.
"The IQR is a range, so it is the same kind of thing as the ordinary range." The word "range" in both names is misleading. The ordinary range is a single number — maximum minus minimum — that is extremely sensitive to outliers. The IQR is a single number — Q_3 - Q_1 — that is extremely insensitive to outliers. They measure spread, but very differently.
"Quartiles divide the data into four equal groups, so Q_1 is the smallest 25%." Close but not quite. Q_1 is the value at the 25% cut-off, not the whole first quarter. The first quarter of the data is the set of all values below Q_1; Q_1 itself is a single number marking the boundary.
"If there is no outlier, the box plot is useless." On the contrary — a box plot of data with no outliers still tells you where the centre is, how wide the middle half is, whether the distribution is symmetric or skewed, and how much the extremes extend beyond the box. It is a general-purpose shape indicator, not just an outlier detector.

Going deeper

If you're happy with quartiles, percentiles, IQR, and box plots as tools for summarizing a dataset, you have what you need — stop here. The rest of this article tightens up the definitions and makes a few connections to topics you will meet later.

Percentile for grouped data

When data comes in grouped form — "how many students scored between 20 and 30, between 30 and 40, ..." — you cannot point at a specific data value for P_k. Instead, you use interpolation within the class that contains the k-th percentile.

The formula, which looks intimidating but just encodes linear interpolation:

P_k = L + \left(\frac{\frac{k}{100} \cdot N - F}{f}\right) \cdot h

where:

L is the lower boundary of the class containing the k-th percentile
N is the total number of observations
F is the cumulative frequency up to (but not including) that class
f is the frequency of that class
h is the class width

The logic: you walk up the cumulative frequency until you find the class where the k-th percentile must lie, then assume the data is uniformly spread within that class and pick the point \frac{k}{100} \cdot N - F units into the class (out of f units total), scaled by the class width. That is all the formula is doing.

For Q_1, Q_2, Q_3, the same formula works with k = 25, 50, 75.

The 1.5 \times \text{IQR} rule, where it comes from

The cut-off Q_1 - 1.5 \times \text{IQR} and Q_3 + 1.5 \times \text{IQR} for identifying outliers is a convention, not a theorem. It was introduced by John Tukey when he invented the box plot in the 1970s as a rule of thumb.

The factor 1.5 is a compromise. For a normal distribution, roughly 99.3% of the data falls inside Tukey's fences — meaning only about 0.7% of genuinely-normal data gets flagged as outliers. Bigger factors (like 3) give you only extreme outliers and miss the moderately unusual ones; smaller factors flag too much data as suspicious. The value 1.5 catches most of the genuinely-unusual points while rarely falsely flagging normal variation.

The rule is not a formal statistical test. It is a quick visual cue: "these points deserve a second look — are they measurement errors, or are they genuine extremes worth investigating?"

Connection to the cumulative distribution

If you plot the cumulative relative frequency against the value — the "what fraction of data lies at or below x" curve — you get a staircase that rises from 0 to 1. Percentiles are exactly the inverse of this curve: to find P_k, you look up the point where the curve crosses k/100 on the vertical axis and read off the corresponding value on the horizontal axis.

This viewpoint generalises cleanly to continuous probability distributions. The cumulative distribution function F(x) = \mathbb{P}(X \leq x) is the continuous analogue of the cumulative relative frequency. Its inverse F^{-1}(p) gives you the p-quantile — the value below which a fraction p of the probability mass lies. Quartiles and percentiles of a dataset are the sample versions of quantiles of a distribution.

This connects descriptive statistics to probability theory: what quartiles are to a dataset, quantiles are to a distribution.

Where this leads next

You now have the tools to summarise a single variable in five numbers and one picture. The next step is to look at two variables at once and ask how they move together.

Correlation — when you have pairs of observations (height and weight, study hours and exam marks), correlation measures how strongly the two are related.
Regression — beyond measuring the relationship, regression gives you an actual line you can use to predict one variable from the other.
Sampling — in practice you rarely have the whole population. Sampling is how you draw conclusions about a population from the quartiles and percentiles of a subset.
Measures of Dispersion — variance and standard deviation are the algebraic counterparts to the IQR. They use every data point instead of just the quartiles, which makes them more sensitive to outliers and more connected to the normal distribution.
Introduction to Inference — the leap from describing data to making claims about the world it came from.