Data Organization — padho-wiki

In short

Data organization is the process of turning a raw pile of numbers into a structure you can reason about. The main tools are frequency distributions (counting how often each value appears), grouped frequency tables (lumping values into intervals when the range is wide), and graphical representations (bar graphs, histograms, frequency polygons, pie charts) that let you see the shape of the data at a glance.

A school in Jaipur records the marks scored by 30 students in a math test (out of 50):

38,\; 27,\; 45,\; 33,\; 29,\; 41,\; 36,\; 22,\; 48,\; 31,\; 26,\; 44,\; 35,\; 28,\; 42,\; 39,\; 30,\; 47,\; 34,\; 23,\; 40,\; 37,\; 25,\; 43,\; 32,\; 46,\; 24,\; 38,\; 41,\; 29

Stare at those 30 numbers for ten seconds. Can you tell what the "typical" score was? Can you tell whether most students did well or poorly? Can you spot the highest and lowest marks quickly?

Probably not. The numbers are sitting there in the order they were collected, and that order carries no information at all. The first mark, 38, happened to be recorded first because that student's paper was on top of the pile. Nothing about the mathematics is in that ordering.

This is the fundamental problem of statistics: raw data is useless. You need to organize it before it tells you anything. The entire field of statistics begins with this step — not with formulas, not with probability, but with the mundane and essential act of sorting, counting, and drawing pictures.

Types of data

Before organizing, you need to know what kind of data you have. There are two fundamentally different types, and they require different tools.

Quantitative data is data you can do arithmetic on. The 30 test scores above are quantitative — you can add them, average them, and say things like "this student scored 12 more than that one." Quantitative data can be further split:

Discrete data takes specific, separate values. The number of siblings a student has is discrete: you can have 0, 1, 2, 3 siblings, but not 2.7 siblings. Test scores (if they must be whole numbers) are also discrete.
Continuous data can take any value in an interval. A student's height might be 162.3 cm or 162.347 cm — there is no gap between possible values. Time, weight, and temperature are continuous.

Qualitative data (also called categorical data) is data you cannot do arithmetic on. A student's favourite subject — Mathematics, Hindi, Science — is qualitative. You can count how many students prefer each subject, but you cannot average "Mathematics" and "Hindi." Qualitative data is organized differently, typically with bar charts or pie charts, where each category gets its own bar or slice.

The distinction matters because the tools change. You can draw a histogram for quantitative data (the bars touch, because the values flow continuously). You draw a bar chart for qualitative data (the bars have gaps, because the categories are separate). Mixing the two is a common mistake.

Frequency distribution: counting what you have

The simplest form of organization is a frequency distribution: a table that lists each distinct value and how many times it appears.

Go back to those 30 test scores. Sort them first — arranging the data in ascending order is the oldest and most powerful organizational move:

22,\; 23,\; 24,\; 25,\; 26,\; 27,\; 28,\; 29,\; 29,\; 30,\; 31,\; 32,\; 33,\; 34,\; 35,\; 36,\; 37,\; 38,\; 38,\; 39,\; 40,\; 41,\; 41,\; 42,\; 43,\; 44,\; 45,\; 46,\; 47,\; 48

Already better. You can see the lowest score (22) and the highest (48) instantly. You can see that 29, 38, and 41 appear twice each. But with 27 distinct values across 30 data points, an ungrouped frequency table would be enormous and not much more useful than the sorted list. This is where grouping comes in.

Grouped frequency distributions

When data spreads over a wide range — here from 22 to 48, a span of 26 marks — you create class intervals (also called bins) and count how many observations fall in each interval.

Choose intervals of width 5: 20–25, 25–30, 30–35, 35–40, 40–45, 45–50.

A critical convention: each interval includes its lower boundary but excludes its upper boundary. So "25–30" means 25 \leq x < 30. This way every observation falls in exactly one interval, with no ambiguity.

Class interval	Tally	Frequency
20–25	\|\|\|\|	4
25–30	\|\|\|\|\|	5
30–35	\|\|\|\|\|	5
35–40	\|\|\|\|\|\|	6
40–45	\|\|\|\|\|\|	6
45–50	\|\|\|\|	4
Total		30

Now something is visible that was invisible in the raw data. The distribution is roughly even across the range, with a slight bulge in the 35–45 region. Most students scored between 25 and 45. Nobody failed catastrophically (the minimum is 22 out of 50), and nobody scored a perfect 50.

Choosing the number of intervals. Too few intervals (say, two intervals of width 15) and you lose detail. Too many (say, intervals of width 1) and you're back to the ungrouped table. A rule of thumb: use 5 to 10 intervals for most datasets. The width should be a "round" number — 5, 10, 20 — so the table is easy to read.

Class mark (midpoint). Each interval has a representative value called its class mark, computed as the average of the boundaries. For the interval 25–30, the class mark is (25 + 30)/2 = 27.5. Class marks become important when you compute the mean from grouped data.

Cumulative frequency

A cumulative frequency table adds a running total: for each class, how many observations fall in that class or any class before it.

Class interval	Frequency	Cumulative frequency
20–25	4	4
25–30	5	9
30–35	5	14
35–40	6	20
40–45	6	26
45–50	4	30

The cumulative frequency tells you things the plain frequency does not. From this table you can read off: 14 students scored below 35 (that is 14/30 \approx 47\%), and 26 students scored below 45 (about 87%). These are exactly the kind of statements teachers make when they announce results.

Graphical representations

Numbers in a table are precise. A picture is fast. The two complement each other — the table is the authority, and the picture is the intuition.

Bar graph

A bar graph draws one bar per category (or per value, for discrete data). The height of each bar represents the frequency. The bars do not touch — gaps between them signal that the categories are separate.

A bar graph of favourite subjects among 30 students. The gaps between bars signal that these are separate categories — you cannot have a subject halfway between Math and Science. Bar graphs are the right tool for qualitative data.

Histogram

A histogram looks like a bar graph, but the bars touch. This is because histograms are for continuous (or grouped) quantitative data, where the intervals are adjacent — the right edge of one interval is the left edge of the next.

A histogram of the 30 test scores, grouped into intervals of width 5. The touching bars reflect the continuous nature of the data — a score of 30 sits on the boundary where one bar ends and the next begins. The roughly uniform heights tell you scores are spread fairly evenly across the range.

Frequency polygon

A frequency polygon connects the midpoints of the tops of histogram bars with straight line segments. Plot each class mark on the horizontal axis and the corresponding frequency on the vertical axis, then join the dots.

The polygon is useful for two reasons: it shows the shape of the distribution more clearly than a histogram (you can see at a glance whether the data is symmetric, skewed, or bimodal), and it lets you overlay two distributions on the same graph for comparison — something hard to do with overlapping histograms.

The frequency polygon for the test-score data. Each dot sits at a class mark with height equal to the frequency. The gentle upward slope from left to right, followed by a plateau, shows a mild skew toward higher scores. The dashed extensions to frequency 0 at both ends "close" the polygon.

Pie chart

A pie chart divides a circle into sectors, one per category, where each sector's angle is proportional to the category's share. If Math is chosen by 12 out of 30 students, its sector has angle (12/30) \times 360° = 144°.

Pie charts are best for showing proportions of a whole. They are poor at showing exact values or comparing many categories — bar graphs are better for that. Use a pie chart when you have 3–5 categories and the point is "what fraction of the total does each category occupy."

Ogive (cumulative frequency curve)

An ogive (pronounced "oh-jive") is the graph of cumulative frequency against the upper boundary of each class. Plot the upper boundary on the horizontal axis and the cumulative frequency on the vertical axis, then connect the dots.

The ogive for the test-score data. The S-shape is typical of data that is fairly evenly spread. You can read off the median (the mark at cumulative frequency 15) by drawing a horizontal line from 15 on the vertical axis to the curve, then dropping to the horizontal axis — it lands near 35. Ogives are the fastest visual tool for finding medians and percentiles.

The ogive has a specific use that no other graph provides so directly: you can read off the median (the middle value) and any percentile by drawing a horizontal line at the desired cumulative frequency and seeing where it meets the curve.

Worked examples

Example 1: Building a grouped frequency table from raw data

The heights (in cm) of 20 students in a class are:

148,\; 155,\; 162,\; 141,\; 170,\; 158,\; 145,\; 167,\; 153,\; 160,\; 149,\; 164,\; 143,\; 172,\; 156,\; 151,\; 168,\; 146,\; 159,\; 163

Step 1. Find the range. Minimum = 141, Maximum = 172, so range = 172 - 141 = 31 cm.

Why: the range tells you how wide your class intervals need to span. All intervals together must cover at least 31 cm.

Step 2. Choose a class width. With range 31 and a target of about 5–6 intervals, a width of 7 gives 31/7 \approx 4.4, so 5 intervals. Choose intervals: 140–147, 147–154, 154–161, 161–168, 168–175.

Why: 5 intervals of width 7 span 5 \times 7 = 35, which covers the range of 31 comfortably. Starting at 140 (just below the minimum) keeps things clean.

Step 3. Sort the data and tally each observation into its interval. Remember: the convention 140 \leq x < 147 means 147 goes into the next interval.

Class interval	Heights in this interval	Frequency
140–147	141, 143, 145, 146	4
147–154	148, 149, 151, 153	4
154–161	155, 156, 158, 159, 160	5
161–168	162, 163, 164, 167	4
168–175	168, 170, 172	3
Total		20

Why: the total must equal 20 (the original number of observations). If it doesn't, you've miscounted — go back and check.

Step 4. Compute cumulative frequencies.

Class interval	Frequency	Cumulative frequency
140–147	4	4
147–154	4	8
154–161	5	13
161–168	4	17
168–175	3	20

Result: The distribution peaks in the 154–161 cm interval (5 students), with fairly even counts on either side. 13 out of 20 students (65%) are shorter than 161 cm.

The histogram of student heights. The tallest bar in the middle (154–161 cm) confirms what the table says: most students cluster around 155–160 cm. The histogram makes the shape of the distribution visible in a way the raw numbers do not.

The histogram shows a roughly symmetric, slightly left-leaning distribution — most students are of medium height, with a gradual taper on both sides. This is the kind of shape you expect for a biological measurement like height.

Example 2: Drawing an ogive and reading the median

The weekly pocket money (in rupees) received by 25 students is recorded in the grouped frequency table below.

Pocket money (₹)	Frequency
50–100	3
100–150	5
150–200	8
200–250	6
250–300	3

Step 1. Build the cumulative frequency table.

Upper boundary	Cumulative frequency
100	3
150	8
200	16
250	22
300	25

Why: each cumulative frequency is the running total. The last value must equal the total number of students (25).

Step 2. Plot the ogive: upper boundary on the horizontal axis, cumulative frequency on the vertical axis.

Why: an ogive always plots the upper boundary — not the class mark — because the cumulative frequency "up to this value" corresponds naturally to the right edge of each interval.

Step 3. Find the median. The median is the value of the middle observation: for 25 students, that is the 13th value (since (25 + 1)/2 = 13). Draw a horizontal line at cumulative frequency 12.5 on the vertical axis (the midpoint for an even interpolation) and see where it meets the curve.

Why: the ogive is a visual interpolation tool. You enter with a cumulative frequency and read off the corresponding data value on the horizontal axis.

Step 4. The horizontal line at cumulative frequency 12.5 meets the ogive at approximately ₹185. The median pocket money is about ₹185.

Result: The median weekly pocket money is approximately ₹185.

The ogive for pocket-money data. The dashed lines show how to read the median: enter at cumulative frequency 12.5 on the vertical axis, go right to the curve, then drop down to the horizontal axis. The answer is approximately ₹185. This visual method works for any percentile — just enter at the corresponding cumulative frequency.

The ogive confirms visually what the table shows numerically: about half the students receive less than ₹185, and half receive more. The steepest part of the curve is in the 150–200 range, which is where the highest frequency (8 students) sits — the curve climbs fastest where the data is most dense.

Common confusions

"A histogram and a bar graph are the same thing." They are not. A histogram is for continuous/grouped quantitative data — the bars touch because the intervals are adjacent. A bar graph is for categorical data — the bars have gaps because the categories are discrete. Using a bar graph for continuous data, or a histogram for categorical data, is a conceptual error, not just a style choice.
"The class interval 20–30 includes both 20 and 30." That depends on the convention. In the standard Indian textbook convention, 20 \leq x < 30: the interval includes 20 but excludes 30, so that 30 falls into the next interval 30–40. This avoids double-counting boundary values. Always state which convention you are using.
"More intervals means a better table." Not necessarily. If you have 30 data points and use 30 intervals of width 1, each interval has at most 1 or 2 entries — the table is just the sorted data written vertically, and no grouping has occurred. The point of grouping is to compress the data into a summary that reveals patterns. Too many intervals defeats this purpose.
"A pie chart is always a good choice." Pie charts work for 3–5 categories when you want to show proportions. With 10 categories, the slices become too thin to read. For comparing exact values across categories, a bar graph is almost always clearer than a pie chart.
"Cumulative frequency decreases sometimes." Never. Cumulative frequency is a running total — it can stay the same (if a class has zero frequency) or go up, but it never goes down. If your cumulative frequency decreases at any point, there is an arithmetic error.

Going deeper

If you came here to learn how to organize data into tables and draw the standard graphs, you have it — you can stop here. The rest is for readers who want to understand the subtleties of interval selection, the connection between histograms and probability, and what happens when the class widths are unequal.

Unequal class widths

Sometimes you encounter data where the class intervals have different widths — perhaps because the data is sparse at the extremes and dense in the middle. When the widths differ, the height of the histogram bar should not be the frequency — it should be the frequency density:

\text{frequency density} = \frac{\text{frequency}}{\text{class width}}

The area of each bar then represents the frequency, not the height. This is a subtle but important point. If you draw an ordinary histogram (height = frequency) with unequal widths, a wide interval will look disproportionately large even if its frequency density is low.

The connection to probability

As the number of observations grows and the class width shrinks, the frequency polygon approaches a smooth curve. That smooth curve, when scaled so the total area under it is 1, is a probability density function — the foundation of continuous probability distributions like the normal distribution. The histogram is, in a real sense, an approximation to the probability density function. This connection is what makes the histogram not just a display tool but a bridge from descriptive statistics to probability theory.

Stem-and-leaf plots

An older but still useful display is the stem-and-leaf plot. For the test-score data, split each number into a "stem" (the tens digit) and a "leaf" (the units digit):

Stem	Leaves
2	2, 3, 4, 5, 6, 7, 8, 9, 9
3	0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 9
4	0, 1, 1, 2, 3, 4, 5, 6, 7, 8

The advantage over a histogram: the stem-and-leaf plot preserves the individual data values while still showing the shape of the distribution. You get both the picture and the data.

Where this leads next

Data organization is just the first step. Once the data is sorted, grouped, and drawn, the next questions are all about summarizing it with numbers:

Measures of Central Tendency — mean, median, and mode: three different ways to answer the question "what is the typical value?"
Measures of Dispersion — range, variance, and standard deviation: measuring how spread out the data is.
Quartiles and Percentiles — dividing the sorted data into quarters and hundredths, and the box plot that shows it all at once.
Correlation — when you have two variables instead of one, how do you tell whether they are related?