Random Variables - Discrete

In short

A discrete random variable is a rule that assigns a number to every outcome of a random experiment. Once you have that rule, you can summarise the whole experiment with two small tables: the probability mass function, which tells you how likely each value is, and the cumulative distribution function, which tells you how likely you are to stay at or below a given value.

You and a friend are about to roll two ordinary dice and add the pips. Before the roll, you know nothing about the outcome — it could land anywhere from 2 to 12. The question you actually care about is not "which pair of faces will show" but "what total will come up, and how often?".

The natural object to reach for is a number — the sum. That number is not fixed. Every time you roll, a different value could pop out. And yet the pattern of how often each value shows up is completely predictable: over thousands of rolls, a total of 7 appears roughly six times as often as a total of 2.

Something strange is happening here. The outcome of a single roll is random. The long-run behaviour of the outcome is not. A discrete random variable is the tool that lets you hold both of those ideas at once: a number that varies from trial to trial, but whose distribution across all possible values is fixed and computable.

Naming the number before you see it

Toss a coin three times. The sample space — the set of all possible outcomes — contains eight sequences:

\{HHH,\; HHT,\; HTH,\; HTT,\; THH,\; THT,\; TTH,\; TTT\}.

Each of these eight sequences is equally likely, so each has probability \frac{1}{8}.

Now focus on something specific: the number of heads. That number is not one of the eight sequences — it is a summary of a sequence. HHT has two heads. TTT has zero heads. HHH has three heads. You can take every sequence in the sample space and attach a number to it:

Outcome	Number of heads
HHH	3
HHT, HTH, THH	2
HTT, THT, TTH	1
TTT	0

You have just done something quietly important. You have built a function, call it X, whose input is an outcome and whose output is a real number. The name for this kind of function is a random variable.

The random variable $X$ maps each of the eight outcomes in the sample space to one of the four numbers $\{0, 1, 2, 3\}$. Three different outcomes land on $X = 1$ and three on $X = 2$, which is why those values have mass $\tfrac{3}{8}$.

The word "variable" is a little misleading — X is not a variable the way x is in 2x + 3 = 7. It is a deterministic rule: feed it the sequence HTH and it returns 2, every single time. What is random is the outcome that gets fed in. The randomness of the experiment flows through X and comes out as a random value for X.

Because the possible outputs of this X are \{0, 1, 2, 3\} — a list you can count off on your fingers — X is called a discrete random variable. Later you will meet continuous random variables, whose output can land anywhere on an interval of the real line. For now, stay with the countable case.

From outcomes to a table of probabilities

Here is the move that makes random variables useful. You started with a sample space of eight equally likely sequences. Many of those sequences collapse to the same value of X. You can now forget the sequences and track only the values.

How likely is X = 2? Three sequences — HHT, HTH, THH — give X = 2, and each has probability \frac{1}{8}, so P(X = 2) = \frac{3}{8}. Do the same for every value:

k	0	1	2	3
P(X = k)	\tfrac{1}{8}	\tfrac{3}{8}	\tfrac{3}{8}	\tfrac{1}{8}

This little table is the whole distribution of X. Every question you can ask about X — what is the probability of getting at least two heads, what is the average number of heads, what is the variance — can be answered from this table alone. You no longer need the original sample space.

The table is called the probability mass function of X, usually written p_X(k) or just p(k). The word "mass" is deliberate. You can picture each possible value of X as a point on the number line, and at each point you drop a lump of probability-mass proportional to how often that value occurs. The total mass you have to distribute is always 1, because one of the values must occur.

The probability mass function for $X = $ number of heads in three coin tosses. The bar heights sum to $1$ because one of the four values must occur.

Notice the shape. The most likely values are in the middle — one head and two heads each have probability \frac{3}{8}. The extreme outcomes — all tails or all heads — are much less likely. This is the first hint of a pattern that will show up again with the binomial distribution: the middle is more crowded than the ends, because there are more ways to get there.

The formal definition

Strip away the story and the dice and you get this.

Definition

A discrete random variable on a sample space S is a function X : S \to \mathbb{R} whose image \{X(s) : s \in S\} is either finite or countably infinite.

The probability mass function of X is the function p_X : \mathbb{R} \to [0, 1] defined by

p_X(k) \;=\; P(X = k) \;=\; P(\{s \in S : X(s) = k\}).

It satisfies two conditions:

p_X(k) \geq 0 for every k.
\displaystyle\sum_{k} p_X(k) = 1, where the sum is over all values in the image of X.

Reading the definition. The first line is saying: a discrete random variable is a rule for attaching a number to each outcome, and the set of numbers it can output is one you can list (even if the list is infinite). The mass function p_X(k) is asking: out of all the outcomes in the sample space, which ones does X send to this particular value k, and what is their combined probability? The two conditions at the end are sanity checks — probabilities cannot be negative, and the total has to be 1 because some value has to come out.

Most of the time the sample space S fades into the background. Once you have the mass function, you rarely go back to the original outcomes.

The running total: cumulative distribution function

There is a second way to package the information in the mass function, and it is often easier to work with for questions of the form "what is the probability that X is at most this much?"

Take the three-coin example and compute, for each value, the probability of getting at most that many heads:

k	P(X \leq k)
0	\tfrac{1}{8}
1	\tfrac{1}{8} + \tfrac{3}{8} = \tfrac{4}{8}
2	\tfrac{1}{8} + \tfrac{3}{8} + \tfrac{3}{8} = \tfrac{7}{8}
3	\tfrac{1}{8} + \tfrac{3}{8} + \tfrac{3}{8} + \tfrac{1}{8} = 1

Each entry is a running total of the mass function. The final entry is 1, because by the time you have reached the largest value, every outcome is accounted for.

The function F_X(k) = P(X \leq k) is called the cumulative distribution function — the CDF. Unlike the mass function, which only cares about isolated points, the CDF is defined for every real number k: if you plug in k = 1.7, it returns the probability that X is at most 1.7, which for integer-valued X is the same as P(X \leq 1) = \frac{4}{8}.

Because the mass is concentrated at specific points, the CDF of a discrete random variable is a staircase. It stays flat between jumps, then jumps vertically by exactly p_X(k) at each value k in the image.

The CDF climbs from $0$ to $1$ in four steps, one at each integer. The height of each jump is exactly $p_X(k)$: the jump at $k = 1$ is $\tfrac{3}{8}$, matching the mass at $k = 1$.

The jumps are not a quirk — they are the entire information content. If someone hands you the CDF, you recover the mass function by measuring the size of each jump. So mass function and CDF are two ways of carrying exactly the same data: one as a bar chart, the other as a staircase.

Two worked examples

Example 1: rolling two dice, sum of the faces

Let X be the sum of the two faces when you roll two fair six-sided dice. Find the probability mass function and the cumulative distribution function.

Step 1. List the sample space. Each die has 6 faces, so there are 6 \times 6 = 36 equally likely ordered pairs (i, j).

Why: equally likely outcomes let you compute any probability by counting, which is the easiest case.

Step 2. For each possible sum k, count how many pairs give that sum.

The sum k = 2 comes only from (1, 1) — one pair. The sum k = 7 comes from (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) — six pairs. Work through all 11 possible sums:

k	2	3	4	5	6	7	8	9	10	11	12
pairs	1	2	3	4	5	6	5	4	3	2	1

Why: the counts are symmetric around 7 because (i, j) gives the same sum as (7-i, 7-j) reflected.

Step 3. Divide each count by 36 to get the mass function.

p_X(k) = \frac{\text{pairs summing to } k}{36}.

So p_X(2) = \frac{1}{36}, p_X(7) = \frac{6}{36} = \frac{1}{6}, p_X(12) = \frac{1}{36}, and so on. Check the total: 1 + 2 + 3 + 4 + 5 + 6 + 5 + 4 + 3 + 2 + 1 = 36, so the sum of p_X is \frac{36}{36} = 1. Good.

Why: verifying that the mass sums to 1 is the one sanity check you always do — if it fails, you miscounted somewhere.

Step 4. Read off the cumulative distribution function.

Add the masses left to right. F_X(2) = \frac{1}{36}. F_X(3) = \frac{1 + 2}{36} = \frac{3}{36}. F_X(7) = \frac{1+2+3+4+5+6}{36} = \frac{21}{36}. F_X(12) = 1.

Why: once you have the mass, the CDF is just a running total — no new thinking needed.

Result: The mass function rises to a peak of \frac{6}{36} at k = 7 and falls symmetrically to \frac{1}{36} at both ends. The CDF is a staircase of 11 steps climbing from \frac{1}{36} to 1.

The mass function for the sum of two dice. The peak at $k = 7$ corresponds to the six diagonal pairs $(1,6), (2,5), \ldots, (6,1)$.

The CDF of the sum of two dice. The staircase climbs from $\tfrac{1}{36}$ at $k = 2$ up to $1$ at $k = 12$. The largest jumps are near $k = 7$, where the mass function peaks — the CDF rises steeply wherever the distribution concentrates probability.

The picture gives you an immediate answer to questions you might have asked about the dice: 7 is the modal sum (the one that happens most often), and the odds of rolling a sum of 2 or 12 are the same and tiny.

Example 2: drawing a card until you get a heart

You shuffle a standard 52-card deck and draw cards one at a time, without replacement, until you see the first heart. Let Y be the number of draws it takes. Find the probability mass function.

Step 1. Identify the possible values of Y. There are 13 hearts and 39 non-hearts in the deck. The worst case is that you draw all 39 non-hearts first, then a heart on draw 40. So Y \in \{1, 2, 3, \ldots, 40\}.

Why: pinning down the range tells you what the mass function's domain is before you compute anything.

Step 2. Compute P(Y = 1). You draw a heart on the very first card: probability \frac{13}{52} = \frac{1}{4}.

Step 3. Compute P(Y = 2). The first card is not a heart (probability \frac{39}{52}) and the second card, drawn from the remaining 51, is a heart (probability \frac{13}{51}). Multiply:

P(Y = 2) = \frac{39}{52} \cdot \frac{13}{51} = \frac{507}{2652} \approx 0.1912.

Why: without replacement, the two draws are not independent, so you have to chain conditional probabilities.

Step 4. Find the general formula. For Y = k, the first k - 1 cards are non-hearts and the k-th card is a heart:

P(Y = k) = \frac{39}{52} \cdot \frac{38}{51} \cdots \frac{39 - k + 2}{52 - k + 2} \cdot \frac{13}{52 - k + 1}.

Plug in k = 3 to check: \frac{39}{52} \cdot \frac{38}{51} \cdot \frac{13}{50} \approx 0.1513.

Result: The mass function is largest at k = 1 and decays rapidly — by k = 5 it is below 0.1, and by k = 10 it is essentially negligible. You will almost always see a heart in the first handful of cards.

The mass function for the number of draws until the first heart. The first bar is the largest because a quarter of all cards are hearts, so you most often succeed on the very first draw.

Both examples have the same structure: start with an experiment, name the number you care about, compute how often each value occurs, and package the result as a mass function. The mass function is where the work lives.

Common confusions

"A random variable is a variable that is random." Close, but wrong. A random variable is a function from outcomes to numbers. It is deterministic — the same outcome always gets the same value. What varies is which outcome the experiment selects.
"P(X = k) is the probability of the outcome k." Only if there is exactly one outcome that X sends to k. If multiple outcomes map to the same value — like HHT, HTH, and THH all mapping to X = 2 — then P(X = k) is the total probability of all of them added up.
"The probability mass function can take values greater than 1." Never. Each value p_X(k) is a probability — it is trapped between 0 and 1. What can be greater than 1 is the continuous analogue, the probability density function, because densities are not the same as probabilities. That is a later article.
"The CDF of a discrete variable is a smooth curve." No. For a discrete variable, the CDF is a staircase. The only smoothness you will ever see in a CDF of a discrete variable is the flat sections between jumps.
"You need the original sample space to compute probabilities involving X." Once you have the mass function or the CDF, you can throw the sample space away. Any probability statement about X — like P(X \geq 3) or P(1 \leq X \leq 5) — can be read off from the mass function alone.

Going deeper

If your goal was just to understand what a discrete random variable is and how to write its mass function, you have it. The rest is for readers who want to see the machinery behind the scenes — infinite supports, the link between the CDF and the mass function, and the pre-image picture that makes the definition watertight.

Infinite but countable supports

The three-coin example has a finite list of possible values. But the "draws until first heart" example, if you imagine a version with replacement, has values 1, 2, 3, \ldots stretching to infinity. That is still discrete — the values form a countable list — but the sum of the mass function is now an infinite series.

For example, if you toss a fair coin until the first head, letting Y be the number of tosses, you get p_Y(k) = \left(\tfrac{1}{2}\right)^k for k = 1, 2, 3, \ldots. The total mass is

\sum_{k=1}^{\infty} \left(\tfrac{1}{2}\right)^k = \tfrac{1}{2} + \tfrac{1}{4} + \tfrac{1}{8} + \cdots = 1.

This is a geometric series. The fact that it sums to exactly 1 is what lets the experiment be well-defined: with probability 1, a head will eventually show up. No mass has leaked away to infinity.

Recovering the mass function from the CDF

The mass function and the CDF carry the same information, and you can convert between them freely. Given the mass, the CDF is a running total:

F_X(k) = \sum_{j \leq k} p_X(j).

Going the other way, the mass at k is the jump in the CDF at k:

p_X(k) = F_X(k) - F_X(k^-),

where F_X(k^-) means the value of F_X just to the left of k. For the three-coin example, F_X(1) = \tfrac{4}{8} and F_X(1^-) = \tfrac{1}{8}, so p_X(1) = \tfrac{4}{8} - \tfrac{1}{8} = \tfrac{3}{8}. Matches.

The pre-image definition, made precise

What does \{s \in S : X(s) = k\} really mean? It is the set of all outcomes that the function X sends to the value k — the pre-image of \{k\} under X. Probability, at its foundation, is a number attached to subsets of the sample space, not to real numbers. When you write P(X = k), you are quietly defining it as

P(X = k) \;\equiv\; P(X^{-1}(\{k\})) \;=\; P(\{s \in S : X(s) = k\}).

This is the bridge that turns the abstract rule X into a concrete probability table. In the three-coin example, X^{-1}(\{2\}) = \{HHT, HTH, THH\}, and the probability of that set of outcomes is \frac{3}{8}.

For the machinery to work, the pre-image of every set of real numbers you might care about has to be an event you can assign probability to. For discrete random variables this is automatic. For continuous ones it takes more care — that is the content of measure theory, which you will meet if you go further in probability.

Where this leads next

A random variable is only the first step. The next articles take the mass function and extract the two numbers that summarise it best — the average value, and the spread around that average — and then study the shapes of mass functions that arise in the experiments you actually care about.

Expectation and Variance - Discrete — how to compute the average value of X and the variance of X straight from the mass function, and what those two numbers tell you.
Binomial Distribution — the specific shape the mass function takes when X counts successes in n independent yes/no trials. The three-coin example is the n = 3 case.
Other Discrete Distributions — the geometric distribution (first success) and the Poisson distribution (rare events), two more named shapes you will see everywhere.
Continuous Random Variables — what happens when the possible values are not a countable list but an entire interval of the real line. Mass functions are replaced by density functions.