Other Discrete Distributions

In short

The geometric distribution answers "how many tries until the first success?" when each try succeeds independently with probability p. Its mean is \tfrac{1}{p} and its variance is \tfrac{1-p}{p^2}. The Poisson distribution answers "how many rare events happened in a fixed interval?" when events occur at an average rate \lambda per interval. Its mean and variance are both equal to \lambda.

You are waiting at a bus stop where buses arrive, on average, twelve per hour. How many buses will turn up in the next ten minutes? Could be zero. Could be three. Could be six.

Or flip it around: you are rolling a die, and you want a 6. How many rolls will it take? Could be one. Could be five. Could, in principle, be fifty — though you would be very unlucky.

Both of these questions have a random-variable answer. Neither of them is binomial. In the die-rolling case, you do not fix the number of trials in advance — you just keep going until you get a success, and count. In the bus-stop case, there is no fixed number of trials at all — events arrive out of a continuum of time, and you count how many landed inside a window.

These two situations are so common that each of them has its own named distribution. The first is the geometric distribution — "first success." The second is the Poisson distribution — "count of rare events."

Geometric: waiting for a success

Return to the die. Each roll is an independent trial. On each roll, the probability of rolling a 6 — call it a success — is p = \tfrac{1}{6}. The probability of failure is q = 1 - p = \tfrac{5}{6}. Let X be the number of rolls up to and including the first 6. So X = 1 means success on the first roll; X = 3 means failure, failure, success.

How likely is each value of X?

X = 1 means success on the first roll. That is just p = \tfrac{1}{6}.

X = 2 means failure, then success. Because the rolls are independent, you multiply: q \cdot p = \tfrac{5}{6} \cdot \tfrac{1}{6} = \tfrac{5}{36}.

X = 3 means failure, failure, success: q \cdot q \cdot p = \tfrac{25}{216}.

The pattern is obvious. For X = k, you need k-1 failures in a row and then a success on trial k:

p_X(k) = q^{k-1} p, \quad k = 1, 2, 3, \ldots

Check the total mass — it should be 1 — by summing the geometric series:

\sum_{k=1}^{\infty} q^{k-1} p = p \sum_{k=0}^{\infty} q^k = p \cdot \frac{1}{1-q} = \frac{p}{p} = 1.

The sum is 1. Every geometric distribution is a legitimate distribution of probability mass over \{1, 2, 3, \ldots\}.

The geometric mass function for rolling a fair die until the first $6$. Each bar is $\tfrac{5}{6}$ the height of the previous one, because getting to the next bar requires one more failure.

The shape is characteristic: every bar is q times the previous one. The most likely single value is always k = 1, because that is the shortest wait. But the average wait — the mean of the distribution — is not 1. It is something bigger.

The mean of a geometric distribution

What is the expected number of rolls to see the first 6? Intuition says: if the probability of success per roll is \tfrac{1}{6}, the average should be 6 rolls. That intuition is right, and you can prove it.

By definition, the expectation is

E[X] = \sum_{k=1}^{\infty} k \cdot p_X(k) = \sum_{k=1}^{\infty} k \cdot q^{k-1} p.

Pull the constant p out front:

E[X] = p \sum_{k=1}^{\infty} k q^{k-1}.

The sum \sum_{k=1}^{\infty} k q^{k-1} is a standard result. Start from the ordinary geometric series \sum_{k=0}^{\infty} q^k = \frac{1}{1-q}. Differentiate both sides with respect to q:

\sum_{k=1}^{\infty} k q^{k-1} = \frac{1}{(1-q)^2} = \frac{1}{p^2}.

Plug back in:

E[X] = p \cdot \frac{1}{p^2} = \frac{1}{p}.

Clean. The expected number of trials to the first success is \tfrac{1}{p}. For the die, that is \tfrac{1}{1/6} = 6.

The variance of a geometric distribution

With more algebra of the same flavour, you can show

\text{Var}(X) = \frac{1 - p}{p^2} = \frac{q}{p^2}.

For the die, \text{Var}(X) = \frac{5/6}{1/36} = 30, so the standard deviation is \sqrt{30} \approx 5.48. That is a lot of spread — and it is why waiting for a 6 feels unpredictable even when the average is only 6. A single trial can easily take double or half the average.

Three geometric distributions for different success probabilities. The smaller $p$ is, the flatter and longer the distribution becomes — and the longer the expected wait $1/p$ for a success.

The formal definition of the geometric distribution

Geometric distribution

A discrete random variable X has a geometric distribution with success probability p, written X \sim \text{Geom}(p), if its probability mass function is

p_X(k) = (1-p)^{k-1}\, p, \qquad k = 1, 2, 3, \ldots

Its mean and variance are

E[X] = \frac{1}{p}, \qquad \text{Var}(X) = \frac{1-p}{p^2}.

Reading the definition. The variable X counts the number of independent Bernoulli trials it takes to get the first success, where each trial succeeds with probability p. The factor (1-p)^{k-1} is the probability of k-1 failures in a row; the extra factor of p is the success at the end.

One warning about conventions: some textbooks define the geometric distribution as the number of failures before the first success, so they start from k = 0 instead of k = 1. That shifts every formula by 1 — the mean becomes \tfrac{1-p}{p}. Both conventions are legitimate. Here, and in most Indian textbooks, you count inclusively: the trial on which the success occurs is counted.

Poisson: counting rare events

Shift gears to the bus-stop scenario. Buses arrive at an average rate of 12 per hour. In a 10-minute window, that works out to an average of \lambda = 2 buses. You want to know the probability of seeing exactly k buses in that window, for k = 0, 1, 2, 3, \ldots

This looks nothing like a binomial problem. There is no fixed number of trials. Buses do not arrive at discrete instants like 1, 2, 3, \ldots minutes — they can show up at any continuous time.

Here is the trick. Chop the 10-minute window into n tiny slots, each of length \tfrac{10}{n} minutes. Make n so large that each slot is so short that at most one bus can arrive in it. In each slot, a bus either arrives (success) or does not (failure). With the average rate of 2 buses per 10 minutes, the probability of a bus in a single slot of length \tfrac{10}{n} minutes is p = \tfrac{2}{n}.

With this slicing, the number of buses is a binomial random variable with parameters n and p = \tfrac{2}{n}. The probability of exactly k buses is

P(X = k) = \binom{n}{k} \left(\frac{2}{n}\right)^k \left(1 - \frac{2}{n}\right)^{n-k}.

Now let n \to \infty — let the slots get infinitely thin. Write \lambda = np = 2 and work out the limit.

The binomial coefficient is

\binom{n}{k} = \frac{n (n-1)(n-2)\cdots(n-k+1)}{k!}.

Multiply by \left(\tfrac{\lambda}{n}\right)^k:

\binom{n}{k} \left(\frac{\lambda}{n}\right)^k = \frac{\lambda^k}{k!} \cdot \frac{n(n-1)\cdots(n-k+1)}{n^k}.

As n \to \infty the fraction on the right has k factors, each of the form \tfrac{n - j}{n} \to 1. So that piece tends to 1, and you are left with \tfrac{\lambda^k}{k!}.

Now the other factor:

\left(1 - \frac{\lambda}{n}\right)^{n-k} = \left(1 - \frac{\lambda}{n}\right)^n \cdot \left(1 - \frac{\lambda}{n}\right)^{-k}.

As n \to \infty, the first piece approaches e^{-\lambda} (that is the defining limit of the exponential function), and the second piece approaches 1^{-k} = 1.

Multiply everything together and you get

P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, 3, \ldots

That is the Poisson probability mass function. It came out of a binomial in the limit where the number of trials explodes to infinity, the success probability per trial shrinks to zero, and their product — the expected number of successes — stays fixed at \lambda. That last condition is the one that matters: whenever you have a process where events are rare but the expected count per unit interval is stable, the Poisson distribution shows up.

The Poisson mass function for $\lambda = 2$ buses in a $10$-minute window. The mode is at $k = 1$ and $k = 2$ — they are tied, because $\lambda$ is an integer. The distribution is skewed right: you sometimes see zero buses, rarely see five, and almost never see more.

The mean and variance of a Poisson

Because the Poisson mass function is p_X(k) = \tfrac{\lambda^k e^{-\lambda}}{k!}, you can compute the mean directly:

E[X] = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!}.

The k = 0 term contributes nothing. For k \geq 1, write \tfrac{k}{k!} = \tfrac{1}{(k-1)!}:

E[X] = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = \lambda e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^{k-1}}{(k-1)!} = \lambda e^{-\lambda} \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = \lambda e^{-\lambda} \cdot e^{\lambda} = \lambda.

The mean is \lambda. That should feel right: \lambda was defined as the expected number of events per interval, and the distribution was built to match.

For the variance, the same kind of manipulation gives E[X(X-1)] = \lambda^2, from which E[X^2] = \lambda^2 + \lambda, and then

\text{Var}(X) = E[X^2] - (E[X])^2 = (\lambda^2 + \lambda) - \lambda^2 = \lambda.

The Poisson distribution has the striking property that its mean and variance are equal. This is a signature you can use to recognise it in data: if you estimate the mean and variance of some count data and they are close to each other, Poisson is a strong candidate.

The mean and variance of the Poisson are both equal to \lambda — a coincidence that is actually a deep constraint. Distributions with this property are relatively rare, and when your data has sample mean and sample variance both close to the same number, Poisson is a natural first guess.

Poisson shapes for $\lambda = 1$, $3$, and $8$. For small $\lambda$ the distribution piles up near zero and is strongly right-skewed. For large $\lambda$ the distribution spreads out, becomes more symmetric, and starts to resemble a bell curve — a preview of the normal approximation to the Poisson for large $\lambda$.

The formal definition of the Poisson distribution

Poisson distribution

A discrete random variable X has a Poisson distribution with parameter \lambda > 0, written X \sim \text{Poisson}(\lambda), if its probability mass function is

p_X(k) = \frac{\lambda^k e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, 3, \ldots

Its mean and variance are

E[X] = \lambda, \qquad \text{Var}(X) = \lambda.

The parameter \lambda is the expected number of events in one unit of whatever you are measuring — per hour, per page, per square kilometre. If you change the length of the interval, \lambda scales linearly.

Two worked examples

Example 1: calls to a helpdesk

A small helpdesk receives an average of \lambda = 3 calls per hour during office hours. Assume calls arrive independently. What is the probability of exactly 5 calls in the next hour? What is the probability of at least one call in the next hour?

Step 1. Identify the distribution. The number of calls in a fixed time window, at a stable rate, with independent arrivals — this is Poisson with \lambda = 3.

Why: the three conditions — fixed window, stable rate, independence — are the recognition test for Poisson.

Step 2. Compute P(X = 5) using the formula.

P(X = 5) = \frac{3^5 \, e^{-3}}{5!} = \frac{243 \cdot e^{-3}}{120}.

Numerically, e^{-3} \approx 0.04979. So

P(X = 5) \approx \frac{243 \cdot 0.04979}{120} \approx \frac{12.099}{120} \approx 0.1008.

About a 10\% chance of exactly five calls in the next hour.

Why: plugging into the Poisson formula is mechanical — the only work is the arithmetic.

Step 3. Compute P(X \geq 1). Use the complement — it is much easier to compute one thing than infinitely many.

P(X \geq 1) = 1 - P(X = 0) = 1 - \frac{3^0 e^{-3}}{0!} = 1 - e^{-3} \approx 1 - 0.0498 \approx 0.9502.

Why: P(X \geq 1) = 1 - P(X = 0) because "zero calls" and "at least one call" are complementary events.

Step 4. Interpret. There is roughly a 95\% chance of seeing at least one call in the next hour — nearly certain. And a 10\% chance of seeing exactly five.

Result: P(X = 5) \approx 0.1008 and P(X \geq 1) \approx 0.9502.

The Poisson mass function for $\lambda = 3$. The bar at $k = 5$ has height about $0.10$ — the answer to the first question.

Example 2: geometric distribution for a free-throw shooter

A basketball player makes each free throw independently with probability p = 0.8. Let X be the number of shots until the player makes the first basket. Find P(X = 3) and E[X].

Step 1. Identify the distribution. "Number of trials until the first success" with independent trials — geometric, with p = 0.8 and q = 0.2.

Step 2. Compute P(X = 3). You need two misses followed by a make:

P(X = 3) = q^2 p = (0.2)^2 (0.8) = 0.04 \cdot 0.8 = 0.032.

Why: the geometric formula is just the probability of k-1 failures and then one success, multiplied because the trials are independent.

Step 3. Compute E[X]. For a geometric distribution, E[X] = \tfrac{1}{p}:

E[X] = \frac{1}{0.8} = 1.25.

On average, this shooter takes 1.25 attempts to make the first basket.

Why: E[X] = 1/p is the headline result for the geometric distribution — intuitively, the higher the success rate, the shorter the wait.

Step 4. Reality-check. Since P(X = 1) = 0.8, the player succeeds on the very first try 80\% of the time. The probability of needing three or more attempts is only P(X \geq 3) = (0.2)^2 = 0.04 — a tiny 4\%. The mean 1.25 is consistent with this: most of the mass is at X = 1, with small corrections from the longer waits.

Result: P(X = 3) = 0.032 and E[X] = 1.25.

The geometric mass function for $p = 0.8$. The first bar is enormous — you almost always make the shot on the first attempt. Subsequent bars shrink by a factor of $0.2$ each step.

Common confusions

"Geometric and binomial are the same thing." They are relatives, not twins. Binomial fixes the number of trials and counts successes. Geometric fixes the number of successes at one and counts trials. A binomial answers "how many successes out of ten?" A geometric answers "how long until the first success?"
"Poisson and binomial are unrelated." They are very closely related. The Poisson is the limit of a binomial where n \to \infty and p \to 0 with np = \lambda held constant. That derivation is the reason the Poisson is the natural distribution for rare events counted over a continuous interval.
"The Poisson has two parameters." It has just one, \lambda. The \lambda bundles together the rate and the interval length — if events happen at rate r per unit time and you count over an interval of length t, then \lambda = rt.
"Mean and variance of the Poisson must be different numbers because mean is \lambda and variance is \sigma^2." They are different names for quantities that happen to be numerically equal for the Poisson. E[X] = \lambda and \text{Var}(X) = \lambda — same number, different meanings.
"Geometric distribution is the same as p \cdot q^k." Careful about the indexing. In the convention used here (counting trials starting at k = 1), the formula is q^{k-1} p. The off-by-one is the difference between "trials including the success" and "failures before the success."

Going deeper

If you only need to recognise these two distributions and plug numbers into the formulas, you have enough. The rest is for readers who want to see the geometric-as-memoryless characterisation, the continuity from binomial to Poisson, and the rigorous proof of the Poisson variance.

The memoryless property of the geometric

Suppose you have been rolling the die and have already failed ten times in a row, still waiting for a 6. What is the probability you need at least five more rolls?

The answer is the same as if you had just started. Conditional on having failed for ten rolls already, the distribution of additional rolls to the first success is again \text{Geom}(p). In symbols:

P(X \geq 10 + 5 \mid X \geq 10) = P(X \geq 5).

Proof. P(X \geq 10 + 5) = q^{14} (the first 14 rolls all failed) and P(X \geq 10) = q^{9}. Divide:

P(X \geq 15 \mid X \geq 10) = \frac{q^{14}}{q^{9}} = q^{5} = P(X \geq 5).

The past does not help you predict the future — the distribution resets. This is called the memoryless property, and among discrete distributions on \{1, 2, 3, \ldots\}, the geometric is the only distribution with it. It is one of the reasons the geometric appears so often in queueing and reliability problems.

Verifying \text{Var}(X) = \lambda for the Poisson

Computing E[X^2] directly is messy, so use the trick of computing E[X(X-1)] instead:

E[X(X-1)] = \sum_{k=0}^{\infty} k(k-1) \cdot \frac{\lambda^k e^{-\lambda}}{k!}.

The k = 0 and k = 1 terms vanish. For k \geq 2, \tfrac{k(k-1)}{k!} = \tfrac{1}{(k-2)!}, so

E[X(X-1)] = e^{-\lambda} \sum_{k=2}^{\infty} \frac{\lambda^k}{(k-2)!} = \lambda^2 e^{-\lambda} \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = \lambda^2 e^{-\lambda} \cdot e^{\lambda} = \lambda^2.

Now E[X^2] = E[X(X-1)] + E[X] = \lambda^2 + \lambda, and the variance is

\text{Var}(X) = E[X^2] - (E[X])^2 = \lambda^2 + \lambda - \lambda^2 = \lambda.

This little manipulation — replacing k^2 with k(k-1) + k — is the cleanest way to handle second moments for distributions where k! appears in the denominator.

When Poisson fails

The Poisson distribution has assumptions baked into it: events are independent, the rate \lambda is constant over the interval, and no two events can occur at exactly the same instant. When those assumptions break, the fit gets worse.

For example, bus arrivals are not perfectly Poisson — buses tend to bunch up because of traffic and driver spacing, so the real-world variance is often larger than the mean. Data scientists call this overdispersion and reach for a different distribution (like the negative binomial) when it shows up. The signature is exactly that mean-variance mismatch: if the sample variance is much bigger than the sample mean, the Poisson is the wrong model.

Where this leads next

Geometric and Poisson are two more pieces of the discrete-distribution toolkit. The articles ahead take you off the discrete line entirely — into probabilities measured by area under curves.

Expectation and Variance - Discrete — the formulas you used in this article, derived from scratch for any discrete distribution.
Binomial Distribution — the parent distribution that Poisson is the limit of, and that geometric is a reshaping of.
Continuous Random Variables — when the sample space is a continuous range and mass functions are replaced by density functions.
Normal Distribution — the continuous distribution that large Poissons and large binomials both eventually converge to. The bell curve behind almost everything in statistics.