Expectation and Variance — Discrete

In short

The expected value E(X) of a discrete random variable is the probability-weighted average of all its possible values — the number the average converges to over many repetitions. The variance \text{Var}(X) measures how far the values typically land from that average. Standard deviation is the square root of variance, bringing the spread back to the original units.

A street vendor in Jaipur sells kulfi. On any given evening, demand is uncertain: she sells 10 kulfi with probability 0.3, sells 20 with probability 0.5, or sells 30 with probability 0.2. She needs to decide how many to prepare. If she makes too few, she loses sales. If she makes too many, the leftover melts.

She doesn't know what tonight's demand will be. But she can ask a sharper question: on average, across many evenings, how many kulfi does she sell per night?

This is not asking for the most common outcome (that would be 20). It is asking for the long-run average — the number that the running mean of her nightly sales would settle toward if she tracked it for a hundred evenings, or a thousand.

That number has a name: it is the expected value of her demand. And computing it requires a surprisingly simple idea.

Building the average, one outcome at a time

The vendor's nightly demand X takes three possible values: 10, 20, and 30. If she tracked her sales for 1000 evenings, she would expect roughly 300 evenings with demand 10, roughly 500 evenings with demand 20, and roughly 200 evenings with demand 30. The total kulfi sold across all 1000 evenings would be approximately

300 \times 10 + 500 \times 20 + 200 \times 30 = 3000 + 10000 + 6000 = 19000

The average per evening would be 19000 / 1000 = 19.

Now look at what you actually computed. Dividing through by 1000:

\frac{300}{1000} \times 10 + \frac{500}{1000} \times 20 + \frac{200}{1000} \times 30 = 0.3 \times 10 + 0.5 \times 20 + 0.2 \times 30 = 19

Each value got multiplied by its probability, and everything was added up. The 1000 evenings dropped out entirely. The answer depends only on the values and their probabilities — not on how many trials you imagine running.

This is the key insight. The long-run average is computed by weighting each outcome by how likely it is.

The formal definition

Expected Value

Let X be a discrete random variable that takes values x_1, x_2, \ldots, x_n with probabilities P(X = x_1), P(X = x_2), \ldots, P(X = x_n). The expected value (or mean) of X is

E(X) = \sum_{i=1}^{n} x_i \, P(X = x_i)

It is also written \mu or \mu_X.

The formula says: take each possible value, multiply it by its probability, add them all up. That's it. The expected value is not necessarily a value that X can actually take — in the kulfi example, the vendor never sells exactly 19 kulfi on any single evening. It is the centre of gravity of the probability distribution, the balance point around which all outcomes cluster.

The probability distribution of kulfi demand. The dashed red line at $X = 19$ is the expected value — the balance point. If you placed weights of 0.3, 0.5, and 0.2 on a seesaw at positions 10, 20, and 30, the seesaw would balance at 19.

Properties of expectation

Expected value obeys several rules that make it much easier to work with than you might expect. Here are the three most important, each with a proof.

Property 1: Linearity — E(aX + b) = aE(X) + b

If you scale every outcome by a constant a and then shift by b, the expected value scales and shifts the same way.

Proof. If X takes values x_i with probabilities p_i, then aX + b takes values ax_i + b with the same probabilities. So

E(aX + b) = \sum_i (ax_i + b) \, p_i = \sum_i ax_i \, p_i + \sum_i b \, p_i = a \sum_i x_i \, p_i + b \sum_i p_i

The first sum is aE(X). The second sum is b \cdot 1 = b, because all probabilities add to 1. Therefore E(aX + b) = aE(X) + b. \square

What this means. If the kulfi vendor raises her price by ₹5 per kulfi, her revenue on a night with demand X is 5X. The expected revenue is 5 \times 19 = 95 rupees. She doesn't need to recompute the whole sum — she just multiplies.

Property 2: Expectation of a sum — E(X + Y) = E(X) + E(Y)

The expected value of a sum is the sum of the expected values — always, whether or not X and Y are independent.

Proof. Let X take values x_i with probabilities p_i, and Y take values y_j with probabilities q_j. The joint probability is P(X = x_i, Y = y_j), and

E(X + Y) = \sum_i \sum_j (x_i + y_j) \, P(X = x_i, Y = y_j)

Split the sum:

= \sum_i \sum_j x_i \, P(X = x_i, Y = y_j) + \sum_i \sum_j y_j \, P(X = x_i, Y = y_j)

In the first double sum, x_i does not depend on j, so pull it out:

\sum_i x_i \sum_j P(X = x_i, Y = y_j) = \sum_i x_i \, P(X = x_i) = E(X)

The inner sum collapses because summing the joint probability over all j gives the marginal probability of X = x_i. By the same argument, the second double sum equals E(Y). Therefore E(X + Y) = E(X) + E(Y). \square

This is a remarkably powerful result. If you roll two dice and want the expected total, you don't need to enumerate all 36 pairs. Each die has expected value 3.5, so the expected total is 3.5 + 3.5 = 7. Done.

Property 3: Expectation of a constant — E(c) = c

A constant has no randomness, so its expected value is just itself. This is a special case of Property 1 with a = 0 and b = c: E(0 \cdot X + c) = 0 \cdot E(X) + c = c.

What the average doesn't tell you

The expected value captures the centre, but two distributions can have the same centre and look completely different.

Consider two cricket batsmen. Batsman A scores 50 runs in every innings. Batsman B scores either 0 or 100, each with probability \frac{1}{2}. Both have the same expected score: E(A) = 50 and E(B) = 0.5 \times 0 + 0.5 \times 100 = 50. But their reliability is worlds apart. A team captain choosing a batsman for a crucial match cares about more than the average — she cares about how much the scores vary.

This is what variance measures.

Variance: measuring the spread

The idea is natural. If you want to know how far outcomes typically land from the mean, compute the average distance from the mean. But there is a subtlety: distances above and below the mean cancel each other out if you just add them. A score of 40 is 10 below the mean and a score of 60 is 10 above it — their deviations sum to zero.

The fix: square the deviations before averaging. Squaring makes every deviation positive, and it also penalises large deviations more than small ones.

Variance

The variance of a discrete random variable X with mean \mu = E(X) is

\text{Var}(X) = E\!\big[(X - \mu)^2\big] = \sum_{i} (x_i - \mu)^2 \, P(X = x_i)

It is also written \sigma^2 or \sigma_X^2.

Derivation of the shortcut formula. The definition involves \mu, which can make direct computation messy. There is a cleaner equivalent:

\text{Var}(X) = E(X^2) - [E(X)]^2

Here is why. Expand the square inside the definition:

E\!\big[(X - \mu)^2\big] = E(X^2 - 2\mu X + \mu^2)

Apply linearity of expectation (Property 1 and Property 2):

= E(X^2) - 2\mu \, E(X) + \mu^2

But \mu = E(X), so 2\mu \, E(X) = 2\mu^2 and the expression becomes

= E(X^2) - 2\mu^2 + \mu^2 = E(X^2) - \mu^2 = E(X^2) - [E(X)]^2 \quad \square

This shortcut is almost always faster. To find the variance: compute E(X), compute E(X^2), subtract the square of the first from the second.

Same mean, very different spread. Batsman A has variance 0 — every innings is exactly 50. Batsman B has variance 2500 — scores swing wildly around the same average. The variance number captures the difference that the mean alone misses.

Properties of variance

Var(c) = 0: A constant has no spread. Follows directly from the definition — (c - c)^2 = 0.

Var(aX + b) = a^2 \,\text{Var}(X): Scaling by a scales variance by a^2 (because you're squaring deviations), and adding b does not change the spread at all — it just shifts everything.

Proof. Let \mu = E(X). Then E(aX + b) = a\mu + b. So

\text{Var}(aX + b) = E\!\big[(aX + b - a\mu - b)^2\big] = E\!\big[(a(X - \mu))^2\big] = a^2 \, E\!\big[(X - \mu)^2\big] = a^2 \, \text{Var}(X) \quad \square

Notice: the shift b vanishes entirely. Moving a distribution left or right does not change how spread out it is.

Standard deviation

Variance has an inconvenient feature: its units are the square of the original units. If X is measured in runs, \text{Var}(X) is in "runs squared" — a quantity nobody can picture.

The fix is simple: take the square root.

Standard Deviation

The standard deviation of X is

\sigma = \sqrt{\text{Var}(X)}

It measures spread in the same units as X itself.

For Batsman B, \text{Var}(X) = 2500, so \sigma = \sqrt{2500} = 50 runs. This says: on a typical innings, Batsman B's score deviates from the mean by about 50 runs — which matches the picture perfectly, since the only possible scores are 0 and 100, each exactly 50 away from the mean of 50.

Computing one from start to finish

Example 1: Expected value and variance of a loaded die

A loaded die has the following probability distribution:

x	1	2	3	4	5	6
P(X = x)	0.1	0.1	0.1	0.1	0.1	0.5

Find E(X), \text{Var}(X), and \sigma.

Step 1. Compute E(X).

E(X) = 1(0.1) + 2(0.1) + 3(0.1) + 4(0.1) + 5(0.1) + 6(0.5)

= 0.1 + 0.2 + 0.3 + 0.4 + 0.5 + 3.0 = 4.5

Why: each value times its probability, summed. The heavy loading on 6 pulls the mean above the fair-die value of 3.5.

Step 2. Compute E(X^2).

E(X^2) = 1^2(0.1) + 2^2(0.1) + 3^2(0.1) + 4^2(0.1) + 5^2(0.1) + 6^2(0.5)

= 0.1 + 0.4 + 0.9 + 1.6 + 2.5 + 18.0 = 23.5

Why: the shortcut formula needs E(X^2). You apply the same weighted-sum rule, but to x^2 instead of x.

Step 3. Compute Var(X) using the shortcut.

\text{Var}(X) = E(X^2) - [E(X)]^2 = 23.5 - 4.5^2 = 23.5 - 20.25 = 3.25

Why: the shortcut E(X^2) - [E(X)]^2 avoids computing each (x_i - \mu)^2 separately.

Step 4. Compute \sigma.

\sigma = \sqrt{3.25} \approx 1.803

Why: the standard deviation brings the answer back to the same units as X — die faces, in this case.

Result: E(X) = 4.5, \text{Var}(X) = 3.25, \sigma \approx 1.80.

The loaded die: faces 1–5 each have probability 0.1, but face 6 has probability 0.5. The mean $\mu = 4.5$ is pulled toward 6 by the heavy loading. The standard deviation of about 1.8 tells you that a typical roll lands roughly 2 faces away from the mean.

The graph confirms the intuition: the distribution is lopsided, piling most of its weight on 6, and the mean has been dragged to 4.5 — a full unit above the fair-die mean of 3.5.

Example 2: Revenue from a random number of customers

A small bookshop gets 0, 1, 2, or 3 customers in an hour with probabilities 0.1, 0.3, 0.4, and 0.2 respectively. Each customer spends exactly ₹200. Find the expected hourly revenue and its standard deviation.

Step 1. Let X be the number of customers. Compute E(X).

E(X) = 0(0.1) + 1(0.3) + 2(0.4) + 3(0.2) = 0 + 0.3 + 0.8 + 0.6 = 1.7

Why: weighted average of customer counts. On average, 1.7 customers arrive per hour.

Step 2. Compute E(X^2).

E(X^2) = 0(0.1) + 1(0.3) + 4(0.4) + 9(0.2) = 0 + 0.3 + 1.6 + 1.8 = 3.7

Why: needed for the variance shortcut.

Step 3. Compute Var(X).

\text{Var}(X) = 3.7 - 1.7^2 = 3.7 - 2.89 = 0.81

Why: the shortcut E(X^2) - [E(X)]^2 gives variance directly.

Step 4. Revenue is R = 200X. By linearity, E(R) = 200 \times 1.7 = 340 and \text{Var}(R) = 200^2 \times 0.81 = 32400.

\sigma_R = \sqrt{32400} = 180

Why: scaling by 200 scales the standard deviation by 200 (not 200^2 — that's for variance). So \sigma_R = 200 \times 0.9 = 180.

Result: Expected hourly revenue = ₹340, standard deviation = ₹180.

The revenue distribution: most likely outcome is ₹400 (2 customers), but the expected value is ₹340. The standard deviation of ₹180 means actual hourly revenue typically swings quite far from the average — the bookshop's income is volatile.

The graph shows that ₹340 falls between the ₹200 and ₹400 bars — the expected value sits inside the distribution, pulled slightly left by the 10% chance of earning nothing. The ₹180 standard deviation — more than half the mean — tells you this is a high-variability business.

Common confusions

"The expected value is the most likely outcome." Not necessarily. The most likely outcome is the mode — the value with the highest probability. In the loaded-die example, the mode is 6 (probability 0.5) but the expected value is 4.5. The expected value is the long-run average, not the most frequent result.
"Variance can be negative." It cannot. Every term (x_i - \mu)^2 is a square, so it is non-negative, and p_i is also non-negative. A sum of non-negative terms is non-negative. Variance is zero only when every outcome equals the mean — i.e., there is no randomness at all.
"E(X^2) = [E(X)]^2." Almost never true. The shortcut formula says \text{Var}(X) = E(X^2) - [E(X)]^2. Since variance is non-negative, this means E(X^2) \geq [E(X)]^2. Equality holds only when \text{Var}(X) = 0 — i.e., the random variable is actually a constant.
"Standard deviation and variance tell you the same thing." They measure the same quantity but on different scales. Variance is in squared units, standard deviation is in original units. When comparing spread to the mean, standard deviation is the natural choice. When doing algebra (adding independent variances), variance is the natural choice.
"\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) always." Only when X and Y are independent (or more precisely, uncorrelated). For correlated variables, there is a cross-term involving the covariance. The expectation version E(X + Y) = E(X) + E(Y) holds always, but variance is not as forgiving.

Going deeper

If you're here for the core definitions and how to compute them, you have everything you need. The rest is for readers who want the algebraic identities that unlock harder problems.

Variance of a sum of independent variables

When X and Y are independent, \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

Proof. Independence means E(XY) = E(X) \cdot E(Y). Let \mu_X = E(X) and \mu_Y = E(Y).

\text{Var}(X + Y) = E\!\big[(X + Y)^2\big] - [E(X + Y)]^2

Expand each piece:

E\!\big[(X + Y)^2\big] = E(X^2) + 2E(XY) + E(Y^2)

[E(X + Y)]^2 = [\mu_X + \mu_Y]^2 = \mu_X^2 + 2\mu_X \mu_Y + \mu_Y^2

Subtract:

\text{Var}(X + Y) = [E(X^2) - \mu_X^2] + [E(Y^2) - \mu_Y^2] + 2[E(XY) - \mu_X \mu_Y]

The first bracket is \text{Var}(X), the second is \text{Var}(Y), and the third bracket is zero by independence. \square

This extends to any number of independent variables: \text{Var}(X_1 + X_2 + \cdots + X_n) = \text{Var}(X_1) + \text{Var}(X_2) + \cdots + \text{Var}(X_n).

The Chebyshev inequality

Variance gives you a universal bound on how far a random variable can stray from its mean. For any k > 0:

P\!\big(|X - \mu| \geq k\sigma\big) \leq \frac{1}{k^2}

This says: the probability of being more than k standard deviations from the mean is at most 1/k^2, regardless of what distribution X has. At k = 2, no more than 25% of the probability can lie more than 2 standard deviations from the mean. At k = 3, no more than about 11%.

The bound is usually loose — for well-behaved distributions like the normal distribution, the actual probabilities are much smaller. But Chebyshev's inequality holds for every distribution with a finite variance, which makes it a powerful theoretical tool.

Conditional expectation

If you know that some event A has occurred, the expected value of X given A is

E(X \mid A) = \sum_i x_i \, P(X = x_i \mid A)

This is the same formula as E(X), but with conditional probabilities in place of unconditional ones. The law of total expectation ties everything together:

E(X) = \sum_j E(X \mid A_j) \, P(A_j)

where A_1, A_2, \ldots is a partition of the sample space. This is the expected-value analogue of the law of total probability, and it is used constantly in applications.

Where this leads next

You now have the two fundamental summary numbers — the mean and the variance — for any discrete random variable. The next step is to see them in action on specific distributions:

Binomial Distribution — when you repeat an experiment n times and count successes. Mean and variance derived from first principles.
Other Discrete Distributions — the geometric and Poisson distributions, each with their own mean and variance formulas.
Continuous Random Variables — when the random variable can take any value in an interval, and sums become integrals.
Normal Distribution — the bell curve, where mean and standard deviation are the only two parameters you need.
Conditional Probability — Advanced — the total probability theorem and multi-stage problems, where conditional expectation becomes indispensable.