In short
The expected value E(X) of a discrete random variable is the probability-weighted average of all its possible values — the number the average converges to over many repetitions. The variance \text{Var}(X) measures how far the values typically land from that average. Standard deviation is the square root of variance, bringing the spread back to the original units.
A street vendor in Jaipur sells kulfi. On any given evening, demand is uncertain: she sells 10 kulfi with probability 0.3, sells 20 with probability 0.5, or sells 30 with probability 0.2. She needs to decide how many to prepare. If she makes too few, she loses sales. If she makes too many, the leftover melts.
She doesn't know what tonight's demand will be. But she can ask a sharper question: on average, across many evenings, how many kulfi does she sell per night?
This is not asking for the most common outcome (that would be 20). It is asking for the long-run average — the number that the running mean of her nightly sales would settle toward if she tracked it for a hundred evenings, or a thousand.
That number has a name: it is the expected value of her demand. And computing it requires a surprisingly simple idea.
Building the average, one outcome at a time
The vendor's nightly demand X takes three possible values: 10, 20, and 30. If she tracked her sales for 1000 evenings, she would expect roughly 300 evenings with demand 10, roughly 500 evenings with demand 20, and roughly 200 evenings with demand 30. The total kulfi sold across all 1000 evenings would be approximately
The average per evening would be 19000 / 1000 = 19.
Now look at what you actually computed. Dividing through by 1000:
Each value got multiplied by its probability, and everything was added up. The 1000 evenings dropped out entirely. The answer depends only on the values and their probabilities — not on how many trials you imagine running.
This is the key insight. The long-run average is computed by weighting each outcome by how likely it is.
The formal definition
Expected Value
Let X be a discrete random variable that takes values x_1, x_2, \ldots, x_n with probabilities P(X = x_1), P(X = x_2), \ldots, P(X = x_n). The expected value (or mean) of X is
It is also written \mu or \mu_X.
The formula says: take each possible value, multiply it by its probability, add them all up. That's it. The expected value is not necessarily a value that X can actually take — in the kulfi example, the vendor never sells exactly 19 kulfi on any single evening. It is the centre of gravity of the probability distribution, the balance point around which all outcomes cluster.
Properties of expectation
Expected value obeys several rules that make it much easier to work with than you might expect. Here are the three most important, each with a proof.
Property 1: Linearity — E(aX + b) = aE(X) + b
If you scale every outcome by a constant a and then shift by b, the expected value scales and shifts the same way.
Proof. If X takes values x_i with probabilities p_i, then aX + b takes values ax_i + b with the same probabilities. So
The first sum is aE(X). The second sum is b \cdot 1 = b, because all probabilities add to 1. Therefore E(aX + b) = aE(X) + b. \square
What this means. If the kulfi vendor raises her price by ₹5 per kulfi, her revenue on a night with demand X is 5X. The expected revenue is 5 \times 19 = 95 rupees. She doesn't need to recompute the whole sum — she just multiplies.
Property 2: Expectation of a sum — E(X + Y) = E(X) + E(Y)
The expected value of a sum is the sum of the expected values — always, whether or not X and Y are independent.
Proof. Let X take values x_i with probabilities p_i, and Y take values y_j with probabilities q_j. The joint probability is P(X = x_i, Y = y_j), and
Split the sum:
In the first double sum, x_i does not depend on j, so pull it out:
The inner sum collapses because summing the joint probability over all j gives the marginal probability of X = x_i. By the same argument, the second double sum equals E(Y). Therefore E(X + Y) = E(X) + E(Y). \square
This is a remarkably powerful result. If you roll two dice and want the expected total, you don't need to enumerate all 36 pairs. Each die has expected value 3.5, so the expected total is 3.5 + 3.5 = 7. Done.
Property 3: Expectation of a constant — E(c) = c
A constant has no randomness, so its expected value is just itself. This is a special case of Property 1 with a = 0 and b = c: E(0 \cdot X + c) = 0 \cdot E(X) + c = c.
What the average doesn't tell you
The expected value captures the centre, but two distributions can have the same centre and look completely different.
Consider two cricket batsmen. Batsman A scores 50 runs in every innings. Batsman B scores either 0 or 100, each with probability \frac{1}{2}. Both have the same expected score: E(A) = 50 and E(B) = 0.5 \times 0 + 0.5 \times 100 = 50. But their reliability is worlds apart. A team captain choosing a batsman for a crucial match cares about more than the average — she cares about how much the scores vary.
This is what variance measures.
Variance: measuring the spread
The idea is natural. If you want to know how far outcomes typically land from the mean, compute the average distance from the mean. But there is a subtlety: distances above and below the mean cancel each other out if you just add them. A score of 40 is 10 below the mean and a score of 60 is 10 above it — their deviations sum to zero.
The fix: square the deviations before averaging. Squaring makes every deviation positive, and it also penalises large deviations more than small ones.
Variance
The variance of a discrete random variable X with mean \mu = E(X) is
It is also written \sigma^2 or \sigma_X^2.
Derivation of the shortcut formula. The definition involves \mu, which can make direct computation messy. There is a cleaner equivalent:
Here is why. Expand the square inside the definition:
Apply linearity of expectation (Property 1 and Property 2):
But \mu = E(X), so 2\mu \, E(X) = 2\mu^2 and the expression becomes
This shortcut is almost always faster. To find the variance: compute E(X), compute E(X^2), subtract the square of the first from the second.
Properties of variance
Var(c) = 0: A constant has no spread. Follows directly from the definition — (c - c)^2 = 0.
Var(aX + b) = a^2 \,\text{Var}(X): Scaling by a scales variance by a^2 (because you're squaring deviations), and adding b does not change the spread at all — it just shifts everything.
Proof. Let \mu = E(X). Then E(aX + b) = a\mu + b. So
Notice: the shift b vanishes entirely. Moving a distribution left or right does not change how spread out it is.
Standard deviation
Variance has an inconvenient feature: its units are the square of the original units. If X is measured in runs, \text{Var}(X) is in "runs squared" — a quantity nobody can picture.
The fix is simple: take the square root.
Standard Deviation
The standard deviation of X is
It measures spread in the same units as X itself.
For Batsman B, \text{Var}(X) = 2500, so \sigma = \sqrt{2500} = 50 runs. This says: on a typical innings, Batsman B's score deviates from the mean by about 50 runs — which matches the picture perfectly, since the only possible scores are 0 and 100, each exactly 50 away from the mean of 50.
Computing one from start to finish
Example 1: Expected value and variance of a loaded die
A loaded die has the following probability distribution:
| x | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| P(X = x) | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.5 |
Find E(X), \text{Var}(X), and \sigma.
Step 1. Compute E(X).
Why: each value times its probability, summed. The heavy loading on 6 pulls the mean above the fair-die value of 3.5.
Step 2. Compute E(X^2).
Why: the shortcut formula needs E(X^2). You apply the same weighted-sum rule, but to x^2 instead of x.
Step 3. Compute Var(X) using the shortcut.
Why: the shortcut E(X^2) - [E(X)]^2 avoids computing each (x_i - \mu)^2 separately.
Step 4. Compute \sigma.
Why: the standard deviation brings the answer back to the same units as X — die faces, in this case.
Result: E(X) = 4.5, \text{Var}(X) = 3.25, \sigma \approx 1.80.
The graph confirms the intuition: the distribution is lopsided, piling most of its weight on 6, and the mean has been dragged to 4.5 — a full unit above the fair-die mean of 3.5.
Example 2: Revenue from a random number of customers
A small bookshop gets 0, 1, 2, or 3 customers in an hour with probabilities 0.1, 0.3, 0.4, and 0.2 respectively. Each customer spends exactly ₹200. Find the expected hourly revenue and its standard deviation.
Step 1. Let X be the number of customers. Compute E(X).
Why: weighted average of customer counts. On average, 1.7 customers arrive per hour.
Step 2. Compute E(X^2).
Why: needed for the variance shortcut.
Step 3. Compute Var(X).
Why: the shortcut E(X^2) - [E(X)]^2 gives variance directly.
Step 4. Revenue is R = 200X. By linearity, E(R) = 200 \times 1.7 = 340 and \text{Var}(R) = 200^2 \times 0.81 = 32400.
Why: scaling by 200 scales the standard deviation by 200 (not 200^2 — that's for variance). So \sigma_R = 200 \times 0.9 = 180.
Result: Expected hourly revenue = ₹340, standard deviation = ₹180.
The graph shows that ₹340 falls between the ₹200 and ₹400 bars — the expected value sits inside the distribution, pulled slightly left by the 10% chance of earning nothing. The ₹180 standard deviation — more than half the mean — tells you this is a high-variability business.
Common confusions
-
"The expected value is the most likely outcome." Not necessarily. The most likely outcome is the mode — the value with the highest probability. In the loaded-die example, the mode is 6 (probability 0.5) but the expected value is 4.5. The expected value is the long-run average, not the most frequent result.
-
"Variance can be negative." It cannot. Every term (x_i - \mu)^2 is a square, so it is non-negative, and p_i is also non-negative. A sum of non-negative terms is non-negative. Variance is zero only when every outcome equals the mean — i.e., there is no randomness at all.
-
"E(X^2) = [E(X)]^2." Almost never true. The shortcut formula says \text{Var}(X) = E(X^2) - [E(X)]^2. Since variance is non-negative, this means E(X^2) \geq [E(X)]^2. Equality holds only when \text{Var}(X) = 0 — i.e., the random variable is actually a constant.
-
"Standard deviation and variance tell you the same thing." They measure the same quantity but on different scales. Variance is in squared units, standard deviation is in original units. When comparing spread to the mean, standard deviation is the natural choice. When doing algebra (adding independent variances), variance is the natural choice.
-
"\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) always." Only when X and Y are independent (or more precisely, uncorrelated). For correlated variables, there is a cross-term involving the covariance. The expectation version E(X + Y) = E(X) + E(Y) holds always, but variance is not as forgiving.
Going deeper
If you're here for the core definitions and how to compute them, you have everything you need. The rest is for readers who want the algebraic identities that unlock harder problems.
Variance of a sum of independent variables
When X and Y are independent, \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).
Proof. Independence means E(XY) = E(X) \cdot E(Y). Let \mu_X = E(X) and \mu_Y = E(Y).
Expand each piece:
Subtract:
The first bracket is \text{Var}(X), the second is \text{Var}(Y), and the third bracket is zero by independence. \square
This extends to any number of independent variables: \text{Var}(X_1 + X_2 + \cdots + X_n) = \text{Var}(X_1) + \text{Var}(X_2) + \cdots + \text{Var}(X_n).
The Chebyshev inequality
Variance gives you a universal bound on how far a random variable can stray from its mean. For any k > 0:
This says: the probability of being more than k standard deviations from the mean is at most 1/k^2, regardless of what distribution X has. At k = 2, no more than 25% of the probability can lie more than 2 standard deviations from the mean. At k = 3, no more than about 11%.
The bound is usually loose — for well-behaved distributions like the normal distribution, the actual probabilities are much smaller. But Chebyshev's inequality holds for every distribution with a finite variance, which makes it a powerful theoretical tool.
Conditional expectation
If you know that some event A has occurred, the expected value of X given A is
This is the same formula as E(X), but with conditional probabilities in place of unconditional ones. The law of total expectation ties everything together:
where A_1, A_2, \ldots is a partition of the sample space. This is the expected-value analogue of the law of total probability, and it is used constantly in applications.
Where this leads next
You now have the two fundamental summary numbers — the mean and the variance — for any discrete random variable. The next step is to see them in action on specific distributions:
- Binomial Distribution — when you repeat an experiment n times and count successes. Mean and variance derived from first principles.
- Other Discrete Distributions — the geometric and Poisson distributions, each with their own mean and variance formulas.
- Continuous Random Variables — when the random variable can take any value in an interval, and sums become integrals.
- Normal Distribution — the bell curve, where mean and standard deviation are the only two parameters you need.
- Conditional Probability — Advanced — the total probability theorem and multi-stage problems, where conditional expectation becomes indispensable.