Shannon Entropy Recap

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Shannon entropy H(X) = -\sum_x p(x)\log_2 p(x) measures, in bits, the uncertainty of a random variable X before you see its value — equivalently, the average number of yes/no questions needed to pin it down. H is zero for a deterministic variable, maximal (\log_2 n) for a uniform distribution over n outcomes. For a biased coin with head-probability p, the binary entropy H(p) = -p\log_2 p - (1-p)\log_2(1-p) traces a bump from 0 at p = 0 or 1 up to 1 bit at p = 1/2. Joint entropy H(X,Y) measures the total uncertainty of two variables; conditional entropy H(Y|X) = H(X,Y) - H(X) measures how much uncertainty about Y survives once X is known; mutual information I(X;Y) = H(X) + H(Y) - H(X,Y) is the reduction in uncertainty about Y from learning X (symmetric in the two arguments, always \geq 0). Shannon's noisy-channel theorem says that for any channel with input X and output Y, error-free communication is possible at every rate below the channel capacity C = \max_{p(x)} I(X;Y), and impossible above it. These five quantities — H, H(X,Y), H(Y|X), I(X;Y), and C — are the classical foundation. The next chapter replaces probability distributions with density operators and every \log with a trace-log, turning Shannon entropy into von Neumann entropy.

You flip a coin. Before the flip, you don't know the outcome — that's uncertainty. After the flip, you do — that's information. The amount of uncertainty you lost is exactly the amount of information you gained. Shannon entropy puts a number on both.

This chapter is the classical warm-up for the quantum version. Every quantum-information quantity in Part 20 — von Neumann entropy, conditional entropy, quantum mutual information, quantum channel capacity — is a direct analogue of a classical Shannon quantity. If the classical story is fluent, the quantum story is mostly a change of notation: p(x) \to \rho, \sum \to \text{tr}, \log \to matrix log. If the classical story is hazy, the quantum chapters will feel like alchemy. So: a clean recap, at the level a JEE-Advanced student can follow, of Shannon's 1948 framework.

What entropy is trying to measure

Before the formula, the intuition. Two scenarios:

Scenario A. Your friend is about to tell you the result of today's cricket toss at the Wankhede. There are two outcomes, heads or tails, each with probability 1/2. How uncertain are you? Fully uncertain — you have no idea which.
Scenario B. Your friend is about to tell you whether the Sun rose this morning. There are two outcomes. How uncertain are you? Zero. You already know.

Both scenarios have two possible outcomes, yet the uncertainty is very different. The difference is in the probabilities. Entropy is the function that takes a probability distribution and spits out a single number quantifying how uncertain you are about a random draw from it.

The right number should obey a few common-sense rules:

Certain outcome → zero uncertainty. If one outcome has probability 1, entropy is 0.
Uniform → maximum uncertainty. For n equally likely outcomes, the entropy should be as large as it gets for that n.
More outcomes → more room for uncertainty. A uniform distribution over 4 outcomes should be more uncertain than a uniform distribution over 2.
Independent variables add. If X and Y are independent random variables, the uncertainty of the pair (X, Y) should be the sum of the individual uncertainties.

Shannon (1948) showed that these requirements — plus a weak continuity condition — force the formula up to an overall constant. That formula is the thing you are about to meet.

Both random variables have two outcomes, but the amount of uncertainty differs completely. Shannon entropy reads the *distribution*, not just the set of outcomes, and returns a single number measuring how surprised you expect to be.

Shannon entropy of a single random variable

Shannon entropy

Let X be a random variable taking values in a finite set \mathcal{X} with probability distribution p(x) = \Pr(X = x). The Shannon entropy of X is

H(X) \;=\; -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x),

measured in bits. The convention 0 \log_2 0 = 0 is used (limit as p \to 0^+). Changing the logarithm base changes the unit: \log_e gives "nats", \log_{10} gives "dits". Bits are the default throughout this article.

Reading the definition. The quantity -\log_2 p(x) is the surprise of outcome x — a rare event (p(x) small) has high surprise, a common event has low surprise. The entropy is the expected surprise: the average value of -\log_2 p(X) under the distribution p. You do not yet know X; H(X) tells you, in bits, how surprised you expect to be once you see it.

Why \log_2 and not any other function of p: the base-2 logarithm makes the unit bits. One bit is the entropy of a fair coin, and it is also the information content of one yes/no answer. The choice of logarithm base is a choice of units, nothing more; \log_2 gives the right units for the rest of computer science.

Three quick sanity checks

Deterministic variable. If X takes a single value with probability 1, then -\sum_x p(x) \log_2 p(x) = -1 \cdot \log_2 1 = 0. Zero bits. No uncertainty.

Fair coin. If X takes two values with probability 1/2 each, then H(X) = -\tfrac{1}{2}\log_2\tfrac{1}{2} - \tfrac{1}{2}\log_2\tfrac{1}{2} = -\tfrac{1}{2}(-1) - \tfrac{1}{2}(-1) = 1. One bit.

Uniform over n outcomes. If X takes n values with probability 1/n each, then H(X) = -n \cdot \tfrac{1}{n}\log_2 \tfrac{1}{n} = \log_2 n. Why this is the maximum: for any distribution on n outcomes, Jensen's inequality applied to the concave function -x\log_2 x gives H(X) \leq \log_2 n, with equality iff the distribution is uniform. The uniform distribution spreads the probability as thinly as possible, so surprise is as uniform as possible.

A six-sided fair die has entropy \log_2 6 \approx 2.585 bits. A fair 52-card draw has entropy \log_2 52 \approx 5.700 bits. You need on average about 5.7 yes/no questions to guess the card.

Operational meaning — compression

Shannon entropy is not just an abstract measure of uncertainty. It has a sharp operational interpretation: H(X) is the minimum average number of bits per symbol you need to encode a long string drawn from p, in any lossless compression scheme.

This is the source-coding theorem. A message like "namaste namaste namaste" over a repetitive alphabet can be compressed far below its literal bit-length because the underlying p has low entropy. A genuinely random string over \{0, 1\} with p = 1/2 already has H = 1 bit per symbol and cannot be compressed further — every scheme that tries will fail on average.

So H(X) is simultaneously:

The uncertainty of X before you see it.
The information gained when you do see it.
The minimum bits per symbol needed to losslessly store a long stream of draws from p.

All three numbers are the same. This triple-interpretation is why entropy is foundational and why it appears everywhere the word "information" does.

Binary entropy — the bump from 0 to 1

The most-used special case of Shannon entropy is the binary entropy function, which applies when X is a single Bernoulli variable with \Pr(X = 1) = p and \Pr(X = 0) = 1 - p:

H(p) \;=\; -p\log_2 p - (1-p)\log_2(1-p).

This one-variable function comes up every time a bit is noisy, a qubit is measured, or a binary classifier is evaluated. Memorise its shape.

The binary entropy $H(p)$ plotted against $p$ on the unit interval. It is zero at the endpoints (deterministic outcomes), peaks at $H(1/2) = 1$ bit (maximum uncertainty), and is symmetric around $p = 1/2$. A biased coin with $p = 0.1$ carries about $0.469$ bits of uncertainty — a little less than half a fair coin.

Three features worth internalising:

H(0) = H(1) = 0. A coin that always lands heads has no uncertainty.
H(1/2) = 1 bit. The fair coin is the maximally uncertain binary variable.
Symmetry. H(p) = H(1 - p). Swapping the labels of the two outcomes does not change the uncertainty.

The concavity of H(p) has a name — it is the reason mixing two distributions never decreases entropy, a fact that will reappear in the quantum chapter as the concavity of von Neumann entropy.

Joint, conditional, and mutual information

One random variable is the baseline. Real communication involves pairs: sender X, receiver Y. Three quantities describe the pair.

Joint entropy

The joint entropy of two random variables X, Y with joint distribution p(x, y) is

H(X, Y) \;=\; -\sum_{x, y} p(x, y) \log_2 p(x, y).

It measures the uncertainty of the pair (X, Y) considered as a single random variable on the product space \mathcal{X} \times \mathcal{Y}.

If X, Y are independent, p(x,y) = p(x)p(y) and a short calculation gives H(X, Y) = H(X) + H(Y):

H(X,Y) = -\sum_{x,y} p(x)p(y)[\log_2 p(x) + \log_2 p(y)] = H(X) + H(Y).

Why the sums separate: the log of a product is a sum of logs, and the marginal sums \sum_y p(y) = 1 let the H(X) piece come out as a single factor. Independence makes total uncertainty additive — a sanity check that the formula agrees with intuition.

For correlated variables, H(X, Y) < H(X) + H(Y). The shortfall is the mutual information, which you will meet in a moment.

Conditional entropy

The conditional entropy of Y given X is

H(Y | X) \;=\; H(X, Y) - H(X) \;=\; \sum_x p(x) H(Y | X = x),

where H(Y | X = x) = -\sum_y p(y | x) \log_2 p(y | x). It measures the uncertainty that remains about Y once X is known.

Reading the definition. H(Y | X = x) is the entropy of the conditional distribution p(y | x) — what the distribution of Y looks like after you learn that X took the value x. Averaging over all possible x (with weights p(x)) gives H(Y | X): on average, how much uncertainty about Y is left after observing X. The chain-rule identity H(Y|X) = H(X, Y) - H(X) is the definition's most useful form.

Two sanity checks: H(Y | X) = 0 if Y is a deterministic function of X (knowing X determines Y completely), and H(Y | X) = H(Y) if X, Y are independent (knowing X tells you nothing about Y). Between these extremes sits every real-world correlation.

Crucially, H(Y | X) \geq 0 — always. Classical conditional entropy cannot go negative. This is intuitive: knowing X can only reduce uncertainty about Y, never create it. The quantum analogue, as you will see in the next chapter, breaks this rule — and that break is entanglement.

Mutual information

The mutual information between X and Y is

I(X; Y) \;=\; H(X) + H(Y) - H(X, Y) \;=\; H(Y) - H(Y | X) \;=\; H(X) - H(X | Y).

It measures how much learning Y reduces your uncertainty about X — equivalently, how much X tells you about Y. It is symmetric, I(X; Y) = I(Y; X), and non-negative, I(X; Y) \geq 0, with equality iff X, Y are independent.

Reading the definition. The first form reads as "total uncertainty if considered separately, minus total uncertainty when considered jointly." The shortfall is whatever redundancy the two variables share — i.e., how much information is held in common. The second form reads as "uncertainty of Y before knowing X, minus uncertainty of Y after knowing X." That reduction is the information X provides about Y.

Joint, conditional, and mutual information, laid out as a Venn diagram of bits. Two overlapping circles of areas $H(X)$ and $H(Y)$; the overlap $I(X; Y)$ is the shared information; the non-overlapping crescents are the conditional entropies; the union is the joint entropy $H(X, Y)$.

The Venn picture is a literal accounting. Every identity between H, H(\cdot | \cdot), H(\cdot, \cdot), and I can be read off by treating the circles as sets and entropies as areas. This picture survives almost unchanged into the quantum chapter — with one shocking exception: for entangled states, the overlap can be larger than either circle. That's the thing negative conditional entropy formalises.

Channels and capacity

A communication channel takes an input symbol X \in \mathcal{X} and produces an output symbol Y \in \mathcal{Y} according to a conditional distribution p(y | x). The channel is noisy when Y is not a deterministic function of X. Two workhorse examples:

Binary symmetric channel (BSC) with flip probability p. Input X \in \{0, 1\}. Output Y \in \{0, 1\}. Transition law: Y = X with probability 1 - p, Y = 1 - X with probability p. The same model as the classical repetition code channel.
Binary erasure channel (BEC). Input X \in \{0, 1\}. Output Y \in \{0, 1, \mathtt{e}\} where \mathtt{e} means "erased". The receiver either sees the input exactly, or sees an explicit erasure symbol.

Channel capacity

The capacity C of a channel with input X and output Y is

C \;=\; \max_{p(x)} I(X; Y),

the maximum mutual information between input and output, where the max is taken over all input distributions p(x). It is measured in bits per channel use.

Reading the definition. For a fixed channel law p(y | x), the mutual information I(X; Y) depends on how you choose to distribute your input symbols. Channel capacity is what happens if you choose the best possible distribution — the one that lets the most information get through on average per channel use.

Shannon's noisy-channel theorem — the result that founded information theory

Shannon's 1948 masterpiece contains two theorems about capacity, which together are called the noisy-channel coding theorem:

Achievability. For any rate R < C, there exist codes that allow communication at rate R with arbitrarily small probability of decoding error, provided code length n is large enough.
Converse. For any rate R > C, every code has a decoding error probability bounded away from zero. Communication at rates exceeding capacity is fundamentally impossible.

The two-sided claim is what makes capacity a fundamental quantity, not just a handy definition. It says there is a sharp number C for every channel, below which everything works and above which nothing does. Shannon did not just define capacity — he proved capacity is the physical limit.

For a BSC with flip probability p, the capacity comes out to a clean formula:

C_{\text{BSC}}(p) \;=\; 1 - H(p),

where H(p) is the binary entropy. At p = 0, C = 1 bit per use (noiseless channel, full throughput). At p = 1/2, C = 0 (the channel is pure noise; no information gets through). At p = 0.01, C \approx 1 - 0.0808 \approx 0.919 bits per use — you can still push almost a full bit through a channel that flips one bit in a hundred, provided you use the right code.

The classical repetition code achieves only 1/n bits per use for n-fold repetition — vastly below capacity. Hamming, Reed-Solomon, LDPC, turbo, polar codes all approach C more efficiently. Modern 5G, Wi-Fi, and the Mars rover's downlink all use codes within fractions of a decibel of the Shannon limit.

Capacity $C(p) = 1 - H(p)$ of the binary symmetric channel as a function of the flip probability $p$. Noiseless at $p = 0$ and $p = 1$ (one bit per use), useless at $p = 1/2$. For realistic $p = 0.01$, about $0.919$ bits per use are achievable; for $p = 0.1$, about $0.53$. A repetition code gets nowhere near these numbers.

Worked examples

Example 1: Binary entropy at $p = 1/2$ versus $p = 1/10$

Setup. Compute H(1/2) and H(1/10) by hand, and interpret the difference.

Step 1 — the p = 1/2 case.

H(1/2) \;=\; -\tfrac{1}{2}\log_2\tfrac{1}{2} - \tfrac{1}{2}\log_2\tfrac{1}{2} \;=\; -\tfrac{1}{2}(-1) - \tfrac{1}{2}(-1) \;=\; 1.

Why \log_2(1/2) = -1: by definition of the logarithm, 2^{-1} = 1/2, so \log_2(1/2) = -1. Any "base-b log of 1/b" equals -1.

A fair coin has one bit of entropy. Exactly.

Step 2 — the p = 1/10 case.

H(1/10) \;=\; -\tfrac{1}{10}\log_2\tfrac{1}{10} - \tfrac{9}{10}\log_2\tfrac{9}{10}.

Compute each piece. \log_2(1/10) = -\log_2 10 \approx -3.3219, so -\tfrac{1}{10}\log_2(1/10) \approx 0.3322. For the second piece, \log_2(9/10) = \log_2 9 - \log_2 10 \approx 3.1699 - 3.3219 \approx -0.1520, so -\tfrac{9}{10}\log_2(9/10) \approx 0.1368. Sum: H(1/10) \approx 0.3322 + 0.1368 \approx 0.4690 bits.

Step 3 — interpret. A fair coin carries one bit of uncertainty; a coin that lands heads only one time in ten carries about 0.469 bits — less than half as much. That matches intuition: the biased coin has a strong prior (almost always tails), so each flip is less surprising on average. You would need only about half as many yes/no questions, averaged over many flips, to pin it down.

A fair coin has maximum binary entropy, one bit. A $(0.1, 0.9)$ coin has entropy $\approx 0.469$ bits — the peaked distribution is less uncertain on average.

What this shows. Entropy is sensitive to the shape of the distribution, not just the number of outcomes. The same two outcomes, weighted differently, carry wildly different amounts of uncertainty.

Example 2: Capacity of the binary symmetric channel at $p = 0.1$

Setup. You are transmitting bits over a wireless channel that flips each bit independently with probability p = 0.1. How many bits of information can you reliably send per channel use, in principle? And what does the calculation actually measure?

Step 1 — identify the channel. The channel is a BSC with p = 0.1. You know C_{\text{BSC}}(p) = 1 - H(p) from the text above.

Step 2 — compute H(0.1). From Example 1, H(0.1) \approx 0.469 bits.

Step 3 — compute the capacity.

C \;=\; 1 - H(0.1) \;=\; 1 - 0.469 \;=\; 0.531 \text{ bits per channel use}.

Step 4 — interpret. Over the long run, using the best possible error-correcting code, you can transmit at most 0.531 bits of payload per bit sent on the wire. Send 1{,}000 bits through the channel, and the theoretical maximum payload is about 531 bits; the other 469 bits are used up by redundancy to correct the flips. No code, however clever, can do better. Shannon's converse theorem forbids it.

Step 5 — why the capacity maximisation lands at uniform input. For the BSC, the optimal input distribution is the uniform distribution p(0) = p(1) = 1/2. Why uniform is optimal for the BSC: the BSC is symmetric (the two inputs are related by a bit-flip symmetry), and symmetric channels are maximised by symmetric inputs. For a uniform input, the output is also uniform with H(Y) = 1, and H(Y | X) = H(p) (the conditional is a Bernoulli-p variable regardless of X), so I(X; Y) = H(Y) - H(Y|X) = 1 - H(p).

Step 6 — compare with the repetition code. A 3-bit repetition code sends 1 logical bit per 3 physical bits, so its rate is 1/3 \approx 0.333 bits per use. That is well below the capacity C \approx 0.531. You are leaving about 0.198 bits per use on the floor. Modern LDPC codes used in 5G sit at rates like 0.5 or higher for similar noise — within \sim 6\% of Shannon's limit.

Rate comparison at $p = 0.1$. The 3-bit repetition code sits at $1/3$; the Shannon capacity sits at $\approx 0.531$; the noiseless upper bound is $1$. The gap between repetition and capacity is the room that modern codes (LDPC, turbo, polar) exploit.

What this shows. Channel capacity is a sharp theoretical ceiling that depends on one number (p) for the BSC. Repetition coding is nowhere near optimal. Shannon's theorem is the reason your phone's 4G modem uses LDPC, not repetition.

Common confusions

"Entropy is uncertainty; information is the opposite." Actually the same thing, in two framings. H(X) is simultaneously how uncertain you are about X before seeing it, and how much information you gain when you do see it. Before and after add to a constant, so they measure the same quantity.
"Entropy is always a whole number of bits." No. H(1/10) \approx 0.469. The "bit" here is a unit, not an integer. You cannot literally transmit a fractional bit in one channel use, but over many uses the average bits per use can be any non-negative real.
"Conditional entropy is the entropy of a conditional distribution." Close but not quite. H(Y | X = x) is the entropy of p(y | x) for a specific x. H(Y | X) is the average of H(Y | X = x) over all x, weighted by p(x). These are different objects — the first is a number depending on x, the second is a single number.
"Mutual information can be negative if the variables are anticorrelated." No. Mutual information is always \geq 0. Anticorrelation is still correlation — it still reduces the joint entropy below the sum of marginals. The symbol I(X; Y) < 0 is a red flag for a calculation error.
"Channel capacity is the rate of the best-known code." No, it is the rate of the best possible code. Capacity is a theoretical quantity defined from the channel alone, not from any code. That capacity is achievable is the content of Shannon's theorem; that known codes approach it is a separate engineering question.
"Bigger alphabet means bigger entropy." Only if the probabilities support it. A uniform distribution over 1024 symbols has H = 10 bits, but a highly peaked distribution over the same 1024 symbols could have H < 1 bit. The alphabet size is the ceiling (H \leq \log_2 n), not the floor.

Going deeper

If you just need the five definitions — H(X), H(p), H(X,Y), H(Y|X), I(X;Y), and the channel capacity C — plus Shannon's noisy-channel theorem, you have the prerequisite. The next chapter builds von Neumann entropy on top of this; you don't need more. The rest of this section sketches the deeper structure — the asymptotic equipartition property, the data-processing inequality, typical sets, and the formal proof of achievability — for readers who want the information-theory background at an undergraduate level.

The asymptotic equipartition property (AEP)

The deepest single fact in Shannon's theory is the AEP. Fix a distribution p on \mathcal{X} and draw n independent samples x_1, \ldots, x_n. The typical set A_\epsilon^{(n)} is the set of sequences whose empirical probability p(x_1) \cdots p(x_n) = \prod_i p(x_i) satisfies

2^{-n(H(X) + \epsilon)} \;\leq\; \prod_i p(x_i) \;\leq\; 2^{-n(H(X) - \epsilon)}.

Three beautiful properties:

Typical sequences are individually near-uniform in log-probability: each has probability roughly 2^{-nH(X)}.
There are roughly 2^{nH(X)} typical sequences: the total probability weight is close to 1, and each piece weighs \approx 2^{-nH(X)}, so the count is \approx 2^{nH(X)}.
Almost all the probability lives on typical sequences as n \to \infty: \Pr(A_\epsilon^{(n)}) \to 1 by the law of large numbers.

The AEP says: long samples from p look, in aggregate, like a uniform distribution on roughly 2^{nH(X)} equally likely sequences. This is why H(X) is the compression limit — you can enumerate the typical sequences in nH(X) bits, and handle the atypical remainder with a negligible fraction of the codebook. Proof of Shannon's source coding theorem is one AEP argument away.

The data-processing inequality

If X \to Y \to Z is a Markov chain (meaning Z depends on X only through Y), then

I(X; Z) \;\leq\; I(X; Y).

Why you cannot gain information by post-processing: Y is a bottleneck. Any function you compute from Y (including random functions like noisy channels) can only destroy information about X, not create new information. This inequality has a one-line proof from the chain rule for mutual information.

The data-processing inequality is used constantly in quantum information too — it is the classical skeleton of the quantum data-processing inequality that governs how much information a noisy quantum channel can preserve.

The Kullback-Leibler divergence

A finer information-theoretic quantity than entropy is the relative entropy (also called KL divergence):

D_{\text{KL}}(p \| q) \;=\; \sum_x p(x) \log_2\frac{p(x)}{q(x)}.

It measures the "distance" between two distributions p and q — specifically, the number of extra bits you would waste per symbol if you built an optimal code for q but the true distribution is p. It is non-negative (D_{\text{KL}} \geq 0) with equality iff p = q, but it is not symmetric and does not satisfy the triangle inequality. It is the parent of many information-theoretic quantities: entropy, mutual information, and cross-entropy are all special cases of KL divergence.

The quantum analogue is the quantum relative entropy S(\rho \| \sigma) = \text{tr}(\rho \log \rho) - \text{tr}(\rho \log \sigma), which governs every inequality in quantum information theory.

A historical note

Claude Shannon's 1948 paper A Mathematical Theory of Communication created the field of information theory single-handedly. Before 1948, "communication" was engineering folklore — circuits and modulation schemes, with no mathematical framework tying performance to the channel's physics. Shannon introduced entropy, capacity, and the two theorems in one 55-page paper, and the field has been living inside that framework ever since.

The Indian connection is via C. R. Rao (Fisher information, Cramér-Rao bound, Rao distance) and Prasanta Mahalanobis (Mahalanobis distance, connections between information and statistics), both of whom developed parallel information-theoretic ideas in statistical inference around the same decades. Modern Indian work in information theory thrives at IISc, TIFR, and the IITs, with strong groups in coding theory, network information theory, and quantum information.

Where this leads next

Von Neumann entropy — the quantum generalisation, where density operators replace probability distributions and every \log becomes a matrix log.
Classical repetition codes — the simplest error-correcting code and its failure modes against quantum channels.
Quantum mutual information — I(A; B) for density operators; always \geq 0, measures total correlations (classical plus quantum).
Joint and conditional entropy — the quantum version, where S(A | B) can be negative and encodes entanglement.
Channel capacity — Shannon's capacity for classical channels, and its several quantum analogues (classical, private, quantum capacities).

References

Claude E. Shannon, A Mathematical Theory of Communication (1948) — Bell System Technical Journal PDF.
Thomas M. Cover and Joy A. Thomas, Elements of Information Theory (2nd ed., 2006), Chapters 2–8 — Wiley.
David J. C. MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapters 2–10 — inference.org.uk/mackay/itila.
Wikipedia, Entropy (information theory) — definitions, properties, and channel-capacity formulas.
Wikipedia, Shannon's source coding theorem — formal statement and proof sketch of the AEP-based compression bound.
John Preskill, Lecture Notes on Quantum Computation, Ch. 10 (Quantum information theory) — theory.caltech.edu/~preskill/ph229.