Skip to content

Shannon Entropy

Shannon entropy is a measure of the average uncertainty (or “surprise”) associated with a random variable. For a discrete random variable XX with possible outcomes {x1,x2,,xn}\{x_1, x_2, \ldots, x_n\} and probability mass function p(x)p(x), the entropy is defined as:

H(X)=i=1np(xi)log2p(xi)H(X) = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)

By convention, 0log0=00 \log 0 = 0 (justified by continuity).

Entropy measures how surprised you expect to be when you learn the outcome of a random variable.

  • If you flip a fair coin, each outcome is equally likely—maximum surprise, maximum entropy.
  • If you flip a biased coin that lands heads 99% of the time, you’re rarely surprised—low entropy.

The key insight: entropy is the answer to “how many yes/no questions do I need, on average, to identify the outcome?”

When using log2\log_2, entropy is measured in bits. One bit is the entropy of a fair coin flip:

H(fair coin)=12log21212log212=1 bitH(\text{fair coin}) = -\frac{1}{2}\log_2\frac{1}{2} - \frac{1}{2}\log_2\frac{1}{2} = 1 \text{ bit}
  • loge\log_e (natural log): entropy in nats
  • log10\log_{10}: entropy in hartleys (rarely used)

Conversion: Hbits=Hnats/ln2H_{\text{bits}} = H_{\text{nats}} / \ln 2

  1. Non-negativity: H(X)0H(X) \geq 0, with equality iff XX is deterministic.

  2. Maximum entropy: For nn outcomes, H(X)log2nH(X) \leq \log_2 n, with equality iff XX is uniform.

  3. Additivity for independent variables: H(X,Y)=H(X)+H(Y)H(X, Y) = H(X) + H(Y) when XYX \perp Y.

  4. Concavity: H(λp+(1λ)q)λH(p)+(1λ)H(q)H(\lambda p + (1-\lambda) q) \geq \lambda H(p) + (1-\lambda) H(q)

  5. Chain rule: H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y|X)

A fair six-sided die has:

H(die)=616log216=log262.585 bitsH(\text{die}) = -6 \cdot \frac{1}{6} \log_2 \frac{1}{6} = \log_2 6 \approx 2.585 \text{ bits}

You need about 2.6 yes/no questions on average to identify which face came up.

If all 26 letters were equally likely: Hmax=log2264.7 bits/letterH_{\text{max}} = \log_2 26 \approx 4.7 \text{ bits/letter}

But English has non-uniform letter frequencies. Shannon estimated: HEnglish1.01.5 bits/letterH_{\text{English}} \approx 1.0 - 1.5 \text{ bits/letter}

This gap (4.71.33.44.7 - 1.3 \approx 3.4 bits) is redundancy—it’s why compression works.

  • Relates to: [[Boltzmann Entropy]], [[Kullback-Leibler Divergence]], [[Mutual Information]]
  • Required for: [[Rate-Distortion Theory]], [[Channel Capacity]], [[Source Coding Theorem]]
  • Generalizes: [[Differential Entropy]] (continuous case)
  • Shannon, C. (1948). “A Mathematical Theory of Communication”
  • Cover & Thomas, Elements of Information Theory, Chapter 2
  • MacKay, Information Theory, Inference, and Learning Algorithms, Chapter 2
  • How does the choice of logarithm base affect information-theoretic arguments in the meme framework?
  • What’s the natural “base” for measuring memetic entropy—bits (binary), or something else?