Overview

In physics, entropy is disorder. In information theory, entropy is surprise. If I tell you “The sun rose today,” that’s low entropy (zero surprise). If I tell you “I won the lottery,” that’s high entropy (high surprise).

Core Idea

Shannon Entropy ($H$): The average amount of information produced by a stochastic source of data. $$ H(X) = - \sum p(x) \log_2 p(x) $$ Measured in bits.

Formal Definition (if applicable)

Bit: The amount of information needed to choose between two equally likely alternatives (Yes/No, Heads/Tails).

Intuition

Imagine a coin toss.

  • Fair coin (50/50): Maximum uncertainty. High entropy (1 bit).
  • Rigged coin (100% Heads): Zero uncertainty. Zero entropy (0 bits).
  • Biased coin (90% Heads): Low uncertainty. Low entropy.

Examples

  • Password Strength: A password like “123456” has low entropy (easy to guess). A password like “x9#mP2!q” has high entropy.
  • English Language: English has redundancy. If I write “Th_ q_ick br_wn f_x,” you can guess the missing letters. This means English has lower entropy than a random string of letters.

Common Misconceptions

  • “Entropy is chaos.” (It’s a measure of information content. Random noise has maximum entropy because it’s maximally unpredictable).
  • “It’s the same as thermodynamic entropy.” (They are mathematically identical, but conceptually distinct, though deep links exist).
  • Redundancy: The opposite of entropy. $R = 1 - H/H_{max}$.
  • Mutual Information: How much knowing X tells you about Y.
  • KL Divergence: A measure of how different two probability distributions are.

Applications

  • Data Compression: You can’t compress data below its entropy. (Shannon’s Source Coding Theorem).
  • Machine Learning: Cross-entropy loss functions.

Criticism / Limitations

Shannon entropy assumes we know the probability distribution. In the real world, we often have to estimate it.

Further Reading

  • Shannon, A Mathematical Theory of Communication
  • Gleick, The Information