Overview
In physics, entropy is disorder. In information theory, entropy is surprise. If I tell you “The sun rose today,” that’s low entropy (zero surprise). If I tell you “I won the lottery,” that’s high entropy (high surprise).
Core Idea
Shannon Entropy ($H$): The average amount of information produced by a stochastic source of data. $$ H(X) = - \sum p(x) \log_2 p(x) $$ Measured in bits.
Formal Definition (if applicable)
Bit: The amount of information needed to choose between two equally likely alternatives (Yes/No, Heads/Tails).
Intuition
Imagine a coin toss.
- Fair coin (50/50): Maximum uncertainty. High entropy (1 bit).
- Rigged coin (100% Heads): Zero uncertainty. Zero entropy (0 bits).
- Biased coin (90% Heads): Low uncertainty. Low entropy.
Examples
- Password Strength: A password like “123456” has low entropy (easy to guess). A password like “x9#mP2!q” has high entropy.
- English Language: English has redundancy. If I write “Th_ q_ick br_wn f_x,” you can guess the missing letters. This means English has lower entropy than a random string of letters.
Common Misconceptions
- “Entropy is chaos.” (It’s a measure of information content. Random noise has maximum entropy because it’s maximally unpredictable).
- “It’s the same as thermodynamic entropy.” (They are mathematically identical, but conceptually distinct, though deep links exist).
Related Concepts
- Redundancy: The opposite of entropy. $R = 1 - H/H_{max}$.
- Mutual Information: How much knowing X tells you about Y.
- KL Divergence: A measure of how different two probability distributions are.
Applications
- Data Compression: You can’t compress data below its entropy. (Shannon’s Source Coding Theorem).
- Machine Learning: Cross-entropy loss functions.
Criticism / Limitations
Shannon entropy assumes we know the probability distribution. In the real world, we often have to estimate it.
Further Reading
- Shannon, A Mathematical Theory of Communication
- Gleick, The Information