Entropy

Overview

In physics, entropy is disorder. In information theory, entropy is surprise. If I tell you “The sun rose today,” that’s low entropy (zero surprise). If I tell you “I won the lottery,” that’s high entropy (high surprise).

Core Idea

Shannon Entropy ($H$): The average amount of information produced by a stochastic source of data. $$ H(X) = - \sum p(x) \log_2 p(x) $$ Measured in bits.

Formal Definition (if applicable)

Bit: The amount of information needed to choose between two equally likely alternatives (Yes/No, Heads/Tails).

Intuition

Imagine a coin toss.

Fair coin (50/50): Maximum uncertainty. High entropy (1 bit).
Rigged coin (100% Heads): Zero uncertainty. Zero entropy (0 bits).
Biased coin (90% Heads): Low uncertainty. Low entropy.

Examples

Password Strength: A password like “123456” has low entropy (easy to guess). A password like “x9#mP2!q” has high entropy.
English Language: English has redundancy. If I write “Th_ q_ick br_wn f_x,” you can guess the missing letters. This means English has lower entropy than a random string of letters.

Common Misconceptions

“Entropy is chaos.” (It’s a measure of information content. Random noise has maximum entropy because it’s maximally unpredictable).
“It’s the same as thermodynamic entropy.” (They are mathematically identical, but conceptually distinct, though deep links exist).

Redundancy: The opposite of entropy. $R = 1 - H/H_{max}$.
Mutual Information: How much knowing X tells you about Y.
KL Divergence: A measure of how different two probability distributions are.

Applications

Data Compression: You can’t compress data below its entropy. (Shannon’s Source Coding Theorem).
Machine Learning: Cross-entropy loss functions.

Criticism / Limitations

Shannon entropy assumes we know the probability distribution. In the real world, we often have to estimate it.

Overview#

Core Idea#

Formal Definition (if applicable)#

Intuition#

Examples#

Common Misconceptions#

Related Concepts#

Applications#

Criticism / Limitations#

Further Reading#