Data Compression

Overview

How do you fit a movie on a DVD? Or a song on an iPod? By throwing away the parts you don’t need (Lossy) or by finding patterns (Lossless).

Core Idea

Redundancy: Most data repeats itself. “AAAAA” can be written as “5A”. Compression removes redundancy.

Formal Definition (if applicable)

Lossless Compression: The original data can be perfectly reconstructed (ZIP, PNG). Used for text and code. Lossy Compression: Some data is lost forever, but the file is much smaller (JPEG, MP3). Used for images and audio where “good enough” is okay.

Intuition

Huffman Coding: Give short codes to common letters (e = 0) and long codes to rare letters (z = 11011). Like Morse Code (E = dot, Q = dash-dash-dot-dash).
Perceptual Coding: The human ear can’t hear quiet sounds right after a loud sound. MP3 deletes the quiet sounds.

Examples

Run-Length Encoding (RLE): “WWWWWWWWWWWWBWWWWWWWWWWWWBBB” -> “12W1B12W3B”.
Lempel-Ziv (LZ77): The algorithm inside ZIP files. It looks for repeated words and replaces them with pointers to the previous occurrence.

Common Misconceptions

“You can compress anything.” (You can’t compress random noise. If a file is already compressed, compressing it again might make it bigger.)
“Lossy looks bad.” (Modern codecs like HEVC or AV1 are almost indistinguishable from the original).

Kolmogorov Complexity: The length of the shortest program that can output a string.
Entropy Rate: The theoretical limit of compression.

Applications

Streaming: Netflix, YouTube, Spotify.
Storage: SSDs compress data on the fly to save space.

Criticism / Limitations

“Generation Loss.” If you save a JPEG as a JPEG over and over, it turns into a blocky mess.

Overview#

Core Idea#

Formal Definition (if applicable)#

Intuition#

Examples#

Common Misconceptions#

Related Concepts#

Applications#

Criticism / Limitations#

Further Reading#