Overview

How do you fit a movie on a DVD? Or a song on an iPod? By throwing away the parts you don’t need (Lossy) or by finding patterns (Lossless).

Core Idea

Redundancy: Most data repeats itself. “AAAAA” can be written as “5A”. Compression removes redundancy.

Formal Definition (if applicable)

Lossless Compression: The original data can be perfectly reconstructed (ZIP, PNG). Used for text and code. Lossy Compression: Some data is lost forever, but the file is much smaller (JPEG, MP3). Used for images and audio where “good enough” is okay.

Intuition

  • Huffman Coding: Give short codes to common letters (e = 0) and long codes to rare letters (z = 11011). Like Morse Code (E = dot, Q = dash-dash-dot-dash).
  • Perceptual Coding: The human ear can’t hear quiet sounds right after a loud sound. MP3 deletes the quiet sounds.

Examples

  • Run-Length Encoding (RLE): “WWWWWWWWWWWWBWWWWWWWWWWWWBBB” -> “12W1B12W3B”.
  • Lempel-Ziv (LZ77): The algorithm inside ZIP files. It looks for repeated words and replaces them with pointers to the previous occurrence.

Common Misconceptions

  • “You can compress anything.” (You can’t compress random noise. If a file is already compressed, compressing it again might make it bigger.)
  • “Lossy looks bad.” (Modern codecs like HEVC or AV1 are almost indistinguishable from the original).
  • Kolmogorov Complexity: The length of the shortest program that can output a string.
  • Entropy Rate: The theoretical limit of compression.

Applications

  • Streaming: Netflix, YouTube, Spotify.
  • Storage: SSDs compress data on the fly to save space.

Criticism / Limitations

“Generation Loss.” If you save a JPEG as a JPEG over and over, it turns into a blocky mess.

Further Reading

  • Sayood, Introduction to Data Compression