Overview
Big Data is the industrial revolution of information. We generate more data every two days than in all of history up to 2003. It changes how we do science, business, and government.
Core Idea
The core idea is n=all. Instead of sampling, we can analyze everyone. We don’t need to ask why (causation); we just need to know what (correlation).
Formal Definition
Defined by the 3 Vs:
- Volume: Huge amounts of data (Petabytes).
- Velocity: Streaming in real-time.
- Variety: Messy data (Text, Video, GPS) not just spreadsheets.
Intuition
- Moneyball: Using stats to find undervalued baseball players.
- Target: Predicting a teenager was pregnant before her father knew, based on her shopping habits (unscented lotion + vitamins).
Examples
- Google Flu Trends: Predicting flu outbreaks by search queries. (Failed eventually because the algorithm drifted).
- Recommendation Engines: Netflix knowing what you want to watch.
Common Misconceptions
- Misconception: More data = Better decisions.
- Correction: More data can mean more noise. Without a good model, it’s just “Big Garbage.”
- Misconception: Data is objective.
- Correction: Data is collected by humans with biases. Algorithms can be racist/sexist if trained on biased data (Algorithmic Bias).
Related Concepts
- Machine Learning: The tool used to mine Big Data.
- Privacy: The death of anonymity.
- Data Science: The job of analyzing it.
Applications
- Smart Cities: Optimizing traffic flow.
- Genomics: Personalized medicine based on your DNA.
Criticism and Limitations
- Surveillance Capitalism: Shoshana Zuboff’s critique that our experience is being mined for profit.
Further Reading
- Big Data by Viktor Mayer-Schönberger and Kenneth Cukier
- Weapons of Math Destruction by Cathy O’Neil