Overview

Linear regression is the “Hello World” of machine learning and the workhorse of statistical modeling. It tries to draw the “best fit” straight line through a scatterplot of data points.

Core Idea

Find the line $y = mx + b$ that minimizes the distance (error) between the line and the actual data points. This allows you to predict $y$ for a new $x$.

Formal Definition (if applicable)

Simple Linear Regression: $$ y = \beta_0 + \beta_1 x + \epsilon $$

  • $y$: Dependent variable (Target)
  • $x$: Independent variable (Feature)
  • $\beta_0$: Intercept
  • $\beta_1$: Slope (Coefficient)
  • $\epsilon$: Error term (Residual)

Intuition

If you plot “Hours Studied” vs “Exam Score,” you’ll likely see an upward trend. Linear regression draws the line that best summarizes this trend, allowing you to say: “For every extra hour studied, the score increases by 5 points on average.”

Examples

  • Real Estate: Predicting house prices based on square footage.
  • Economics: Estimating the impact of interest rates on inflation.

Common Misconceptions

  • “Linear means straight line only”: In simple regression, yes. But you can include polynomial terms ($x^2$) to model curves while still using “linear” regression techniques (linear in parameters).
  • “Correlation is enough”: Regression gives you the slope (how much $y$ changes), whereas correlation only gives the strength of the relationship.

Applications

Used for forecasting, determining causal relationships (in controlled settings), and quantifying trends.

Criticism / Limitations

It assumes a linear relationship. If the data is curved (e.g., exponential growth), a straight line is a terrible model. It is also sensitive to outliers.

Further Reading

  • “An Introduction to Statistical Learning” by James, Witten, Hastie, and Tibshirani