Simple Linear Regression

Lecture 2

Does an extra year of education raise wages, and if so, by how much?

To answer that, we need a model.

A model lets us summarize a relationship with a single number.

How much does Y change, on average, when X increases by one unit?

We start with the simplest case: one outcome, one predictor.

Y = wages, X = years of education.
This is simple (bivariate) linear regression.

The relationship will not be perfect, there is always noise.

Two people with the same education earn different wages. The model must account for that.

The population regression model.

Y_i = β₀ + β₁X_i + ε_i

Y_i is the outcome for observation i. X_i is the predictor. ε_i is the error term, everything else that affects Y.

β₀ and β₁ are population parameters. They are fixed but unknown. Our job is to estimate them from data.

What does β₁ mean?

β₁ is the slope, the effect of X on Y.

A one-unit increase in X is associated with a β₁-unit change in Y, on average, holding all else equal.

If X = years of education and Y = log wages, then β₁ = 0.08 means one more year of education is associated with roughly an 8% wage increase.

β₀ is the intercept, the predicted value of Y when X = 0. Often not directly meaningful, but necessary for the line to fit.

The error term ε_i is not noise to be ignored.

It captures everything that affects Y other than X: ability, family background, luck, measurement error.

Key assumption: E[ε_i | X_i] = 0. The error has mean zero, conditional on X.

This says the omitted factors are, on average, unrelated to X. It is a strong assumption, and one we will spend much of this course learning to worry about.

How do we estimate β₀ and β₁ from data?

Ordinary Least Squares (OLS)

Draw a line through the data.

Any line gives us fitted values Ŷ_i = b₀ + b₁X_i.
The residual is the gap: e_i = Y_i − Ŷ_i.

Choose the line that minimizes the sum of squared residuals.

min ∑_i e_i² = ∑_i (Y_i − b₀ − b₁X_i)²
Squaring penalizes large errors more than small ones.
We square (not use absolute values) because it yields a clean closed-form solution.

The OLS estimators have closed-form solutions.

β̂₁ = Cov(X, Y) / Var(X)

β̂₀ = Ȳ − β̂₁X̄

β̂₁ is the sample covariance of X and Y divided by the sample variance of X. It measures how much Y co-moves with X, scaled by how much X varies.

The intercept β̂₀ is pinned down by requiring the line to pass through the point (X̄, Ȳ).

Fitted values and residuals

The fitted value is what the model predicts.

Ŷ_i = β̂₀ + β̂₁X_i

The residual is the prediction error.

e_i = Y_i − Ŷ_i
Residuals are estimates of the true errors ε_i, but not the same thing.

Two algebraic facts that always hold by construction.

The residuals sum to zero: ∑ e_i = 0.
The residuals are uncorrelated with X: ∑ X_ie_i = 0.

Under what conditions is OLS a good estimator?

The Classical Linear Regression Model assumptions.

CLM 1 & 2, Linearity & random sampling: The model is Y_i = β₀ + β₁X_i + ε_i, estimated on a random sample from the population.
CLM 3, Variation in X: X is not constant, there is sample variation to exploit.
CLM 4, Zero conditional mean: E[ε_i | X_i] = 0.

Under CLM 1–4, OLS is unbiased.

E[β̂₁] = β₁

On average, across all possible samples we could have drawn, the OLS estimator hits the true population parameter.

The key is CLM 4. If E[ε | X] ≠ 0, if the error is correlated with X, then β̂₁ is biased. This is the omitted variable problem, and we will return to it repeatedly.

A fifth assumption: homoskedasticity.

CLM 5, Homoskedasticity: Var(ε_i | X_i) = σ² for all i.

The variance of the error is the same for all values of X. It does not grow or shrink as X changes.

This assumption is needed to derive the standard formula for the variance of β̂₁. When it fails, heteroskedasticity, the standard errors are wrong.

How precise is β̂₁?

Var(β̂₁) = σ² / ∑_i(X_i − X̄)²

The variance of β̂₁ is smaller when:

The error variance σ² is small, less noise makes the signal clearer.
There is more variation in X, a wider spread gives more information about the slope.
The sample size n is larger, more data always helps.

The standard error is the estimated standard deviation of β̂₁.

We do not know σ², so we estimate it using the residuals:

s² = ∑_i e_i² / (n − 2)

We divide by n − 2 (not n) because we estimated two parameters (β̂₀ and β̂₁), losing two degrees of freedom.

The standard error SE(β̂₁) is what gets reported next to the coefficient in regression output. It measures estimation uncertainty.

The Gauss-Markov theorem.

Under CLM 1–5, OLS is BLUE: the Best Linear Unbiased Estimator.

Among all estimators that are (a) linear functions of Y and (b) unbiased, OLS has the smallest variance.

“Best” means most efficient, you cannot do better without either using a nonlinear estimator or accepting some bias.

This does not require the errors to be normally distributed, just the five assumptions above.

Is the estimated effect real, or could it be sampling noise?

The t-statistic tests whether β₁ = 0.

t = β̂₁ / SE(β̂₁)

Under H₀: β₁ = 0 and the CLM assumptions, t ∼ t_n−2. For large n, this is approximately N(0,1).

Rule of thumb: |t| > 2 gives a p-value below roughly 0.05. We reject H₀ and conclude the effect is statistically distinguishable from zero.

The p-value is the probability of observing a |t| this large or larger if H₀ were true.

Reading regression output

The coefficient: β̂₁

Your estimate of the slope. Interpret it: a one-unit increase in X is associated with a β̂₁ change in Y.

The standard error: SE(β̂₁)

How precisely is the coefficient estimated? Smaller SE = more precise.

The t-statistic and p-value

Is the coefficient statistically distinguishable from zero?
Statistical significance ≠ practical significance. A tiny effect can be highly significant with enough data.

How well does the model fit the data?

R² measures goodness of fit.

R² = 1 − RSS / TSS

TSS = total sum of squares = ∑(Y_i − Ȳ)², total variation in Y.

RSS = residual sum of squares = ∑ e_i², variation left unexplained.

R² ranges from 0 to 1. A value of 0.3 means X explains 30% of the variation in Y. A high R² does not mean the estimates are unbiased or causally interpretable.

In simple regression, R² is the squared correlation.

R² = [Corr(X, Y)]² = r²

This is a useful check: if you know r, you know the fit of the simple regression.

But R² is not everything. A model can have a low R² and still produce an unbiased, precisely estimated, and important causal effect. Much of applied econometrics involves data where individual-level noise is very high.

OLS gives you correlation, not causation.

Even if all the CLM assumptions hold, β̂₁ is an estimate of the conditional mean relationship. Causation requires more.

The critical threat: if anything in ε is correlated with X, CLM 4 fails and β̂₁ is biased. In the education-wage example, ability is in ε and is correlated with education, classic omitted variable bias.

The rest of this course is largely about diagnosing and fixing violations of CLM 4.

Practice Questions

Question 1 of 4