Multiple Regression

Lecture 3

What goes wrong when we leave out a variable that matters?

The omitted variable problem

We want the effect of education on wages.

Simple regression: wage_i = β₀ + β₁educ_i + ε_i
Ability is in ε. More-able people tend to get more education.

CLM 4 fails: E[ε | educ] ≠ 0.

The error is correlated with the regressor, so β̂₁ picks up both the education effect and the ability effect.
We say β̂₁ is biased upward.

The fix: put ability in the model explicitly.

That is exactly what multiple regression does.

The multiple regression model.

Y_i = β₀ + β₁X_1i + β₂X_2i + … + β_kX_ki + ε_i

There are k regressors and k + 1 parameters to estimate (including the intercept).

OLS still minimizes the sum of squared residuals, the same criterion as before, now over all k + 1 parameters simultaneously.

The key payoff: by including controls, we can isolate the effect of one variable while holding the others fixed.

What does each coefficient mean in a multiple regression?

β_j is the effect of X_j on Y, holding all other regressors constant.

This is the ceteris paribus interpretation, all else equal.

Example: in wage = β₀ + β₁educ + β₂exper + ε, the coefficient β₁ is the effect of education on wages among workers with the same experience.

This is what makes multiple regression so powerful: it lets us make comparisons we cannot make with simple regression.

But “holding constant” is statistical, not physical. It works only if the model is correctly specified and CLM 4 holds.

The Frisch-Waugh theorem: partialling out.

The OLS estimate β̂₁ in a multiple regression is numerically identical to the slope from regressing Y on the residuals of X₁ after projecting out all other regressors.

In plain language: multiple regression removes the variation in X₁ that is explained by the other controls, then estimates the relationship between what remains and Y.

This is the precise sense in which we “hold other variables constant.” We are not literally fixing them, we are netting out their linear influence.

The CLM assumptions extended to multiple regression.

CLM 1–4 carry over unchanged: linearity, random sampling, variation in each X, and E[ε | X₁, …, X_k] = 0.
CLM 3′, No perfect multicollinearity: No regressor is an exact linear combination of the others.
CLM 5, Homoskedasticity: Var(ε | X₁, …, X_k) = σ².

Multicollinearity

Perfect multicollinearity: OLS breaks down entirely.

If one regressor is an exact linear function of another, the coefficients are not identified, infinitely many combinations give the same fit.
Example: including both income in dollars and income in thousands of dollars.

High (but imperfect) multicollinearity: OLS still works, but imprecisely.

When regressors are highly correlated, it is hard to disentangle their separate effects.
Standard errors inflate. Coefficients become sensitive to small data changes.
OLS is still unbiased, only precision suffers.

The fix is more data, not a different estimator.

Under CLM 1–4, OLS is still unbiased.

E[β̂_j] = β_j for all j = 1, …, k.

The critical condition remains CLM 4: the error must have mean zero conditional on all included regressors. Including more controls makes this more plausible, but never guaranteed.

Gauss-Markov also extends: under CLM 1–5, OLS is BLUE among all linear unbiased estimators of β_j.

Omitted variable bias has a precise formula.

Suppose the true model includes X₂ but we omit it. The OLS estimate of β₁ from the short regression converges to:

β̂_1,short ︎→︎ β₁ + β₂ · δ̂₁

where δ̂₁ is the coefficient from regressing X₂ on X₁.

The bias is β₂ · δ̂₁: it is large when the omitted variable has a strong effect on Y (β₂ large) and is strongly correlated with X₁ (δ̂₁ large).

Using the OVB formula to sign the bias

Education on wages, omitting ability.

β₂ > 0: ability raises wages.
δ̂₁ > 0: ability is positively correlated with education.
Bias = positive × positive = upward bias. Simple OLS overstates the return to education.

Class size on test scores, omitting poverty.

β₂ < 0: poverty lowers test scores.
δ̂₁ > 0: poorer schools tend to have larger classes.
Bias = negative × positive = downward bias. Simple OLS overstates the harm from class size.

How do we measure fit when there are multiple regressors?

R² always rises when you add a variable. Adjusted R² does not.

R² = 1 − RSS / TSS, never decreases as regressors are added, even if they are irrelevant.

Adjusted R̄² penalizes for the number of regressors:

R̄² = 1 − [RSS / (n − k − 1)] / [TSS / (n − 1)]

R̄² can fall when you add a variable whose contribution is smaller than its cost in degrees of freedom. Use it for model comparison, not as a target to maximize.

The F-statistic tests whether all slope coefficients are jointly zero.

H₀: β₁ = β₂ = … = β_k = 0, none of the regressors matter.

F = [R² / k] / [(1 − R²) / (n − k − 1)]

Under H₀, F ∼ F_{k, n−k−1}. A large F-statistic (small p-value) means the regressors jointly explain a statistically significant share of the variation in Y.

The F-test can also test any linear restriction, not just all-zero. We will use this for joint significance of subsets of coefficients.

Reading multiple regression output

Each coefficient has the same structure as before.

Estimate, standard error, t-statistic, p-value.
Interpretation: the effect of X_j on Y, holding all other included variables fixed.

The model-level statistics summarize overall fit.

R² and adjusted R² measure how much variation is explained.
The F-statistic and its p-value test whether any regressor matters.
Degrees of freedom: n − k − 1 residual d.f.

What to look for first.

Does the sign of the key coefficient make sense? Is it statistically significant?
Do the controls have the expected signs? Do they change the key coefficient?

Can adding a control variable ever make things worse?

Yes. Controlling for the wrong variables introduces bias.

A bad control is a variable that is itself caused by the treatment variable X₁. Including it blocks part of the causal channel you are trying to measure.

Example: estimating the effect of education on wages, controlling for occupation. But occupation is partly determined by education, controlling for it removes some of the very effect we want to measure.

Good controls are variables that affect Y and are correlated with X₁, but are not caused by X₁.

Extending the model: functional form

Relationships need not be linear in the original variables.

They must be linear in the parameters, that is what OLS requires.

Log transformation: ln(wage) = β₀ + β₁educ + ε.

β₁ is now a semi-elasticity: a one-unit increase in educ raises wages by approximately 100·β₁%.
Log-log specifications give elasticities directly.

Quadratics: β₁X + β₂X².

Allows diminishing (or increasing) returns. Both terms enter the same regression.
Marginal effect: β₁ + 2β₂X, depends on the level of X.

Dummy variables

A dummy variable takes the value 0 or 1.

Encodes binary categories: female, union member, treated group.

wage = β₀ + β₁female + β₂educ + ε.

β₁ is the wage gap between women and men with the same education.
The intercept β₀ is the baseline (for men, here).

Dummy variable trap: never include all categories.

With m categories, include m − 1 dummies. The omitted category is the reference group.
Including all creates perfect multicollinearity with the intercept.

Adding controls does not automatically give you a causal estimate.

Every control you add removes some omitted variable bias. But you can never observe everything. There is always some unobserved variable that could violate CLM 4.

When we run a regression with many controls, we are making an assumption: conditional on the included variables, the treatment variable is “as good as randomly assigned.” This is called conditional independence or selection on observables.

Whether that assumption is credible is a substantive question, not a statistical one. It requires institutional knowledge about how the data were generated.

What comes next

Heteroskedasticity and robust standard errors.

When CLM 5 fails, OLS is still unbiased but standard errors are wrong. Fix: HC robust SEs.

Instrumental Variables.

When CLM 4 fails and no observed control fixes it, we need an instrument, a variable that shifts X but has no direct effect on Y.

Panel Data.

When the same units are observed over time, fixed effects can absorb time-invariant unobservables entirely.

Practice Questions

Question 1 of 4