Maximum Likelihood Estimation

Lecture 7

OLS works for linear models. What do we do when the outcome is binary, a count, or bounded?

The limits of OLS for non-linear outcomes.

Binary outcomes: Y ∈ {0, 1}.

OLS can predict probabilities below 0 or above 1, impossible values.
Residuals are heteroskedastic by construction. Standard errors are unreliable.
The linear probability model (LPM) is a useful approximation near the mean, but breaks down in the tails.

Count outcomes: Y ∈ {0, 1, 2, ...}.

Counts cannot be negative. OLS can predict negative counts.
Distribution is often skewed, not normal.

The solution: specify the distribution and maximize the likelihood.

MLE is a general estimation principle that works for any correctly specified model.

The likelihood is the probability of observing the data, viewed as a function of the parameters.

L(θ, y) = P(Y = y | θ)

Fix the data y. Vary the parameter θ. The likelihood answers: which value of θ makes the observed data most probable?

For a random sample of n observations, the joint likelihood is the product of individual densities (assuming independence):

L(θ) = ∏_i=1ⁿ f(y_i | θ)

We maximize the log-likelihood. The log is a monotone transformation, so it has the same argmax.

ℓ(θ) = log L(θ) = ∑_i=1ⁿ log f(y_i | θ)

Converting a product to a sum makes calculus tractable and prevents numerical underflow (products of many small probabilities can round to zero).

MLE estimator: θ̂_MLE = argmax_θ ℓ(θ). Find the parameter value that maximizes the log-likelihood.

In practice, numerical optimization (Newton-Raphson, gradient ascent) is used because closed-form solutions rarely exist for non-linear models.

OLS is MLE under normality.

Assume Y_i = β₀ + β₁X_i + u_i with u_i ~ N(0, σ²). Then the density of Y_i is:

f(y_i | β, σ²) = (2πσ²)^−1/2 exp(−(y_i − β₀ − β₁x_i)² / 2σ²)

Maximizing the log-likelihood with respect to β is equivalent to minimizing the sum of squared residuals, exactly the OLS criterion.

MLE is a strict generalization of OLS. OLS is the special case when errors are normally distributed and the model is linear. When those assumptions break down, MLE gives us a principled alternative.

Asymptotic properties of MLE.

Consistency.

θ̂_MLE ︎→︎_p θ₀ as n ︎→︎ ∞, provided the model is correctly specified.

Asymptotic normality.

√n(θ̂_MLE − θ₀) ︎→︎_d N(0, I(θ₀)⁻¹)
Where I(θ) is the Fisher information, the expected curvature of the log-likelihood.

Efficiency (Cramér–Rao).

Among all consistent, asymptotically normal estimators, MLE achieves the lowest possible asymptotic variance.
MLE is asymptotically efficient when the model is correctly specified.

The catch.

All of this requires correct model specification. Misspecify the distribution and estimates can be inconsistent.

How do we model P(Y = 1 | X) so the predicted probability always lies between 0 and 1?

The logit model applies the logistic function to a linear index.

P(Y_i = 1 | X_i) = Λ(β₀ + β₁X_i) = ^{e^β₀+β₁x} / (1 + e^β₀+β₁x)

Λ(·) is the logistic (sigmoid) function. Its range is (0, 1) for all real inputs, so predicted probabilities are always valid.

Log-odds interpretation. The coefficients give the change in the log-odds of the outcome:

log(p / (1−p)) = β₀ + β₁X

A one-unit increase in X multiplies the odds by e^β₁. This is the odds ratio.

The probit model uses the normal CDF instead of the logistic function.

P(Y_i = 1 | X_i) = Φ(β₀ + β₁X_i)

Φ(·) is the standard normal CDF. Like the logistic function, its range is (0, 1).

Latent variable interpretation. Suppose there is an unobserved continuous variable Y*_i = β₀ + β₁X_i + u_i with u_i ~ N(0,1). We observe Y_i = 1 if and only if Y*_i > 0. That gives the probit model.

Probit is the natural choice when you believe the underlying process is driven by normally distributed shocks. Logit is easier to interpret via odds ratios. In practice, results are nearly identical.

Logit vs. probit: practical differences.

Logit

Logistic CDF. Slightly heavier tails than probit. Coefficients are log-odds ratios, easy to exponentiate into odds ratios. Widely used in economics and epidemiology.

Probit

Normal CDF. Tails thin out faster. Natural latent-variable interpretation. Coefficients not directly interpretable as odds ratios. Common in structural models.

In practice: both give nearly identical marginal effects and predictions.

Logit coefficients ≈ 1.6 × probit coefficients (due to variance normalization).
The choice rarely changes substantive conclusions.
Prefer logit for odds-ratio tables, prefer probit for structural work.

Logit and probit coefficients are not marginal effects. Always compute marginal effects.

The marginal effect of X on P(Y=1) depends on the current value of X because the response function is non-linear:

∂P / ∂X = λ(β₀ + β₁X) ⋅ β₁

where λ(·) is the density (derivative of the CDF). This weight is maximized near 0 and shrinks toward the tails.

Average marginal effects (AME): evaluate the derivative at each observation's covariate values and average. This is the most common and most interpretable summary.

Marginal effects at the mean (MEM): evaluate the derivative at the sample mean. Faster but can be misleading when the mean is in the tail of the distribution.

The log-likelihood for a binary outcome model.

Each observation contributes either log p_i (if Y_i=1) or log(1 − p_i) (if Y_i=0), where p_i = P(Y_i=1|X_i):

ℓ(β) = ∑_i [y_i log p_i + (1−y_i) log(1−p_i)]

This is the binary cross-entropy loss, the same objective used in machine-learning classifiers.

There is no closed-form solution. Optimization proceeds numerically. Most software uses IRLS (iteratively reweighted least squares) or Newton-Raphson.

Standard errors come from the Hessian of the log-likelihood: Var(β̂) ≈ −H(β̂)⁻¹, where H is the matrix of second derivatives.

Goodness of fit for MLE models.

Log-likelihood ratio (LR) test.

LR = −2[ℓ(β̂_R) − ℓ(β̂_U)] ∼ χ²(q) under the null.
Compares a restricted model (fewer parameters) to an unrestricted model.
Analogous to the F-test in OLS.

Pseudo-R² (McFadden).

1 − ℓ(β̂) / ℓ(β̂₀), where ℓ(β̂₀) is the intercept-only log-likelihood.
Ranges from 0 to 1 but is not directly comparable to OLS R².
Values of 0.2–0.4 are considered a good fit for binary models.

Percent correctly predicted and ROC curve.

Classification-based metrics. Sensitive to the chosen probability threshold (typically 0.5).
ROC/AUC is threshold-free and preferred for prediction tasks.

Poisson regression models count outcomes using MLE.

Assume Y_i | X_i ~ Poisson(μ_i) with μ_i = exp(β₀ + β₁X_i). Using the exponential link ensures μ_i > 0.

Interpretation: a one-unit increase in X multiplies the expected count by e^β₁. Equivalently, β₁ is the change in log μ, an elasticity when X is in logs.

Equidispersion: Poisson requires E[Y] = Var(Y) = μ. If the variance exceeds the mean (overdispersion), Poisson SEs are too small. Use negative binomial regression or quasi-Poisson robust SEs.

Poisson regression is also widely used for difference-in-differences with count outcomes, and is robust to misspecification of the Poisson distribution as long as the conditional mean is correctly specified (Wooldridge 1999).

Three equivalent tests for MLE models.

Wald test.

Uses only the unrestricted estimates: (β̂ − β₀)^T [Var(β̂)]⁻¹ (β̂ − β₀) ~ χ².
Most common in practice because it requires only one model to be estimated.

Likelihood ratio (LR) test.

Requires both restricted and unrestricted models. Most powerful of the three by Neyman-Pearson.

Score (Lagrange multiplier) test.

Uses only the restricted model. Useful when the unrestricted model is expensive to estimate.
Breusch-Pagan and White heteroskedasticity tests are score tests.

All three are asymptotically equivalent under the null.

Common mistakes with MLE models.

Interpreting coefficients as marginal effects.

Logit/probit coefficients measure effects on the log-odds or the latent index, not on the probability. Always report AMEs.

Comparing coefficients across models.

The scale of logit/probit coefficients depends on the residual variance. Adding controls changes the scale, not just the magnitude. Do not interpret a coefficient growing larger as evidence that the effect grew.

Using pseudo-R² like OLS R².

Pseudo-R² is not a variance-explained measure. It cannot be directly compared across different datasets or outcome definitions.

Ignoring separation.

If a covariate perfectly predicts the outcome, the MLE does not exist (coefficient ︎→︎ ±∞). Use penalized MLE or Firth logit.

Practice Questions

Question 1 of 4

MLE is the general principle underlying most estimators in econometrics.

OLS is MLE under normally distributed errors. It is a special case.
Logit and probit extend MLE to binary outcomes. Report marginal effects, not raw coefficients.
Poisson extends MLE to count data. Check for overdispersion before reporting standard errors.