Panel Data

Lecture 6

What is panel data?

The same units observed at multiple points in time.

Units: individuals, firms, countries, states, counties.
N units observed over T periods ︎→︎ up to NT observations.

Balanced vs. unbalanced.

Balanced: every unit is observed in every period.
Unbalanced: some observations are missing. Common in practice.

Examples.

NLSY: same individuals surveyed annually from 1979 onward.
Compustat: same firms observed across fiscal years.
State-level data: 50 states observed over decades of policy changes.

Panel data lets us control for unobserved, time-invariant confounders.

Recall the fundamental problem: omitted variables that are correlated with X bias our estimates. In cross-section, we can only control for what we observe.

Panel data gives us a different strategy: if the omitted variable does not change over time for a given unit (ability, culture, geography), we can remove it entirely by exploiting within-unit variation.

We are no longer comparing different people to each other. We are comparing the same person to themselves at different points in time.

The unobserved effects model.

Y_it = β₀ + β₁X_it + α_i + ε_it

i indexes units, t indexes time. α_i is the unobserved individual effect, it varies across units but is fixed within a unit over time.

α_i captures everything about unit i that is constant over time: innate ability, culture, geography, management quality.

The key question: is α_i correlated with X_it? The answer determines which estimator to use.

Panel data contains two kinds of variation.

Within variation

How X_it changes for the same unit over time. This is what fixed effects exploits. Free of time-invariant confounders.

Between variation

How X_i differs across units on average. This is what cross-sectional OLS uses. Susceptible to unobserved unit-level confounders.

Fixed effects uses only within variation. If X barely changes within units over time, FE will be imprecise, there is little variation left to identify β₁.

Pooled OLS ignores the panel structure.

Stack all NT observations and run OLS as if they were independent. This treats α_i as part of the error term.

If Cov(α_i, X_it) ≠ 0, pooled OLS is biased, exactly the omitted variable problem from cross-section, now with a time dimension.

Even if α_i is uncorrelated with X, pooled OLS standard errors are wrong because observations from the same unit are correlated over time.

Verdict: pooled OLS is almost never the right choice with panel data.

First differencing: subtract last period from this period.

Take the model in period t and subtract period t − 1:

ΔY_it = β₁ΔX_it + Δε_it

α_i is time-invariant, so it cancels out exactly. We are left with changes in Y regressed on changes in X.

First differencing is efficient when T = 2. For longer panels, the fixed effects estimator is generally preferred.

The fixed effects (within) estimator: demean each unit.

Subtract each unit’s time mean from both sides:

(Y_it − Ȳ_i) = β₁(X_it − X̄_i) + (ε_it − ε̄_i)

α_i is absorbed by the unit mean and drops out. Running OLS on the demeaned data gives the FE estimator.

Equivalently: include a dummy variable for each unit. The FE estimator is identical to OLS with N − 1 unit dummies. This is the least squares dummy variable (LSDV) estimator.

Degrees of freedom: we lose N − 1 for the unit dummies (or equivalently, for the demeaning).

What fixed effects cannot identify

Time-invariant variables are wiped out along with α_i.

Race, sex, country of birth, and other fixed characteristics are collinear with the unit dummies.
FE cannot estimate the effect of anything that does not vary within units.

FE only uses within-unit variation.

If X_it changes little within units (low within variance), FE estimates are imprecise.
Classic example: education rarely changes after a certain age, FE struggles to identify its effect in adult panels.

FE does not fix endogeneity from time-varying confounders.

Only absorbs time-invariant unobservables. If something changes within units and is correlated with X, FE is still biased.

Random effects: treat α_i as uncorrelated with X.

If Cov(α_i, X_it) = 0, the unobserved effect is just a component of the composite error. We do not need to remove it, we can model it.

The RE estimator uses both within and between variation. It is a weighted average of the FE (within) estimator and the between estimator. This makes it more efficient than FE when its assumption holds.

RE can also estimate coefficients on time-invariant regressors (race, sex, geography), something FE cannot do.

But if Cov(α_i, X_it) ≠ 0, RE is biased. The assumption is often implausible in economic data.

The Hausman test: FE or RE?

Both FE and RE are consistent if Cov(α_i, X_it) = 0. Only FE is consistent if it is not.

H₀: RE is consistent, the unobserved effect is uncorrelated with the regressors. If H₀ holds, FE and RE estimates should be similar.

The test statistic is based on the difference in coefficient vectors: H = (β̂_FE − β̂_RE)′ [Var(β̂_FE) − Var(β̂_RE)]⁻¹ (β̂_FE − β̂_RE), distributed χ²_k.

A significant Hausman statistic rejects RE in favor of FE. In most applied economics work, FE is the default.

Two-way fixed effects

Add time dummies alongside unit dummies.

Y_it = α_i + λ_t + β₁X_it + ε_it
α_i absorbs unit-level time-invariant confounders. λ_t absorbs period-level shocks common to all units (recessions, policy changes, pandemics).

Identification comes from variation in X that is neither unit-specific nor time-specific.

We need variation in X that differs across both units and time, over and above unit averages and time averages.

Two-way FE is the workhorse of applied panel econometrics.

In Stata: reghdfe y x, absorb(unit time). In R: feols(y ~ x | unit + time).

Difference-in-Differences is a special case of two-way FE.

DiD compares the change over time in a treated group to the change over time in a control group.

β̂_DiD = (Ȳ_treat,post − Ȳ_treat,pre) − (Ȳ_ctrl,post − Ȳ_ctrl,pre)

This is equivalent to a two-way FE regression with unit and time dummies and a treatment indicator D_it = 1 if unit i is treated in period t.

Key assumption: parallel trends. In the absence of treatment, treated and control units would have followed the same trend. This cannot be tested directly but can be examined with pre-treatment data.

Always cluster standard errors by unit in panel data.

Observations from the same unit over time are correlated, not independent. Using standard (or even HC robust) SEs that assume independence will understate uncertainty.

Cluster-robust SEs at the unit level allow for arbitrary serial correlation within units. They are the standard in applied panel work.

If treatment varies at a higher level (e.g., state policy applied to individuals), cluster at the treatment assignment level, not the observation level.

With few clusters (< 30–50), cluster-robust SEs can be unreliable. Consider wild bootstrap or other small-sample corrections.

FE and IV: complementary strategies

FE handles time-invariant unobserved confounders.

If the endogeneity comes from a fixed unit characteristic (ability, culture), FE eliminates it completely.

IV handles time-varying endogeneity.

If the endogeneity comes from something that changes over time and is correlated with X_it, you need an instrument.

You can combine them: FE-IV (or within-IV).

Instrument for X_it using a variable that varies within units over time, after demeaning.
Handles both time-invariant confounders (via FE) and time-varying endogeneity (via IV).

Practice Questions

Question 1 of 4