←︎ Back Econometrics › Class Slides
1 / 17

Panel Data

Lecture 6

What is panel data?

The same units observed at multiple points in time.
  • Units: individuals, firms, countries, states, counties.
  • N units observed over T periods ︎→︎ up to NT observations.
Balanced vs. unbalanced.
  • Balanced: every unit is observed in every period.
  • Unbalanced: some observations are missing. Common in practice.
Examples.
  • NLSY: same individuals surveyed annually from 1979 onward.
  • Compustat: same firms observed across fiscal years.
  • State-level data: 50 states observed over decades of policy changes.
Panel data lets us control for unobserved, time-invariant confounders.
Recall the fundamental problem: omitted variables that are correlated with X bias our estimates. In cross-section, we can only control for what we observe.
Panel data gives us a different strategy: if the omitted variable does not change over time for a given unit (ability, culture, geography), we can remove it entirely by exploiting within-unit variation.
We are no longer comparing different people to each other. We are comparing the same person to themselves at different points in time.
The unobserved effects model.
Yit = β0 + β1Xit + αi + εit
i indexes units, t indexes time. αi is the unobserved individual effect, it varies across units but is fixed within a unit over time.
αi captures everything about unit i that is constant over time: innate ability, culture, geography, management quality.
The key question: is αi correlated with Xit? The answer determines which estimator to use.
Panel data contains two kinds of variation.

Within variation

How Xit changes for the same unit over time. This is what fixed effects exploits. Free of time-invariant confounders.

Between variation

How Xi differs across units on average. This is what cross-sectional OLS uses. Susceptible to unobserved unit-level confounders.

Fixed effects uses only within variation. If X barely changes within units over time, FE will be imprecise, there is little variation left to identify β1.
Pooled OLS ignores the panel structure.
Stack all NT observations and run OLS as if they were independent. This treats αi as part of the error term.
If Cov(αi, Xit) ≠ 0, pooled OLS is biased, exactly the omitted variable problem from cross-section, now with a time dimension.
Even if αi is uncorrelated with X, pooled OLS standard errors are wrong because observations from the same unit are correlated over time.
Verdict: pooled OLS is almost never the right choice with panel data.
First differencing: subtract last period from this period.
Take the model in period t and subtract period t − 1:
ΔYit = β1ΔXit + Δεit
αi is time-invariant, so it cancels out exactly. We are left with changes in Y regressed on changes in X.
First differencing is efficient when T = 2. For longer panels, the fixed effects estimator is generally preferred.
The fixed effects (within) estimator: demean each unit.
Subtract each unit’s time mean from both sides:
(YitȲi) = β1(Xiti) + (εitε̄i)
αi is absorbed by the unit mean and drops out. Running OLS on the demeaned data gives the FE estimator.
Equivalently: include a dummy variable for each unit. The FE estimator is identical to OLS with N − 1 unit dummies. This is the least squares dummy variable (LSDV) estimator.
Degrees of freedom: we lose N − 1 for the unit dummies (or equivalently, for the demeaning).

What fixed effects cannot identify

Time-invariant variables are wiped out along with αi.
  • Race, sex, country of birth, and other fixed characteristics are collinear with the unit dummies.
  • FE cannot estimate the effect of anything that does not vary within units.
FE only uses within-unit variation.
  • If Xit changes little within units (low within variance), FE estimates are imprecise.
  • Classic example: education rarely changes after a certain age, FE struggles to identify its effect in adult panels.
FE does not fix endogeneity from time-varying confounders.
  • Only absorbs time-invariant unobservables. If something changes within units and is correlated with X, FE is still biased.
Random effects: treat αi as uncorrelated with X.
If Cov(αi, Xit) = 0, the unobserved effect is just a component of the composite error. We do not need to remove it, we can model it.
The RE estimator uses both within and between variation. It is a weighted average of the FE (within) estimator and the between estimator. This makes it more efficient than FE when its assumption holds.
RE can also estimate coefficients on time-invariant regressors (race, sex, geography), something FE cannot do.
But if Cov(αi, Xit) ≠ 0, RE is biased. The assumption is often implausible in economic data.
The Hausman test: FE or RE?
Both FE and RE are consistent if Cov(αi, Xit) = 0. Only FE is consistent if it is not.
H0: RE is consistent, the unobserved effect is uncorrelated with the regressors. If H0 holds, FE and RE estimates should be similar.
The test statistic is based on the difference in coefficient vectors: H = (β̂FEβ̂RE)′ [Var(β̂FE) − Var(β̂RE)]−1 (β̂FEβ̂RE), distributed χ²k.
A significant Hausman statistic rejects RE in favor of FE. In most applied economics work, FE is the default.

Two-way fixed effects

Add time dummies alongside unit dummies.
  • Yit = αi + λt + β1Xit + εit
  • αi absorbs unit-level time-invariant confounders. λt absorbs period-level shocks common to all units (recessions, policy changes, pandemics).
Identification comes from variation in X that is neither unit-specific nor time-specific.
  • We need variation in X that differs across both units and time, over and above unit averages and time averages.
Two-way FE is the workhorse of applied panel econometrics.
  • In Stata: reghdfe y x, absorb(unit time). In R: feols(y ~ x | unit + time).
Difference-in-Differences is a special case of two-way FE.
DiD compares the change over time in a treated group to the change over time in a control group.
β̂DiD = (Ȳtreat,postȲtreat,pre) − (Ȳctrl,postȲctrl,pre)
This is equivalent to a two-way FE regression with unit and time dummies and a treatment indicator Dit = 1 if unit i is treated in period t.
Key assumption: parallel trends. In the absence of treatment, treated and control units would have followed the same trend. This cannot be tested directly but can be examined with pre-treatment data.
Always cluster standard errors by unit in panel data.
Observations from the same unit over time are correlated, not independent. Using standard (or even HC robust) SEs that assume independence will understate uncertainty.
Cluster-robust SEs at the unit level allow for arbitrary serial correlation within units. They are the standard in applied panel work.
If treatment varies at a higher level (e.g., state policy applied to individuals), cluster at the treatment assignment level, not the observation level.
With few clusters (< 30–50), cluster-robust SEs can be unreliable. Consider wild bootstrap or other small-sample corrections.

FE and IV: complementary strategies

FE handles time-invariant unobserved confounders.
  • If the endogeneity comes from a fixed unit characteristic (ability, culture), FE eliminates it completely.
IV handles time-varying endogeneity.
  • If the endogeneity comes from something that changes over time and is correlated with Xit, you need an instrument.
You can combine them: FE-IV (or within-IV).
  • Instrument for Xit using a variable that varies within units over time, after demeaning.
  • Handles both time-invariant confounders (via FE) and time-varying endogeneity (via IV).
Practice Questions
Question 1 of 4

Key Terms