Difference-in-Differences

Lecture 10

You observe a treated group before and after a policy. How do you separate the treatment effect from trends that would have happened anyway?

Difference-in-Differences: subtract the control group’s change from the treated group’s change.

	Before	After	Difference
Treated	Ȳ_T,pre	Ȳ_T,post	ΔȲ_T
Control	Ȳ_C,pre	Ȳ_C,post	ΔȲ_C
DiD Estimator			ΔȲ_T − ΔȲ_C

The control group tells us what would have happened to the treated group in the absence of treatment, the counterfactual trend. Subtracting that trend isolates the causal effect.

Classic example: Card & Krueger (1994), New Jersey raised its minimum wage, Pennsylvania did not. Comparing employment changes across the border identifies the wage effect.

DiD is a regression with unit fixed effects, time fixed effects, and a treatment indicator.

Y_it = α_i + λ_t + δ(Treated_i × Post_t) + u_it

α_i absorbs all time-invariant unit differences. λ_t absorbs common time trends. δ is the DiD estimate, the coefficient on the interaction of being in the treatment group and being in the post period.

In the simple 2×2 case (one treated group, one control, one pre period, one post period), this regression recovers exactly the table-based DiD estimator.

With multiple periods, unit FE eliminate level differences across units, time FE eliminate common aggregate shocks. The interaction term then identifies the treatment effect by comparing treated and control units’ paths around the treatment date.

The parallel trends assumption.

Parallel Trends

In the absence of treatment, the treated and control groups would have followed the same trend over time: E[Y(0)_it − Y(0)_it−1 | i ∈ Treated] = E[Y(0)_it − Y(0)_it−1 | i ∈ Control].

What this is and is not.

It does not require treated and control units to have similar levels of Y.
It requires them to have similar trends in Y in the pre-period.
It is an assumption about counterfactual paths, it can never be directly tested, only made plausible.

When parallel trends is implausible.

Treatment was targeted at units that were already trending differently (e.g., Ashenfelter’s dip).
Macro shocks hit treated and control units asymmetrically at the same time as treatment.

Test parallel trends by checking whether pre-treatment trends differ between treated and control groups.

With multiple pre-periods, estimate a set of interaction terms between the treatment group indicator and period dummies:

Y_it = α_i + λ_t + ∑_k≠−1 δ_k (Treated_i × 1[t = k]) + u_it

k = −1 is the omitted baseline period (the period just before treatment). The pre-treatment δ_k for k < −1 should be zero if parallel trends holds.

This is the event study plot. Plot the δ_k with confidence intervals over time. Pre-treatment coefficients should be near zero and insignificant, post-treatment coefficients show the dynamic treatment effects.

Caution

A flat pre-trend is consistent with parallel trends but does not prove it. Unobserved confounders could have the same pre-trend while still violating the assumption after treatment.

Card & Krueger (1994): did New Jersey’s minimum wage increase reduce employment?

In April 1992, New Jersey raised its minimum wage from $4.25 to $5.05. Pennsylvania did not change its minimum wage. Card and Krueger surveyed fast-food restaurants in both states before and after the change.

	Before	After	Change
New Jersey	20.44	21.03	+0.59
Pennsylvania	23.33	21.17	−2.16
DiD			+2.75

Full-time equivalent employment increased in New Jersey relative to Pennsylvania after the wage increase. This directly contradicted the standard competitive labor market prediction and ignited a large literature on the employment effects of minimum wages.

Threats to DiD validity beyond parallel trends.

Anticipation effects.

If units know treatment is coming and change behavior before the official treatment date, the “pre” period is already contaminated.
Extend the event study window further back. Pre-treatment coefficients should remain flat all the way to the earliest available period.

Spillovers (SUTVA violation).

If treatment in one unit affects the control group (e.g., displaced workers move to control areas), the control group outcome is contaminated.
Choose control units that are unlikely to be affected, e.g., geographically distant.

Ashenfelter’s dip.

Units often receive treatment because of a temporary dip in their outcome. They would have recovered regardless of treatment.
Mean reversion masquerades as a treatment effect. Check pre-trends further back than one period.

Composition changes.

If the set of units in the treated or control group changes between pre and post (attrition, entry), the comparison group shifts over time.

What if different units are treated at different times, staggered rollout?

With staggered treatment, the standard TWFE regression estimates a weighted average of all pairwise 2×2 DiDs, and some weights can be negative.

Goodman-Bacon (2021) showed that the TWFE coefficient δ̂ decomposes into a weighted sum of every possible clean 2×2 comparison in the data: early vs. late adopters, late vs. never, early vs. never. The weights depend on sample sizes and treatment timing, not on economic importance.

The negative weighting problem. Units that were treated early serve as controls for units treated later. If treatment effects are heterogeneous over time (dynamic effects), already-treated units are bad controls. The TWFE estimator can be a meaningless mixture, or even have the wrong sign.

When is TWFE fine? If treatment effects are constant (homogeneous and time-invariant), TWFE recovers the ATT. The problem arises with treatment effect heterogeneity.

Modern robust estimators for staggered DiD.

Callaway & Sant’Anna (2021), group-time ATTs.

Estimate separate ATTs for each cohort (group of units treated at the same time) and each post-treatment period.
Aggregate these group-time ATTs into summary parameters of interest (overall ATT, dynamic effects, calendar time effects).
Never uses already-treated units as controls by default.

Sun & Abraham (2021), interaction-weighted estimator.

Saturate the TWFE regression with cohort × relative-time interactions. Averages them with cohort-size weights to recover the ATT.
Implemented as eventstudyinteract in Stata, sunab() in R’s fixest.

Borusyak, Jaravel & Spiess (2024), imputation estimator.

Imputes untreated potential outcomes for treated observations using only clean (never- or not-yet-treated) controls, then directly estimates ATTs by cohort.

Practical advice.

Run the Bacon decomposition first to understand the TWFE weights. If weights are non-negative and not too dispersed, TWFE may be acceptable. If not, use a robust estimator.

In DiD, standard errors must be clustered at the unit level that received the treatment.

Bertrand, Duflo & Mullainathan (2004) showed that DiD studies that cluster at too fine a level (e.g., individual) severely overreject the null. The reason: serial correlation within units over time inflates the effective number of independent observations.

Rule of thumb: cluster at the level at which treatment varies. If a state-level policy changed, cluster by state. If a firm-level policy, cluster by firm.

Few clusters problem. With fewer than ~30 clusters, standard cluster-robust SEs can be undersized. Use the wild cluster bootstrap (Cameron, Gelbach & Miller 2008) for inference with few clusters.

With staggered treatment, cluster at the unit level (e.g., state) and consider two-way clustering (state × time) if there is reason to believe correlated shocks across units within periods.

DiD in the causal inference toolkit.

DiD vs. panel FE alone.

DiD is a specific application of two-way FE. It adds the parallel trends framing and the 2×2 intuition. DiD makes the control group role explicit, panel FE applies more broadly.

DiD vs. IV.

IV requires a valid instrument (relevance + exclusion). DiD requires parallel trends. Both can identify local treatment effects. When a policy change serves as an instrument, the two approaches can be combined (DiD-IV).

DiD vs. RDD.

RDD requires a sharp threshold, DiD requires a clear before/after and a plausible control group. DiD is more broadly applicable but relies on a stronger untestable assumption (parallel trends).

Synthetic control (Abadie & Gardeazabal 2003).

When there is only one treated unit (e.g., one country, one state), construct a weighted average of control units that matches the pre-treatment path. Extends the DiD idea to the single-treated-unit case.

What does DiD identify? The Average Treatment Effect on the Treated (ATT).

Under parallel trends, DiD identifies the ATT.

ATT = E[Y(1) − Y(0) | Treated = 1]: the average effect for units that were actually treated.
This is not the ATE (average over all units) unless treatment effects are homogeneous.

Dynamic treatment effects.

Effects often grow over time (e.g., job training) or fade (e.g., one-time information intervention).
Event study plots directly visualize the dynamic ATT pattern: δ_k for each period k relative to treatment.

Heterogeneous effects across subgroups.

Interact the treatment indicator with baseline characteristics to estimate how the ATT varies.
With staggered designs, the Callaway-Sant’Anna framework allows aggregation by cohort, calendar time, or unit characteristics.

DiD in practice: a checklist.

1. Draw the 2×2 table. Compute the raw DiD by hand first.

2. Plot the raw outcome trends for treated and control groups over time.

3. Run the event study. Check that pre-trends are flat.

4. Cluster standard errors at the treatment level. Use wild bootstrap if few clusters.

5. If staggered, run the Bacon decomposition and use a robust estimator (CS, SA, or BJS).

6. Check for anticipation, spillovers, and composition changes.

7. Argue why parallel trends is plausible. What makes the control group a good counterfactual?

Software for DiD.

Stata: reghdfe for TWFE, csdid for Callaway-Sant’Anna, eventstudyinteract for Sun-Abraham.

bacondecomp for the Goodman-Bacon decomposition.

R: fixest package is the workhorse.

feols() for TWFE, sunab() for Sun-Abraham, did package for Callaway-Sant’Anna.
bacondecomp package for the decomposition.

Python: pyfixest mirrors fixest syntax.

Always plot the event study. Use coefplot (Stata) or iplot() in fixest (R).

Common mistakes in DiD.

Using standard TWFE with staggered treatment and heterogeneous effects.

Can produce wrong-signed estimates. Always run the Bacon decomposition, switch to a robust estimator if early-treated units are used as controls.

Not clustering at the right level.

Clustering too finely (individual, county) ignores within-state serial correlation. Cluster at the treatment assignment level.

Claiming parallel trends is “proven” by a pre-trends test.

A flat pre-trend is consistent with parallel trends but is not a proof. You must argue theoretically why the control group is a valid counterfactual.

Ignoring dynamic effects and reporting only the single DiD coefficient.

A single post-treatment dummy imposes constant treatment effects. Always plot the event study to check for dynamics.

Practice Questions

Question 1 of 4

DiD identifies the ATT by using a control group to net out time trends.

The estimator is (treated after − treated before) − (control after − control before).
Key assumption: parallel trends, treated and control would have trended the same absent treatment.
Staggered rollout requires Goodman-Bacon decomposition + robust estimators (CS, SA, BJS). Naive TWFE can be badly biased.