Learning Resources
Interactive demonstrations and code examples for econometrics, microeconomics, and forecasting.
The Regression Weighting Problem
Aronow & Samii (2016) — Econometrics · Program Evaluation
When you run OLS with control variables, you may assume you are estimating the average treatment effect (ATE) across your full sample. Aronow and Samii (2016) show this is not the case. OLS implicitly assigns each observation a weight based on how much the treatment varies conditional on the covariates. Observations where treatment is nearly perfectly predicted by covariates receive very little weight — they barely influence the estimate at all. The result is that OLS identifies an effect for a subset of the data that may look very different from the full sample.
The Math
Let Di be the treatment indicator and Xi be covariates. Define the OLS residual from regressing treatment on covariates:
The OLS estimator of the treatment coefficient is a weighted average of unit-level treatment effects, with weights:
Units whose treatment status is nearly determined by their covariates (êi ≈ 0) receive near-zero weight. The effective sample — the observations actually driving your estimate — can be far smaller and systematically different from your full sample.
Why It Matters
- External validity: Your estimate may generalize only to the effective sample, not the full population you studied.
- Heterogeneous effects: If treatment effects vary by covariates, OLS does not give you the ATE — it gives you the treatment effect for the people in the middle of the covariate distribution.
- Reporting: You should report who is in your effective sample alongside your estimates.
Interactive Demonstration: LaLonde (1986) Data
The LaLonde dataset from the National Supported Work (NSW) job training experiment contains 445 treated and 260 control observations with covariates including age, education, race, marital status, and prior earnings. The charts below show how OLS weights are distributed and how the effective sample differs from the full sample.
OLS Weight by Propensity Score
Effective Sample vs. Full Sample — Covariate Means
Reading the right chart: The effective sample (high-weight observations) differs from the full LaLonde sample on key covariates — it skews younger and has higher prior earnings — illustrating that OLS identifies effects for a specific subpopulation.
# Install packages if needed:
# install.packages(c("MatchIt", "cobalt"))
library(MatchIt) # provides lalonde dataset
library(cobalt) # for love.plot / balance checks
data("lalonde")
# ── Step 1: OLS of treatment on covariates ──────────────────────────────────
fit_ps <- lm(treat ~ age + educ + black + hisp + married + nodegree + re74 + re75,
data = lalonde)
# ── Step 2: Compute Aronow-Samii weights ────────────────────────────────────
e_hat <- residuals(fit_ps) # ê_i = D_i - E[D_i | X_i]
as_wts <- e_hat^2 / sum(e_hat^2) # normalize so weights sum to 1
# ── Step 3: Effective sample size ───────────────────────────────────────────
n_eff <- 1 / sum(as_wts^2)
cat("Full sample n:", nrow(lalonde), "\n")
cat("Effective sample n:", round(n_eff, 1), "\n")
# ── Step 4: Compare full sample vs effective sample covariate means ──────────
covs <- c("age", "educ", "black", "hisp", "married", "nodegree", "re74", "re75")
full_means <- colMeans(lalonde[, covs])
eff_means <- sapply(covs, function(v) weighted.mean(lalonde[[v]], as_wts))
comparison <- data.frame(
covariate = covs,
full_sample = round(full_means, 3),
eff_sample = round(eff_means, 3),
difference = round(eff_means - full_means, 3)
)
print(comparison)
# ── Step 5: Run main OLS and note what it actually estimates ─────────────────
ols <- lm(re78 ~ treat + age + educ + black + hisp + married + nodegree + re74 + re75,
data = lalonde)
summary(ols)
# The treatment coefficient is a weighted avg treatment effect —
# *not* the ATE over the full LaLonde sample.
* ── Load LaLonde data ──────────────────────────────────────────────────────
* Download from: https://users.nber.org/~rdehejia/data/nsw_dw.dta
* or use the version bundled with teffects/Stata 13+
* Using built-in Stata example data (NSW experimental):
* webuse lalonde, clear
* Or load your own:
use "nsw_dw.dta", clear
* ── Step 1: OLS of treatment on covariates ─────────────────────────────────
reg treat age educ black hispanic married nodegree re74 re75
* ── Step 2: Compute Aronow-Samii weights ───────────────────────────────────
predict e_hat, residuals // ê_i = D_i - E[D_i | X_i]
gen e_hat_sq = e_hat^2
quietly summarize e_hat_sq
gen as_wt = e_hat_sq / r(sum) // normalize weights to sum to 1
* ── Step 3: Effective sample size ──────────────────────────────────────────
gen as_wt_sq = as_wt^2
quietly summarize as_wt_sq
scalar n_eff = 1 / r(sum)
display "Full sample n = " _N
display "Effective sample n = " n_eff
* ── Step 4: Compare full vs effective sample covariate means ────────────────
foreach v in age educ black hispanic married nodegree re74 re75 {
quietly summarize `v'
scalar full_`v' = r(mean)
quietly summarize `v' [aw = as_wt]
scalar eff_`v' = r(mean)
display "`v': full = " full_`v' " effective = " eff_`v' ///
" diff = " eff_`v' - full_`v'
}
* ── Step 5: Main OLS regression ─────────────────────────────────────────────
reg re78 treat age educ black hispanic married nodegree re74 re75
* The coefficient on treat is a weighted average treatment effect (WATE)
* over the effective sample, not the ATE over the full sample.
# pip install causaldata numpy pandas statsmodels
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from causaldata import lalonde # LaLonde NSW experimental data
df = lalonde.load_pandas().data
# ── Step 1: OLS of treatment on covariates ──────────────────────────────────
covariates = "age + educ + black + hisp + married + nodegree + re74 + re75"
ps_model = smf.ols(f"treat ~ {covariates}", data=df).fit()
# ── Step 2: Aronow-Samii weights ─────────────────────────────────────────────
e_hat = ps_model.resid # ê_i = D_i - Ê[D_i | X_i]
as_wts = e_hat**2 / (e_hat**2).sum() # normalize to sum = 1
# ── Step 3: Effective sample size ────────────────────────────────────────────
n_eff = 1 / (as_wts**2).sum()
print(f"Full sample n: {len(df)}")
print(f"Effective sample n: {n_eff:.1f}")
# ── Step 4: Compare full vs effective sample covariate means ─────────────────
covs = ["age", "educ", "black", "hisp", "married", "nodegree", "re74", "re75"]
comparison = pd.DataFrame({
"full_sample": df[covs].mean(),
"eff_sample" : df[covs].apply(lambda col: np.average(col, weights=as_wts)),
})
comparison["difference"] = comparison["eff_sample"] - comparison["full_sample"]
print("\nCovariate balance: full sample vs. effective sample")
print(comparison.round(3))
# ── Step 5: Main OLS — note what it actually estimates ───────────────────────
ols = smf.ols(f"re78 ~ treat + {covariates}", data=df).fit()
print(ols.summary())
# The 'treat' coefficient is the WATE over the effective sample —
# not the ATE over the full LaLonde NSW sample.