Learning Resources
Interactive demonstrations and code examples for econometrics, microeconomics, and forecasting.
▶Standard Deviation
Statistics · Descriptive Statistics
Play with the sliders below to see how the spread changes!
For each data point xᵢ, subtract the mean x̄ to find how far it sits from the center. Squaring does two things: it makes negatives positive (so deviations above and below don't cancel), and it penalizes large outliers more heavily. A score 20 points away from the mean contributes 400 to the sum; a score 2 points away contributes only 4.
We sum all the squared deviations and divide by n−1 to get an average. The result is called the variance (s²). It measures spread in squared units (e.g., points²), which is hard to interpret directly.
Why n−1 instead of n? Because we estimated the mean from the same data, which “uses up” one degree of freedom. Using n−1 gives an unbiased estimate of the true population variance. This is called Bessel’s correction.
Taking the square root of the variance returns us to the original units (points, dollars, etc.) instead of squared units. This makes the SD directly interpretable alongside the data. For example, if exam scores have an SD of 10 points, you can say “a typical score is within about 10 points of the mean” — a statement that makes intuitive sense.
The formula uses Greek letters as shorthand. Here’s what each one means:
| Symbol | Name | Meaning in this context |
|---|---|---|
| σ | lowercase sigma | Standard deviation of a population |
| s | s (Roman) | Standard deviation of a sample (what we usually compute) |
| σ² / s² | sigma-squared / s-squared | Variance — SD before taking the square root (in squared units) |
| Σ | uppercase Sigma | Summation — “add up all the following terms” |
| μ | mu | Population mean (true average for the whole group) |
| x | x-bar | Sample mean (average computed from collected data) |
| n | n (Roman) | Sample size — number of observations |
Tip: lowercase σ and uppercase Σ look similar but mean very different things. Context matters: σ (or s) is a single number measuring spread; Σ is an instruction to sum a list of numbers.
Normal Distribution of Exam Scores (mean = 70)
- SD is in the same units as the data. If scores are in points, the SD is in points. A score of 80 means very different things in a class with SD=2 versus SD=20. Try dragging the SD slider to see how the spread changes!
- The 68-95-99.7 rule (Empirical Rule). For bell-shaped data, about 68% of observations fall within ±1 SD of the mean, 95% within ±2 SD, and 99.7% within ±3 SD. Check the stats output above — the percentages match the rule!
- Mean and SD are independent. The spread of the bell curve depends only on σ, not on where the curve is centered. A class with mean 60 and SD=10 has the exact same bell shape as one with mean 85 and SD=10 — just shifted on the score axis.
- SD vs. Variance. Variance is SD². It’s useful in math (variances add across independent variables) but hard to interpret because it’s in squared units. The SD is just its square root — back in the original units.
- Outliers inflate the SD. Because deviations are squared before averaging, extreme points count disproportionately. A few outliers can make the SD much larger than it would otherwise be.
Score: / 5
▶Central Limit Theorem
Statistics · Inference
This is the Central Limit Theorem: for large enough n, the sampling distribution of the mean is normal regardless of the population shape. It underlies every t-test, confidence interval, and regression inference you will ever run.
Pick a population shape, set the sample size, and simulate.
Run the simulation to see results.
Key Takeaways
- The CLT is why the normal distribution is everywhere. Individual observations don’t need to be normal — only sample means do. That’s the foundation of t-tests, confidence intervals, and regression inference.
- Larger n ︎→︎ faster convergence. At n = 1, sample means look like the raw population. By n = 30, the distribution is usually close to normal. At n = 100, it’s excellent even for severely skewed data.
- SE = σ/√n. As n increases, sample means cluster more tightly around μ. Quadrupling the sample size halves the standard error.
- Independence matters. The CLT assumes random, independent observations. Clustered or time-series data with autocorrelation can violate this and require different inference methods.
▶Confidence Intervals
Econometrics · Statistical Inference
The wider the range, the more sure you can be it contains the truth. The narrower the range, the more precise — but the less sure. Play with the controls below to see how it works!
This is the average exam score from the students we actually asked — our single best guess for the true average. In our example, we surveyed a group and got an average of 78. It's the center of our confidence interval. The more students we ask, the closer this number tends to be to the real answer.
This number comes from the normal distribution table and controls how confident we want to be. A higher confidence level means a bigger Z-value, which stretches the interval wider. For 90% confidence it's 1.645, for 95% it's 1.960, and for 99% it's 2.576. Think of it as a "confidence multiplier" — the more sure you want to be, the bigger the multiplier.
The standard error tells us how much our sample average is likely to wiggle from sample to sample. It comes from two things: the standard deviation (how spread out individual scores are) divided by the square root of the sample size. This is why collecting more data helps — a bigger n means a smaller standard error, which means a narrower interval.
| Confidence Level | Significance (α) | Tail Area (α/2) | Z* (Critical Value) | Meaning |
|---|---|---|---|---|
| 90% | 0.10 | 0.05 | 1.645 | We allow a 10% chance of being wrong |
| 95% | 0.05 | 0.025 | 1.960 | We allow a 5% chance of being wrong |
| 99% | 0.01 | 0.005 | 2.576 | We allow a 1% chance of being wrong |
| df | t* at 90% | t* at 95% | t* at 99% |
|---|---|---|---|
| 5 | 2.015 | 2.571 | 4.032 |
| 10 | 1.812 | 2.228 | 3.169 |
| 15 | 1.753 | 2.131 | 2.947 |
| 20 | 1.725 | 2.086 | 2.845 |
| 25 | 1.708 | 2.060 | 2.787 |
| 30 | 1.697 | 2.042 | 2.750 |
| 50 | 1.676 | 2.009 | 2.678 |
| 100 | 1.660 | 1.984 | 2.626 |
| ∞ | 1.645 | 1.960 | 2.576 |
- Ask more students ︎→︎ narrower interval. The more people you ask, the better your guess. Try dragging the sample size slider to the right!
- More variation in scores ︎→︎ wider interval. If some students got 20 and others got 100, it's harder to pin down the average. Try increasing the standard error!
- More confidence ︎→︎ wider interval. Saying "I'm 99% sure" requires casting a wider net than "I'm 90% sure." Try switching between the confidence level buttons!
- It's always a tradeoff. You can be very precise OR very confident, but not both at the same time (unless you get more data!).
Score: / 5
▶Hypothesis Testing & P-Values
Statistics · Inference
The null hypothesis H0 says the true coefficient is zero: no effect. The t-statistic measures how many standard errors your estimate sits from zero. The p-value asks: if H0 were true, how often would you see a result this extreme just by chance?
Drag the t-statistic below and watch the p-value and decision update in real time.
This is your regression coefficient — the sample estimate of the true population parameter β. It tells you the estimated effect of X on Y. Because it comes from a sample, it carries uncertainty. Under H0: β = 0, a t-stat far from zero suggests the true effect is probably nonzero.
The standard error measures uncertainty in your estimate. A larger sample reduces SE (SE ∝ 1/√n), making your t-statistic larger and your test more powerful. A noisier dataset (higher σ²) inflates SE, making it harder to reject H0.
Key Takeaways
- A p-value is not the probability H0 is true. It’s the probability of seeing data this extreme or more assuming H0 were true. These are very different things.
- Failing to reject ≠ accepting H0. You may simply lack enough data (statistical power) to detect a real effect. Absence of evidence is not evidence of absence.
- Statistical significance ≠ economic significance. With a large enough sample, even a trivially small effect will produce p < 0.05. Always report the coefficient magnitude and a confidence interval alongside the p-value.
- α = 0.05 is conventional, not sacred. It was popularized by Fisher in the 1920s. The right threshold depends on the cost of false positives vs. false negatives in your specific application.
- Two-tailed vs. one-tailed. A two-tailed test asks: is β ≠ 0? A one-tailed test asks: is β > 0 or β < 0? Toggle between them above. Note that one-tailed tests have less demanding critical values — use them only when theory strongly predicts a direction before seeing the data.
Quick Check
▶Type I & Type II Errors, Statistical Power
Statistics · Inference
The chart below shows the H0 and H1 distributions. The critical value separates “reject” from “fail to reject.” Drag the effect size to see how power changes.
Key Takeaways
- α and β trade off against each other. Making the test stricter (lower α) moves the critical value further out, which increases β. You will miss more true effects. There is no free lunch.
- Power increases with effect size and sample size. Larger true effects are easier to detect. More data reduces SE, which narrows both distributions and makes them easier to separate.
- Underpowered studies are dangerous. A study that finds p > α may simply lack the sample size to detect the effect — not evidence that no effect exists.
- Power analysis should happen before data collection. Given a target power (e.g., 80%), a hypothesized effect size, and α, you can calculate the required n. Post-hoc power calculations are generally uninformative.
▶Ordinary Least Squares
Econometrics · Linear Regression
OLS fits a line through data by minimizing the sum of squared residuals. The sliders add measurement error to each variable independently. The hidden variable slider introduces an omitted confounder correlated with both x1 and x2 — watch how the coefficients get pulled as its effect on Y grows.
x2 (red): adds measurement error to x2 only — red dots spread horizontally, x2's standard error grows.
hidden variable U (purple): U is correlated with both x1 and x2 but excluded from the regression. As U rises, its effect on Y grows and the coefficients on x1 and x2 pick up its influence — omitted variable bias.
▶Instrumental Variables
Econometrics · Causal Inference
OLS is unbiased only when the regressor X is uncorrelated with the error. When X is endogenous — correlated with unobserved determinants of Y — OLS picks up not just the causal effect but also the back-door path through the omitted variable. An instrumental variable Z breaks the endogeneity: it shifts X (relevance) but has no direct effect on Y (exclusion restriction). 2SLS uses only the exogenous part of X's variation — the part explained by Z — to identify β.
First Stage — Z ︎→︎ X
Structural Equation — X ︎→︎ Y
endogeneity ρ (purple): the correlation between X and the error term — the source of OLS bias. OLS absorbs the back-door path through the omitted variable; 2SLS strips it out. At ρ = 0 both estimators are consistent; as ρ rises the OLS coefficient drifts while 2SLS stays near 1.
▶Maximum Likelihood Estimation
Econometrics · Estimation
Maximum Likelihood Estimation finds parameters that make the observed data most probable. For binary outcomes we use logistic regression — MLE's answer to OLS for 0/1 data. Instead of minimizing squared residuals, MLE maximizes the log-likelihood. Output shows z-statistics and a Pseudo R² in place of the OLS R².
x2 (red): adds measurement error to x2 — same effect on x2's coefficient.
hidden variable U (purple): U is correlated with both x1 and x2 but excluded from the logit. As U rises, it shifts the true probabilities and the estimated coefficients pick up its influence.
▶Fixed and Random Effects
Panel Data · Econometrics
With panel data — multiple observations per unit over time — unobserved individual characteristics (fixed effects) can bias pooled OLS. Demeaning subtracts each unit's mean from its observations, eliminating any time-invariant omitted variable. The fixed effects estimator exploits only within-unit variation to identify β. Random effects instead treats the individual effect as random and uncorrelated with X, allowing between-unit variation to also inform the estimate — gaining efficiency at the cost of that assumption.
unit heterogeneity (σα): scales the time-invariant group intercepts αi, which are correlated with group-level X. This biases pooled OLS but is removed exactly by demeaning — the FE/RE estimates are unaffected.
hidden variable (U): U is a time-varying omitted variable negatively correlated with within-group X variation. Because it varies within groups, demeaning does not remove it — U biases both pooled OLS and FE/RE estimates. Push far enough and the slope flips sign even after demeaning.
clustered std. error (σ): adds idiosyncratic within-unit noise — widens standard errors but does not bias the slope. This is the variation that clustered standard errors are designed to account for.
▶R-Squared & Goodness of Fit
Regression · Model Fit
But R² can mislead. A model with high R² may be spurious, overfit, or irrelevant for causal questions. And a valid causal estimate from an IV regression might have R² = 0.02. R² measures fit — not correctness, not causality.
Adjust the sliders to see how R² responds to noise and slope.
Key Takeaways
- R² = 1 − RSS/TSS. TSS is total variation in Y; RSS is the variation left over after the model. R² is the fraction the model accounts for.
- High R² does not mean the regression is valid. A spurious regression of two trending time series can produce R² > 0.99 even when there’s no real relationship.
- R² always increases when you add variables, even irrelevant ones. Use adjusted R² or information criteria (AIC, BIC) when comparing models with different numbers of regressors.
- In causal inference, R² is often unimportant. What matters is whether β̂ is unbiased, not how much of Y’s variance is explained.
▶Omitted Variable Bias
Regression · Causal Inference
The bias formula: Bias ≈ γ × ρ(X, Z), where γ is the effect of the omitted variable Z on Y, and ρ is how correlated Z is with X. If either is zero, there is no bias.
Move the sliders to see how the biased estimate (dashed red) diverges from the true line (teal).
Key Takeaways
- Bias = γ × δ̂, where δ̂ is the coefficient from regressing Z on X. If the omitted variable is uncorrelated with X (ρ = 0), there is no bias even if Z strongly affects Y.
- You can sign the bias. If Z has a positive effect on Y (γ > 0) and is positively correlated with X (ρ > 0), β̂ will be upward biased. Knowing the direction helps even if you can’t measure Z.
- Adding controls reduces OVB only if they proxy the omitted variable. Adding irrelevant controls doesn’t help and can introduce “bad control” bias.
- This is why randomization is valuable. Random assignment of X makes ρ(X, Z) ≈ 0 for all omitted Z, eliminating OVB by design.
▶Heteroskedasticity
Regression · Diagnostics
OLS is still unbiased under heteroskedasticity, but standard errors are wrong. The usual SEs are too small where variance is high, making your t-stats too large and p-values too small. The fix is heteroskedasticity-robust standard errors (HC1/HC2/HC3).
Move the severity slider to see the residual fan grow.
Key Takeaways
- OLS coefficients are still unbiased. Heteroskedasticity affects inference (standard errors, t-stats, p-values), not the point estimates themselves.
- Always use robust standard errors. In R:
lm_robust()orcoeftest(m, vcov=vcovHC(m)). In Stata:reg y x, robust. The cost is zero; there is no reason not to. - Visual check: residuals vs. fitted values plot. If the spread of residuals increases with fitted values, heteroskedasticity is likely. The Breusch–Pagan or White test provides a formal check.
- Weighted Least Squares (WLS) can be more efficient than OLS when the form of heteroskedasticity is known, by down-weighting high-variance observations.
▶The Regression Weighting Problem
Aronow & Samii (2016) — Econometrics · Program Evaluation
When you run OLS with control variables, you may assume you are estimating the average treatment effect (ATE) across your full sample. Aronow and Samii (2016) show this is not the case. OLS implicitly assigns each observation a weight based on how much the treatment varies conditional on the covariates. Observations where treatment is nearly perfectly predicted by covariates receive very little weight — they barely influence the estimate at all. The result is that OLS identifies an effect for a subset of the data that may look very different from the full sample.
The Math
Let Di be the treatment indicator and Xi be covariates. Define the OLS residual from regressing treatment on covariates:
The OLS estimator of the treatment coefficient is a weighted average of unit-level treatment effects, with weights:
Units whose treatment status is nearly determined by their covariates (êi ≈ 0) receive near-zero weight. The effective sample — the observations actually driving your estimate — can be far smaller and systematically different from your full sample.
Why It Matters
- External validity: Your estimate may generalize only to the effective sample, not the full population you studied.
- Heterogeneous effects: If treatment effects vary by covariates, OLS does not give you the ATE — it gives you the treatment effect for the people in the middle of the covariate distribution.
- Reporting: You should report who is in your effective sample alongside your estimates.
Simulation: Drag to See the Problem
The slider controls how strongly a covariate predicts treatment — simulating increasing selection bias. Drag it right to watch the effective sample shrink and diverge from the full sample in real time.
AS Weight by Propensity Score
Where OLS Gets Its Identification
At high separation (slider right): treated and control barely overlap. Only observations in the narrow middle zone identify the effect — shown in blue on the right. The gray mass of the full sample gets near-zero weight and effectively disappears from the estimate.
Real Data: LaLonde (1986)
The LaLonde dataset from the National Supported Work (NSW) job training experiment contains 445 treated and 260 control observations with covariates including age, education, race, marital status, and prior earnings. The charts below show how OLS weights are distributed and how the effective sample differs from the full sample.
# Install packages if needed:
# install.packages(c("MatchIt", "cobalt"))
library(MatchIt) # provides lalonde dataset
library(cobalt) # for love.plot / balance checks
data("lalonde")
# ── Step 1: OLS of treatment on covariates ──────────────────────────────────
fit_ps <- lm(treat ~ age + educ + black + hisp + married + nodegree + re74 + re75,
data = lalonde)
# ── Step 2: Compute Aronow-Samii weights ────────────────────────────────────
e_hat <- residuals(fit_ps) # ê_i = D_i - E[D_i | X_i]
as_wts <- e_hat^2 / sum(e_hat^2) # normalize so weights sum to 1
# ── Step 3: Effective sample size ───────────────────────────────────────────
n_eff <- 1 / sum(as_wts^2)
cat("Full sample n:", nrow(lalonde), "\n")
cat("Effective sample n:", round(n_eff, 1), "\n")
# ── Step 4: Compare full sample vs effective sample covariate means ──────────
covs <- c("age", "educ", "black", "hisp", "married", "nodegree", "re74", "re75")
full_means <- colMeans(lalonde[, covs])
eff_means <- sapply(covs, function(v) weighted.mean(lalonde[[v]], as_wts))
comparison <- data.frame(
covariate = covs,
full_sample = round(full_means, 3),
eff_sample = round(eff_means, 3),
difference = round(eff_means - full_means, 3)
)
print(comparison)
# ── Step 5: Run main OLS and note what it actually estimates ─────────────────
ols <- lm(re78 ~ treat + age + educ + black + hisp + married + nodegree + re74 + re75,
data = lalonde)
summary(ols)
# The treatment coefficient is a weighted avg treatment effect —
# *not* the ATE over the full LaLonde sample.
* ── Load LaLonde data ──────────────────────────────────────────────────────
* Download from: https://users.nber.org/~rdehejia/data/nsw_dw.dta
* or use the version bundled with teffects/Stata 13+
* Using built-in Stata example data (NSW experimental):
* webuse lalonde, clear
* Or load your own:
use "nsw_dw.dta", clear
* ── Step 1: OLS of treatment on covariates ─────────────────────────────────
reg treat age educ black hispanic married nodegree re74 re75
* ── Step 2: Compute Aronow-Samii weights ───────────────────────────────────
predict e_hat, residuals // ê_i = D_i - E[D_i | X_i]
gen e_hat_sq = e_hat^2
quietly summarize e_hat_sq
gen as_wt = e_hat_sq / r(sum) // normalize weights to sum to 1
* ── Step 3: Effective sample size ──────────────────────────────────────────
gen as_wt_sq = as_wt^2
quietly summarize as_wt_sq
scalar n_eff = 1 / r(sum)
display "Full sample n = " _N
display "Effective sample n = " n_eff
* ── Step 4: Compare full vs effective sample covariate means ────────────────
foreach v in age educ black hispanic married nodegree re74 re75 {
quietly summarize `v'
scalar full_`v' = r(mean)
quietly summarize `v' [aw = as_wt]
scalar eff_`v' = r(mean)
display "`v': full = " full_`v' " effective = " eff_`v' ///
" diff = " eff_`v' - full_`v'
}
* ── Step 5: Main OLS regression ─────────────────────────────────────────────
reg re78 treat age educ black hispanic married nodegree re74 re75
* The coefficient on treat is a weighted average treatment effect (WATE)
* over the effective sample, not the ATE over the full sample.
# pip install causaldata numpy pandas statsmodels
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from causaldata import lalonde # LaLonde NSW experimental data
df = lalonde.load_pandas().data
# ── Step 1: OLS of treatment on covariates ──────────────────────────────────
covariates = "age + educ + black + hisp + married + nodegree + re74 + re75"
ps_model = smf.ols(f"treat ~ {covariates}", data=df).fit()
# ── Step 2: Aronow-Samii weights ─────────────────────────────────────────────
e_hat = ps_model.resid # ê_i = D_i - Ê[D_i | X_i]
as_wts = e_hat**2 / (e_hat**2).sum() # normalize to sum = 1
# ── Step 3: Effective sample size ────────────────────────────────────────────
n_eff = 1 / (as_wts**2).sum()
print(f"Full sample n: {len(df)}")
print(f"Effective sample n: {n_eff:.1f}")
# ── Step 4: Compare full vs effective sample covariate means ─────────────────
covs = ["age", "educ", "black", "hisp", "married", "nodegree", "re74", "re75"]
comparison = pd.DataFrame({
"full_sample": df[covs].mean(),
"eff_sample" : df[covs].apply(lambda col: np.average(col, weights=as_wts)),
})
comparison["difference"] = comparison["eff_sample"] - comparison["full_sample"]
print("\nCovariate balance: full sample vs. effective sample")
print(comparison.round(3))
# ── Step 5: Main OLS — note what it actually estimates ───────────────────────
ols = smf.ols(f"re78 ~ treat + {covariates}", data=df).fit()
print(ols.summary())
# The 'treat' coefficient is the WATE over the effective sample —
# not the ATE over the full LaLonde NSW sample.
▶Propensity Score Matching
Causal Inference · Program Evaluation
PSM’s solution: compress all observable confounders into a single number — the propensity score P(T=1|X). Then match each treated unit to a control unit with a similar score. If the model is correctly specified, matched pairs are as-good-as-randomly assigned.
Adjust the sliders, pick a matching method, and click Run Matching to see how balance improves and the ATT estimate converges toward the truth.
Generate data and run matching to see results.
Key Takeaways
- Propensity scores reduce dimensionality. Instead of matching on every covariate separately, you match on one number: P(T=1|X). Rosenbaum & Rubin (1983) proved this is sufficient to remove bias from observed confounders.
- Common support matters. Matching only works where treated and control PS distributions overlap. Units outside the common support region have no valid match and should be dropped.
- PSM estimates the ATT, not the ATE. You are asking: what was the effect for the people who were actually treated? That is different from the average effect across everyone.
- PSM only removes selection on observables. If there is an unobserved confounder correlated with both treatment and outcome, PSM cannot fix it. IV or an experiment is needed for that.
This lab uses a non-experimental version of the LaLonde (1986) dataset — 185 NSW job training participants combined with 429 CPS comparison workers drawn from a national survey. The goal is to practice loading data, inspecting its structure, and computing basic summary statistics by group.
| Variable | Description |
|---|---|
treat | 1 = received job training, 0 = control group |
age | Age in years |
educ | Years of education completed |
race | Race: black / hispan / white |
married | 1 = married at time of enrollment |
nodegree | 1 = no high school diploma |
re74 | Real earnings in 1974 (pre-program, $) |
re75 | Real earnings in 1975 (pre-program, $) |
re78 | Real earnings in 1978 (post-program outcome, $) |
- Download the dataset above and save it to a known folder. You will set your working directory to that folder in Step 0 of the code.
- Load the data using the code below for the preferred program.
- Examine the structure: how many observations? How many variables? What are the variable types?
- Compute mean
re78separately for the treated (treat = 1) and control (treat = 0) groups. What patterns emerge?
* Lab 1: Loading Data in Stata * ───────────────────────────────────────────────── * Step 0 — Set your working directory (use FORWARD slashes, not backslashes) cd "C:/Users/YourName/Downloads" * Step 1 — Import the CSV import delimited "lalonde.csv", clear * Step 2 — Examine the dataset structure describe list in 1/5 * Step 3 — Summary statistics for all variables summarize * Step 4 — Mean earnings in 1978 by treatment status tabstat re78, by(treat) stat(mean sd n)
# Lab 1: Loading Data in R
# ─────────────────────────────────────────────────
# Step 0 — Set your working directory
setwd("C:/Users/YourName/Downloads")
# Step 1 — Import the CSV
data <- read.csv("lalonde.csv")
# Step 2 — Examine the dataset structure
str(data)
head(data)
# Step 3 — Summary statistics for all variables
summary(data)
# Step 4 — Mean earnings in 1978 by treatment status
tapply(data$re78, data$treat,
function(x) c(mean = mean(x), sd = sd(x), n = length(x)))
# Lab 1: Loading Data in Python
# ─────────────────────────────────────────────────
import pandas as pd
import os
# Step 0 — Set your working directory
os.chdir("C:/Users/YourName/Downloads")
# Step 1 — Import the CSV
data = pd.read_csv("lalonde.csv")
# Step 2 — Examine the dataset structure
print(data.dtypes)
print(data.head())
# Step 3 — Summary statistics for all variables
print(data.describe())
# Step 4 — Mean earnings in 1978 by treatment status
print(data.groupby("treat")["re78"].agg(["mean", "std", "count"]))
When you run Step 4, you will likely find that the treated group earns less on average than the control group. That seems to say the program hurt participants — but does it?
The control group here is not from the same randomized experiment. It is 429 workers drawn from the Current Population Survey, a national sample with very different backgrounds. The treated and control groups are not comparable. What looks like a program effect is mostly a difference in who ended up in each group.
This is the selection bias problem. The rest of the lab sequence is about detecting it (Lab 3), measuring how bad it is (Lab 4), and trying to fix it (Labs 5–7).
Raw data rarely arrives analysis-ready. This lab covers the essential cleaning steps for the LaLonde dataset: checking for missing values, inspecting variable types, converting a categorical string variable into indicator (dummy) variables, and inspecting earnings distributions. This dataset combines two very different populations — NSW job training participants and CPS survey workers — so what you find in the pre-program earnings variables will matter more than it might seem.
| Variable | Type | Notes |
|---|---|---|
race | String | Needs to be converted to dummies: black, hispan |
re74, re75 | Numeric | Pre-program earnings; many zeros are legitimate (unemployed) |
re78 | Numeric | Outcome variable; check for implausible values |
treat, married, nodegree | Binary | Should only take values 0 or 1 |
- Check for missing values across all variables. Are there any?
- Verify that binary variables (
treat,married,nodegree) only contain 0 and 1. - Convert
raceinto two dummy variables:blackandhispan(white is the omitted reference category). - Inspect the distribution of
re78for outliers. What is the maximum value? - Inspect the distributions of
re74andre75. What fraction of treated workers had zero pre-program earnings? Does the same pattern hold for the control group?
* Lab 2: Cleaning the Data in Stata
* ─────────────────────────────────────────────────
* Step 1 — Check for missing values
misstable summarize
* Step 2 — Verify binary variables
tabulate treat
tabulate married
tabulate nodegree
* Step 3 — Create dummy variables from race string
generate black = (race == "black")
generate hispan = (race == "hispan")
drop race
* Step 4 — Inspect the distribution of re78
summarize re78, detail
histogram re78, bin(30) title("Distribution of 1978 Earnings")
# Lab 2: Cleaning the Data in R
# ─────────────────────────────────────────────────
# Step 1 — Check for missing values
colSums(is.na(data))
# Step 2 — Verify binary variables
lapply(data[, c("treat", "married", "nodegree")], table)
# Step 3 — Create dummy variables from race string
data$black <- as.integer(data$race == "black")
data$hispan <- as.integer(data$race == "hispan")
data$race <- NULL
# Step 4 — Inspect the distribution of re78
summary(data$re78)
hist(data$re78, breaks = 30, main = "Distribution of 1978 Earnings",
xlab = "re78")
# Lab 2: Cleaning the Data in Python
# ─────────────────────────────────────────────────
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("lalonde.csv")
# Step 1 — Check for missing values
print(data.isnull().sum())
# Step 2 — Verify binary variables
for col in ["treat", "married", "nodegree"]:
print(col, data[col].value_counts().to_dict())
# Step 3 — Create dummy variables from race string
data["black"] = (data["race"] == "black").astype(int)
data["hispan"] = (data["race"] == "hispan").astype(int)
data = data.drop(columns=["race"])
# Step 4 — Inspect the distribution of re78
print(data["re78"].describe())
data["re78"].hist(bins=30)
plt.title("Distribution of 1978 Earnings")
plt.xlabel("re78")
plt.show()
When you inspect re74 and re75, you will find that a large share of treated workers had zero pre-program earnings — they were unemployed before entering the job training program. The CPS comparison group looks very different: most were employed throughout.
This is not a data problem to fix. It is the selection problem made visible. The two groups did not start from the same place, which means a simple comparison of their 1978 earnings will reflect who they were, not what the program did.
The dummy variables you created here (black, hispan) will appear as controls in every lab from Lab 3 onward. Cleaning is not just housekeeping — it sets up the analysis.
Before running any regression, a well-constructed summary statistics table is essential. This lab covers computing means, standard deviations, and counts for key variables — overall and by treatment status — and producing a publication-style table. The dataset used here is the non-experimental version of the LaLonde data: NSW treated workers paired with a comparison group drawn from the CPS. Because these groups were not randomly assigned, this lab asks whether they look similar before the program started — and whether a simple comparison of 1978 earnings would be fair.
| Variable | Description |
|---|---|
treat | Treatment indicator (use to split the sample) |
age, educ | Demographics |
black, hispan, married, nodegree | Binary characteristics |
re74, re75, re78 | Earnings in 1974, 1975, 1978 ($) |
- After loading and cleaning the data (Labs 1 & 2), compute the mean and standard deviation of each variable for the full sample.
- Split the sample by
treatand compute means for the treated and control groups separately. - Produce a table with three columns: Control mean, Treated mean, and Full-sample mean.
- Run a t-test for each variable comparing treated and control means. Which variables have p < 0.05?
- Compare pre-program earnings (
re74,re75) across groups. What does the imbalance tell you about whether a simple comparison of 1978 earnings would be fair?
* Lab 3: Summary Statistics in Stata
* ─────────────────────────────────────────────────
* Step 0 — Load and prepare data
import delimited "lalonde.csv", clear
generate black = (race == "black")
generate hispan = (race == "hispan")
drop race
* Step 1 — Full-sample summary
local vars age educ black hispan married nodegree re74 re75 re78
summarize `vars'
* Step 2 — Means by treatment status
tabstat `vars', by(treat) stat(mean sd n) format(%9.2f)
* Step 3 — Test for differences in means (balance check)
foreach v of local vars {
ttest `v', by(treat)
}
# Lab 3: Summary Statistics in R
# ─────────────────────────────────────────────────
# Step 0 — Load and prepare data
data <- read.csv("lalonde.csv")
data$black <- as.integer(data$race == "black")
data$hispan <- as.integer(data$race == "hispan")
data$race <- NULL
vars <- c("age","educ","black","hispan","married","nodegree","re74","re75","re78")
# Step 1 — Full-sample summary
summary(data[, vars])
# Step 2 — Means by treatment status
do.call(rbind, lapply(vars, function(v) {
c(variable = v,
control = round(mean(data[data$treat == 0, v]), 2),
treated = round(mean(data[data$treat == 1, v]), 2),
full = round(mean(data[, v]), 2))
}))
# Step 3 — t-tests for balance
lapply(vars, function(v) {
t.test(data[data$treat == 1, v], data[data$treat == 0, v])
})
# Lab 3: Summary Statistics in Python
# ─────────────────────────────────────────────────
import pandas as pd
from scipy import stats
data = pd.read_csv("lalonde.csv")
data["black"] = (data["race"] == "black").astype(int)
data["hispan"] = (data["race"] == "hispan").astype(int)
data = data.drop(columns=["race"])
vars = ["age","educ","black","hispan","married","nodegree","re74","re75","re78"]
# Step 1 — Full-sample summary
print(data[vars].describe().round(2))
# Step 2 — Means by treatment status
table = data.groupby("treat")[vars].mean().T.round(2)
table.columns = ["Control", "Treated"]
table["Full"] = data[vars].mean().round(2)
print(table)
# Step 3 — t-tests for balance
for v in vars:
t, p = stats.ttest_ind(data.loc[data.treat==1, v],
data.loc[data.treat==0, v])
print(f"{v}: t = {t:.3f}, p = {p:.3f}")
Step 2: Comparing Group Means
Your output will show means for each variable split by treatment status. In Stata, tabstat shows Mean, Std Dev, and N for each group. In R and Python, the code produces a table with Control, Treated, and Full-sample columns. The format differs but the question is the same — look at each row and ask yourself:
- Pre-program earnings (re74, re75): Do treated and control have similar earnings before the program? (if they do, re74 and re75 will have similar means)
- Demographics (age, educ, race): Are the groups comparable on background characteristics? (if so, they will also have similar means)
- Key question: If pre-program earnings are very different between groups, what does that tell you about whether a simple comparison of re78 is fair? Why does fairness matter in this context?
Step 3: t-test Results for Balance
Wait — why is a big p-value good here?
In regression, stars mean you found something. Here, stars mean your groups were already different.
Your output will show a p-value for each variable. Here is how to read them:
- p < 0.05 = the two groups are significantly different on this variable
- p ≥ 0.05 = the two groups look statistically similar
Answer these questions from your own output:
- Which pre-program variables have p < 0.05?
- Specifically: what is the p-value for
re74? Is it small? - If treated and control had different earnings before the program, can you trust a simple comparison of 1978 earnings?
Understanding Balance
Balance means the treated and control groups are similar on pre-program characteristics. In a randomized experiment, balance happens automatically because treatment is assigned by coin flip. In this dataset, the comparison group is not the experimental control group — it is drawn from the CPS, a general population survey. CPS workers are a fundamentally different population from the disadvantaged workers targeted by the NSW program. That is why the groups look so different even before the program started.
Describing Imbalance: A Practical Framework
When you look at your results, you're looking for imbalance. Here is how to describe what you see:
1. Look at the numbers first (not the p-value)
- For each variable: what is the actual difference between treated and control means?
- Example: if treated re74 = $2,096 and control re74 = $5,619, the difference is about $3,500.
- Write it down: "re74 differs by $3,500."
2. Now check if that difference is "real" or "chance"
- Look at the p-value. If p < 0.05, this difference is unlikely due to luck alone.
- If p ≥ 0.05, it could easily be chance.
3. Categorize each variable
- Big difference + small p-value = clear imbalance (groups differ)
- Small difference OR large p-value = balanced on this variable (groups similar)
4. Step back and ask
- How many variables show clear imbalance?
- Which are most important (pre-program earnings vs. age)?
- What story does the imbalance tell about how different these two populations are?
This lab has two parts. First, you will compute the true causal effect of job training using the experimental dataset — NSW treated and control workers who were randomly assigned. Because assignment was random, a simple difference in mean 1978 earnings is an unbiased estimate of the ATT. Record that number: it is your benchmark for Labs 5, 6, and 7. Second, you will switch to the non-experimental dataset (NSW treated + CPS comparison) and run OLS. Because the comparison group is a different population, OLS will not recover the benchmark even with controls. This is the failure LaLonde (1986) documented.
| Variable | Role |
|---|---|
re78 | Outcome — real earnings in 1978 ($) |
treat | 1 if received job training (randomly assigned) |
age, educ | Continuous controls |
black, hispan, married, nodegree | Binary controls |
re74, re75 | Pre-program earnings |
| Variable | Role |
|---|---|
re78 | Outcome — real earnings in 1978 ($) |
treat | 1 if received job training (not randomly assigned here) |
age, educ | Continuous controls |
race | String — converted to black, hispan dummies in code |
married, nodegree | Binary controls |
re74, re75 | Pre-program earnings (controls added in Step 4) |
- Load
lalonde_experimental.csv. Compute meanre78fortreat=1andtreat=0separately. The difference is the ATT from LaLonde's randomized experiment. Record this number — it is your benchmark for Labs 5, 6, and 7. - Switch to
lalonde.csv. Run OLS ofre78ontreatonly. The coefficient matches the raw mean difference in this dataset. How does it compare to your Step 1 benchmark? - Add demographic controls (
age,educ,black,hispan,married,nodegree). Does thetreatcoefficient move closer to the benchmark? - Add pre-program earnings (
re74,re75). Does the estimate improve further?
- Even with full controls, OLS cannot recover the true ATT because the CPS comparison group is a fundamentally different population. How far is your fully-controlled estimate from your Step 1 benchmark? This gap is what LaLonde (1986) documented. Labs 5–7 will try to close it.
* Step 1. True ATT from experimental data import delimited "lalonde_experimental.csv", clear tabstat re78, by(treat) stat(mean) * ATT = mean re78 (treat=1) minus mean re78 (treat=0) * Record this. It is your benchmark for Labs 5, 6, and 7. * Steps 2-4. OLS on non-experimental data import delimited "lalonde.csv", clear generate black = (race == "black") generate hispan = (race == "hispan") drop race * Step 2. OLS with treatment only regress re78 treat, robust * How does this compare to your Step 1 benchmark? * Step 3. Add demographic controls regress re78 treat age educ black hispan married nodegree, robust * Step 4. Add pre-program earnings regress re78 treat age educ black hispan married nodegree re74 re75, robust * Step 5. How far is the fully controlled estimate from your Step 1 benchmark?
library(sandwich)
library(lmtest)
# Step 1. True ATT from experimental data
exp_data <- read.csv("lalonde_experimental.csv")
means <- tapply(exp_data$re78, exp_data$treat, mean)
att <- means["1"] - means["0"]
cat("True ATT (benchmark):", round(att, 2), "\n")
# Record this. It is your benchmark for Labs 5, 6, and 7.
# Steps 2-4. OLS on non-experimental data
data <- read.csv("lalonde.csv")
data$black <- as.integer(data$race == "black")
data$hispan <- as.integer(data$race == "hispan")
data$race <- NULL
# Step 2. OLS with treatment only
m1 <- lm(re78 ~ treat, data = data)
coeftest(m1, vcov = vcovHC(m1, type = "HC1"))
# How does this compare to your Step 1 benchmark?
# Step 3. Add demographic controls
m2 <- lm(re78 ~ treat + age + educ + black + hispan + married + nodegree,
data = data)
coeftest(m2, vcov = vcovHC(m2, type = "HC1"))
# Step 4. Add pre-program earnings
m3 <- lm(re78 ~ treat + age + educ + black + hispan + married + nodegree +
re74 + re75, data = data)
coeftest(m3, vcov = vcovHC(m3, type = "HC1"))
# Step 5. How far is the fully controlled estimate from your Step 1 benchmark?
cat("OLS estimate (fully controlled):", round(coef(m3)["treat"], 2), "\n")
cat("Benchmark:", round(att, 2), "\n")
cat("Gap:", round(coef(m3)["treat"] - att, 2), "\n")
import pandas as pd
import statsmodels.formula.api as smf
# Step 1. True ATT from experimental data
exp = pd.read_csv("lalonde_experimental.csv")
att = exp[exp["treat"]==1]["re78"].mean() - exp[exp["treat"]==0]["re78"].mean()
print(f"True ATT (benchmark): ${att:,.2f}")
# Record this. It is your benchmark for Labs 5, 6, and 7.
# Steps 2-4. OLS on non-experimental data
data = pd.read_csv("lalonde.csv")
data["black"] = (data["race"] == "black").astype(int)
data["hispan"] = (data["race"] == "hispan").astype(int)
data = data.drop(columns=["race"])
# Step 2. OLS with treatment only
m1 = smf.ols("re78 ~ treat", data=data).fit(cov_type="HC1")
print(m1.summary())
# How does this compare to your Step 1 benchmark?
# Step 3. Add demographic controls
m2 = smf.ols("re78 ~ treat + age + educ + black + hispan + married + nodegree",
data=data).fit(cov_type="HC1")
print(m2.summary())
# Step 4. Add pre-program earnings
m3 = smf.ols("re78 ~ treat + age + educ + black + hispan + married + nodegree"
" + re74 + re75", data=data).fit(cov_type="HC1")
print(m3.summary())
# Step 5. How far is the fully controlled estimate from your Step 1 benchmark?
ols_est = m3.params["treat"]
print(f"OLS estimate (fully controlled): ${ols_est:,.2f}")
print(f"Benchmark: ${att:,.2f}")
print(f"Gap: ${ols_est - att:,.2f}")
Step 1: Reading the Experimental ATT
Your output shows mean re78 for treated and control workers in the experimental dataset. The difference is the ATT — the causal effect of job training. Your number should be close to $1,794.
Wait — why no regression needed?
In observational data, regression controls for covariates to reduce bias. Here there is no bias to reduce — randomization already balanced the groups. The raw mean difference is the causal effect.
Steps 2–4: Reading OLS on Non-Experimental Data
The coefficient on treat is OLS's estimate of the effect of job training on 1978 earnings using the CPS comparison group.
- Step 2 (no controls): likely negative or far from $1,794 — CPS workers earn more in 1978 even though they were never in the program.
- Steps 3–4 (with controls): the estimate may move toward the benchmark but typically stays far from it.
Wait — why don’t controls fix this?
The problem is common support: most NSW participants look unlike anyone in the CPS data — much lower pre-program earnings, different demographics. OLS with controls extrapolates, using the model's functional form to predict outcomes for regions of the covariate space where few or no comparison observations exist. Lab 7 will show that restricting the comparison to CPS workers who actually resemble NSW participants (propensity score matching) gets much closer to the benchmark.
Step 5: The Gap
Subtract your fully-controlled OLS estimate from your Step 1 benchmark. That number is the scale of the failure. This is what LaLonde (1986) documented — even careful OLS on non-experimental data can be wildly wrong. Labs 5, 6, and 7 will try to close this gap.
Probit is a nonlinear regression model for binary outcomes. It estimates the probability that the dependent variable equals 1 as a function of covariates, using the standard normal CDF rather than a straight line. This lab uses probit to model the probability of receiving job training. At the end there is an extension that builds on the probit output to estimate a selection-corrected treatment effect.
| Variable | Role |
|---|---|
treat | Dependent variable in first stage, 1 if received training |
re78 | Outcome in second stage, earnings in 1978 |
age, educ, black, hispan, married, nodegree | Covariates in both stages |
re74, re75 | Pre-program earnings, strong predictors of selection |
- Run a probit regression of
treaton all available covariates. Which variables predict participation? Report marginal effects. - Generate predicted probabilities (propensity scores) from the probit. Compare the mean score for treated vs. control workers. Do the two groups overlap? Poor overlap means the comparison group is too different to be a valid counterfactual.
- Compute the inverse Mills ratio (IMR) for each observation. The formula differs by group: for treated workers (treat=1) use phi(xb) / Phi(xb); for controls (treat=0) use -phi(xb) / (1 - Phi(xb)). Phi is the standard normal CDF, phi is the PDF, and xb is the probit linear index.
- Add the IMR as a control in a second-stage OLS of
re78ontreatand all covariates. This is the Heckman two-step selection correction. - How far is the Heckman estimate from your Lab 4 Step 1 experimental ATT? Record this and compare to Lab 6 DiD when you get there.
* Step 0 — Load and prepare data import delimited "lalonde.csv", clear generate black = (race == "black") generate hispan = (race == "hispan") drop race * Step 1. Probit first stage probit treat age educ black hispan married nodegree re74 re75 margins, dydx(*) * Step 2. Predicted probabilities (propensity scores) predict phat, pr tabstat phat, by(treat) stat(mean min max) * Step 3. Inverse Mills ratio predict xb, xb gen imr = treat * (normalden(xb) / normal(xb)) + (1 - treat) * (-normalden(xb) / (1 - normal(xb))) * Step 4. Second-stage OLS regress re78 treat age educ black hispan married nodegree re74 re75 imr, robust * Step 5. Compare to benchmarks di "Heckman: " _b[treat] * Compare to your Lab 4 Step 1 experimental ATT. * Record this and compare to Lab 6 DiD when you get there.
library(sandwich)
library(lmtest)
library(margins)
data <- read.csv("lalonde.csv")
data$black <- as.integer(data$race == "black")
data$hispan <- as.integer(data$race == "hispan")
data$race <- NULL
# Step 1. Probit first stage
probit <- glm(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
data = data, family = binomial(link = "probit"))
summary(margins(probit))
# Step 2. Predicted probabilities (propensity scores)
data$phat <- predict(probit, type = "response")
tapply(data$phat, data$treat, summary)
# Step 3. Inverse Mills ratio
xb <- predict(probit, type = "link")
data$imr <- ifelse(data$treat == 1, dnorm(xb) / pnorm(xb), -dnorm(xb) / (1 - pnorm(xb)))
# Step 4. Second-stage OLS
heck <- lm(re78 ~ treat + age + educ + black + hispan + married + nodegree +
re74 + re75 + imr, data = data)
coeftest(heck, vcov = vcovHC(heck, type = "HC1"))
# Step 5. Compare to benchmarks
cat("Heckman:", round(coef(heck)["treat"], 2), "\n")
# Compare to your Lab 4 Step 1 experimental ATT.
# Record this and compare to Lab 6 DiD when you get there.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.stats import norm
data = pd.read_csv("lalonde.csv")
data["black"] = (data["race"] == "black").astype(int)
data["hispan"] = (data["race"] == "hispan").astype(int)
data = data.drop(columns=["race"])
# Step 1. Probit first stage
probit = smf.probit(
"treat ~ age + educ + black + hispan + married + nodegree + re74 + re75",
data=data
).fit()
print(probit.get_margeff().summary())
# Step 2. Predicted probabilities (propensity scores)
data["phat"] = probit.predict()
print(data.groupby("treat")["phat"].describe())
# Step 3. Inverse Mills ratio
xb = probit.predict(linear=True)
data["imr"] = np.where(data["treat"] == 1, norm.pdf(xb) / norm.cdf(xb), -norm.pdf(xb) / (1 - norm.cdf(xb)))
# Step 4. Second-stage OLS
heck = smf.ols(
"re78 ~ treat + age + educ + black + hispan + married + nodegree + re74 + re75 + imr",
data=data
).fit(cov_type="HC1")
print(heck.summary())
# Step 5. Compare to benchmarks
print(f"Heckman: ${heck.params['treat']:,.2f}")
# Compare to your Lab 4 Step 1 experimental ATT.
# Record this and compare to Lab 6 DiD when you get there.
Step 1: Reading Marginal Effects
Do not read the raw probit coefficients as probability changes — they are expressed on the latent z-score scale, not in percentage points. Read the marginal effects instead: each value tells you how a one-unit increase in that variable changes the probability of treatment participation.
- Example: if the marginal effect for
re74is −0.00003, a $1,000 increase in pre-program earnings reduces the probability of treatment by about 3 percentage points. - Which variables most strongly predict who entered the program?
Step 2: Propensity Score Overlap
Compare the mean propensity score for treated vs. control workers. This tells you whether the two groups share common ground.
What counts as poor overlap?
If treated workers have a mean propensity score around 0.40 and CPS controls cluster near 0.05, the groups barely overlap. Heckman and matching need common ground to work. No overlap means no valid comparison, regardless of method.
Steps 3–4: Reading the Heckman Estimate
- IMR coefficient: if significant (p < 0.05), selection bias was present and the IMR is doing work. A large, significant IMR means OLS alone was badly biased.
- treat coefficient in second-stage OLS: this is the selection-corrected treatment effect. Compare it to your Lab 4 Step 1 benchmark — is it closer than raw OLS?
Why might Heckman still fail here?
The textbook Heckman model requires an exclusion restriction — a variable that predicts treatment but not earnings. Here the same covariates appear in both stages, so identification relies on functional form alone. The correction may not fully remove selection bias.
Difference-in-differences (DiD) estimates a treatment effect by comparing the change in outcomes over time for a treated group against the change for an untreated group. It differences out any time-invariant differences between groups and any common time trend, leaving only the treatment effect. The key assumption is parallel trends: in the absence of treatment, both groups would have followed the same trajectory. This lab applies DiD to the LaLonde data using pre-program earnings as the pre-period. At the end you will compare your estimate to the experimental benchmark from Lab 4.
| Variable | Role |
|---|---|
re78 | Outcome, post-period earnings (1978) |
re75 | Pre-period earnings (1975) |
re74 | Second pre-period earnings (1974), for trend check |
treat | Treatment indicator, 1 if received job training |
age, educ, black, hispan, married, nodegree | Controls |
- Compute the four group means: average
re75andre78fortreat=0andtreat=1. Calculate the DiD by hand. - Reshape the data to long format with one row per person per period. Create a
postindicator equal to 1 for the 1978 observation. Run a DiD regression. Does the coefficient on thepost x treatinteraction match your manual calculation? - Add demographic controls. Does the estimate change?
- Compute average
re74andre75for each group. Is the change from 1974 to 1975 the same for treated and control workers?
- Compare your DiD estimate to the Lab 4 benchmark. How close is it? What might a difference in pre-period trends tell you about why DiD did or did not recover the true effect?
* Step 0 — Load and prepare data import delimited "lalonde.csv", clear generate black = (race == "black") generate hispan = (race == "hispan") drop race * Step 1. Four group means and manual DiD tabstat re74 re75 re78, by(treat) stat(mean) * DiD = (re78_treat1 - re75_treat1) - (re78_treat0 - re75_treat0) * Calculate this by hand from the table above. * Step 2. Reshape and DiD regression preserve gen id = _n rename re75 earnings0 rename re78 earnings1 reshape long earnings, i(id) j(post) regress earnings i.post##i.treat, cluster(id) * Coefficient on 1.post#1.treat is the DiD estimate. * Step 3. Add controls regress earnings i.post##i.treat age educ black hispan married nodegree, cluster(id) * Step 4. Pre-period trend check restore tabstat re74 re75, by(treat) stat(mean) * Is the change from re74 to re75 the same for both groups? * Step 5. Compare to benchmark * Record your DiD estimate from Step 2. * Compare it to your Lab 4 benchmark. * How large is the gap? What does the re74-to-re75 pattern suggest about why?
library(dplyr)
library(tidyr)
library(sandwich)
library(lmtest)
data <- read.csv("lalonde.csv")
data$black <- as.integer(data$race == "black")
data$hispan <- as.integer(data$race == "hispan")
data$race <- NULL
# Step 1. Four group means and manual DiD
data %>% group_by(treat) %>%
summarise(mean_re74 = mean(re74), mean_re75 = mean(re75), mean_re78 = mean(re78))
did_manual <- (mean(data$re78[data$treat==1]) - mean(data$re75[data$treat==1])) -
(mean(data$re78[data$treat==0]) - mean(data$re75[data$treat==0]))
cat("Manual DiD:", did_manual, "\n")
# Step 2. Reshape and DiD regression
data_long <- data %>%
mutate(id = row_number()) %>%
pivot_longer(cols = c(re75, re78), names_to = "year", values_to = "earnings") %>%
mutate(post = as.integer(year == "re78"))
did <- lm(earnings ~ post * treat, data = data_long)
coeftest(did, vcov = vcovCL(did, cluster = ~id))
# Step 3. Add controls
did_c <- lm(earnings ~ post * treat + age + educ + black + hispan + married + nodegree,
data = data_long)
coeftest(did_c, vcov = vcovCL(did_c, cluster = ~id))
# Step 4. Pre-period trend check
data %>% group_by(treat) %>%
summarise(change_74_75 = mean(re75) - mean(re74))
# Is the change the same for both groups?
# Step 5. Compare to benchmark
cat("DiD estimate:", round(coef(did)["post:treat"], 2), "\n")
# Compare to your Lab 4 Step 1 experimental ATT.
# How large is the gap? What does the pre-period pattern suggest about why?
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("lalonde.csv")
data["black"] = (data["race"] == "black").astype(int)
data["hispan"] = (data["race"] == "hispan").astype(int)
data = data.drop(columns=["race"])
# Step 1. Four group means and manual DiD
print(data.groupby("treat")[["re74","re75","re78"]].mean())
did_manual = (
(data.loc[data.treat==1,"re78"].mean() - data.loc[data.treat==1,"re75"].mean()) -
(data.loc[data.treat==0,"re78"].mean() - data.loc[data.treat==0,"re75"].mean())
)
print("Manual DiD:", did_manual)
# Step 2. Reshape and DiD regression
data_long = pd.melt(
data.assign(id=range(len(data))),
id_vars=["id","treat","age","educ","black","hispan","married","nodegree","re74"],
value_vars=["re75","re78"], var_name="year", value_name="earnings"
)
data_long["post"] = (data_long["year"] == "re78").astype(int)
did = smf.ols("earnings ~ post * treat", data=data_long).fit(cov_type="cluster", cov_kwds={"groups": data_long["id"]})
print(did.summary())
# Step 3. Add controls
did_c = smf.ols(
"earnings ~ post * treat + age + educ + black + hispan + married + nodegree",
data=data_long
).fit(cov_type="cluster", cov_kwds={"groups": data_long["id"]})
print(did_c.summary())
# Step 4. Pre-period trend check
trends = data.groupby("treat")[["re74","re75"]].mean()
trends["change_74_75"] = trends["re75"] - trends["re74"]
print(trends)
# Is the change the same for both groups?
# Step 5. Compare to benchmark
print(f"DiD estimate: ${did.params['post:treat']:,.2f}")
# Compare to your Lab 4 Step 1 experimental ATT.
# How large is the gap? What does the pre-period pattern suggest about why?
Step 1: The Four Means and Manual DiD
You have four numbers: mean re75 and mean re78 for each group. The DiD is:
(re78_treated − re75_treated) − (re78_control − re75_control)
This asks: how much did treated workers’ earnings change, relative to how much control workers’ earnings changed over the same period?
Step 2: Reading the Regression
- post coefficient: how much did earnings change from 1975 to 1978 for the control group (the common time trend)?
- treat coefficient: how much did treated workers differ from control workers in 1975, before treatment?
- post × treat coefficient: the DiD estimate — the extra change in earnings for treated workers beyond the common trend. This should match your manual calculation.
Wait — why does differencing help?
Cross-section OLS compares groups that look very different (re74: $2,096 vs $5,619). DiD removes those time-invariant differences by looking at changes. But it only works if both groups would have followed the same earnings trajectory without treatment — the parallel trends assumption.
Step 5: Pre-Period Trend Check
Compare the re74 → re75 change for treated vs. control workers. If the two groups had different earnings trends even before the program started, parallel trends is violated — and DiD is still biased even though it differences out levels.
- Similar pre-period trends: parallel trends assumption is plausible.
- Different pre-period trends: DiD may be picking up a pre-existing divergence, not the treatment effect.
- What do you find? Does this help explain how far your DiD estimate is from the $1,794 benchmark?
Propensity score matching estimates a treatment effect by pairing each treated unit with a control unit that had a similar probability of receiving treatment given observed covariates. By matching on the propensity score rather than all covariates individually, we try to construct a comparison group that looks like the treated group on everything we can observe. This lab builds directly on Lab 5: the propensity scores you estimated there are the inputs here. Dehejia and Wahba (1999) revisited LaLonde's data using propensity score matching and found estimates much closer to the experimental benchmark than the methods in Labs 5 and 6 achieved. This lab asks you to see if you can replicate that result.
| Variable | Role |
|---|---|
treat | Treatment indicator, 1 if received job training |
re78 | Outcome, earnings in 1978 |
age, educ, black, hispan, married, nodegree | Matching covariates |
re74, re75 | Pre-program earnings, matching covariates |
- Estimate propensity scores using a probit of
treaton all covariates (same specification as Lab 5). - Match each treated worker to the nearest control worker by propensity score (matching with replacement — the same control may match multiple treated workers). How many unique control workers were used? Are any treated units far from their match?
- Check covariate balance before and after matching. Are the matched groups more similar on observable characteristics than the unmatched groups?
- Estimate the treatment effect on the matched sample. Compare to the Lab 4 benchmark. How close did matching get?
- Try a caliper to discard treated units with no close match (e.g., caliper of 0.01 on the propensity score). Does the estimate change? How many units are dropped? What does sensitivity to the caliper tell you about the quality of the match?
* Step 0 — Load and prepare data
import delimited "lalonde.csv", clear
generate black = (race == "black")
generate hispan = (race == "hispan")
drop race
* Step 1. Estimate propensity scores and match (nearest-neighbor, ATT)
teffects psmatch (re78) (treat age educ black hispan married nodegree re74 re75, probit), atet
* Step 2. Balance check before and after matching
tebalance summarize
* Look at the standardized differences. Values below 0.1 indicate good balance.
* Step 3. Treatment effect
* The teffects output reports the ATT directly.
* Record it and compare to your Lab 4 benchmark.
* Step 4. Caliper matching (Getting It Right)
* Use caliper to flag treated units with no close match, drop them,
* then re-estimate ATT on the on-support sample.
capture teffects psmatch (re78) (treat age educ black hispan married nodegree re74 re75, probit), ///
atet caliper(0.01) osample(unmatched)
count if unmatched == 1
local n_drop = r(N)
drop if unmatched == 1
di "Treated units dropped: " `n_drop'
teffects psmatch (re78) (treat age educ black hispan married nodegree re74 re75, probit), atet
* How does the estimate change? How many treated units were dropped?
library(MatchIt)
library(sandwich)
library(lmtest)
data <- read.csv("lalonde.csv")
data$black <- as.integer(data$race == "black")
data$hispan <- as.integer(data$race == "hispan")
data$race <- NULL
# Step 1. Nearest-neighbor matching on propensity score (probit)
m_out <- matchit(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
data = data, method = "nearest", distance = "probit", replace = FALSE)
# Step 2. Balance check
summary(m_out, standardize = TRUE)
# Look at standardized mean differences before and after matching.
# Step 3. Treatment effect on matched sample
m_data <- match.data(m_out)
fit <- lm(re78 ~ treat + age + educ + black + hispan + married + nodegree + re74 + re75,
data = m_data, weights = weights)
coeftest(fit, vcov = vcovCL(fit, cluster = ~subclass))
cat("Matching ATT:", round(coef(fit)["treat"], 2), "\n")
# Compare to your Lab 4 Step 1 experimental ATT.
# Step 4. Caliper matching (Getting It Right)
m_cal <- matchit(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
data = data, method = "nearest", distance = "probit", caliper = 0.01, replace = FALSE)
summary(m_cal)
m_cal_data <- match.data(m_cal)
fit_cal <- lm(re78 ~ treat + age + educ + black + hispan + married + nodegree + re74 + re75,
data = m_cal_data, weights = weights)
coeftest(fit_cal, vcov = vcovCL(fit_cal, cluster = ~subclass))
cat("Caliper ATT:", round(coef(fit_cal)["treat"], 2), "\n")
cat("Units dropped:", nrow(data) - nrow(m_cal_data), "\n")
# How does the estimate change? Does restricting to better matches help?
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from sklearn.neighbors import NearestNeighbors
data = pd.read_csv("lalonde.csv")
data["black"] = (data["race"] == "black").astype(int)
data["hispan"] = (data["race"] == "hispan").astype(int)
data = data.drop(columns=["race"])
# Step 1. Propensity scores
probit = smf.probit(
"treat ~ age + educ + black + hispan + married + nodegree + re74 + re75",
data=data
).fit()
data["pscore"] = probit.predict()
# Step 2. Nearest-neighbor matching
treated = data[data.treat == 1].reset_index(drop=True)
control = data[data.treat == 0].reset_index(drop=True)
nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[["pscore"]])
distances, indices = nn.kneighbors(treated[["pscore"]])
matched_control = control.iloc[indices.flatten()].reset_index(drop=True)
# Step 3. Balance check
covs = ["age", "educ", "black", "hispan", "married", "nodegree", "re74", "re75"]
before = data.groupby("treat")[covs].mean().T
before.columns = ["control_before", "treated_before"]
after = pd.DataFrame({
"control_after": matched_control[covs].mean(),
"treated_after": treated[covs].mean()
})
print(pd.concat([before, after], axis=1).round(2))
# Step 4. ATT estimate
att = treated["re78"].mean() - matched_control["re78"].mean()
print(f"Matching ATT: ${att:,.2f}")
# Compare to your Lab 4 Step 1 experimental ATT.
# Getting It Right: caliper matching
caliper = 0.01
mask = distances.flatten() <= caliper
att_cal = treated.loc[mask, "re78"].mean() - matched_control.loc[mask, "re78"].mean()
print("Caliper ATT (0.01):", att_cal)
print("Treated units dropped:", (~mask).sum())
Steps 1–2: What Matching Does
OLS with controls adjusts statistically for group differences but extrapolates — it estimates outcomes for CPS workers using functional form assumptions that may not hold. Matching restricts to actual comparable observations: each treated worker is paired with a control worker who had a similar probability of treatment.
Why does PSM succeed where OLS failed?
OLS, Heckman, and DiD all try to use the full CPS control group as a counterfactual. That comparison is fundamentally flawed. PSM finds the subset of CPS workers who look most like NSW participants and compares only them. It enforces comparability instead of assuming it.
Step 3: Reading the Balance Check
Standardized mean differences (SMD) measure how different treated and matched control groups are, in units of standard deviations.
- Before matching: large SMDs, especially for re74 and re75 — the groups are very different.
- After matching: SMDs should fall, ideally below 0.1 — the matched groups look similar on observables.
- If balance is poor after matching, the propensity score model may be misspecified.
Steps 4–5: ATT and the Caliper
- ATT on matched sample: should be closer to $1,794 than OLS, Heckman, or DiD. This is Dehejia & Wahba’s (1999) result.
- Caliper: treated workers with no close match (distance > 0.01) are dropped. The estimate on the remaining sample is cleaner but uses fewer observations.
- If the caliper estimate changes substantially, some of your matches were poor — those treated workers had no good counterfactual in the CPS data.
▶Production Possibilities & Comparative Advantage
Microeconomics · International Trade
Comparative advantage says: even if one country is better at producing everything, both benefit by specializing in the good with the lower opportunity cost. Click the button to see consumption expand beyond the PPF.
Key Takeaways
- OC, not absolute productivity, drives comparative advantage. A has CA in wheat (OC = 0.5 cloth < 2). B has CA in cloth (OC = 0.5 wheat < 2). Even if one country is better at both goods, trade still benefits both.
- Specialization + trade pushes consumption outside the PPF. After trading at terms between both OC ratios (here 1 : 1), both countries consume bundles that were individually unattainable.
- The PPF slope = −(OC of the good on the x-axis). A steeper PPF means a higher OC — comparative disadvantage in that good.
- Trade is not zero-sum. Specialization creates new value; gains accrue to both sides simultaneously.
▶Demand
Microeconomics · Consumer Behavior
But the whole curve can also shift. When something other than price changes — income, tastes, prices of related goods, or expectations — demand at every price changes simultaneously. Use the demand shift slider to see a rightward shift (increase) or leftward shift (decrease).
The Demand Curve
- Income. For normal goods, rising income increases demand (right shift). For inferior goods, rising income decreases demand as consumers trade up.
- Prices of substitutes. If a substitute becomes more expensive, demand for this good rises. If it becomes cheaper, demand falls.
- Prices of complements. If a good used alongside this one gets cheaper, demand for this good rises; more expensive, demand falls.
- Tastes and preferences. Health trends, advertising, or fashion can shift demand in either direction without any price change.
- Expectations. If consumers expect higher prices in the future, they buy more now — shifting current demand right.
- Number of buyers. More consumers in the market means higher demand at every price.
- Law of Demand. All else equal, a higher price means lower quantity demanded. This is why demand curves slope downward. Drag the price slider up — the point moves left (less is demanded).
- Movement along the curve vs. a shift. A price change moves you along the existing curve. A change in income, tastes, or related prices shifts the entire curve. These are fundamentally different events.
- Substitutes and complements. If a substitute gets more expensive, consumers switch here — demand rises. If a complement gets more expensive, people use less of both — demand falls.
- Normal vs. inferior goods. For most goods, higher income shifts demand right. For inferior goods (like instant noodles), higher income shifts demand left as people upgrade.
- Expectations are part of demand. Anticipated price increases shift today’s demand right. Demand is forward-looking, not just about current prices.
Score: / 5
▶Supply
Microeconomics · Producer Behavior
The entire curve can also shift. When input costs change, technology improves, the number of sellers changes, or government policy shifts — supply at every price changes simultaneously. Use the supply shift slider to see a rightward shift (more supply) or leftward shift (less supply).
The Supply Curve
- Input (resource) prices. Cheaper labor, materials, or energy lowers production costs — supply increases (right shift). More expensive inputs shift supply left.
- Technology. Better production technology lowers costs and lets producers profitably offer more at every price — supply increases (right shift).
- Number of sellers. More firms entering the market increases market supply. Firms exiting decreases it.
- Government policies. Subsidies lower costs and increase supply. Taxes raise costs and decrease supply.
- Expectations. If producers expect higher future prices, they may withhold supply now, decreasing current supply.
- Prices of related goods in production. If a substitute in production becomes more profitable, producers switch, decreasing supply of this good.
- Law of Supply. All else equal, a higher price means higher quantity supplied. Supply curves slope upward because higher prices make production more profitable. Drag the price slider up — the point moves right (more is supplied).
- Movement along the curve vs. a shift. A change in the good’s own price moves you along the supply curve. A change in costs, technology, or the number of sellers shifts the entire curve.
- Lower costs shift supply right. When inputs become cheaper or technology improves, producers can profitably supply more at every price. This is what long-run technological progress looks like.
- The supply curve is a marginal cost curve. A producer only supplies a unit if the price covers the marginal cost of producing it. The supply curve tells you the minimum price needed to bring each additional unit to market.
- Market supply = sum of firm supply curves. The market supply curve is the horizontal sum of all individual firm supply curves. More sellers means a larger total supply at every price.
Score: / 5
▶Market Equilibrium
Microeconomics · Market Equilibrium
A market equilibrium is the price and quantity where buyers and sellers agree — where the amount consumers want to buy exactly equals the amount producers want to sell. Use the buttons below to shift the curves and see what happens!
Supply & Demand
- Demand shift vs. movement along demand. A demand shift happens when something other than price changes — income, tastes, prices of related goods, expectations, or the number of buyers. Moving along the curve is just a response to a price change. Try “Demand ↑” to see the whole curve move!
- Supply shift vs. movement along supply. Same idea: a supply shift happens when costs change, technology improves, or the number of sellers changes. “Supply ↑” simulates producers becoming more efficient (lower costs = more offered at every price).
- The four core results. Demand ↑ ︎→︎ P↑ Q↑. Demand ↓ ︎→︎ P↓ Q↓. Supply ↑ ︎→︎ P↓ Q↑. Supply ↓ ︎→︎ P↑ Q↓. These four cases are the foundation of ECO 2020.
- Shortage vs. surplus. If price is below equilibrium, consumers want more than producers offer — a shortage. Sellers raise prices until the market clears. Above equilibrium: a surplus drives prices back down. Equilibrium is stable.
- Both curves shift. Real markets often have both curves shifting simultaneously. Try moving both sliders at once. When both increase, quantity rises for certain — but the price change is ambiguous and depends on which shift is larger.
Score: / 5
▶Consumer & Producer Surplus
Microeconomics · Welfare
The competitive equilibrium maximizes total surplus. Any price above or below equilibrium destroys some of that value as deadweight loss.
Move the price slider above or below equilibrium to see what happens.
Key Takeaways
- CS = area above price, below demand curve. For a linear demand P = a − bQ, CS at price P* is ½(a − P*) × Q*.
- PS = area below price, above supply curve. For a linear supply P = c + dQ, PS at price P* is ½(P* − c) × Q*.
- Price controls create deadweight loss. A binding ceiling (below equilibrium) causes a shortage; a binding floor (above equilibrium) causes a surplus. Both prevent mutually beneficial trades.
- Total surplus is maximized at competitive equilibrium. This is the First Welfare Theorem: free markets allocate resources efficiently (under idealized conditions).
▶Elasticity
Microeconomics · Price Elasticity of Demand
Price Elasticity of Demand measures how sensitive consumers are to price: Price Elasticity of Demand = % change in Qd ÷ % change in P. elasticity > 1 = elastic (responsive); elasticity < 1 = inelastic (unresponsive).
Drag the price slider to see how elasticity changes along the curve. The shaded rectangle is total revenue (P × Q) — watch how it changes!
Demand Curve — shaded area = Total Revenue (P × Q)
- Price Elasticity of Demand = % ΔQd ÷ % ΔP. A value of −2 means a 1% price increase causes a 2% drop in quantity demanded. Price Elasticity of Demand is always negative (higher price → lower Qd). We compare the absolute value to 1 to classify elasticity.
- Elastic vs. inelastic. Absolute value > 1: elastic (price-sensitive). Absolute value < 1: inelastic. Absolute value = 1: unit elastic. Drag the price slider toward the top of the curve — elastic. Near the bottom — inelastic. Elasticity varies along a single linear curve!
- Elasticity and total revenue (TR = P × Q). Elastic demand: price ↑ ︎→︎ TR ↓ (Qd falls more than price rose). Inelastic demand: price ↑ ︎→︎ TR ↑ (Qd barely falls). Watch the shaded rectangle shrink or grow as you move the price slider.
- Determinants of elasticity. More elastic when: many substitutes, luxury good, large share of income, time to adjust. More inelastic when: few substitutes, necessity, small share of income, urgently needed.
- Slope ≠ elasticity. A steeper demand curve is not the same as a more inelastic one at every point. A linear curve changes from elastic to inelastic as you move down it — even though the slope is constant throughout.
Score: / 5
▶Price Controls
Microeconomics · Government Intervention
Drag below $7 for a ceiling — drag above $7 for a floor.
Key Takeaways
- Non-binding controls have no effect. A ceiling above P* or a floor below P* is ignored by the market.
- Price ceilings create shortages. Qd > Qs. The quantity traded is limited by supply. Examples: rent control, wartime rationing.
- Price floors create surpluses. Qs > Qd. The quantity traded is limited by demand. Examples: minimum wage, agricultural supports.
- Both create DWL. The prevented trades — the DWL triangle — represent mutually beneficial exchanges that can’t happen because the price is stuck at the wrong level.
▶Monopoly vs. Perfect Competition
Microeconomics · Market Structure
Move the MC slider to see how cost conditions change the efficiency gap.
Key Takeaways
- MR < P for a monopolist. To sell one more unit the monopolist must lower price on all units. For P = a − bQ, MR = a − 2bQ — same intercept, double the slope.
- Qm < Qc, Pm > MC. The monopolist produces less and charges more than a competitive market would.
- DWL = ½(Pm − MC)(Qc − Qm). This triangle is the value of unrealized mutually beneficial trades.
- Market power is about price-setting, not just profit. Antitrust policy targets the DWL caused by output restriction, not profit itself.
▶Externalities & Pigouvian Policy
Microeconomics · Market Failures
A Pigouvian tax (negative) or subsidy (positive) equal to the externality size internalizes it and restores efficiency. Toggle type and drag the slider.
Key Takeaways
- Negative externalities cause over-production. MSC > MPC ⇒ the supply curve understates the true social cost. Market Q exceeds efficient Q.
- Positive externalities cause under-provision. MSB > MPB ⇒ the demand curve understates the true social benefit. Market Q falls short of efficient Q.
- Pigouvian tax/subsidy internalizes the externality. A tax = marginal external cost shifts supply up to MSC; a subsidy = marginal external benefit shifts demand up to MSB. In both cases the corrected equilibrium equals the social optimum.
- The Coase theorem is an alternative. If property rights are well-defined and transaction costs are low, private bargaining can achieve the efficient outcome without government intervention.
▶Tax Incidence
Microeconomics · Policy
The burden is shared according to relative elasticities: the more inelastic side bears more of the tax. If demand is perfectly inelastic, buyers pay 100%. If supply is perfectly inelastic, sellers bear 100%.
Drag the tax slider and watch the burden split and deadweight loss grow.
Key Takeaways
- Statutory incidence ≠ economic incidence. It doesn’t matter whether the tax is levied on buyers or sellers — the economic burden is determined solely by relative elasticities.
- The more inelastic side bears more of the tax. If |εD| < |εS|, buyers pay more. Gasoline taxes fall mostly on consumers because fuel demand is inelastic.
- Taxes create deadweight loss. By reducing quantity below the efficient level, some mutually beneficial trades are prevented. DWL grows with the square of the tax rate.
- Tax revenue peaks before DWL explodes. There is a Laffer-curve logic at the micro level: very high tax rates shrink the tax base so much that revenue may fall even as DWL continues to grow.
▶Public Goods & the Free-Rider Problem
Microeconomics · Market Failures
For public goods, efficiency requires vertical summation of demand: at any quantity, the social marginal benefit equals the sum of each individual’s marginal benefit. The Nash equilibrium falls short of this optimum.
Key Takeaways
- Private goods: horizontal summation. At each price, sum quantities demanded across consumers. This gives market demand.
- Public goods: vertical summation. At each quantity, sum marginal benefits across consumers (all enjoy the same unit). This gives social MB.
- The free-rider problem causes under-provision. In Nash equilibrium, the lower-valuation consumer free-rides entirely; the dominant consumer stops at their own MB = MC, well below the social optimum.
- Remedies: government provision, corrective subsidies, or Coasian bargaining. If all three conditions are met (non-rivalry, non-excludability, low transactions costs), negotiated cost-sharing can achieve Q*.
▶Price Discrimination
Microeconomics · Monopoly Pricing
Toggle between regimes to see how the efficiency–equity trade-off plays out.
Key Takeaways
- Single-price: DWL but some CS. The monopolist charges everyone P = $6. Buyers with WTP > $6 keep consumer surplus; units 40–80 go unsold (DWL = $80).
- Perfect discrimination: efficient but no CS. Every unit is sold at the buyer’s WTP. Output reaches the competitive level (Q = 80, P = MC = $2). DWL = $0, but CS = $0 too.
- Real-world discrimination is imperfect (2nd & 3rd degree). Airlines, coupons, student/senior discounts, and bundling are all partial forms that partially close the DWL gap without extracting all CS.
- Price discrimination requires market power. Competitive firms cannot price-discriminate: buyers would simply switch to a cheaper rival.
▶Labor Market: Supply & Demand
Labor Economics · Wages & Employment
Use the sliders to shift labor demand (e.g., a technology boom raises MRP) or labor supply (e.g., immigration expands the workforce). Watch how the equilibrium wage W* and employment L* respond.
Key Takeaways
- Labor demand = firms’ MRP. DL shifts right when product prices rise, technology improves, or capital complements labor. Both wages and employment increase.
- Labor supply reflects workers’ opportunity cost. SL shifts right when immigration expands the workforce or participation rises — wages fall but employment rises.
- Demand and supply shocks have opposite wage effects. A demand increase → W↑ and L↑. A supply increase → W↓ and L↑. Watch employer and worker surplus redistribute between the shaded areas.
- Competitive equilibrium maximizes total surplus. W* = MRP at L*: no DWL. Policy interventions (minimum wages, payroll taxes) drive wedges between supply and demand, creating welfare losses analogous to price controls in goods markets.
▶Minimum Wage
Labor Economics · Wage Policy
The competitive equilibrium here is W* = $10, L* = 40. Drag the slider above $10 to make the minimum wage binding.
Key Takeaways
- Non-binding floors have no effect. A minimum wage below W* is ignored by the market. Only floors above the competitive wage change outcomes.
- Binding minimum wages create unemployment. Qs > Qd at the floor price: more workers want jobs than firms want to hire. Employment is determined by the demand side.
- Winners and losers. Workers who keep their jobs benefit (higher wages). Workers who lose their jobs are worse off. The net welfare effect depends on the elasticity of labor demand.
- Monopsony complicates the prediction. When a single employer has market power (see Monopsony below), a minimum wage can actually increase employment by eroding the monopsony rent.
▶Monopsony
Labor Economics · Employer Market Power
The monopsony hires where MCL = MRP, then pays the lowest wage workers will accept (from the supply curve). Result: wages and employment both fall below the competitive level. Drag the slider from 0 (competitive) to 1 (pure monopsony).
Key Takeaways
- MCL > supply curve for a monopsonist. For supply W = a + bL, MCL = a + 2bL — same intercept, double the slope. The gap between MCL and supply widens as employment rises.
- Monopsony suppresses both wages and employment. The wage is read off the supply curve at the monopsony employment level — well below competitive. Real-world examples: company towns, healthcare (hospital as dominant employer), college sports.
- There is no supply curve for a monopsony. The firm faces a supply curve, not a supply curve that it takes as given. Quantity choices determine wages simultaneously.
- Minimum wages can improve efficiency here. A minimum wage set between the monopsony wage and the competitive wage raises both wages and employment, eliminating some or all of the DWL.
▶Human Capital & Returns to Education
Labor Economics · Wages & Investment
ln(w) = α + r · s + β · experience
The slope r is the rate of return to schooling. U.S. estimates typically fall between 8% and 12% per year. Drag the slider to see how the wage profile changes with different rates of return.
Key Takeaways
- The Mincer equation is log-linear. Because ln(w) is linear in s, wages grow exponentially with schooling. A 10% return means each year of school raises wages by ~10%, compounding over a career.
- The college premium is large and has grown. In the U.S., college graduates earn roughly 65–80% more than high school graduates — consistent with an 8–10% annual return over 4 extra years.
- Endogeneity is the central identification problem. People who get more schooling may differ in ability, family background, or motivation — all of which also raise wages. OLS estimates of r likely overstate the causal return. Card (1995) used college proximity as an IV to address this.
- Signaling vs. human capital. Spence (1973) argued education may signal pre-existing ability rather than build skills. Both theories predict positive Mincer slopes — but have very different policy implications for public education subsidies.
▶Unions & Collective Bargaining
Labor Economics · Wage Setting
The competitive equilibrium is W* = $10, L* = 40. Drag the slider to set the union wage premium.
Key Takeaways
- Union wage premium is ~10–20% in the U.S. The union–non-union wage gap is robust across many studies, though it has narrowed with declining union membership (from 35% in 1955 to ~10% today).
- Unions redistribute, they don’t just create rents. The orange area shows the rent transferred from employers to employed union members. The red triangle is pure DWL from jobs that no longer exist.
- Non-union workers can be made worse off. Workers displaced from union jobs compete in the non-union sector, lowering wages there — the “threat effect” and “crowding effect.”
- Efficient bargaining models predict different outcomes. If unions and firms bargain over both wages and employment, they can reach the competitive employment level while still paying a union wage — eliminating the DWL.
▶Unemployment: Types & Natural Rate
Labor Economics · Macroeconomic Context
The natural rate (NAIRU) = frictional + structural. It cannot be eliminated by demand stimulus. Use the sliders to build up the unemployment rate from its components.
Key Takeaways
- The natural rate is not zero unemployment. Even at full employment, workers are always between jobs (frictional) or retraining after industry change (structural). The U.S. NAIRU is estimated at roughly 4–5%.
- Demand stimulus can only cure cyclical unemployment. Monetary or fiscal expansion raises aggregate demand, pulling the economy back to potential. It cannot retrain structurally unemployed workers or speed up job search.
- Okun’s Law: 1% cyclical unemployment ≈ 2% GDP below potential. This empirical regularity links the unemployment gap to the output gap, connecting labor market analysis to macroeconomic measurement.
- Duration matters. Long-term unemployment can erode skills and labor force attachment, converting what started as cyclical unemployment into structural unemployment — a phenomenon called hysteresis.
Class slides for Labor Economics coming soon.
Interactive tools for Intermediate Micro coming soon.
Class slides for Intermediate Micro coming soon.
▶Simple Exponential Smoothing
Business Forecasting · Exponential Smoothing
The chart shows 36 months of retail sales data (blue) alongside the SES one-step-ahead forecast (red). Drag the α slider and watch how the forecast’s responsiveness changes.
Retail Sales vs. SES One-Step-Ahead Forecast
- Initialize: Set the level equal to the first observation: ℓ1 = y1.
- Update level: ℓt = α yt + (1 − α) ℓt−1 — a weighted average of the new observation and the prior level.
- Forecast: ŷt+1|t = ℓt — all future forecasts equal the current level (flat forecast line).
- Why “exponential”? Substituting the level equation repeatedly shows that weights decay geometrically: α, α(1−α), α(1−α)², …
- Optimal α is estimated by minimizing the sum of squared one-step forecast errors (SSE).
▶Moving Average Smoother
Business Forecasting · Decomposition
This series has 48 quarters of data with a clear seasonal swing and upward trend. Try k = 4 to watch the seasonal pattern disappear. At larger k, the trend itself is smoothed — but the smoother lags further behind the data.
Quarterly Sales: Original Series vs. MA(k) Smoother
▶ACF and PACF Explorer
Business Forecasting · Time Series Graphics & ARIMA
Reading both plots together is how ARIMA orders are identified in fpp3 Lab 7. Select a series type to see the expected patterns. Red dashed lines are 95% bounds (±1.96 / √T); colored bars are significant.
ACF — Lags 0–20
PACF — Lags 1–20
- AR(p): ACF tails off (exponential decay); PACF cuts off sharply after lag p. Choose p = last significant PACF lag.
- MA(q): ACF cuts off sharply after lag q; PACF tails off. Choose q = last significant ACF lag.
- ARMA(p,q): Both ACF and PACF tail off. Use AIC/BIC to choose p and q.
- Non-stationary (unit root): ACF decays very slowly; PACF has one large spike near 1.0 at lag 1. Difference the series (d = 1) and re-examine.
- Seasonal: Significant spikes at lags m, 2m, … in both ACF and PACF. Use seasonal ARIMA terms or seasonal differencing.
- White noise check: After fitting, residual ACF and PACF should both be inside the bounds at all lags.
▶Holt’s Linear Trend Method
Business Forecasting · Exponential Smoothing
The chart shows 36 months of retail sales (blue), Holt one-step-ahead fitted values (red), and an 8-period forecast fan (dashed red). Adjust both sliders to see how each parameter shapes the fit and forecast.
Retail Sales — Holt’s Linear Trend Fit and Forecast
- Level: ℓt = α yt + (1 − α)(ℓt−1 + bt−1)
- Trend: bt = β(ℓt − ℓt−1) + (1 − β) bt−1
- Forecast h steps ahead: ŷt+h|t = ℓt + h · bt
- Initialization: Set ℓ1 = y1 and b1 = y2 − y1.
- Key difference from SES: The h-step forecast is not flat — it increases (or decreases) linearly at rate bt.
▶STL Decomposition
Business Forecasting · Decomposition
The chart shows 48 quarters of constructed data. Use the component buttons to overlay each extracted component, or select Seasonally Adjusted to see the series with seasonality removed. Adjust the trend window and seasonal window to see how the components change.
Original Series with Selected Component
- Trend: Estimated by applying LOESS smoothing to the series with a user-specified window. A larger window gives a smoother trend; a smaller window tracks short-run movements more closely.
- Seasonal: After subtracting the trend, LOESS is applied separately to each season position (all Q1s, all Q2s, etc.). With season(window="periodic") the seasonal pattern is fixed over time; a numeric window allows it to change slowly.
- Remainder: Rt = yt − Tt − St. A small remainder SD indicates the model captured most of the structure.
- Seasonally adjusted: yt − St — removes the seasonal pattern to reveal the underlying trend plus irregular variation.
- Trend strength (FT): 1 − Var(R) / Var(T + R). Values above 0.64 indicate strong trend.
- Seasonal strength (FS): 1 − Var(R) / Var(S + R). Values above 0.64 indicate strong seasonality.
▶Benchmark Forecasting Methods
Business Forecasting · Forecasting Fundamentals
The chart shows a 36-period time series split into a training set and a test set. Use the slider to adjust the training window length. All four benchmarks are fit on the training data and their test-period forecasts are shown. The accuracy table reports RMSE and MAE for each method on the test set.
Benchmark Forecasts — Training vs. Test Period
- Mean: Every future forecast equals the average of all training observations. Best when the series has no trend or seasonal pattern.
- Naive: Every future forecast equals the last observed value. Optimal for a random walk and is the default benchmark for financial data.
- Seasonal Naive: Each forecast equals the value from the same season one period ago (e.g., forecast for Q1 = last Q1). Captures seasonal patterns with no parameter estimation.
- Drift: Projects the trend implied by the first and last training observations forward linearly. Equivalent to drawing a line from the first to the last point.
- Rule of thumb: If your model cannot beat Naive on RMSE, question whether the added complexity is justified.
▶ETS Models
Business Forecasting · Exponential Smoothing
The chart shows 48 quarters of data with a trend and additive seasonal pattern. Use the toggle buttons to change the error type, trend type, and seasonal type. The fitted line (dashed red) and 8-quarter forecast (green) update automatically. Compare AIC values across specifications to see which model fits best.
Data, ETS Fitted Values, and Forecast
- ETS(A,N,N) = Simple Exponential Smoothing. No trend, no seasonality.
- ETS(A,A,N) = Holt’s Linear Trend. Additive trend, no seasonality.
- ETS(A,A,A) = Holt-Winters Additive. Trend and additive seasonal component.
- ETS(A,A,M) = Holt-Winters Multiplicative Seasonal. Use when seasonal swings grow proportionally with the level.
- Damped trend (Ad): The trend is multiplied by a damping parameter φ < 1 each step, causing it to flatten out in long-run forecasts. Tends to outperform the full additive trend in practice.
- AIC rewards fit but penalizes the number of parameters — a lower AIC indicates a better balance of fit and parsimony.
▶ARIMA Models
Business Forecasting · ARIMA
Use the buttons to change p, d, and q. The main chart shows the actual series, fitted values (dashed), and an 8-step forecast. The residual ACF chart below checks whether the model has captured all autocorrelation — bars should stay inside the red bounds for a well-specified model.
Series, ARIMA Fitted Values, and Forecast
Residual ACF — Is Autocorrelation Removed?
- Choose d: Difference until the series looks stationary (constant mean). A unit-root test (ADF, KPSS) formalizes this. Most economic series need d = 1.
- Choose p: Use the PACF — the AR order equals the lag at which the PACF cuts off sharply. Try ARIMA(p,d,0) first.
- Choose q: Use the ACF of the differenced series — the MA order equals the lag at which the ACF cuts off. Try ARIMA(0,d,q) if the ACF cuts off and the PACF tails off.
- Check residuals: All residual ACF bars should fall within the 95% bounds (±1.96 / √T). Significant bars indicate the model is still missing structure.
- ARIMA(0,1,0): Random walk with no drift — equivalent to the Naive benchmark method.
▶Seasonal ARIMA (SARIMA)
Business Forecasting · ARIMA
The series below has a rising trend and a seasonal pattern that grows with the level — similar to the Lab 8 pharmaceutical data. Use the button rows to set the non-seasonal orders (top) and seasonal orders (bottom). The residual ACF should fall within the bounds if the model is well-specified.
Monthly Series, SARIMA Fitted Values, and 18-Month Forecast
Residual ACF — Lags 1–24
- Notation: ARIMA(p,d,q)(P,D,Q)[m] — lower case for non-seasonal, upper case for seasonal. The seasonal layer operates at lags m, 2m, 3m, …
- D (seasonal differencing): Subtract yt−m from yt. Use D = 1 when the ACF shows slow decay at the seasonal lags. Apply before regular differencing.
- P and Q: After seasonal differencing, look at the ACF and PACF at lags m, 2m, … A spike only at lag m in the PACF suggests P = 1; a spike only at lag m in the ACF suggests Q = 1.
- Identification order in fpp3: (1) take log if variance grows with level; (2) choose D with
unitroot_nsdiffs; (3) choose d withunitroot_ndiffs; (4) read ACF/PACF of the fully-differenced series for p,q,P,Q. - Residual check: A well-specified SARIMA leaves residuals that look like white noise — all ACF bars inside the 95% bounds, including at lags 12, 24.
This lab introduces the fpp3 ecosystem and the tsibble data structure used throughout the course. You will load a built-in dataset, filter to a single time series, produce a time plot, and compute basic summary statistics. All fpp3 datasets are included in the package — no download is required.
| Variable | Description |
|---|---|
Month | Monthly time index (tsibble index) |
State | Australian state (key variable) |
Industry | Retail industry category (key variable) |
Turnover | Monthly retail turnover ($AUD millions) |
- Install and load fpp3. Print aus_retail — how many key combinations are there? What is the index?
- Filter to New South Wales, Supermarket and grocery stores. How many rows does this leave?
- Make a time plot with
autoplot(). Does the series show a trend? Seasonality? - Compute the mean, standard deviation, minimum, and maximum of Turnover for this series.
- Repeat for a different state or industry. How do the patterns compare?
# Lab 1: Getting Started with fpp3
# ─────────────────────────────────────────────────
# install.packages("fpp3") # run once
library(fpp3)
# Step 1 — Inspect the tsibble
aus_retail
# Note: key = State + Industry; index = Month
# Step 2 — Filter to one series
nsw_grocery <- aus_retail |>
filter(State == "New South Wales",
Industry == "Supermarket and grocery stores")
nsw_grocery
# Step 3 — Time plot
nsw_grocery |>
autoplot(Turnover) +
labs(title = "NSW Supermarket & Grocery Turnover",
y = "Turnover ($AUD millions)", x = NULL)
# Step 4 — Summary statistics
nsw_grocery |>
as_tibble() |>
summarise(
n = n(),
mean_turnover = mean(Turnover, na.rm = TRUE),
sd_turnover = sd(Turnover, na.rm = TRUE),
min_turnover = min(Turnover, na.rm = TRUE),
max_turnover = max(Turnover, na.rm = TRUE)
)
# Step 5 — Try another series for comparison
aus_retail |>
filter(State == "Victoria",
Industry == "Clothing, footwear and personal accessory retailing") |>
autoplot(Turnover) +
labs(title = "Victoria: Clothing & Footwear Turnover",
y = "Turnover ($AUD millions)")
# Lab 1: Getting Started with Time Series Data
# ─────────────────────────────────────────────────
# Export from R first:
# library(fpp3)
# aus_retail |>
# filter(State == "New South Wales",
# Industry == "Supermarket and grocery stores") |>
# mutate(Month = as.Date(Month)) |>
# as_tibble() |>
# write.csv("nsw_grocery.csv", row.names = FALSE)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("nsw_grocery.csv", parse_dates=["Month"])
df = df.sort_values("Month").set_index("Month")
# Step 1 — Inspect the data
print(df.head())
print(f"N = {len(df)}, {df.index.min().date()} to {df.index.max().date()}")
# Step 2 — Time plot
df["Turnover"].plot(
figsize=(10, 4),
title="NSW Supermarket & Grocery Turnover",
ylabel="$AUD millions")
plt.tight_layout(); plt.show()
# Step 3 — Summary statistics
print(df["Turnover"].describe())
# Step 4 — Seasonal pattern: average turnover by calendar month
df["month"] = df.index.month
monthly_avg = df.groupby("month")["Turnover"].mean()
monthly_avg.plot(kind="bar", figsize=(8, 4),
title="Average Turnover by Calendar Month")
plt.xlabel("Month"); plt.tight_layout(); plt.show()
# Step 5 — Compare another series (re-export from R with different filter)
Plotting comes before modeling. This lab walks through fpp3’s suite of visualization tools — time plot, seasonal plot, subseries plot, ACF correlogram, and lag plot — using a series with clear trend and strong monthly seasonality.
| Variable | Description |
|---|---|
Month | Monthly time index |
Cost | Monthly government drug subsidy cost ($AUD millions) |
- Make a time plot with
autoplot(). Describe the trend and any seasonal pattern you see. - Use
gg_season()to create a seasonal plot. Which months have consistently high sales? Low sales? Why might this be? - Use
gg_subseries(). What do the horizontal blue lines represent? Which months are trending up the most? - Compute and plot the ACF with
ACF() |> autoplot(). Are there seasonal spikes at lag 12? What does slow decay in the ACF indicate? - Create lag plots with
gg_lag(). At which lags is the correlation strongest? Why?
# Lab 2: Time Series Graphics
# ─────────────────────────────────────────────────
library(fpp3)
# Build a_10: total antidiabetic drug cost from PBS data
a_10 <- PBS |>
filter(ATC2 == "A10") |>
summarise(Cost = sum(Cost) / 1e6)
# Step 1 — Time plot
a_10 |>
autoplot(Cost) +
labs(title = "Australian antidiabetic drug sales",
y = "$ million", x = NULL)
# Step 2 — Seasonal plot
a_10 |>
gg_season(Cost, labels = "both") +
labs(title = "Seasonal plot: antidiabetic drug sales",
y = "$ million")
# Step 3 — Subseries plot
a_10 |>
gg_subseries(Cost) +
labs(title = "Subseries plot: antidiabetic drug sales",
y = "$ million")
# Blue horizontal lines = seasonal mean for each month
# Step 4 — ACF correlogram (lag_max = 3 years)
a_10 |>
ACF(Cost, lag_max = 36) |>
autoplot() +
labs(title = "ACF: antidiabetic drug sales")
# Slow decay = trend; spikes at multiples of 12 = annual seasonality
# Step 5 — Lag plots
a_10 |>
gg_lag(Cost, geom = "point", lags = 1:12) +
labs(title = "Lag plots: antidiabetic drug sales")
# Lab 2: Time Series Graphics
# ─────────────────────────────────────────────────
# Export from R first:
# library(fpp3)
# a_10 <- PBS |> filter(ATC2 == "A10") |> summarise(Cost = sum(Cost) / 1e6)
# a_10 |> mutate(Month = as.Date(Month)) |>
# as_tibble() |> write.csv("a10.csv", row.names = FALSE)
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from pandas.plotting import lag_plot
df = pd.read_csv("a10.csv", parse_dates=["Month"])
df = df.sort_values("Month").set_index("Month")
# Step 1 — Time plot
df["Cost"].plot(figsize=(10, 4),
title="Antidiabetic Drug Sales", ylabel="$AUD millions")
plt.tight_layout(); plt.show()
# Step 2 — Seasonal plot: one line per year
df["year"] = df.index.year
df["month"] = df.index.month
pivot = df.pivot(columns="year", values="Cost")
pivot.plot(figsize=(10, 4), alpha=0.7,
title="Seasonal Plot: Drug Sales (one line per year)")
plt.xlabel("Month (1 = Jan)"); plt.tight_layout(); plt.show()
# Step 3 — Subseries means by calendar month
monthly = df.groupby("month")["Cost"].mean()
monthly.plot(kind="bar", figsize=(8, 4),
title="Mean Drug Sales by Calendar Month")
plt.tight_layout(); plt.show()
# Step 4 — ACF correlogram (36 lags)
fig, ax = plt.subplots(figsize=(10, 4))
plot_acf(df["Cost"], lags=36, ax=ax)
ax.set_title("ACF: antidiabetic drug sales")
plt.tight_layout(); plt.show()
# Step 5 — Lag scatter plots (lags 1-12)
fig, axes = plt.subplots(3, 4, figsize=(12, 9))
for i, ax in enumerate(axes.flat, 1):
lag_plot(df["Cost"], lag=i, ax=ax)
ax.set_title(f"Lag {i}")
plt.tight_layout(); plt.show()
STL (Seasonal and Trend decomposition using Loess) separates a time series into trend, seasonal, and remainder components. In this lab you will decompose Australian beer production, extract the seasonally adjusted series, and quantify the strength of trend and seasonality.
| Variable | Description |
|---|---|
Quarter | Quarterly time index |
Beer | Quarterly beer production (megalitres) |
- Plot beer production over time. Describe the overall shape: does the trend increase then decrease? Is there strong seasonality?
- Fit STL decomposition with
model(STL(...))and plot all four components. Which component has the largest scale? - Overlay the seasonally adjusted series on the original. What remains once seasonality is removed?
- Use
features(Beer, feat_stl)to compute trend strength (FT) and seasonal strength (FS). Are both above the 0.64 threshold for “strong”? - Re-fit STL with
season(window = 7)instead of"periodic". How does this change the seasonal component?
# Lab 3: STL Decomposition
# ─────────────────────────────────────────────────
library(fpp3)
beer <- aus_production |>
select(Quarter, Beer) |>
filter(!is.na(Beer))
# Step 1 — Time plot
beer |>
autoplot(Beer) +
labs(title = "Australian quarterly beer production",
y = "Megalitres", x = NULL)
# Step 2 — STL decomposition
dcmp <- beer |>
model(STL(Beer ~ trend(window = 9) +
season(window = "periodic"),
robust = TRUE))
components(dcmp) |> autoplot()
# Step 3 — Seasonally adjusted overlay
components(dcmp) |>
as_tsibble() |>
autoplot(Beer, colour = "gray70") +
geom_line(aes(y = season_adjust), colour = "#0072B2", linewidth = 0.9) +
labs(title = "Beer: original (grey) vs. seasonally adjusted (blue)",
y = "Megalitres")
# Step 4 — Trend and seasonal strength
beer |>
features(Beer, feat_stl)
# F_T > 0.64 → strong trend; F_S > 0.64 → strong seasonality
# Step 5 — Variable seasonal window (allows seasonal shape to change)
dcmp2 <- beer |>
model(STL(Beer ~ trend(window = 9) +
season(window = 7)))
components(dcmp2) |> autoplot()
# Lab 3: STL Decomposition
# ─────────────────────────────────────────────────
# Export from R:
# aus_production |> select(Quarter, Beer) |> filter(!is.na(Beer)) |>
# mutate(Quarter = as.Date(Quarter)) |>
# as_tibble() |> write.csv("beer.csv", row.names = FALSE)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import STL
df = pd.read_csv("beer.csv", parse_dates=["Quarter"])
df = df.sort_values("Quarter").dropna(subset=["Beer"]).set_index("Quarter")
# Step 1 — Time plot
df["Beer"].plot(figsize=(10, 4),
title="Australian Quarterly Beer Production", ylabel="Megalitres")
plt.tight_layout(); plt.show()
# Step 2 — STL decomposition (period=4 for quarterly data)
stl = STL(df["Beer"], period=4, robust=True)
res = stl.fit()
res.plot()
plt.suptitle("STL Decomposition: Beer Production")
plt.tight_layout(); plt.show()
# Step 3 — Seasonally adjusted overlay
df["sa"] = df["Beer"] - res.seasonal
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(df.index, df["Beer"], color="gray", alpha=0.7, label="Original")
ax.plot(df.index, df["sa"], color="steelblue", lw=1.5, label="Seasonally adjusted")
ax.legend(); ax.set_title("Beer: Original vs. Seasonally Adjusted")
plt.tight_layout(); plt.show()
# Step 4 — Trend and seasonal strength (Hyndman-Athanasopoulos formulas)
var_r = np.var(res.resid)
F_T = max(0, 1 - var_r / np.var(res.trend + res.resid))
F_S = max(0, 1 - var_r / np.var(res.seasonal + res.resid))
print(f"Trend strength (F_T): {F_T:.3f} (>0.64 = strong)")
print(f"Seasonal strength (F_S): {F_S:.3f} (>0.64 = strong)")
# Step 5 — Variable seasonal window
stl2 = STL(df["Beer"], period=4, seasonal=7, robust=True)
res2 = stl2.fit()
res2.plot()
plt.suptitle("STL: seasonal window = 7 (more flexible)")
plt.tight_layout(); plt.show()
Before fitting complex models, always benchmark against simple ones. This lab fits the four benchmark forecasting methods — mean, naïve, seasonal naïve, and drift — splits the data into training and test sets, and evaluates forecast accuracy using MAE, RMSE, and MAPE.
| Variable | Description |
|---|---|
Quarter | Quarterly time index |
Beer | Quarterly beer production (megalitres) |
- Split the series: train on data before 2005, test on 2005 onward. How many test periods are there?
- Fit all four benchmark models on the training set using a single
model()call. - Generate forecasts for the test period and plot them alongside the test data. Which method looks most reasonable?
- Compute accuracy metrics with
accuracy(fc, beer_test). Which model has the lowest RMSE? Lowest MAPE? - Run residual diagnostics on the seasonal naïve model. Is the Ljung-Box test significant? Are residuals white noise?
# Lab 4: Benchmark Methods & Forecast Accuracy
# ─────────────────────────────────────────────────
library(fpp3)
beer <- aus_production |>
select(Quarter, Beer) |>
filter(!is.na(Beer))
# Step 1 — Train / test split
beer_train <- beer |> filter(year(Quarter) < 2005)
beer_test <- beer |> filter(year(Quarter) >= 2005)
cat("Training periods:", nrow(beer_train), "\n")
cat("Test periods: ", nrow(beer_test), "\n")
# Step 2 — Fit four benchmark models
fit <- beer_train |>
model(
Mean = MEAN(Beer),
Naive = NAIVE(Beer),
SNaive = SNAIVE(Beer),
Drift = RW(Beer ~ drift())
)
# Step 3 — Forecast and plot
fc <- fit |> forecast(h = nrow(beer_test))
fc |>
autoplot(beer_train, level = NULL) +
autolayer(beer_test, Beer, colour = "black") +
labs(title = "Beer production: benchmark forecasts vs. actual",
y = "Megalitres") +
guides(colour = guide_legend(title = "Method"))
# Step 4 — Accuracy on test set
accuracy(fc, beer_test) |>
arrange(RMSE)
# Step 5 — Residual diagnostics for seasonal naive
fit |> select(SNaive) |> gg_tsresiduals()
augment(fit) |>
filter(.model == "SNaive") |>
features(.innov, ljung_box, lag = 8)
# p > 0.05 → residuals are consistent with white noise
# Lab 4: Benchmark Methods & Forecast Accuracy
# ─────────────────────────────────────────────────
# Uses same beer.csv as Lab 3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.diagnostic import acorr_ljungbox
df = pd.read_csv("beer.csv", parse_dates=["Quarter"])
df = df.sort_values("Quarter").dropna(subset=["Beer"]).set_index("Quarter")
# Step 1 — Train/test split (mirror R: train < 2005, test >= 2005)
train = df[df.index.year < 2005].copy()
test = df[df.index.year >= 2005].copy()
print(f"Training: {len(train)}, Test: {len(test)}")
# Step 2 — Benchmark forecasts
h = len(test)
mean_fc = pd.Series(train["Beer"].mean(), index=test.index)
naive_fc = pd.Series(train["Beer"].iloc[-1], index=test.index)
snaive_fc = pd.Series(
[train["Beer"].iloc[-(4 - i % 4)] for i in range(h)], index=test.index)
slope = (train["Beer"].iloc[-1] - train["Beer"].iloc[0]) / (len(train) - 1)
drift_fc = pd.Series(
[train["Beer"].iloc[-1] + slope * (i + 1) for i in range(h)], index=test.index)
# Step 3 — Plot
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(train.index[-20:], train["Beer"].iloc[-20:], "k-", label="Training")
ax.plot(test.index, test["Beer"], "k--", label="Actual")
for fc, label in [(mean_fc, "Mean"), (naive_fc, "Naive"),
(snaive_fc, "Seasonal Naive"), (drift_fc, "Drift")]:
ax.plot(test.index, fc, label=label)
ax.legend(); ax.set_title("Beer Production: Benchmark Forecasts")
plt.tight_layout(); plt.show()
# Step 4 — Accuracy
def metrics(actual, fc):
e = actual.values - fc.values
return {"RMSE": np.sqrt(np.mean(e**2)),
"MAE": np.mean(np.abs(e)),
"MAPE": np.mean(np.abs(e / actual.values)) * 100}
results = {m: metrics(test["Beer"], fc)
for m, fc in [("Mean", mean_fc), ("Naive", naive_fc),
("Seasonal Naive", snaive_fc), ("Drift", drift_fc)]}
print(pd.DataFrame(results).T.sort_values("RMSE").round(2))
# Step 5 — Residual diagnostics for Seasonal Naive (training residuals)
# SNaive innovation at time t: Beer[t] - Beer[t-4]
resid_train = train["Beer"].iloc[4:].values - train["Beer"].iloc[:-4].values
lb = acorr_ljungbox(resid_train, lags=[8], return_df=True)
print(lb)
# p << 0.05 → residuals are NOT white noise (autocorrelation remains)
ETS models generalize SES, Holt’s, and Holt-Winters methods into a unified state-space framework. fpp3’s ETS() function selects the best error/trend/seasonal specification by minimizing AICc. This lab fits an ETS model to a drug sales series with strong trend and seasonality, diagnoses residuals, and produces a 2-year forecast.
| Variable | Description |
|---|---|
Month | Monthly time index |
Cost | Monthly government drug cost ($AUD millions) |
- Plot the h02 series. Describe the trend and seasonal pattern. Does variance grow with the level?
- Fit
ETS(Cost)with automatic selection. Usereport()to identify the chosen model. What do the E, T, and S letters mean? - Run residual diagnostics with
gg_tsresiduals()and the Ljung-Box test. Are residuals white noise? - Produce a 2-year forecast. Do the prediction intervals look reasonable given the historical variation?
- Manually fit ETS(A,A,A) and ETS(M,Ad,M) and compare AICc values. Which is preferred?
# Lab 5: Exponential Smoothing (ETS)
# ─────────────────────────────────────────────────
library(fpp3)
# Build h02: total corticosteroid drug cost from PBS
h02 <- PBS |>
filter(ATC2 == "H02") |>
summarise(Cost = sum(Cost) / 1e6)
# Step 1 — Time plot
h02 |>
autoplot(Cost) +
labs(title = "Australian corticosteroid drug sales",
y = "$ million", x = NULL)
# Step 2 — Automatic ETS selection
fit_ets <- h02 |> model(ETS(Cost))
report(fit_ets)
# Read: ETS(error, trend, season) — e.g. ETS(M,Ad,M)
# smoothing parameters: alpha, beta*, gamma, phi
# Step 3 — Residual diagnostics
fit_ets |> gg_tsresiduals()
augment(fit_ets) |>
features(.innov, ljung_box, lag = 24)
# p << 0.05 → residuals are NOT white noise; autocorrelation remains
# Step 4 — 2-year forecast with prediction intervals
fit_ets |>
forecast(h = "2 years") |>
autoplot(h02) +
labs(title = "ETS forecast: corticosteroid drug sales",
y = "$ million")
# Step 5 — Compare AICc across ETS specifications
h02 |>
model(
Auto = ETS(Cost),
AAA = ETS(Cost ~ error("A") + trend("A") + season("A")),
MAM = ETS(Cost ~ error("M") + trend("A") + season("M")),
MAdM = ETS(Cost ~ error("M") + trend("Ad") + season("M"))
) |>
glance() |>
select(.model, AICc) |>
arrange(AICc)
# Lab 5: Exponential Smoothing (ETS)
# ─────────────────────────────────────────────────
# Note: uses statsmodels ETSModel — forecast API differs from R.
# Use get_prediction(start, end) and .summary_frame(alpha) to get intervals.
# Column names are 'mean', 'pi_lower', 'pi_upper' (not conf_int).
# Export from R:
# library(fpp3)
# h02 <- PBS |> filter(ATC2 == "H02") |> summarise(Cost = sum(Cost) / 1e6)
# h02 |> mutate(Month = as.Date(Month)) |>
# as_tibble() |> write.csv("h02.csv", row.names = FALSE)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
from statsmodels.stats.diagnostic import acorr_ljungbox
df = pd.read_csv("h02.csv", parse_dates=["Month"])
df = df.sort_values("Month").set_index("Month")
df.index.freq = "MS"
# Step 1 — Time plot
df["Cost"].plot(figsize=(10, 4),
title="H02 Antidiabetic Drug Costs", ylabel="$AUD millions")
plt.tight_layout(); plt.show()
# Step 2 — Fit ETS models and compare AIC
configs = {
"ETS(A,A,A)": dict(error="add", trend="add", seasonal="add", seasonal_periods=12),
"ETS(M,A,M)": dict(error="mul", trend="add", seasonal="mul", seasonal_periods=12),
"ETS(M,Ad,M)":dict(error="mul", trend="add", seasonal="mul", seasonal_periods=12,
damped_trend=True),
}
fits = {}
for name, kw in configs.items():
try:
fits[name] = ETSModel(df["Cost"], **kw).fit(disp=False)
except Exception:
pass
aic_tbl = {n: {"AIC": f.aic, "BIC": f.bic} for n, f in fits.items()}
print(pd.DataFrame(aic_tbl).T.sort_values("AIC").round(1))
# Step 3 — Report best model (assume ETS(M,A,M))
best = fits.get("ETS(M,A,M)", list(fits.values())[0])
print(best.summary())
# Step 4 — Residual diagnostics
resid = best.resid
fig, axes = plt.subplots(3, 1, figsize=(10, 8))
axes[0].plot(resid); axes[0].axhline(0, color="red", lw=0.8)
axes[0].set_title("Residuals")
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(resid, lags=24, ax=axes[1])
axes[2].hist(resid, bins=20); axes[2].set_title("Distribution")
plt.tight_layout(); plt.show()
lb = acorr_ljungbox(resid, lags=[24], return_df=True)
print(lb)
# Step 5 — 2-year forecast with 80% and 95% intervals
fc = best.get_prediction(start=len(df), end=len(df) + 23)
sf95 = fc.summary_frame(alpha=0.05) # 95% PI: columns pi_lower / pi_upper
sf80 = fc.summary_frame(alpha=0.20) # 80% PI
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(df.index[-60:], df["Cost"].iloc[-60:], color="black")
ax.plot(sf95.index, sf95["mean"], color="steelblue")
ax.fill_between(sf80.index, sf80["pi_lower"], sf80["pi_upper"],
alpha=0.35, color="steelblue")
ax.fill_between(sf95.index, sf95["pi_lower"], sf95["pi_upper"],
alpha=0.15, color="steelblue")
ax.set_title("H02: 2-Year ETS Forecast")
plt.tight_layout(); plt.show()
ARIMA models capture autocorrelation structure by regressing a series on its own past values (AR) and past forecast errors (MA), after differencing to achieve stationarity (I). This lab uses monthly US retail employment to practice the full ARIMA workflow: stationarity check, automatic model selection, residual diagnostics, and a head-to-head comparison with ETS.
| Variable | Description |
|---|---|
Month | Monthly time index |
Series_ID | BLS series identifier (key) |
Title | Employment category label |
Employed | Employed persons (thousands) |
- Filter to “Retail Trade” employment. Plot the series and its ACF. Is the series stationary? What does slow ACF decay signal?
- Use
features(Employed, list(unitroot_ndiffs, unitroot_nsdiffs))to determine how many differences are needed. - Fit
ARIMA(Employed)with automatic selection. Usereport()to read the output. Identify p, d, q, P, D, Q, and m. - Run residual diagnostics. Is the Ljung-Box test significant? Are residuals consistent with white noise?
- Split on 2018 and compare ARIMA vs. ETS on the test set. Which has lower RMSE? Which has more conservative (wider) prediction intervals?
# Lab 6: ARIMA Models
# ─────────────────────────────────────────────────
library(fpp3)
retail <- us_employment |>
filter(Title == "Retail Trade") |>
select(Month, Employed)
# Step 1 — Time plot and ACF
retail |>
autoplot(Employed) +
labs(title = "US Retail Trade Employment",
y = "Thousands of persons", x = NULL)
retail |>
ACF(Employed, lag_max = 36) |>
autoplot() +
labs(title = "ACF: US Retail Employment")
# Slow ACF decay → non-stationary (unit root)
# Step 2 — How many differences?
retail |>
features(Employed, list(unitroot_ndiffs, unitroot_nsdiffs))
# Step 3 — Automatic ARIMA selection
fit_arima <- retail |> model(ARIMA(Employed))
report(fit_arima)
# Read: ARIMA(p,d,q)(P,D,Q)[m]
# Coefficients, AICc, sigma^2
# Step 4 — Residual diagnostics
fit_arima |> gg_tsresiduals()
augment(fit_arima) |>
features(.innov, ljung_box, lag = 24)
# Step 5 — ARIMA vs. ETS on test set
retail_train <- retail |> filter(year(Month) < 2018)
retail_test <- retail |> filter(year(Month) >= 2018)
fit_compare <- retail_train |>
model(
ARIMA = ARIMA(Employed),
ETS = ETS(Employed)
)
fc_compare <- fit_compare |>
forecast(h = nrow(retail_test))
fc_compare |>
autoplot(retail_train |> tail(60), level = 80) +
autolayer(retail_test, Employed, colour = "black") +
labs(title = "ARIMA vs. ETS: US Retail Employment",
y = "Thousands of persons")
accuracy(fc_compare, retail_test) |>
select(.model, RMSE, MAE, MAPE) |>
arrange(RMSE)
# Lab 6: ARIMA Models
# ─────────────────────────────────────────────────
# Note: requires pmdarima for auto_arima — install once with:
# pip install pmdarima
# Export from R:
# us_employment |>
# filter(Title == "Retail Trade") |>
# select(Month, Employed) |>
# mutate(Month = as.Date(Month)) |>
# as_tibble() |> write.csv("retail_employment.csv", row.names = FALSE)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.stats.diagnostic import acorr_ljungbox
df = pd.read_csv("retail_employment.csv", parse_dates=["Month"])
df = df.sort_values("Month").set_index("Month")
df.index.freq = "MS"
# Step 1 — Time plot and ACF
df["Employed"].plot(figsize=(10, 4),
title="US Retail Trade Employment", ylabel="Thousands of persons")
plt.tight_layout(); plt.show()
fig, ax = plt.subplots(figsize=(10, 4))
plot_acf(df["Employed"], lags=36, ax=ax)
ax.set_title("ACF: US Retail Employment"); plt.tight_layout(); plt.show()
# Step 2 — Stationarity check (ADF test)
pval = adfuller(df["Employed"].dropna())[1]
print(f"ADF p-value (levels): {pval:.4f} (>0.05 = non-stationary)")
pval_d = adfuller(df["Employed"].diff(12).diff(1).dropna())[1]
print(f"ADF p-value (s-diff+diff): {pval_d:.4f} (<0.05 = stationary)")
# Step 3 — Fit ARIMA (auto-order via pmdarima; install: pip install pmdarima)
from pmdarima import auto_arima
fit_auto = auto_arima(df["Employed"], seasonal=True, m=12,
information_criterion="aicc", stepwise=True, trace=True)
print(fit_auto.summary())
# Step 4 — Residual diagnostics
resid = fit_auto.resid()
fig, axes = plt.subplots(3, 1, figsize=(10, 8))
axes[0].plot(resid); axes[0].axhline(0, color="red", lw=0.8)
axes[0].set_title("Residuals")
plot_acf(resid, lags=24, ax=axes[1])
axes[2].hist(resid, bins=20)
plt.tight_layout(); plt.show()
lb = acorr_ljungbox(resid, lags=[24], return_df=True)
print(lb)
# Step 5 — ARIMA vs. ETS on 2018+ test set
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
train = df[df.index.year < 2018]
test = df[df.index.year >= 2018]
h = len(test)
p, d, q = fit_auto.order
P, D, Q, m = fit_auto.seasonal_order
arima_fit = SARIMAX(train["Employed"], order=(p,d,q),
seasonal_order=(P,D,Q,m)).fit(disp=False)
ets_fit = ETSModel(train["Employed"], error="mul", trend="add",
seasonal="mul", seasonal_periods=12).fit(disp=False)
arima_fc = arima_fit.forecast(h)
ets_fc = ets_fit.forecast(h)
rmse = lambda f: np.sqrt(np.mean((test["Employed"].values - f.values)**2))
print(f"ARIMA RMSE: {rmse(arima_fc):.1f}")
print(f"ETS RMSE: {rmse(ets_fc):.1f}")
Automatic ARIMA selection is convenient, but reading ACF and PACF plots after differencing is the core diagnostic skill. An AR(p) process shows PACF cutting off at lag p with ACF tailing off; an MA(q) process is the mirror image. This lab walks through manual order identification and then checks your choice against the AICc-selected model.
| Variable | Description |
|---|---|
Quarter | Quarterly time index (tsibble index) |
Gas | Gas production (petajoules) |
- Plot Gas production. Describe the trend and any seasonality. Is the series stationary?
- Take a log transformation, then a seasonal difference (lag 4). Plot and check the KPSS statistic with
features(log_gas, unitroot_kpss). Take a regular first difference if still needed. - Plot the ACF and PACF of the (log, seasonally-differenced) series. Based on the patterns, propose values of p and q for the non-seasonal part.
- Fit your manually-chosen
ARIMA(p,d,q)(P,D,Q)[4]alongsideARIMA(log(Gas))auto-selection. Compare the two models by AICc usingglance(). - Forecast Gas production 8 quarters (2 years) ahead from both models and plot the intervals side by side. Which is narrower?
# Lab 7: ARIMA Order Selection via ACF/PACF
# ─────────────────────────────────────────────────
library(fpp3)
gas <- aus_production |> select(Quarter, Gas)
# Step 1 — Time plot
gas |> autoplot(Gas) +
labs(title = "Australian Gas Production",
y = "Petajoules", x = NULL)
# Step 2 — Log + seasonal difference
gas <- gas |> mutate(log_gas = log(Gas))
gas |>
mutate(sdiff = difference(log_gas, lag = 4)) |>
features(sdiff, unitroot_kpss)
# If p-value < 0.05, take one more regular difference
gas_diff <- gas |>
mutate(d = difference(difference(log_gas, lag = 4), 1))
gas_diff |>
autoplot(d) +
labs(title = "Log Gas: seasonal + regular differenced")
# Step 3 — ACF and PACF of differenced series
gas_diff |> ACF(d, lag_max = 24) |> autoplot()
gas_diff |> PACF(d, lag_max = 24) |> autoplot()
# AR(p): PACF cuts off, ACF tails off
# MA(q): ACF cuts off, PACF tails off
# Step 4 — Manual vs. auto ARIMA
fit <- gas |>
model(
manual = ARIMA(log(Gas) ~ pdq(1,1,0) + PDQ(1,1,0)), # adjust after Step 3
auto = ARIMA(log(Gas))
)
glance(fit) |> select(.model, AICc, BIC)
report(fit |> select(auto))
# Step 5 — Forecast 8 quarters ahead
fc <- fit |> forecast(h = 8)
fc |>
autoplot(gas |> tail(40), level = 80) +
labs(title = "Gas Production Forecast: Manual vs. Auto ARIMA",
y = "Petajoules (log scale)")
# Lab 7: ARIMA Order Selection via ACF/PACF
# ─────────────────────────────────────────────────
# Note: requires pmdarima for auto_arima — install once with:
# pip install pmdarima
# ─────────────────────────────────────────────────
# Export from R:
# aus_production |>
# select(Quarter, Gas) |>
# mutate(Quarter = as.Date(Quarter)) |>
# as_tibble() |> write.csv("gas.csv", row.names = FALSE)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
df = pd.read_csv("gas.csv", parse_dates=["Quarter"])
df = df.sort_values("Quarter").set_index("Quarter")
# Step 1 — Time plot
df["Gas"].plot(figsize=(10, 4),
title="Australian Gas Production", ylabel="Petajoules")
plt.tight_layout(); plt.show()
# Step 2 — Log + seasonal + regular differencing; KPSS test
df["log_gas"] = np.log(df["Gas"])
df["sdiff_d1"] = df["log_gas"].diff(4).diff(1) # lag-4 + lag-1
_, p, *_ = kpss(df["sdiff_d1"].dropna(), regression="c")
print(f"KPSS p-value after log + s-diff + diff: {p:.4f} (>0.05 = stationary)")
# Step 3 — ACF and PACF to identify p, q
fig, axes = plt.subplots(2, 1, figsize=(10, 7))
plot_acf( df["sdiff_d1"].dropna(), lags=24, ax=axes[0])
plot_pacf(df["sdiff_d1"].dropna(), lags=24, ax=axes[1])
axes[0].set_title("ACF of log-gas (s-diff + diff)")
axes[1].set_title("PACF of log-gas (s-diff + diff)")
plt.tight_layout(); plt.show()
# Step 4 — Manual vs. auto ARIMA (compare AICc)
manual = SARIMAX(df["log_gas"], order=(1,1,0),
seasonal_order=(1,1,0,4)).fit(disp=False)
print(f"Manual AICc: {manual.aicc:.1f}")
from pmdarima import auto_arima # pip install pmdarima
auto_m = auto_arima(df["log_gas"], seasonal=True, m=4,
information_criterion="aicc", stepwise=True, trace=False)
print(f"Auto AICc: {auto_m.aicc():.1f}")
print("Auto order:", auto_m.order, auto_m.seasonal_order)
# Step 5 — Forecast 8 quarters ahead (back-transform from log)
fc_manual = np.exp(manual.forecast(8))
auto_fit = SARIMAX(df["log_gas"], order=auto_m.order,
seasonal_order=auto_m.seasonal_order).fit(disp=False)
fc_auto = np.exp(auto_fit.forecast(8))
print("Manual forecast:", fc_manual.round(1).tolist())
print("Auto forecast:", fc_auto.round(1).tolist())
Seasonal ARIMA extends the ARIMA framework with a seasonal AR/MA layer operating at lag m, written ARIMA(p,d,q)(P,D,Q)[m]. Monthly pharmaceutical data are strongly seasonal and trended, making them ideal for practising seasonal differencing and reading the full SARIMA notation from R output.
| Variable | Description |
|---|---|
Month | Monthly time index |
Cost | Monthly drug cost (AUD millions) |
- Plot the series. Identify trend, seasonality, and any calendar effects. Does variance grow with level? Consider a log transformation.
- Use
features(log(Cost), list(unitroot_ndiffs, unitroot_nsdiffs))to determine d and D. Take the indicated differences and verify with an ACF plot. - Fit
ARIMA(log(Cost))with automatic selection. Usereport()to read the full SARIMA(p,d,q)(P,D,Q)[12] output. Write out the model equation in words: what does each term capture? - Run
gg_tsresiduals()and the Ljung-Box test. Do the residuals look like white noise? - Forecast 36 months (3 years) ahead. Plot the forecast with 80% and 95% prediction intervals. Does the seasonal pattern carry forward correctly?
# Lab 8: Seasonal ARIMA (SARIMA)
# ─────────────────────────────────────────────────
library(fpp3)
h02 <- PBS |>
filter(ATC2 == "H02") |>
summarise(Cost = sum(Cost) / 1e6)
# Step 1 — Plot and check for variance growth
h02 |> autoplot(Cost) +
labs(title = "H02 Antidiabetic Drug Costs",
y = "AUD millions", x = NULL)
h02 |> autoplot(log(Cost)) +
labs(title = "H02 Drug Costs (log scale)")
# Step 2 — How many differences needed?
h02 |>
features(log(Cost), list(unitroot_ndiffs, unitroot_nsdiffs))
# Visually check after seasonal + regular differencing
h02 |>
mutate(d = difference(difference(log(Cost), lag = 12), 1)) |>
autoplot(d)
h02 |>
mutate(d = difference(difference(log(Cost), lag = 12), 1)) |>
ACF(d, lag_max = 36) |>
autoplot()
# Step 3 — Automatic SARIMA
fit <- h02 |> model(ARIMA(log(Cost)))
report(fit)
# Look for: ARIMA(p,d,q)(P,D,Q)[12]
# Non-seasonal: captures short-run autocorrelation
# Seasonal: captures lag-12 autocorrelation
# Step 4 — Residual diagnostics
fit |> gg_tsresiduals(lag_max = 36)
augment(fit) |>
features(.innov, ljung_box, lag = 36, dof = 4)
# Step 5 — 3-year forecast
fc <- fit |> forecast(h = 36)
fc |>
autoplot(h02 |> tail(60)) +
labs(title = "H02 Drug Costs: 3-Year SARIMA Forecast",
y = "AUD millions")
# Lab 8: Seasonal ARIMA (SARIMA)
# ─────────────────────────────────────────────────
# Note: requires pmdarima for auto_arima — install once with:
# pip install pmdarima
# ─────────────────────────────────────────────────
# Export from R: (same h02.csv as Lab 5)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import kpss
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.stats.diagnostic import acorr_ljungbox
from pmdarima import auto_arima # pip install pmdarima
df = pd.read_csv("h02.csv", parse_dates=["Month"])
df = df.sort_values("Month").set_index("Month")
df.index.freq = "MS"
# Step 1 — Original and log-scale plots
fig, axes = plt.subplots(2, 1, figsize=(10, 7))
df["Cost"].plot(ax=axes[0], title="H02 Drug Costs (original)")
np.log(df["Cost"]).plot(ax=axes[1], title="H02 Drug Costs (log scale)")
plt.tight_layout(); plt.show()
# Step 2 — Stationarity: KPSS after seasonal then regular differencing
log_cost = np.log(df["Cost"])
_, p1, *_ = kpss(log_cost.diff(12).dropna(), regression="c")
_, p2, *_ = kpss(log_cost.diff(12).diff(1).dropna(), regression="c")
print(f"KPSS after seasonal diff only: p = {p1:.4f}")
print(f"KPSS after seasonal + first diff: p = {p2:.4f}")
print("(p > 0.05 means stationary)")
# Step 3 — Auto SARIMA selection on log scale
fit = auto_arima(log_cost, seasonal=True, m=12,
information_criterion="aicc", stepwise=True, trace=True)
print(fit.summary())
# Step 4 — Residual diagnostics
resid = fit.resid()
fig, axes = plt.subplots(3, 1, figsize=(10, 8))
axes[0].plot(resid); axes[0].axhline(0, color="red", lw=0.8)
axes[0].set_title("Residuals")
plot_acf(resid, lags=36, ax=axes[1])
axes[2].hist(resid, bins=20); axes[2].set_title("Distribution")
plt.tight_layout(); plt.show()
print(acorr_ljungbox(resid, lags=[36], return_df=True))
# Step 5 — 3-year forecast (36 months, back-transform)
p, d, q = fit.order; P, D, Q, m = fit.seasonal_order
sarima = SARIMAX(log_cost, order=(p,d,q),
seasonal_order=(P,D,Q,m)).fit(disp=False)
fc_log = sarima.get_forecast(36)
fc_mean = np.exp(fc_log.predicted_mean)
ci = np.exp(fc_log.conf_int())
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(df.index[-60:], df["Cost"].iloc[-60:], color="black", label="Observed")
ax.plot(fc_mean.index, fc_mean, color="steelblue", label="Forecast")
ax.fill_between(ci.index, ci.iloc[:,0], ci.iloc[:,1], alpha=0.2)
ax.legend(); ax.set_title("H02 Drug Costs: 3-Year SARIMA Forecast")
plt.tight_layout(); plt.show()
Dynamic regression combines regression predictors with an ARIMA error structure, written ARIMA(y ~ x + pdq() + PDQ()) in fpp3. This is essential when a series has external drivers (temperature, prices, calendar effects) and residual autocorrelation. Weekly gasoline production has a complex seasonal period (~52 weeks), so Fourier terms replace seasonal ARIMA to handle the non-integer periodicity.
| Variable | Description |
|---|---|
Week | Weekly time index |
Barrels | Production (millions of barrels per day) |
- Plot the series. Describe the trend and the complex weekly seasonality (period ≈ 52.18 weeks).
- Fit four models using K = 1, 3, 6, and 9 Fourier terms combined with
ARIMA(Barrels ~ fourier(K=K) + PDQ(0,0,0)). Use a loop ormodel()with named entries. Extract AICc fromglance()and select the best K. - Inspect the best model with
report(). What ARIMA order was chosen for the errors? Rungg_tsresiduals(). - Forecast 2 years (104 weeks) ahead using the best model. Plot the result.
- Compare the Fourier + ARIMA forecast against a plain
ARIMA(Barrels)model on a 2015–2017 test set. Which has lower RMSE?
# Lab 9: Dynamic Regression with ARIMA Errors
# ─────────────────────────────────────────────────
library(fpp3)
# Step 1 — Plot
us_gasoline |> autoplot(Barrels) +
labs(title = "US Weekly Gasoline Production",
y = "Millions of barrels/day", x = NULL)
# Step 2 — Select best number of Fourier terms by AICc
fit_fourier <- us_gasoline |>
model(
K1 = ARIMA(Barrels ~ fourier(K = 1) + PDQ(0,0,0)),
K3 = ARIMA(Barrels ~ fourier(K = 3) + PDQ(0,0,0)),
K6 = ARIMA(Barrels ~ fourier(K = 6) + PDQ(0,0,0)),
K9 = ARIMA(Barrels ~ fourier(K = 9) + PDQ(0,0,0))
)
glance(fit_fourier) |>
select(.model, AICc) |>
arrange(AICc)
# Lower AICc = better fit/parsimony trade-off
# Step 3 — Best model diagnostics (assuming K6 wins; adjust if needed)
best_fit <- us_gasoline |>
model(ARIMA(Barrels ~ fourier(K = 6) + PDQ(0,0,0)))
report(best_fit)
best_fit |> gg_tsresiduals()
# Step 4 — Forecast 2 years ahead
fc <- best_fit |> forecast(h = 104)
fc |>
autoplot(us_gasoline |> tail(104), level = 80) +
labs(title = "US Gasoline Production: 2-Year Forecast",
y = "Millions of barrels/day")
# Step 5 — Compare vs. plain ARIMA on 2015-2017 test set
gas_train <- us_gasoline |> filter(year(Week) < 2015)
gas_test <- us_gasoline |> filter(year(Week) >= 2015)
fit_compare <- gas_train |>
model(
Fourier_ARIMA = ARIMA(Barrels ~ fourier(K = 6) + PDQ(0,0,0)),
Plain_ARIMA = ARIMA(Barrels)
)
fc_compare <- fit_compare |>
forecast(h = nrow(gas_test))
accuracy(fc_compare, gas_test) |>
select(.model, RMSE, MAE, MAPE) |>
arrange(RMSE)
# Lab 9: Dynamic Regression with ARIMA Errors
# ─────────────────────────────────────────────────
# Note: Python may select a different best K than R in Step 2.
# R's ARIMA() auto-selects the error order separately for each K value,
# while Python uses a fixed ARIMA(0,1,1) error structure throughout.
# This changes the AICc surface, so the winning K may differ.
# The key conclusion in Step 5 (Fourier+ARIMA beats plain ARIMA) holds either way.
# ─────────────────────────────────────────────────
# Export from R:
# us_gasoline |>
# mutate(Week = as.Date(Week)) |>
# as_tibble() |> write.csv("us_gasoline.csv", row.names = FALSE)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.stats.diagnostic import acorr_ljungbox
df = pd.read_csv("us_gasoline.csv", parse_dates=["Week"])
df = df.sort_values("Week").set_index("Week")
# Step 1 — Time plot
df["Barrels"].plot(figsize=(10, 4),
title="US Weekly Gasoline Production",
ylabel="Millions of barrels/day")
plt.tight_layout(); plt.show()
# Step 2 — Build Fourier terms (period ≈ 52.18 weeks = 365.25/7)
period = 365.25 / 7
n = len(df)
t = np.arange(n)
def fourier_df(t_vals, K, index):
cols = {}
for k in range(1, K + 1):
cols[f"sin{k}"] = np.sin(2 * np.pi * k * t_vals / period)
cols[f"cos{k}"] = np.cos(2 * np.pi * k * t_vals / period)
return pd.DataFrame(cols, index=index)
def aicc(fit):
k, n = len(fit.params), fit.nobs
return fit.aic + 2 * k * (k + 1) / (n - k - 1)
aicc_results = {}
for K in [1, 3, 6, 9]:
X = fourier_df(t, K, df.index)
fit = SARIMAX(df["Barrels"], exog=X,
order=(0,1,1), seasonal_order=(0,0,0,0)).fit(disp=False)
aicc_results[f"K={K}"] = aicc(fit)
print("AICc by number of Fourier terms:")
print(pd.Series(aicc_results).sort_values())
# Note: R auto-selects the ARIMA error order for each K; Python fixes it at (0,1,1).
# Python may select a different best K than R — both correctly apply AICc minimization.
# Step 3 — Fit best K and diagnose residuals
best_K = int(min(aicc_results, key=aicc_results.get).split("=")[1])
X_best = fourier_df(t, best_K, df.index)
best = SARIMAX(df["Barrels"], exog=X_best,
order=(0,1,1), seasonal_order=(0,0,0,0)).fit(disp=False)
print(best.summary())
lb = acorr_ljungbox(best.resid, lags=[52], return_df=True)
print(lb)
# Step 4 — Forecast 2 years (104 weeks)
t_fc = np.arange(n, n + 104)
X_fc = fourier_df(t_fc, best_K, index=range(104))
fc = best.get_forecast(104, exog=X_fc)
mean = fc.predicted_mean
ci = fc.conf_int()
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(df.index[-104:], df["Barrels"].iloc[-104:], color="black")
ax.plot(mean.index, mean.values, color="steelblue")
ax.fill_between(ci.index, ci.iloc[:,0], ci.iloc[:,1], alpha=0.2)
ax.set_title("US Gasoline: 2-Year Fourier + ARIMA Forecast")
plt.tight_layout(); plt.show()
# Step 5 — Compare vs. plain ARIMA on 2015-2017 test set
train_idx = df.index.year < 2015
test_idx = df.index.year >= 2015
n_train = train_idx.sum()
t_train, t_test = t[:n_train], t[n_train:]
X_tr = fourier_df(t_train, best_K, df.index[train_idx])
X_te = fourier_df(t_test, best_K, df.index[test_idx])
f_fit = SARIMAX(df.loc[train_idx, "Barrels"], exog=X_tr,
order=(0,1,1), seasonal_order=(0,0,0,0)).fit(disp=False)
p_fit = SARIMAX(df.loc[train_idx, "Barrels"],
order=(0,1,1), seasonal_order=(0,0,0,0)).fit(disp=False)
n_test = int(test_idx.sum())
fc_f = f_fit.forecast(n_test, exog=X_te)
fc_p = p_fit.forecast(n_test)
actual = df.loc[test_idx, "Barrels"]
rmse = lambda f: np.sqrt(np.mean((actual.values - np.array(f))**2))
print(f"Fourier+ARIMA RMSE: {rmse(fc_f):.3f}")
print(f"Plain ARIMA RMSE: {rmse(fc_p):.3f}")