Why bother, and what we gain from using predictions as covariates
AI disclaimer: These slides were built collaboratively with Claude Opus 4.8: I supplied the outline and table of contents, Claude drafted and refined the slides through our back-and-forth, and I have reviewed and edited every slide carefully for accuracy,.
Part (a): why bother? We randomized; covariates cannot create or fix bias.
Claim 1. Difference-in-means is unbiased for the ATE; covariates only buy precision.
Part (b): predictions as covariates. Use a prediction \(\hat Y_i(0)\): when does it help, and when is it better than the full pre-intervention history?
Claim 2. Often, the most essential piece of covariates is \(\hat Y_i(0)\), the prognostic score.
Claim 3. How we think about the functional form (additive? multiplicative? logit?) of the treatment effects shapes how important \(\hat Y_i(0)\) is in our estimation.
We believe that individuals being predictable from their pre-intervention history is what lets the model explain them (their outcomes and their treatment effects) better.
Binary treatment \(W_i \in \{0,1\}\); potential outcomes \(Y_i(0), Y_i(1)\); covariates \(X_i \in \mathbb{R}^K\). Observed \(Y_i = Y_i(W_i)\) (i.e., no spillovers).
Sampling. \(\big(Y_i(0), Y_i(1), X_i\big) \overset{\text{iid}}{\sim} \mu\), a population measure over (control outcome, treated outcome, covariate) triples.
Design (50:50). A random half of the units is assigned to treatment (complete randomization), so \(\Pr(W_i=1)=\tfrac12\) and \(W_i \perp \big(Y_i(0),Y_i(1),X_i\big)\).
\[ \underbrace{\tau = \mathbb{E}_\mu\!\left[Y_i(1)-Y_i(0)\right] = \!\!\int_{\mathcal{Y}\times\mathcal{Y}\times\mathcal{X}}\!\!(y_1-y_0)\,\mu(dy_0,dy_1,dx)}_{\textstyle \text{population ATE}} \] \[ \underbrace{\tau_c = \frac1n\sum_{i=1}^n \mathbb{E}_\mu\!\left[Y_i(1)-Y_i(0)\mid X_i\right]}_{\textstyle \text{conditional ATE: population CATE averaged over the realized } X_i} \]
We will focus on \(\tau_c\) (cleaner); estimators today are also valid for \(\tau\) but carry extra noise term, \(\mathrm{Var}(\text{CATE})/n\), from treatment-effect heterogeneity entering through the random draw of covariates.
Difference-in-means is unbiased, but needlessly noisy.
The estimator is just two group averages (\(n_1=n_0=\tfrac n2\)): \[ \hat\tau_{\text{DM}} = \bar Y_1 - \bar Y_0, \qquad \bar Y_w = \frac{1}{n_w}\sum_{i:\,W_i=w} Y_i = \frac{1}{n_w}\sum_{i:\,W_i=w} Y_i(w). \]
Split each potential outcome into a part predictable from \(X\) and a residual: \[ Y_i(w) = \underbrace{m_{w}(X_i)}_{\mathbb{E}[Y_i(w)\mid X_i]} + \varepsilon_i(w), \qquad \mathbb{E}[\varepsilon_i(w)\mid X_i]=0 . \]
Average within each group and subtract (\(\langle\cdot\rangle_{T},\langle\cdot\rangle_{C}\): treated/control averages): \[ \hat\tau_{\text{DM}} = \big[\,\langle m_1(X_i)\rangle_{T} - \langle m_0(X_i)\rangle_{C}\,\big] \;+\; \big[\,\langle\varepsilon(1)\rangle_{T} - \langle\varepsilon(0)\rangle_{C}\,\big]. \]
Compare each group’s prognostic average to the full-sample average \(\bar m_w = \frac1n\sum_{i} m_w(X_i)\), by adding and subtracting it:
\[ \hat\tau_{\text{DM}} = \underbrace{(\bar m_1 - \bar m_0)}_{\textstyle \tau_c} \;+\; \underbrace{(\langle m_1\rangle_{T}-\bar m_1) - (\langle m_0\rangle_{C}-\bar m_0)}_{\textstyle \text{imbalance}} \;+\; \underbrace{\langle\varepsilon(1)\rangle_{T} - \langle\varepsilon(0)\rangle_{C}}_{\textstyle \text{noise}} . \]
Adjustment models \(m_w=\mathbb{E}[Y_i(w)|X_i]\) and subtracts the imbalance, removing that variance. That is the whole payoff of covariates.
Combine an outcome model \(\hat m_w(x)\approx\mathbb{E}[Y\mid X=x,W=w]\) with an estimated propensity \(\hat e(x)\approx\Pr(W=1\mid X=x)\): \[ \hat\tau_{\text{AIPW}} = \frac1n\sum_{i}\!\Big[\hat m_1(X_i)-\hat m_0(X_i) + \frac{W_i\big(Y_i-\hat m_1(X_i)\big)}{\hat e(X_i)} - \frac{(1-W_i)\big(Y_i-\hat m_0(X_i)\big)}{1-\hat e(X_i)}\Big]. \]
Doubly robust: consistent if either \(\hat m_w\) or \(\hat e\) is correct.
Personal Learning: the source of bias comes from imbalance; the imbalance that stems from covariates’ direct effects to the potential outcomes and the imbalance of covariate space between treated and control groups.
Same target \(\tau_c\), two estimators. Compare their limiting variances (50:50):
\[ \sqrt{n}\,(\hat\tau_{\text{DM}} - \tau_c) \to \mathcal{N}(0, V_{\text{DM}}), \qquad \sqrt{n}\,(\hat\tau_{\text{AIPW}} - \tau_c) \to \mathcal{N}(0, V). \] \[ \underbrace{V_{\text{DM}} = 2\,\mathrm{Var}(Y_i(1)) + 2\,\mathrm{Var}(Y_i(0))}_{\textstyle \text{full outcome variance}} \quad \underbrace{V = 2\,\mathbb{E}[\varepsilon_i(1)^2] + 2\,\mathbb{E}[\varepsilon_i(0)^2]}_{\textstyle \text{residual variance}} \]
Since \(\mathrm{Var}(Y_i(w))=\mathrm{Var}(m_w(X_i))+\mathbb{E}[\varepsilon_i(w)^2]\), \[ V_{\text{DM}} - V = 2\,\mathrm{Var}(m_1(X_i)) + 2\,\mathrm{Var}(m_0(X_i)) \;\ge\; 0, \] exactly the outcome variation the covariates explain. Better \(m_w\) ⇒ smaller \(V\).
Caveat: asymptotic.
same target, tighter intervals
Personal Learning:
Choosing a functional form is choosing an estimand, and each estimand invites a different estimator. (Let \(\tau_i = Y_i(1) - Y_i(0)\))
Choosing a functional form is choosing an estimand, and each estimand invites a different estimator.
In Part (a), the variance \(V\) shrinks with how well we predict the baseline outcome \(Y_i(0)\). So rather than adjust for the whole vector \(X_i\), build the single best predictor of \(Y_i(0)\) and use that as the covariate.
That predictor is \(m_0(x)=\mathbb{E}[Y_i(0)\mid X_i=x]\): it minimizes the residual variance, and collapsing \(X_i\) into this one number loses nothing for the baseline mean: \(\mathbb{E}[Y_i(0)\mid X_i]=\mathbb{E}[Y_i(0)\mid m_0(X_i)]\).
Hansen (2008) calls it the prognostic score.
Build \(\hat Y_i(0)\) from features untouched by treatment: baseline covariates, or each unit’s pre-intervention outcomes.
Suppose you observe the outcomes pre-intervention.
Each unit has a stable type (e.g. a habitual sharer) shaping both its past and its \(Y_i(0)\), so its past is a noisy signal of \(Y_i(0)\).
Now consider ATE estimation with two different sets of covariates: the full history or an out-of-fold \(\hat Y_i(0)\) (predicted from the history).
History is a superset of information to \(\hat Y_i(0)\). Would it be better?
Consider two cases of \(\tau_i\)
A latent type \(\alpha_i\sim\mathcal{N}(0,\sigma_\alpha^2)\) generates a pre-period \(Y_{i,t}=\alpha_i+\nu_{i,t}\) (\(t=1,\dots,T\)) and the post-intervention control outcome \(Y_{i,T+1}(0)=\alpha_i+\varepsilon_i\), with \(\nu_{i,t}\sim\mathcal{N}(0,\sigma_\nu^2)\), \(\varepsilon_i\sim\mathcal{N}(0,\sigma_\varepsilon^2)\).
Heterogeneous effect, two channels:
How we score a covariate set \(S\). Because it is a simulation, each unit’s true effect \(\tau_i=Y_i(1)-Y_i(0)\) is known.
\(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); \(n\), \(T\), \(\sigma_\nu\) set per scenario (noted on each plot); 50:50 assignment. Details in appendix.
Spec: \(n=300\) units, \(T=40\) pre-periods, \(\sigma_\nu=0.4\) (small), \(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); full history \(=\) OLS on all 40 lags; out-of-fold \(R^2\) for \(\tau_i\), averaged over 40 draws.
Spec: identical \(n=300\) units, \(T=40\) pre-periods, \(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); only the pre-period noise changes to \(\sigma_\nu=8.0\) (big). Full history \(=\) OLS on all 40 lags; out-of-fold \(R^2\) for \(\tau_i\), averaged over 40 draws.
Denoising pays off: one cross-fit \(\hat Y(0)\) avoids the overfitting that controlling for many noisy lags induces.
Spec: \(n=4{,}000\) units, \(T=12\) pre-periods, \(\sigma_\nu=1.0\), \(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); full history \(=\) random forest on all lags; out-of-fold \(R^2\) for \(\tau_i\).
The right covariate is dictated by which feature of the history the effect depends on, not by goodness-of-fit on \(Y(0)\) alone.
Under a multiplicative effect, \(\tau_i\propto Y_i(0)\) by construction, so \(\hat Y_i(0)\) directly says who responds to the treatment. For binary outcomes, assuming a logit link makes the probability-scale effect a function of baseline (the algebra of the logit, peaking mid-baseline). Athey–Keleher–Spiess (2025)
Rank units by out-of-fold \(\hat Y_i(0)\), split into quantile groups \(G\), and estimate \(\mathrm{GATE}(g)=\mathbb{E}[Y_i(1)-Y_i(0)\mid G_i=g]\) (group difference-in-means or AIPW). Then read the shape:
1. For estimating the ATE, Randomization gives unbiasedness; covariates only buy precision, by shrinking residual variance.
2. The natural primary covariate is the out-of-fold predicted counterfactual outcome \(\hat Y_i(0)\).
3. The assumed functional form also hints us on whether \(\hat Y_i(0)\) is the right summary; the GATE-by-baseline plot could be used as the diagnostic for it.
Three different conditioning sets, three different answers:
Counterintuitively, plugging in an estimated \(\hat e(x)\) is (weakly) more efficient than using the known \(e\equiv\tfrac12\) (Hirano–Imbens–Ridder 2003).
Intuition: regressing \(W\) on \(X\) lets \(\hat e\) absorb the realized correlation between assignment and covariates: it auto-corrects the chance imbalance. The oracle \(\tfrac12\) ignores that your sample over-treated some \(X\) values.
Finite samples? The HIR theorem is asymptotic. In practice estimated \(\hat e\) usually lowers variance too (it absorbs the realized imbalance), but it is not a universal finite-sample guarantee, and an over-flexible propensity model can backfire.
For AIPW with a consistent outcome model, both already hit the efficiency bound \(V\), so it washes out asymptotically. The gain from estimating \(\hat e\) shows up for plain IPW, or when \(\hat m_w\) is poor. (Distinct from the realized fraction \(n_1/n\) vs design \(\tfrac12\): the Hájek ratio normalization is likewise more stable than Horvitz–Thompson.)
Metric per covariate set \(S\): out-of-fold \[ R^2_S = 1 - \frac{\sum_i \big(\tau_i - \hat g_S(\cdot_i)\big)^2}{\sum_i (\tau_i - \bar\tau)^2}, \] where \(\hat g_S\) is a 5-fold cross-fitted prediction of \(\tau_i\) from \(S\).
Why we can compute it: in the simulation each unit’s true effect \(\tau_i=Y_i(1)-Y_i(0)\) is known, so we regress it directly on \(S\) and read off the heterogeneity \(S\) explains. (In real data \(\tau_i\) is unseen: you’d regress an AIPW pseudo-outcome whose conditional mean is \(\tau_i\) instead.)
Learners. Full history \(\to\) OLS on all lags (Case 1, so it can overfit) or a random forest (Case 2, for the nonlinear volatility). Prognostic \(\hat Y(0)\) \(\to\) a cross-fit prediction of \(Y(0)\), used as one covariate.
Read it: higher \(R^2_S \Rightarrow\) \(S\) carries more of the effect’s variation \(\Rightarrow\) a better basis for CATE estimation / targeting.