Covariates in Randomized Experiments

Why bother, and what we gain from using predictions as covariates

Wooyong Park

Acknowledgements & disclosure

AI disclaimer: These slides were built collaboratively with Claude Opus 4.8: I supplied the outline and table of contents, Claude drafted and refined the slides through our back-and-forth, and I have reviewed and edited every slide carefully for accuracy,.

Roadmap: two questions, three claims

Part (a): why bother? We randomized; covariates cannot create or fix bias.

Claim 1. Difference-in-means is unbiased for the ATE; covariates only buy precision.

Part (b): predictions as covariates. Use a prediction \(\hat Y_i(0)\): when does it help, and when is it better than the full pre-intervention history?

Claim 2. Often, the most essential piece of covariates is \(\hat Y_i(0)\), the prognostic score.

Claim 3. How we think about the functional form (additive? multiplicative? logit?) of the treatment effects shapes how important \(\hat Y_i(0)\) is in our estimation.

We believe that individuals being predictable from their pre-intervention history is what lets the model explain them (their outcomes and their treatment effects) better.

Part (a): Why bother under randomization

Setup & notation

Binary treatment \(W_i \in \{0,1\}\); potential outcomes \(Y_i(0), Y_i(1)\); covariates \(X_i \in \mathbb{R}^K\). Observed \(Y_i = Y_i(W_i)\) (i.e., no spillovers).

Sampling. \(\big(Y_i(0), Y_i(1), X_i\big) \overset{\text{iid}}{\sim} \mu\), a population measure over (control outcome, treated outcome, covariate) triples.

Design (50:50). A random half of the units is assigned to treatment (complete randomization), so \(\Pr(W_i=1)=\tfrac12\) and \(W_i \perp \big(Y_i(0),Y_i(1),X_i\big)\).

Two estimands

\[ \underbrace{\tau = \mathbb{E}_\mu\!\left[Y_i(1)-Y_i(0)\right] = \!\!\int_{\mathcal{Y}\times\mathcal{Y}\times\mathcal{X}}\!\!(y_1-y_0)\,\mu(dy_0,dy_1,dx)}_{\textstyle \text{population ATE}} \] \[ \underbrace{\tau_c = \frac1n\sum_{i=1}^n \mathbb{E}_\mu\!\left[Y_i(1)-Y_i(0)\mid X_i\right]}_{\textstyle \text{conditional ATE: population CATE averaged over the realized } X_i} \]

Population vs. conditional ATE

\(\tau\) averages the effect over the whole population
\(\tau_c\) averages the population CATE \(c(X_i)=\mathbb{E}[Y_i(1)-Y_i(0)\mid X_i]\) over just the \(n\) covariate values in this experiment.
- If we draw 700 men and 300 women but the population is balanced, \(\tau_c\) can depart from \(\tau\) if the effect differs by gender.
\(\tau\) is fixed while \(\tau_c\) is random (through the \(X_i\)), and they agree on average: \(\mathbb{E}[\tau_c]=\tau\).

We will focus on \(\tau_c\) (cleaner); estimators today are also valid for \(\tau\) but carry extra noise term, \(\mathrm{Var}(\text{CATE})/n\), from treatment-effect heterogeneity entering through the random draw of covariates.

Why bother? Start from difference-in-means

Difference-in-means is unbiased, but needlessly noisy.

The estimator is just two group averages (\(n_1=n_0=\tfrac n2\)): \[ \hat\tau_{\text{DM}} = \bar Y_1 - \bar Y_0, \qquad \bar Y_w = \frac{1}{n_w}\sum_{i:\,W_i=w} Y_i = \frac{1}{n_w}\sum_{i:\,W_i=w} Y_i(w). \]

Split each potential outcome into a part predictable from \(X\) and a residual: \[ Y_i(w) = \underbrace{m_{w}(X_i)}_{\mathbb{E}[Y_i(w)\mid X_i]} + \varepsilon_i(w), \qquad \mathbb{E}[\varepsilon_i(w)\mid X_i]=0 . \]

Average within each group and subtract (\(\langle\cdot\rangle_{T},\langle\cdot\rangle_{C}\): treated/control averages): \[ \hat\tau_{\text{DM}} = \big[\,\langle m_1(X_i)\rangle_{T} - \langle m_0(X_i)\rangle_{C}\,\big] \;+\; \big[\,\langle\varepsilon(1)\rangle_{T} - \langle\varepsilon(0)\rangle_{C}\,\big]. \]

Correctable Imbalance, Irreducible Noise

Compare each group’s prognostic average to the full-sample average \(\bar m_w = \frac1n\sum_{i} m_w(X_i)\), by adding and subtracting it:

\[ \hat\tau_{\text{DM}} = \underbrace{(\bar m_1 - \bar m_0)}_{\textstyle \tau_c} \;+\; \underbrace{(\langle m_1\rangle_{T}-\bar m_1) - (\langle m_0\rangle_{C}-\bar m_0)}_{\textstyle \text{imbalance}} \;+\; \underbrace{\langle\varepsilon(1)\rangle_{T} - \langle\varepsilon(0)\rangle_{C}}_{\textstyle \text{noise}} . \]

\(\tau_c=\bar m_1-\bar m_0\) is the conditional ATE (\(\mathbb{E}[\tau_c]=\tau\)).

Imbalance. Say the outcome grows with a covariate \(X\), but in our sample the treated group (\(\langle m_1\rangle_{T}\))is mostly low-\(X\) units. Then the treated look worse than a representative group(\(\bar{m}_1\)) would; that gap is the imbalance.

Noise is orthogonal to \(X\) (\(\mathbb{E}[\varepsilon\mid X]=0\)): covariates cannot touch it.

Adjustment models \(m_w=\mathbb{E}[Y_i(w)|X_i]\) and subtracts the imbalance, removing that variance. That is the whole payoff of covariates.

The adjusted estimator: AIPW

Combine an outcome model \(\hat m_w(x)\approx\mathbb{E}[Y\mid X=x,W=w]\) with an estimated propensity \(\hat e(x)\approx\Pr(W=1\mid X=x)\): \[ \hat\tau_{\text{AIPW}} = \frac1n\sum_{i}\!\Big[\hat m_1(X_i)-\hat m_0(X_i) + \frac{W_i\big(Y_i-\hat m_1(X_i)\big)}{\hat e(X_i)} - \frac{(1-W_i)\big(Y_i-\hat m_0(X_i)\big)}{1-\hat e(X_i)}\Big]. \]

Doubly robust: consistent if either \(\hat m_w\) or \(\hat e\) is correct.

Personal Learning: the source of bias comes from imbalance; the imbalance that stems from covariates’ direct effects to the potential outcomes and the imbalance of covariate space between treated and control groups.

Variance-reduction algebra

Same target \(\tau_c\), two estimators. Compare their limiting variances (50:50):

\[ \sqrt{n}\,(\hat\tau_{\text{DM}} - \tau_c) \to \mathcal{N}(0, V_{\text{DM}}), \qquad \sqrt{n}\,(\hat\tau_{\text{AIPW}} - \tau_c) \to \mathcal{N}(0, V). \] \[ \underbrace{V_{\text{DM}} = 2\,\mathrm{Var}(Y_i(1)) + 2\,\mathrm{Var}(Y_i(0))}_{\textstyle \text{full outcome variance}} \quad \underbrace{V = 2\,\mathbb{E}[\varepsilon_i(1)^2] + 2\,\mathbb{E}[\varepsilon_i(0)^2]}_{\textstyle \text{residual variance}} \]

Since \(\mathrm{Var}(Y_i(w))=\mathrm{Var}(m_w(X_i))+\mathbb{E}[\varepsilon_i(w)^2]\), \[ V_{\text{DM}} - V = 2\,\mathrm{Var}(m_1(X_i)) + 2\,\mathrm{Var}(m_0(X_i)) \;\ge\; 0, \] exactly the outcome variation the covariates explain. Better \(m_w\) ⇒ smaller \(V\).

Caveat: asymptotic.

Athey, Cersosimo, Koutout & Li (2025)

same target, tighter intervals

Same point estimates across estimators; narrower intervals for the two covariate-adjusted methods.

Part (b): Predictions as covariates

Personal Learning:

Predictions of \(Y(0)\) are essential covariates.
There are mainly two things we have to be careful about in these predictions.
Functional form of the treatment effect matters.

Functional form → estimand → estimator

Choosing a functional form is choosing an estimand, and each estimand invites a different estimator. (Let \(\tau_i = Y_i(1) - Y_i(0)\))

Additive: “how many more/fewer?” Effect in levels, \(Y_i(1)=Y_i(0)+\tau+\varepsilon_i\). → diff-in-means / OLS / AIPW.
Multiplicative: “what % change vs. baseline?” \(\tau_i = (1+\tau) Y_i(0)\eta_i\)
- baseline level itself is the effect modifier: knowing \(Y_i(0)\) becomes essential.

Binary outcomes, extensive margin (whether):
- logit \(\;\mathrm{logit}\,\Pr(Y_i(t)=1\mid X)=\alpha(X)+\beta t\;\) → constant in log-odds, but the effect on the probability is largest at intermediate baseline. (FAFSA; Athey, Keleher, and Spiess(2025))
- LPM \(\;\Pr(Y_i(t)=1\mid X)=\alpha(X)+\beta t\;\) → additive, flat in baseline.
Intensive margin (how much, given \(Y>0\)): additive, multiplicative, …

Functional form → estimand → estimator

Choosing a functional form is choosing an estimand, and each estimand invites a different estimator.

Same data, different functional form → different estimand → different conclusion.
The distribution of the baseline data matters: e.g., if it is a count variable, how zero-inflated is it?
And under multiplicative effects, a good prediction of \(Y_i(0)\) is the key input!

A prediction of \(Y(0)\) as the primary covariate

In Part (a), the variance \(V\) shrinks with how well we predict the baseline outcome \(Y_i(0)\). So rather than adjust for the whole vector \(X_i\), build the single best predictor of \(Y_i(0)\) and use that as the covariate.

That predictor is \(m_0(x)=\mathbb{E}[Y_i(0)\mid X_i=x]\): it minimizes the residual variance, and collapsing \(X_i\) into this one number loses nothing for the baseline mean: \(\mathbb{E}[Y_i(0)\mid X_i]=\mathbb{E}[Y_i(0)\mid m_0(X_i)]\).

Hansen (2008) calls it the prognostic score.

Building \(\hat Y_i(0)\): two rules

Build \(\hat Y_i(0)\) from features untouched by treatment: baseline covariates, or each unit’s pre-intervention outcomes.

Rule (a): no treatment information may leak in. → use pre-intervention or control units
Rule (b): the prediction must be out-of-fold / out-of-bag: never trained on a unit’s own outcome (cross-fitting), or overfitting biases the AIPW adjustment and later HTE analyses.

Using the pre-intervention history

Suppose you observe the outcomes pre-intervention.

Each unit has a stable type (e.g. a habitual sharer) shaping both its past and its \(Y_i(0)\), so its past is a noisy signal of \(Y_i(0)\).

Now consider ATE estimation with two different sets of covariates: the full history or an out-of-fold \(\hat Y_i(0)\) (predicted from the history).

History is a superset of information to \(\hat Y_i(0)\). Would it be better?

Consider two cases of \(\tau_i\)

Case 1: \(\tau_i \propto Y_i(0)\) (the level).
Case 2: \(\tau_i \propto \mathrm{sd}\big(\{Y_{i,t}\}\big)\) (volatility).

How the simulation is built

A latent type \(\alpha_i\sim\mathcal{N}(0,\sigma_\alpha^2)\) generates a pre-period \(Y_{i,t}=\alpha_i+\nu_{i,t}\) (\(t=1,\dots,T\)) and the post-intervention control outcome \(Y_{i,T+1}(0)=\alpha_i+\varepsilon_i\), with \(\nu_{i,t}\sim\mathcal{N}(0,\sigma_\nu^2)\), \(\varepsilon_i\sim\mathcal{N}(0,\sigma_\varepsilon^2)\).

Heterogeneous effect, two channels:

Case 1 (level): \(\tau_i\propto Y_{i,T+1}(0)\).
Case 2 (volatility): \(\tau_i\propto \mathrm{sd}(\{Y_{i,t}\})\).

How we score a covariate set \(S\). Because it is a simulation, each unit’s true effect \(\tau_i=Y_i(1)-Y_i(0)\) is known.

We predict the outcomes with 5-fold cross-fitted random forest first for \(\hat{Y_i(0)}\).
Interact the covariates with the treatment (\(Y_i \sim W_i + X_i + W_iX_i\))
Report the out-of-fold \(R^2\) for \(\tau_i\) and estimated \(\hat{\tau(x)}\) — the share of the effect’s variation \(S\) captures.

(Full history \(=\) OLS on all lags in Case 1, a random forest in Case 2; prognostic \(=\) a cross-fit \(\hat Y_i(0)\))

\(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); \(n\), \(T\), \(\sigma_\nu\) set per scenario (noted on each plot); 50:50 assignment. Details in appendix.

Case 1 (level): small pre-period noise

Small noise: \(\hat Y(0)\) ties the naive full history.

Spec: \(n=300\) units, \(T=40\) pre-periods, \(\sigma_\nu=0.4\) (small), \(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); full history \(=\) OLS on all 40 lags; out-of-fold \(R^2\) for \(\tau_i\), averaged over 40 draws.

Case 1 (level): big pre-period noise

Big noise: \(\hat Y(0)\) nearly doubles its \(R^2\) vs. the naive full-history OLS.

Spec: identical \(n=300\) units, \(T=40\) pre-periods, \(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); only the pre-period noise changes to \(\sigma_\nu=8.0\) (big). Full history \(=\) OLS on all 40 lags; out-of-fold \(R^2\) for \(\tau_i\), averaged over 40 draws.

Denoising pays off: one cross-fit \(\hat Y(0)\) avoids the overfitting that controlling for many noisy lags induces.

When the full history wins

Case 2 (volatility): \(\hat Y(0)\) is uninformative; the full history wins.

Spec: \(n=4{,}000\) units, \(T=12\) pre-periods, \(\sigma_\nu=1.0\), \(\sigma_\alpha=1\), \(\sigma_\varepsilon=0.5\); full history \(=\) random forest on all lags; out-of-fold \(R^2\) for \(\tau_i\).

The right covariate is dictated by which feature of the history the effect depends on, not by goodness-of-fit on \(Y(0)\) alone.

Why predictions help: effects that depend on the baseline

Effect on \(\Pr(Y{=}1)\) vs. baseline \(\Pr(Y(0){=}1)\): logit (constant log-odds) peaks at intermediate baseline; LPM stays flat.

Under a multiplicative effect, \(\tau_i\propto Y_i(0)\) by construction, so \(\hat Y_i(0)\) directly says who responds to the treatment. For binary outcomes, assuming a logit link makes the probability-scale effect a function of baseline (the algebra of the logit, peaking mid-baseline). Athey–Keleher–Spiess (2025)

Diagnostic: GATE by predicted baseline

Example Figure: Each point is a baseline-quantile group: its mean predicted \(\hat Y(0)\) (x) vs. its estimated GATE (y).

Rank units by out-of-fold \(\hat Y_i(0)\), split into quantile groups \(G\), and estimate \(\mathrm{GATE}(g)=\mathbb{E}[Y_i(1)-Y_i(0)\mid G_i=g]\) (group difference-in-means or AIPW). Then read the shape:

Flat → effect independent of baseline (additive).
Linear up/down → effect scales with baseline (multiplicative).
Inverted-U → logit-type, largest at intermediate baseline.

Summary: one line per claim

1. For estimating the ATE, Randomization gives unbiasedness; covariates only buy precision, by shrinking residual variance.

2. The natural primary covariate is the out-of-fold predicted counterfactual outcome \(\hat Y_i(0)\).

3. The assumed functional form also hints us on whether \(\hat Y_i(0)\) is the right summary; the GATE-by-baseline plot could be used as the diagnostic for it.

Appendix

Appendix: is difference-in-means biased?

Three different conditioning sets, three different answers:

Conditional on the realized assignment \(\{W_i\}\): yes. Once you fix which units landed in each arm, the treated and control groups have different covariate profiles, so their baseline means differ. \(\hat\tau_{\text{DM}}\) then blends the true effect with that composition gap, off by the realized imbalance \((\langle m_1\rangle_T-\bar m_1)-(\langle m_0\rangle_C-\bar m_0)\).

Conditional on the treated fraction (\(\tfrac12\)) and \(\{X_i\}\): no. The imbalance term has expectation zero over randomization.

Unconditional (overall ATE \(\tau\)): no. \(\mathbb{E}[\hat\tau_{\text{DM}}]=\tau\). The imbalance is variance, not bias; adjustment removes it.

Appendix: oracle vs. estimated propensity

Counterintuitively, plugging in an estimated \(\hat e(x)\) is (weakly) more efficient than using the known \(e\equiv\tfrac12\) (Hirano–Imbens–Ridder 2003).

Intuition: regressing \(W\) on \(X\) lets \(\hat e\) absorb the realized correlation between assignment and covariates: it auto-corrects the chance imbalance. The oracle \(\tfrac12\) ignores that your sample over-treated some \(X\) values.

Finite samples? The HIR theorem is asymptotic. In practice estimated \(\hat e\) usually lowers variance too (it absorbs the realized imbalance), but it is not a universal finite-sample guarantee, and an over-flexible propensity model can backfire.

For AIPW with a consistent outcome model, both already hit the efficiency bound \(V\), so it washes out asymptotically. The gain from estimating \(\hat e\) shows up for plain IPW, or when \(\hat m_w\) is poor. (Distinct from the realized fraction \(n_1/n\) vs design \(\tfrac12\): the Hájek ratio normalization is likewise more stable than Horvitz–Thompson.)

Appendix: how the Stage C plots were made

Metric per covariate set \(S\): out-of-fold \[ R^2_S = 1 - \frac{\sum_i \big(\tau_i - \hat g_S(\cdot_i)\big)^2}{\sum_i (\tau_i - \bar\tau)^2}, \] where \(\hat g_S\) is a 5-fold cross-fitted prediction of \(\tau_i\) from \(S\).

Why we can compute it: in the simulation each unit’s true effect \(\tau_i=Y_i(1)-Y_i(0)\) is known, so we regress it directly on \(S\) and read off the heterogeneity \(S\) explains. (In real data \(\tau_i\) is unseen: you’d regress an AIPW pseudo-outcome whose conditional mean is \(\tau_i\) instead.)

Learners. Full history \(\to\) OLS on all lags (Case 1, so it can overfit) or a random forest (Case 2, for the nonlinear volatility). Prognostic \(\hat Y(0)\) \(\to\) a cross-fit prediction of \(Y(0)\), used as one covariate.

Read it: higher \(R^2_S \Rightarrow\) \(S\) carries more of the effect’s variation \(\Rightarrow\) a better basis for CATE estimation / targeting.