Recursive Partitioning
for Heterogeneous Causal Effects

Susan Athey and Guido Imbens(2016)

Wooyong Park, Joonwoo Shin, Jeongwoo Yang, Minseo Seok

URL - wooyongp.github.io/rpart-jsc25_1/slides.html

Data-driven HCE analysis based on tree models

Don’t stress. We are going to start from scratch.

Causal Inference with binary treatment

Research Question:

Does a smile in your online profile help your online micro-borrowing?(Athey et al, 2022 NBER WP)

W_i \in \{0,1\} : Your treatment status

Under SUTVA(Stable Unit Treatment Values Assumption),

Y_i(1) : individual i’s potential outcome if he/she smiled
Y_i(0) : individual i’s potential outcome if he/she did not

Your treatment effect is the difference between the two: \tau_i = Y_i(1) - Y_i(0)

Missing Counterfactuals and ATE

What we see:

Y_i = W_iY_i(1) + (1-W_i)Y_i(0)

Since we cannot observe both situations, we usually rely on ATE:

\tau = \mathbb{E}\bigl[Y_i(1)-Y_i(0)\bigr]

1. the treatment is randomized

2. the treatment is uncorrelated with unobserved characteristics

3. we have an IV

we can unbiasedly estimate the ATE.

Heterogeneous Treatment Effects

Limitations of ATE

Sometimes, the ATE is insufficient.

Unfortunately, some people’s smiles might not be as alluring as others’. \rightarrow HTE

Conditional ATEs

CATE tries to explain them within the data:

\tau(X_i) = \mathbb{E}\bigl[Y_i(1) - Y_i(0)|X_i\bigr]

Athey and Imbens(2016): Tree-based model, a ML algorithm, can hint on how to choose X_i

Key concepts in ML

In terms of prediction, OLS(\mathrm{y}=X'\beta +\varepsilon) is not good enough.

Not all DGPs are linear.
Bias-Variance tradeoff in Prediction

\begin{align*} \text{Prediction Error} &=\mathbb{E}\bigl[\bigl(f(x)+\varepsilon-\hat{f}(x)\bigr)^2\bigr]\\ & = \underbrace{\mathbb{E}[f-\hat{f}]^2}_{\text{bias}} + \underbrace{\mathbb{V}(\hat{f})}_{\text{variance}} + \mathbb{V}(\varepsilon) \end{align*}

The unbiased predictor is not usually the minimum-error predictor, but most models/estimations including OLS focus on unbiasedness!

Why important? Heterogeneity in TE can be nonlinear wrt X.

Classification and Regression Trees recursively divide the covariate space into two so that MSE decreases each time we add a partition. In the figure below, we have a partition with 5 leaves.

Code

# See the full code at Grant Mcdermott's repository
library(rpart) 
library(parsnip)
library(tidyverse)
library(parttree)
set.seed(123) ## For consistent jitter

fit = rpart(Kyphosis ~ Start + Age, data = kyphosis)

ggplot(kyphosis, aes(x = Start, y = Age)) +
  geom_parttree(data = fit, alpha = 0.1, aes(fill = Kyphosis)) + # <-- key layer
  geom_point(aes(col = Kyphosis)) +
  labs(
    x = "No. of topmost vertebra operated on", y = "Patient age (months)",
    caption = "Note: Points denote observations. Shading denotes model predictions."
    ) +
  theme_minimal()

Steps

Choose X_1 or X_2.
Choose the cutoff for dividing.
Repeat these steps.

Trees recursively partition the covariate space so that MSE decreases each time we add a partition.

For regression trees, the criterion for choosing covariate and a cutoff pair is the MSE.

Split the data into S^{tr}(training) and S^{te}(test).
Within S^{tr}, choose X_i and a cutoff k that minimizes MSE:

MSE = \underbrace{\sum_{i \in L}(y_i-\hat{y}_L)^2}_{\text{left-side MSE}} + \underbrace{\sum_{i \in R}(y_i-\hat{y}_R)^2}_{\text{right-side MSE}}

Repeat this process.
(Important!!) Our prediction within each leaf is the sample mean within each leaf:

\hat{y}(\text{leaf}) = \overline{y}_{(\text{leaf})}

The resulting MSE of the model given a data S and partition \Pi would be MSE_\mu(S, S^{tr}, \Pi) = \frac{1}{\#(S)} \sum_{i \in S} \biggl[Y_i -\hat{\mu}(X_i; S^{tr}, \Pi)\biggr]^2

In this paper, we use the adjusted MSE: MSE_\mu(S, S^{tr}, \Pi) = \frac{1}{\#(S)} \sum_{i \in S} \biggl[(Y_i -\hat{\mu}(X_i; S^{tr}, \Pi))^2- {\color{blue}Y_i^2}\biggr]

which does not affect the splitting mechanism but makes the algebra more interpretable.

CART - Best predictor, but biased

Sample mean of each leaf
Actual Sample Mean with Larger / independent samples

Code

# Load required library
library(rpart)
library(ggplot2)
library(rpart.plot)

set.seed(456)

# simulate data and apply CART with two features
n <- 10000

x1 <- runif(n, 0, 10)
x2 <- runif(n, -5, 5)
y <- 1 + 2 * x1 + 3 * x2 * as.numeric(x2>0) + rnorm(n, 0, 1)  # True relationship: y = 1 + 2 x1 + 3 x2*I(x2>0) + noise
    
# Fit regression tree
data <- data.frame(x1 = x1, x2 = x2, y = y)
tree_model <- rpart(y ~ x1 + x2, data = data, control = rpart.control(cp = 0.04))
    
# Pruning
best <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]),"CP"]
pruned_tree <- prune(tree_model, cp=best)

rpart.plot(pruned_tree, digits=3)

Code

# Load required library
library(tidyverse)

# Make a data and compute the sample mean in each leaf
n <- 100000

x1 <- runif(n, 0, 10)
x2 <- runif(n, -5, 5)
y <- 1 + 2 * x1 + 3 * x2 * as.numeric(x2>0) + rnorm(n, 0, 1)  # True relationship: y = 1 + 2 x1 + 3 x2*I(x2>0) + noise

data <- tibble(x1 = x1, x2 = x2, y = y) |> 
    mutate(leaf = case_when(
        x1<2.62 & x2<1.77 ~ "1",
        x1<5.23 & x1>=2.62 & x2<1.77 ~ "2",
        x1<5.23 & x2>=1.77 ~ "3",
        x1>=5.23 & x2<1.81 ~ "4",
        x1>=5.23 & x2>=1.81 ~ "5"
    ))

df <- data |> group_by(leaf) |> summarize(unbiased_mean = mean(y)) |> 
    mutate(tree_mean = c(4.27, 9.51, 16.5, 16.9, 26.5))

kableExtra::kable(df)

leaf	unbiased_mean	tree_mean
1	4.334393	4.27
2	9.563450	9.51
3	16.405273	16.50
4	17.000677	16.90
5	26.452853	26.50

Roadmap

Trees can be more flexible and data-driven. However,

CART - biased estimates of CATE \rightarrow \mathbb{E}[\hat{\tau}(X;S^{tr}) \mid x] \neq \tau(x)
How to solve? We split the sample
- S^{tr} for building a tree
- S^{est} to estimate CATEs and take no role in building a tree

How does this change the tree?
- For outcome prediction(Y) - Joonwoo
- For CATE(\tau(X)) - Jeongwoo
- Simulation Results - Minseo

Honest Inference for Outcome Averages

Notations for predicted outcomes

Given a partition \Pi, conditional mean is given by: \begin{equation*} \mu(x;\Pi) = \mathbb{E}\left[Y_i | X_i \in \textit{l}(x;\Pi)\right] \end{equation*}
Given a sample \mathcal{S} we estimate conditional mean is given by \begin{equation*} \hat{\mu}(x;\mathcal{S},\Pi) = \frac{1}{\#(i \in \mathcal{S} : X_i \in \textit{l}(x;\Pi))}\sum\limits_{i \in \mathcal{S}:X_i \in \textit{l}(x;\Pi)}Y_i \end{equation*}

Limitations of CART

We cannot simply use CART to estimate HTE.

Potential bias in the leaf estimates
does not consider variance in tree splitting

Limitations of CART

Suppose Y_i \in \mathbb{R}, \quad X_i \in \{L,R\}

Only two possible partitions : \begin{equation*} \Pi = \begin{cases} \{L,R\} & (\text{no split}) \\ \{ \{L\}, \{R\} \} & (\text{split}) \end{cases} \end{equation*}
To build Regression tree, we compare

\begin{align*} &\frac{1}{\#(S^{tr})}\sum\limits_{i: X_i \in S^{tr}}(Y_i-\bar{Y})^2 \quad\text{and}\qquad \qquad \\ &\frac{1}{\#(S^{tr})}\biggl[\sum\limits_{i: X_i=L}(Y_i-\bar{Y_L})^2+\sum\limits_{i:X_i = R}(Y_i-\bar{Y_R})^2\biggr] \end{align*}

Limitations of CART

This is equivalent to

\begin{align*} \pi(S)= \begin{cases} \quad \{ \{L,R \} \} & \text{if} \quad |\bar{Y}_L -\bar{Y}_R| \leq c \\ \quad \{ \{L\}, \{R \} \} & \text{if} \quad |\bar{Y}_L -\bar{Y}_R| > c \end{cases} \end{align*}

If we condition on \left|\bar{Y}_L -\bar{Y}_R\right| > c, we expect bias:

\mathbb{E}(\overline{Y}_k) \neq \mathbb{E}(\overline{Y}_k| \left|\overline{Y}_L -\overline{Y}_R \right| > c)

where k \in \{L,R\}

Limitations of CART

Example

Training
Estimation Set

Code

# See the full code at Grant Mcdermott's repository
library(rpart) 
library(parsnip)
library(tidyverse)
library(parttree)

train1 <- kyphosis %>% head(50)
estimation <- kyphosis %>% tail(31)
set.seed(456) ## For consistent jitter
train2 <- kyphosis %>% sample_n(50)

fit = rpart(Kyphosis ~ Start + Age, data = train1)

ggplot(train1, aes(x = Start, y = Age)) +
  geom_parttree(data = fit, alpha = 0.1, aes(fill = Kyphosis)) + # <-- key layer
  geom_point(aes(col = Kyphosis)) +
  labs(
    x = "No. of topmost vertebra operated on", y = "Patient age (months)",
    caption = "Note: Points denote observations. Shading denotes model predictions."
    ) +
  theme_minimal()

Code

fit = rpart(Kyphosis ~ Start + Age, data = kyphosis)

ggplot(kyphosis, aes(x = Start, y = Age)) +
#   geom_parttree(data = fit, alpha = 0.1, aes(fill = Kyphosis)) + # <-- key layer
  geom_jitter(aes(col = Kyphosis)) +
  geom_point(aes(x=10, y=50), color = "#F8766D") +
  geom_point(aes(x=3, y=58), color = "#F8766D") +
  geom_point(aes(x=12.3, y=210), color = "#F8766D") +
  geom_point(aes(x=12.2, y=200), color = "#F8766D") +
  geom_point(aes(x=4, y=65), color = "#F8766D") +
  geom_point(aes(x=4.5, y=100), color = "#F8766D") +
  geom_segment(aes(x = 0, xend = 12.5, y = 35, yend = 35), linetype="dashed") +
  geom_vline(aes(xintercept=12.5), linetype="dashed") +
  labs(
    x = "No. of topmost vertebra operated on", y = "Patient age (months)",
    caption = "Note: Points denote observations. Shading denotes model predictions."
    ) +
  theme_minimal()

CART vs Honest

Honest Estimation uses two different samples : S^{tr} for splitting and S^{est} for estimation.

CART

\hat{\mu}(x;S^{tr},\pi(S^{tr}))=\frac{1}{\#(i \in S^{tr} : X_i \in \textit{l}(x;\pi(S^{tr})))}\sum\limits_{i\in S^{tr}:X_i \in \textit{l}(x;\pi(S^{tr}))}Y_i

Honest

\hat{\mu}(x;{\color{red}S^{est}},\pi(S^{tr}))=\frac{1}{\#(i \in {\color{red}S^{est}} : X_i \in \textit{l}(x;\pi(S^{tr})))}\sum\limits_{i\in {\color{red}S^{est}}:X_i \in \textit{l}(x;\pi(S^{tr}))}Y_i

The Honest Criterion

\begin{align*} \text{MSE}_{\mu}(\underbrace{S^{tr}}_\text{training set},\underbrace{S^{est}}_\text{estimation set},\Pi) &= \frac{1}{\#(S^{tr})}\sum\limits_{i \in S^{tr}}\left[(Y_i - \hat{\mu}(X_i;{\color{red}S^{ est}},\Pi))^2-Y_i^2\right] \end{align*}

\begin{align*} \text{EMSE}_\mu &= \mathbb{E}_{S^{tr},S^{est}}\left[\text{MSE}_\mu(S^{tr},S^{est},\Pi)\right] \end{align*}

where the expectation is taken over all possible S^{tr} \quad and S^{est}.

The Honest Target

Given \Pi, we can expand EMSE_\mu(\Pi) :

\begin{align*} -\text{EMSE}_\mu(\Pi) &= -\mathbb{E}_{(Y_i,X_i),S^{est}}\left[(Y_i-\mu(X_i;\Pi))^2 - Y_i^2\right] \\ &\quad-\mathbb{E}_{X_i,S^{est}}\left[(\hat{\mu}(X_i;S^{est},\Pi)-\mu(X_i;\Pi))^2\right]\\ &\quad= \mathbb{E}_{X_i}\left[\mu^2(X_i;\Pi)\right] - \mathbb{E}_{S^{est},X_i}\left[\mathbb{V}(\hat{\mu}(X_i;S^{est},\Pi))\right] \end{align*}

How can we estimate each of these terms using \quad S^{tr}\quad and \quad N^{est}?

Honest Target: Estimation

-\text{EMSE}_\mu=\mathbb{E}_{X_i}\left[\mu^2(X_i;\Pi)\right] - \mathbb{E}_{S^{est},X_i}\biggl[\mathbb{V}\bigl(\hat{\mu}(X_i;S^{est},\Pi)\bigr)\biggr]

first term
second term

\hat{\mathbb{E}}\left[\mu^2(x;\Pi)\right] = \hat{\mu}^2(x;S^{tr},\Pi)-\frac{S^2_{S^{tr}}(\mathcal{l}(x;\Pi))}{N^{tr}(\mathcal{l}(x;\Pi))}

\hat{\mathbb{V}}(\hat{\mu}(x;S^{est},\Pi)) = \frac{S^2_{S^{\color{red} tr}}(l(x;\Pi))}{N^{est}(l(x;\Pi))}

Assuming leaf shares between S^{tr} and S^{est} are approximately the same,

\hat{\mathbb{E}}\left[\mathbb{V}(\hat{\mu}(X_i;S^{est},\Pi))|i \in S^{tr}\right] = \frac{1}{N^{est}}\sum\limits_{\mathcal{l} \in \Pi}S^2_{S^{tr}}(\mathcal{l})

Honest Target: Estimation

-\text{EMSE}_\mu=\mathbb{E}_{X_i}\left[\mu^2(X_i;\Pi)\right] - \mathbb{E}_{S^{est},X_i}\biggl[\mathbb{V}\bigl(\hat{\mu}(X_i;S^{est},\Pi)\bigr)\biggr]

The two terms combined, we obtain an unbiased estimator for honest target: \begin{align*} \begin{aligned} -\widehat{\text{EMSE}_\mu}(S^{tr},N^{est},\Pi) =\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \\ \frac{1}{N^{tr}} \sum\limits_{i \in S^{tr}}\hat{\mu^2}(X_i;S^{tr},\Pi)-\left(\frac{1}{N^{tr}}+\frac{1}{N^{est}}\right)\cdot\sum\limits_{l \in \Pi}S^2_{S^{tr}}(l(x;\Pi)) \end{aligned} \end{align*}

Comparison to CART

CART Target

\begin{equation*} -\text{MSE}_{\mu}(S^{tr},S^{tr},\Pi) = \frac{1}{N^{tr}} \sum\limits_{i \in S^{tr}}\hat{\mu^2}(X_i;S^{tr},\Pi) \end{equation*}

Honest Target

\begin{equation*} \begin{aligned} -\widehat{\text{EMSE}_\mu}(S^{tr},N^{est},\Pi) =\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \\ \frac{1}{N^{tr}} \sum\limits_{i \in S^{tr}}\hat{\mu^2}(X_i;S^{tr},\Pi)-{\color{blue}\left(\frac{1}{N^{tr}}+\frac{1}{N^{est}}\right)\cdot\sum\limits_{l \in \Pi}S^2_{S^{tr}}(l(x;\Pi))} \end{aligned} \end{equation*}

Comparison to CART

Pros and Cons of Honest

Pro:
- Honest target not only removes potential bias in leaf estimates but also considers variance reduction in splitting.
- enables statistical testing(valid confidence intervals)
Con: smaller sample size, shallower tree, and less personalized predictions

Honest Inference for Treatment Effects

Population average outcome “in a leaf” and its estimator

\begin{align*} &\mu(w,x;\Pi) \equiv \mathbb{E}[Y_i(w)|X_i \in \mathcal{l}(x;\Pi)] \notag \\ &\hat{\mu}(w,x;\mathcal{S},\Pi) \equiv \frac{1}{\# (\{i \in \mathcal{S}_w:X_i\in\mathcal{l}(x;\Pi)\})} \sum_{i \in \mathcal{S}_w:X_i\in\mathcal{l}(x;\Pi)}Y_i \end{align*}

Average causal effect “in a leaf” and its estimator

\begin{align*} &\tau(x;\Pi) \equiv \mathbb{E}[Y_i(1)-Y_i(0)|X_i \in \mathcal{l}(x;\Pi)] \notag \\ &\hat{\tau}(x;\mathcal{S},\Pi) \equiv \hat{\mu}(1,x;\mathcal{S},\Pi)-\hat{\mu}(0,x;\mathcal{S},\Pi) \notag \end{align*}

CART For HTE Estimation?

Model
Criterion

Model and Estimation
- Model type: Tree structure with \mathcal{S}^{\color{red}{tr}} (Grow and prune)
- Estimate with \mathcal{S}^{\color{red}{tr}}.

\hat{\tau}(x;\mathcal{S}^{\color{red} \text{tr}},\Pi) = \hat{\mu}(1,x;\mathcal{S}^{\color{red}{tr}},\Pi)-\hat{\mu}(0,x;\mathcal{S}^{\color{red}{tr}},\Pi)

Criterion Function
- In-sample Goodness-of-fit function: NOT FEASIBLE Q^{is}=-MSE=-\frac{1}{N}\sum_{i=1}^{N}(\tau_i-\hat{\tau}_i)^2

Problems with using CART for HTE

GOAL: Estimate within-leaf treatment effect
HOW? maximize -MSE_{\tau}

Problem 1: \tau_i’s are unobservable
- Under our framework, estimate -MSE_\tau with -\hat{MSE}_\tau(unbiased)

Problem 2: biased \overline{\tau_L} and \overline{\tau_R} (e.g. \overline{\tau_L}-\overline{\tau_R} with condition \geq c is biased)
- Split sample; one to build tree, the other to estimate effects.

NEW criterion by Honest Algorithm

\begin{align*} -\mathbb{E}_{\mathcal{S}^{\color{red}{tr}}, \mathcal{S}^{\color{red}{est}}}[\sum_{i\in \mathcal{S}^{\color{red}{tr}}}(\tau_i-\hat{\tau}(X_i;\mathcal{S}^{\color{red}est}))^2] \end{align*}

New Criterion for Honest Causal Tree

Given \Pi,

\begin{align*} -\text{MSE}_{\tau}(\mathcal{S^{{\color{red}\text{tr}}}},\mathcal{S^{{\color{red}\text{est}}}}) &\equiv -\frac{1}{\text{N}^{\color{red}\text{tr}}}\sum_{i\in \mathcal{S^{{\color{red}\text{tr}}}}}[(\tau_i-\hat{\tau}(X_i;\mathcal{S^{{\color{red}\text{est}}}},\Pi))^2 -\tau_i^2 ]\notag \\ &= -\frac{1}{\text{N}^{\color{red}\text{tr}}}\sum_{i\in \mathcal{S^{{\color{red}\text{tr}}}}}[-2\tau_i\cdot \hat{\tau}(X_i;\mathcal{S^{{\color{red}\text{est}}}},\Pi)+\hat{\tau}^2 (X_i;\mathcal{S^{{\color{red}\text{est}}}},\Pi)] \\ -EMSE &= - \mathbb{E}_{\mathcal{S}^{\color{red}{tr}}, \mathcal{S}^{\color{red}{est}}}[\sum_{i\in \mathcal{S}^{\color{red}{tr}}}((\tau_i-\hat{\tau}(X_i;\mathcal{S}^{\color{red}est}))^2-\tau_i^2)] \\ &= \mathbb{E}_{X_i}[\tau^2(X_i;\Pi)] -\mathbb{V}_{\mathcal{S}^{\text{est}}, X_i}[\hat{\tau}(X_i; \mathcal{S}^{\text{est}},\Pi)] \end{align*}

Again, the last equality holds by the “honesty” that \mathcal{S}^{est}\perp \Pi

Estimating the Criterion

In-sample goodness-of-fit measure: -\hat{EMSE}_{\tau}(\mathcal{S^\text{tr}},\Pi) \begin{align*} &\equiv \hat{\mathbb{E}}_{X_i}[\tau^2(X_i;\Pi)] - \hat{\mathbb{V}}_{\mathcal{S}^{est},X_i}[\hat{\tau}(X_i;\mathcal{S}^{est},\Pi)] \\ & \equiv \frac{1}{\text{N}^\text{tr}} \sum_{i\in\mathcal{S^\text{tr}}} \hat{\tau}^2(X_i;\mathcal{S}^{tr},\Pi) -\biggl(\frac{2}{\text{N}^{\text{tr}}}\biggr)\sum_{\ell \in \Pi}\biggl(\frac{S^2_{\mathcal{S}^\text{tr}_\text{treat}}(\ell)}{p}+\frac{S^2_{\mathcal{S}^\text{tr}_\text{control}}(\ell)}{1-p}\biggr) \end{align*} where p=N^\text{tr}_\text{treat}/N^\text{tr}
Note that S^2’s are the sample variances of mean estimates, NOT the treatment effects.

Interpretation of the Criterion

The first term rewards high heterogeneity in treatment effects

\begin{align*} \hat{\mathbb{E}}_{X_i}[\tau^2(X_i;\Pi)]=\frac{1}{\text{N}^\text{tr}} \sum_{i\in\mathcal{S^\text{tr}}} \hat{\tau}^2(X_i;\mathcal{S}^{tr},\Pi) \end{align*}
The second term penalizes a partition that increases variance in leaf estimates (e.g. small leaves) \begin{align*} -\hat{\mathbb{V}}_{\mathcal{S}^{est},X_i}[\hat{\tau}(X_i;\mathcal{S}^{est},\Pi)] = -\frac{2}{\text{N}^{\text{tr}}}\sum_{\ell \in \Pi}(\frac{S^2_{\mathcal{S}^\text{tr}_\text{treat}}(\ell)}{p}+\frac{S^2_{\mathcal{S}^\text{tr}_\text{control}}(\ell)}{1-p}) \end{align*}

Pros and Cons of Honest

Pro:
- Honest target not only removes potential bias in leaf estimates but also penalizes high variance
- enables statistical testing(valid confidence intervals)
Con: smaller sample size, shallower tree, and less personalized predictions

Details

Alternative Estimators and Simulation Results

Alternative Methods for Constructing Trees

(1) Fit-based Trees (F)

Zeileis et al., (2008)
Regressors: intercept(average) + dummy variable for treatment
goodness-of-fit

MSE_{\mu,W}(\mathcal{S}^{te},\mathcal{S}^{est},\Pi) \equiv \sum_{i\in\mathcal{S}^{te}} ((Y_i- \hat{\mu}_w({\color{red}W_i},X_i;\mathcal{S}^{est},\Pi))^2 -{Y_i}^2)

Pros: MSE is feasible (No \tau_i terms)
Cons: NO reward for heterogeneity of treatment effects

(c.f. \sum\hat{\tau}^2 term in Causal Tree MSE)

Alternative Methods for Constructing Trees

(2) Squared T-statistic Trees (TS)

Su et al., (2009)
Split Rule: split if ({\color{red}\overline{\tau}_L-\overline{\tau}_R})^2 is sufficiently large
- similar to two-sample t test

\begin{align*} T^2 \equiv N \cdot \frac{({\color{red}\overline{\tau}_L-\overline{\tau}_R})^2}{S^2/N_L+S^2/N_R} \end{align*} where S^2 is the conditional sample variance given the split

Pros: (only) rewards for heterogeneity of treatment effects
Cons: no value on splits that improve the fit(c.f. Fit-based Trees)

Simulation Study: Set-up

Goal: Compare the performance of proposed algorithms (Adaptive vs. Honest)

Outcome
Model
Design

Evaluate Mean Squared Error (MSE) for each method
Evaluate 90% confidence interval coverage for each method

Notation	Sample Size	Role
N_tr	500 or 1,000	Tree Construction
N_est	500 (honest setting)	Treatment Effect Estimation
N_te	8,000	Test Sample (MSE eval.)

Y_i(w) = \eta(X_i) + \frac{1}{2}\cdot (2w-1) \cdot {\color{red} \kappa (X_i)} + \epsilon_i

\epsilon_i \sim N(0,.01)
X_i \sim N(0,1)
\epsilon_i\perp X_i \quad \text{and}\quad X_i\perp X_j

We have three different setups:

K=2; \quad \kappa(x) = \frac{1}{2}x_1
K=10; \quad \kappa(x) = \sum_{k=1}^2 \mathbb{I}\{x_k >0\}\cdot x_k
K=20; \quad \kappa(x) = \sum_{k=1}^4 \mathbb{I}\{x_k>0\} \cdot x_k

Design 1: two covariates. HTE is linear.
Design 2: six covariates. HTE is non linear
Design 3: eight covariates. HTE is non linear

Simulation Study: Results

CT-H vs alternative estimators

CT-H:
- Best overall performance across all designs

F-H:
- Performs worst in all designs; splits based on outcome prediction

Simulation Study: Results

Adaptive vs Honest : Coverage for 90% confidence intervals

Honest estimation achieves nominal 90% coverage in all designs, while adaptive methods often fall below

The fit estimator has the highest adaptive coverage rates; it doesn’t focus on treatment effects

Honest estimation sacrifices some goodness of fit for valid confidence intervals

More results

Conclusion

By having a separate estimation set, tree-based ML approach can be used for estimating and testing heterogeneous treatment effects!

It imposes no restrictions on model complexity or the number of covariates, which helps setting data-driven hypotheses.

Different criterions can be used(fit, T-squared, etc.), but our baseline estimator(CT-H) performs the best in simulation.

Resources

Proofs and Detailed Explanation of the Paper - oriented by Wooyong Park and Jeongwoo Yang

CART coding exercise - oriented by Joonwoo Shin

Honest Tree coding exercise - oriented by Jeongwoo Yang

Reading List of Applied Micro researches using the causal tree + generalized random forests - oriented by Minseo Seok

Appendix

Cost and Benefits of Honest

Cost
- Shallower tree (\because smaller leaves \rightarrow higher \mathbb{V})
- Smaller # of samples \rightarrow Less personalized predictions and lower MSE
Benefit
- EASY
- Holding tree from \mathcal{S}^{\color{red}tr} fixed, can use standard methods to conduct inference (confidence interval) within each leaf of the tree on \mathcal{S}^{\color{red}te}
(Disregard of the dimension of covariates)
- No assumption on sparsity needed (c.f. nonparametric methods)
vs Dishonest with double the sample
- Honest does worse if true model is sparse (also the case where bias is less severe)
- Dishonest has similar or better MSE in many cases, but poor coverage of confidence intervals

Return

FAQ

Individuals on the edges of a leaf(outliers)
- Use different method (e.g. Radom Forest) to provide a more personalized estimation. Causal Tree is to answer questions on the relation between covariates and how they interplay with treatment effects.
Is smaller number of samples bad?
- Again, we’ve moved the goal post here. We are not trying to give the best prediction of effect on individuals. Rather, recursive partitioning assists figuring a general relation between covariates and treatment effects.
Why 50:50 in sample splitting?
- Sample ratio could be taken differently in different problems and data available.

Simulation Study: Results

Number of Leaves(Tree Depth)

CT-H:
- Splitting criteria: Maximizes – MSE

F-H:
- Splitting criteria: Maximizes outcome prediction
- Build deeper trees than that of CT
- Less prone to overfitting on treatment effects

TS-H:
- Splitting criteria: Maximizes squared t-statistic
- Tree depth similar to that of CT
- Adaptive versions still prone to overfitting

Simulation Study: Results

Adaptive vs Honest : Ratio of infeasible MSE

Honest estimation shows higher MSE in most cases \rightarrow Uses only half the data, leading to lower precision

Fit estimator performs poorly in Design 1 \rightarrow With smaller sample size, it tends to ignore treatment heterogeneity

As design complexity increases, the MSE ratio decreases. \rightarrow Adaptive estimators overfit more in complex settings.

Return

Recursive Partitioning for Heterogeneous Causal Effects

Data-driven HCE analysis based on tree models

Causal Inference with binary treatment

Missing Counterfactuals and ATE

What we see:

Heterogeneous Treatment Effects

Limitations of ATE

Conditional ATEs

Key concepts in ML

Building a Tree

Steps

CART - Best predictor, but biased

Roadmap

Trees can be more flexible and data-driven. However,

Honest Inference for Outcome Averages

Notations for predicted outcomes

Limitations of CART

We cannot simply use CART to estimate HTE.

Limitations of CART

Limitations of CART

Limitations of CART

Example

CART vs Honest

CART

Honest

The Honest Criterion

The Honest Target

Honest Target: Estimation

Honest Target: Estimation

Comparison to CART

CART Target

Honest Target

Comparison to CART

Pros and Cons of Honest

Honest Inference for Treatment Effects

Honest Inference for Treatment Effects

CART For HTE Estimation?

Problems with using CART for HTE

New Criterion for Honest Causal Tree

Estimating the Criterion

Interpretation of the Criterion

Pros and Cons of Honest

Alternative Estimators and Simulation Results

Alternative Methods for Constructing Trees

(1) Fit-based Trees (F)

Alternative Methods for Constructing Trees

(2) Squared T-statistic Trees (TS)

Simulation Study: Set-up

Simulation Study: Results

CT-H vs alternative estimators

Simulation Study: Results

Adaptive vs Honest : Coverage for 90% confidence intervals

Conclusion

Resources

Appendix

Cost and Benefits of Honest

FAQ

Simulation Study: Results

Number of Leaves(Tree Depth)

Simulation Study: Results

Adaptive vs Honest : Ratio of infeasible MSE

Recursive Partitioning
for Heterogeneous Causal Effects