This material is an edited compilation of two original documents by Susan Athey, Stefan Wager, Nicolaj Nørgaard Mühlbach, Xinkun Nie, Vitor Hadad, Matthew Schaelling, Kaleb Javier, Niall Keleher.
Before beginning the tutorial, let’s make sure all necessary packages are installed. Try running the cell below and, if any issues arise, follow the instructions within it.
CRAN packages: If any of these packages are not
installed, write
install.packages("<name of package>")
. For example,
to install the package fBasics
, use
install.packages("fBasics")
. The number in parenthesis in
the comments the package version that was used when this was tutorial
was compiled. If you find yourself having issues, please consider
upgrading or downgrading to the same package version.
library(tidyverse)
library(tidyselect)
library(dplyr) # Data manipulation (0.8.0.1)
library(fBasics) # Summary statistics (3042.89)
library(corrplot) # Correlations (0.84)
library(psych) # Correlation p-values (1.8.12)
library(grf) # Generalized random forests (0.10.2)
library(rpart) # Classification and regression trees, or CART (4.1-13)
library(rpart.plot) # Plotting trees (3.0.6)
library(treeClust) # Predicting leaf position for causal trees (1.1-7)
library(car) # linear hypothesis testing for causal tree (3.0-2)
library(devtools) # Install packages from github (2.0.1)
library(readr) # Reading csv files (1.3.1)
library(tidyr) # Database operations (0.8.3)
library(tibble) # Modern alternative to data frames (2.1.1)
library(knitr) # RMarkdown (1.21)
library(kableExtra) # Prettier RMarkdown (1.0.1)
library(ggplot2) # general plotting tool (3.1.0)
library(haven) # read stata files (2.0.0)
library(aod) # hypothesis testing (1.3.1)
library(evtree) # evolutionary learning of globally optimal trees (1.0-7)
library(purrr)
select <- dplyr::select
# Set seed for reproducibility
set.seed(2025)
Non-CRAN packages: The packages below have not yet been uploaded to CRAN or any other major package repository, but we can grab them from their authors’ github webpages. If you don’t have these packages installed already, uncomment the relevant line below to install it.
# For causal trees (Athey and Imbens, 2016) version 0.0
# install_github('susanathey/causalTree') # Uncomment this to install the causalTree package
library(causalTree)
The economic context in which we will be working is inspired by Gerber, Green, and Larimer (2008)’s paper “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment” (see article). This paper begins by noting that voter turnout is hard to explain via theories based on rational self-interest behavior, because the observable payoff to voting seems so small that voter turnout should be much smaller than what we see in reality. It could be the case, then, that voters receive some unobserved utility from voting – they have fulfilled their civic duty – or it could be that voters feel pressured by their peers to exercise their voting duty. The authors are interested in understanding the latter effect. They pose the question: to what extent do social norms cause voter turnout? In other words, we would like to quantify the effect of social pressure on voter participation.
For this experiment, a large number of voters were randomly divided in several groups, but for our purposes, we only need to know that there was a “control” group that did not receive anything, and a specific “treatment” group that received a message stating that, after the election, the recent voting record of everyone on their households would be sent to all their neighbors – we will call this the Neighbors mailing. This mailing had the effect of maximizing social pressure on potential voters, since their peers would be able to know whether they voted or not.
The outcome dataset is publicly available here or here. In this tutorial, we will use the following variables from it.
Below, we load the data as a .csv file, rename the response and the treatment variable to \(Y\) and \(W\), respectively, and extract the relevant covariates outlined above. Then, we standardize the continuous covariates to have zero mean and unit variance and omit observations with NA entries.
# Clear any existing variables
rm(list = ls())
# Load raw data
df0 <- readr::read_csv(file = 'socialpressnofact.csv', na = character())
# Specify outcome, treatment, and covariate variable names to use
outcome_variable_name <- "outcome_voted"
treatment_variable_name <- "treat_neighbors"
covariate_names <-c("sex", "yob", "city", "hh_size", "totalpopulation_estimate",
"percent_male", "median_age", "percent_62yearsandover",
"percent_white", "percent_black", "percent_asian", "median_income",
"employ_20to64", "highschool", "bach_orhigher","percent_hispanicorlatino")
all_variables_names <- c(outcome_variable_name, treatment_variable_name, covariate_names)
df <- df0 %>% select(all_of(all_variables_names))
# Drop rows containing missing values
df <- df %>% drop_na()
For convenience, let’s also change the names of the treatment and
outcome variables. This is just to make our analysis code below simpler,
because we won’t have to care about variable names. The treatment will
be denoted W
, and the outcome will be denoted
Y
(we’ll have more to say about notation in the next
section).
# Rename variables
df <- df %>% rename(Y=outcome_variable_name,W=treatment_variable_name)
Some of our methods below don’t accept factor
variables,
so let’s change their type to numeric
here.
Note: If you are modifying this tutorial for your own
application, make sure this is a valid step!
# Converting all columns to numerical and add row id
df <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
df <- df %>% mutate_if(is.character,as.numeric)
df <- df %>% rowid_to_column( "ID")
Finally, let’s separate a portion of our dataset as a test set. Later we will use this subset to evaluate the performance of each method described below.
train_fraction <- 0.80 # Use train_fraction % of the dataset to train our models
df_train <- sample_frac(df, replace=F, size=train_fraction)
df_test <- anti_join(df,df_train, by = "ID") # need to check on larger datasets
Let’s establish some common notation and definitions. Each unit in our data set will be represented by:
Throughout our analysis, we will often be talking in terms of counterfactuals questions, e.g. “what would have happened if we had assigned the treatment to certain control units?” In order to express this mathematically, we will make use of the potential outcome framework of Rubin (1974). Let’s define the following random variables:
Naturally, we only ever get to observe one of these two for each unit, but it’s convenient to define so that we can think about counterfactuals. In fact, we can think of much of causal inference as a “missing value” problem: there exists an underlying data process generating random variables \((X_{i}, Y_{i}(0), Y_{i}(1))\), but we can only observe the realization of \((X_{i}, Y_{i}(0))\) (for control units) or \((X_{i}, Y_{i}(1))\) (for treated units), with the remaining outcome being missing.
\(X_{i}\) | \(Y_{i}(0)\) | \(Y_{i}(1)\) |
---|---|---|
\(X_{1}\) | \(Y_{1}(0)\) | \(Y_{1}(1)\) |
\(X_{2}\) | \(Y_{2}(0)\) | \(Y_{2}(1)\) |
\(X_{3}\) | \(Y_{3}(0)\) | \(Y_{3}(1)\) |
\(\cdots\) | \(\cdots\) | \(\cdots\) |
\(X_{n}\) | \(Y_{n}(0)\) | \(Y_{n}(1)\) |
Using the potential outcome notation above, the observed outcome can also be written as
\[Y_{i}^{obs} = W_{i}Y_{i}(1) + (1-W_{i})Y_{i}(0)\]
In order to avoid clutter, we’ll from now own denote \(Y_{i}^{obs}\) simply by \(Y_{i}\).
We will be making two identification assumptions that will allow us to use the methods below.
The unconfoundedness assumption states that, once we condition on observable characteristics, the treatment assignment is independent to how each person would respond to the treatment. In other words, the rule that determines whether or not a person is treated is determined completely by their observable characteristics. This allows, for example, for experiments where people from different genders get treated with different probabilities, but it rules out experiments where people self-select into treatment due to some characteristic that is not observed in our data.
\[Y_i(1), Y_i(0) \perp W_i \ | \ X_i\]
In order to estimate the treatment effect for a person with particular characteristics \(X_{i} = x\), we need to ensure that we are able to observe treated and untreated people with those same characteristics so that we can compare their outcomes. The overlap assumption states that at every point of the covariate space we can always find treated and control individuals.
\[\forall \ x \in \text{supp}\ (X), \qquad 0 < P\ (W = 1 \ | \ X = x) < 1\]
Note: Some of the datasets the github webpage are completely randomized experiments. This assumption holds for those as well.
Reference: Athey and Imbens (PNAS, 2016)
Detecting heterogeneous treatment effects, i.e., differential effects of an intervention for certain subgroups of the population, can be very valuable in many areas of research. In medicine, for example, researchers might be interested in finding which subgroup of patients benefits most from getting a certain drug. At the same time, one might be worried about finding spurious effects just by estimating many different specifications, and this is why many scientific fields require pre-analysis plans for publications. However, researchers may find these plans restrictive since they cannot publish results for subgroups they did not anticipate before running the experiment.
Athey and Imbens (2016)’s causal trees provide a data-driven approach to partitioning the data into subgroups that differ by the magnitude of their treatment effects. Much like decision trees, which partition the covariate space by finding subgroups with similar outcomes, causal trees find subgroups with similar treatment effects. Moreover, even though this is an adaptive method, these subgroups do not need to be specified prior to the experiment.
In order to ensure valid estimates of the treatment effect within each subgroup, Athey and Imbens propose a sample-splitting approach that they refer to as honesty: a method is honest if it uses one subset of the data to estimate the model parameters, and a different subset to produce estimates given these estimated parameters. In the context of causal trees, honesty implies that the asymptotic properties of treatment effect estimates within leaves are the same as if the tree partition had been exogenously given, and it is one of the assumptions required to produce unbiased and asymptotically normal estimates of the treatment effect.
As we just explained, honesty requires us to separate different subsets of our training data for model selection and prediction.
df_split
: the splitting sample, used to build
the treedf_est
: the estimation sample, used to compute
the average treatment effect in each leaf# Diving the data 40%-40%-20% into splitting, estimation and validation samples
split_size <- floor(nrow(df_train) * 0.5)
df_split <- sample_n(df_train, replace=FALSE, size=split_size)
# Make the splits
df_est <- anti_join(df_train,df_split, by ="ID")
Begin by defining a formula containing only the outcome and the covariates.
fmla_ct <- paste("factor(Y) ~", paste(covariate_names, collapse = " + "))
print('This is our regression model')
## [1] "This is our regression model"
print( fmla_ct)
## [1] "factor(Y) ~ sex + yob + city + hh_size + totalpopulation_estimate + percent_male + median_age + percent_62yearsandover + percent_white + percent_black + percent_asian + median_income + employ_20to64 + highschool + bach_orhigher + percent_hispanicorlatino"
Next, we use the honest.causalTree
function from the
causalTree
package. To ensure that honesty is enabled, the
parameters for splitting and cross-validation below should not be
changed. However, if your tree is not splitting at all, try decreasing
the parameter minsize
that controls the minimum size of
each leaf.
For more details on other parameters, please take a look at this
extended documentation
for the causalTree
paper.
ct_unpruned <- honest.causalTree(
formula = fmla_ct, # Define the model
data = df_split, # Subset used to create tree structure
est_data = df_est, # Which data set to use to estimate effects
treatment = df_split$W, # Splitting sample treatment variable
est_treatment = df_est$W, # Estimation sample treatment variable
split.Rule = "CT", # Define the splitting option
cv.option = "TOT", # Cross validation options
cp = 0, # Complexity parameter
split.Honest = TRUE, # Use honesty when splitting
cv.Honest = TRUE, # Use honesty when performing cross-validation
minsize = 9, # Min. number of treatment and control cases in each leaf
HonestSampleSize = nrow(df_est)) # Num obs used in estimation after building the tree
The resulting object will be rpart
objects, so
rpart
methods for cross-validation and plotting can be
used.
We must prune the tree by cross-validation to avoid overfitting. The
honest cross-validation method selected above (and recommended)
penalizes an estimate of the variance in the treatment effects estimates
across leaves, and this estimate is computed using the estimation
sample. The cv.option
selected above (\(TOT\)) uses an unbiased estimate of the
test mean-squared error.
# Table of cross-validated values by tuning parameter.
ct_cptable <- as.data.frame(ct_unpruned$cptable)
# Obtain optimal complexity parameter to prune tree.
selected_cp <- which.min(ct_cptable$xerror)
optim_cp_ct <- ct_cptable[selected_cp, "CP"]
# Prune the tree at optimal complexity parameter.
ct_pruned <- prune(tree = ct_unpruned, cp = optim_cp_ct)
To predict the treatment effect on the estimation sample, use the
function predict
as below.
tauhat_ct_est <- predict(ct_pruned, newdata = df_est)
The causalTree
package does not compute standard errors
by default, but we can compute them using the following trick. First,
define \(L_{\ell}\) to indicate
assignment to leaf \(\ell\) and
consider the following linear model.
\[\begin{align} Y = \sum_{\ell} L_{\ell}\alpha_{\ell} + W \cdot L_{\ell} \beta_{\ell} \end{align}\]
# Create a factor column 'leaf' indicating leaf assignment
num_leaves <- length(unique(tauhat_ct_est)) #There are as many leaves as there are predictions
df_est$leaf <- factor(tauhat_ct_est, labels = seq(num_leaves))
# Run the regression
ols_ct <- lm(as.formula("Y ~ 0 + leaf + W:leaf"), data= df_est) #Warning: the tree won't split for charitable dataset
print(as.formula("Y ~ 0 + leaf + W:leaf"))
## Y ~ 0 + leaf + W:leaf
The interaction coefficients in this regression recover the average treatment effects in each leaf, since
\[\begin{align} E[Y|W=1, L=1] - E[Y|W=0, L=1] = (\alpha_{1} + \beta_{1}) - (\alpha_{1}) = \beta_{1} \end{align}\]
Therefore, the standard error around the coefficients is also the standard error around the treatment effects. In the next subsection, we will also use these statistics to test hypothesis about leaf estimates.
ols_ct_summary <- summary(ols_ct)
te_summary <- coef(ols_ct_summary)[(num_leaves+1):(2*num_leaves), c("Estimate", "Std. Error")]
Estimate | Std. Error | |
---|---|---|
leaf1:W | -0.0324 | 0.0918 |
leaf2:W | 0.0346 | 0.0271 |
leaf3:W | 0.0418 | 0.0259 |
leaf4:W | 0.0794 | 0.0059 |
leaf5:W | 0.0984 | 0.1367 |
To predict the treatment effect on a new, entirely unseen data, use
the function predict
as below.
tauhat_ct_test <- predict(ct_pruned, newdata = df_test)
A natural place to begin is by ploting the (pruned) tree. We can use
the rpart.plot
function from the rpart.plot
package.
rpart.plot(
x = ct_pruned, # Pruned tree
type = 3, # Draw separate split labels for the left and right directions
fallen = TRUE, # Position the leaf nodes at the bottom of the graph
leaf.round = 1, # Rounding of the corners of the leaf node boxes
extra = 100, # Display the percentage of observations in the node
branch = 0.1, # Shape of the branch lines
box.palette = "RdBu") # Palette for coloring the node