This material is an edited compilation of two original documents by Susan Athey, Stefan Wager, Nicolaj Nørgaard Mühlbach, Xinkun Nie, Vitor Hadad, Matthew Schaelling, Kaleb Javier, Niall Keleher.

Introduction and Setup

Loading packages

Before beginning the tutorial, let’s make sure all necessary packages are installed. Try running the cell below and, if any issues arise, follow the instructions within it.

CRAN packages: If any of these packages are not installed, write install.packages("<name of package>"). For example, to install the package fBasics, use install.packages("fBasics"). The number in parenthesis in the comments the package version that was used when this was tutorial was compiled. If you find yourself having issues, please consider upgrading or downgrading to the same package version.

library(tidyverse)
library(tidyselect)
library(dplyr)       # Data manipulation (0.8.0.1)
library(fBasics)     # Summary statistics (3042.89)
library(corrplot)    # Correlations (0.84)
library(psych)       # Correlation p-values (1.8.12)
library(grf)         # Generalized random forests (0.10.2)
library(rpart)       # Classification and regression trees, or CART (4.1-13)
library(rpart.plot)  # Plotting trees (3.0.6)
library(treeClust)   # Predicting leaf position for causal trees (1.1-7)
library(car)         # linear hypothesis testing for causal tree (3.0-2)
library(devtools)    # Install packages from github (2.0.1)
library(readr)       # Reading csv files (1.3.1)
library(tidyr)       # Database operations (0.8.3)
library(tibble)      # Modern alternative to data frames (2.1.1)
library(knitr)       # RMarkdown (1.21)
library(kableExtra)  # Prettier RMarkdown (1.0.1)
library(ggplot2)     # general plotting tool (3.1.0)
library(haven)       # read stata files (2.0.0)
library(aod)         # hypothesis testing (1.3.1)
library(evtree)      # evolutionary learning of globally optimal trees (1.0-7)
library(purrr)

select <- dplyr::select

# Set seed for reproducibility
set.seed(2025)

Non-CRAN packages: The packages below have not yet been uploaded to CRAN or any other major package repository, but we can grab them from their authors’ github webpages. If you don’t have these packages installed already, uncomment the relevant line below to install it.

# For causal trees (Athey and Imbens, 2016)  version 0.0
# install_github('susanathey/causalTree') # Uncomment this to install the causalTree package
library(causalTree)

The economic context in which we will be working is inspired by Gerber, Green, and Larimer (2008)’s paper “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment” (see article). This paper begins by noting that voter turnout is hard to explain via theories based on rational self-interest behavior, because the observable payoff to voting seems so small that voter turnout should be much smaller than what we see in reality. It could be the case, then, that voters receive some unobserved utility from voting – they have fulfilled their civic duty – or it could be that voters feel pressured by their peers to exercise their voting duty. The authors are interested in understanding the latter effect. They pose the question: to what extent do social norms cause voter turnout? In other words, we would like to quantify the effect of social pressure on voter participation.

For this experiment, a large number of voters were randomly divided in several groups, but for our purposes, we only need to know that there was a “control” group that did not receive anything, and a specific “treatment” group that received a message stating that, after the election, the recent voting record of everyone on their households would be sent to all their neighbors – we will call this the Neighbors mailing. This mailing had the effect of maximizing social pressure on potential voters, since their peers would be able to know whether they voted or not.

The outcome dataset is publicly available here or here. In this tutorial, we will use the following variables from it.

  • Response variable
    • outcome_voted: Indicator variable where \(= 1\) indicates voted in the August 2006 primary
  • Treatment variable
    • treat_neighbors: Indicator variable where \(= 1\) indicates Neighbors mailing treatment
  • Covariates,
    • sex: Indicator variable where \(= 1\) indicates male
    • yob: Year of birth
    • city: City index
    • hh_size: Household size
    • totalpopulation_estimate: Estimate of city population
    • percent_male: Percentage males in household
    • median_age: Median age in household
    • median_income: Median income in household
    • percent_62yearsandover: Percentage of subjects of age higher than 62 yo
    • percent_white: Percentage white in household
    • percent_black: Percentage black in household
    • percent_asian: Percentage asian in household
    • percent_hispanicorlatino: Percentage hispanic or latino in household
    • employ_20to64: Percentage of employed subjects of age 20 to 64 yo
    • highschool: Percentage having only high school degree
    • bach_orhigher: Percentage having bachelor degree or higher

Data Cleaning

Below, we load the data as a .csv file, rename the response and the treatment variable to \(Y\) and \(W\), respectively, and extract the relevant covariates outlined above. Then, we standardize the continuous covariates to have zero mean and unit variance and omit observations with NA entries.

# Clear any existing variables
rm(list = ls())

# Load raw data
df0 <- readr::read_csv(file = 'socialpressnofact.csv', na = character())

# Specify outcome, treatment, and covariate variable names to use
outcome_variable_name <- "outcome_voted"
treatment_variable_name <- "treat_neighbors"
covariate_names <-c("sex", "yob", "city", "hh_size", "totalpopulation_estimate",
                        "percent_male", "median_age", "percent_62yearsandover",
                        "percent_white", "percent_black", "percent_asian", "median_income",
                        "employ_20to64", "highschool", "bach_orhigher","percent_hispanicorlatino")
all_variables_names <- c(outcome_variable_name, treatment_variable_name, covariate_names)
df <- df0 %>% select(all_of(all_variables_names))

# Drop rows containing missing values
df <- df %>% drop_na()

For convenience, let’s also change the names of the treatment and outcome variables. This is just to make our analysis code below simpler, because we won’t have to care about variable names. The treatment will be denoted W, and the outcome will be denoted Y (we’ll have more to say about notation in the next section).

# Rename variables
df <- df %>% rename(Y=outcome_variable_name,W=treatment_variable_name)

Some of our methods below don’t accept factor variables, so let’s change their type to numeric here. Note: If you are modifying this tutorial for your own application, make sure this is a valid step!

# Converting all columns to numerical and add row id
df <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))

df <- df %>% mutate_if(is.character,as.numeric)
df <- df %>% rowid_to_column( "ID")

Finally, let’s separate a portion of our dataset as a test set. Later we will use this subset to evaluate the performance of each method described below.

train_fraction <- 0.80  # Use train_fraction % of the dataset to train our models

df_train <- sample_frac(df, replace=F, size=train_fraction)
df_test <- anti_join(df,df_train, by = "ID") # need to check on larger datasets

Notation

Let’s establish some common notation and definitions. Each unit in our data set will be represented by:

  • \(X_{i}\): is a \(p\)-dimensional vector of observable pre-treatment characteristics
  • \(W_{i} \in \{0, 1\}\): is a binary variable indicating whether the individual was treated (\(1\)) or not (\(0\))
  • \(Y_{i}^{obs} \in \mathbb{R}\): a real variable indicating the observed outcome for that individual

Throughout our analysis, we will often be talking in terms of counterfactuals questions, e.g. “what would have happened if we had assigned the treatment to certain control units?” In order to express this mathematically, we will make use of the potential outcome framework of Rubin (1974). Let’s define the following random variables:

  • \(Y_{i}(1)\): the outcome unit \(i\) would attain if they received the treatment
  • \(Y_{i}(0)\): the outcome unit \(i\) would attain if they were part of the control group

Naturally, we only ever get to observe one of these two for each unit, but it’s convenient to define so that we can think about counterfactuals. In fact, we can think of much of causal inference as a “missing value” problem: there exists an underlying data process generating random variables \((X_{i}, Y_{i}(0), Y_{i}(1))\), but we can only observe the realization of \((X_{i}, Y_{i}(0))\) (for control units) or \((X_{i}, Y_{i}(1))\) (for treated units), with the remaining outcome being missing.

\(X_{i}\) \(Y_{i}(0)\) \(Y_{i}(1)\)
\(X_{1}\) \(Y_{1}(0)\) \(Y_{1}(1)\)
\(X_{2}\) \(Y_{2}(0)\) \(Y_{2}(1)\)
\(X_{3}\) \(Y_{3}(0)\) \(Y_{3}(1)\)
\(\cdots\) \(\cdots\) \(\cdots\)
\(X_{n}\) \(Y_{n}(0)\) \(Y_{n}(1)\)

Using the potential outcome notation above, the observed outcome can also be written as

\[Y_{i}^{obs} = W_{i}Y_{i}(1) + (1-W_{i})Y_{i}(0)\]

In order to avoid clutter, we’ll from now own denote \(Y_{i}^{obs}\) simply by \(Y_{i}\).

Assumptions

We will be making two identification assumptions that will allow us to use the methods below.

Assumption 1: Unconfoundedness

The unconfoundedness assumption states that, once we condition on observable characteristics, the treatment assignment is independent to how each person would respond to the treatment. In other words, the rule that determines whether or not a person is treated is determined completely by their observable characteristics. This allows, for example, for experiments where people from different genders get treated with different probabilities, but it rules out experiments where people self-select into treatment due to some characteristic that is not observed in our data.

\[Y_i(1), Y_i(0) \perp W_i \ | \ X_i\]

Assumption 2: Overlap

In order to estimate the treatment effect for a person with particular characteristics \(X_{i} = x\), we need to ensure that we are able to observe treated and untreated people with those same characteristics so that we can compare their outcomes. The overlap assumption states that at every point of the covariate space we can always find treated and control individuals.

\[\forall \ x \in \text{supp}\ (X), \qquad 0 < P\ (W = 1 \ | \ X = x) < 1\]

Note: Some of the datasets the github webpage are completely randomized experiments. This assumption holds for those as well.

Causal Trees

Reference: Athey and Imbens (PNAS, 2016)

Detecting heterogeneous treatment effects, i.e., differential effects of an intervention for certain subgroups of the population, can be very valuable in many areas of research. In medicine, for example, researchers might be interested in finding which subgroup of patients benefits most from getting a certain drug. At the same time, one might be worried about finding spurious effects just by estimating many different specifications, and this is why many scientific fields require pre-analysis plans for publications. However, researchers may find these plans restrictive since they cannot publish results for subgroups they did not anticipate before running the experiment.

Athey and Imbens (2016)’s causal trees provide a data-driven approach to partitioning the data into subgroups that differ by the magnitude of their treatment effects. Much like decision trees, which partition the covariate space by finding subgroups with similar outcomes, causal trees find subgroups with similar treatment effects. Moreover, even though this is an adaptive method, these subgroups do not need to be specified prior to the experiment.

In order to ensure valid estimates of the treatment effect within each subgroup, Athey and Imbens propose a sample-splitting approach that they refer to as honesty: a method is honest if it uses one subset of the data to estimate the model parameters, and a different subset to produce estimates given these estimated parameters. In the context of causal trees, honesty implies that the asymptotic properties of treatment effect estimates within leaves are the same as if the tree partition had been exogenously given, and it is one of the assumptions required to produce unbiased and asymptotically normal estimates of the treatment effect.

Step 1: Split the dataset

As we just explained, honesty requires us to separate different subsets of our training data for model selection and prediction.

  • df_split: the splitting sample, used to build the tree
  • df_est: the estimation sample, used to compute the average treatment effect in each leaf
# Diving the data 40%-40%-20% into splitting, estimation and validation samples
split_size <- floor(nrow(df_train) * 0.5)
df_split <- sample_n(df_train, replace=FALSE, size=split_size)

# Make the splits
df_est <- anti_join(df_train,df_split, by ="ID")

Step 2: Fit the tree

Begin by defining a formula containing only the outcome and the covariates.

fmla_ct <- paste("factor(Y) ~", paste(covariate_names, collapse = " + "))

print('This is our regression model')
## [1] "This is our regression model"
print( fmla_ct)
## [1] "factor(Y) ~ sex + yob + city + hh_size + totalpopulation_estimate + percent_male + median_age + percent_62yearsandover + percent_white + percent_black + percent_asian + median_income + employ_20to64 + highschool + bach_orhigher + percent_hispanicorlatino"

Next, we use the honest.causalTree function from the causalTree package. To ensure that honesty is enabled, the parameters for splitting and cross-validation below should not be changed. However, if your tree is not splitting at all, try decreasing the parameter minsize that controls the minimum size of each leaf.

For more details on other parameters, please take a look at this extended documentation for the causalTree paper.

ct_unpruned <- honest.causalTree(
  formula = fmla_ct,            # Define the model
  data = df_split,              # Subset used to create tree structure
  est_data = df_est,            # Which data set to use to estimate effects

  treatment = df_split$W,       # Splitting sample treatment variable
  est_treatment = df_est$W,     # Estimation sample treatment variable

  split.Rule = "CT",            # Define the splitting option
  cv.option = "TOT",            # Cross validation options
  cp = 0,                       # Complexity parameter

  split.Honest = TRUE,          # Use honesty when splitting
  cv.Honest = TRUE,             # Use honesty when performing cross-validation

  minsize = 9,                 # Min. number of treatment and control cases in each leaf
  HonestSampleSize = nrow(df_est)) # Num obs used in estimation after building the tree

The resulting object will be rpart objects, so rpart methods for cross-validation and plotting can be used.

Step 3: Cross-validate

We must prune the tree by cross-validation to avoid overfitting. The honest cross-validation method selected above (and recommended) penalizes an estimate of the variance in the treatment effects estimates across leaves, and this estimate is computed using the estimation sample. The cv.option selected above (\(TOT\)) uses an unbiased estimate of the test mean-squared error.

# Table of cross-validated values by tuning parameter.
ct_cptable <- as.data.frame(ct_unpruned$cptable)

# Obtain optimal complexity parameter to prune tree.
selected_cp <- which.min(ct_cptable$xerror)
optim_cp_ct <- ct_cptable[selected_cp, "CP"]

# Prune the tree at optimal complexity parameter.
ct_pruned <- prune(tree = ct_unpruned, cp = optim_cp_ct)

Step 4: Predict point estimates (on estimation sample)

To predict the treatment effect on the estimation sample, use the function predict as below.

tauhat_ct_est <- predict(ct_pruned, newdata = df_est)

Step 5: Compute standard errors

The causalTree package does not compute standard errors by default, but we can compute them using the following trick. First, define \(L_{\ell}\) to indicate assignment to leaf \(\ell\) and consider the following linear model.

\[\begin{align} Y = \sum_{\ell} L_{\ell}\alpha_{\ell} + W \cdot L_{\ell} \beta_{\ell} \end{align}\]

# Create a factor column 'leaf' indicating leaf assignment
num_leaves <- length(unique(tauhat_ct_est))  #There are as many leaves as there are predictions

df_est$leaf <- factor(tauhat_ct_est, labels = seq(num_leaves))

# Run the regression
ols_ct <- lm(as.formula("Y ~ 0 + leaf + W:leaf"), data= df_est) #Warning: the tree won't split for charitable dataset
print(as.formula("Y ~ 0 + leaf + W:leaf"))
## Y ~ 0 + leaf + W:leaf

The interaction coefficients in this regression recover the average treatment effects in each leaf, since

\[\begin{align} E[Y|W=1, L=1] - E[Y|W=0, L=1] = (\alpha_{1} + \beta_{1}) - (\alpha_{1}) = \beta_{1} \end{align}\]

Therefore, the standard error around the coefficients is also the standard error around the treatment effects. In the next subsection, we will also use these statistics to test hypothesis about leaf estimates.

ols_ct_summary <- summary(ols_ct)
te_summary <- coef(ols_ct_summary)[(num_leaves+1):(2*num_leaves), c("Estimate", "Std. Error")]
Average treatment effects per leaf
Estimate Std. Error
leaf1:W -0.0324 0.0918
leaf2:W 0.0346 0.0271
leaf3:W 0.0418 0.0259
leaf4:W 0.0794 0.0059
leaf5:W 0.0984 0.1367

Step 6: Predict point estimates (on test set)

To predict the treatment effect on a new, entirely unseen data, use the function predict as below.

tauhat_ct_test <- predict(ct_pruned, newdata = df_test)

Assessing heterogeneity

A natural place to begin is by ploting the (pruned) tree. We can use the rpart.plot function from the rpart.plot package.

rpart.plot(
  x = ct_pruned,        # Pruned tree
  type = 3,             # Draw separate split labels for the left and right directions
  fallen = TRUE,        # Position the leaf nodes at the bottom of the graph
  leaf.round = 1,       # Rounding of the corners of the leaf node boxes
  extra = 100,          # Display the percentage of observations in the node
  branch = 0.1,          # Shape of the branch lines
  box.palette = "RdBu") # Palette for coloring the node