Calibration Plots

Author

Wooyong Park

Published

November 30, 2025

Update Notes

  • 2025-09-10: Fixed Standard Error Miscalculations
  • 2025-11-30: Added a section for continuous outcomes, replace equal-distance bins with equal-size bins; add a section for continuous outcomes

Limitations

  • In Section 4, due to the low predictive power of the covariates, the calibration plots are not very informative. We should replace this dataset with one that has stronger relationship between the covariates and the outcome.

Introduction

Calibration plots, aka reliability plots, display the empirical probability against the predicted probability. This is often done by grouping predictions into bins and calculating the average predicted and the empirical probability in each bin. A well-crafted calibration plot helps us understand the performance of a model and compare it to other models.

In this tutorial, we cover how to create calibration plots for different types of outcomes. Let’s begin with loading the necessary packages and setting the seed. This practice tutorial works with both R and Python.

NB: This tutorial has scrollable code blocks. Please scroll all the way down to see full code.

Setting Up

We start by loading the necessary packages, setting the seed, and configuring the plotting environment.

Code
# load packages
library(ggplot2)
library(dplyr)
library(readr)
library(grf)
library(nnet)
library(e1071)
library(plotly)
library(shiny)
library(htmlwidgets)
library(kableExtra)

# set seed
set.seed(42)

# set theme for ggplot2
theme_set(theme_minimal(base_size = 14) + theme(legend.position = "bottom"))
Code
# load packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import calibration_curve
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
import ipywidgets as widgets
from IPython.display import display
import patchworklib as pw

# set seed
np.random.seed(42)

# set matplotlib font size
plt.rcParams.update({'font.size': 14})

Binary Outcomes

Data

We use the abridged version of the General Social Survey (GSS) (Smith, 2016) dataset. In this dataset, individuals were assigned to treatment or control randomly with equal probability.

Individuals were asked about their thoughts on government spending on the social safety net. The treatment is the wording of the question: about half of the individuals were asked if they thought government spends too much on “welfare” (W_i=1), while the remaining half was asked about “assistance to the poor”(W_i=0). The outcome is binary, with Y_i=1 corresponding to a positive answer. In the data set below, we also collect a few other demographic covariates.

Code
# Read in data
data <- read.csv("https://docs.google.com/uc?id=1AQva5-vDlgBcM_Tv9yrO8yMYRfQJgqo_&export=download")
n <- nrow(data)


# Treatment: does the the gov't spend too much on "welfare" (1) or "assistance to the poor" (0)
treatment <- "w"

# Outcome: 1 for 'yes', 0 for 'no'
outcome <- "y"

# Additional covariates
covariates <- c("age", "polviews", "income", "educ", "marital", "sex")
Code

# Read in data
data = pd.read_csv("https://docs.google.com/uc?id=1AQva5-vDlgBcM_Tv9yrO8yMYRfQJgqo_&export=download")
n = len(data)

# Treatment: does the the gov't spend too much on "welfare" (1) or "assistance to the poor" (0)
treatment = "w"

# Outcome: 1 for 'yes', 0 for 'no'
outcome = "y"

# Additional covariates
covariates = ["age", "polviews", "income", "educ", "marital", "sex"]

Predictions

Our goal is to predict the counterfactual outcome Y_i(0)—the outcome if an individual were assigned to the control group—for each observation. To achieve this, we restrict our analysis to the control group (W_i=0) and train two models using the observed covariates: a Random Forest and a Naive Bayes classifier.

A common methodological flaw in this context is generating calibration plots using a model fitted on the entire dataset. This approach is equivalent to reporting in-sample performance, which leads to model overfitting and yields biased evaluation metrics.

To avoid this and ensure valid evaluation, we partition the control group into distinct training and testing sets. We fit the models on the training set and generate calibration plots exclusively using the testing set.

For reproducibility, we designate the first 10,000 observations of the control group as the training set, while the remaining observations serve as the testing set.

Code
# Data preparation
data_c <- data[data[,treatment] == 0,] |> select(all_of(c(outcome, covariates)))
n_control <- nrow(data_c)
data_c_training <- data_c[1:10000,]
data_c_testing <- data_c[10001:n_control,]


# Random forest
forest <- grf::regression_forest(
  X = data_c_training[,covariates], 
  Y = data_c_training[,outcome], 
  honesty = FALSE,
  tune.parameters = "all", 
  num.trees = 1000, 
  seed = 42
  )

# Naive Bayes
nb <- naiveBayes(
  formula = as.formula(paste0(outcome, " ~ ", paste0(covariates, collapse = "+"))),
  data = data_c_training
)
Code
# Data preparation
data_c = data[data[treatment] == 0][[outcome, *covariates]]
n_control = len(data_c)
data_c_training = data_c.iloc[:10000]
data_c_testing = data_c.iloc[10001:n_control]

# Random forest
forest = RandomForestClassifier(random_state=42, n_estimators=1000)
forest.fit(data_c_training[covariates], data_c_training[outcome])
RandomForestClassifier(n_estimators=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
# Naive Bayes
nb = GaussianNB()
nb.fit(data_c_training[covariates], data_c_training[outcome])
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Calibration Plots

Now let us create calibration plots with 10 bins. First, we generate predictions on the test data and create calibration plots for both the Random Forest and Naive Bayes models. Then, we group the predicted probabilities into 10 bins with equal size and calculate the average predicted probability and the empirical probability of the observations in each bin, which are then plotted as points on the calibration plot.

Code
# Generate predictions on test data
forest_pred <- predict(forest, data_c_testing[,covariates])$predictions
nb_pred <- predict(nb, data_c_testing, type = "raw")[,2]

# Create data frame for plotting
plot_data <- data.frame(
  actual = data_c_testing[,outcome],
  forest_pred = forest_pred,
  nb_pred = nb_pred
)

# Display Predictions vs Actual
kableExtra::kable(head(plot_data, 10), caption = "Predictions vs Actual Data") |> kable_styling(bootstrap_options = c("striped", "hover")) |> scroll_box(width = "100%", height = "500px")
Predictions vs Actual Data
actual forest_pred nb_pred
0 0.4021113 0.5437730
1 0.5778866 0.7078813
0 0.4961295 0.5806107
0 0.4897356 0.6224092
1 0.4060207 0.0060097
0 0.4388465 0.5554696
1 0.5107301 0.5336652
0 0.4912047 0.4704317
0 0.4955906 0.5773510
1 0.5368823 0.6586107
Code
# Simple calibration plot with 10 bins
n_bins <- 10

# Bin the predictions
rf_bins <- dplyr::ntile(plot_data$forest_pred, n_bins)
nb_bins <- dplyr::ntile(plot_data$nb_pred, n_bins)

# Calculate the statistics for the Random Forest model
rf_bin_stats <- tibble(
    bin_center = tapply(plot_data$forest_pred, rf_bins, mean),
    empirical_prob = tapply(plot_data$actual, rf_bins, mean),
    count = as.numeric(table(rf_bins)),
    empirical_se = tapply(plot_data$actual, rf_bins, sd)/sqrt(count)
  )

# Calculate the statistics for the Naive Bayes model
nb_bin_stats <- tibble(
    bin_center = tapply(plot_data$nb_pred, nb_bins, mean),
    empirical_prob = tapply(plot_data$actual, nb_bins, mean),
    count = as.numeric(table(nb_bins)),
    empirical_se = tapply(plot_data$actual, nb_bins, sd)/sqrt(count)
  )

# Combine the statistics for the two models
bin_stats <-  mutate(rf_bin_stats, model = "Random Forest") |> 
  bind_rows(mutate(nb_bin_stats, model = "Naive Bayes"))

# Create the calibration plot
ggplot(bin_stats) +
  geom_abline(aes(slope = 1, intercept = 0, color = "Perfect Calibration"), linetype = "dashed") +
  geom_ribbon(aes(x = bin_center, y = empirical_prob, fill = model,
                    ymin = empirical_prob - 1.96 * empirical_se, 
                    ymax = empirical_prob + 1.96 * empirical_se), alpha = 0.2) +
  geom_point(aes(x=bin_center, y=empirical_prob, color = model), size = 2, alpha = 0.5) +
  scale_color_manual(values = c("Perfect Calibration" = "red", "Random Forest" = "blue", "Naive Bayes" = "green")) +
  scale_fill_manual(values = c("Perfect Calibration" = "red", "Random Forest" = "blue", "Naive Bayes" = "green")) +
  labs(title = paste0("Calibration Plot with ", n_bins, " bins"), x=NULL, y=NULL, color = NULL) + guides(fill = "none")

Code
# Generate predictions on test data
forest_pred = forest.predict_proba(data_c_testing[covariates])[:, 1]
nb_pred = nb.predict_proba(data_c_testing[covariates])[:, 1]

# Create data frame for plotting
plot_data = pd.DataFrame({
    'actual': data_c_testing[outcome],
    'forest_pred': forest_pred,
    'nb_pred': nb_pred
    })

# Display Predictions vs Actual
print("Predictions vs Actual Data:")
Predictions vs Actual Data:
Code
print(plot_data.head(10))
       actual  forest_pred   nb_pred
20767       1     0.309000  0.707901
20769       0     0.693502  0.580603
20770       0     0.832964  0.622424
20771       1     0.255000  0.006002
20776       0     0.369000  0.555478
20778       1     0.738960  0.533656
20779       0     0.971583  0.470412
20780       0     0.722167  0.577366
20781       1     0.501414  0.658634
20782       1     0.951133  0.740106
Code
# Simple calibration plot with 10 bins
n_bins = 10

# Bin the predictions
rf_bins = pd.qcut(plot_data['forest_pred'], q=n_bins, labels=False, duplicates='drop')
nb_bins = pd.qcut(plot_data['nb_pred'], q=n_bins, labels=False, duplicates='drop')

# Calculate the statistics for the Random Forest model
rf_bin_stats = pd.DataFrame({
    'bin_center': plot_data['forest_pred'].groupby(rf_bins).mean(),
    'empirical_prob': plot_data['actual'].groupby(rf_bins).mean(),
    'count': rf_bins.value_counts().sort_index(),
    'empirical_se': plot_data['actual'].groupby(rf_bins).std() / rf_bins.value_counts().sort_index()
})

# Calculate the statistics for the Naive Bayes model
nb_bin_stats = pd.DataFrame({
    'bin_center': plot_data['nb_pred'].groupby(nb_bins).mean(),
    'empirical_prob': plot_data['actual'].groupby(nb_bins).mean(),
    'count': nb_bins.value_counts().sort_index(),
    'empirical_se': plot_data['actual'].groupby(nb_bins).std() / nb_bins.value_counts().sort_index()
})

# Combine the statistics for the two models
rf_bin_stats['model'] = 'Random Forest'
nb_bin_stats['model'] = 'Naive Bayes'
bin_stats = pd.concat([rf_bin_stats, nb_bin_stats], ignore_index=True)

# Create the calibration plot
fig, ax = plt.subplots(figsize=(12, 8))

# Add perfect calibration line
ax.plot([0, 1], [0, 1], 'r--', label='Perfect Calibration', linewidth=2)

# Plot Random Forest with ribbons
rf_data = bin_stats[bin_stats['model'] == 'Random Forest']
ax.scatter(rf_data['bin_center'], rf_data['empirical_prob'], 
          c='blue', label='Random Forest', alpha = 0.5, s=2)
ax.fill_between(rf_data['bin_center'], 
                rf_data['empirical_prob'] - 1.96 * rf_data['empirical_se'],
                rf_data['empirical_prob'] + 1.96 * rf_data['empirical_se'],
                alpha=0.2, color='blue')

# Plot Naive Bayes with ribbons
nb_data = bin_stats[bin_stats['model'] == 'Naive Bayes']
ax.scatter(nb_data['bin_center'], nb_data['empirical_prob'], 
          c='green', label='Naive Bayes', alpha = 0.5, s=2)
ax.fill_between(nb_data['bin_center'], 
                nb_data['empirical_prob'] - 1.96 * nb_data['empirical_se'],
                nb_data['empirical_prob'] + 1.96 * nb_data['empirical_se'],
                alpha=0.2, color='green')

# Formatting
ax.set_ylim(0, 1)
(0.0, 1.0)
Code
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Empirical Probability')
ax.set_title(f'Calibration Plot with {n_bins} bins')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In the figure above, the X-axis represents the predicted probability, while the Y-axis displays the empirical probability. Each point on the plot corresponds to the average predicted and empirical probabilities within its respective bin, and the ribbons illustrate the 95% confidence intervals for the empirical probability.

How well-calibrated our models are can be assessed by comparing the points to the 45-degree red dashed line:

  • Perfect calibration: Points lie on the 45-degree red dashed line
  • Overconfident: Points lie below the line (predicted probabilities too high)
  • Underconfident: Points lie above the line (predicted probabilities too low)

A key decision: how to divide the predicted probabilities into bins

When creating calibration plots, there are two major considerations:

  1. the number of bins (how many buckets?)
  2. dividing the predicted probabilities into bins (how to allocate the observations to the buckets?).

Throughout this tutorial, the bins are divided into equal sizes. This is the common practice in research involving calibration plots(Niculescu-Mizil and Caruana (2005) and Athey et al. (2025)).

However, the specific method of dividing the predicted probabilities into bins is not unique. For example, you can choose whether to divide the predicted probabilities into equal-size bins or equal-width bins. For example, in Python’s sklearn.calibration.CalibrationDisplay method, the strategy argument allows you to choose both options:

strategy: {‘uniform’, ‘quantile’}, default=‘uniform’ Strategy used to define the widths of the bins. - ‘uniform’: The bins have identical widths. - ‘quantile’: The bins have the same number of samples and depend on predicted probabilities.

Adjusting the number of bins

The confidence intervals in the figures above are calculated using the standard error of the empirical probability as mentioned above. Recall the standard error is calculated using the formula:

\sigma = \frac{\text{standard deviation}}{\sqrt{n}}

This results in wider confidence intervals for bins with fewer observations or higher standard deviation. One should note that the right number of bins is determined by the trade-off between the standard deviation and the number of observations in each bin. To see this, the following code chunk creates calibration plots with different numbers of bins: 5, 10, 20, and 100.

Code
# Create calibration plots with different numbers of bins
n_bins <- c(5, 10, 20, 100)

# Create a tibble for plotting the calibration plots
bin_stats <- tibble::tibble(bin_center = double(), empirical_prob = double(), count = double(), empirical_se = double(), model = character())

for (n in n_bins) {
  # Bin the predictions
  rf_bins <- dplyr::ntile(plot_data$forest_pred, n)
  nb_bins <- dplyr::ntile(plot_data$nb_pred, n)

  # Calculate the statistics for the Random Forest model
  rf_bin_stats <- tibble(
    bin_center = tapply(plot_data$forest_pred, rf_bins, mean),
    empirical_prob = tapply(plot_data$actual, rf_bins, mean),
    count = as.numeric(table(rf_bins)),
    empirical_se = tapply(plot_data$actual, rf_bins, sd)/sqrt(count)
  )

  # Calculate the statistics for the Naive Bayes model
  nb_bin_stats <- tibble(
    bin_center = tapply(plot_data$nb_pred, nb_bins, mean),
    empirical_prob = tapply(plot_data$actual, nb_bins, mean),
    count = as.numeric(table(nb_bins)),
    empirical_se = tapply(plot_data$actual, nb_bins, sd)/sqrt(count)
  )

  # Combine the statistics for the two models
  bin_stats <-  bin_stats |> 
    bind_rows(mutate(rf_bin_stats, model = "Random Forest", n_bin = n)) |> 
    bind_rows(mutate(nb_bin_stats, model = "Naive Bayes", n_bin = n))
}

# Create the calibration plot
ggplot(bin_stats) +
  geom_abline(aes(slope = 1, intercept = 0, color = "Perfect Calibration"), linetype = "dashed") +
  geom_ribbon(aes(x = bin_center, y = empirical_prob, fill = model,
                    ymin = empirical_prob - 1.96 * empirical_se, 
                    ymax = empirical_prob + 1.96 * empirical_se), alpha = 0.2) +
  geom_point(aes(x=bin_center, y=empirical_prob, color = model), alpha = 0.5, size=2) +
  scale_color_manual(values = c("Perfect Calibration" = "red", "Random Forest" = "blue", "Naive Bayes" = "green")) +
  scale_fill_manual(values = c("Perfect Calibration" = "red", "Random Forest" = "blue", "Naive Bayes" = "green")) +
  labs(title = paste0("Calibration Plot with different bins"), x=NULL, y=NULL, color = NULL) + guides(fill = "none") +
  facet_wrap(~ n_bin, nrow = length(n_bins)%/%2, labeller = labeller(n_bin = function(x) paste0(x, " bins")), scales = "free_x")

Code
# Create calibration plots with different numbers of bins
n_bins = [5, 10, 20, 100]

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()

for i, n in enumerate(n_bins):
    ax = axes[i]
    
    # Bin the predictions
    rf_bins = pd.qcut(plot_data['forest_pred'], q=n, labels=False, duplicates='drop')
    nb_bins = pd.qcut(plot_data['nb_pred'], q=n, labels=False, duplicates='drop')

    # Calculate the statistics for the Random Forest model
    rf_bin_stats = pd.DataFrame({
        'bin_center': plot_data['forest_pred'].groupby(rf_bins).mean(),
        'empirical_prob': plot_data['actual'].groupby(rf_bins).mean(),
        'count': rf_bins.value_counts().sort_index(),
        'empirical_se': plot_data['actual'].groupby(rf_bins).std() / np.sqrt(rf_bins.value_counts().sort_index())
    })

    # Calculate the statistics for the Naive Bayes model
    nb_bin_stats = pd.DataFrame({
        'bin_center': plot_data['nb_pred'].groupby(nb_bins).mean(),
        'empirical_prob': plot_data['actual'].groupby(nb_bins).mean(),
        'count': nb_bins.value_counts().sort_index(),
        'empirical_se': plot_data['actual'].groupby(nb_bins).std() / np.sqrt(nb_bins.value_counts().sort_index())
    })

    # Add perfect calibration line
    ax.plot([0, 1], [0, 1], 'r--', label='Perfect Calibration', linewidth=2)

    # Plot Random Forest with ribbons
    ax.scatter(rf_bin_stats['bin_center'], rf_bin_stats['empirical_prob'], 
               c='blue', label='Random Forest', alpha = 0.5, s=2)
    ax.fill_between(rf_bin_stats['bin_center'], 
                    rf_bin_stats['empirical_prob'] - 1.96 * rf_bin_stats['empirical_se'],
                    rf_bin_stats['empirical_prob'] + 1.96 * rf_bin_stats['empirical_se'],
                    alpha=0.2, color='blue')

    # Plot Naive Bayes with ribbons
    ax.scatter(nb_bin_stats['bin_center'], nb_bin_stats['empirical_prob'], 
              c='green', label='Naive Bayes', alpha = 0.5, s=2)
    ax.fill_between(nb_bin_stats['bin_center'], 
                    nb_bin_stats['empirical_prob'] - 1.96 * nb_bin_stats['empirical_se'],
                    nb_bin_stats['empirical_prob'] + 1.96 * nb_bin_stats['empirical_se'],
                    alpha=0.2, color='green')

    # Formatting
    ax.set_ylim(0, 1)
    ax.set_xlabel('Predicted Probability')
    ax.set_ylabel('Empirical Probability')
    ax.set_title(f'Calibration Plot with {n} bins')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Notice that as the number of bins increases, the confidence intervals tend to become wider, and the relationship between the predicted and empirical probabilities tends to become noisier. Making a good calibration plot requires careful consideration of the number of bins, the standard deviation, and the method of choosing the bins.

Multiple Classes

For multinomial (multi-class) outcomes, we need to adapt the standard binary calibration approach. That is, we need to create a calibration plot for each class. We can do this in two approaches:

1. One-vs-Rest Calibration Plots

  1. Create separate calibration plots for each class using a one-vs-rest strategy:
  2. For each class, treat it as the “positive” class and all others as “negative”
  3. Plot predicted probabilities for that class against observed proportions
  4. Repeat for all classes, resulting in k plots for k classes

2. Class-wise Calibration Plots

  1. Bin the predicted probabilities for each class
  2. For each bin, calculate the observed frequency of that class
  3. Plot the predicted probability (x-axis) vs. observed frequency (y-axis) for each class
  4. Use different colors/markers for different classes on the same plot

In the following example, we create the class-wise calibration plot. To do so, let us create a function and a toy dataset that plots a multinomial calibration plot. Even in this case, we still have to make the decisions discussed in Section 2.4 on how to divide the predicted probabilities into bins. However, for simplicity, I use the same number of bins for all classes(n_bins=10).

Creating the Toy Dataset

Code
# Create a toy dataset
x1 <- rnorm(1000)
x2 <- rnorm(1000)
  
# Create probabilities for 3 classes
# Class probabilities depend on the features
linear_pred1 <- 0.5 + 1.2 * x1 - 0.5 * x2
linear_pred2 <- -0.8 + 0.2 * x1 + 1.5 * x2
linear_pred3 <- 0.3 - 0.7 * x1 - 0.3 * x2
  
# Convert to probabilities using softmax
denom <- exp(linear_pred1) + exp(linear_pred2) + exp(linear_pred3)
prob1 <- exp(linear_pred1) / denom
prob2 <- exp(linear_pred2) / denom
prob3 <- exp(linear_pred3) / denom
  
# Generate class based on probabilities
classes <- apply(cbind(prob1, prob2, prob3), 1, function(p) sample(1:3, 1, prob = p))
class_factor <- factor(classes, levels = 1:3, labels = c("Class1", "Class2", "Class3"))
  
# Create dataset
data <- tibble(
  x1 = x1,
  x2 = x2,
  class = class_factor,
  true_prob1 = prob1,
  true_prob2 = prob2,
  true_prob3 = prob3
)
Code
# Create a toy dataset
x1 = np.random.normal(size=1000)
x2 = np.random.normal(size=1000)

# Create probabilities for 3 classes
# Class probabilities depend on the features
linear_pred1 = 0.5 + 1.2 * x1 - 0.5 * x2
linear_pred2 = -0.8 + 0.2 * x1 + 1.5 * x2
linear_pred3 = 0.3 - 0.7 * x1 - 0.3 * x2

# Convert to probabilities using softmax
denom = np.exp(linear_pred1) + np.exp(linear_pred2) + np.exp(linear_pred3)
prob1 = np.exp(linear_pred1) / denom
prob2 = np.exp(linear_pred2) / denom
prob3 = np.exp(linear_pred3) / denom

# Generate class based on probabilities
# Stack probabilities for all classes
prob_matrix = np.column_stack([prob1, prob2, prob3])
classes = np.array([np.random.choice(3, p=prob_matrix[i]) for i in range(1000)])
class_factor = np.array(classes, dtype=object)
class_factor[classes == 0] = "Class1"
class_factor[classes == 1] = "Class2"
class_factor[classes == 2] = "Class3"

# Create dataset
data = pd.DataFrame({
    'x1': x1,
    'x2': x2,
    'class': class_factor,
    'true_prob1': prob1,
    'true_prob2': prob2,
    'true_prob3': prob3
})

Function for Multinomial Calibration Plot

Code
# Function for multinomial calibration plot
plot_multinomial_calibration <- function(predicted_probs, actual_class, n_bins=10) {

  predicted_probs <- as.data.frame(predicted_probs)
  
  classes <- colnames(predicted_probs)
  plot_data <- data.frame()
  
  for (class_name in classes) {
    # Extract probabilities for this class
    probs <- as.numeric(predicted_probs[[class_name]])
    
    # maximum/minimum probability
    max_prob <- max(probs)
    min_prob <- min(probs)

    # Create bins
    bins <- cut(probs, breaks=seq(min(probs), max(probs), length.out=n_bins+1), include.lowest=TRUE)
    
    # Calculate observed proportions and standard errors
    bin_summary <- aggregate(as.numeric(actual_class == class_name), by=list(bin=bins), FUN=mean)
    bin_centers <- aggregate(probs, by=list(bin=bins), FUN=mean)
    bin_counts <- aggregate(as.numeric(actual_class == class_name), by=list(bin=bins), FUN=length)
    bin_sd <- aggregate(as.numeric(actual_class == class_name), by=list(bin=bins), FUN=sd)
    
    # Calculate standard error
    bin_se <- bin_sd$x / sqrt(bin_counts$x)
    
    # Combine data
    class_data <- data.frame(
      class = class_name,
      bin_center = bin_centers$x,
      observed_prop = bin_summary$x,
      count = bin_counts$x,
      empirical_se = bin_se
    )
    plot_data <- rbind(plot_data, class_data)
  }
  
  # Create plot with confidence intervals
  ggplot(plot_data, aes(x=bin_center, color=class, group=class)) +
    geom_abline(slope=1, intercept=0, linetype="dashed", color="red") +
    geom_ribbon(aes(ymin = observed_prop - 1.96 * empirical_se, 
                    ymax = observed_prop + 1.96 * empirical_se, 
                    fill = class), alpha = 0.2) +
    geom_point(aes(y=observed_prop, size = count), alpha = 0.5, size=2) +
    geom_line(aes(y=observed_prop)) +
    labs(x="Predicted Probability", y="Observed Proportion", 
         title="Calibration Plot for Multinomial Classes",
         color = "Class", fill = "Class", size = "Sample Size") +
    guides(fill = "none", size = "none")
}
Code
# Function for multinomial calibration plot with confidence intervals
def plot_multinomial_calibration(y_true, y_pred_proba, class_names=None, n_bins=10):
    """
    y_true: array-like of true class labels (integers)
    y_pred_proba: array-like of shape (n_samples, n_classes) with predicted probabilities
    class_names: list of class names (optional)
    n_bins: number of bins for probability
    """
    n_classes = y_pred_proba.shape[1]
    
    if class_names is None:
        class_names = [f"Class {i}" for i in range(n_classes)]
    
    plt.figure(figsize=(12, 8))
    plt.plot([0, 1], [0, 1], 'r--', label='Perfectly calibrated', linewidth=2)
    
    colors = ['blue', 'green', 'orange', 'purple', 'brown']
    
    for i in range(n_classes):
        # One-vs-rest approach: is this sample class i or not?
        # Convert class names to integers for comparison
        y_true_numeric = np.array([0 if label == "Class1" else 1 if label == "Class2" else 2 for label in y_true])
        y_true_binary = (y_true_numeric == i).astype(int)
        y_prob = y_pred_proba.iloc[:, i]
        
        # Create bins manually to calculate confidence intervals
        # Use min and max of predicted probabilities instead of uniform [0,1] bins
        min_prob = y_prob.min()
        max_prob = y_prob.max()
        bin_edges = np.linspace(min_prob, max_prob, n_bins + 1)
        bins = pd.cut(y_prob, bins=bin_edges, labels=False, include_lowest=True)
        
        # Calculate bin statistics
        bin_stats = []
        for bin_idx in range(n_bins):
            mask = bins == bin_idx
            if mask.sum() > 0:
                bin_center = y_prob[mask].mean()
                observed_prop = y_true_binary[mask].mean()
                count = mask.sum()
                std_error = np.sqrt(observed_prop * (1 - observed_prop) / count) if count > 0 else 0
                
                bin_stats.append({
                    'bin_center': bin_center,
                    'observed_prop': observed_prop,
                    'count': count,
                    'std_error': std_error
                })
        
        bin_stats = pd.DataFrame(bin_stats)
        
        if len(bin_stats) > 0:
            color = colors[i % len(colors)]
            
            # Plot confidence intervals
            plt.fill_between(bin_stats['bin_center'], 
                            bin_stats['observed_prop'] - 1.96 * bin_stats['std_error'],
                            bin_stats['observed_prop'] + 1.96 * bin_stats['std_error'],
                            alpha=0.2, color=color)
            
            # Plot points and line
            plt.scatter(bin_stats['bin_center'], bin_stats['observed_prop'], 
                       s=bin_stats['count']*2, color=color, alpha=0.7, label=class_names[i])
            plt.plot(bin_stats['bin_center'], bin_stats['observed_prop'], 
                    color=color, linewidth=2)
    
    plt.xlabel('Predicted Probability')
    plt.ylabel('Observed Proportion')
    plt.title('Calibration Plot for Multinomial Classes')
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.show()

Training the Model

We then train a multinomial logistic regression model on the first 700 observations with covariates x_1 and x_2.

Code
# Split the data into training and testing sets
train_data <- data[1:700, ]
test_data <- data[701:1000, ]

# Train a multinomial logistic regression model
model <- multinom(class ~ x1 + x2, data = train_data)
# weights:  12 (6 variable)
initial  value 769.028602 
iter  10 value 515.885778
final  value 515.867874 
converged
Code
pred_probs <- predict(model, newdata = test_data, type = "probs")

# Convert to a tibble
pred_probs_df <- as_tibble(pred_probs)
colnames(pred_probs_df) <- levels(test_data$class)
Code
# Split the data into training and testing sets
train_data = data.iloc[:700]
test_data = data.iloc[701:1000]

# Train a multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(train_data[['x1', 'x2']], train_data['class'])
LogisticRegression(multi_class='multinomial')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code

pred_probs = model.predict_proba(test_data[['x1', 'x2']])

# Convert to a tibble
pred_probs_df = pd.DataFrame(pred_probs, columns=model.classes_)

Calibration Plots

Code
# Create the calibration plot
plot_multinomial_calibration(pred_probs_df, test_data$class, n_bins=10)
Ignoring unknown labels:
• size : "Sample Size"

Code
# Create the calibration plot
plot_multinomial_calibration(test_data['class'], pred_probs_df, n_bins=10)

Continuous Outcomes

For continuous outcomes, we can create a calibration plot by binning the predicted outcomes and calculating the average predicted and empirical outcome values in each bin. This is similar to the binary outcome case, but with a continuous outcome.

Data

We use the dataset from the large-scale natural field experiment conducted by Karlan and List (2007) titled “Does Price Matter in Charitable Giving?”. In this experiment, approximately 50,000 previous donors were randomly assigned to a treatment group or a control group. The subjects received direct mail solicitations for donations to a liberal nonprofit organization. The treatment consists of a “matching grant” offer in the letter: specific individuals received a letter stating that a matching donor would match their contribution (W_i=1), while the control group received a standard letter with no matching offer (W_i=0). The outcome is continuous, with Y_i corresponding to the change in dollar amount donated1. Also, we filter out the bottom 10 percentile and top 10 percentile of the outcome to avoid the influence of extreme values. For the rest of the process, we follow the same procedure in Section 2 to create the calibration plot.

Code
# Read in data
data <- read.csv("https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Charitable/ProcessedData/charitable_withdummyvariables.csv")
n <- nrow(data)

# Treatment: received a letter stating that a matching donor would match their contribution (1) or not (0)
treatment <-  "treatment"

# Outcome: Dollar amount donated
outcome <- "out_changeamtgive"

# Additional covariates
covariates <- c("female", "median_hhincome", "red0", "pwhite", "pblack", "psch_atlstba")

# Filter out the treatment group
data_c <- data[data[[treatment]] == 0, ]
data_c <- na.omit(data_c[, c(outcome, covariates)])

# Filter out bottom 10 percentile and top 10 percentile of outcome
data_c <- data_c[data_c[,outcome] > quantile(data_c[,outcome], 0.1) & data_c[,outcome] < quantile(data_c[,outcome], 0.9), ]
Code

# Read in data
data = pd.read_csv("https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Charitable/ProcessedData/charitable_withdummyvariables.csv")
n = len(data)

# Treatment: received a letter stating that a matching donor would match their contribution (1) or not (0)
treatment = "treatment"

# Outcome: Dollar amount donated
outcome = "out_changeamtgive"

# Additional covariates
covariates = ["female", "median_hhincome", "red0", "pwhite", "pblack", "psch_atlstba"]

# Filter out the treatment group
data_c = data[data[treatment] == 0][[outcome, *covariates]].dropna()

# Filter out bottom 10 percentile and top 10 percentile of outcome
lower = data_c[outcome].quantile(0.1)
upper = data_c[outcome].quantile(0.9)
data_c = data_c[(data_c[outcome] > lower) & (data_c[outcome] < upper)]

Predictions

We again want to predict the counterfactual outcome Y_i(0)—the outcome if an individual were assigned to the control group—for each observation. To achieve this, we restrict our analysis to the control group (W_i=0) and train two models using the observed covariates: a Ridge Regression and a Random Forest.

The covariates are as follows:

  • female: A binary indicator equal to 1 if the donor is female, 0 otherwise.
  • median_hhincome: The median household income of the donor’s zip code (derived from Census data).
  • red0: A binary indicator equal to 1 if the donor lives in a state that voted for George W. Bush in the 2004 presidential election (a “Red State”).
  • pwhite / pblack: The proportion of the population in the donor’s zip code that is White or Black
  • psch_atlstba: Proportion of the population who finished college

The control group is divided into a training set and a testing set. The training set contains the first 7,000 observations, and the testing set contains the remaining observations. We then generate predictions for the testing set for both the Ridge Regression and Random Forest models.

Code
# Data preparation
n_control <- nrow(data_c)
data_c_training <- data_c[1:7000,]
data_c_testing <- data_c[7001:n_control,]


# Ridge Regression
ridge <- glmnet::cv.glmnet(
  x = as.matrix(data_c_training[,covariates]), 
  y = data_c_training[,outcome], 
  alpha = 0
)

# Random forest
forest <- grf::regression_forest(
  X = data_c_training[,covariates], 
  Y = data_c_training[,outcome], 
  honesty = FALSE,
  tune.parameters = "all", 
  num.trees = 1000, 
  seed = 42
  )
Code
# Data preparation
data_c_training = data_c.iloc[:7000]
data_c_testing = data_c.iloc[7001:n_control]

# Ridge Regression
ridge = Ridge(alpha=0.1)
ridge.fit(data_c_training[covariates], data_c_training[outcome])
Ridge(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
# Random forest
forest = RandomForestRegressor(random_state=42, n_estimators=1000)
forest.fit(data_c_training[covariates], data_c_training[outcome])
RandomForestRegressor(n_estimators=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Calibration Plots

Now let us create calibration plots with 50 bins. First, we generate predictions on the test data and create calibration plots for both the Ridge Regression and Random Forest models. Then, we group the predicted probabilities into 50 bins with equal size and calculate the average predicted outcome and the empirical outcome value in each bin, which are then plotted as points on the calibration plot.

Code
# Generate predictions on test data
ridge_pred <- as.vector(predict(ridge, newx = as.matrix(data_c_testing[,covariates]), s = "lambda.min"))
forest_pred <- predict(forest, data_c_testing[,covariates])$predictions

# Create data frame for plotting
plot_data <- data.frame(
  actual = data_c_testing[,outcome],
  ridge_pred = ridge_pred,
  forest_pred = forest_pred
)

# Display Predictions vs Actual
kableExtra::kable(head(plot_data, 10), caption = "Predictions vs Actual Data") |> kable_styling(bootstrap_options = c("striped", "hover")) |> scroll_box(width = "100%", height = "500px")
Predictions vs Actual Data
actual ridge_pred forest_pred
-50 -35.66895 -35.10202
-50 -35.78493 -35.93120
-45 -36.02916 -36.84405
-45 -35.89466 -34.29541
-50 -35.91803 -35.71871
-55 -35.80678 -35.47450
-40 -35.57152 -36.84217
-35 -35.70177 -36.54056
-35 -35.57777 -37.06384
-50 -35.67235 -35.66118
Code
# Simple calibration plot with 50 bins
n_bins <- 50

# Bin the predictions
ridge_bins <- dplyr::ntile(plot_data$ridge_pred, n_bins)
rf_bins <- dplyr::ntile(plot_data$forest_pred, n_bins)

# Calculate the statistics for the Ridge Regression model
ridge_bin_stats <- tibble(
    bin_center = tapply(plot_data$ridge_pred, ridge_bins, mean),
    empirical_prob = tapply(plot_data$actual, ridge_bins, mean),
    count = as.numeric(table(ridge_bins)),
    empirical_se = tapply(plot_data$actual, ridge_bins, sd)/sqrt(count)
  )

# Calculate the statistics for the Random Forest model
rf_bin_stats <- tibble(
    bin_center = tapply(plot_data$forest_pred, rf_bins, mean),
    empirical_prob = tapply(plot_data$actual, rf_bins, mean),
    count = as.numeric(table(rf_bins)),
    empirical_se = tapply(plot_data$actual, rf_bins, sd)/sqrt(count)
  )

# Combine the statistics for the two models
bin_stats <-  mutate(ridge_bin_stats, model = "Ridge Regression") |> 
bind_rows(mutate(rf_bin_stats, model = "Random Forest"))

# Create the calibration plot
ggplot(bin_stats) +
  geom_abline(aes(slope = 1, intercept = 0, color = "Perfect Calibration"), linetype = "dashed") +
  geom_ribbon(aes(x = bin_center, y = empirical_prob, fill = model,
                    ymin = empirical_prob - 1.96 * empirical_se, 
                    ymax = empirical_prob + 1.96 * empirical_se), alpha = 0.2) +
  geom_point(aes(x=bin_center, y=empirical_prob, color = model), alpha = 0.5, size=2) +
  scale_color_manual(values = c("Perfect Calibration" = "red", "Ridge Regression" = "blue", "Random Forest" = "green")) +
  scale_fill_manual(values = c("Perfect Calibration" = "red", "Ridge Regression" = "blue", "Random Forest" = "green")) +
  labs(title = paste0("Calibration Plot with ", n_bins, " bins"), x="Predicted Outcome", y="Empirical Outcome", color = NULL) + guides(fill = "none")

Code
# Generate predictions on test data
ridge_pred = ridge.predict(data_c_testing[covariates])
forest_pred = forest.predict(data_c_testing[covariates])

# Create data frame for plotting
plot_data = pd.DataFrame({
    'actual': data_c_testing[outcome],
    'ridge_pred': ridge_pred,
    'forest_pred': forest_pred
    })

# Display Predictions vs Actual
print("Predictions vs Actual Data:")
Predictions vs Actual Data:
Code
print(plot_data.head(10))
       actual  ridge_pred  forest_pred
25648   -50.0  -35.800969   -25.546778
25649   -45.0  -36.167722   -36.633399
25656   -45.0  -35.970060   -34.242667
25657   -50.0  -36.004640   -27.716534
25658   -55.0  -35.837096   -24.123339
25659   -40.0  -35.536632   -43.230730
25669   -35.0  -35.679684   -45.046554
25670   -35.0  -35.494804   -37.185162
25672   -50.0  -35.639813   -35.556909
25673   -50.0  -35.974745   -32.964125
Code
# Simple calibration plot with 50 bins
n_bins = 50

# Bin the predictions
ridge_bins = pd.qcut(plot_data['ridge_pred'], q=n_bins, labels=False, duplicates='drop')
rf_bins = pd.qcut(plot_data['forest_pred'], q=n_bins, labels=False, duplicates='drop')

# Calculate the statistics for the Ridge Regression model
ridge_bin_stats = pd.DataFrame({
    'bin_center': plot_data['ridge_pred'].groupby(ridge_bins).mean(),
    'empirical_prob': plot_data['actual'].groupby(ridge_bins).mean(),
    'count': ridge_bins.value_counts().sort_index(),
    'empirical_se': plot_data['actual'].groupby(ridge_bins).std() / ridge_bins.value_counts().sort_index()
})

# Calculate the statistics for the Random Forest model
rf_bin_stats = pd.DataFrame({
    'bin_center': plot_data['forest_pred'].groupby(rf_bins).mean(),
    'empirical_prob': plot_data['actual'].groupby(rf_bins).mean(),
    'count': rf_bins.value_counts().sort_index(),
    'empirical_se': plot_data['actual'].groupby(rf_bins).std() / rf_bins.value_counts().sort_index()
})


# Combine the statistics for the two models
ridge_bin_stats['model'] = 'Ridge Regression'
rf_bin_stats['model'] = 'Random Forest'
bin_stats = pd.concat([ridge_bin_stats, rf_bin_stats], ignore_index=True)

# Create the calibration plot
fig, ax = plt.subplots(figsize=(12, 8))

# Add perfect calibration line
mean_outcome = bin_stats['empirical_prob'].mean()
ax.axline((mean_outcome, mean_outcome), slope=1, color='r', linestyle='--', label='Perfect Calibration', linewidth=2)

# Plot Ridge Regression with ribbons
ridge_data = bin_stats[bin_stats['model'] == 'Ridge Regression']
ax.scatter(ridge_data['bin_center'], ridge_data['empirical_prob'], 
          c='blue', label='Ridge Regression', alpha = 0.5, s=2)
ax.fill_between(ridge_data['bin_center'], 
                ridge_data['empirical_prob'] - 1.96 * ridge_data['empirical_se'],
                ridge_data['empirical_prob'] + 1.96 * ridge_data['empirical_se'],
                alpha=0.2, color='blue')

# Plot Random Forest with ribbons
rf_data = bin_stats[bin_stats['model'] == 'Random Forest']
ax.scatter(rf_data['bin_center'], rf_data['empirical_prob'], 
          c='green', label='Random Forest', alpha = 0.5, s=2)
ax.fill_between(rf_data['bin_center'], 
                rf_data['empirical_prob'] - 1.96 * rf_data['empirical_se'],
                rf_data['empirical_prob'] + 1.96 * rf_data['empirical_se'],
                alpha=0.2, color='green')

# Formattin
ax.set_xlabel('Predicted Outcome')
ax.set_ylabel('Empirical Outcome')
ax.set_title(f'Calibration Plot with {n_bins} bins')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In the figure above, the X-axis represents the predicted outcome, while the Y-axis displays the empirical outcome. Each point on the plot corresponds to the average predicted and empirical outcomes within its respective bin, and the ribbons illustrate the 95% confidence intervals for the empirical outcome.

Again, how good are calibration plots in assessing the performance of our models depends on the number of bins and the method of dividing the predicted probabilities into bins. The details in Section 2.4 are applicable here as well.

Adjusting the number of bins

The confidence intervals in the figures above are calculated using the standard error of the empirical outcome as mentioned above. Recall the standard error is calculated using the formula:

\sigma = \frac{\text{standard deviation}}{\sqrt{n}}

This results in wider confidence intervals for bins with fewer observations or higher standard deviation. One should note that the right number of bins is determined by the trade-off between the standard deviation and the number of observations in each bin. To see this, the following code chunk creates calibration plots with different numbers of bins: 5, 10, 100, and 500.

Code
# Create calibration plots with different numbers of bins
n_bins <- c(5, 10, 100, 500)

# Create a tibble for plotting the calibration plots
bin_stats <- tibble::tibble(bin_center = double(), empirical_prob = double(), count = double(), empirical_se = double(), model = character())

for (n in n_bins) {
  # Bin the predictions
  rf_bins <- dplyr::ntile(plot_data$forest_pred, n)
  ridge_bins <- dplyr::ntile(plot_data$ridge_pred, n)

  # Calculate the statistics for the Random Forest model
  rf_bin_stats <- tibble(
    bin_center = tapply(plot_data$forest_pred, rf_bins, mean),
    empirical_prob = tapply(plot_data$actual, rf_bins, mean),
    count = as.numeric(table(rf_bins)),
    empirical_se = tapply(plot_data$actual, rf_bins, sd)/sqrt(count)
  )

  # Calculate the statistics for the Ridge Regression model
  ridge_bin_stats <- tibble(
    bin_center = tapply(plot_data$ridge_pred, ridge_bins, mean),
    empirical_prob = tapply(plot_data$actual, ridge_bins, mean),
    count = as.numeric(table(ridge_bins)),
    empirical_se = tapply(plot_data$actual, ridge_bins, sd)/sqrt(count)
  )

  # Combine the statistics for the two models
  bin_stats <-  bin_stats |> 
    bind_rows(mutate(rf_bin_stats, model = "Random Forest", n_bin = n)) |> 
    bind_rows(mutate(ridge_bin_stats, model = "Ridge Regression", n_bin = n))
}

# Create the calibration plot
ggplot(bin_stats) +
  geom_abline(aes(slope = 1, intercept = 0, color = "Perfect Calibration"), linetype = "dashed") +
  geom_ribbon(aes(x = bin_center, y = empirical_prob, fill = model,
                    ymin = empirical_prob - 1.96 * empirical_se, 
                    ymax = empirical_prob + 1.96 * empirical_se), alpha = 0.2) +
  geom_point(aes(x=bin_center, y=empirical_prob, color = model), alpha = 0.5, size=2) +
  scale_color_manual(values = c("Perfect Calibration" = "red", "Ridge Regression" = "blue", "Random Forest" = "green")) +
  scale_fill_manual(values = c("Perfect Calibration" = "red", "Ridge Regression" = "blue", "Random Forest" = "green")) +
  labs(title = paste0("Calibration Plot with different bins"), x="Predicted Outcome", y="Empirical Outcome", color = NULL) + guides(fill = "none") +
  facet_wrap(~ n_bin, nrow = length(n_bins)%/%2, labeller = labeller(n_bin = function(x) paste0(x, " bins")), scales = "free")

Code
# Create calibration plots with different numbers of bins
n_bins = [5, 10, 100, 500]

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()

for i, n in enumerate(n_bins):
    ax = axes[i]
    
    # Bin the predictions
    rf_bins = pd.qcut(plot_data['forest_pred'], q=n, labels=False, duplicates='drop')
    ridge_bins = pd.qcut(plot_data['ridge_pred'], q=n, labels=False, duplicates='drop')

    # Calculate the statistics for the Random Forest model
    rf_bin_stats = pd.DataFrame({
        'bin_center': plot_data['forest_pred'].groupby(rf_bins).mean(),
        'empirical_prob': plot_data['actual'].groupby(rf_bins).mean(),
        'count': rf_bins.value_counts().sort_index(),
        'empirical_se': plot_data['actual'].groupby(rf_bins).std() / np.sqrt(rf_bins.value_counts().sort_index())
    })

    # Calculate the statistics for the Ridge Regression model
    ridge_bin_stats = pd.DataFrame({
        'bin_center': plot_data['ridge_pred'].groupby(ridge_bins).mean(),
        'empirical_prob': plot_data['actual'].groupby(ridge_bins).mean(),
        'count': ridge_bins.value_counts().sort_index(),
        'empirical_se': plot_data['actual'].groupby(ridge_bins).std() / np.sqrt(ridge_bins.value_counts().sort_index())
    })

    # Add perfect calibration line
    mean_outcome = bin_stats['empirical_prob'].mean()
    ax.axline((mean_outcome, mean_outcome), slope=1, color='r', linestyle='--', label='Perfect Calibration', linewidth=2)

    # Plot Ridge Regression with ribbons
    ridge_data = pd.DataFrame({
        'bin_center': plot_data['ridge_pred'].groupby(ridge_bins).mean(),
        'empirical_prob': plot_data['actual'].groupby(ridge_bins).mean(),
        'empirical_se': plot_data['actual'].groupby(ridge_bins).std() / np.sqrt(ridge_bins.value_counts().sort_index())
    })
    ax.scatter(ridge_data['bin_center'], ridge_data['empirical_prob'], 
              c='blue', label='Ridge Regression', alpha = 0.5, s=2)
    ax.fill_between(ridge_data['bin_center'], 
                    ridge_data['empirical_prob'] - 1.96 * ridge_data['empirical_se'],
                    ridge_data['empirical_prob'] + 1.96 * ridge_data['empirical_se'],
                    alpha=0.2, color='blue')

    # Plot Random Forest with ribbons
    rf_data = pd.DataFrame({
        'bin_center': plot_data['forest_pred'].groupby(rf_bins).mean(),
        'empirical_prob': plot_data['actual'].groupby(rf_bins).mean(),
        'empirical_se': plot_data['actual'].groupby(rf_bins).std() / np.sqrt(rf_bins.value_counts().sort_index())
    })
    ax.scatter(rf_data['bin_center'], rf_data['empirical_prob'], 
              c='green', label='Random Forest', alpha = 0.5, s=2)
    ax.fill_between(rf_data['bin_center'], 
                    rf_data['empirical_prob'] - 1.96 * rf_data['empirical_se'],
                    rf_data['empirical_prob'] + 1.96 * rf_data['empirical_se'],
                    alpha=0.2, color='green')

    # Formatting
    ax.set_xlabel('Predicted Outcome')
    ax.set_ylabel('Empirical Outcome')
    ax.set_title(f'Calibration Plot with {n} bins')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Notice that as the number of bins increases, the confidence intervals tend to become wider, and the relationship between the predicted and empirical probabilities tends to become noisier. However, we see that the perfect calibration line becomes more and more closer to the points as the number of bins increases. This is because as we narrow down the bins, the average predicted outcome and the empirical outcome within each bin becomes more and more similar to the mean of the outcome. Making a good calibration plot thus requires careful consideration of the number of bins, the standard deviation, and the method of choosing the bins.

Acknowledgments

Here is a non-exhaustive list of helpful sources that I used to create this tutorial.

References

Athey, Susan, Herman Brunborg, Tianyu Du, Ayush Kanodia, and Keyon Vafa. 2025. “LABOR-LLM: Language-Based Occupational Representations with Large Language Models.” https://arxiv.org/abs/2406.17972.
Karlan, Dean, and John A. List. 2007. “Does Price Matter in Charitable Giving? Evidence from a Large-Scale Natural Field Experiment.” American Economic Review 97 (5): 1774–93. https://doi.org/10.1257/aer.97.5.1774.
Niculescu-Mizil, Alexandru, and Rich Caruana. 2005. “Predicting Good Probabilities with Supervised Learning.” In Proceedings of the 22nd International Conference on Machine Learning, 625–32.

Footnotes

  1. The change in dollar amount donated is calculated as the difference between what they gave in this experiment and what they gave previously. While Karlan and List (2007) primarily analyze the raw donation amount (out_amountgive), we use the change in donation amount to mitigate the issue of zero-inflated outcomes. Given that the sample consists entirely of 50,083 prior donors (who donated at least once since 1991), the change in donation is largely driven by their baseline giving history, resulting in fewer zero-valued outcomes compared to the raw donation amount.↩︎