MOMENTUM:
Selection Bias Decomposition and Forecasting vs. Causal Inference

December 5th, 2025

Sample Selection Bias

Sample Selection

As a first approach, I decompose the Bias in a setting with sample selection.

It’s a simpler setting, which allows us to establish some initial work from which to build up to more complex settings.

I work under standard SUTVA, consistency and positivity assumptions.

Setup

  • Population of individuals:
    \(i = 1,2,\ldots,N\)

  • Binary treatment:
    \(\mathcal{D} = \{0,1\}\)

    • \(D_i = 0\): untreated
    • \(D_i = 1\): treated

Setup

  • Potential outcomes:
    • \(Y_{0i}\): outcome for individual \(i\) under no treatment

    • \(Y_{1i}\): outcome for individual \(i\) under treatment

    • Only one is observed:

      \[ Y_i = D_i\,Y_{1i} + (1-D_i)\,Y_{0i} \]

Setup

  • Individual treatment effect: \[ \tau_i = Y_{1i} - Y_{0i} \]

  • Target causal parameter: \[ ATE = \mathbb{E}[\tau_i] = \mathbb{E}[Y_{1i} - Y_{0i}] \]

Naïve Difference-in-Means Estimator

  • Sample estimator for \(ATE\) in a randomized setting:

    \[ \widehat{ATE} = \frac{1}{n_1}\sum_{i=1}^n D_i Y_i \;-\; \frac{1}{n_0}\sum_{i=1}^n (1 - D_i) Y_i \]

  • Where:

    • \(n_1 = \sum_{i=1}^n D_i\) (treated count)
    • \(n_0 = \sum_{i=1}^n (1 - D_i)\) (control count)

Latent Groups

  • Individuals belong to an unobserved group:

    \[ \mathcal{G} = \{L, S\} \]

  • Interpretation:

    • Group \(L\): large treatment response
    • Group \(S\): small treatment response

Group-Specific Treatment Effects

  • Define the average effect within each latent group:

    \[ \tau_g = \mathbb{E}[Y_{1i} \mid G_i = g] - \mathbb{E}[Y_{0i} \mid G_i = g] \]

  • Stack into a vector:

    \[ \tau = \begin{bmatrix} \tau_L \\ \tau_S \end{bmatrix}, \qquad \tau_L > \tau_S \]

Population Group Weights

  • Define the population distribution of latent groups:

    \[ \lambda = \begin{bmatrix} \lambda_L \\ \lambda_S \end{bmatrix}, \qquad \lambda_L + \lambda_S = 1 \]

  • Interpretation:
    \(\lambda_g = P(G_i = g)\) in the full population.

Sample Group Weights

  • Define the sample distribution of latent groups:

    \[ \gamma = \begin{bmatrix} \gamma_L \\ \gamma_S \end{bmatrix}, \qquad \gamma_L + \gamma_S = 1 \]

  • Interpretation:
    \(\gamma_g = P(G_i = g)\) among sampled individuals.

Sample Selection

  • Even if treatment is perfectly randomized within the sample, we may get:

    \[ \gamma \neq \lambda \]

  • With heterogeneous effects \(\tau_L \neq \tau_S\),
    this mismatch becomes the key driver of bias
    from the naïve difference-in-means estimator.

Population ATE

  • Start from the definition based on potential outcomes:

    \[ ATE := \mathbb{E}[Y_{1i}] - \mathbb{E}[Y_{0i}] \]

  • Use the law of total expectation over latent groups:

    \[ \begin{aligned} ATE &= \lambda_L \cdot \big(\mathbb{E}[Y_{1i} \mid G_i = L] - \mathbb{E}[Y_{0i} \mid G_i = L]\big) \\ &\quad + \lambda_S \cdot \big(\mathbb{E}[Y_{1i} \mid G_i = S] - \mathbb{E}[Y_{0i} \mid G_i = S]\big) \end{aligned} \]

Compact Form

  • Recognize the group-specific treatment effects:

    \[ \tau_L = \mathbb{E}[Y_{1i} \mid G_i = L] - \mathbb{E}[Y_{0i} \mid G_i = L] \] \[ \tau_S = \mathbb{E}[Y_{1i} \mid G_i = S] - \mathbb{E}[Y_{0i} \mid G_i = S] \]

  • Then the ATE becomes:

    \[ ATE = \lambda_L \tau_L + \lambda_S \tau_S = \lambda^T \tau \]

Subgroup Counts

  • We define

    \[ n_{dg} = \sum_{i=1}^n \mathbf{1}\{D_i = d,\; G_i = g\} \;,\; \forall (d, g) \in \mathcal{D, G} \]

  • These satisfy: \[ n_{1L} + n_{1S} = n_1, \qquad n_{0L} + n_{0S} = n_0 \]

Subgroup Weights

  • Define the subgroup gammas:

    \[ \gamma_{dg} = \frac{n_{dg}}{n_d} \]

  • Interpretation:
    \(\gamma_{dg}\) is the fraction of treated (or untreated) individuals who belong to group \(g\).

    In expectation: \(\mathbb{E}[\gamma_{dg}] = P(G=g \mid D=d)\)

Expected Subgroup Weights

  • Because treatment is randomized a single iteration won’t see: \[ \gamma_{1g} = \gamma_g, \qquad \gamma_{0g} = \gamma_g \]

  • But, in expectation, random assignment reproduces \(\gamma_g\):

    \[ \mathbb{E}[\gamma_{1g}] = \mathbb{E}[\gamma_{0g}] = \gamma_g \]

  • Therefore, when computing \(\mathbb{E}[\widehat{ATE}]\), we can safely replace \(\gamma_{dg}\) with \(\gamma_g\).

Naïve ATE Estimator by Group

  • We start from the difference-in-means estimator for the ATE and, after taking expectations, we reach the following expression for it:

    \[ \begin{aligned} \mathbb{E}[\widehat{ATE}] &= \gamma_L \big(\mathbb{E}[Y_{1i} \mid G_i = L] - \mathbb{E}[Y_{0i} \mid G_i = L]\big) \\ &\quad + \gamma_S \big(\mathbb{E}[Y_{1i} \mid G_i = S] - \mathbb{E}[Y_{0i} \mid G_i = S]\big) \\[4pt] &= \gamma_L \tau_L + \gamma_S \tau_S = \gamma^T \tau \end{aligned} \]

Bias from Sample Selection

\[ \begin{aligned} Bias &= \mathbb{E}[\widehat{ATE}] - ATE \\ &= \gamma^T \tau - \lambda^T \tau \\ &= (\gamma_L - \lambda_L)(\tau_L - \tau_S) \end{aligned} \]

  • Bias arises only if:
    • Group effects differ (TE heterogeneity), and
    • Sample composition differs from population (\(\gamma \neq \lambda\)).

k Groups setting

The extrapolation of the previous expression into k Groups is not too complex, and it results in the following expression:

  • For groups \(\mathcal{G} = \{1, \dots, k\}\), where \(g\) represents any of these groups.

\[ Bias = \sum_{g=2}^k (\gamma_g - \lambda_g)(\tau_g - \tau_1) \]

Treatment Selection Bias

Treatment Selection

What changes:

  • In order to isolate the bias from Treatment Selection, we assume no sample selection now (\(\lambda = \gamma\)).
  • Treatment is no longer randomized within the sample.
  • One group is more likely to be treated.

Key consequence:

\[ \mathbb{E}[\gamma_{1g}] \neq \mathbb{E}[\gamma_{0g}], \qquad g \in \mathcal{G} \]

Overall Treatment Rate

  • We define the treated fraction as follows:

\[ \delta = P(D = 1) = \sum_{g \in \mathcal{G}} \text{P}(D = 1 \mid G = g) \cdot \text{P}(G = g) = \sum_{g \in \mathcal{G}} \pi_g \lambda_g \]

where: \[ \pi_g = P(D = 1 \mid G = g) \]

Linking \(\pi_g\) to \(\mathbb{E}[\gamma_{1g}]\)

  • Using Bayes’ rule:

\[ \begin{aligned} \pi_g &= P(D = 1 \mid G = g) = \frac{P(G = g \mid D = 1)\,P(D = 1)}{P(G = g)} \\ &= \mathbb{E}[\gamma_{1g}] \cdot \frac{\delta}{\lambda_g} \end{aligned} \]

hence

\[ \mathbb{E}[\gamma_{1g}] = \pi_g \cdot \frac{\lambda_g}{\delta} \]

Linking \(\pi_g\) to \(\mathbb{E}[\gamma_{0g}]\)

\[ \begin{aligned} \mathbb{E}[\gamma_{0g}] &= P(G = g \mid D = 0) \\ &= \frac{P(G = g) - P(G = g \mid D = 1)\,P(D = 1)}{P(D = 0)} \\ [4pt] &= \frac{\lambda_g - \mathbb{E}[\gamma_{1g}]\,\delta}{1 - \delta} = (1 - \pi_g)\,\frac{\lambda_g}{1 - \delta} \end{aligned} \]

  • Intuition:
    • If \(\pi_g\) is high, group \(g\) is overrepresented among treated and underrepresented among controls, and vice versa.

ATE Under Treatment Selection

  • Even though treatment is now selected (not randomized), the population ATE definition does not change:

    \[ ATE := \mathbb{E}[Y_{1i}] - \mathbb{E}[Y_{0i}] \]

  • So the true ATE remains:

    \[ ATE = \lambda^T \tau \]

Defining Group– and Treatment–Specific Means

  • For clarity, define:

    \[ \mu_{dg} := \mathbb{E}[Y_{di} \mid G_i = g], \qquad d \in \mathcal{D},\; g \in \mathcal{G} \]

  • Interpretation:

    • \(\mu_{0g}\): average outcome without treatment in group \(g\)
    • \(\mu_{1g}\): average outcome with treatment in group \(g\)

Relation Between \(\mu_{dg}\) and \(\tau_g\)

  • So the group-specific treatment effect: becomes

    \[ \tau_g = \mu_{1g} - \mu_{0g} \]

  • Equivalently:

    \[ \mu_{1g} = \mu_{0g} + \tau_g \]

Bias Under Treatment Selection

  • Under treatment selection, the bias of the naïve difference-in-means estimator is:

    \[ \begin{aligned} Bias &= \mathbb{E}[\widehat{ATE}] - ATE \\[4pt] &= \lambda_L \left[ \left(\frac{\pi_L}{\delta} - 1\right)(\tau_L - \tau_S) \;+\; \frac{\pi_L - \delta}{\delta(1 - \delta)}(\mu_{0L} - \mu_{0S}) \right] \end{aligned} \]

k Groups setting

In this case, in the k Group setting, with the same configuration as before, we recover:

\[ Bias = \sum_{g=2}^k \lambda_g \left[ \left(\frac{\pi_g}{\delta} - 1\right)(\tau_g - \tau_1) \;+\; \frac{\pi_g - \delta}{\delta(1 - \delta)}(\mu_{0g} - \mu_{01}) \right] \]

Selection Bias using Diff-in-Diff

What Changes?

  • We now introduce time with two periods: T = {1, 2}

    • \(t = 1\): pre-treatment
    • \(t = 2\): post-treatment
  • Potential outcomes are now indexed by:

    \[ Y_{dit} \quad \text{for } d \in \mathcal{D},\; t \in T \]

  • Main differences vs Section 2:

    • We now target ATT, not ATE
    • We add No Anticipation and Parallel Trends assumptions

ATT in the 2-Period DiD Setting

  • Target parameter: the Average Treatment Effect on the Treated:

    \[ ATT := \mathbb{E}[Y_{1i2} \mid D_i = 1] - \mathbb{E}[Y_{0i2} \mid D_i = 1] \]

  • We obtain the DiD representation:

    \[ \begin{aligned} ATT &= \big( \mathbb{E}[Y_{1i2} \mid D_i = 1] - \mathbb{E}[Y_{1i1} \mid D_i = 1] \big) \\ &\quad- \big( \mathbb{E}[Y_{0i2} \mid D_i = 0] - \mathbb{E}[Y_{0i1} \mid D_i = 0] \big) \end{aligned} \]

ATT with Latent Groups and Treatment Selection

  • As before, define:

    \[ \mu_{dtg} := \mathbb{E}[Y_{dit} \mid G_i = g] \;,\; \forall (d, t, g) \in \mathcal{D,}\,T\mathcal{,G} \]

  • Let \(\tau_g\) be the group-specific treatment effect in period 2:

    \[ \tau_g = \mathbb{E}[Y_{1i2} \mid G_i = g] - \mathbb{E}[Y_{0i2} \mid G_i = g] \]

ATT with Latent Groups and Treatment Selection

  • The ATT can be written compactly as:

\[ ATT = \sum_{g \in \mathcal{G}} \frac{\pi_g}{\delta}\,\lambda_g\,\tau_g \]

DiD Estimator for the ATT

  • Define sample averages for treated/untreated by period:

    \[ \bar{Y}_{dt} := \frac{1}{n_d} \sum_{i=1}^n Y_{it}\,\mathbf{1}\{D_i = d\} \;,\; \forall (d, t) \in \mathcal{D},\,T \]

  • The standard 2-period DiD estimator:

    \[ \widehat{ATT} := (\bar{Y}_{12} - \bar{Y}_{11}) - (\bar{Y}_{02} - \bar{Y}_{01}) \]

Expected Value of the DiD ATT

  • Under:
    • Treatment selection via \(\pi_g\) and \(\lambda_g\),
    • No sample selection (sample mirrors population in \(G\)),
    • No Anticipation,
    • Group-level Parallel Trends,
  • The expected value of the DiD estimator is:

\[ \mathbb{E}[\widehat{ATT}] = \sum_{g \in \mathcal{G}} \frac{\pi_g}{\delta}\,\lambda_g\,\tau_g \]

Bias of DiD for ATT

  • Bias of the DiD estimator: Bias = 0

  • Interpretation:

    • Even with treatment selection and heterogeneous effects across latent groups, DiD remains unbiased for the ATT as long as:
      • No Anticipation holds
      • Parallel Trends
      • There is no sample selection in \(G\)

Forecasting vs. Causal Inference

Simulation: Setup and Imports

import numpy as np
import lib

SEED = 163683749
np.random.seed(SEED)

sim = lib.Matrix(
    n_samples=10000,
    lambda_L=0.5,
    pi_L=0.5,
    pi_S=0.0,
    beta_X=2.0,
    baseline_S=-20.0,
    baseline_L=15.0,
    tau_L=70.0,
    tau_S=1.0,
    noise_std=5.0,
    G_to_X=True,
    random_seed=SEED
)

print("Simulation created successfully")
Simulation created successfully

Simulation Run

df = sim.generate()
pippin = lib.Palantir(df)

display(df.head().round(2))
X G D Y Y0 Y1 True_TE All GxD
0 12.00 L 1 103.54 33.54 103.54 70.0 All L_T
1 9.00 L 1 92.93 22.93 92.93 70.0 All L_T
2 14.09 L 0 43.00 43.00 113.00 70.0 All L_U
3 18.13 L 0 61.03 61.03 131.03 70.0 All L_U
4 21.10 S 0 26.00 26.00 27.00 1.0 All S_U

Basic Data Plot

pippin.plot(title="Simulated Data")

Data Colored by G and D

pippin.plot(hue_col="GxD")

Linear Regression

pippin.plot(hue_col="GxD", regression_col="All")

Linear Regression by Group

pippin.plot(hue_col="GxD", regression_col="G")

Linear Regression by Treatment

pippin.plot(hue_col="GxD", regression_col="D")

Linear Regression by G and D

pippin.plot(hue_col="GxD", regression_col="GxD")