Causal Effects Without Forecasting Gains

Guillem Mirabent Rubinat

May 29, 2026

I: Motivation

Main Goal: Causal Forecasting

  • Causal forecasting: leverage causal information inside a forecasting model whose objective stays strictly predictive.
  • The priority is to minimize total out-of-sample prediction error, with the forecasting algorithms being mostly machine-learning models.

Objective

\[ f^{\star}\;=\;\arg\min_{f\in\mathcal{F}}\;\mathbb{E}\bigl[\ell(Y_{i,t+h},\,f(X_{i,t}))\bigr] \]

\(\ell\) = squared loss (intensity) / cross-entropy (incidence).

With rich \(\mathcal{F}\), the minimizer is the Bayes regression \(f^{\star}(x)=\mathbb{E}[Y_{i,t+h}\mid X_{i,t}=x]\). Trained \(\hat{f}\) approximates \(f^{\star}\).

How can causal information help in forecasting?

I currently see two main channels:

  • Robustness to a shock in the DGP.

  • Preventing the model from ignoring a rare but important event.

Interesting literature plug-in

The robustness channel overlaps with the distribution-shift / domain-adaptation literature in ML, e.g. invariance under hidden confounding (García Meixide & Ríos Insua (2025)).

Step Goal: Improve forecasts

I started with a more modest step goal:

  • Improve the Conflict Forecast model (Mueller & Rauh (2018), (2022a), (2022b), and Mueller et al. (2024)) using Ceasefire Agreements from PA-X data, Bell & Badanjak (2019).

We have evidence that ceasefires can reduce violence (Mueller & Rauh (2024) and other works in progress).

But encoding the ceasefire variable in every “obvious” way yielded zero forecast gain, thus the goal became, instead:

The real step goal

Document, model, and explain why the causal variable fails to directly improve forecasting performance.

On the causality of ceasefires

A simple diff in diff exercise
\(w\) Mode ATT SE \(z\) \(p\) 95% CI
18 strict −0.431 0.213 −2.02 0.044 [−0.849, −0.012]
18 relaxed −0.568 0.196 −2.89 0.004 [−0.952, −0.183]
24 strict −0.534 0.306 −1.74 0.082 [−1.134, +0.067]
24 relaxed −0.761 0.229 −3.32 0.001 [−1.210, −0.312]

II: Literature Review

Causality vs. prediction

  • Explain vs Predict.
    • Breiman (2001) splits statistical practice into data modeling (assume a stochastic model to extract information about nature) and algorithmic modeling (treat the mechanism as unknown, fit any \(f(x)\) and validate by test-set error to predict new responses).
    • Shmueli (2010), (2025) and Carlin & Moreno-Betancur (2025) reorganize this around the question, not the model, where a variable’s admissibility changes with the goal (to explain or to predict). Hence the true-model myth: a less causally faithful model can predict better than a causally correct one.
    • Also Iskhakov et al. (2020), Kleinberg et al. (2015), and Yarkoni & Westfall (2017).
  • Significant ≠ predictive.
    • Lo et al. (2015) formalize the gap: significance is a distributional difference (\(f_{D=0}\neq f_{D=1}\)), prediction needs distributional separation (\(\sum_x \min\{f_{D=0},f_{D=1}\}\)). A weak but stable mean shift yields a tiny \(p\)-value, but if distributions overlap ≈90% a classifier can perform poorly.
    • Ward et al. (2010) run this test on canonical civil-war models (Fearon–Laitin, Collier–Hoeffler): highly significant variables sometimes carry very modest AUC contributions out-of-sample.

Other relevant literature

  • Causal machine learning. Chernozhukov et al. (2018) is the leading methodology. Their focus is on a valid identification of the treatment effect while controlling for high-dimensional nuisance parameters. Amazing work, but it solves a slightly different problem.

  • Difference-in-differences. Callaway & Sant’Anna (2021) is the main DiD tool I used for the analysis at the beginning (the plot in the motivation). Sun & Abraham (2021) is also a very interesting read, warning against the sinful use of two-way fixed-effects regressions with leads and lags. Mueller & Rauh (2024) is the DiD work I partly followed for the motivation plot.

  • Matching within DiD. For the intricacies of matching when running a DiD analysis: Daw & Hatfield (2018) and Ham & Miratrix (2023) (matching when parallel trends does not hold).

III: Naïve approaches and what they show

Setup and notation

  • \(\mathbf{Y_{t+h}}\) the target, consists on \(h\)-step ahead violence/fatalities, or aggregate measures of those.
  • \(\mathbf{D_t}\) binary treatment, indicating the presence of a ceasefire agreement in period \(t\).
  • \(\mathbf{X_t}\) information set, lags of \(Y\), lags of \(D\), news-topic features, etc.
  • \(Y\) is forecastable from \(X\): \(\widehat{g}(X_t) \approx Y_{t+h} \forall \, h\geq0\).
  • \(D\) is not predictable from \(X\): \(\widehat{f}(X_t) \not\approx D_{t+h}\;\forall \, h\geq 0\).
    • And, in fact, the best possible forecast for \(D\) is the conditional probability \(P(D_{t+h}{=}1 \mid Violence_t)\), with any other model achieving negative \(R^2\) or baseline ROC & PR AUC performance.
  • Imbalance:
    • \(P(Y{=}1)\in[0.17,0.30]\) (on incidence tasks, which are binary classification tasks, not regression ones).
    • \(P(D{=}1)\approx 1/85 \approx 0.012\).

The “obvious” modelling approaches

  • Add \(D_t\) as a feature to RF / LightGBM / Logistic AR.
    • Binary encoding.
    • Ordinal linearly decaying encoding.
  • Give more weight to \(D_t\) in the model training process:
    • Sample weighting: Upweighting the observations with \(D_t=1\) in the loss function of the trees.
    • Feature weight: Upsampling \(D_t\) in the feature space by making it more likely to be selected for each tree.
    • First split forcing: Forcing each tree to use \(D_t\) as the first split.
  • Two stage approach:
    • First, predict \(D_{t+h}\) from \(X_t\) (with \(h\geq0\)).
    • Then, use \(\widehat{D}_{t+h}\) as a feature in the main forecasting model.
    • This approach failed early on due to \(D_{t+h}\) being unpredictable by \(X_t\).

The results

Observed differences in metrics

\[ m(·) \]

Represents the results obtained from computing each metric \(m\) using the observed values for \(Y_{t+h}\) and the predictions from a given model using a given set of features (either \(\{X_t\}\) or \(\{X_t, D_t\}\), in this case).

The following is observed for any of the approaches mentioned:

\[ m(X_t, D_t) \approx m(X_t) \quad \forall \,\, m \in \{\mathrm{ROC\text{-}AUC}, \; \mathrm{PR\text{-}AUC}, \; \mathrm{MSE}\} \]

In order to have a better insight into these results, I have prepared a dashboard to visualize everything.

IV: MSE-gain decomposition

Linear forecasting model

For an \(h\)-step outcome at country–month \((i,t)\),

\[ y_{i,t+h} = X'_{it}\,\beta_h \;+\; \tau_h\, d_{it} \;+\; \varepsilon_{i,t+h}, \]

with

  • \(\mathbf{y_{i,t+h}}\): \(h\)-step ahead outcome for individual \(i\).
  • \(\mathbf{X_{it}}\): information set known at \(t\) (lags of \(Y\), lags of \(D\), news text data).
  • \(\mathbf{\beta_h}\): \(h\)-step ahead predictive coefficients on \(X\).
  • \(\mathbf{\tau_h}\): \(h\)-step ahead ATT, assumed homogeneous across \((i,t)\).
  • \(\mathbf{d_{it}\in\{0,1\}}\): contemporaneous binary treatment.
  • \(\mathbf{\varepsilon_{i,t+h}}\): random shock with \(\mathbb{E}[\varepsilon_{i,t+h}] = 0\).

Conditional Mean Independence \(\mathbb{E}[\varepsilon_{i,t+h}\mid X_{it},\,d_{it}]=0.\)

Why homogeneous \(\tau_h\)

Allowing \(\tau_{ih}\) unconstrained destroys the interpretability of this illustration. Alternatively, we could assume \(\mathbb{E}[\tau_{ih}\mid X_{it}, d_{it}]=\tau_h\) but assuming that the TE-heterogeneity is not driven in any way by the observable information is as strong of an assumption as homogeneity itself, in this context.

Full-information forecast

Under conditional mean independence, the optimal forecast given the information set and the treatment \((X_{it},d_{it})\) is,

\[ \widehat y^{\,f}_{i,t+h} \;=\; \mathbb{E}[y_{i,t+h}\mid X_{it},d_{it}] \;=\; X'_{it}\beta_h + \tau_h\, d_{it}. \]

The forecast error collapses to the structural shock,

\[ e^{\,f}_{i,t+h} \;=\; y_{i,t+h}-\widehat y^{\,f}_{i,t+h} \;=\; \varepsilon_{i,t+h}, \]

so the mean-squared error is just the irreducible noise:

\[ \mathrm{MSE}_f \;=\; \mathbb{E}\bigl[(e^{\,f}_{i,t+h})^{2}\bigr] \;=\; \mathrm{Var}(\varepsilon_{i,t+h}) \;=\; \sigma_\varepsilon^{2}. \]

Reading

\(\mathrm{MSE}_f\) would be our benchmark: the best loss any forecaster could achieve when both \(X\) and \(D\) are observable at prediction time.

Restricted forecast

Now I restrict the forecast to \(X_{it}\) alone and define the population nowcaster of treatment,

\[ \widetilde d(X_{it}) \;:=\; \mathbb{E}[d_{it}\mid X_{it}]. \]

By iterated expectations, conditional mean independence implies \(\mathbb{E}[\varepsilon_{i,t+h}\mid X_{it}] = 0\), so the optimal forecast given \(X_{it}\) is,

\[ \widehat y^{\,r}_{i,t+h} \;=\; \mathbb{E}[y_{i,t+h}\mid X_{it}] \;=\; X'_{it}\beta_h + \tau_h\,\widetilde d(X_{it}). \]

The restricted error carries an extra term — the unforecastable part of \(D\):

\[ e^{\,r}_{i,t+h} \;=\; \tau_h\bigl(d_{it}-\widetilde d(X_{it})\bigr) + \varepsilon_{i,t+h}. \]

Squaring and taking expectations:

\[ \mathrm{MSE}_r \;=\; \tau_h^{2}\,\mathbb{E}\!\bigl[(d_{it}-\widetilde d(X_{it}))^{2}\bigr] \;+\; \sigma_\varepsilon^{2}. \]

MSE gain from observing \(D\):

The MSE gain from observing \(D\) on top of \(X\):

\[ \Delta\mathrm{MSE}\;:=\;\mathrm{MSE}_r-\mathrm{MSE}_f\;=\;\tau_h^2\,\mathbb{E}\!\bigl[(d_{it}-\widetilde d(X_{it}))^2\bigr]. \]

By LIE, \[ \mathbb{E}[d_{it}-\widetilde d(X)]=\mathbb{E}[d_{it}] - \mathbb{E}[\mathbb{E}[d_{it}\mid X_{it}]] = 0 \] \[ \mathbb{E}[(d_{it}-\widetilde d(X_{it}))^2] = \mathbb{E}[\mathrm{Var}(d_{it}\mid X_{it})] \] And we can express \(R^2\) in terms of the variance of the conditional expectation of \(d_{it}\): \[ R^2_{d|X_{it}}\;:=\;\frac{\mathrm{Var}(\widetilde d(X_{it}))}{\mathrm{Var}(d_{it})}\quad\Longrightarrow\quad \mathbb{E}\left[\mathrm{Var}(d_{it}\mid X_{it})\right]=(1-R^2_{d|X_{it}})\,\mathrm{Var}(d_{it}). \]

With \(d_{it}\) being binary, \(\mathrm{Var}(d_{it})=p(1-p)\):

\[ \boxed{\;\Delta\mathrm{MSE}\;=\;\tau_h^{2}\,\cdot\, p(1-p)\,\cdot\,(1-R^{2}_{d|X_{it}})\;} \]

MSE decomposition analysis

\(\Delta\mathrm{MSE}\) vanishes whenever any of these three components is near zero:

  • \(\mathbf{\tau_h^2}\): goes to 0 if the effect is too small (small signal).
  • \(\mathbf{p(1-p)}\): quickly drops to 0 when the treatment is imbalanced.
  • \(\mathbf{1-R^2_{d|X_{it}}}\): inversely proportional to the predictability of the treatment.

Plug-in for conflict

Let’s focus on a reasonable approach and see the first problem we face when plugging the treatment variable into the models:

\(R^2_{d|X_{it}}\approx 0\) from a naive forecast that always predicts the median (the treatment is highly unpredictable, achieving only negative \(R^2\)).

\(\tau_h\) depends a lot on the task at hand, but let’s assume that \(\tau_h^2 = 1\), just enough to keep the term from shrinking.

\(p\approx 1/85 \Rightarrow p(1-p)\approx 0.01\) , so the prevalence penalty is very strong

\[ \Delta\mathrm{MSE} \;\approx\; 1\cdot 0.01\cdot 1 \;\approx\; 0.01 \]

We need to bear in mind that, regarding violence intensity (the only regression task we have, predicting log(fatalities)), an ATT of 1 is very generous. Mueller & Rauh (2024) analyze several settings on this task and don’t find any treatment effect above \(|0.9|\). In this example, the prevalence penalty would bring the MSE gain to a very low level, especially if we compare it with the general MSE levels of the forecast, which lie between 0.42 and 1.91 (at the very best, a generous ATT would yield a 2% MSE gain).

Allowing for OVB

If we drop conditional mean independence, then: \(\tau_h=\tau_h^{\mathrm{causal}}+bias_h\), where \(bias_h\) is the omitted-variable bias. The exact same algebra now uses the projection coefficient:

\[ \Delta\mathrm{MSE}\;=\;\bigl(\tau_h^{\mathrm{causal}}+bias_h\bigr)^{2}\cdot p(1-p)\cdot (1-R^{2}_{d|X_{it}}). \]

Nowcasting vs. Forecasting \(D\)

The decomposition uses \(R^2_{d|X_{it}}\) (nowcast), not \(R^2_{d|X_{i,t-k}}\) (forecast).

Applying the law of total variance to \(\mathrm{Var}(\mathbb{E}[d_{it}\mid X_{it}])\), and the tower property of conditional expectation on \(X_{i,t-k}\subseteq X_{it}\),

\[ \mathrm{Var}(\mathbb{E}[d_{it}\mid X_{it}])\;=\;\mathrm{Var}(\mathbb{E}[d_{it}\mid X_{i,t-k}])\;+\;\underbrace{\mathbb{E}\!\bigl[\mathrm{Var}(\mathbb{E}[d_{it}\mid X_{it}]\mid X_{i,t-k})\bigr]}_{\geq 0} \]

thus,

\[ R^2_{d|X_{it}}\;\geq\;R^2_{d|X_{i,t-k}} \]

\(\textbf{Takeaway:}\) A highly forecastable treatment is also highly nowcastable, which makes (low) nowcastability more strict as a predictability metric for the forecast gain decomposition.

V: From the decomposition to Random Forests

Linear regression almost never assigns \(\hat\tau = 0\)

Frisch–Waugh-Lovell residualization gives, in OLS,

\[ \hat\tau_h^{\mathrm{OLS}}\;=\;\frac{\mathrm{Cov}(\widetilde y,\widetilde d)}{\mathrm{Var}(\widetilde d)}. \]

For any non-zero \(\mathrm{Cov}(\widetilde y,\widetilde d)\) (that is, for \(X\) having any predictive capacity towards \(D\), no matter how small) the projection coefficient is not zero.

So a linear forecaster always claims some coefficient from \(D\). In our analysis, the gain is determined by \(\tau_h^2 \cdot p(1-p)\cdot(1-R^2_{d|X_{it}})\).

(Regression) Trees split by variance reduction

At a node \(\mathcal{N}\) holding sample \(S\), a candidate split \((j,th)\) on feature \(X_j\) at threshold \(th\) partitions \(S\) into \(S_L=\{i \in S \mid X_{ij} \leq th\}\) and \(S_R=\{i \in S \mid X_{ij} > th\}\). The impurity for regression trees is:

\[ I(S) \;=\; \frac{1}{|S|}\sum_{i\in S}(Y_i-\bar Y_S)^2 \]

The variance reduction (split-gain) is:

\[ \Delta I(j,th)\;=\;I(S)\;-\;\frac{|S_L|}{|S|}\,I(S_L)\;-\;\frac{|S_R|}{|S|}\,I(S_R) \]

The tree picks the optimal split \((j^*,th^*) = \arg\max_{(j,th) \in J \times \mathcal{T}_j} \Delta I(j,th)\), where \(J \subseteq \{1,\dots,d\}\) represents the available features to evaluate (\(J = \{1,\dots,d\}\) for standard trees, or a random subset for Random Forests), and \(\mathcal{T}_j\) is the set of all possible thresholds for feature \(j\).

What \(D\) competes against at each node

At any given split, the tree evaluates the available features and strictly chooses the one that provides the highest immediate signal (variance reduction).

Feature \(D\) is selected at a node only if:

\[ \Delta I(D, 0.5) \;>\; \max_{X_j, th} \Delta I(X_j, th) \]

(The split-gain of \(D\) must be strictly greater than the best possible split from any other feature \(X_j\).)

Why \(D\) Can Be Completely Ignored

  • The “Winner-Takes-All” Rule: If \(D\) is the not the feature with the highest split-gain at every single node, it will never be selected.
  • Pruning and Stopping: Trees generally need to use stopping criteria in order to avoid over-fitting. If the tree hits its stopping criteria (like max depth or early pruning) before \(D\) ever manages to be the best choice, it will be left out of the model (tree).
  • Forcing splits on \(D\): We can force the trees to split first on \(D\), but in practice this lead to an overall worse performance of the model, in line with what we observed in the simulation.

VI: Markov-chain simulation

Setup

States \(S_t\in\{0,1\}\) (peace, conflict). Base transitions:

  • \(p_{pp}=0.98,\; p_{pc}=0.02\).
  • \(p_{cp}=0.20,\; p_{cc}=0.80\).

Treatment \(D_t\in\{0,1\}\) only available when \(S_t=1\). Effect \(\gamma\):

\[ p_{cp,t}\;=\;\min(p_{cp}+\gamma,\,1), \]

decaying linearly to baseline \(p_{cp}\) over \(\ell\) periods after \(D_t=1\).

Target: \(Y_t=\bigvee_{h=1}^{w}\{S_{t+h}=1\}\), \(w\in\{3,12\}\)

Base run calibrated to \(P(D{=}1)\approx 1/85\).

Sweep results

Treatment-effect sweep: \(\Delta_m = m(X_t,D_t) - m(X_t)\)

Δ increase as model performance degrades

Metric base \(\tau\!\approx\!0\) high \(\tau\!\to\!1\) \(\Delta_m\) at \(\tau\!\to\!1\) absolute drop
ROC–AUC 0.764 0.685 +0.004 \(-10\%\)
PR–AUC 0.549 0.394 +0.053 \(-28\%\)
\(R^2\) 0.372 0.155 +0.015 \(-58\%\)

\(\Delta_m>0\) is observable only in the regime where the forecast problem itself becomes much harder.

What happens around treatment

Average RF forecast error, relative to \(t_0\).
  • Before \(t\) (in conflict): our forecast underpredicts \(Y\) (average error \(>0\)).
  • At \(t-w\): treatment effect enters the target window and \(Y\) starts falling (on average) toward the forecast.

What happens around treatment II

Average RF forecast error, relative to \(t_0\), with 95% “confidence intervals”.
  • What happens in the previous plot should be taken with a grain of salt, as it portrays the average error across all treatment events. The 95% “confidence intervals” of those errors are very wide.

VII: Conclusion and next steps

Conclusions

  • Ceasefire agreements do not seem to carry enough signal to improve the forecasts of violence, given the current observables.
    • In order to use them, we would need to find a way to detect which ones are more likely to have stronger treatment effects than the current ATT estimates.
    • Getting a better prediction for \(D_{t+h}\) alone could even hurt the performance of our model if it is not accompanied by better detection of more impactful ceasefires.

Silver-lining takeaway

But, more generally, out-of-sample predictive power (or pseudo-OOS) can potentially equip decision-makers with improved ways to rank policies based on their usage of otherwise statistically significant variables.

Next steps

  • Develop a systematic framework to help decision-makers rank policies based on the OOS predictive power of their usage of statistically significant variables.

  • Develop a new approach through Neural Networks:

    • Embeddings-based news encoder. Develop a (potentially multilingual) contextual encoder which could even be applied to local news at sub-national resolution. This approach could provide a richer text signal, potentially improving \(\widetilde{d}(X_{it})\) and forecasts at small administrative levels.

    • Causally-regularized loss. Add an identification penalty alongside the predictive loss: \[ \mathcal{L}(\theta)\;=\;\underbrace{\mathbb{E}\bigl[\ell\bigl(Y_{i,t+h},\,f_{\theta}(X_{it},D_{it})\bigr)\bigr]}_{\text{predictive}}\;+\;\lambda\,\underbrace{\mathcal{R}_{\text{causal}}\bigl(\widehat\tau_h(\theta);\;\tau_h^{\text{target}}\bigr)}_{\text{identification penalty}} \] In a similar fashion as PINNs (Raissi et al. (2019)), but instead of a physics residual, a causal residual with valid identification properties.

    • Reinforcement-learning policy layer. Decision-maker \(\pi\) chooses actions \(a_t\in\mathcal{A}\) (intervention timing, allocation of limited resources, etc.) given the information set, optimizing a Bellman value: \[ V^{\pi}(X_{it})\;=\;\mathbb{E}_{\pi}\!\bigl[\,r(X_{it},a_t)\;+\;\delta\,V^{\pi}(X_{i,t+1})\;\big|\;X_{it}\,\bigr] \] Importantly, conditional on first establishing that the available data supports a valid offline policy-learning environment. Iskhakov et al. (2020) provide very interesting insights on this topic.

Caveat: we still need signal

None of this manufactures signal that the data does not contain. If the causal variables carry a tiny signal and the penalty term would regularize toward a noisy target. This is not a workaround for the problem identified in IV and V.

Thanks

Guillem Mirabent Rubinat

IAE–CSIC · UAB · Barcelona School of Economics

guillem.mirabent@bse.eu

References

Bell, C., & Badanjak, S. (2019). Introducing PA-x: A new peace agreement database and dataset. Journal of Peace Research, 56(3), 452–466. https://doi.org/10.1177/0022343318819123
Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230. https://doi.org/10.1016/j.jeconom.2020.12.001
Carlin, J. B., & Moreno-Betancur, M. (2025). On the uses and abuses of regression models: A call for reform of statistical practice and teaching. Statistics in Medicine, 44, e10244.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097
Daw, J. R., & Hatfield, L. A. (2018). Matching and regression to the mean in difference-in-differences analysis. Health Services Research, 53(6), 4138–4156. https://doi.org/10.1111/1475-6773.13015
García Meixide, C., & Ríos Insua, D. (2025). Domain adaptation under hidden confounding. Electronic Journal of Statistics, 19(2), 5805–5842. https://doi.org/10.1214/25-EJS2474
Ham, D. W., & Miratrix, L. (2023). Benefits and costs of matching prior to a difference in differences analysis when parallel trends does not hold. arXiv Preprint. https://arxiv.org/abs/2205.08644
Iskhakov, F., Rust, J., & Schjerning, B. (2020). Machine learning and structural econometrics: Contrasts and synergies. The Econometrics Journal, 23(3), S81–S124. https://doi.org/10.1093/ectj/utaa019
Kleinberg, J., Ludwig, J., Mullainathan, S., & Obermeyer, Z. (2015). Prediction policy problems. American Economic Review: Papers & Proceedings, 105(5), 491–495. https://doi.org/10.1257/aer.p20151023
Lo, A., Chernoff, H., Zheng, T., & Lo, S.-H. (2015). Why significant variables aren’t automatically good predictors. Proceedings of the National Academy of Sciences, 112(45), 13892–13897.
Mueller, H., & Rauh, C. (2018). Reading between the lines: Prediction of political violence using newspaper text. American Political Science Review, 112(2), 358–375.
Mueller, H., & Rauh, C. (2022a). The hard problem of prediction for conflict prevention. Journal of the European Economic Association, 20(6), 2440–2467. https://doi.org/10.1093/jeea/jvac025
Mueller, H., & Rauh, C. (2022b). Using past violence and current news to predict changes in violence. International Interactions, 48(4), 579–596.
Mueller, H., & Rauh, C. (2024). Building bridges to peace: A quantitative evaluation of power-sharing agreements. Economic Policy, 39(118), 411–467. https://doi.org/10.1093/epolic/eiad027
Mueller, H., Rauh, C., & Seimon, B. (2024). Introducing a global dataset on conflict forecasts and news topics. Data & Policy, 6. https://doi.org/10.1017/dap.2024.10
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707. https://doi.org/10.1016/j.jcp.2018.10.045
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330
Shmueli, G. (2025). To explain, to predict, or to describe: Figuring out the study goal [commentary on “on the uses and abuses of regression models” by carlin and moreno-betancur]. Statistics in Medicine, 44, e10307.
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175–199. https://doi.org/10.1016/j.jeconom.2020.09.006
Ward, M. D., Greenhill, B. D., & Bakke, K. M. (2010). The perils of policy by p-value: Predicting civil conflicts. Journal of Peace Research, 47(4), 363–375. https://doi.org/10.1177/0022343309356491
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393