Contents

Transparency in Forecasting

A detailed breakdown of our experimental methodology and comprehensive results comparing AI forecasters against top human superforecasters across multiple prediction markets and datasets.

Our Experimental Framework

Our methodology uses questions / framework from ForecastBench.org. We ensure fair comparison between human and AI forecasters through careful experimental controls.

  • Data Leakage Prevention: We strictly limit data leakage by using models with training cutoffs before forecast dates, and only allow access to articles last updated before the forecast date to supplement knowledge.
  • Statistical Significance: All results are tested for statistical significance using both bootstrap methods and Mann-Whitney U tests to ensure robust conclusions.

Full Brier Score Comparison

SourceWinnerQuestionsHuman MeanDelPy MeanP-Value (Bootstrap)P-Value (M-W)Significant
👨-0.0139
880.03540.04920.52250.2484
👨-0.0489
720.11910.16800.08830.0436
FRED Economic Data
👨-0.0057
860.16000.16560.83380.0274
👨-0.0140
110.07200.08600.84670.0302
Manifold Markets
🤖+0.0520
220.13430.08230.40990.2648
Metaculus
👨-0.0064
200.03550.04190.81470.6750
Polymarket
👨-0.0868
220.05240.13920.17050.5338
Wikipedia
👨-0.0587
880.02710.08580.01890.2869
Yahoo! Finance
🤖+0.0283
880.25540.22720.00010.0002

Note: Lower Brier scores indicate better forecasting accuracy. Green differences show DelPy (our AI) outperforming top human superforecasters, red shows humans performing better. Rows with check indicate statistically significant results (p=0.05).

Get in Touch

Interested in learning more about DelPy, our AI forecasting system, or collaborating on AI Safety research? We'd love to hear from you.