A detailed breakdown of our experimental methodology and comprehensive results comparing AI forecasters against top human superforecasters across multiple prediction markets and datasets.
Our methodology uses questions / framework from ForecastBench.org. We ensure fair comparison between human and AI forecasters through careful experimental controls.
Source | Winner | Questions | Human Mean | DelPy Mean | P-Value (Bootstrap) | P-Value (M-W) | Significant |
---|---|---|---|---|---|---|---|
ACLED | 👨-0.0139 | 88 | 0.0354 | 0.0492 | 0.5225 | 0.2484 | |
DBnomics | 👨-0.0489 | 72 | 0.1191 | 0.1680 | 0.0883 | 0.0436 | |
FRED Economic Data | 👨-0.0057 | 86 | 0.1600 | 0.1656 | 0.8338 | 0.0274 | |
👨-0.0140 | 11 | 0.0720 | 0.0860 | 0.8467 | 0.0302 | ||
Manifold Markets | 🤖+0.0520 | 22 | 0.1343 | 0.0823 | 0.4099 | 0.2648 | |
Metaculus | 👨-0.0064 | 20 | 0.0355 | 0.0419 | 0.8147 | 0.6750 | |
Polymarket | 👨-0.0868 | 22 | 0.0524 | 0.1392 | 0.1705 | 0.5338 | |
Wikipedia | 👨-0.0587 | 88 | 0.0271 | 0.0858 | 0.0189 | 0.2869 | |
Yahoo! Finance | 🤖+0.0283 | 88 | 0.2554 | 0.2272 | 0.0001 | 0.0002 |
Note: Lower Brier scores indicate better forecasting accuracy. Green differences show DelPy (our AI) outperforming top human superforecasters, red shows humans performing better. Rows with check indicate statistically significant results (p=0.05).
Interested in learning more about DelPy, our AI forecasting system, or collaborating on AI Safety research? We'd love to hear from you.