Contents

Transparency in Forecasting

A detailed breakdown of our experimental methodology and comprehensive results comparing AI forecasters against top human superforecasters across multiple prediction markets and datasets.

Our Experimental Framework

Our methodology uses questions / framework from ForecastBench.org. We ensure fair comparison between human and AI forecasters through careful experimental controls.

Data Leakage Prevention: We strictly limit data leakage by using models with training cutoffs before forecast dates, and only allow access to articles last updated before the forecast date to supplement knowledge.
Statistical Significance: All results are tested for statistical significance using both bootstrap methods and Mann-Whitney U tests to ensure robust conclusions.

Full Brier Score Comparison

Source	Winner	Questions	Human Mean	DelPy Mean	P-Value (Bootstrap)	P-Value (M-W)
ACLED acleddata.com	👨-0.0139	88	0.0354	0.0492	0.5225	0.2484
DBnomics db.nomics.world	👨-0.0489	72	0.1191	0.1680	0.0883	0.0436
FRED Economic Data fred.stlouisfed.org	👨-0.0057	86	0.1600	0.1656	0.8338	0.0274
RFI randforecastinginitiative.org	👨-0.0140	11	0.0720	0.0860	0.8467	0.0302
Manifold Markets manifold.markets	🤖+0.0520	22	0.1343	0.0823	0.4099	0.2648
Metaculus metaculus.com	👨-0.0064	20	0.0355	0.0419	0.8147	0.6750
Polymarket polymarket.com	👨-0.0868	22	0.0524	0.1392	0.1705	0.5338
Wikipedia wikipedia.org	👨-0.0587	88	0.0271	0.0858	0.0189	0.2869
Yahoo! Finance finance.yahoo.com	🤖+0.0283	88	0.2554	0.2272	0.0001	0.0002

Note: Lower Brier scores indicate better forecasting accuracy. Green differences show DelPy (our AI) outperforming top human superforecasters, red shows humans performing better. Rows with check indicate statistically significant results (p=0.05).

Get in Touch

Interested in learning more about DelPy, our AI forecasting system, or collaborating on AI Safety research? We'd love to hear from you.

Contact DelPy Team