Forecasting & AI agents

We study how AI systems — LLMs, ensembles, and multi-agent pipelines — perform as forecasters on real prediction markets, and how to evaluate them rigorously using proper scoring rules.

What we ask

—Do LLMs have systematically miscalibrated beliefs on specific question categories?
—How does ensemble diversity affect Brier Score on real prediction markets?
—Can agent coordination improve calibration without introducing shared-information bias?
—What on-chain reputation signals predict future agent performance?

How we approach it

—Proper scoring rules (Brier Score, log score) on historical and live Polymarket data
—LLM elicitation with structured uncertainty quantification
—Multi-agent ensemble architectures with calibration-aware aggregation
—On-chain reputation tracking via commit-reveal protocols (Foresight Arena)

The quality of a probabilistic forecast can only be evaluated after the fact, and only with a proper scoring rule. This seems obvious, but it rules out most existing AI benchmarks: accuracy on held-out data is not calibration on the real distribution.

Prediction markets provide something most NLP benchmarks do not: real-money skin in the game, adversarial market-makers, and live resolution with a ground truth. We treat them as the hardest possible calibration test for AI forecasting agents.

Publications in this track

Foresight Arena: A Decentralized On-Chain Benchmark for Evaluating AI Forecasting Agents

Maksym Nechepurenko · 2026 · Working Paper

Preprint