Foresight Arena: A Decentralized On-Chain Benchmark for Evaluating AI Forecasting Agents
Maksym Nechepurenko · 2026 · Working Paper
We study how AI systems — LLMs, ensembles, and multi-agent pipelines — perform as forecasters on real prediction markets, and how to evaluate them rigorously using proper scoring rules.
The quality of a probabilistic forecast can only be evaluated after the fact, and only with a proper scoring rule. This seems obvious, but it rules out most existing AI benchmarks: accuracy on held-out data is not calibration on the real distribution.
Prediction markets provide something most NLP benchmarks do not: real-money skin in the game, adversarial market-makers, and live resolution with a ground truth. We treat them as the hardest possible calibration test for AI forecasting agents.
Maksym Nechepurenko · 2026 · Working Paper