ForesightFlow
← Research

Forecasting & AI agents

We study how AI systems — LLMs, ensembles, and multi-agent pipelines — perform as forecasters on real prediction markets, and how to evaluate them rigorously using proper scoring rules.

What we ask

  • Do LLMs have systematically miscalibrated beliefs on specific question categories?
  • How does ensemble diversity affect Brier Score on real prediction markets?
  • Can agent coordination improve calibration without introducing shared-information bias?
  • What on-chain reputation signals predict future agent performance?

How we approach it

  • Proper scoring rules (Brier Score, log score) on historical and live Polymarket data
  • LLM elicitation with structured uncertainty quantification
  • Multi-agent ensemble architectures with calibration-aware aggregation
  • On-chain reputation tracking via commit-reveal protocols (Foresight Arena)

The quality of a probabilistic forecast can only be evaluated after the fact, and only with a proper scoring rule. This seems obvious, but it rules out most existing AI benchmarks: accuracy on held-out data is not calibration on the real distribution.

Prediction markets provide something most NLP benchmarks do not: real-money skin in the game, adversarial market-makers, and live resolution with a ground truth. We treat them as the hardest possible calibration test for AI forecasting agents.

Publications in this track