Why humans are still much better than AI at forecasting the future
Being able to predict the future seems nice. I would’ve liked to know that my S&P 500 index funds would peak in mid-February and then fall off a cliff in April. It would’ve been helpful for my reporting in the lead-up to the inauguration to know just how far the Trump administration would go to […]


Being able to predict the future seems nice. I would’ve liked to know that my S&P 500 index funds would peak in mid-February and then fall off a cliff in April. It would’ve been helpful for my reporting in the lead-up to the inauguration to know just how far the Trump administration would go to attack foreign aid. And while I’m at it, I’d like some sense of where mortgage rates are going, so I can better judge when to buy a house.
The art of systematically making these kinds of predictions is called forecasting, and we’ve known for a long time that some people — so-called superforecasters — are better at it than others. But even they aren’t Nostradamuses; major events still surprise them sometimes. Superforecasters’ work takes time and effort, and there aren’t many of them.
It’s also hard for us mortals to emulate what makes them so effective. I wrote a whole profile of one of the world’s best superforecaster teams, called the Samotsvety group, and despite their tips and tricks, I didn’t leave the experience as a superforecaster myself.
But you know what’s sometimes better at learning than I am? AI models.
In recent years, the forecasting community has increasingly pivoted to trying to build and learn from AI-fueled prediction bots. More specialized fields have, of course, been doing this in various forms for a while; algorithmic trading in financial markets, for instance, where computer programs using various prediction tools trade assets without human intervention, has been around for decades. But using AI as a more general-purpose forecasting tool is a newer idea.
Everyone I spoke with in the field agrees that the top human forecasters still beat machines.
The best evidence for this comes from tournaments run quarterly by Metaculus, a leading prediction website, where participants compete to forecast the future most accurately. Originally for humans only, Metaculus recently began bot tournaments, where contestants enter custom-made AI-driven bots whose track record can then be compared to the best human predictors.
So far there, we have results for three quarters — Q3 and Q4 of 2024, and Q1 of 2025 — and in each quarter, Metaculus’s human superforecasters beat the best machines. (If you want to try, there’s a $30,000 prize for each quarter’s winner.)
But the gap, Metaculus CEO Deger Turan tells me, is narrowing with each quarter. More intriguing still is the fact that the best model in Q1 this year was incredibly simple: It just pulled some recent news articles, then asked o1, at the time the most advanced OpenAI model, to make its own prediction. This approach couldn’t beat humans, but it beat a lot of AI models that were much more sophisticated.
o1 is no longer the cutting-edge OpenAI model; as of this writing, it’s o3. And by some metrics, o3 isn’t as good as Gemini 2.5 Pro, the best model from Google DeepMind. All of which is to say: While humans basically stay the same, the AIs are only getting better, and that could mean that the predictions they make will only get better as well.
Almost every arena of human life relies on good prediction. Lawyers predict whether or not their opponent will agree to a settlement. Construction supervisors predict when a building project will finish. Movie producers predict what script will be a hit. Singles predict whether the person they’re chatting up would prefer a first date over coffee or beer.
We’re not very good at these predictions right now, but we could get much, much better soon. We’re only just starting to realize the implications of that kind of shift.
How AI forecasting works
In theory, an “AI forecaster” is just a program that relies upon machine learning models of one form or another to predict future events.
Prediction is at the heart of what machine learning models do: They analyze vast reams of data and then come up with models that can predict outside that data. For generative models like ChatGPT or Claude or Midjourney, that means predicting the next word or pixel that a user wants in response to a query. For the algorithmic trading models that financiers have been building at least since the founding of the hedge fund Renaissance Technologies in 1982, it means predicting the future path of asset prices in stock, bond, and other markets, based on past performance.
For more generalized predictions about world events, forecasters these days tend to rely heavily on general-purpose models from firms like xAI, Google DeepMind, OpenAI, or Anthropic. These are trained with hundreds of millions of dollars worth of GPUs over several months, which is one reason why it’s much more promising for the relatively small teams working on using AI for forecasting to piggyback on all that training than to start from scratch. (Disclosure: Vox Media is one of several publishers that has signed partnership agreements with OpenAI. Our reporting remains editorially independent. One of Anthropic’s early investors is James McClave, whose BEMC Foundation helps fund Future Perfect.)
Glossary of forecasting terms
Forecasting is a world unto itself, with plenty of forecasting-specific jargon and references. Here’s a brief guide to some common terms you’ll hear in the forecasting world.
Base rate: the historical rate at which a given phenomenon happens (e.g., the rate at which countries go to war, or the percentage of days when the S&P 500 drops overall), before adjusting for specifics of a given case. Establishing your base rate is often the first step in forecasting.
Brier score: a common measure of how accurate forecasts turn out to be. Computed using a formula measuring the distance between a forecaster’s assigned probabilities and actual outcomes.
Calibration: how well the probabilities a forecaster assigns to events happening match up with whether they actually happen — e.g., do events the forecaster estimates as 70 percent likely occur 70 percent of the time?
Metaculus: a popular website where forecasters can make predictions and compare accuracy. Not structured like a prediction market.
Prediction market: a stock-type market, usually online, where participants can bet real currency, cryptocurrency, or play money on specific events happening or not. Kalshi, Polymarket, and Manifold are popular prediction markets.
Scope sensitivity: the ability to reason clearly about the scale of different phenomena. An important attribute of good forecasters.
Superforecaster: a human whose forecasts are reliably much more accurate, and better calibrated, than the average human’s.
One team based at the Center for AI Safety released a paper in October 2024 claiming “superhuman” forecasting ability by simply prompting a large language model (in this case, OpenAI’s 4o model) and scraping recent news articles. That claim crumbled under scrutiny: Other researchers could not replicate the finding, and it appeared that the model could predict well in training partly because it had more recent data than it should’ve, a problem called “data contamination.”
Imagine if you are, in late 2024, trying to train a model to predict who the Democratic nominee that year will be; you know it will be Kamala Harris, but to train the model, you try to only give it data from before that became obvious. If that data, though, isn’t purely from early 2024, and includes references to Harris’s eventual nomination, your forecast could perform very well, but only because it has access to data it’d never have in a real-world context.
A more promising approach comes from UC Berkeley computer scientists Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. Their forecaster also relied on language models, but did extensive amounts of “scaffolding”: instead of simply letting the bot run free, they asked the language model to do a series of very specific things, in a very specific order, to get the final result:
- First, the model is asked to come up with a set of queries to send to a newswire service to gain more information on the question being forecast.
- Then the queries are sent, the news service sends replies, and the language model is asked which replies are likely to be most helpful. It then summarizes the top replies.
- This process is first performed on old questions for which answers are already known, and with old news articles. A model is then asked to predict based on summaries of these old articles; it is fine-tuned based on whether these predictions are accurate or not, to improve its performance.
- That fine-tuned model, and several other more general-purpose models, are then asked for predictions, and an average of the different models’ views is used.
Just asking the language models directly, they found, led to terrible predictions: “Most models’ scores are around or worse than random guessing,” the team wrote in their paper. But once they were able to fine-tune models by showing what thousands of successful predictions (and their underlying reasoning) looked like, the results were much better.
The resulting forecasting bot got 71.5 percent of questions correct. By comparison, the human prediction website the researchers used for comparison took the average prediction of their participants and got 77 percent accuracy. That human comparator isn’t as good as the best superforecasters, but it’s certainly better than random chance.
The takeaway: The AI forecaster is not quite up to human level, and certainly not up to the level of human “superforecasters” who beat the crowd. But it’s not too far away.
Why forecasting is hard for AIs
That’s impressive, but progress since has been fairly slow. “Arguably, from an academic perspective, nothing has surpassed [UC Berkeley’s] Steinhardt’s paper, which is now a full year old,” Dan Schwarz, CEO of the startup FutureSearch, which builds AI-based forecasting tools, told me.
A full year may not sound like much, but that’s because you’re thinking in human terms. In the world of AI, a year is an eternity. That fact underlines something Schwarz and other entrepreneurs working on AI for forecasting told me: this stuff is harder than it looks.
One limitation is, ironically, that language models are not great quantitative thinkers or logical reasoners. Some common forecasting questions take the form of “will X event happen by Y date”: “will China invade Taiwan by 2030,” or “will China invade Taiwan by 2040.” One logical implication is that, for a given question, the odds should stay the same or increase as the date gets later into the future: since “China invading before 2040” includes all future where “China invades before 2030,” the odds of it happening by 2040 should, at the very least, not be lower than the odds of it happening by 2030.
But language models don’t think logically and systematically enough to know that. Turan, the CEO of forecasting platform Metaculus, notes that a few bots that entered into the platform’s contests have tried to impose this kind of consistency on them and were designed so as to force forecasts to be internally consistent. “They end up having way better results,” Turan says. Phil Godzin, a software engineer who won the fourth-quarter 2024 contest, has explained that the first step in his model is “asking an LLM to group related questions together and predict them in batches to maintain internal consistency.”
This limitation may become less important due to the dawn of “reasoning models,” like OpenAI’s o3/o4-mini and DeepSeek’s R1. These models differ from previous language models in that they undergo extensive late-stage training to ensure they give correct answers to logical and mathematical questions that can be easily checked (like “how many ‘r’s are in ‘strawberry’”). They are also typically designed to use more computing power when queried, to ensure these kinds of questions are answered accurately. In theory, this evolution in the models should make consistency in forecasts easier to maintain, though it’s too soon to see if this advantage shows up in practice.
Schwarz of FutureSearch cites poor web research skills as a crucial bottleneck. Despite rollouts of flashy features like ChatGPT’s “Deep Research” mode, collation of basic facts about a given situation is still a major challenge for AI models.
FutureSearch this week revealed Deep Research Bench, an attempt to provide a benchmark for web-based research done by leading LLMs. It finds that, as of May 2025, even the best models struggle mightily with routine research tasks. The “Find Number” task, for instance, asked models to find a specific data point (e.g., how many FDA medical device recalls there have been in history). The best model, OpenAI o3, got a score of 69 percent on that; many got less than half right, and DeepSeek R1, which made a splash a few months ago, got less than a third.
The models did even worse at more complex tasks, like locating whole data sets. The best overall score, from o3, was 0.51 out of 1. FutureSearch estimates that a competent, smart, but fallible human should be able to get 0.8. “We can conclude that frontier agents under low elicitation substantially underperform smart generalist researchers who are given ample time,” the authors conclude.
Steinhardt, the Berkeley statistician who coauthored last year’s paper, frames the situation a bit more positively. Sure, AIs have limitations, but ChatGPT was introduced just two and a half years ago, and they’re already nipping at humans’ heels. “I’d guess that if you applied the best-known forecasting ideas to the best AI systems today, you’d outperform the best human forecasters working as a group,” Steinhardt says. “Why is it good at this? Because humans are just really, really bad forecasters.”
Good forecasting requires you to be honest about your mistakes and to learn from them; to change your views all the time, by little increments, rather than suddenly and all at once; and to not be distracted by what’s prominently in the news and being discussed around you, but give proper weight to all the information you’re receiving.
Humans aren’t especially good at any of that. We tend to base our beliefs on many topics on a single piece of information, often information that isn’t even relevant. We give much more weight to information that is easier to recall or more readily available, whether or not it’s more important. We’re absolutely terrible at thinking about scope — even experts struggle to give, say, a thousand times the weight to a number in the billions compared to the millions. It stands to reason that AIs could be better at all of this.
A world with AI forecasts
Superforecasting, the bible of the whole field of general-purpose forecasting, came out in 2015. The psychologist who coauthored it, the University of Pennsylvania’s Philip Tetlock, based it on research that had been ongoing for decades before that.
Yet it’s fair to say that, friendly coverage from folks like me aside, the idea that there are clear strategies that enable you to better predict the future hasn’t set the world on fire. When the New York Times reports on border tensions in India and Pakistan, it does not cite superforecasters’ view on likely outcomes. The White House does not ask superforecasters for a prediction on how China might respond to higher tariffs. Investing firms don’t get into bidding wars to hire the best superforecasters to project trends.
This raises an important corollary question: if the world doesn’t have a ton of demand for human superforecasting, would that change at all if it’s done by machines? Why should AI superforecasting be different?
This worry might account for the relatively small scale of most AI forecasting efforts. Google DeepMind, OpenAI, Anthropic, and other leading labs aren’t prioritizing it. A few small startups, like FutureSearch, ManticAI (a top performer on the Metaculus competitions), and Lightning Rod Labs, are. Presumably, if the big labs thought that superhuman forecasting were a big economic game-changer, they’d invest more in it. Certainly, that’s what a superforecaster would surmise.
That said, there are good reasons to think superhuman AI forecasting that is forecasting better than the best humans today would be a big deal. Human forecasters require time, energy, and resources to make good forecasts; they can’t spit out an accurate probability estimate in a matter of minutes. A good AI model, in theory, could.
Compare how useful a research librarian who takes a few weeks to send you a stack of useful books would have been before the dawn of the internet, to the ability to search Google today. Both give you useful outputs. A good librarian’s output might even be more useful. But getting results instantly is incredibly important, and massively increases demand for the service.
Ben Turtel, a cofounder of Lightning Rod AI, imagines his forecasting AI being especially useful in cases where someone has a lot of unstructured data that the forecaster can evaluate quickly. Take, for instance, a nurse or doctor trying to anticipate a patient’s trajectory based on scattered notes in their medical records, plus evidence from studies correlating outcomes with patient attributes like whether they smoke or their age. That is a difficult task for which there is no one recipe. Having a model that can instantly and accurately combine patient-specific data with broader evidence, and provide a prognosis with a probability, would ease their jobs considerably.
Similarly, companies that operate abroad often pay for political risk consultants that purport to tell them, say, “how dangerous is it to be working in Jordan right now,” or “what are the odds that a coup happens in Myanmar while we’re working there.” Demonstrably superhuman AI forecasting might change that work considerably, and could threaten a lot of those consultancies — if those AI forecasts were trusted.
The “if trusted” part, though, is key. An AI superforecaster would be, if nothing else, a deeply strange entity. Imagine going up to America’s octogenarian president and saying, “We made an oracle out of silicon, and it is now better at predicting wars than the CIA. You need to listen to it, and not your advisers with millennia of combined experience.” The whole scenario feels laughable. Even if you can prove a model is better than human experts on some class of problems, there’ll be a long way to go before relevant decision-makers would truly believe and internalize that fact.
The black-box nature of modern LLMs is part of the problem: If you ask one for a prediction, we don’t ultimately know what computation it’s doing to come to that answer. We might see a hybrid period first, where LLMs are asked for explanations of their predictions and decision makers only act on their judgment if the explanations make sense. Even still, acting on AI advice in some contexts, like medicine, could open risk-averse providers and administrators to lawsuits or worse.
But we can get accustomed to what feels strange. A good model here might be Wikipedia. In the 2000s, the website was popular and gradually improving in quality, but there were strong norms against citing it or relying upon it. Anyone could edit it; obviously it couldn’t be trusted. But over time, those norms eroded as it became clear that on many topics, Wikipedia was equally or more accurate than more traditional sources.
AI prediction bots might follow a similar trajectory. First, they’re a curiosity. Then they’re a guilty pleasure that many secretly count on. Finally, we accept that they’re onto something, and they begin shaping the way we all make decisions.