My 2025 AI predictions and 2024 evaluations

Below, I evaluate my 2024 AI forecasts then register my 2025 forecasts.

Evaluating 2024 forecasts

GPT-4.5 and end of 2024 capability forecasts

In Feb 2024, I made some forecasts about GPT-4.5 capabilities. Unfortunately GPT-4.5 hasn’t been released, so theoretically this hasn’t been resolved. However, I said my forecasts would have been similar if forecasting the SOTA end of 2024, so I’ll evaluate against that (another interpretation could have been to look at GPT-4o which would resulted in me predicting too highly).

Later, in May 2024, I made some end-of-2024 capability forecasts for the scenario described below. Below I gather these forecasts and the actual end-of-2024 performance.

Benchmark

Feb 2024 GPT-4.5 predictions

Prediction difference vs. end of 2024

May 2024 prediction

Prediction difference vs. end of 2024

Actual, SOTA end of 2024

OSWorld



40%

15%

25%

SWEBench

25%

-4%

35%

6%

29%

InterCode (Bash)

73%

24%

75%

26%

49%

GAIA

65%

0%

55%

-10%

65%

WebArena

50%

-7%

60%

3%

57%

GPQA

60%

-10%

70%

0%

70%

Overall, it looks like my Feb 2024 predictions were about right, but would have been too high if we evaluated based on GPT-4o. My May 2024 predictions were a bit too high, though less so if we remove InterCode which didn’t get any submissions in 2024.

Performance is much higher on a subset of SWEBench, SWEBench-Verified, because many of the problems in the original SWEBench are unsolvable. I didn’t realize this when forecasting, so one could make an argument for using the verified result which would make both of my SWEBench predictions way too low.

Interestingly, REBench, which arose from tasks released after May 2024, advanced much faster than I expected. This is partially due to me overestimating the difficulty of the benchmark, but might also be a signal that other benchmarks could have significantly higher SOTAs with more iteration on fine-tuning/scaffolding and more inference compute.

Scenario forecasts from May 2024

An earlier draft of our scenario was written in 2024 and contained various forecasts for the rest of 2024 (we spent significant time on this, but most of our effort was focused on later time periods).

Here are some of the most important forecasts:

Forecast

Resolution

Explanation

Apps that generally accomplish tasks on your computer are released, powered by GPT-5.

Incorrect

No GPT-5. Might get “Operator” soon for computer use. Anthropic’s agent is basically useless.

AI agents get good enough to be routinely useful as web-browsing assistants

Incorrect

I’m not aware of useful ones. Very similar to above forecast.

AIs achieve Medium on Preparedness CBRN and Persuasion

Correct

o1 system card

GPT-5 (and any other models this year) are mostly useful as coding assistants for relatively simple tasks, because they aren't reliable enough agents

Correct

This matches my sense of coding assistants/agents for most use cases.

AI software R&D is overall sped up by 10% due to gains on coding tasks

Roughly correct

This roughly matches my sense of the state of automation of AI software R&D, maybe a bit low.

AI forecasters are trained to give calibrated probabilities with informative rationales. They are roughly at the level of a 90th percentile superforecaster on the distribution superforecasters are selected on (short-term geopolitical questions), in terms of the probabilities given. The rationales are hit or miss.

Incorrect

This seems below my sense of actual automated forecasts’ abilities, and ForecastBench agrees with my sense.

Overall it seems like our capability/impact forecasts were pretty good in coding and Preparedness areas, but too bullish on computer using and web browsing agents, as well as too bullish on forecasting.

2025 forecasts

I will have some more detailed forecasts out soon as part of an AI scenarios project, though those will be conditioned on shorter timelines than my median. Until then, I'll make a few forecasts below.

AI 2025 survey

Here are my 2025 capabilities, revenue and public prominence forecasts from ai2025.org (which I co-created and am responsible for determining resolution values). These are primarily based on extrapolation, with various intuitive adjustments. I encourage everyone to enter their forecasts!

Takes on Gary Marcus’ 2025 forecasts

I agree with >50% probability on all of Marcus’s “high confidence” predictions. The ones I’m least confident in are also ones that I’d like clarification: (3) and (6)-(9)

  1. Marcus’s (3): Profits from AI models will continue to be modest or nonexistent (chip-making companies will continue to do well though, in supplying hardware to the companies that build the models; shovels will continue to sell well throughout the gold rush.)
    1. I’d guess there will be overall no profit due to reinvestment. But I’d guess that already there are substantial markups on OpenAI and Anthropic’s APIs compared to marginal inference cost.
    2. I’d be excited for Marcus to predict the revenues of leading LLM companies (e.g. OpenAI, Anthropic, xAI), rather than just profits (you can see my above forecast of $17B for the sum of those 3 companies).
  2. Marcus’s (6)-(8): The lack of reliability will continue to haunt generative AI, Hallucinations (which should really be called confabulations) will continue to haunt generative AI, Reasoning flubs will continue to haunt generative AI.
    1. It’s hard to judge these ones because I don’t understand what the minimal evidence is that Marcus could see that would falsify them. I encourage him to clarify this.
  3. Marcus’s (9): AI “Agents” will be endlessly hyped throughout 2025 but far from reliable, except possibly in very narrow use cases.
    1. Agree but not confident, plausibly will depend on the meaning of “very narrow”. I expect pretty useful coding agents, unsure about computer-using or web-browsing agents.
    2. I’d be excited for Marcus to make more specific predictions about:
      1. The impact of end-to-end coding agents such as Devin, and augmenting agents such as Cursor, Windsurf, etc. (could e.g. forecast revenues, usage)
      2. The capabilities and impact of further computer-use agents such as improvements on Anthropic’s prototype, or OpenAI’s rumored Operator. As well as agents built specifically for web browsing.

There’s also a few that I’m unsure of because I haven’t been following progress closely enough in those areas: (12)-(15).

I disagree with (18), (21) and maybe (23) from his medium confidence predictions:

  1. Marcus’s (18): Technical moat will continue to be elusive. Instead, there will be more convergence on broadly similar models, across both US and China; some systems in Europe will catch up to roughly the same place.
    1. Disagree, I’d guess that some of OpenAI/Anthropic/GDM and maybe xAI/Meta will be clearly ahead, at least as much as is currently the case.
  2. Marcus’s (21): 2025 could well be the year in which valuations for major AI companies start to fall. (Though, famously, “the market can remain irrational longer than you can remain solvent”)
    1. Disagree, depending on “could well be”. I disagree that the market values the major AI companies too highly, I think it’s the opposite.
  3. Marcus’s (23): Neurosymbolic AI will become much more prominent
    1. Disagree depending on what counts. If things like Ryan Greenblatt’s ARC-AGI solution count (generating a bunch of Python programs then selecting the best ones based on the given examples of the transformation), then this seems more likely.

I weakly disagree with both of Marcus’s low confidence predictions:

  1. Marcus’s (24): We may well see a large-scale cyberattack in which Generative AI plays an important causal role, perhaps in one of the four ways discussed in a short essay of mine that will appear shortly in Politico.
  2. Marcus’s (25): There could continue to be no “GPT-5 level” model (meaning a huge, across the board quantum leap forward as judged by community consensus) throughout 2025. Instead we may see models like o1 that are quite good at many tasks for which high-quality synthetic data can be created, but in other domains only incrementally better than GPT-4.
    1. Weakly disagree, depending on definition re: “quantum leap”