How Well Does RL Scale?
By Toby_Ord @ 2025-10-22T13:16 (+119)
This is the latest in a series of essays on AI Scaling.
You can find the others on my site.
Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.
The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling:
- Scaling the amount of compute used for RL during training
- Scaling the amount of compute used for inference during deployment
We can see (1) as training the AI in more effective reasoning techniques and (2) as allowing the model to think for longer. I’ll call the first RL-scaling, and the second inference-scaling. Both new kinds of scaling were present all the way back in OpenAI’s announcement of their first reasoning model, o1, when they showed this famous chart:
I’ve previously shown that in the initial move from a base-model to a reasoning model, most of the performance gain came from unlocking the inference-scaling. The RL training did provide a notable boost to performance, even holding the number of tokens in the chain of thought fixed. You can see this RL boost in the chart below as the small blue arrow on the left that takes the base model up to the trend-line for the reasoning model. But this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost.
The question of where these capability gains come from is important because scaling up the inference compute has very different implications than scaling up the training compute. In this first round of reasoning models, they were trained with a very small amount of RL compute compared to the compute used in pre-training, meaning that the total cost of training was something like 1.01x higher than the base-model. But if most of the headline performance results require 30x as much inference compute, then the costs of deploying the those capabilities is 30x higher. Since frontier AI developers are already spending more money deploying their models than they did training them, multiplying those costs by 30x is a big deal. Moreover, these are costs that have to be paid every time you want to use the model at this level of capability, so can’t be made up in volume.
But that was just the initial application of RL to LLMs. What happens as companies create more advanced reasoning models, using more RL?
The seeds of the answer can be found all the way back in that original o1 chart.
The chart shows steady improvements for both RL-scaling and inference-scaling, but they are not the same. Both graphs have the same y-axis and (despite the numbers being removed from the x-axis) we can see that they are both on a logarithmic x-axis covering almost exactly two orders of magnitude of scaling (100x). In both cases, the datapoints lie on a relatively straight line, which is presumably the central part of a larger S-curve. However, the slope of the RL-scaling graph (on the left) is almost exactly half that of the slope of the inference-scaling graph (on the right). When the x-axis is logarithmic, this has dramatic consequences.
The graph on the right shows that scaling inference-compute by 100x is enough to drive performance from roughly 20% to 80% on the AIME benchmark. This is pretty typical for inference scaling, where quite a variety of different models and benchmarks see performance improve from 20% to 80% when inference is scaled by 100x.
For instance, this is what was found with Anthropic’s first reasoning model (Sonnet 3.7) on another AIME benchmark, with almost exactly the same scaling behaviour:
And ability on the ARC-AGI 1 benchmark also scales in a similar way for many of OpenAI’s different reasoning models:
We don’t always see this scaling behaviour for inference: some combinations of LLM, inference-scaling technique, and benchmark see the performance plateau below 80% or exhibit a different slope (often worse). But this climb from 20 to 80 with 100x more inference compute is pretty common (especially for reasoning-intensive benchmarks) and almost certainly what is happening on that original o1 graph.
In contrast, the slope of the RL-scaling trend is half as large, which means that it requires twice as many orders of magnitude to achieve the exact same improvement in capabilities. Increasing the RL training compute by 100x as shown in the o1 chart only improved performance from about 33% to 66%. At that rate, going from 20 to 80 would require scaling up the RL training compute by 10,000x.
We can confirm this trend — and that it continued beyond o1 — by looking at the following graph from the o3 launch video (with a line added showing the slope corresponding to going from 20 to 80 in 10,000x):
Using another version of the AIME benchmark, this shows o1’s training progress over 3 orders of magnitude and o3’s training over a further order of magnitude. In total, we see that scaling up the RL-training by 4 orders of magnitude takes the model from about 26% to 88%. This provides some confirmation for the rule-of-thumb that a 10,000x scale-up in RL training compute is required to improve this benchmark performance from 20 to 80.
To my knowledge, OpenAI hasn’t provided RL-training curves for other benchmarks, but they do have charts comparing o1 with o3 and o3 with GPT-5 at different inference-scaling levels on several benchmarks. Given that o3 used about 10x as much RL training as o1, we’d expect the RL boost going from o1 to o3 to be worth about the same as the inference boost of giving o1 just half an order of magnitude more inference (~3x as many tokens). And this is indeed what one sees on their performance/token graph comparing the two:
Similarly, o3 also requires about 3x as many tokens to match GPT-5 on the SWE-bench and GPQA Diamond benchmarks. This would fit the expected pattern of GPT-5 having been trained with a further 10x as much RL training compute as o3:
It is hard to verify that this trend holds for models from other companies, as this data on training curves for cutting-edge models is often treated as confidential. But the fact that other leading labs’ base models and reasoning models are roughly on par with OpenAI’s suggests none of them are scaling notably better than this.
So the evidence on RL-scaling and inference-scaling supports a general pattern:
- a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
- a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference
In general, to get the same benefit from RL-scaling as from inference-scaling required twice as many orders of magnitude. That’s not good.
How do these compare to pre-training scaling?
The jumps from GPT-1 to 2 to 3 to 4 each involved scaling up the pre-training compute by about 100x. How much of the RL-scaling or inference-scaling would be required to give a similar boost? While I can’t say for sure, we can put together the clues we have and take an educated guess.
Jones (2021) and EpochAI both estimate that you need to scale-up inference by roughly 1,000x to reach the same capability you’d get from a 100x scale-up of training. And since the evidence from o1 and o3 suggests we need about twice as many orders of magnitude of RL-scaling compared with inference-scaling, this implies we need something like a 1,000,000x scale-up of total RL compute to give a boost similar to a GPT level.
This is breathtakingly inefficient scaling. But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.
Yet despite the poor scaling behaviour, RL training has so far been a good deal. This is solely because the scaling of RL compute began from such a small base compared with the massive amount of pre-training compute invested in today’s models. While AI labs are reticent to share information about how much compute has actually been spent on RL (witness the removal of all numbers from the twin o1 scaling graphs), it is widely believed that even the 10,000x RL-scaling we saw for o3’s training still ended up using much less compute than the ~ FLOP spent on pre-training. This means that OpenAI (and their competitors) have effectively got those early gains from RL-training for free.
For example, if the 10x scaling of RL compute from o1 to o3 took them from a total of 1.01x the pre-training compute to 1.1x, then the 10x scale-up came at the price of a 1.1x scale-up in overall training costs. If that gives the same performance boost as using 3x as many reasoning tokens (which would multiply all deployment costs of reasoning models by 3) then it is a great deal for a company that deploys its model so widely.
But this changes dramatically once RL-training reaches and then exceeds the size of the pre-training compute. In July 2025, xAI’s Grok 4 launch video included a chart suggesting that they had reached this level (where pre-training compute is shown in white and RL-training compute in orange):
Scaling RL by another 10x beyond this point increases the total training compute by 5.5x, and beyond that it is basically the full 10x increase to all training costs. So this is the point where the fact that they get much less for a 10x scale-up of RL compute compared with 10x scale-ups in pre-training or inference really bites. I estimate that at the time of writing (Oct 2025), we’ve already seen something like a 1,000,000x scale-up in RL training and it required ≤2x the total training cost. But the next 1,000,000x scale-up would require 1,000,000x the total training cost, which is not possible in the foreseeable future.
Grok 4 was trained on 200,000 GPUs located in xAI’s vast Colossus datacenter. To achieve the equivalent of a GPT-level jump through RL would (according to the rough scaling relationships above) require 1,000,000x the total training compute. To put that in perspective, it would require replacing every GPU in their datacenter with 5 entirely new datacenters of the same size, then using 5 years worth of the entire world’s electricity production to train the model. So it looks infeasible for further scaling of RL-training compute to give even a single GPT-level boost.
I don’t think OpenAI, Google, or Anthropic have quite reached the point where RL training compute matches the pre-training compute. But they are probably not far off. So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.
Conclusion
The shift towards RL allowed the scaling era to continue even after pre-training scaling had stalled. It did so via two different mechanisms: scaling up the RL training compute and scaling up the inference compute.
Scaling RL training allowed the model to learn for itself how to achieve better performance. Unlike the imitation learning of next-token-prediction, RL training has a track record of allowing systems to burst through the human level — finding new ways of solving problems that go beyond its training data. But in the context of LLMs, it scales poorly. We’ve seen impressive gains, but these were only viable when starting from such a low base. We have reached the point where it is too expensive to go much further.
This leaves us with inference-scaling as the remaining form of compute-scaling. RL helped enable inference-scaling via longer chain of thought and, when it comes to LLMs, that may be its most important legacy. But inference-scaling has very different dynamics to scaling up the training compute. For one thing, it scales up the flow of ongoing costs instead of scaling the one-off training cost. This has many consequences for AI deployment, AI risk, and AI governance.
But perhaps more importantly, inference-scaling is really a way of improving capabilities by allowing the model more time to solve the problem, rather than by increasing its intelligence. Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.
Wei Dai @ 2025-10-24T21:49 (+11)
While I appreciate this work being done, it seems a very bad sign for our world/timeline that the very few people with both philosophy training and an interest in AI x-safety are using their time/talent to do forecasting (or other) work instead of solving philosophical problems in AI x-safety, with Daniel Kokotajlo being another prominent example.
This implies one of two things: Either they are miscalculating the best way to spend their time, which indicates bad reasoning or intuitions even among humanity's top philosophers (i.e., those who have at least realized the importance of AI x-risk and are trying to do something about it). Or they actually are the best people (in a comparative advantage sense) available to work on these other problems, in which case the world must be on fire, and they're having to delay working on extremely urgent problems that they were trained for, to put out even bigger fires.
(Cross-posted to LW and EAF.)
Yarrow Bouchard🔸 @ 2025-10-24T22:21 (+4)
Why do you think this work has less value than solving philosophical problems in AI safety? If LLM scaling is sputtering out, isn't that important to know? In fact, isn't it a strong contender for the most important fact about AI that could be known right now?
I suppose you could ask why this work hasn't been done by somebody else already and that's a really good question. For instance, why didn't anyone doing equity research or AI journalism notice this already?
Among people who are highly concerned about near-term AGI, I don't really expect such insights to be surfaced. There is strong confirmation bias. People tend to look for confirmation that AGI is coming soon and not for evidence against. So, I'm not surprised that Toby Ord is the first person within effective altruism to notice this. Most people aren't looking. But this doesn't explain why equity research analysts, AI journalists, or others who are interested in LLM scaling (such as AI researchers or engineers not working for one of the major LLM companies and not bound by an NDA) missed this. I am surprised an academic philosopher is the first person to notice this! And kudos to him for that!
Wei Dai @ 2025-10-24T22:41 (+2)
Why do you think this work has less value than solving philosophical problems in AI safety?
From the perspective of comparative advantage and counterfactual impact, this work does not seem to require philosophical training. It seems to be straightforward empirical research, that many people could do, besides the very few professionally trained AI-risk-concerned philosophers that humanity has.
To put it another way, I'm not sure that Toby was wrong to work on this, but if he was, it's because if he hadn't, then someone else with more comparative advantage for working on this problem (due to lacking training or talent for philosophy) would have done so shortly afterwards.
Yarrow Bouchard🔸 @ 2025-10-24T23:07 (+2)
I'm not sure that Toby was wrong to work on this, but if he was, it's because if he hadn't, then someone else with more comparative advantage for working on this problem (due to lacking training or talent for philosophy) would have done so shortly afterwards.
How shortly? We're discussing this in October 2025. What's the newest piece of data that Toby's analysis is dependent on? Maybe the Grok 4 chart from July 2025? Or possibly qualitative impressions from the GPT-5 launch in August 2025? Who else is doing high-quality analysis of this kind and publishing it, even using older data?
I guess I don't automatically buy the idea that even in a few months we'll see someone else independently go through the same reasoning steps as this post and independently come to the same conclusion. But there are plenty of people who could, in theory, do it and who are, in theory, motivated to do this kind of analysis and who also will probably not see this post (e.g. equity research analysts, journalists covering AI, AI researchers and engineers independent of LLM companies).
I certainly don't buy the idea that if Toby hadn't done this analysis, then someone else in effective altruism would have done it. I don't see anybody else in effective altruism doing similar analysis. (I chalk that up largely to confirmation bias.)
Toby_Ord @ 2025-10-26T10:28 (+12)
I appreciate you raising this Wei (and Yarrow's responses too). They both echoed a lot of my internal debate on this. I'm definitely not sure whether this is the best use of my time. At the moment, my research time is roughly evenly split between this thread of essays on AI scaling and more philosophical work connected to longtermism, existential risk and post-AGI governance. The former is much easier to demonstrate forward progress and there is more of a demand signal for it. The latter is harder to be sure it is on the right path and is in less demand. My suspicion is that it is generally more important though, and that demand/appreciation doesn't track importance very well.
It is puzzling to me too that no-one else was doing this kind of work on understanding scaling. I think I must be adding some rare ingredient, but I can't think of anything rare enough to really explain why no-one else got these results first. (People at the labs probably worked out a large fraction of this, but I still don't understand why the people not at the labs didn't.)
In addition to the general questions about which strand is more important, there are a few more considerations:
- No-one can tell ex ante how a piece of work or research stream will pan out, so everyone will always be wrong ex post sometimes in their prioritisation decisions
- My day job is at Oxford University's AI Governance Initiative (a great place!) and I need to be producing some legible research that an appreciable number of other people are finding useful
- I'm vastly more effective at work when I have an angle of attack and a drive to write up the results — recently this has been for these bite-size pieces of understanding AI scaling. The fact that there is a lot of response from others is helping with this as each piece receives some pushback that leads me to the next piece.
But I've often found your (Wei Dai's) comments over the last 15-or-so years to be interesting, unusual, and insightful. So I'll definitely take into account your expressed demand for more philosophical work and will look through those pages of philosophical questions you linked to.
titotal @ 2025-10-23T09:16 (+6)
What happens if this is true and AI improvements will primarily be inference driven?
It seems like this would be very bad news for AI companies, because customers would have to pay for accurate AI results directly, on a per-run basis. Furthermore, they would have to pay exponentional costs for a linear increase in accuracy.
As a crude example, would you expect a translation agency to pay four times as much for translations with half as many errors? In either case, you'd still need a human to come along and correct the errors.
Lowe Lundin @ 2025-10-23T09:25 (+5)
Toby wrote this post, which touches on many of the questions you asked. I think it's been significantly under-hyped!
Toby_Ord @ 2025-10-23T11:10 (+38)
Thanks. I'm also a bit surprised by the lack of reaction to this series given that:
- compute scaling has been the biggest story of AI in the last few decades
- it has dramatically changed
- very few people are covering these changes
- it is surprisingly easy to make major crisp contributions to our understanding of it just by analysing the few pieces of publicly available data
- the changes have major consequences for AI companies, AI timelines, AI risk, and AI governance
Anthony DiGiovanni @ 2025-10-26T14:51 (+10)
For my part, I simply didn't know the series existed until seeing this post, since this is the only post in the series on EAF. :)
Lowe Lundin @ 2025-10-23T13:33 (+3)
Agreed! The series has been valuable for my personal thinking around this (I quoted the post I linked above as late as yesterday.) Imo, more people should be paying attention to this.
Sharmake @ 2025-10-23T20:05 (+2)
The crux for me is I don't agree that compute scaling has dramatically changed, because I don't think pre-training scaling has gotten much worse returns.
Patrick Hoang @ 2025-10-23T05:37 (+6)
I'll just add a comment that Ord did an 80k podcast on this topic here.
Yarrow Bouchard🔸 @ 2025-10-22T22:58 (+4)
This is a really compelling post. This seems like the sort of post that could have a meaningful impact on the opinions of people in the finance/investment world who are thinking about AI. I would be curious to see how equity research analysts and so on would react to this post.
This is a very strong conclusion and seems very consequential if true:
This leaves us with inference-scaling as the remaining form of compute-scaling.
I was curious to see if you had a similar analysis that supports the assertion that "the scaling up of pre-training compute also stalled". Let me know if I missed something important. For the convenience of other readers, here are some pertinent quotes from your previous posts.
From "Inference Scaling Reshapes AI Governance" (February 12, 2025):
But recent reports from unnamed employees at the leading labs suggest that their attempts to scale up pre-training substantially beyond the size of GPT-4 have led to only modest gains which are insufficient to justify continuing such scaling and perhaps even insufficient to warrant public deployment of those models. A possible reason is that they are running out of high-quality training data. While the scaling laws might still be operating (given sufficient compute and data, the models would keep improving), the ability to harness them through rapid scaling of pre-training may not.
There is a lot of uncertainty about what is changing and what will come next.
One question is the rate at which pre-training will continue to scale. It may be that pre-training has topped out at a GPT-4 scale model, or it may continue increasing, but at a slower rate than before. Epoch AI suggests the compute used in LLM pre-training has been growing at about 5x per year from 2020 to 2024. It seems like that rate has now fallen, but it is not yet clear if it has gone to zero (with AI progress coming from things other than pre-training compute) or to some fraction of its previous rate.
This strongly suggests that even though there are still many more unused tokens on the indexed web (about 30x as many as are used in GPT-4 level pre-training), performance is being limited by lack of high-quality tokens. There have already been attempts to supplement the training data with synthetic data (data produced by an LLM), but if the issue is more about quality than raw quantity, then they need the best synthetic data they can get.
From "The Extreme Inefficiency of RL for Frontier Models" (September 19, 2025):
LLMs and next-token prediction pre-training were the most amazing boost to generality that the field of AI has ever seen, going a long way towards making AGI seem feasible. This self-supervised learning allowed it to imbibe not just knowledge about a single game, or even all board games, or even all games in general, but every single topic that humans have ever written about — from ancient Greek philosophy to particle physics to every facet of pop culture. While their skills in each domain have real limits, the breadth had never been seen before. However, because they are learning so heavily from human generated data they find it easier to climb towards the human range of abilities than to proceed beyond them. LLMs can surpass humans at certain tasks, but we’d typically expect at least a slow-down in the learning curve as they reach the top of the human-range and can no longer copy our best techniques — like a country shifting from fast catch-up growth to slower frontier growth.
Sharmake @ 2025-10-23T19:55 (+3)
I broadly don't think inference scaling is the only path, primarily because I disagree with the claim that pre-training returns declined much, and attribute the GPT-4.5 evidence as mostly a case of broken compute promises making everything disappointing.
I also have a hypothesis that current RL is mostly serving as an elicitation method for pre-trained AIs.
We shall see in 2026-2027 whether this remains true.
Yarrow Bouchard🔸 @ 2025-10-23T21:57 (+1)
I disagree with the claim that pre-training returns declined much
Could you elaborate or link to somewhere where someone makes this argument? I'm curious to see if a strong defense can be made of self-supervised pre-training of LLMs continuing to scale and deliver worthwhile, significant benefits.
Sharmake @ 2025-10-23T23:29 (+2)
I currently can't find a source, but to elaborate a little bit, my reason for thinking this is that the GPT-4 to GPT-4.5 scaleup used 15x the compute instead of 100x the compute, and I remember that 10x compute is enough to be competitive with the current algorithmic improvements that don't involve scaling up models, whereas 100x compute increases result in the wow moments we associated with GPT-3 to GPT-4, and the GPT-5 release was not a scale up of compute, but instead productionizing GPT-4.5.
I'm more in the camp of "I find little reason to believe that pre-training returns have declined" here.
Peter @ 2025-10-23T15:58 (+1)
- It seems more likely that RL does actually allow LLMs to learn new skills.
- RL + LLMs is still pretty new but we already have clear signs it exhibits scaling laws with the right setup just like self-supervised pretraining. This time they appear to be sigmoidal, probably based on something like each policy or goal or environment they're trained with. It has been about 1 year since o1-preview and maybe this was being worked on to some degree about a year before that.
- The Grok chart contains no numbers, which is so strange I don't think you can conclude much from it except "we used more RL than last time." It also seems likely that they might not yet be as efficient as OpenAI and DeepMind, who have been in the RL game for much longer with projects like AlphaZero and AlphaStar.
Yarrow Bouchard🔸 @ 2025-10-23T22:06 (+3)
The Grok chart contains no numbers, which is so strange I don't think you can conclude much from it except "we used more RL than last time."
Isn't the point just that the amount of compute used for RL training is now roughly the same as the amount of compute used for self-supervised pre-training? Because if this is true, then obviously scaling up RL training compute another 1,000,000x is obviously not feasible.
My main takeaway from this post is not whether RL training would continue to provide benefits if it were scaled up another 1,000,000x, just that the world doesn't have nearly enough GPUs, electricity, or investment capital for that to be possible.
Peter @ 2025-10-23T22:43 (+1)
Maybe or maybe not - people also thought we would run out of training data years ago. But that has been pushed back and maybe won't really matter given improvements in synthetic data, multimodal learning, and algorithmic efficiency.
Yarrow Bouchard🔸 @ 2025-10-23T23:20 (+1)
What part do you think is uncertain? Do you think RL training could become orders of magnitude more compute efficient?