Scaling of AI training runs will slow down after GPT-5

By Maxime Riché 🔸 @ 2024-04-26T16:06 (+10)

My credence: 33% confidence in the claim that the growth in the number of GPUs used for training SOTA AI will slow down significantly directly after GPT-5. It is not higher because of (1) decentralized training is possible, and (2) GPT-5 may be able to increase hardware efficiency significantly, (3) GPT-5 may be smaller than assumed in this post, (4) race dynamics.

TLDR: Because of a bottleneck in energy access to data centers and the need to build OOM larger data centers.

Update: See Vladimir_Nesov's comment or Erich_Grunewald's comment for why this claim is likely wrong, since decentralized training seems to be solved.

The reasoning behind the claim:

Current large data centers consume around 100 MW of power, while a single nuclear power plant generates 1GW. The largest seems to consume 150 MW.
An A100 GPU uses 250W, and around 1kW with overheard. B200 GPUs, uses ~1kW without overhead. Thus a 1MW data center can support maximum 1k to 2k GPUs.
GPT-4 used something like 15k to 25k GPUs to train, thus around 15 to 25MW.
Large data centers are around 10-100 MW. This is likely one of the reason why top AI labs are mostly only using ~ GPT-4 level of FLOPS to train new models.
GPT-5 will mark the end of the fast scaling of training runs.
- A 10-fold increase in the number of GPUs above GPT-5 would require a 1 to 2.5 GW data center, which doesn’t exist and would take years to build, OR would require decentralized training using several data centers. Thus GPT-5 is expected to mark a significant slowdown in scaling runs. The power consumption required to continue scaling at the current rate is becoming unsustainable, as it would require the equivalent of multiple nuclear power plants. I think this is basically what Sam Altman, Elon Musk and Mark Zuckerberg are saying in public interviews.
The main focus to increase capabilities will be one more time on improving software efficiency. In the next few years, investment will also focus on scaling at inference time and decentralized training using several data centers.
If GPT-5 doesn’t unlock research capabilities, then after GPT-5, scaling capabilities will slow down for some time towards historical rates, with most gains coming from software improvements, a bit from hardware improvement, and significantly less than currently from scaling spending.
Scaling GPUs will be slowed down by regulations on lands, energy production, and build time. Training data centers may be located and built in low-regulation countries. E.g., the Middle East for cheap land, fast construction, low regulation, and cheap energy, thus maybe explaining some talks with Middle East investors.

Unrelated to the claim:

Hopefully, GPT-5 is still insufficient for self-improvement:
- Research has pretty long horizon tasks that may require several OOM more compute.
- More accurate world models may be necessary for longer horizon tasks and especially for research (hopefully requiring the use of compute-inefficient real, non-noisy data, e.g., real video).
- “Hopefully”, moving to above human level requires RL.
- “Hopefully”, RL training to finetune agents is still several OOM less efficient than pretraining and/or is currently too noisy to improve the world model (this is different than simply shaping propensities) and doesn’t work in the end.
Guessing that GPT-5 will be at expert human level on short horizon tasks but not on long horizon tasks nor on doing research (improving SOTA), and we can’t scale as fast as currently above that.

How big is that effect going to be?

Using values from: https://epochai.org/blog/the-longest-training-run, we have estimates that in a year, the effective compute is increased by:

Software efficiency: x1.7/year (1 OOM in 3.9 y)
Hardware efficiency: x1.3/year (1 OOM in 5.9 y)
Investment increase:
- x2.8/year (before ChatGPT) (1 OOM in 2.3 y)
- x10/year (since ChatGPT) (1 OOM in 1 y) (my guess for GPT-4 => GPT-5)

Let's assume GPT-5 is using 10 times more GPUs than GPT-4 for training. 250k GPUs would mean around 250MW needed for training. This is already larger than the largest data center reported in this article... Then, moving to GPT-6 with 2.5M GPUs would require 2.5 GW.

Building the infrastructure for GPT-6 may require a few years (e.g., using existing power plants and building a 2.5M GPU data center). For reference, OpenAI and Microsoft seem to have a $100B data center project going until 2028 (4 years); that’s worth around 3M B200 GPUs (at $30k per units).

Building the infrastructure for GPT-7 may require even more time (e.g., building 25 power plant units).

If the infrastructure for GPT-6 takes 4 years to be assembled, then the increase in GPUs is limited to 1 OOM in 4 years (~ x1.8/year).

The total growth rate between GPT-4 and GPT-5 is x22/year or x6.2/year when using investment growth values from before ChatGPT.

Taking into account the decrease in the growth of investment in training runs, the total growth rate between GPT-5 and GPT-6 would then be x4/year. The growth rate would be divided by 5.5 or by 1.55 when using values from before ChatGPT.

These estimates assume no efficient decentralized training.

Impact of GPT-5

One could assume that software and hardware efficiency will have a growth rate increased by something like 100% because of the increased productivity from GPT-5 (vs before ChatGPT).

In that case, the growth rate of effective compute after GPT-5 would be significantly above the growth rate before ChatGPT (~ x8.8/year vs. ~ x6/year before ChatGPT).

Erich_Grunewald @ 2024-04-26T16:28 (+10)

A 10-fold increase in the number of GPUs above GPT-5 would require a 1 to 2.5 GW data center, which doesn’t exist and would take years to build, OR would require decentralized training using several data centers. Thus GPT-5 is expected to mark a significant slowdown in scaling runs.

Why do you think decentralized training using several data centers will lead to a significant slowdown in scaling runs? Gemini was already trained across multiple data centers.

SummaryBot @ 2024-04-26T16:37 (+1)

Executive summary: The scaling of AI training runs is expected to slow down significantly after GPT-5 due to the unsustainable power consumption required to continue scaling at the current rate, which would necessitate the equivalent of multiple nuclear power plants.

Key points:

Current large data centers consume around 100 MW of power, limiting the number of GPUs that can be supported.
GPT-4 used an estimated 15k to 25k GPUs, requiring 15 to 25 MW of power.
A 10-fold increase in GPUs above GPT-5 would require a 1 to 2.5 GW data center, which doesn't exist and would take years to build.
After GPT-5, the focus will shift to improving software efficiency, scaling at inference time, and decentralized training using multiple data centers.
Scaling GPUs will be slowed down by regulations on lands, energy production, and build time, potentially leading to the construction of training data centers in low-regulation countries.
The total growth rate of effective compute is expected to decrease significantly after GPT-5, from ~x22/year (or x6.2/year using pre-ChatGPT investment growth values) to ~x4/year, assuming no efficient decentralized training is developed.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.