XPT forecasts on (some) Direct Approach model inputs

By Forecasting Research Institute, rosehadshar @ 2023-08-20T12:39 (+37)

This post was co-authored by the Forecasting Research Institute and Rose Hadshar. Thanks to Josh Rosenberg for managing this work, Zachary Jacobs and Molly Hickman for the underlying data analysis, Kayla Gamin for fact-checking and copy-editing, and the whole FRI XPT team for all their work on this project. Special thanks to staff at Epoch for their feedback and advice.

Summary

InputEpoch (default)XPT superforecasterXPT expert[1]Notes
Baseline growth rate in algorithmic progress (OOM/year)0.21-0.650.09-0.20.15-0.23

Epoch: 80% confidence interval (CI)

XPT: 90% CI,[2] based on 2024-2030 forecasts

Current spending ($, millions)$60$35$60

Epoch: 2023 estimate

XPT: 2024 median forecast[3]

Yearly growth in spending (%)34%-91.4%6.40%-11%5.7%-19.5%

Epoch: 80% CI

XPT: 90% CI,[4] based on 2024-2050 forecasts

OutputEpoch default inputsXPT superforecaster inputsXPT expert inputs
Median TAI arrival year

2036

2065

2052

Probability of TAI by 2050

70%

38%

49%

Probability of TAI by 2070

76%

53%

65%

Probability of TAI by 2100

80%

66%

74%

Note that regeneration affects model outputs, so these results can’t be replicated directly, and the TAI probabilities presented here differ slightly from those in Epoch’s blog post.[5] Figures given here are the average of 5 regenerations.

Source of 2070 forecastXPT superforecasterXPT expert
Direct Approach model53%65%
XPT postmortem survey question on probability of TAI[6] by 20703.75%16%

Introduction

This post:

Background on the Direct Approach model

In May 2023, researchers at Epoch released an interactive Direct Approach model, which models the probability that TAI arrives in a given year. The model relies on:

Epoch’s default inputs produce a model output of 70% that TAI arrives in 2050, and a median TAI arrival year of 2036.[7] Note that these default inputs are based on extrapolating historical trends, and do not represent the all-things-considered view of Epoch staff.[8]

(The Direct Approach model is similar to Cotra’s biological anchors model, except that it uses scaling laws to estimate compute requirements rather than using biological anchors. It also incorporates more recent data for its other inputs, as it was made after Cotra’s model. See here for a comparison of XPT forecasts with Cotra’s model inputs.)

Background on the Existential Risk Persuasion Tournament (XPT)

In 2022, the Forecasting Research Institute (FRI) ran the Existential Risk Persuasion Tournament (XPT). Over the course of 4 months, 169 forecasters, including 80 superforecasters and 89 experts, forecasted on various questions related to existential and catastrophic risk. Forecasters moved through a four-stage deliberative process that was designed to incentivize them not only to make accurate predictions but also to provide persuasive rationales that boosted the predictive accuracy of others’ forecasts. Forecasters stopped updating their forecasts on 31st October 2022, and are not currently updating on an ongoing basis. FRI hopes to run future iterations of the tournament.

You can see the results from the tournament overall here, results relating to AI risk here, and to AI timelines in general here.

Comparing Direct Approach inputs and XPT forecasts

Some of the XPT questions relate directly to some of the inputs to the Direct Approach model. Specifically, there are XPT questions which relate to Direct Approach inputs on algorithmic progress and investment:[9]

XPT questionComparisonInput to Direct Approach model

46. How much will be spent on compute in the largest AI experiment by the end of 2024, 2030, 2050?

 

Comparison of median XPT 2024 forecasts with Direct Approach 2023 estimate

Current spending 

(The dollar value, in millions, of the largest reasonable training run in 2023.)

Inferred annual spending growth between median 5th and 95th percentile XPT forecasts for 2024 and 2050, compared with Epoch 80% CIYearly growth in spending (%) (How much the willingness to spend on potentially-transformative training runs will increase, each year.)
48. By what factor will training efficiency on ImageNet classification have improved over AlexNet by the end of 2024, 2030?Inferred annual growth rate between median 5th and 95th percentile XPT forecasts for 2024 and 2030, compared with Epoch 80% CI

Baseline growth rate 

(The yearly improvement in language and vision algorithms, expressed as an order of magnitude.)

Caveats and notes

It is important to note that there are several limitations to this analysis:

The forecasts

InputEpoch defaultXPT superforecasterXPT expertNotes
Baseline growth rate (OOM/year)0.35-0.750.09-0.20.15-0.23

Epoch: 80% CI

XPT: 90% CI,[10]  based on 2024-2030 forecasts

Current spending ($, millions)$60$35$60

Epoch: 2023 estimate

XPT: 2024 median forecast

Yearly growth in spending (%)34%-91.4%6.40%-11%5.7%-19.5%

Epoch: 80% CI

XPT: 90% CI,[11]  based on 2024-2050 forecasts

Median TAI arrival year (according to the Epoch Direct Approach model)

2036

2065

2052

Note that regeneration affects model outputs, so these results can’t be replicated directly. Figures given here are the average of 5 regenerations.

See workings here.

What drives the differences between Epoch’s inputs and XPT forecasts?

Across the relevant inputs, Epoch is drawing on recent research which was not available at the time the XPT forecasters made their forecasts (the tournament closed in October 2022). Epoch doesn’t cite arguments for their inputs beyond these particular pieces of research, so it’s hard to say what drives the disagreement beyond access to more recent research, and different question formulation.

Specifically:

The single biggest factor driving differences in outputs is yearly growth in spending, closely followed by baseline growth rate in algorithmic progress. It is noteworthy that:

On current spending estimates:

See here for some analysis showing how much changing just one input altered the model output. Note that because regeneration alters model outputs, these results cannot be directly replicated.

XPT forecasters’ all-things-considered view on TAI timelines

As we mentioned above, this analysis takes the Direct Approach model and most of Epoch’s original inputs as a given, and uses XPT forecasts for particular inputs. It cannot be read as a statement of XPT forecasters’ all-things-considered view on TAI.

In fact, from questions in the XPT postmortem survey, we know that XPT forecasters’ all-things-considered TAI timelines are longer than this analysis of the Direct Approach model suggests.

XPT forecasters made the following explicit predictions in the postmortem survey:

The output of the Direct Approach model using XPT inputs is more aggressive than XPT forecasters’ overall views. Subsetting XPT forecasts to those forecasters who responded to the postmortem survey for comparability, the Direct Approach model outputs:

Note that:

Appendix A: Arguments made for different forecasts

Both Epoch and the XPT forecasters gave arguments for their forecasts.

In Epoch’s case, the arguments are put forward directly in the relevant sections of the Direct Approach post.

In the XPT case:

The footnotes for XPT rationale summaries contain direct quotes from XPT team rationales.

Algorithmic progress

InputEpochXPT superforecasterXPT expert
Baseline growth rate (OOM/year)0.21-0.650.09-0.20.15-0.23

Direct Approach arguments

XPT arguments

General comments:

Arguments for slower algorithmic progress (further from Epoch’s estimate):

Arguments for faster algorithmic progress (closer to Epoch’s estimate):

Investment

InputEpochXPT superforecasterXPT expert
Current spending ($, millions)$60$35$60
Yearly growth in spending (%)34%-91.4%6.40%-11%5.7%-19.5%

Direct Approach arguments

XPT arguments

General comments:

Arguments for lower spending (further from Epoch’s estimate):

Arguments for higher spending (closer to Epoch’s estimate):

 

  1. ^

     In this post, the term “XPT expert” includes both general x-risk experts, domain, and non-domain experts, and does not include superforecasters. This is because the sample size for AI domain experts on these questions was small. About two-thirds of experts forecasting on these questions were either AI domain experts or general x-risk experts, while about one-third were experts in other domains. For details on each subgroups' forecasts, see Appendix 5 here.

  2. ^

     For this question, XPT forecasters were asked to give their forecasts at the 5th, 25th, 50th, 75th, and 95th percentiles. The XPT CI presented here is the range between the XPT forecasters’ median 5th percentile and median 95th percentile forecasts, so it is not directly comparable to the Epoch CI.

  3. ^

     Here we use XPT median rather than 90% CI, because the direct approach model takes a single estimate as an input for this parameter.

  4. ^

     For this question, XPT forecasters were asked to give their forecasts at the 5th, 25th, 50th, 75th, and 95th percentiles. The XPT CI presented here is the range between the XPT forecasters’ median 5th percentile and median 95th percentile forecasts, so it is not directly comparable to the Epoch CI.

  5. ^

     In three instances (across roughly 50 regenerations), we generated a result of >2100. In all three cases, this result was many decades away from the results we generated on other generations. After consultation with the Epoch team, we think it’s very likely that this is some minor glitch rather than a true model output, and so we excluded all three >2100 results from our analysis.

  6. ^

     “Transformative AI is defined here as any scenario in which global real GDP during a year exceeds 115% of the highest GDP reported in any full prior year.”

  7. ^

     Note that regeneration affects model outputs, so these results can’t be replicated directly. Figures given here are the average of 5 regenerations.

  8. ^

     “The outputs of the model should not be construed as the authors’ all-things-considered views on the question; these are intended to illustrate the predictions of well-informed extrapolative models.” https://epochai.org/blog/direct-approach-interactive-model 

  9. ^

     There was also an XPT question on compute, but it is not directly comparable to inputs to the Direct Approach model:

    XPT question: 47. What will be the lowest price, in 2021 US dollars, of 1 GFLOPS with a widely-used processor by the end of 2024, 2030, 2050?

    Direct Approach model input: Growth in FLOP/s/$ from hardware specialization (OOM/year): The rate at which you expect hardware performance will improve each year due to workload specialization, over and above the default projections. The units are orders of magnitude per year. (Lognormal, 80% CI).

  10. ^

     For this question, XPT forecasters were asked to give their forecasts at the 5th, 25th, 50th, 75th, and 95th percentiles. The XPT CI presented here is the range between the XPT forecasters’ median 5th percentile and median 95th percentile forecasts, so it is not directly comparable to the Epoch CI.

  11. ^

     For this question, XPT forecasters were asked to give their forecasts at the 5th, 25th, 50th, 75th, and 95th percentiles. The XPT CI presented here is the range between the XPT forecasters’ median 5th percentile and median 95th percentile forecasts, so it is not directly comparable to the Epoch CI.

  12. ^

     The probability of >15% growth by 2100 was asked about in both the main component of the XPT and the postmortem survey. The results here are from the postmortem survey. The superforecaster median estimate for this question in the main component of the XPT was 2.75% (for both all superforecaster participants and the subset that completed the postmortem survey).

  13. ^

     The probability of >15% growth by 2100 was asked about in both the main component of the XPT and the postmortem survey. The results here are from the postmortem survey. The experts median estimate for this question in the main component of the XPT was 19% for all expert participants and 16.9% for the subset that completed the postmortem survey.

  14. ^
  15. ^

     Question 48: See 339, “"On the other hand, an economist would say that one day, the improvement will stagnate as models become ""good enough"" for efficient use, and it's not worth it to become even better at image classification. Arguably, this day seems not too far off. So growth may either level off or continue on its exponential path. Base rate thinking does not help much with this question… It eluded the team to find reasonable and plausible answers... stagnation may be just as plausible as further exponential growth. No one seems to know.”

  16. ^

     Question 48: 340, “Low range forecasts assume that nobody does any further work on this area, hence no improvement in efficiency.” 341, “The Github page for people to submit entries to the leaderboard created by OpenAI hasn't received any submissions (based on pull requests), which could indicate a lack of interest in targeting efficiency. https://github.com/openai/ai-and-efficiency”.

  17. ^

     Question 48: 340, “In addition, it seems pretty unclear, whether this metric would keep improving incidentally with further progress in ML, especially given the recent focus on extremely large-scale models rather than making things more efficient.”

  18. ^

     Question 48: 340, “[T]here seem to bem some hard limits on how much computation would be needed to learn a strong image classifier”.

  19. ^

     Question 48: 341, “The use cases for AI may demand accuracy instead of efficiency, leading researchers to target continued accuracy gains instead of focusing on increased efficiency.”

  20. ^

     Question 48: 341, “A shift toward explainable AI (which could require more computing power to enable the AI to provide explanations) could depress growth in performance.”

  21. ^

     Question 48: 336, “Lower end forecasts generally focused on the fact that improvements may not happen in a linear fashion and may not be able to keep pace with past trends, especially given the "lumpiness" of algorithmic improvement and infrequent updates to the source data.” 338, “The lowest forecasts come from a member that attempted to account for long periods with no improvement.  The reference table is rarely updated and it only includes a few data points.  So progress does look sporadic.”

  22. ^

     Question 48: 337, “The most significant disagreements involved whether very rapid improvement observed in historical numbers would continue for the next eight years.  A rate of 44X is often very hard to sustain and such levels usually revert to the mean.”

  23. ^

     Question 48: 340, “The higher range forecasts simply stem from the extrapolation detailed above.

    Pure extrapolation of the 44x in 7 years would yield a factor 8.7 for the 4 years from 2020 to 2024 and a factor of 222 for the years until 2030. => 382 and 9768.” 336, “Base rate has been roughly a doubling in efficiency every 16 months, with a status quo of 44 as of May 2019, when the last update was published. Most team members seem to have extrapolated that pace out in order to generate estimates for the end of 2024 and 2030, with general assumption being progress will continue at roughly the same pace as it has previously.”

  24. ^

     Question 48: 336, “The high end seems to assume that progress will continue and possibly increase if things like quantum computing allow for a higher than anticipated increase in computing power and speed.”

  25. ^

     Question 48: 341, “AI efficiency will be increasingly important and necessary to achieve greater accuracy as AI models grow and become limited by available compute.”

  26. ^

     Question 48: 341.

  27. ^

     Question 48: 337, “The most significant disagreements involved whether very rapid improvement observed in historical numbers would continue for the next eight years.  A rate of 44X is often very hard to sustain and such levels usually revert to the mean.  However, it seems relatively early days for this tech, so this is plausible.”

  28. ^

     https://epochai.org/blog/direct-approach-interactive-model#investment

  29. ^

     Question 46: 338: “The main split between predictions is between lower estimates (including the team median) that anchor on present project costs with a modest multiplier, and higher estimates that follow Cotra in predicting pretty fast scaling will continue up to anchors set by demonstrated value-added, tech company budgets, and megaproject percentages of GDP."

  30. ^

     Question 46: 340: “Presumably much of these disagreement[s] stem from different ways of looking at recent AI progress.  Some see the growth of computing power as range bound by current manufacturing processes and others expect dramatic changes in the very basis of how processors function leading to continued price decreases.”

  31. ^

     Question 46: 337, “[T]raining cost seems to have been stuck in the $10M figure for the last few years.”; “we have not seen such a large increase in the estimated training cost of the largest AI model during the last few years: AlphaZero and PALM are on the same ballpark.” 341, “For 2024, the costs seem to have flattened out and will be similar to now. To be on trend in 2021, the largest experiment would need to be at $0.2-1.5bn. GPT-3 was only $4.6mn”

  32. ^

     Question 46: 341, “The AI impacts note also states that the trend would only be sustainable for a few more years. 5-6 years from 2018, i.e. 2023-24, we would be at $200bn, where we are already past the total budgets for even the biggest companies.”

  33. ^

     Question 46: 336, “The days of 'easy money' may be over. There's some serious belt-tightening going on in the industry (Meta, Google) that could have a negative impact on money spent.”

  34. ^

     Question 46: 337, “It also puts more weight on the reduced cost of compute and maybe even in the improved efficiency of minimization algorithms, see question 48 for instance.” 336, “After 2030, we expect increased size and complexity to be offset by falling cost of compute, better pre-trained models and better algorithms. This will lead to a plateau and possible even a reduction in costs.”; “In the near term, falling cost of compute, pre-trained models, and better algorithms will reduce the expense of training a large language model (which is the architecture which will likely see the most attention and investment in the short term).” See also 343, “$/FLOPs is likely to be driven down by new technologies and better chips. Better algorithm design may also improve project performance without requiring as much spend on raw compute.” See also 339, “The low end scenarios could happen if we were to discover more efficient training methods (e.g. take a trained model from today and somehow augment it incrementally each year rather than a single batch retrain or perhaps some new research paradigm which makes training much cheaper).”

  35. ^

     Question 46: 336, “Additionally, large language models are currently bottlenecked by available data. Recent results from DeepMind suggest that models over ~100 billion parameters would not have enough data to optimally train. This will lead to smaller models and less compute used in the near term. For example, GPT-4 will likely not be significantly larger than Chinchilla. https://arxiv.org/abs/2203.15556”. 341, “The data availability is limited.” See also 340, “The evidence from Chinchilla says that researchers overestimated the value of adding parameters (see https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications). That is probably discouraging researchers from adding more parameters for a while. Combined with the difficulty of getting bigger text datasets, that might mean text-oriented systems are hitting a wall. (I'm unsure why this lasts long - I think other datasets such as video are able to expand more).”

  36. ^

     Question 46: 340, “The growth might be slowing down now.”; “Or maybe companies were foolishly spending too little a few years ago, but are now reaching diminishing returns, with the result that declining hardware costs mostly offset the desire for bigger models.”

  37. ^

     Question 46: 340, “Later on, growth might slow a lot due to a shift to modular systems. I.e. total spending on AI training might increase a good deal. Each single experiment could stay small, producing parts that are coordinated to produce increasingly powerful results.” See also 339, “2050 At this point I'm not sure it will be coherent to talk about a single AI experiment, models will probably be long lived things which are improved incrementally rather than in a single massive go. But they'll also be responsible for a large fraction of the global GDP so large expenditures will make sense, either at the state level or corporation.”

  38. ^

     Question 46: 340, “Some forecasters don't expect much profit from increased spending on AI training. Maybe the recent spending spree was just researchers showing off, and companies are about to come to their senses and stop spending so much money.”

  39. ^

     Question 46: 340; “There may some limits resulting from training time. There seems to be agreement that it's unwise to attempt experiments that take more than a few months. Maybe that translates into a limit on overall spending on a single experiment, due to limits on how much can be done in parallel, or datacenter size, or supercomputer size?”

  40. ^

     Question 46: 343, “Monetization of AGI is in its early stages. As AI creates new value, it's likely that additional money will be spent on increasingly more complex projects.” Note that this argument refers to forecasts higher than the team median forecasts, and the team median for 2024 was $25m.

  41. ^

     Question 46: 337, “This will make very much sense in the event that a great public project or international collaboration will be assembled for researching a particular aspect of AI (a bit in the line of project Manhattan for the atomic bomb, the LHC for collider physics or ITER for fusion). The probability of such a collaboration eventually appearing is not small. Other scenario is great power competition between China and the US, with a focus on AI capabilities.”

  42. ^

     Question 46: 336, “There is strong competition between players with deep pockets and strong incentives to develop and commercialize 'AI-solutions'.”

  43. ^

     Question 46: 344, “Automatic experiments run by AI are beyond valuation”. 337, “One forecast suggest astronomical numbers for the largest project in the future, where the basis of this particular forecast is the possibility of an AI-driven economic explosion (allowing for the allocation of arbitrarily large resources in AI).”