On the future of language models

By Owen Cotton-Barratt @ 2023-12-20T16:58 (+116)

1.     Introduction

1.1     Summary of key claims

1.2     Meta

We know that AI is likely to be a very transformative technology. But a lot of the analysis of this point treats something like “AGI” as a black box, without thinking too much about the underlying tech which gets there. I think that’s a useful mode, but it’s also helpful to look at specific forms of AI technology and ask where they’re going and what the implications are.

This doc does that for language models. It’s a guide for thinking about them from various angles with an eye to what the strategic implications might be. Basically I’ve tried to write the thing I wish I’d read a couple of years ago; I’m sharing now in case it’s helpful for others.

The epistemic status of this is “I thought pretty hard about this and these are my takes”; I’m sure there are still holes in my thinking (NB I don’t actually do direct work with language models), and I’d appreciate pushback; but I’m also pretty sure I’m capturing some important dynamics which aren’t as broadly appreciated as they should be. Many of the particular insights here are due to other people. I want to say thanks to Adam Bales, Anna Wang, Buck Shlegeris, Carl Shulman, Daniel Dewey, Eric Drexler, Max Dalton, Nate Soares, Rebecca Cotton-Barratt, Rohin Shah, Rose Hadshar, Tom Davidson, and especially Beth Barnes, David Manheim, Lukas Finnveden, and Toby Ord, for helpful comments and/or conversations.

2.     What type of thing are language models?

2.1     Emulating civilization, not individual people

The field of AI was originally about reproducing human intelligence. Humans are good at finding patterns and learning things. If we could automate the type of thinking they do, that would be a big deal. If we could build automated systems which were better general learners and thinkers than humans, it would transform the world.

Language models aren’t really trying to do the same thing. This may be a surprising claim; they’re a type of machine learning, which is doing exactly this. However, I think it’s clearer to think of language models as a specialized application of machine learning. Sure, they make use of machine learning techniques, but their game isn’t really “be better than humans at learning from a certain amount of language” (indeed they’re fed with so much data that they can be much more inefficient than humans, and I don’t think this is a crux for how important they will be). It’s “replicate the kind of things humans say”.

This is powerful because humans, collectively, know a bunch of stuff, both implicitly and explicitly. There’s a lot of knowledge and intelligence which is crystallized in our writing. If the language models of today seem to know a lot of things, this isn’t because they’ve gone out and understood the world directly, but because they’re leveraging knowledge which is represented in human text. 

Moreover language is the medium via which we construct concepts and make explicit arguments — powerful tools for understanding and acting in the world. The ability to approximate human writing — even if not based on the same underlying learning abilities — might reproduce a lot of that intelligence. 

All of this matters for thinking about the impacts language models are likely to have, and where they might be going. In slogan form, perhaps:

Machine learning reimplements human intelligence.

Language models emulate humanity’s collective intelligence.

Note that language models could be used to emulate the written output of individual people, if a prompt was specific enough that it tightly specified the author. But this isn’t their default mode — mostly predicting text will depend on averages across a lot of different (possible) people (weighted by how likely those people were to be writing about the topic).

2.2     An extremely crude picture of how language models work

For the purposes of this document, what I think is important:

2.3     What are foundation models approximating?

We can think of foundation models as a series of approximations. A given foundation model Wi approximates the limit WText of what we could achieve with ideal machine learning and all extant text. This in turn approximates WOmega, which is the true distribution human writing is drawn from. Foundation models can never actually achieve “the true distribution”, but understanding that this is what they’re approximating may help us to understand their scope as a technology.

Here’s a digression digging a bit deeper on these concepts:

3.     Techniques for getting value from language models

A major focus of research on language models has been on improving the foundation models — getting better approximations to WText. But there is important complementary research in the question: for a fixed foundation model Wi, how can you do useful things? There are few different techniques:

3.1     Prompt engineering

The output of foundation models depends on the prompts they are given. This would be true of WOmega — the value of being able to sample from all possible human documents would be importantly dependent on the ability to steer towards the most useful parts of document space. For the weaker foundation models we have, there may be other helpful tricks in designing prompts.

Over the last couple of years, as people have played around with language models, there has been a lot of parallelized labour into finding the style of prompts that is most likely to lead to good things. To the extent that people are finding knowledge about how to get value out of WOmega, this will generalize to future language models; to the extent that they’re learning tricks peculiar to the current generation of foundation models, it may not.

3.2     Scaffolding

Scaffolding is the general category of designing environments around language models which feed them prompts and process their outputs. Scaffolding is a broad category of which the most straightforward case is just prompt engineering, but in general it allows for complex procedures where the output in response to earlier prompts is fed into other software tools, and these determine what is put into later prompts.

For example, scaffolding could allow for a model to make multi-stage plans and then call separate instances to execute each of those stages without losing track or where it is, and to make use of tools such as browsing the internet and writing and executing software.

Limits of what might be achievable via scaffolding are discussed in Section 6.3.2.

3.3     Finetuning

Finetuning takes a foundation model and runs more machine learning to adjust just some of the weights — using the foundation model to give an inductive in its search for more refined models. The idea is that it’s much easier to find models which are smart in arbitrary ways if you’re restricted to a much smaller-dimensional search space. For small amounts of finetuning, we might think of the inductive bias as being roughly “only consider saying things that humans might say”. For larger amounts of finetuning the bias might be more structural, making use (in opaque ways) of implicit knowledge the language model has to restrict the search space.

Finetuning relies on having some metric, or feedback loop, to train things towards. This could be given by some body of text it’s trying to emulate, or by some other function of text output. 

3.4     Combining these

Scaffolding and finetuning can be combined. Generically I think they will be. For it not to make sense to use scaffolding it would be the case that the trivial scaffold performed (roughly) as well as anything else. I think this is implausible at least in the short term. And it would be even more surprising if foundation models — which were selected for their ability to emulate human outputs — happened to be optimized among close by systems for their performance when used in an effective scaffold. I therefore think it’s implausible that it won’t be optimal to make use of finetuning.

We might think of finetuning as analogous to on-the-job training for the use-case at hand, and scaffolding as analogous to setting up a good management structure and organizational protocols. The analogy supports the idea that a combination of the two may be most effective.

4.     Natural limits of language models

In Section 5 we’ll start to look at the impacts language models will have in the world as they are further developed and deployed. In order to facilitate that, in this section we’ll look at some natural limits on the kind of things language models are doing. We’ll be concerned with “what kind of outputs can they produce?”; questions of how fast they can produce those, or how they are integrated into society are of central importance for how much impact they end up having, but out of scope for what I want to explore in this section.

4.1     Approximating human capabilities, not superhuman capabilities

There’s a common argument about AI that goes roughly:

There’s nothing special about human capability levels. In any given domain, if AI capabilities are advancing rapidly towards human-level, they’ll probably continue advancing rapidly way past human-level.

Foundation models have been rapidly advancing towards giving human-level responses to many different types of questions: they are rapidly approaching human-level at writing poetry, or explaining physics, or concocting recipes — in the sense that they are far closer to human level now than they were three years ago. Foundation models, however, are emulating human outputs. To the extent that they have human capabilities, they have these via emulation. So the argument doesn’t apply (at least in the straightforward way); rather, we should expect progress to slow down when the quality of their outputs are somewhere in the vicinity of (peak) human performance.

There are a couple of important caveats here:

4.2     Limited cognition per forward pass

To produce a single token, a language model makes a single forward pass over the neural net. To produce longer pieces of text, it repeatedly produces single tokens, with everything it’s produced so far added to the context.

Each forward pass amounts to something of similar complexity to multiplying together some large matrices. This gives lots of room for something like consulting an index and accessing stored knowledge, but relatively limited space for something like “thinking new thoughts”. 

By analogy, when humans learn arithmetic they do it by a mix of rote memorization — many of us see “3x7” and instinctively know that the answer is “21” without calculating anything — and processes for calculating things (e.g. long division). Language models are structured in a way that can make them good at the rote memorization part, but they cannot in a single forward pass do a large amount of following a process.

This means that we can construct tasks that even very strong foundation models will predictably be weak at. e.g. — 

The remainder of [352 digit number] when divided by [219 digit number] is …

WOmega probably gets this right most of the time. But WText probably gets it wrong almost all of the time. (Unless there are some heuristic tricks I’m unaware of. I’d be more confident in my example if it asked for prime factorizations.)

There are three important caveats here:

4.3     Missing cognitive moves?

Language models are capable of reproducing some types of ~atomic cognitive move that humans use. There may be others — at least at any given moment in time — that they cannot reproduce.

Reasons that they might not be able to reproduce a given cognitive move:

It’s worth being aware that there could be constraints from these on what language models can do, but that this might change as architectures improve or models become bigger. (Furthermore, it might be that at some stage — if not already — language models can make useful cognitive moves that humans are incapable of.)

Multimodal models

One concern might be that language models are only equipped to deal with things in language. How do multimodal models affect this picture? Multimodal language language models are the same basic technology as language models, but they use encodings of non-text data into a kind of text to allow the models to interface with this non-text data. They can output non-text via the encoding if that’s the thing that the language model predicts will happen.

Multimodal language models are therefore able to interface with and think about non-text data. But they may (at least for now) be more likely to lack the correct architecture to reproduce the type of cognitive moves humans do with non-text data. However, language models could be augmented with various capacities by using scaffolding to give them access to interfaces which permit them to query other kinds of objects (e.g. image processing; running physics simulations).

5.     Early major impacts of language models?

5.1     Principles for thinking about this

The main metaphor I use to think about this goes as follows:

Suppose you have a large workforce of relatively expert people — whom you can train at significant expense and then will work very fast for a very small fraction of minimum wage — but they’re all a bit drunk and only working from home. What can you usefully do with them?

Of course this metaphor isn’t perfect (and readers may want to think about its imperfections to critique the conclusions I draw from it), but I think it’s probably pretty good as a starting point. A major intuition that I have about that scenario — which I think is probably accurate about the actual situation with language models — is “wow, there’s a really big prize available here for whoever can figure out how to use these folks to do useful stuff”. And there will certainly be incentives to develop techniques to mitigate the obvious disadvantages of being drunk (e.g. via automated error checking).

A couple of people have mentioned to me another metaphor: a large force of interns. I think this is also good; it’s a little better in suggesting that by default they don’t know much about the task at hand, but a little worse in suggesting that they get their knowledge about the domain by looking things up rather than by half-remembering (or occasionally fabricating) facts, and in suggesting strategies like “identify the good interns” which don’t really translate over.

A quick note/aside on the economics:

 As of mid-2023, GPT-4 costs around $0.1 per 1,000 tokens (around 750 words). That’s about 10 minutes of typing at 75wpm. So we’re looking at getting this work at around $0.6/hour — equivalent to perhaps 5% of minimum wage in rich countries. I don’t know how high the markup that OpenAI charges is relative to their marginal cost of providing the service, but I wouldn’t be surprised if the production cost at the margin is much cheaper than that (they charge like 2% of that — 0.1% of minimum wage — for older models, and my guess is that that’s much closer to the marginal production cost for them, and could still be significantly above marginal production cost). Perhaps newer more sophisticated models will be more expensive, but also perhaps progress (or improving compute) drives down prices. (Of course if compute starts being super valuable for this purpose that is likely to push the price of compute up, at least until compute manufacturing can be scaled up to meet demand again.)

OK, so that’s the groundwork. Now to think about what this could mean for where the transformative impacts come. Some observations:

5.2     Important early areas for automation

There are several categories of intellectual labour that I think might be automatable with language models and really important. Three of them together I think might change the world a lot — perhaps on a comparable scale to the industrial revolution, but probably not radically beyond that. In roughly increasing order of importance, they are:

5.3     The big two applications

More important than the preceding, I think there are two really important applications, which have the potential to radically reshape the world:

I think that these two categories are likely at least somewhat harder to get high quality automated work out of than expert advice or management. Why?

I’m not sure how big/thorny these obstructions are. The prizes from automating them are very high, so there will be a lot of pressure to find the paths of least resistance. e.g. even if the most efficient way for humans (and hypothetical ideal AI) to do work here is more like “stare into the void and then bring it back to the domain of language” rather than just doing all the reasoning at the verbal level, if there’s a way to get comparably good results by doing everything at the explicit verbal level and it’s just 100x slower, that could still be enough to get you something transformative.

High quality software engineering has some of the same obstructions, but because it’s so easy to get a high-quality success metric, we may expect self-play to help push model performance up to human-level and beyond relatively early. Research and executive capacity face issues with epistemic grounding: how can you be confident that one angle leads to better takes than another? We may ultimately need to rely on real-world feedback loops to help learn this, but they may be slow.

We should probably expect research and executive capacity to be partially automated (and so performed by centaurs, i.e. human–AI teams) before they’re fully automated. At minimum, many people in research and executive roles spend good fractions of their time on software or management tasks, so automating the latter would increase total capacity for the former.

6.     Timelines and takeoffs

6.1     How quickly is all of this likely to happen?

My view is that for a lot of the pieces with significant societal impacts, the fundamental technology is already here. Over the next 5–10 years we might see people building and deploying systems which do a lot of stuff in the world, based on near-term-accessible language models. A lot of innovation will come from startups doing “X with AI”, for various applications X mostly providing expert advice or management services. They will often start by doing it in ways that have human oversight for quality control and training purposes, but reduce the degree of human oversight over time. By default the developers will make use of both finetuning and scaffolding — just hackily throwing stuff together to find out what works. 

The vibe I’m imagining for this is something like the Industrial Revolution or the Wild West, not a nuclear arms race. This could be enough to create significant social unease, centred in the middle classes, as many people see their livelihoods threatened, and more feel uncomfortable with how fast everything is changing.

(If I’m wrong about them having big impacts over this timescale, it’s probably because of some important missing cognitive move which restricts their usability — perhaps something about reliability. But my guess is that these kind of issues will turn out not to be a big problem, or will be surmountable given the scale of the prizes.)

We may see something more like a race for big-2 capabilities. Because if fully automated they can potentially be deployed at very large scale by a single actor (rather than quickly saturating demand), the incentives for a pure race could exist. However, I think it’s most likely that for a while centaurs will significantly outperform fully automated systems — if this is right then while there’s quite likely to be a race for full automation at some point, that would occur in a world which looks significantly transformed from the one we see today (where research has already been accelerated by centaur human-AI teams, and a lot of important planning in the world is done by humans aided by AI). The duration of this centaur period — especially how long we have in the “late centaur” period where efficiency of research is many times what it is today — could be important for determining how different that future world is.

I’m pretty unsure how far we are from ~full automation of big-2 capabilities. When I try to visualize future world trajectories and look for the most coherent ones, I think it’s most likely that this is somewhere in the range 5–15 years away; but I’m not confident in this. At the point where that process is really taking off I expect it will overtake the kind of broad societal impacts I’ve just been discussing, if it is not otherwise constrained.

6.2     Scaling language models towards superintelligence

Foundation models get their oomph from approximating human writing. They can approximate smart or knowledgable humans (with the right prompts, or the right training corpus). But for getting significantly superhuman performance, they would need something else. What could that be?

Two techniques which might be helpful components:

6.2.1     Finetuning for superhuman task performance

For tasks with well-defined success metrics, simply training to do well on those tasks could produce superhuman performance. How quickly this will happen is likely to depend on the task. In the limit with a rich enough model space, enough training data, and enough training time, we might expect to end up approximating optimal performance (and hence exceeding human performance) at every task. But in practice performance on some tasks might be capped by what is achievable within the model space, and might face challenges in getting good enough data.

Still, finetuning for superhuman performance seems like an important part of the picture. At tasks like “write an argument which is persuasive to X audience”, where there is lots of data available on the reactions of that audience, we might expect language models to do pretty well pretty quickly (especially to the extent that persuasiveness is a function of local sentence choice and not larger-scale structures of how arguments fit together). At tasks like “give a winning chess move”, we can generate high quality synthetic data so that it’s likely that we can finetune model performance to exceed top human intuitive play. (Though note that within the confines of a single forward pass, the limit on cognition could prevent too much tree search through future game states, which could mean that performance still lags behind systems which are capable of tree search.)

For open-ended tasks like “build a company that will make a lot of money”, I guess that we will for the near future be unable to give enough data and train deep enough to get superhuman performance on this just with finetuning.

6.2.2     Scaffolding for amplification via reflection

Humans are able to benefit from time to reflect. Our slow answers to questions are often better than our snap judgements. But often we don’t actually get the time to reflect, and do act on the basis of our snap judgements.

Since “thinking time” can be very cheap for language models, if they could similarly benefit from extra reflection time, this could help them to boost their task performance significantly above their non-reflective performance. And if their non-reflective performance is approximating human performance, their reflective performance could naturally be superhuman. (Albeit if this were the only mechanism for getting superhuman performance, it might be capped at “what groups of humans going slowly and carefully could do”.)

Scaffolding provides a toolset to help facilitate this reflection. The language models of today already benefit from extra thinking time — they perform better when prompted to think out loud, and scaffolding techniques like running things multiple times and taking a vote can improve performance.

6.3     Recursive improvement and takeoff

An intelligence explosion based on language models would need a mechanism for recursive improvement — something that could repeatedly ratchet towards better performance, where improved performance would help with the next round of improvements.

6.3.1     Reflection-based takeoff

If more thinking time leads to better takes in a relatively unbounded way, this could be a mechanism for takeoff. The key threshold here is not “does performance increase with extra thinking time?” (a bar that language models already clear), but “can performance scale ~arbitrarily far with extra thinking time?” (a bar that humanity as a whole probably crosses, but the language models of today probably don’t).

Even if this bar is crossed, improvement isn’t automatically recursive. But if we know how to use extra compute to produce superhuman performance, we can then use that to construct new data sets to be approximated. These could be used as part of finetuning, or even to build new text corpora, which represent (initially modestly) superhuman levels of intelligence.

This, then, could be iterated. The hope would be that reflection by systems which are approximating smarter answers will be more effective, and lead to yet smarter answers. The system could gradually bootstrap its way to strong superintelligence — essentially continuing the process whereby 21st Century humans are in many ways meaningfully smarter than 11th Century humans.

I say “gradually”, but with large enough amounts of compute this process could potentially play out quickly. Here’s some hacky first-pass analysis:

Still, overall I think this could be thought of as something like “the slow, boring path to superintelligence”. Perhaps it will be the first one that works. But I think it’s a good likelihood that some other things help it to move faster.

6.3.2     Scaffolding-based takeoff

It’s unclear what the performance returns to better scaffolding will look like. At least right now, it seems like nobody has invested that much in building good scaffolding (compared to the investments in building good foundation models), so there might be low-hanging fruit remaining.

How good can scaffolding ever get? One thought is that perhaps a given foundation model has something like a level of “latent potential”, and ideal scaffolding unlocks that but never exceeds it. However, with the right scaffolding one could reimplement an arbitrary GOFAI; while wildly impractical, this is a thought experiment which demonstrates that there is no natural ceiling on capabilities imposed by the foundation model. 

Scaffolding is a language-based construction, so language models could plausibly learn how to contribute to better scaffolding (which can then be experimented with, and could recursively feed into further improvements to scaffolding). We are therefore interested in a question like “what is the returns curve to investment on improving scaffolding?”, which is an empirical question. For some possible shapes of the curve, improvements to scaffolding could precipitate an intelligence explosion, gathering pace faster and faster as successive generations of scaffolding are more effective than the last at further improving the scaffolding. My guess is that the parameters don’t quite shake out that way, but this feels very guesswork-y for such an important parameter.

6.3.3     Finetuning-based takeoff

I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.

6.3.4     Mixed takeoffs

Perhaps most likely is that there is no single silver-bullet, but takeoff contains elements of all of these processes, and others, blended together in a vortex of increasing speed. e.g. as well as improved scaffolding feeding into improved reflection which can help with the next generation of scaffolding, improvements in AI performance could help to accelerate developments in chip fabrication, so that there are greater amounts of compute available to help this process run more quickly.

This should be faster than what we would get out of any single mechanism. The main reason we wouldn’t see such a mixed takeoff is if one of the components is individually so fast that it leaves everything else behind.

One possibility that arises as part of a mixed takeoff is using machine learning to optimize for the most effective scaffolding. I’ll discuss further in a later section (on the bitter lesson).

6.3.5     Systems not built on language models

I’ve been considering recursive improvement for language models. But the general arguments for an intelligence explosion don’t assume anything like the particular form of language models. Whether or not an intelligence explosion based on language models is possible, it’s likely the case that an intelligence explosion based on other forms of AI technology will eventually be possible. (& the argument about things which exceed human level rapidly blowing past human level is more likely to directly apply to such technologies.)

Could this matter? Yes, in two possible worlds:

7.     Language model agents and transparency

7.1     Where does agency come from?

Suppose we have an agent-like system built out of language models. The foundation models themselves weren’t agent-like. So where could the agency have “come from”?

I think the answer will be one, or a combination, of three possibilities:

  1. The system could be emulating a human or other agent represented in the corpus
    • i.e. it’s implicitly predicting “what would this agent do in this circumstance?”, where the agent and the circumstance have somehow (explicitly or implicitly) been specified
  2. The agency could be selected for (presumably via finetuning)
    • If the developers have selected a system that performs well on a particular task, it is quite plausible that part of the selection pressure has gone towards agency (since this is a generically useful capability)
      • cf. humans and evolution
  3. The agency could be explicitly built in via scaffolding
    • e.g. a prompt gives a language model an explicit goal and asks it to generate plans towards that goal, and then its answers are taken and processed into new prompts to get the plans implemented

I think we should have quite different attitudes towards these, from an AI safety perspective. 

1) seems like mostly a sideshow — while we could get agency from this, unless people are trying hard I don’t think it would tend to find especially competent agents to emulate, and may not have a good handle on what’s going on in the world.

2) seems scary. This is the classic case of mesa-optimization. By default I’d think we should expect not to really understand the goals of agents that have been selected for this way. There may be clever work that could be done to ensure things are safe, but this is the kind of story that makes AI risk seem large and thorny.

3) seems promising. An agent built in this way would come with a massive amount of transparency-by-construction:

This is probably a vast volume of thoughts to handle, but everything is in a very legible form and we can probably take steps to automate oversight. In general: all the normal reasons people are keen on transparency make it seem like a great idea to try to use architecture which is extremely transparent. (This includes both wanting transparency to facilitate long-term AI safety, and wanting transparency to enable auditing of AI applications in the shorter term.)

In practice things may often use a combination of these. And a combination could be concerning: if we have top-level agency coming from 2), then we’re less able to trust the transparency from 3), since the system might have incentives to misrepresent its own thoughts.

7.2     Strategy: avoid selection pressure for agency

A lot of putative safety techniques are around assuming that we have something potentially dangerous and catching it. I think these are well worth investing (defence in depth seems valuable), but as a complementary strategy I’m pretty attracted to the idea that we should build systems where we have reason to believe that they shouldn’t have anything dangerous going on.

In the case of language model agents, this means: I think we should avoid any intensive search/selection processes towards high-level effectiveness of agents towards particular tasks. So far as possible we should aim for high-level agency to enter explicitly via scaffolding, and not via anything else.

Tentatively, I think this would mean:

Of course there’s a whole research agenda here. But I think that the basic point is straightforward and might be quite important to have broadly understood. I think this is somewhere where humanity by default makes systems which are selected to have agency (because we just try everything and see what works), but because the alternative of introducing agency via scaffolding is a pretty good substitute, it might be within political reach to build norms which exclude the problematic type of selection.

7.3     The bitter lesson?

Richard Sutton’s “bitter lesson” from 70 years of AI research is that building knowledge into AI agents may help in the short term but in the longer term is consistently overtaken by general-purpose methods that make use of more computation. This raises a couple of concerns about maintaining transparency:

  1. Even if the most effective agents combine scaffolding and finetuning, the scaffolding might stop being human-comprehensible as compute scales
  2. Even if the internal communication between parts of a scaffolded agent are initially in natural human language, as things are optimized they may find more efficient ways to communicate

Essentially, one might think that even if early scaffolded agents are more transparent, these will be obsoleted by more sophisticated AI which does end-to-end training for effectiveness over the entire system (including the scaffolding).

I take this concern to have some bite. I do think that a scaffolded agent which was purely optimized would be unlikely to have transparent internals. Nonetheless there are a few reasons why I don’t think the bitter lesson means that hopes for transparency are necessarily doomed:

8.     Risks & strategies

8.1     A rough taxonomy of risks

There are several different points which might be dangerous. Here’s one way of slicing things up:

  1. Early language model agent misalignment risk
    • Early systems which are over the autonomous replication threshold, if there isn’t a good regime in place to handle them, could get into the world and then hang around and do destructive things, e.g. —
      • Preparing things which are destructive to discourse as political moves to try to ensure that there aren’t concerted efforts to find and close them down at some later point
      • “AI-run mafia”: bribing and extorting people in ways that build up larger power bases
    • At the first points where this becomes a risk it isn’t very credible that they would be able to outstrip the rest of civilization at the AI improvement game, nor that they would be able to directly cause a global catastrophe; however they might still create an exacerbating risk factor
    • As language model agents become more competent there might be a moment where we haven’t learned how to responsibly handle such systems, and a powerful one gets free in a way that does more directly threaten a global catastrophe
  2. Many opaque language model agents 
    • If people automating executives do so in ways that aren’t naturally transparent about their goals (e.g. because of heavy selection for strong performance), we may end up with a lot of systems in positions of some influence which are at least subtly misaligned, and these could end up with a majority of power in the world
    • This could be bad because:
      • 2A: The future might be determined by processes which are less in touch with human values
      • 2B: There is the possibility with smarter systems of a coordinated treacherous turn
  3. Wrong research automated first
    • At the point where we’re starting to automate most research, if the foundations for the automation of AI research are such that it’s much easier to automate capabilities research than to automate safe capabilities research, we might see a runaway process where the cutting edge in the world doesn’t have safety as a key embodied value, and then this ends up producing some extremely powerful-but-dangerous systems
      • This could happen either because:
        • “Automating capabilities research safely” is just much harder, and we fail to work out how to do that before the key time; or
        • There’s a moratorium or significant attempt to slow down AI development/deployment, which is not globally effective, which leads to the open-source / not carefully/ethically developed systems become cutting edge (because the more white hat stuff has really slowed down)
  4. Vulnerable world
    • If the fruits of research are not tightly held, and if the underlying technical landscape lies a certain way, we might end up in a vulnerable-world-hypothesis type scenario, where there is some broadly available destructive technology
    • Absent strong coordination to avoid its use, this could lead to a global catastrophe
  5. Coercive singleton
    • Automation could lead to strong centralization of power (e.g. if the fruits of automated research are tightly held by a single actor). If one actor gains enough power, they could expropriate control from the rest of the world
      • This is concerning whether that actor is a human, AI system, or institution built out of humans and/or AIs
      • Some of the strategies for avoiding this will vary with the type of actor that is being guarded against; others are cross-cutting
  6. Misalignment from successor paradigms
    • If language-based AI becomes uncompetitive at some point, misalignment from the successor systems could be a serious risk
    • This is especially concerning since the possibility of making language model agents transparent-by-construction seems idiosyncratic to this technology; we might expect transparency to much harder with the successor systems
  7. Butlerian Jihad
    • No catastrophe caused by AI, but in a knee-jerk reaction of fear-of-AI-systems, humanity locks in some things which cut us off from the most valuable futures
    • There are versions of this which involve permanently locking things in, and other versions which don’t necessarily have permanence, but leave things in the hands of humans long enough that we mess it up some other way
  8. (Flawed success)
    • No catastrophe caused by AI, but we somehow fail to build good futures anyway
    • Perhaps because we’re extrapolating values in a bad way, choosing an unhelpful starting point for that process, or choosing to use AI to lock in some properties that would have better not been locked in

I could offer views about the relative degree of existential risk posed by these, or the degree to which we should be prioritizing them (where these come apart because we may have disproportionate leverage over some). But I’m really not very confident in my relative assessment, and I’m much more confident in a meta-level take, so I’ll restrict myself to that:

I think that all of these risks (and it’s quite possible I’m missing some) are potentially grave. I wouldn’t currently feel comfortable assigning less than 1% risk of existential catastrophe to any of them — easily enough that if correct it would justify massive attention to address.

I also think that the actions people should take to understand and mitigate the various risks are likely to differ significantly. I therefore think that it should be a significant priority to better characterize the various risks, to assess how large they are in absolute terms, and to produce plans which are targeted specifically at reducing that risk. This can then feed into better prioritization of actions across the space — it’s likely that we should have a portfolio which includes work well-targeted at a number of these different risks.

8.2     Example strategies for mitigating the different risks

Here are some brainstormed thoughts on strategies for the various things here, to start things off. Take them or leave them.

  1. Early language model agent misalignment risk
    • Monitoring model capabilities
    • Conventions against deploying certain types of agents
    • Restriction of model access by major AI labs to make it harder for third parties to create such agents
    • Defensive measures which make it hard for “escaped” agents to increase their power
  2. Many opaque language model agents 
    • Developing techniques for making highly effective and highly transparent language model agents
    • Conventions against creating agents via finetuning, and otherwise restricting the amount of optimization power that can be exercised at the top level of creating agents
    • Transparency research, to make default-opaque agents less opaque
    • Work to instill virtuous behaviour and "good culture" (e.g. high levels of honesty) in language model agents, even if they're not fully transparent
  3. Wrong research automated first
    • Differential development of research that might be important to automate early
      • Perhaps via centralized Manhattan-Project–style work
    • Conventions restricting the automation of potentially-scary branches of research
    • General strategies for handling differential technological development
  4. Vulnerable world
    • Political centralization of automated research
    • AI-mediated treaties for automated arms control
  5. Coercive singleton
    • Avoiding too much centralized power by any actor
    • Automation of bargaining/negotiation/cooperation, to facilitate reaching cooperative singletons first
  6. Misalignment from successor paradigms
    • Continuation of traditional AI alignment research, and laying the groundwork for its automation
    • Political coordination to restrict research into such paradigms except in extremely careful ways
  7. Butlerian Jihad
    • Working out actually-good paths forwards vis-a-vis humanity’s relationship with AI
    • Convening discussions with thought leaders to minimize polarization of issues
    • Careful communication to build coalitions for Paretotopian futures
  8. (Flawed success)
    • Research into key things that will be needed for creating good futures, and avoiding bad ones
    • Starting to build coalitions and support for any key steps that are anticipated

8.3     Thoughts on tactical implications

I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.

I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.


Aaron Bergman @ 2023-12-24T20:06 (+12)

In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance

  • i.e. it is a mistake-in-principle to imagine projecting out the GPT-2—GPT-3—GPT-4 capability trend into the far-superhuman range

Surprised to see no pushback on this yet. I do not think this is true; I've come around to thinking that Eliezer is basically right that the limit of next token prediction on human generated text is superintelligence. Now how this latent ability manifests is a hard question, but it's there to be used by the model for its own ends or elicited by humans for ours, or both.

Also worth adding (guessing this point has been made before) that non human-generated text (e.g. regression outputs from a program) are in the training data, so merely predicting those gets you superhuman performance in some domains.

Owen Cotton-Barratt @ 2023-12-24T22:49 (+12)

Sorry, I think you're reading me as saying something like "language models scaled naively up don't do anything superhuman"? Whereas I'm trying to say something more like "language models scaled naively up break the trend line in the vicinity of human level, because the basic mechanism for improved capabilities that they had been using stops working, so they need to use other mechanisms (which probably move a bit slower)".

If you disagree with that unpacking, I'm interested to hear it. If you agree with the unpacking and think that I've done a bad job summarizing it, I'm interested if you want to propose alternate wording.

I do discuss the stuff you're talking about in several places in the doc, especially Sections 2.3, 4.1, and 6.2.

NickLaing @ 2023-12-21T08:24 (+12)

Thanks so much for this article, I feel like I understand LLMs and their potential trajectory far better than I did before. The framings of "drunk expert" vs. "interns" was a little bit of a lightbulb moment for me.

As a non-technical person I've tried reading a lot of theoretical stuff about LLMs and AI both here and on LessWrong. I usually start reading then at some point (in the first third/half of the article) I hit a point where I genuinely don't understand what's going on, stop reading then go back and read something else usually in my global health comfort zone :D.

For whatever reason I managed to read this whole article while at least feeling like I understood every paragraph and concept. I think the bar for understanding here is lower than many AI explainer articles, which I greatly appreciated and I would encourage other non-AI-technical people to read this in full. I would consider myself likely in the bottom 20% of forum readers when it comes to AI understanding as well.

I'm sure its not easy to walk the line between adequately explaining these concepts while still enabling basic people like me to understand ;).

So well done.