AGI safety from first principles

By richard_ngo @ 2020-10-21T17:42 (+77)

This is a linkpost to https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ

In this report (which I'm linking from the Alignment Forum) I have attempted to put together the most compelling case for why the development of AGI might pose an existential threat. It stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people's arguments, but as this report has grown, it's become more representative of my own views and less representative of anyone else's. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI - one which doesn't take any previous claims for granted, but attempts to work them out from first principles.

The report is primarily aimed at people who already understand the basics of machine learning, but most of it should also make sense to laypeople. It's roughly 15,000 words in total, split into six sections: the first and last are short framing sections, while the middle four correspond to the four premises of the core argument. The brief introductory section appears below; find the rest on the Alignment Forum.

AGI safety from first principles

The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth's second most powerful "species", and lose the ability to create a valuable and worthwhile future.

I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that:

  1. We’ll build AIs which are much more intelligent than humans (i.e. superintelligent).
  2. Those AIs will be autonomous agents which pursue large-scale goals.
  3. Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals.
  4. The development of such AIs would lead to them gaining control of humanity’s future.

While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs.

That's the introduction; to continue reading, here's the next section, on Superintelligence. In addition to reframing existing arguments, here are a few of the more novel claims made in the rest of the report:

  1. When training AIs which can perform well on a range of novel tasks, we shouldn’t think of objective functions as specifications of our desired goals, but rather as tools to shape our agents’ motivations and cognition.
  2. Interactions between many AGIs (specifically via replication, cultural learning, and recursive improvement) will be important during the transition from human-level AGI to superintelligence.
  3. Existing frameworks for thinking about goal-directed agency don’t help us to predict the types of goals AGIs will have. To do so, we should identify specific cognitive capacities AGIs would need to be capable of pursuing goals, and how those might develop.
  4. The likelihood of inner misalignment occuring depends on whether instrumentally convergent subgoals will be present during training, and how complex they will be compared with the outer objective.
  5. We should plan towards building intent aligned AGIs which are better than humans at safety and governance research. Up to that point, we can increase our chances of retaining control via coordination to deploy transparent systems in constrained ways.

See also Rohin's summary for the Alignment Newsletter here.


  1. Stuart Russell also refers to this as the “gorilla problem” in his recent book, Human Compatible. ↩︎


MaxDalton @ 2022-01-09T09:48 (+2)

[I'm doing a bunch of low-effort reviews of posts I read a while ago and think are important. Unfortunately, I don't have time to re-read them or say very nuanced things about them.]

There's been a variety of work over the last few years focused on examining the arguments for focusing on AI alignment. I think this is one of the better and more readable ones. It's also quite long and not-really-on-the-Forum. Not sure what to do with that. The last post has a bunch of comment threads, which might be a good way of demonstrating EA reasoning.

Linch @ 2021-06-23T10:26 (+2)

In the report, you say 
 

We can start with Legg’s well-known definition, which identifies intelligence as the ability to do well on a broad range of cognitive tasks.


However, this is importantly different from Legg's actual definition(pg12), which is:

Intelligence measures an agent’s ability to achieve goals in a wide range of environments

I'm curious whether this change is intentional, or perhaps I'm looking at the wrong link?

I think the two definitions are meaningfully different (consider the case of a child completing Raven's Progressive Matrices without particular interest or inclination to do so), and the definition you use is more common when people refer to human intelligence and the definition from Legg and Hutter more common when people refer to machine intelligence.

richard_ngo @ 2021-06-25T21:28 (+4)

I intended mine to be a slight rephrasing of Legg and Hutter's definition to make it more accessible to people without RL backgrounds. One thing that's not obvious from the way they use "environments" is that the goal is actually built into the environment via a reward function, so describing each environment as a "task" seems accurate.

A second non-obvious thing is that the body the agent uses is also defined as part of the environment, so that the agent only performs the abstract task of sending instructions to that body. A naive reading of Legg and Hutter's definition would interpret a stronger agent as being more intelligent. Adding "cognitive" I think rules this out, while also remaining true to the spirit of the original definition.

Curious if you still disagree, and if so why - I don't really see what you're pointing at with the Raven's Matrices example.

Linch @ 2021-06-26T01:17 (+2)

One thing that's not obvious from the way they use "environments" is that the goal is actually built into the environment via a reward function, so describing each environment as a "task" seems accurate.

Thanks, I was not aware of this issue. 

Curious if you still disagree, and if so why - I don't really see what you're pointing at with the Raven's Matrices example.

I'm less sure I disagree now. I think my intuitive sense of intelligence (as defined for humans) is that it's used as a (sometimes single-variable) mapping for a broad cluster of somewhat distinct but in practice fairly correlated traits like pattern recognition, working memory, etc, while Legg's definition is carefully written to define intelligence to be outcome-only. The fact that the reward is built into the environment is not something I was previously aware of and I need to think harder about whether I still have reservations. 

One thing I'm confused about is whether Legg's definition (or your rephrasing) allows for situations where it's in principle possible that being smarter is ex ante worse for an agent (obviously ex post it's possible to follow the correct decision procedure and be unlucky). My intuition is that most naive definitions of intelligence allows for this to at least theoretically be possible in various ways, but I'm not sure (and currently lean against) Legg's definition allowing for this. 

richard_ngo @ 2021-06-26T05:03 (+2)

One thing I'm confused about is whether Legg's definition (or your rephrasing) allows for situations where it's in principle possible that being smarter is ex ante worse for an agent (obviously ex post it's possible to follow the correct decision procedure and be unlucky).

There definitely are such cases - e.g. Omega penalises all smart agents. Or environments where there are several crucial considerations which you're able to identify at different levels of intelligence, so that as intelligence increases, your success increases and decreases.

But in general I agree with your complaint about Legg's definition being defined in behavioural terms, and how it'd be better to have a good definition of intelligence in terms of the cognitive processes involved (e.g. planning, abstraction, etc). I do think that starting off in behaviourist terms was a good move, back when people were much more allergic to talking about AGI/superintelligence. But now that we're past that point, I think we can do better. (I don't think I've written about this yet in much detail, but it's quite high on my list of priorities.)

Linch @ 2021-06-26T07:17 (+2)

There definitely are such cases ... Or environments where there are several crucial considerations which you're able to identify at different levels of intelligence, so that as intelligence increases, your success increases and decreases.

Sorry I'm confused about this claim as stated. Assume that all environments have 3 levels of abstraction, which has this ultimate {action -> expected utility} pair 

A-> 10 expected utils

B -> -10 expected utils

C -> 20 expected utils


It seems to me that by the definition:

Intelligence measures an agent’s ability to achieve goals in a wide range of environments

Then by definition, strategy that outputs C smarter than strategy that outputs A smarter than strategy that outputs B.  So B < A < C. 

This is true even if cognitively the algorithm that outputs B is more sophisticated (eg more crucial considerations,  or literally the same learning algorithm but with more compute) than the one that outputs A. 

Am I confused here? 

richard_ngo @ 2021-06-27T01:12 (+2)

Ah, I see. I thought you meant "situations" as in "individual environments", but it seems like you meant "situations" as in "possible ways that all environments could be".

In that case, I think you're right, but I don't consider it a problem. Why might it be the case that adding more compute, or more memory, or something like that, would be net negative across all environments? It seems like either we'd have to define the set of environments in a very gerrymandered way, or else there's something about the change we made that lands us in a valley of bad thinking. In the former case, we should use a wider set of environments; in the latter case, it seems easier to bite the bullet and say "Yeah, turns out that adding more of this usually-valuable trait makes agents less intelligent."

Linch @ 2021-06-27T21:44 (+4)

Hmm, I'm probably not phrasing this well, but the point I'm trying to get across is that the Legg definition defines intelligence as always monotonically good in an in-principle way. I actually agree with you that empirically smarts (as usually defined)->good outcomes seems like the most natural hypothesis, but I'd have preferred a definition of intelligence which would leave open this question as an empirical hypothesis, over something that assumes it by definition.

I realize that "empirical hypothesis" is weird because of No Free Lunch, so I guess by a range of environments I mean something like "environments that plausibly reflect actual questions that might naturally arise in the real world" (Not very well-stated).

For example, another thing that I'm sort of interested in is multiagent situations where credibly proving you're dumber makes you a more trustworthy agent, where it feels weird for me to claim that the credibly dumber agent is actually on some deeper level smarter than the naively smarter agent (whereas an agent smart enough to credibly lie about their dumbness is smarter again on both definitions). 

(I don't think the Omega hates smartness example, or for that matter a world where anti-induction is the correct decision procedure, is very interesting, relatively speaking, because they feel contrived enough to be a relatively small slice of realistic possibilities). 

richard_ngo @ 2021-06-27T23:01 (+4)

Ah, I like the multiagent example. So to summarise: I agree that we have some intuitive notion of what cognitive processes we think of as intelligent, and it would be useful to have a definition of intelligence phrased in terms of those. I also agree that Legg's behavioural definition might diverge from our implicit cognitive definition in non-trivial ways.

I guess the reason why I've been pushing back on your point is that I think that possible divergences between the two aren't the main thing going on here. Even if it turned out that the behavioural definition and the cognitive definition ranked all possible agents the same, I think the latter would be much more insightful and much more valuable for helping us think about AGI.

But this is probably not an important disagreement.

Linch @ 2021-06-27T23:21 (+4)

I see the issue now. To restate it in my own words, both of us agree that cognitive definitions are plausibly more useful than behavioral definitions (and probably you are more confident in this claim than I am), but for me the cruxes are in the direction of where the cognitive definitions and the behavioral definitions diverge in non-trivial ways in ranking agents, and in those cases the divergences are important + interesting, whereas in your case you'd consider the cognitive definitions more insightful for thinking about AGI even if it can later be shown that the divergences are only in trivial ways.  

Upon reflection, I'm not sure if we disagree. I'll need to think harder about whether I'd consider using the cognitive definitions (which presumably will suffer a bit of an elegance tax) to still be a generally superior way of thinking about AGI than using the behavioral definition if there are no non-trivial divergences.

I also agree that as stated this is probably not an important disagreement.