Summary: Existential risk from power-seeking AI by Joseph Carlsmith

By rileyharris @ 2023-10-28T15:05 (+11)

This is a linkpost to https://www.millionyearview.com/p/advanced-artificial-intelligence

Within our lifetimes we might witness the deployment of advanced artificial intelligence (AI) systems with the ability to run companies, push forward political campaigns and advance science. The basic concern is that these AI systems could pursue goals which are different from what any human intended. In particular, a sufficiently advanced system would be capable of advanced planning and strategy, and so would see the usefulness of gaining power such as influence, weapons, money, and greater cognitive resources. This is deeply troubling, as it may be difficult to prevent these systems from collectively disempowering humanity. In "Existential risk from power-seeking AI"^[1] Carlsmith clarifies the main reasons to think that power-seeking AI might present an extreme risk to humanity.^[2]

Conflict of interest: I have received grants from Open Philanthropy, including for work on this blog. Although I asked Carlsmith directly for feedback on this piece,^[3] Open Philanthropy had no direct input.

More likely than not, we will see advanced AI systems within our lifetimes, or our children's lifetimes

Carlsmith builds his case for AI risk around estimates for when AI systems could be developed that have:

Advanced capabilities such as the ability to outperform the best humans on tasks such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation.
Planning capabilities such as the ability to make and carry out plans, in pursuit of objectives, as if with an understanding of the world.
Strategic awareness: the ability to make plans that represent the effect of gaining and maintaining power over humans and the environment.

Carlsmith believes that it is more likely than not that we will be able to build agents with all of the capabilities described above by 2070.^[4]

It could be difficult to align or control these systems

If we create advanced AI systems, how can we ensure they are aligned in the sense that they do what their designers want them to do? Here, misalignment looks less like failing or breaking in an effort to do what the designers want, and more like deliberately doing something designers don't want it to do. It would look less like an employee underperforming, and more like an employee trying to embezzle funds from their employer.

There are several strategies for aligning AI systems, but each of them faces difficulties:

Aligning objectives: one strategy is to control the objectives that AI systems have. We might give examples of desired behaviour, set up the evolutionary environment for AI carefully, or give feedback on behaviour we like and dislike. These methods, especially when they rely on human feedback, may be difficult to scale to advanced systems. More fundamentally, it may be difficult to avoid situations where the AI system learns to do something that meets our evaluation criteria but with a fundamentally different strategy. For example, we might attempt to train an AI to tell the truth, but accidentally teach it to create fictional internet sources and cater to our biases. The problem is not about ensuring an AI system ‘understands’ what we want, because a sufficiently advanced AI might understand perfectly well and use that knowledge to deceive us in pursuit of its own goals.
Limiting capabilities: another strategy might be to create agents that have limited capabilities. We are more likely to be able to stop limited AI systems from engaging in misaligned behaviour, and the systems themselves would be less likely to believe that deception would pay off for them.^[5] The major difficulty here is that we will face strong incentives to create AI systems that are capable and pursue long-term objectives, because these systems will help our scientists, corporations, and politicians use AI to pursue their own goals. In addition, there may also be technical difficulties, and limiting capacities by e.g. ensuring AI systems can only do a narrow range of tasks or only pursue short-term objectives might be more difficult than building systems with advanced capabilities to do similar tasks.
Controlling incentives: finally, we might try to control the environment in which AI systems are deployed. For instance, we might want to prevent a system from hacking by withholding internet access or turning off any system that is caught hacking. The problem is that controlling incentives becomes more difficult as AI systems become more capable. The example above may prevent a moderately powerful system from hacking, but a sufficiently sophisticated system might realise that there is no realistic chance that we can catch it.

In addition to the difficulties above, there are several factors that make alignment particularly difficulty compared to other safety problems, such as building nuclear reactors:

Active deception: AI systems may be trying to deliberately undermine our efforts to monitor and align their behaviour. For instance, a sufficiently advanced system might realise that it needs to pass certain safety tests in order to be released.
Unpredictability: AI systems may consider strategies that we can't imagine, and they might have cognitive abilities that are opaque to us.
No room for error: finally, we tend to solve safety problems through trial and error. However, the stakes for aligning advanced AI systems might be extremely high, and therefore we may not be able to learn from mistakes.

Powerful, misaligned AI systems could disempower humanity

Sufficiently advanced AI systems that are misaligned are likely to realise that taking power over humans and the environment will allow them to pursue their other goals, regardless of what those other goals are.^[6]

Power-seeking behaviour might include things like manipulating human attempts to monitor, retrain, or shut off misaligned systems; blackmailing, bribing or manipulating humans; attempting to accumulate money and computational resources; making unauthorised backup copies of themselves; manipulating or weakening human institutions and politics; taking control of automated factories and critical infrastructure.

Power-seeking, unlike other forms of misalignment, is crucially important because the scale of potential failures is enormous. An AI system that is attempting to take control could, well, take control. Our relationship to these powerful AI systems might be similar to the relationship that chimpanzees have to us: the fate of our friends, family, and community in the hands of a more capable and intelligent species that may not share our values. This would be a very bad outcome for us, perhaps as bad as extinction.^[7]

Knowing this, we still might deploy advanced, misaligned AI systems

It should be reasonably clear that there are strong reasons to avoid deploying advanced misaligned AI systems. There are several reasons to be concerned that they may be deployed anyway:

Unilateralist's curse: we might expect that, once some people can build advanced AI systems, the number of people with potential access to those systems will grow over time. Different actors may have different views about how dangerous AI systems are, and the most optimistic and least cautious might end up deploying even if doing so presents clear dangers.
Externalities: even if it is in humanities interests to avoid deploying potentially misaligned systems, some individuals might stand to personally gain a lot of money, power, or prestige, and face only a fraction of the cost if things go poorly. This might be similar to how while it would be in humanities interests to reduce carbon emissions, many corporations are incentivised to continue to emit large amounts.
Race dynamics: if several groups are competing to build AI systems first, then they might know that they could gain an advantage by cutting corners on expensive or difficult alignment strategies. This could generate a race to the bottom where the first AI systems to be deployed are the quickest and cheapest (and least safe) to develop .
Apparent safety: advanced AI systems might offer opportunities to solve major problems, generate wealth, and rapidly advance science and technology. They may also actively deceive us about their level of alignment. Without clear signs of misalignment, it might be difficult to justify ignoring the promise of these systems even if we think they could be manipulating us (in ways that we can't detect). We might also overestimate our ability to control advanced systems.

Of course, if we notice an AI is actively deceiving us and seeking power, we would try to stop it. By the time we deploy advanced AI systems of the kind that pose a significant risk, we are likely to have more advanced tools for detecting, constraining, responding to and defending against misaligned behaviour.

Even so we may fail to contain the damage. First, as AI capabilities increase, we will be at an increasing disadvantage, especially if this happens in hours or days rather than months or years. Second, AI systems may deliberately hide their misalignment and interfere with our attempts to monitor and correct them, and so we may not detect misaligned behaviour early on. Third, even if we do get warning shots, we may fail to respond quickly and decisively, or face problems that are too difficult for us to solve. Unfortunately, many potential solutions may only superficially solve the problem, by essentially teaching the system to more carefully avoid detection. Finally, all of the factors that lead misaligned systems to be deployed in the first place would contribute to the difficulty of correcting alignment failures after deployment.

Conclusion

Carlsmith illustrates how AI could lead to human disempowerment:

It could become possible and feasible to build relevantly powerful, agentic AI systems, and we might have strong incentives to do so.
It might be much harder to build these systems such that they are aligned to our values, compared to building systems that are misaligned but are still superficially attractive to deploy.
If deployed, misaligned systems might seek power over humans in high-impact ways, perhaps to the point of completely disempowering humanity.

Overall, Carlsmith thinks there is a greater than 10% chance that the three events above all occur by 2070. If Carlsmith is right, then we face a substantial existential risk from AI systems within our lifetimes, or our children's lifetimes.

Sources

Nick Bostrom (2012). The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds & Machines 22.

Ajeya Cotra (2020). Draft report on AI timelines. AI Alignment Forum.

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, & Owain Evans (2018). Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts. Journal of Artificial Intelligence Research 62.

Toby Ord (2020). The Precipice: Existential Risk and the Future of Humanity. Bloomsbury Publishing. See summary here.

Cover image by Pixabay.

^{^}
This summary also draws on the longer version of this report.
^{^}
In some places this essay is framed as an argument for why the risk is high, but I think it is better characterised as an explanation of the worldview in which the risk is high, or a rough quantitative model for estimating the existential risk from power-seeking AI. This model might be useful to work through even for readers that would place very different probabilities on these possibilities.
^{^}
This is a courtesy I try to extend to all authors. My aim is to helpfully summarise this essay, rather than to offer a strong independent review. You can find reviews here.
^{^}
Calsmith seems to be making a judgement call here based on evidence such as: a draft technical report modelling the year in which we could probably train a model as large as the human brain, and concludes that ‘transformative’ AI is more likely than not by 2065 (Cotra, 2020). (Where transformative AI is defined as a model that could have “at least as profound an impact on the world’s trajectory as the Industrial Revolution did”). A public forecasting platform called Metaculus predicted that it was more likely than not that there will “be Human-machine intelligence parity before 2040” (as of September 2023 this is now above 90%) and gave a median of 2038 for the date that “the first weakly general AI system be devised, tested, and publicly announced” (as of September 2023 this is now predicted to be in 2027). Experts answer questions like whether “unaided machines can accomplish every task better and more cheaply than human workers” by 2066 very differently based on exactly how the question is phrased, and give answers that are sometimes as low as 3% or as high as above 50% (Grace et al., 2018).
^{^}
Similarly, we could create systems that are only able to pursue short-term objectives, and are thus unlikely to pursue deception that only paid off in the long-term. We could also try to build specialised systems that pursue narrow tasks, which would likely do less damage if they were misaligned and would also be easier to control and incentivise to do what we want.
^{^}
This is called the "Instrumental Convergence" hypothesis. See Bostrom (2012).
^{^}
This could be the case whether or not every human dies. Ord (2020) defines an existential catastrophe as the destruction of humanity’s long-term potential. Carlsmith thinks that the involuntary disempowerment of humanity would likely be equivalent to extinction in this sense. An important subtlety is that Toby Ord wants to define "humanity" broadly, so it includes descendants we become or create. In this sense, a misaligned AI system could be seen as an extension of humanity, and if that future was good, then perhaps humanities disempowerment would not be like extinction. But Carlsmith thinks that if he thought about it more, he would conclude that the unintentional disempowerment is very likely to be equivalent to extinction.