Summary: Existential risk from power-seeking AI by Joseph Carlsmith

By rileyharris @ 2023-10-28T15:05 (+11)

This is a linkpost to https://www.millionyearview.com/p/advanced-artificial-intelligence

Within our lifetimes we might witness the deployment of advanced artificial intelligence (AI) systems with the ability to run companies, push forward political campaigns and advance science. The basic concern is that these AI systems could pursue goals which are different from what any human intended. In particular, a sufficiently advanced system would be capable of advanced planning and strategy, and so would see the usefulness of gaining power such as influence, weapons, money,  and greater cognitive resources. This is deeply troubling, as it may be difficult to prevent these systems from collectively disempowering humanity.  In "Existential risk from power-seeking AI"[1] Carlsmith clarifies the main reasons to think that power-seeking AI might present an extreme  risk to humanity.[2]

Conflict of interest: I have received grants from Open Philanthropy, including for work on this blog. Although I asked Carlsmith directly for feedback on this piece,[3] Open Philanthropy had no direct input.

More likely than not, we will see advanced AI systems within our lifetimes, or our children's lifetimes

Carlsmith builds his case for AI risk around estimates for when AI systems could be developed that have: 

Carlsmith believes that it is more likely than not that we will be able to build agents with all of the capabilities described above by 2070.[4]

It could be difficult to align or control these systems

If we create advanced AI systems, how can we ensure they are aligned in the sense that they do what their designers want them to do? Here, misalignment looks less like failing or breaking in an effort to do what the designers want, and more like deliberately doing something designers don't want it to do. It would look less like an employee underperforming, and more like an employee trying to embezzle funds from their employer.

There are several strategies for aligning AI systems, but each of them faces difficulties:

In addition to the difficulties above, there are several factors that make alignment particularly difficulty compared to other safety problems, such as building nuclear reactors:

Powerful, misaligned AI systems could disempower humanity

Sufficiently advanced AI systems that are misaligned are likely to realise that taking power over humans and the environment will allow them to pursue their other goals, regardless of what those other goals are.[6]

Power-seeking behaviour might include things like manipulating human attempts to monitor, retrain, or shut off misaligned systems; blackmailing, bribing or manipulating humans; attempting to accumulate money and computational resources; making unauthorised  backup copies of themselves; manipulating or weakening human institutions and politics; taking control of automated factories and critical infrastructure.

Power-seeking, unlike other forms of misalignment, is crucially important because the scale of potential failures is enormous. An AI system that is attempting to take control could, well, take control. Our relationship to these powerful AI systems might be similar to the relationship that chimpanzees have to us: the fate of our friends, family, and community in the hands of a more capable and intelligent species that may not share our values. This would be a very bad outcome for us, perhaps as bad as extinction.[7]

Knowing this, we still might deploy advanced, misaligned AI systems

It should be reasonably clear that there are strong reasons to avoid deploying advanced misaligned AI systems. There are several reasons to be concerned that they may be deployed anyway:

  1. Unilateralist's curse: we might expect that, once some people can build advanced AI systems, the number of people with potential access to those systems will grow over time. Different actors may have different views about how dangerous AI systems are, and the most optimistic and least cautious might end up deploying even if doing so presents clear dangers.
  2. Externalities: even if it is in humanities interests to avoid deploying potentially misaligned systems, some individuals might stand to personally gain a lot of money, power, or prestige, and face only a fraction of the cost if things go poorly. This might be similar to how while it would be in humanities interests to reduce carbon emissions, many corporations are incentivised to continue to emit large amounts.
  3. Race dynamics: if several groups are competing to build AI systems first, then they might know that they could gain an advantage by cutting corners on expensive or difficult alignment strategies. This could generate a race to the bottom where the first AI systems to be deployed are the quickest and cheapest (and least safe) to develop .
  4. Apparent safety: advanced AI systems might offer opportunities to solve major problems, generate wealth, and rapidly advance science and technology. They may also actively deceive us about their level of alignment. Without clear signs of misalignment, it might be difficult to justify ignoring the promise of these systems even if we think they could be manipulating us (in ways that we can't detect). We might also overestimate our ability to control advanced systems.

Of course, if we notice an AI is actively deceiving us and seeking power, we would try to stop it. By the time we deploy advanced AI systems of the kind that pose a significant risk, we are likely to have more advanced tools for detecting, constraining, responding to and defending against misaligned behaviour.

Even so we may fail to contain the damage. First, as AI capabilities increase, we will be at an increasing disadvantage, especially if this happens in hours or days rather than months or years. Second, AI systems may deliberately hide their misalignment and interfere with our attempts to monitor and correct them, and so we may not detect misaligned behaviour early on. Third, even if we do get warning shots, we may fail to respond quickly and decisively, or face problems that are too difficult for us to solve.  Unfortunately, many potential solutions may only superficially solve the problem, by essentially teaching the system to more carefully avoid detection. Finally, all of the factors that lead misaligned systems to be deployed in the first place would contribute to the difficulty of correcting alignment failures after deployment.

 Conclusion

Carlsmith illustrates how AI could lead to human disempowerment:

  1.  It could become possible and feasible to build relevantly powerful, agentic AI systems, and we might have strong incentives to do so.
  2. It might be much harder to build these systems such that they are aligned to our values, compared to building systems that are misaligned but are still superficially attractive to deploy.
  3. If deployed, misaligned systems might seek power over humans in high-impact ways, perhaps to the point of completely disempowering humanity.

Overall, Carlsmith thinks there is a greater than 10% chance that the three events above all occur by 2070. If Carlsmith is right, then we face a substantial existential risk from AI systems within our lifetimes, or our children's lifetimes.

Sources

Nick Bostrom (2012). The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds & Machines 22.

Ajeya Cotra (2020). Draft report on AI timelines. AI Alignment Forum.

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, & Owain Evans  (2018). Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts. Journal of Artificial Intelligence Research 62.

Toby Ord (2020). The Precipice: Existential Risk and the Future of Humanity. Bloomsbury Publishing. See summary here.

Cover image by Pixabay.

  1. ^

    This summary also draws on the longer version of this report.

  2. ^

    In some places this essay is framed as an argument for why the risk is high, but I think it is better characterised as an explanation of the worldview in which the risk is high, or a rough quantitative model for estimating the existential risk from power-seeking AI. This model might be useful to work through even for readers that would place very different probabilities on these possibilities.

  3. ^

    This is a courtesy I try to extend to all authors. My aim is to helpfully summarise this essay, rather than to offer a strong independent review. You can find reviews here.

  4. ^

    Calsmith seems to be making a judgement call here based on evidence such as: a draft technical report modelling  the year in which we could probably train a model as large as the human brain, and concludes that ‘transformative’ AI is more likely than not by 2065 (Cotra, 2020). (Where transformative AI is defined as a model that could have “at least as profound an impact on the world’s trajectory as the Industrial Revolution did”). A public forecasting platform called Metaculus predicted that it was more likely than not that there will “be Human-machine intelligence parity before 2040” (as of September 2023 this is now above 90%) and  gave a median of 2038 for the date that “the first weakly general AI system be devised, tested, and publicly announced” (as of September 2023 this is now predicted to be in 2027). Experts answer questions like whether “unaided machines can accomplish every task better and more cheaply than human workers” by 2066 very differently based on exactly how the question is phrased, and give answers that are sometimes as low as 3% or as high as above 50% (Grace et al., 2018).

  5. ^

    Similarly, we could create systems that are only able to pursue short-term objectives, and are thus unlikely to pursue deception that only paid off in the long-term. We could also try to build specialised systems that pursue narrow tasks, which would likely do less damage if they were misaligned and would also be easier to control and incentivise to do what we want.

  6. ^

    This is called the "Instrumental Convergence" hypothesis. See Bostrom (2012).

  7. ^

    This could be the case whether or not every human dies. Ord (2020) defines an existential catastrophe as the destruction of humanity’s long-term potential. Carlsmith thinks that the involuntary disempowerment of humanity would likely be equivalent to extinction in this sense. An important subtlety is that Toby Ord wants to define "humanity" broadly, so it includes descendants we become or create. In this sense, a misaligned AI system could be seen as an extension of humanity, and if that future was good, then perhaps humanities disempowerment would not be like extinction. But Carlsmith thinks that if he thought about it more, he would conclude that the unintentional disempowerment is very likely to be equivalent to extinction.