What is the current most representative EA AI x-risk argument?

By Matthew_Barnett @ 2023-12-15T22:04 (+117)

I tend to disagree with most EAs about existential risk from AI. Unfortunately, my disagreements are all over the place. It's not that I disagree with one or two key points: there are many elements of the standard argument that I diverge from, and depending on the audience, I don't know which points of disagreement people think are most important.

I want to write a post highlighting all the important areas where I disagree, and offering my own counterarguments as an alternative. This post would benefit from responding to an existing piece, along the same lines as Quintin Pope's article "My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"". By contrast, it would be intended to address the EA community as a whole, since I'm aware many EAs already disagree with Yudkowsky even if they buy the basic arguments for AI x-risks.

My question is: what is the current best single article (or set of articles) that provide a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?

I was considering replying to Joseph Carlsmith's article, "Is Power-Seeking AI an Existential Risk?", since it seemed reasonably comprehensive and representative of the concerns EAs have about AI x-risk. However, I'm a bit worried that the article is not very representative of EAs who have substantial probabilities of doom, since he originally estimated a total risk of catastrophe at only 5% before 2070. In May 2022, Carlsmith changed his mind and reported a higher probability, but I am not sure whether this is because he has been exposed to new arguments, or because he simply thinks the stated arguments are stronger than he originally thought.

I suspect I have both significant moral disagreements and significant empirical disagreements with EAs, and I want to include both in such an article, while mainly focusing on the empirical points. For example, I have the feeling that I disagree with most EAs about:

How bad human disempowerment would likely be from a utilitarian perspective, and what "human disempowerment" even means in the first place
Whether there will be a treacherous turn event, during which AIs violently take over the world after previously having been behaviorally aligned with humans
How likely AIs are to coordinate near-perfectly with each other as a unified front, leaving humans out of their coalition
Whether we should expect AI values to be "alien" (like paperclip maximizers) in the absence of extraordinary efforts to align them with humans
Whether the AIs themselves will be significant moral patients, on par with humans
Whether there will be a qualitative moment when "the AGI" is created, rather than systems incrementally getting more advanced, with no clear finish line
Whether we get only "one critical try" to align AGI
Whether "AI lab leaks" are an important source of AI risk
How likely AIs are to kill every single human if they are unaligned with humans
Whether there will be a "value lock-in" event soon after we create powerful AI that causes values to cease their evolution over the coming billions of years
How bad problems related to "specification gaming" will be in the future
How society is likely to respond to AI risks, and whether they'll sleepwalk into a catastrophe

However, I also disagree with points made by many other EAs who have argued against the standard AI risk case. For example, I think that,

AIs will eventually become vastly more powerful and smarter than humans. So, I think AIs will eventually be able to "defeat all of us combined"
I think a benign "AI takeover" event is very likely even if we align AIs successfully
AIs will likely be goal-directed in the future. I don't think, for instance, that we can just "not give the AIs goals" and then everything will be OK.
I think it's highly plausible that AIs will end up with substantially different values from humans (although I don't think this will necessarily cause a catastrophe).
I don't think we have strong evidence that deceptive alignment is an easy problem to solve at the moment
I think it's plausible that AI takeoff will be relatively fast, and the world will be dramatically transformed over a period of several months or a few years
I think short timelines, meaning a dramatic transformation of the world within 10 years from now, is pretty plausible

I'd like to elaborate on as many of these points as possible, preferably by responding to direct quotes from the representative article arguing for the alternative, more standard EA perspective.

Matthew_Barnett @ 2023-12-18T23:33 (+29)

It's unclear whether I'll end up writing this critique, but if I do, then based on the feedback to the post so far, I'd likely focus on the arguments made in the following posts (which were suggested by Ryan Greenblatt):

The reason is that these two posts seem closest to presenting a detailed and coherent case for expecting a substantial risk of a catastrophe that researchers still broadly feel comfortable endorsing. Additionally, the DeepMind AGI safety seem appears to endorse the first post as being the "closest existing threat model" to their view.

I'd prefer not to focus on List of Lethalities, even though I disagree with the views expressed within even more strongly than the views in the other listed posts. My guess is that criticism of MIRI threat models, while warranted, is already relatively saturated compared to threat models from more "mainstream" researchers, although I'd still prefer more detailed critiques of both.

If I were to write this critique, I would likely try to cleanly separate my empirical arguments from the normative ones, probably by writing separate posts about them and focusing first on the empirical arguments. That said, I still think both topics are important, since I think many EAs seem to have a faulty background chain of reasoning that flows from their views about human disempowerment risk, concluding that such risks override most other concerns.

For example, I suspect either a majority or a substantial minority of EAs would agree with the claim that it is OK to let 90% of humans die (e.g. of aging), if that reduced the risk of an AI catastrophe by 1 percentage point. By contrast, I think that type of view seems to naively prioritize a concept of "the human species" far above actual human lives in a way that is inconsistent with careful utilitarian reasoning, empirical evidence, or both. And I do not think this logic merely comes down to whether you have person-affecting views or not.

Ryan Greenblatt @ 2023-12-20T04:46 (+6)

That said, I still think both topics are important, since I think many EAs seem to have a faulty background chain of reasoning that flows from their views about human disempowerment risk, concluding that such risks override most other concerns.

For example, I suspect either a majority or a substantial minority of EAs would agree with the claim that it is OK to let 90% of humans die (e.g. of aging), if that reduced the risk of an AI catastrophe by 1 percentage point. By contrast, I think that type of view seems to naively prioritize a concept of "the human species" far above actual human lives in a way that is inconsistent with careful utilitarian reasoning, empirical evidence, or both. And I do not think this logic merely comes down to whether you have person-affecting views or not.

I currently don't think these normative arguments make much of a difference in prioritization or decision making in practice. So, I think this probably isn't that important to argue about.

Perhaps the most important case in which they would lead to very different decision making is the case of pausing AI (or trying to speed it up). Strong longtermists likely want to pause AI (at the optimal time) until the reduction in p(doom) per year is around the same as the exogenous doom. (This includes the chance of societal disruption which makes the situation worse and then results in doom. For instance, nuclear war induced societal collapse which results in building AI far less safely. Gradual changes in power over time also seem relevant, e.g. china.) I think the go-ahead-point for longtermists probably looks like 0.1% to 0.01% reduction in p(doom) per year of delay for longtermists, but this might depend on how optimistic you are about other aspects of society. Of course, if we could coordinate sufficiently to also eliminate other sources of risk, the go-ahead-point might lower considerably.

ETA: note that waiting until the reduction in p(doom) per year of delay is 0.1% does not imply that the final p(doom) is 0.1%. It's probably notably higher, maybe over an order of magnitude higher.

[Low confidence] If we apply the preferences of typical people (but gloomier empirical views about AI), then it seems very relevant that people broadly don't seem care that much about saving the elderly, life extension, or getting strong versions of utopia for themselves before they die. But, they do care a lot about avoiding societal collapse and ruin. And they care some about the continuity of human civilization. So, the go-ahead-point in reduction in doom per year if we use the preferences of normal people might look pretty similar to longtermists (though it's a bit confusing to apply somewhat incoherant preferences). I think it's probably less than a factor of 10 higher, maybe a factor of 3 higher. Also, normal people care about the absolute level of risk: if we couldn't reduce risk below 20%, then it's plausible that the typical preferences of normal people never want to build AI because they care more about not dying in a catastrophe than not dying of old age etc.

If we instead assume something like utilitarian person-affecting views (let's say only caring about humans for simplicity), but with strongly diminishing returns (e.g. logarithmic) above the quality of live of current americans and with similarly diminishing returns after 500 years of life, then I think you end up do 1% reduction in P(doom) per year of delay as the go-ahead point. This probably leads to pretty similar decisions in most cases.

(Separately, pure person-affecting views seem super implausible to me. Indifference to the torture of arbitrary number of new future people seems strong from my perspective. If you have asymmetric person-affecting views, then you plausible get dominated by the potential for reducing suffering in the long run.)

The only views which seem to lead to a pretty different conclusions are views with radically higher discount rates, e.g. pure person-affecting views where you care mostly about the lives of short lived animals or perhaps some views where you care about fulfilling the preferences of current humans (who might have high discount rates on their preferences?). But it's worth noting that these views seem indifferent to the torture of an arbitrary number of future people in a way that feels pretty implausible to me.

By contrast, I think that type of view seems to naively prioritize a concept of "the human species" far above actual human lives in a way that is inconsistent with careful utilitarian reasoning, empirical evidence, or both.

I don't think this depends on the concept of "the human species". Personally, I care about the overall utilization of resources in the far future (and I imagine many people with a similar perspective agree with me here). For instance, I think literal extinction in the event of AI takeover is unlikely and also not very importantly worse relative to full misaligned AI takeover without extinction. Similarly, I would potentially be happier to turn over the universe to aliens instead of AIs.

Separately, I think scope-sensitive/linear-returns person-affecting views are likely dominated by the potential for using a high fraction of future resources to simulate huge number of copies of existing people living happy lives. In practice, no one goes here because the actual thing people mean when they say "person-affecting views" is more like caring about the preferences of currently existing humans in a diminishing returns way. I think the underlying crux isn't well described as person-affecting vs non-person-affecting and is better described as diminishing returns.

Matthew_Barnett @ 2023-12-23T08:46 (+3)

I think the go-ahead-point for longtermists probably looks like 0.1% to 0.01% reduction in p(doom) per year for longtermists, but this might depend on how optimistic you are about other aspects of society.

To be clear, my argument would be that the go-ahead-point for longtermists likely looks much higher, like a 10% total risk of catastrophe. Actually that's not exactly how I'd frame it, since what matters more is how much we can reduce the risk of catastrophe by delaying, not just the total risk of a catastrophe. But I'd likely consider a world where we delay AI until the total risk falls below 0.1% to be intolerable from several perspectives.

I guess one way of putting my point here is that you probably think of "human disempowerment" as a terminal state that is astronomically bad, and probably far worse than "all currently existing humans die". But I don't really agree with this. Human disempowerment just means that the species homo sapiens is disempowered, and I don't see why we should draw the relevant moral boundary around our species. We can imagine other boundaries like "our current cultural and moral values", which I think would drift dramatically over time even if the human species remained.

I'm just not really attached to the general frame here. I don't identify much with "human values" in the abstract as opposed to other salient characteristics of intelligent beings. I think standard EA framing around "humans" is simply bad in an important way relevant to these arguments (and this includes most attempts I've seen to broaden the standard arguments to remove references to humans). Even when an EA insists their concern isn't about the human species per se I typically end up disagreeing on some other fundamental point here that seems like roughly the same thing I'm pointing at. Unfortunately, I consistently have trouble conveying this point to people, so I'm not likely to be understood here unless I give a very thorough argument.

I suspect it's a bit like the arguments vegans have with non-vegans about whether animals are OK to eat because they're "not human". There's a conceptual leap from "I care a lot about humans" to "I don't necessarily care a lot about the human species boundary" that people don't reliably find intuitive except perhaps after a lot of reflection. Most ordinary instances of arguments between vegans and non-vegans are not going to lead to people successfully crossing this conceptual gap. It's just a counterintuitive concept for most people.

Perhaps as a brief example to help illustrate my point, it seems very plausible to me that I would identify more strongly with a smart behavioral LLM clone of me trained on my personal data compared to how much I'd identify with the human species. This includes imperfections in the behavioral clone arising from failures to perfectly generalize from my data (though excluding extreme cases like the entity not generalizing any significant behavioral properties at all). Even if this clone were not aligned with humanity in the strong sense often meant by EAs, I would not obviously consider it bad to give this behavioral clone power, even at the expense of empowering "real humans".

On top of all of this, I think I disagree with your argument about discount rates, since I think you're ignoring the case for high discount rates based on epistemic uncertainty, rather than pure time preferences.

Ryan Greenblatt @ 2023-12-23T17:42 (+5)

Also, another important clarification is that my views are probably quite different from that of the median EA who identifies as longtermist. So I'd be careful not to pattern match me.

(And I prefer not to identify with movements, so I'd say that I'm not an EA.)

Ryan Greenblatt @ 2023-12-23T17:28 (+1)

I think you misunderstood the points I was making. Sorry for writing an insufficently clear comment.

Actually that's not exactly how I'd frame it, since what matters more is how much we can reduce the risk of catastrophe by delaying, not just the total risk of a catastrophe.

Agreed that's why I wrote "0.1% to 0.01% reduction in p(doom) per year". I wasn't talking about the absolute level of doom here. I edited my comment to say "0.1% to 0.01% reduction in p(doom) per year of delay" which is hopefully more clear. The expected absolute level of doom is probably notably higher than 0.1% to 0.01%.

Human disempowerment just means that the species homo sapiens is disempowered, and I don't see why we should draw the relevant moral boundary around our species.

I don't. That's why I said "Similarly, I would potentially be happier to turn over the universe to aliens instead of AIs."

Also, note that I think AI take over is unlikely to lead to extinction.

ETA: I'm pretty low confidence about a bunch of these tricky moral questions.

I would be reasonably happy (e.g. 50-90% of the value relative to human control) to turn the universe over to aliens. The main reduction in value is due to complicated questions about the likely distribution of values of aliens. (E.g., how likely is that aliens are very sadistic or lack empathy. This is probably still not the exact right question.) I'd also be pretty happy with (e.g.) uplifted dogs (dogs which are made to be as intelligent as humans while keeping the core of "dog" whatever that means) so long as the uplifting process was reasonable.

I think the exact same questions apply to AIs, I just have empirical beliefs that AIs which end up taking over are likely to do predictably worse things with the cosmic endowment (e.g. 10-30% of the value). This doesn't have to be true, I can imagine learning facts about AIs which would make me feel a lot better about AI takeover. Note that conditioning on the AI taking over is important here. I expect to feel systematically better about smart AIs with long horizon goals which are either not quite smart enough to take over or don't take over (for various complicated reasons).

More generally, I think I basically endorse the views here (which discusses the questions of when you should cede power etc.).

Note that in my ideal future it seems really unlikely that we end up spending a non-trivial fraction of future resources running literal humans instead of finding out better stuff to spend computational resources on (e.g. like beings with experiences that a wildly better than our experiences or beings which are vastly cheaper to run).

(That said, we can and should let all humans live for as long as they want and dedicate some fraction of resources to basic continuity of human civilization insofar as people want this. 1/10^12 of the resources would easily suffice from my perspective, but I'm sympathic to making this more like 1/10^3 or 1/10^6.)

Perhaps as a brief example to help illustrate my point, it seems very plausible to me that I would identify more strongly with a smart behavioral LLM clone of me trained on my personal data compared to how much I'd identify with the human species.

I think "identify" is the wrong word from my perspective. The key question is "what would the smart behavioral clone do with the vast amount of future resources". That said, I'm somewhat sympathetic to the claim that this behavioral clone would do basically reasonable things with future resources. I also feel reasonably optimistic about pure imitation LLM alignment for somewhat similar reasons.

On top of all of this, I think I disagree with your argument about discount rates, since I think you're ignoring the case for high discount rates based on epistemic uncertainty, rather than pure time preferences.

Am I ignoring this case? I just think we should treat "what do I terminally value"^[1] and "what is the best route to achieving that" as most separate questions. So, we should talk about whether "high discount rates due to epistemic uncertainty" is a good reasoning heuristic for achieving my terminal values separately from what my terminal values are.

Separately, I think a high per year discount rate due to epistemic uncertainty seems pretty clearly wrong. I'm pretty confiden that I can influence, to at least a small degree (e.g. I can affect the probability by >10^-10, probably much greater), whether or not the moral equivalent of 10^30 people are tortured in 10^6 years. It seems like a very bad idea from my perspective to put literally zero weight on this due to 1% annual discount rates.

For less specific things like "does a civilization descended from and basically endorsed by humans exist in 10^6 years", I think I have considerable influence. E.g., I can affect the probability by >10^-6 (in expectation). (This influence is distinct from the question of how valuable this is to influence, but we were talking about epistemic uncertainty here.)

My guess is that we end up with basically a moderate fixed discount over very long run future influence due to uncertainty over how the future will go, but this is more like 10% or 1% than 10^-30. And, because the long run future still dominates in my views, this just multiplies though all calculations and ends up not mattering much for decision making. (I think acausal trade considerations implicitly mean that I would be willing to tradeoff long run considerations in favor of things which look good as weighted by current power structures (e.g. helping homeless children in the US) if I had a 1,000x-10,000x opportunity to do this. E.g., if I could stop 10,000 US children from being homeless with a day of work and couldn't do direct trade, I would still do this.

^{^}
More precisely, what would my CEV (Coherant Extrapolated Volition) want and how do I handle uncertainty about what my CEV would want?

Matthew_Barnett @ 2023-12-23T20:13 (+2)

Agreed that's why I wrote "0.1% to 0.01% reduction in p(doom) per year". I wasn't talking about the absolute level of doom here. I edited my comment to say "0.1% to 0.01% reduction in p(doom) per year of delay" which is hopefully more clear

Ah, sorry. I indeed interpreted you as saying that we would reduce p(doom) to 0.01-0.1% per year, rather than saying that each year of delay reduces p(doom) by that amount. I think that view is more reasonable, but I'd still likely put the go-ahead-number higher.

That's why I said "Similarly, I would potentially be happier to turn over the universe to aliens instead of AIs."

Apologies again for misinterpreting. I didn't know how much weight to put on the word "potentially" in your comment. Although note that I said, "Even when an EA insists their concern isn't about the human species per se I typically end up disagreeing on some other fundamental point here that seems like roughly the same thing I'm pointing at." I don't think the problem is literally that EAs are anthropocentric, but I think they often have anthropocentric intuitions that influence these estimates.

Maybe a more accurate summary is that people have a bias towards "evolved" or "biological" beings, which I think might explain why you'd be a little happier to hand over the universe to aliens, or dogs, but not AIs.

I would be reasonably happy (e.g. 50-90% of the value relative to human control) to turn the universe over to aliens. [...]
I think the exact same questions apply to AIs, I just have empirical beliefs that AIs which end up taking over are likely to do predictably worse things with the cosmic endowment (e.g. 10-30% of the value).

I guess I mostly think that's a pretty bizarre view, with some obvious reasons for doubt, and I don't know what would be driving it. The process through which aliens would get values like ours seems much less robust than the process through which AIs gets our values. AIs are trained on our data, and humans will presumably care a lot about aligning them (at least at first).

From my perspective this is a bit like saying you'd prefer aliens to take over the universe rather than handing control over to our genetically engineered human descendants. I'd be very skeptical of that view too for some basic reasons.

Overall, upon learning your view here, I don't think I'd necessarily diagnose you as having the intuitions I alluded to in my original comment, but I think there's likely something underneath your views that I would strongly disagree with, if I understood your views further. I find it highly unlikely that AGIs will be even more "alien" from the perspective of our values than literal aliens (especially if we're talking about aliens who themselves build their own AIs, genetically engineer themselves, and so on).

Ryan Greenblatt @ 2023-12-24T03:03 (+1)

If you're interested in diving into "how bad/good is it to cede the universe to AIs", I strongly think it's worth reading and responding to "When is unaligned AI morally valuable?" which is the current state of the art on the topic (same thing I linked above). I now regret rehashing a bunch of these arguments which I think are mostly made better here. In particular, I think the case for "AIs created in the default way might have low moral value is reasonably well argued for here:

Many people have a strong intuition that we should be happy for our AI descendants, whatever they choose to do. They grant the possibility of pathological preferences like paperclip-maximization, and agree that turning over the universe to a paperclip-maximizer would be a problem, but don’t believe it’s realistic for an AI to have such uninteresting preferences.
I disagree. I think this intuition comes from analogizing AI to the children we raise, but that it would be just as accurate to compare AI to the corporations we create. Optimists imagine our automated children spreading throughout the universe and doing their weird-AI-analog of art; but it’s just as realistic to imagine automated PepsiCo spreading throughout the universe and doing its weird-AI-analog of maximizing profit.
It might be the case that PepsiCo maximizing profit (or some inscrutable lost-purpose analog of profit) is intrinsically morally valuable. But it’s certainly not obvious.
Or it might be the case that we would never produce an AI like a corporation in order to do useful work. But looking at the world around us today that’s certainly not obvious.
Neither of those analogies is remotely accurate. Whether we should be happy about AI “flourishing” is a really complicated question about AI and about morality, and we can’t resolve it with a one-line political slogan or crude analogy.

(And the same recommendation for onlookers.)

Matthew_Barnett @ 2023-12-24T03:52 (+2)

I now regret rehashing a bunch of these arguments which I think are mostly made better here.

It's fine if you don't want to continue this discussion. I can sympathize if you find it tedious. That said, I don't really see why you'd appeal to that post in this context (FWIW, I read the post at the time it came out, and just re-read it). I interpret Paul Christiano to mainly be making arguments in the direction of "unaligned AIs might be morally valuable, even if we'd prefer aligned AI" which is what I thought I was broadly arguing for, in contradistinction to your position. I thought you were saying something closer to the opposite of what Paul was arguing for (although you also made several separate points, and I don't mean to oversimplify your position).

(But I agree with the quoted part of his post that we shouldn't be happy with AIs doing "whatever they choose to do". I don't think I'm perfectly happy with unaligned AI. I'd prefer we try to align AIs, just as Paul Christiano says too.)

Ryan Greenblatt @ 2023-12-24T04:34 (+1)

Huh, no I almost entirely agree with this post as I noted in my prior comment. I cited this much earlier: "More generally, I think I basically endorse the views here (which discusses the questions of when you should cede power etc.)."

I do think unaligned ai would be morally valuable (I said in an earlier comment unaligned ai which take over might capture 10-30% of the value. That's a lot of value.)

I don't think I'm perfectly happy with unaligned AI. I'd prefer we try to align AIs, just as Paul Christiano says too.

I think we've probably been talking past each other. I thought the whole argument here was "how much value do we lose if (presumably misaligned) AI takes over" and you were arguing for "not much, caring about this seems like overly fixating on humanity" and I was arguing "(presumably misaligned) ais which take over probably results in substantially less value". This now seems incorrect and we perhaps only have minor quantitative disagreements?

I think it probably would have helped if you were more quantitative here. Exactly how much of the value?

Matthew_Barnett @ 2023-12-24T06:35 (+3)

I thought the whole argument here was "how much value do we lose if (presumably misaligned) AI takes over"

I think the key question here is: compared to what? My position is that we lose a lot of potential value both from delaying AI and from having unaligned AI, but it's not a crazy high reduction in either case. In other words they're pretty comparable in terms of lost value.

Ranking the options in rough order (taking up your offer to be quantitative):

Aligned AIs built tomorrow: 100% of the value from my perspective
Aligned AIs built in 100 years: 50% of the value
Unaligned AIs built tomorrow: 15% of the value
Unaligned AIs built in 100 years: 25% of the value

Note that I haven't thought about these exact numbers much.

frib @ 2023-12-24T07:15 (+1)

Aligned AIs built in 100 years: 50% of the value

What drives this huge drop? Naive utility would be very close to 100%. (Do you mean "aligned ais built in 100y if humanity still exists by that point", which includes extinction risk before 2123?)

Matthew_Barnett @ 2023-12-24T09:57 (+4)

I attempted to explain the basic intuitions behind my judgement in this thread. Unfortunately it seems I did a poor job. For the full explanation you'll have to wait until I write a post, if I ever get around to doing that.

The simple, short, and imprecise explanation is: I don't really value humanity as a species as much as I value the people who currently exist, (something like) our current communities and relationships, our present values, and the existence of sentient and sapient life living positive experiences. Much of this will go away after 100 years.

Ryan Greenblatt @ 2023-12-23T21:28 (+1)

TBC, it's plausible that in the future I'll think that "marginally influencing AIs to have more sensible values" is more leveraged than "avoiding AI take over and hoping that humans (and our chosen successors) do something sensible". I'm partially defering to others on the view that AI takeover is the best angle of attack, perhaps I should examine further.

(Of course, it could be that from a longtermist perspective other stuff is even better than avoiding AI takeover or altering AI values. E.g. maybe one of conflict avoidance, better decision theory, or better human institutions for post singularity is even better.)

I certainly wish the question of how much worse/better AI takeover is relative to human control was investigated more effectively. It seems notable to me how important this question is from a longtermist perspective and how little investigation it has received.

(I've spent maybe 1 person day thinking about it and I think probably less than 3 FTE years have been put into this by people who I'd be interested in defering to.)

Ryan Greenblatt @ 2023-12-23T21:19 (+1)

The process through which aliens would get values like ours seems much less robust than the process through which AIs gets our values. AIs are trained on our data, and humans will presumably care a lot about aligning them (at least at first).

Note that I'm conditioning on AIs successfully taking over which is strong evidence against human success at creating desirable (edit: from the perspective of the creators) AIs.

if I understood your views further. I find it highly unlikely that AGIs will be even more "alien" from the perspective of our values than literal aliens

For an intuition pump, consider future AIs which are trained for the equivalent of 100 million years of next-token-prediction^[1] on low quality web text and generated data and then aggressively selected with outcomes based feedback. This outcomes based feedback results in selecting the AIs for carefully tricking their human overseers in a variety of cases and generally ruthlessly pursuing reward.

This scenario is somewhat worse than what I expect in the median world. But in practice I expect that it's at least systematically possible to change the training setup to achieve in predictably better AI motivation and values. Beyond trying to influence AI motivations with crude tools, it seems even better to have humans retain control, use AIs to do a huge amount of R&D (or philosophy work), and then decide what should actually happen with access to more options.

Another way to put this is that I feel notably better about the decisions making of current power structures in the western world and in AIs labs than I feel about going with AI motivations which likely result from training.

More generally, if you are the sole person in control, it seems strictly better from your perspective to carefully reflect on who/what you want to defer to rather than doing this somewhat arbitrarily (this still leaves open the question of how bad arbitrarily defering is).

From my perspective this is a bit like saying you'd prefer aliens to take over the universe rather than handing control over to our genetically engineered human descendants. I'd be very skeptical of that view too for some basic reasons.

I'm pretty happy with slow and steady genetic engineering as a handover process, but I would prefer even slower and more deliberate than this. E.g., existing humans thinking carefully for as long as seems to yield returns about what beings we should defer to and then defer to those slightly smart beings which think for a long time and defer to other beings, etc, etc.

I guess I mostly think that's a pretty bizarre view, with some obvious reasons for doubt, and I don't know what would be driving it.

Part of my view on aliens or dogs is driven from the principle of "aliens/dogs are in a somewhat similar position to us, so we should be fine with swapping" (roughly speaking) and "the part of my values which seem most dependent on random emprical contingencies about evolved life I put less weight on". These intuitions transfer somewhat less to the AI case.

^{^}
Current AIs are trained on perhaps 10-100 trillion tokens and if we think 1 token the equivalent of 1 second then (100*10^12)/(60*60*24*365) = 3 milion years.

Matthew_Barnett @ 2023-12-23T21:50 (+2)

Note that I'm conditioning on AIs successfully taking over which is strong evidence against human success at creating desirable AIs.

I don't think it's strong evidence, for what it's worth. I'm also not sure what "AI takeover" means, and I think existing definitions are very ambiguous (would we say Europe took over the world during the age of imperialism? Are smart people currently in control of the world? Have politicians, as a class, taken over the world?). Depending on the definition, I tend to think that AI takeover is either ~inevitable and not inherently bad, or bad but not particularly likely.

This outcomes based feedback results in selecting the AIs for carefully tricking their human overseers in a variety of cases and generally ruthlessly pursuing reward.

Would aliens not also be incentivized to trick us or others? What about other humans? In my opinion, basically all the arguments about AI deception from gradient descent apply in some form to other methods of selecting minds, including evolution by natural selection, cultural learning, and in-lifetime learning. Humans frequently lie to or mislead each other about our motives. For example, if you ask a human what they'd do if they became world dictator, I suspect you'd often get a different answer than the one they'd actually chose if given that power. I think this is essentially the same epistemic position we might occupy with AI.

Also, for a bunch of reasons that I don't currently feel like elaborating on, I expect humans to anticipate, test for, and circumvent the most egregious forms of AI deception in practice. The most important point here is that I'm not convinced that incentives for deception are much worse for AIs than for other actors in different training regimes (including humans, uplifted dogs, and aliens).

Ryan Greenblatt @ 2023-12-24T00:35 (+1)

I don't think it's strong evidence, for what it's worth. I'm also not sure what "AI takeover" means, and I think existing definitions are very ambiguous (would we say Europe took over the world during the age of imperialism? Are smart people currently in control of the world? Have politicians, as a class, taken over the world?). Depending on the definition, I tend to think that AI takeover is either ~inevitable and not inherently bad, or bad but not particularly likely.

By "AI takeover", I mean autonomous AI coup/revolution. E.g., violating the law and/or subverting the normal mechanisms of power transfer. (Somewhat unclear exactly what should count tbc, but there are some central examples.) By this definition, it basically always involves subverting the intentions of the creators of the AI, though may not involve violent conflict.

I don't think this is super likely, perhaps 25% chance.

Ryan Greenblatt @ 2023-12-24T00:33 (+1)

Also, for a bunch of reasons that I don't currently feel like elaborating on, I expect humans to anticipate, test for, and circumvent the most egregious forms of AI deception in practice. The most important point here is that I'm not convinced that incentives for deception are much worse for AIs than for other actors in different training regimes (including humans, uplifted dogs, and aliens).

I don't strongly disagree with either of these claims, but this isn't exactly where my crux lies.

The key thing is "generally ruthlessly pursuing reward".

I'm checking out of this conversation though.

Matthew_Barnett @ 2023-12-24T04:10 (+2)

The key thing is "generally ruthlessly pursuing reward".

It depends heavily on what you mean by this, but I'm kinda skeptical of the strong version of ruthless reward seekers, for similar reasons given in this post. I think AIs by default might be ruthless in some other senses -- since we'll be applying a lot of selection pressure to them to get good behavior -- but I'm not really sure how how much weight to put on the fact that AIs will be "ruthless" when evaluating how good they are at being our successors. It's not clear how that affects my evaluation of how much I'd be OK handing the universe over to them, and my guess is the answer is "not much" (absent more details).

Humans seem pretty ruthless in certain respects too, e.g. about survival, or increasing their social status. I'd expect aliens, and potentially uplifted dogs to be ruthless too along some axes depending on how we uplifted them.

I'm checking out of this conversation though.

Alright, that's fine.

tobytrem @ 2023-12-19T15:02 (+15)

I’ve decided to curate this question post because:

It exemplifies a truth seeking approach to critique. The author, Matthew_Barnett, is starting off with a feeling that they disagree with EAs about AI risk, but wants to respond to specific arguments. This is obviously a more time intensive approach to critique than simply writing up your own impressions, but it is likely to lead to more precise arguments, which are easier to learn from.
The comments are particularly helpful. I especially think that this comment from Tom Barnes is likely to help a reader who is also asking “What is the current most representative EA AI x-risk argument?”
I hope curating this post will encourage even more helpful responses from Forum users. AI risk is heterogeneous, and discussions around it are constantly changing, so both newcomers and long interested readers can benefit from answers to this post’s question.

calebp @ 2023-12-17T11:27 (+14)

I think the closest thing to an EA perspective written relatively recently that is all in a single doc is probably this pdf of Holdens most important century sequence on cold takes.

Ryan Greenblatt @ 2023-12-16T00:56 (+13)

My question is: what is the current best single article that provides a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?

Unfortunately, I don't think there is any such article which seems basically up-to-date and reasonable to me.

Here are reasonably up-to-date posts which seem pretty representative to me, but aren't comprehensive. Hopefully this is still somewhat helpful:

On specifically "will the AI literally kill everyone", I think the most up-to-date discussion is here, here, and here.

I think an updated comprehensive case is an open project that might happen in the next few years.

Matthew_Barnett @ 2023-12-16T01:16 (+7)

Thanks. It's unfortunate there isn't any single article that presents the case comprehensively. I'm OK with replying to multiple articles as an alternative.

In regards to the pieces you mentioned:

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

My understanding is that (as per the title) this piece argued that a catastrophe is likely without specific countermeasures, but it seems extremely likely that specific countermeasures will be taken to prevent a catastrophe, at least indirectly. Do you know about any other pieces that argue something more long the lines "actually there is a decent chance of an AI catastrophe even given normal counter-efforts"?

Scheming AIs: Will AIs fake alignment during training in order to get power?

While I haven't digested this post yet, my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe. I think it is very likely that AIs will sometimes lie to get power, just as humans do, but it seems like there's a lot more that you'd need to argue to show that this might be catastrophic. Am I wrong in my impression?

What a compute-centric framework says about AI takeoff speeds

I'd like to note that I don't think I have any critical disagreements with this piece, and overall it doesn't seem to be directly about AI x-risk per se.

Karthik Tadepalli @ 2023-12-22T05:08 (+4)

it seems extremely likely that specific countermeasures will be taken to prevent a catastrophe, at least indirectly.

This suggests that you hold a view where one of the cruxes with mainstream EA views is "EAs believe there won't be countermeasures, but countermeasures are very likely, and they significantly mitigate the risk from AI beyond what EAs believe." (If that is not one of your cruxes, then you can ignore the rest of this!)

The confusing thing about that is, what if EA activities are a key reason why good countermeasures end up being taken against AI? In that case, EA arguments would be a "victim" of their own success (though no one would be complaining!) But that doesn't seem like a reason to disagree right now, when there is the common ground of "specific countermeasures really need to be taken".

Matthew_Barnett @ 2024-02-08T23:24 (+2)

The confusing thing about that is, what if EA activities are a key reason why good countermeasures end up being taken against AI?

I find that quite unlikely. I think EA activities contribute on the margin, but it seems very likely to me that people would eventually have taken measures against AI risk in the absence of any EA movement.

In general, while I agree we should not take this argument so far, so that EA ideas do not become "victims of their own success", I also think neglectedness is a standard barometer EAs have used to judge the merits of their interventions. And I think AI risk mitigation will very likely not be a neglected field in the future. This should substantially downweight our evaluation of AI risk mitigation efforts.

In a trivial example, you'd surely concede that EAs should not try to, e.g. work on making sure that future spacecraft designs are safe? Advanced spacecrafts could indeed play a very important role in the future; but it seems unlikely that society would neglect to work on spacecraft safety, making this a pretty unimportant problem to work on right now. To be clear, I definitely don't think the case for working on AI risk mitigation is as bad as the case for working on spacecraft safety, but my point is that the idea I'm trying to convey here applies in both cases.

Ryan Greenblatt @ 2023-12-16T02:13 (+2)

The descriptions you gave all seem reasonable to me. Some responses:

Do you know about any other pieces that argue something more long the lines "actually there is a decent chance of an AI catastrophe even given normal counter-efforts"?

I'm afraid not. However, I do actually think that relatively minimal countermeasures is at least plausible.

my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe

This seems like a slight understatement to me. I think it argues for it being plausible that AIs will systematically take actions to acquire power later. Then, it would be severe enough to cause a catastrophe if AIs were capable enough to overcome other safeguards in practice.

One argument for risk is as follows:

It's reasonably likely that powerful AIs will be schemers and this scheming won't be removable with current technology without "catching" the schemer in the act (as argued for by Carlsmith 2023 which I linked)
Prior to technology advancing enough to remove scheming, these scheming AIs will be able to take over and they will do so sucessfully.

Neither step in the argument is trivial. For (2), the key questions are:

How much will safety technology advance due to the efforts of human researchers prior to powerful AI?
When scheming AIs first become transformatively useful for safety work, will we be able to employ countermeasures which allow us to extract lots of useful work from these AIs while still preventing them from being able to take over without getting caught? (See our recent work on AI Control for instance.)
What happens when scheming AIs are caught? Is this sufficient?
How long will we have with the first transformatively useful AIs prior to much more powerful AI being developed? So, how much work will we be able to extract out of these AIs?
Can we actually get AIs to productively work on AI safety? How will we check their work given that they might be trying to screw us over.
Many of the above questions depend on the strength of the societal response.

This is just focused on the scheming threat model which is not the only threat model.

We (redwood research where I work) might put out some posts soon which indirectly argue for (2) not necessarily going well by default. (This will also argue for tractability).

overall it doesn't seem to be directly about AI x-risk per se.

Agreed, but relatively little time is an important part of the overall threat model so it seems relevant to reference when making the full argument.

MvK @ 2023-12-16T02:11 (+12)

Doesn't go much into probabilities or extinction and may therefore not be what you are looking for, but I've found Dan Hendrycks' overview/introduction to AI risks to be a pretty comprehensive collection. (https://arxiv.org/abs/2306.12001)

(I for one, would love to see someone critique this, although the FAQ at the end is already a good start to some counterarguments and possible responses to those)

Roman Leventov @ 2023-12-17T15:08 (+4)

I second this.

Also worth mentioning https://arxiv.org/abs/2306.06924 - a paper by Critch and Russell of the very similar genre.

JuanGarcia @ 2023-12-17T21:56 (+1)

I was reading one today I think on a similar vein to those you mention

https://library.oapen.org/bitstream/handle/20.500.12657/75844/9781800647886.pdf?sequence=1#page=226

Wei Dai @ 2023-12-16T07:33 (+7)

The Main Sources of AI Risk?

Not exactly what you're asking for, but you could use it as a reference for all of the significant risks that different people have brought up, to select which ones you want to further research and address in your response post.

Habryka @ 2023-12-17T20:20 (+6)

This is definitely not a "canonical" answer, but one of the things I find myself most frequently linking back to is Eliezer's List of Lethalities. I do think it is pretty comprehensive in what it covers, though isn't structured as a strict argument.

titotal @ 2023-12-16T01:42 (+6)

I second this request, for similar reasons.

In particular, I'm interested in accounts of the "how" of AI extinction, beyond "oh, it'll be super duper smart so it'll automatically beat us". I think this is a pretty bad excuse not to at least give a vague outline of a realistic scenario, and you aren't going to be changing many minds with it.

Wei Dai @ 2023-12-16T18:19 (+4)

https://www.lesswrong.com/posts/tyE4orCtR8H9eTiEr/stuxnet-not-skynet-humanity-s-disempowerment-by-ai

Nick K. @ 2023-12-16T13:28 (+4)

Again, this is just one salient example, but: Do you find it unrealistic that a top human level persuasion skills (think interchangeably Mao, Sam Altman and FDR depending on the audience) together with 1 million times ordinary communication bandwidth (i.e. entertaining this amount of conversations) would enable you to take over the world? Or would you argue that AI is never going to get to that level?

Matthew_Barnett @ 2023-12-17T06:51 (+9)

Suppose we brought back Neanderthals, genetically engineered them to be smarter and stronger than us, and integrated them into our society. As their numbers grew, it became clear that, if they teamed up against all of humanity, they could beat us in a one-on-one fight.

In this scenario — taking the stated facts as a given — I'd still be pretty skeptical of the suggestion that there is a substantial chance that humanity will go extinct at the hands of the Neanderthals (at least in the near-to-medium term). Yes, the Neanderthals could kill all of us they wanted to, but they likely won't want to for a number of reasons. And my skepticism here goes beyond a belief that they'd be "aligned" with us. They may in fact have substantially different values from homo sapiens, on average, and yet I still don't think we'd likely go extinct merely because of that.

From this perspective, within the context of the scenario described, I think it would be quite reasonable and natural to ask for a specific plausible account that illustrates why humanity would go extinct if they continued on their current course with the Neanderthals. It's reasonable to ask the same thing about AI.

Ryan Greenblatt @ 2023-12-23T17:31 (+3)

Note that the comment you're replying to says "take over the world" not extinction.

I think extinction is unlikely conditional on takeover (and takeover seems reasonably likely).

Neanderthal take over doesn't seem very bad from my perspective, so probably I'm basically fine with that. (Particularly if we ensure that some basic ideas are floating around in Neanderthal culture like "maybe you should be really thoughtful and careful with what you do with the cosmic endowment".)

Matthew_Barnett @ 2023-12-23T19:57 (+3)

Note that the comment you're replying to says "take over the world" not extinction.

I agree, but the original comment said "In particular, I'm interested in accounts of the "how" of AI extinction".

David Mathers @ 2023-12-16T18:16 (+7)

I think he's partly asking for "takeover the world" to be operationalized a bit.

Zach Stein-Perlman @ 2023-12-16T01:14 (+4)

An artificially structured argument for expecting AGI ruin.

Denis @ 2023-12-20T19:44 (+2)

I think it's clear from your search results and the answers below that there isn't one representative position.

But even if there were, I think it's more useful for you to just make your arguments in the way you've outlined, without focusing on one general article to disagree with.

There's a very specific reason for this: Think about the target audience. These questions are now vital topics being discussed at government level, which impact national and international policies, and corporate strategies at major international companies.

If your argument is strong, surely you'd want it to be accessible to the people working on these policies, rather than just to a small group of EA people who will recognise all the arguments you're addressing.

If you want to write in a way that is very accessible, it's better not to just say "I disagree with Person X on this" but rather, "My opinion is Y. There are those, such as person X, who disagree, because they believe that ... (explanation of point of disagreement)."

There is the saying that the best way to get a correct answer to any question these days is to post an incorrect answer and let people correct you. In the same spirit, if you outline your positions, people will come back with objections, whether original or citing other work, and eventually you can modify your document to address these. So it becomes a living document representing your latest thinking.

This is also valuable because AI itself is evolving, and we're learning more about it every day. So even if your argument is accurate based on what we know today, you might want to change something tomorrow.

(Yes, I realise what I've proposed is a lot more work! But maybe the first version, outlining what you think, is already valuable in and of itself).

DPiepgrass @ 2023-12-17T20:35 (+2)

Opinions on this are pretty diverse. I largely agree with the bulleted list of things-you-think, and this article paints a picture of my current thinking.

My threat model is something like: the very first AGIs will probably be near human-level and won't be too hard to limit/control. But in human society, tyrants are overrepresented among world leaders, relative to tyrants in the population of people smart enough to lead a country. We'll probably end up inventing multiple versions of AGI, some of which may be straightforwardly turned into superintelligences and others not. The worst superintelligence we help to invent may win, and if it doesn't it'll probably be because a different one beats it (or reaches an unbeatable position first). Humans will probably be sidelined if we survive a battle between super-AGIs. So it would be much safer not to invent them―but it's also hard to avoid inventing them! I have low confidence in my P(catastrophe) and I'm unsure how to increase my confidence.

But I prefer estimating P(catastrophe) over P(doom) because extinction is not all that concerns me. Some stories about AGI lead to extinction, others to mass death, others to dystopia (possibly followed by mass death later), others to utopia followed by catastrophe, and still others to a stable and wonderful utopia (with humanity probably sidelined eventually, which may even be a good thing). I think I could construct a story along any of these lines.

Hayven Frienby @ 2023-12-26T16:27 (+3)

Well said. I also think it's important to define what is meant by "catastrophe." Just as an example, I personally would consider it catastrophic to see a future in which humanity is sidelined and subjugated by an AGI (even a "friendly," aligned one), but many here would likely disagree with me that this would be a catastrophe. I've even heard otherwise rational (non-EA) people claim a future in which humans are 'pampered pets' of an aligned ASI to be 'utopian,' which just goes to show the level of disagreement.

DPiepgrass @ 2023-12-30T17:29 (+1)

To me, it's important whether the AGIs are benevolent and have qualia/consciousness. If AGIs are ordinary computers but smart, I may agree; if they are conscious and benevolent, I'm okay being a pet.

Hayven Frienby @ 2023-12-31T13:32 (+1)

I'm not sure whether we could ever truly know if an AGI was conscious or experienced qualia (which are by definition not quantifiable). And you're probably right that being a pet of a benevolent ASI wouldn't be a miserable thing (but it is still an x-risk ... because it permanently ends humanity's status as a dominant species).

DPiepgrass @ 2024-01-03T17:32 (+1)

I would caution against thinking the Hard Problem of Consciousness is unsolvable "by definition" (if it is solved, qualia will likely become quantifiable). I think the reasonable thing is to presume it is solvable. But until it is solved we must not allow AGI takeover, and even if AGIs stay under human control, it could lead to a previously unimaginable power imbalance between a few humans and the rest of us.

JuanGarcia @ 2024-05-29T13:49 (+1)

A new paper on this came out recently: https://link.springer.com/article/10.1007/s00146-024-01930-2

Matthew_Barnett @ 2024-05-29T18:52 (+5)

I gave a brief reply to the paper here.

Hayven Frienby @ 2023-12-26T16:18 (+1)

I think you've summarized the general state of EA views on x-risk with artificial intelligence--thanks! My views* are considered extreme around here, but I think it's important to note that to me there seems to be a vocal contingent of us who give lower consideration to AI x-risk, at least on the forum, and I wonder if this represents a general trend. (epistemic status: low -- I have no hard data to back this up besides the fact that there seem to be more pro-AI posts around here)

*I think any substantial (>0.01%) risk of extinction due to AI action in the next century warrants a total and "permanent" (>50 years) pause on all AI development, enforced through international law