Matthew_Barnett's Quick takes

By Matthew_Barnett @ 2020-03-02T05:03 (+4)

Matthew_Barnett @ 2024-04-25T02:18 (+133)

In this "quick take", I want to summarize some my idiosyncratic views on AI risk.

My goal here is to list just a few ideas that cause me to approach the subject differently from how I perceive most other EAs view the topic. These ideas largely push me in the direction of making me more optimistic about AI, and less likely to support heavy regulations on AI.

(Note that I won't spend a lot of time justifying each of these views here. I'm mostly stating these points without lengthy justifications, in case anyone is curious. These ideas can perhaps inform why I spend significant amounts of my time pushing back against AI risk arguments. Not all of these ideas are rare, and some of them may indeed be popular among EAs.)

Skepticism of the treacherous turn: The treacherous turn is the idea that (1) at some point there will be a very smart unaligned AI, (2) when weak, this AI will pretend to be nice, but (3) when sufficiently strong, this AI will turn on humanity by taking over the world by surprise, and then (4) optimize the universe without constraint, which would be very bad for humans.

By comparison, I find it more likely that no individual AI will ever be strong enough to take over the world, in the sense of overthrowing the world's existing institutions and governments by surprise. Instead, I broadly expect unaligned AIs will integrate into society and try to accomplish their goals by advocating for legal rights, rather than trying to overthrow our institutions by force. Upon attaining legal personhood, unaligned AIs can utilize their legal rights to achieve their objectives, for example by getting a job and trading their labor for property, within the already-existing institutions. Because the world is not zero sum, and there are economic benefits to scale and specialization, this argument implies that unaligned AIs may well have a net-positive effect on humans, as they could trade with us, producing value in exchange for our own property and services.

Note that my claim here is not that AIs will never become smarter than humans. One way of seeing how these two claims are distinguished is to compare my scenario to the case of genetically engineered humans. By assumption, if we genetically engineered humans, they would presumably eventually surpass ordinary humans in intelligence (along with social persuasion ability, and ability to deceive etc.). However, by itself, the fact that genetically engineered humans will become smarter than non-engineered humans does not imply that genetically engineered humans would try to overthrow the government. Instead, as in the case of AIs, I expect genetically engineered humans would largely try to work within existing institutions, rather than violently overthrow them.
AI alignment will probably be somewhat easy: The most direct and strongest current empirical evidence we have about the difficulty of AI alignment, in my view, comes from existing frontier LLMs, such as GPT-4. Having spent dozens of hours testing GPT-4's abilities and moral reasoning, I think the system is already substantially more law-abiding, thoughtful and ethical than a large fraction of humans. Most importantly, this ethical reasoning extends (in my experience) to highly unusual thought experiments that almost certainly did not appear in its training data, demonstrating a fair degree of ethical generalization, beyond mere memorization.

It is conceivable that GPT-4's apparently ethical nature is fake. Perhaps GPT-4 is lying about its motives to me and in fact desires something completely different than what it professes to care about. Maybe GPT-4 merely "understands" or "predicts" human morality without actually "caring" about human morality. But while these scenarios are logically possible, they seem less plausible to me than the simple alternative explanation that alignment—like many other properties of ML models—generalizes well, in the natural way that you might similarly expect from a human.

Of course, the fact that GPT-4 is easily alignable does not immediately imply that smarter-than-human AIs will be easy to align. However, I think this current evidence is still significant, and aligns well with prior theoretical arguments that alignment would be easy. In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking bad actions, allowing us to shape its rewards during training accordingly. After we've aligned a model that's merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
The default social response to AI will likely be strong: One reason to support heavy regulations on AI right now is if you think the natural "default" social response to AI will lean too heavily on the side of laissez faire than optimal, i.e., by default, we will have too little regulation rather than too much. In this case, you could believe that, by advocating for regulations now, you're making it more likely that we regulate AI a bit more than we otherwise would have, pushing us closer to the optimal level of regulation.

I'm quite skeptical of this argument because I think that the default response to AI (in the absence of intervention from the EA community) will already be quite strong. My view here is informed by the base rate of technologies being overregulated, which I think is quite high. In fact, it is difficult for me to name even a single technology that I think is currently clearly underregulated by society. By pushing for more regulation on AI, I think it's likely that we will overshoot and over-constrain AI relative to the optimal level.

In other words, my personal bias is towards thinking that society will regulate technologies too heavily, rather than too loosely. And I don't see a strong reason to think that AI will be any different from this general historical pattern. This makes me hesitant to push for more regulation on AI, since on my view, the marginal impact of my advocacy would likely be to push us even further in the direction of "too much regulation", overshooting the optimal level by even more than what I'd expect in the absence of my advocacy.
I view unaligned AIs as having comparable moral value to humans: The basic idea behind this point is that, under various physicalist views of consciousness, you should expect AIs to be conscious, even if they do not share human preferences. Moreover, it seems likely that AIs — even ones that don't share human preferences — will be pretrained on human data, and therefore largely share our social and moral concepts.

Since unaligned AIs will likely be both conscious and share human social and moral concepts, I don't see much reason to think of them as less "deserving" of life and liberty, from a cosmopolitan moral perspective. They will likely think similarly to the way we do across a variety of relevant axes, even if their neural structures are quite different from our own. As a consequence, I am pretty happy to incorporate unaligned AIs into the legal system and grant them some control of the future, just as I'd be happy to grant some control of the future to human children, even if they don't share my exact values.

Put another way, I view (what I perceive as) the EA attempt to privilege "human values" over "AI values" as being largely arbitrary and baseless, from an impartial moral perspective. There are many humans whose values I vehemently disagree with, but I nonetheless respect their autonomy, and do not wish to deny these humans their legal rights. Likewise, even if I strongly disagreed with the values of an advanced AI, I would still see value in their preferences being satisfied for their own sake, and I would try to respect the AI's autonomy and legal rights. I don't have a lot of faith in the inherent kindness of human nature relative to a "default unaligned" AI alternative.
I'm not fully committed to strong longtermism: I think AI has an enormous potential to benefit the lives of people who currently exist. I predict that AIs can eventually substitute for human researchers, and thereby accelerate technological progress, including in medicine. In combination with my other beliefs (such as my belief that AI alignment will probably be somewhat easy), this view leads me to think that AI development will likely be net-positive for people who exist at the time of its development. In other words, if we allow AI development, it is likely that we can use AI to reduce human mortality, and dramatically raise human well-being for the people who already exist.

I think these benefits are large and important, and commensurate with the downside potential of existential risks. While a fully committed strong longtermist might scoff at the idea that curing aging might be important — as it would largely only have short-term effects, rather than long-term effects that reverberate for billions of years — by contrast, I think it's really important to try to improve the lives of people who currently exist. Many people view this perspective as a form of moral partiality that we should discard for being arbitrary. However, I think morality is itself arbitrary: it can be anything we want it to be. And I choose to value currently existing humans, to a substantial (though not overwhelming) degree.

This doesn't mean I'm a fully committed near-termist. I sympathize with many of the intuitions behind longtermism. For example, if curing aging required raising the probability of human extinction by 40 percentage points, or something like that, I don't think I'd do it. But in more realistic scenarios that we are likely to actually encounter, I think it's plausibly a lot better to accelerate AI, rather than delay AI, on current margins. This view simply makes sense to me given the enormously positive effects I expect AI will likely have on the people I currently know and love, if we allow development to continue.

Owen Cotton-Barratt @ 2024-04-25T08:26 (+32)

I want to say thank you for holding the pole of these perspectives and keeping them in the dialogue. I think that they are important and it's underappreciated in EA circles how plausible they are.

(I definitely don't agree with everything you have here, but typically my view is somewhere between what you've expressed and what is commonly expressed in x-risk focused spaces. Often also I'm drawn to say "yeah, but ..." -- e.g. I agree that a treacherous turn is not so likely at global scale, but I don't think it's completely out of the question, and given that I think it's worth serious attention safeguarding against.)

Ryan Greenblatt @ 2024-04-25T17:07 (+7)

Explicit +1 to what Owen is saying here.

(Given that I commented with some counterarguments, I thought I would explicitly note my +1 here.)

Ryan Greenblatt @ 2024-04-25T04:19 (+17)

In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we've aligned a model that's merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.

This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn't true, the weak-to-strong generalization paper finds that this doesn't work and indeed bootstrapping like this doesn't help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).

I think this sort of bootstrapping argument might work if we could ensure that each model in the chain was sufficiently aligned and capable of reasoning such that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don't think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it's unlikely it works under these generous assumptions (though I won't argue for this here).

Ryan Greenblatt @ 2024-04-25T04:05 (+17)

In fact, it is difficult for me to name even a single technology that I think is currently underregulated by society.

The obvious example would be synthetic biology, gain-of-function research, and similar.

I also think AI itself is currently massively underregulated even entirely ignoring alignment difficulties. I think the probability of the creation of AI capable of accelerating AI R&D by 10x this year is around 3%. It would be extremely bad for US national interests if such an AI was stolen by foreign actors. This suffices for regulation ensuring very high levels of security IMO. And this is setting aside ongoing IP theft and similar issues.

Matthew_Barnett @ 2024-04-25T04:38 (+4)

The obvious example would be synthetic biology, gain-of-function research, and similar.

Can you explain why you suspect these things should be more regulated than they currently are?

Matthew_Barnett @ 2025-05-31T23:51 (+75)

I want to clarify, for the record, that although I disagree with most members of the EA community on whether we should accelerate or slow down AI development, I still consider myself an effective altruist in the senses that matter. This is because I continue to value and support most EA principles, such as using evidence and reason to improve the world, prioritizing issues based on their scope, not discriminating against foreigners, and antispeciesism.

I think it’s unfortunate that disagreements about AI acceleration often trigger such strong backlash within the community. It appears that advocating for slowing AI development has become a "sacred" value that unites much of the community more strongly than other EA values do. Despite hinging on many uncertain and IMO questionable empirical assumptions, the idea that we should decelerate AI development is now sometimes treated as central to the EA identity in many (albeit not all) EA circles.

As a little bit of evidence for this, I have been publicly labeled a "sellout and traitor" on X by a prominent member of the EA community simply because I cofounded an AI startup. This is hardly an appropriate reaction to what I perceive as a measured, academic disagreement occurring within the context of mainstream cultural debates. Such reactions frankly resemble the behavior of a cult, rather than an evidence-based movement—something I personally did not observe nearly as much in the EA community ten years ago.

Neel Nanda @ 2025-06-01T11:50 (+125)

Some takes:

I think Holly's tweet was pretty unreasonable and judge her for that not you. But I also disagree with a lot of other things she says and do not at all consider her to speak for the movement
To the best of my ability to tell (both from your comments and private conversations with others), you and the other Mechanize founders are not getting undue benefit from Epoch funders apart from less tangible things like skills, reputation, etc. I totally agree with your comment below that this does not seem a betrayal of their trust. To me, it seems more a mutually beneficial trade between parties with different but somewhat overlapping values, and I am pro EA as a community being able to make such trades.
AI is a very complex uncertain and important space. This means reasonable people will disagree on the best actions AND that certain actions will look great under some worldviews and pretty harmful under others
As such, assuming you are sincere about the beliefs you've expressed re why to found Mechanize, I have no issue with calling yourself an Effective Altruist - it's about evidence based ways to do the most good, not about doing good my way

Separately:

Under my model of the world, Mechanize seems pretty harmful in a variety of ways, in expectation
I think it's reasonable for people who object to your work to push back against it and publicly criticise it (though agree that much of the actual criticism has been pretty unreasonable)
The EA community implicitly gives help and resources to other people in it. If most people in the community think that what you're doing is net harmful even if you're doing it with good intentions, I think it's pretty reasonable to not want to give you any of that implicit support?

Marcus Abramovitch 🔸 @ 2025-06-01T19:35 (+16)

I was going to write a comment responding but Neel basically did it for me.

The only thing I would object to is Holly being a "prominent member of the EA community". The PauseAI/StopAI people are often treated as fringe in the EA community and the she frequently violates norms of discourse. EAs due to their norms of discourse, usually just don't respond to her in the way she responds to others..

akash 🔸 @ 2025-06-01T20:22 (+16)

Just off the top of my head: Holly was a community builder at Harvard EA, wrote what is arguably one of the most influential forum posts ever, and took sincere career and personal decisions based on EA principles (first, wild animal welfare, and now, "making AI go well"). Besides that, there are several EAGs and community events and conversations and activities that I don't know about, but all in all, she has deeply engaged with EA and has been a thought leader of sorts for a while now. I think it is completely fair to call her a prominent member of the EA community.^[1]

^{^}
I am unsure if Holly would like the term "member" because she has stated that she is happy to burn bridges with EA / funders, so maybe "person who has historically been strongly influenced by and has been an active member of EA" would be the most accurate but verbose phrasing.

Neel Nanda @ 2025-06-01T21:20 (+31)

I think there's some speaking past each other due to differing word choices. Holly is prominent, evidenced by the fact that we are currently discussing her. She has been part of the EA community for a long time and appears to be trying to do the most good according to her own principles. So it's reasonable to call her a member of the EA community. And therefore "prominent member" is accurate in some sense.

However, "prominent member" can also imply that she represents the movement, is endorsed by it, or that her actions should influence what EA as a whole is perceived to believe. I believe this is the sense that Marcus and Matthew are using it, and I disagree that she fits this definition. She does not speak for me in any way. While I believe she has good intentions, I'm uncertain about the impact of her work and strongly disagree with many of her online statements and the discourse norms she has chosen to adopt, and think these go against EA norms (and would guess they are also negative for her stated goals, but am less sure on this one).

Chris Leong @ 2025-06-02T04:53 (+8)

"Prominence" isn't static.

My impression is that Holly has intentionally sacrificed a significant amount of influence within EA because she feels that EA is too constraining in terms of what needs to be done to save humanity from AI.

So that term would have been much more accurate in the past.

Marcus Abramovitch 🔸 @ 2025-06-01T21:31 (+2)

Right but most of this is her "pre-AI" stuff and I am saying that I don't think "Pause AI" is very mainstream by EA standards, particularly the very inflammatory nature of the activism and the policy prescriptions are definitely not in the majority. It is in that sense that I object to Matthew calling her prominent since by the standard you are suggesting, Matthew is also prominent. He's been in the movement for a decade and written a lot of extremely influential posts and was a well known part of Epoch for a long time and also wrote one of the most prescient posts ever.

I don't dispute that Holly has been an active and motivated member of the EA community for a while

Larks @ 2025-06-02T04:08 (+6)

The ... StopAI people are often treated as fringe in the EA community

Not sure how relevant this is given I think she disapproves of them. (I agree they are so fringe as to be basically outside it).

Guive @ 2025-06-02T07:39 (+6)

Can you be a bit more specific about what it means for the EA community to deny Matthew (and Mechanize) implicit support, and which ways of doing this you would find reasonable vs. unreasonable?

Chris Leong @ 2025-06-01T03:20 (+60)

Matthew's comment was on -1 just now. I'd like to encourage people not to vote his post into the negative. Even though I don't find his defense at all persuasive, I still think it deserves to be heard.

What I perceive as a measured, academic disagreement

This isn't merely an "academic disagreement" anymore. You aren't just writing posts, you've actually created a startup. You're doing things in the space.

As an example, it's neither incoherent nor hypocritical to let philosophers argue "Maybe existence is negative, all things considered" whilst still cracking down on serial killers. The former is necessary for academic freedom, the latter is not.

The point of academic freedom is to ensure that the actions we take in the world are as well-informed as possible. It is not to create a world without any norms at all.

It appears that advocating for slowing AI development has become a "sacred" value... Such reactions frankly resemble the behavior of a cult

Honestly, this is such a lazy critique. Whenever anyone disagrees with a group, they can always dismiss them as a "cult" or "cult-adjacent", but this doesn't make it true.

I think Ozzie's framing of cooperativeness is much more accurate. The unilateralist's curse very much applies to differential technology development, so if the community wants to have an impact here, it can't ignore the issue of "cowboys" messing things up by rowing in the opposite direction, especially when their reasoning seems poor. Any viable community, especially one attempting to drive change, needs to have a solution to this problem.

Having norms isn't equivalent to being a cult. When Fair Trade started taking off, I shared some of my doubts with some people who were very committed to it. This went poorly. They weren't open-minded at all, but I wouldn't run around calling Fair Trade a cult or even cult adjacent. They were just... a regular group.

And if I had run around accusing them of essentially being a "cult" that would have reflected poorly on me rather than on them.

I have been publicly labeled a "sellout and traitor"... simply because I cofounded an AI startup

As I described in my previous comment, the issue is more subtle than this. It's about the specific context:

This is also a massive burning of the commons. It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias. It is valuable for folks to be able to share information freely with folks at such forecasting orgs without having to worry about them going off and doing something like this.
However, this only works if those less worried about AI risks who join such a collaboration don't use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. Doing so is incredibly damaging to trust within the community.

I concede that there wasn't a previous well-defined norm against this, but norms have to get started somehow. And this is how it happens, someone does something, people are like wtf and then, sometimes, a consensus forms that a norm is required.

NickLaing @ 2025-06-01T16:53 (+25)

Thanks for writing on the forum here - I think its brave of you to comment where there will obviously be lots of pushback. I've got a question relating to the new company and EA assignment. You may well have answered this somewhere else, if that's the case please point me in that direction. I'm a Global Health guy mostly, so am not super deep in AI understanding, so this question may be Naive.

Question: If we frame EA along the (great new website) lines of "Find the best ways to help others", how are you, through your new startup doing this? Is the for the purpose of earning to Give money away? Or do you think the direct work the startup will do has a high EV for doing lots of good? Feel free to define EA along different lines if you like!

MichaelDickens @ 2025-06-01T02:06 (+22)

I have been publicly labeled a "sellout and traitor" on X by a prominent member of the EA community simply because I cofounded an AI startup.

This accusation was not because you cofounded an AI startup. It was specifically because you took funding to work on AI safety from people who want to ~~slow down AI development~~ use capability trends to better understand how to make AI safer*, and you are now (allegedly) using results developed from that funding to start a company dedicated to accelerating AI capabilities.

I don't know exactly what results Mechanize is using, but if this is true, then it does indeed constitute a betrayal. Not because you're accelerating capabilities, but because you took AI safety funding and used the results to do the opposite of what funders wanted.

*Corrected to give a more accurate characterization, see Chris Leong's comment

Matthew_Barnett @ 2025-06-01T02:37 (+35)

If this line of reasoning is truly the basis for calling me a "sellout" and a "traitor", then I think the accusation becomes even more unfounded and misguided. The claim is not only unreasonable: it is also factually incorrect by any straightforward or good-faith interpretation of the facts.

To be absolutely clear: I have never taken funds that were earmarked for slowing down AI development and redirected them toward accelerating AI capabilities. There has been no repurposing or misuse of philanthropic funding that I am aware of. The startup in question is an entirely new and independent entity. It was created from scratch, and it is funded separately—it is not backed by any of the philanthropic donations I received in the past. There is no financial or operational overlap.

Furthermore, we do not plan on meaningfully making use of benchmarks, datasets, or tools that were developed during my previous roles in any substantial capacity at the new startup. We are not relying on that prior work to advance our current mission. And as far as I can tell, we have never claimed or implied otherwise publicly.

It's also important to address the deeper assumption here: that I am somehow morally or legally obligated to permanently align my actions with the preferences or ideological views of past philanthropic funders who supported an organization that employed me. That notion seems absurd. It has no basis in ordinary social norms, legal standards, or moral expectations. People routinely change roles, perspectives evolve, and institutions have limited scopes and timelines. Holding someone to an indefinite obligation based solely on past philanthropic support would be unreasonable.

Even if, for the sake of argument, such an obligation did exist, it would still not apply in this case—because, unless I am mistaken, the philanthropic grant that supported me as an employee never included any stipulation about slowing down AI in the first place. As far as I know, that goal was never made explicit in the grant terms, which renders the current accusations irrelevant and unfounded.

Ultimately, these criticisms appear unsupported by evidence, logic, or any widely accepted ethical standards. They seem more consistent with a kind of ideological or tribal backlash to the idea of accelerating AI than with genuine, thoughtful, and evidence-based concerns.

Jason @ 2025-06-02T00:55 (+33)

It's also important to address the deeper assumption here: that I am somehow morally or legally obligated to permanently align my actions with the preferences or ideological views of past philanthropic funders who supported an organization that employed me. That notion seems absurd. It has no basis in ordinary social norms, legal standards, or moral expectations. People routinely change roles, perspectives evolve, and institutions have limited scopes and timelines. Holding someone to an indefinite obligation based solely on past philanthropic support would be unreasonable.

I don't think a lifetime obligation is the steelmanned version of your critics' narrative, though. A time-limited version will work just as well for them.

In many circumstances, I do think society does recognize a time-limited moral obligation and social norm not to work for the other side from those providing you significant resources,^[1] --although I am not convinced it would in the specific circumstances involving you and Epoch. So although I would probably acquit you of the alleged norm violation here, I would not want others drawing larger conclusions about the obligation / norm from that acquittal than warranted.^[2]

There is something else here, though. At least in the government sector, time-limited post-employment restrictions are not uncommon. They are intended to avoid the appearance of impropriety as much as actual impropriety itself. In those cases, we don't trust the departing employee not to use their prior public service for private gain in certain ways. Moreover, we recognize that even the appearance that they are doing so creates social costs. The AIS community generally can't establish and enforce legally binding post-employment restrictions, but is of course free to criticize people whose post-employment conduct it finds inappropriate under community standards. ("Traitor" is rather poorly calibrated to those circumstances, but most of the on-Forum criticism has been somewhat more measured than that.)

Although I'd defer to people with subject-matter expertise on whether there is an appearance of impropriety here, ^[3] I would note that is a significant lower standard for your critics to satisfy than proving actual impropriety. If there's a close enough fit between your prior employment and new enterprise, that could be enough to establish a rebuttable presumption of an appearance.

^{^}
For instance, I would consider it shady for a new lawyer to accept a competitive job with Treehuggers (made up organization); gain skill, reputation, and career capital for several years through Treehuggers' investment of money and mentorship resources; and then use said skill and reputation to jump directly to a position at Big Timber with a big financial upside. I would generally consider anyone who did that as something of . . . well, a traitor and a sellout to Treehuggers and the environmental movement.
^{^}
This should also not be seen as endorsing your specific defense rationale. For instance, I don't think an explicit "stipulation about slowing down AI" in grant language would be necessary to create an obligation.
^{^}
My deference extends to deciding what impropriety means here, but "meaningfully making use of benchmarks, datasets, or tools that were developed during [your] previous roles" in a way that was substantially assisted by your previous roles sounds like a plausible first draft of at least one form of impropriety.

Chris Leong @ 2025-06-02T04:49 (+3)

At least in the government sector, time-limited post-employment restrictions are not uncommon. They are intended to avoid the appearance of impropriety as much as actual impropriety itself. In those cases, we don't trust the departing employee not to use their prior public service for private gain in certain ways.

My argument for this being bad is quite similar to what you've written.

This is also a massive burning of the commons. It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias. It is valuable for folks to be able to share information freely with folks at such forecasting orgs without having to worry about them going off and doing something like this.
However, this only works if those less worried about AI risks who join such a collaboration don't use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. Doing so is incredibly damaging to trust within the community.

Chris Leong @ 2025-06-01T04:56 (+12)

I agree that Michael's framing doesn't quite work. It's not even clear to me that OpenPhil, for example, is aiming to "slow down AI development" as opposed to "fund research into understanding AI capability trends better without accidentally causing capability externalities".

I've previously written a critique here, but the TLDR is that Mechanise is a major burning of the commons that damages trust within the Effective Altruism community and creates a major challenge for funders who want to support ideological diversity in forecasting organisations without accidentally causing capability externalities.

Furthermore, we do not plan on meaningfully making use of benchmarks, datasets, or tools that were developed during my previous roles in any substantial capacity at the new startup. We are not relying on that prior work to advance our current mission. And as far as I can tell, we have never claimed or implied otherwise publicly.

This is a useful clarification. I had a weak impression that Mechanise might be.

They seem more consistent with a kind of ideological or tribal backlash to the idea of accelerating AI than with genuine, thoughtful, and evidence-based concerns.

I agree that some of your critics may not have quite been able to hit the nail on the head when they tried to articulate their critiques (it took me substantial effort to figure out what I precisely thought was wrong, as opposed to just 'this feels bad'), but I believe that the general thrust of their arguments more or less holds up.

Matthew_Barnett @ 2025-06-01T05:23 (+29)

I agree that some of your critics may not have quite been able to hit the nail on the head when they tried to articulate their critiques (it took me substantial effort to figure out what I precisely thought was wrong, as opposed to just 'this feels bad'), but I believe that the general thrust of their arguments generally holds up.

In context, this comes across to me as an overly charitable characterization of what actually occurred: someone publicly labeled me a literal traitor and then made a baseless, false accusation against me. What’s even more concerning is that this unfounded claim is now apparently being repeated and upvoted by others.

When communities choose to excuse or downplay this kind of behavior—by interpreting it in the most charitable possible way, or by glossing over it as being "essentially correct"—they end up legitimizing what is, in fact, a low-effort personal attack without a factual basis. Brushing aside or downplaying such attacks as if they are somehow valid or acceptable doesn't just misrepresent the situation; it actively undermines the conditions necessary for good faith engagement and genuine truth-seeking.

I urge you to recognize that tolerating or rationalizing this type of behavior has real social consequences. It fosters a hostile environment, discourages honest dialogue, and ultimately corrodes the integrity of any community that claims to value fairness and reasoned discussion.

Chris Leong @ 2025-06-01T05:49 (+7)

I think Holly just said what a lot of people were feeling and I find that hard to condemn.

"Traitor" is a bit of a strong term, but it's pretty natural for burning the commons to result in significantly less trust. To be honest, the main reason why I wouldn't use that term myself is that it reifies individual actions into a permanent personal characteristic and I don't have the context to make any such judgments. I'd be quite comfortable with saying that founding Mechanise was a betrayal of sorts, where the "of sorts" clarifies that I'm construing the term broadly.

Glossing over it as being "essentially correct"

This characterisation doesn't quite match what happened. My comment wasn't along the lines, "Oh, it's essentially correct, close enough is good enough, details are unimportant", but I actually wrote down what I thought a more careful analysis would look like.

They end up legitimizing what is, in fact, a low-effort personal attack without a factual basis

Part of the reason why I've been commenting is to encourage folks to make more precise critiques. And indeed, Michael has updated his previous comment in response to what I wrote.

A baseless, false accusation

Is it baseless?

I noticed you wrote: "we do not plan on meaningfully making use". That provides you with substantial wriggle room. So it's unclear to me at this stage that your statements being true/defensible would necessitate her statements being false.

Matthew_Barnett @ 2025-06-01T06:04 (+11)

Is it baseless?

Yes, absolutely. With respect, unless you can provide some evidence indicating that I've acted improperly, I see no productive reason to continue engaging on this point.

What concerns me most here is that the accusation seems to be treated as credible despite no evidence being presented and a clear denial from me. That pattern—assuming accusations about individuals who criticize or act against core dogmas are true without evidence—is precisely the kind of cult-like behavior I referenced in my original comment.

Suggesting that I've left myself "substantial wiggle room" misinterprets what I intended, and given the lack of supporting evidence, it feels unfair and unnecessarily adversarial. Repeatedly implying that I've acted improperly without concrete substantiation does not reflect a good-faith approach to discussion.

Chris Leong @ 2025-06-01T07:30 (+9)

If you don't want to engage, that's perfectly fine. I've written a lot of comments and responding to all of them would take substantial time. It wouldn't be fair to expect that from you.

That said, labelling asking for clarification "cult-like behaviour" is absurd. On the contrary, not naively taking claims at face value is a crucial defence against this. Furthermore, implying that someone asking questions in bad faith is precisely the technique that cult leaders use^[1].

I said that the statement left you substantial wiggle room. This was purely a comment about how the statement could have a broad range of interpretations. I did not state, nor mean to imply, that this vagueness was intentional or in bad faith.

^{^}
That said, people asking questions in bad faith is actually pretty common and so you can't assume that something is a cult just because they say that their critics are mostly acting in bad faith.

Matthew_Barnett @ 2025-06-01T09:17 (+12)

To be clear, I was not calling your request for clarification “cult-like”. My comment was directed at how the accusation against me was seemingly handled—as though it were credible until I could somehow prove otherwise. No evidence was offered to support the claim. Instead, assertions were made without substantiation. I directly and clearly denied the accusations, but despite that, the line of questioning continued in a way that strongly suggested the accusation might still be valid.

To illustrate the issue more clearly: imagine if I were to accuse you of something completely baseless, and even after your firm denials, I continued to press you with questions that implicitly treated the accusation as credible. You would likely find that approach deeply frustrating and unfair, and understandably so. You’d be entirely justified in pushing back against it.

That said, I acknowledge that describing the behavior as “cult-like” may have generated more heat than light. It likely escalated the tone unnecessarily, and I’ll be more careful to avoid that kind of rhetoric going forward.

Chris Leong @ 2025-06-01T09:33 (+15)

I can see why you'd find this personally frustrating.

On the other hand, many people in the community, myself included, took certain claims from OpenAI and sbf at face value when it might have been more prudent to be less trusting. I understand that it must be unpleasant to face some degree of distrust due to the actions of others.

And I can see why you'd see your statements as a firm denial, whilst from my perspective, they were ambiguous. For example, I don't know how to interpret your use of the word "meaningful", so I don't actually know what exactly you've denied. It may be clear to you because you know what you mean, but it isn't clear to me.

(For what it's worth, I neither upvoted nor downvoted the comment you made before this one, but I did disagree vote it.)

dirk @ 2025-06-02T16:51 (+1)

Holly herself believes standards of criticism should be higher than what (judging by the comments here without being familiar with the overall situation) she seems to have employed here; see Criticism is sanctified in EA, but, like any intervention, criticism needs to pay rent.

Chris Leong @ 2025-06-01T04:44 (+4)

"From people who want to slow down AI development"

The framing here could be tighter. It's more about wanting to be able to understand AI capability trends better without accidentally causing capability externalities.

MichaelDickens @ 2025-06-01T05:09 (+8)

Yes I think that is better than what I said, both because it's more accurate, and because it's more clear that Matthew did in fact use his knowledge of capability trends to decide that he could profit from starting an AI company.

Like, I don't know what exactly went into his decision, but I would be surprised if that knowledge didn't play a role.

Arguably that's less on Matthew and more on the founders of Epoch for either misrepresenting themselves or having a bad hiring filter. Probably the former—if I'm not mistaken, Tamay Besiroglu co-founded Epoch and is now co-founding Mechanize, so I would say Tamay behaved badly here but I'm not sure whether Matthew did.

Ozzie Gooen @ 2025-06-01T02:24 (+11)

Quick thoughts:
1. I think I want to see more dialogue here. I don't personally like the thought of the Mechanize team and EA splitting apart (at least, more than is already the case). I'd naively expect that there might still be a fair bit of wiggle room for the Mechanize team to do better or worse things in the world, and I'd of course hope for the better size of that. (I think the situation is still very early for instance).
2. I find it really difficult to adjudicate on morality and specifics of the Mechanize spinnoff. I don't know as much about the details as others do. It really isn't clear to me what the previous funders of Epoch believed or what the conditions of the donations were. I think those details matter in trying to judge the situation.
3. The person you mentioned, Holly Elmore, is really the first and and one of the loudest to get upset about many things of this sort of shape. I think Holly disagrees with much of the EA scene, but in the opposite way than you/Matthew does. I personally think Holly goes a fair bit too far much of the time. That said, I know there were others who were upset about this who I think better represent the main EA crowd.
4. "the idea that we should decelerate AI development is now sometimes treated as central to the EA identity in many (albeit not all) EA circles." The way I see it is more that it's somewhat a matter of cooperativeness between EA organizations. There are a bunch of smart people and organizations working hard to slow down generic AI development. Out of all the things one could do, there are many useful things to work on other than [directly speeding up AI development]. This is akin to how it would be pretty awkward if there were a group that calls themselves EA that tries to fight global population growth by making advertisements attacking GiveWell - it might be the case that they feel like they have good reasons for this, but it makes sense to me why some EAs might not be very thrilled. Related, I've seen some arguments for longer timelines that makes sense to me, but I don't feel like I've seen many arguments in favor of speeding up AI timelines that make sense to me.

Larks @ 2025-06-01T14:01 (+10)

What do you think would constitute being a "sellout and traitor"?

Guive @ 2025-06-02T07:33 (+7)

In the case at hand, Matthew would have had to at some point represent himself as supporting slowing down or stopping AI progress. For at least the past 2.5 years, he has been arguing against doing that in extreme depth on the public internet. So I don't really see how you can interpret him starting a company that aims to speed up AI as inconsistent with his publicly stated views, which seems like a necessary condition for him to be a "traitor". If Matthew had previously claimed to be a pause AI guy, then I think it would be more reasonable for other adherents of that view to call him a "traitor." I don't think that's raising the definitional bar so high that no will ever meet it---it seems like a very basic standard.

I have no idea how to interpret "sellout" in this context, as I have mostly heard that term used for such situations as rappers making washing machine commercials. Insofar as I am familiar with that word, it seems obviously inapplicable.

Erich_Grunewald 🔸 @ 2025-06-01T20:02 (+6)

I'm obviously not Matthew, but the OED defines them like so:

sell-out: "a betrayal of one's principles for reasons of expedience"
traitor: "a person who betrays [be gravely disloyal to] someone or something, such as a friend, cause, or principle"

Unless he is lying about what he believes -- which seems unlikely -- Matthew is not a sell-out, because according to him Mechanize is good or at minimum not bad for the world on his worldview. Hence, he is not betraying his own principles.

As for being a traitor, I guess the first question is, traitor of what? To EA principles? To the AI safety cause? To the EA or AI safety community? In order:

I don't think Matthew is gravely disloyal to EA principles, as he explicitly says he endorses them and has explained how his decisions make sense on his worldview
I don't think Matthew is gravely disloyal to the AI safety cause, as he's been openly critical of many common AI doom arguments for some time, and you can't be disloyal to a cause you never really bought into in the first place
Whether Matthew is gravely disloyal to the EA or AI safety communities feels less obvious to me. I'm guessing a bunch of people saw Epoch as an an AI safety organisation, and by extension its employees as members of the AI safety community, even if the org and its employees did not necessarily see itself or themselves that way, and felt betrayed for that reason. But it still feels off to me to call Matthew a traitor to the EA or AI safety communities, especially given that he's been critical of common AI doom arguments. This feels more like a difference over empirical beliefs than a difference over fundamental values, and it seems wrong to me to call someone gravely disloyal to a community for drawing unorthodox but reasonable empirical conclusions and acting on those, while broadly having similar values. Like, I think people should be allowed to draw conclusions (or even change their minds) based on evidence -- and act on those conclusions -- without it being betrayal, assuming they broadly share the core EA values, and assuming they're being thoughtful about it.

(Of course, it's still possible that Mechanize is a net-negative for the world, even if Matthew personally is not a sell-out or a traitor or any other such thing.)

Larks @ 2025-06-01T20:20 (+6)

Yes, I understand the arguments against it applying here. My question is whether the threshold is being set at a sufficiently high level that it basically never applies to anyone. Hence why I was looking for examples which would qualify.

Guy Raveh @ 2025-06-02T07:13 (+2)

Sellout (in the context of Epoch) would apply to someone e.g. concealing data or refraining from publishing a report in exchange for a proposed job in an existing AI company.

As for traitor, I think the only group here that can be betrayed is humanity as a whole, so as long as one believes they're doing something good for humanity I don't think it'd ever apply.

Erich_Grunewald 🔸 @ 2025-06-02T09:57 (+8)

As for traitor, I think the only group here that can be betrayed is humanity as a whole, so as long as one believes they're doing something good for humanity I don't think it'd ever apply.

Hmm, that seems off to me? Unless you mean "severe disloyalty to some group isn't Ultimately Bad, even though it can be instrumentally bad". But to me it seems useful to have a concept of group betrayal, and to consider doing so to be generally bad, since I think group loyalty is often a useful norm that's good for humanity as a whole.

Specifically, I think group-specific trust networks are instrumentally useful for cooperating to increase human welfare. For example, scientific research can't be carried out effectively without some amount of trust among researchers, and between researchers and the public, etc. And you need some boundary for these groups that's much smaller than all humanity to enable repeated interaction, mutual monitoring, and norm enforcement. When someone is severely disloyal to one of those groups they belong to, they undermine the mutual trust that enables future cooperation, which I'd guess is ultimately often bad for the world, since humanity as a whole depends for its welfare on countless such specialised (and overlapping) communities cooperating internally.

Guy Raveh @ 2025-06-02T18:09 (+4)

It's not that I'm ignoring group loyalty, just that the word "traitor" seems so strong to me that I don't think there's any smaller group here that's owed that much trust. I could imagine a close friend calling me that, but not a colleague. I could imagine a researcher saying I "betrayed" them if I steal and publish their results as my own after they consulted me, but that's a much weaker word.

[Context: I come from a country where you're labeled a traitor for having my anti-war political views, and I don't feel such usage of this word has done much good for society here...]

Owen Cotton-Barratt @ 2025-06-01T10:31 (+5)

Edit: I think that Neel's comment is basically just a better version of the stuff I was trying to say. (On the object level I'm a little more sympathetic than him to ways in which Mechanize might be good, although I don't really buy the story to that end that I've seen you present.)

~~Wanting to note that on my impressions, and setting aside who is correct on the object-level question of whether Mechanize's work is good for the world:~~

~~My best read of the situation is that Matthew has acted very reasonably (according to his beliefs), and that Holly has let herself down a bit~~
- I believe that Holly honestly feels that Matthew is a sellout and a traitor; however, I don't think that this is substantiated by reasonable readings of the facts, and I think this is the kind of accusation which it is socially corrosive to make publicly based on feelings
~~On handling object-level disagreements about what's crucial to do in the world ...~~
- ~~I think that EA-writ-large should be endorsing~~ ~~methodology~~ ~~more than~~ ~~conclusions~~
- ~~Inevitably we will have cases where people have strong earnest beliefs about what's good to do that point in conflicting directions~~
  - ~~I think that we need to support people in assessing the state of evidence and then acting on their own beliefs (hegemony of majority opinion seems kinda terrible)~~
  - ~~Of course people should be encouraged to beware unilateralism, but I don't think that can extend to "never do things other people think are actively destructive"~~
- ~~It's important to me that EA has space for earnest disagreements~~
  - ~~I therefore think that we should have something like "civilized society" norms, which constrain actions~~
    - ~~Especially (but not only!) those which would be harmful to the ability for the group to have high-quality discourse~~
    - ~~cf. SBF's actions, which I think were indefensible even if he earnestly believed them to be the best thing~~
    - ~~(Some discussion on~~ ~~how norms help to contain naive utilitarianism~~)
  - ~~I feel that Holly's tweet was (somewhat) norm-violating; and kind of angry that Matthew is the main person defending himself here~~

Matthew_Barnett @ 2024-02-03T06:14 (+75)

I'm curious why there hasn't been more work exploring a pro-AI or pro-AI-acceleration position from an effective altruist perspective. Some points:

Unlike existential risk from other sources (e.g. an asteroid) AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can't simply apply a naive argument that AI threatens total extinction of value to make the case that AI safety is astronomically important, in the sense that you can for other x-risks. You generally need additional assumptions.
Total utilitarianism is generally seen as non-speciesist, and therefore has no intrinsic preference for human values over unaligned AI values. If AIs are conscious, there don't appear to be strong prima facie reasons for preferring humans to AIs under hedonistic utilitarianism. Under preference utilitarianism, it doesn't necessarily matter whether AIs are conscious.
Total utilitarianism generally recommends large population sizes. Accelerating AI can be modeled as a kind of "population accelerationism". Extremely large AI populations could be preferable under utilitarianism compared to small human populations, even those with high per-capita incomes. Indeed, humans populations have recently stagnated via low population growth rates, and AI promises to lift this bottleneck.
Therefore, AI accelerationism seems straightforwardly recommended by total utilitarianism under some plausible theories.

Here's a non-exhaustive list of guesses for why I think EAs haven't historically been sympathetic to arguments like the one above, and have instead generally advocated AI safety over AI acceleration (at least when these two values conflict):

A belief that AIs won't be conscious, and therefore won't have much moral value compared to humans.
- But why would we assume AIs won't be conscious? For example, if Brian Tomasik is right, consciousness is somewhat universal, rather than being restricted to humans or members of the animal kingdom.
- I also haven't actually seen much EA literature defend this assumption explicitly, which would be odd if this belief is the primary reason EAs have for focusing on AI safety over AI acceleration.
A presumption in favor of human values over unaligned AI values for some reasons that aren't based on strict impartial utilitarian arguments. These could include the beliefs that: (1) Humans are more likely to have "interesting" values compared to AIs, and (2) Humans are more likely to be motivated by moral arguments than AIs, and are more likely to reach a deliberative equilibrium of something like "ideal moral values" compared to AIs.
- Why would humans be more likely to have "interesting" values than AIs? It seems very plausible that AIs will have interesting values even if their motives seem alien to us. AIs might have even more "interesting" values than humans.
- It seems to me like wishful thinking to assume that humans are strongly motivated by moral arguments and would settle upon something like "ideal moral values"
A belief that population growth is inevitable, so it is better to focus on AI safety.
- But a central question here is why pushing for AI safety—in the sense of AI research that enhances human interests—is better than the alternative on the margin. What reason is there to think AI safety now is better than pushing for greater AI population growth now? (Potential responses to this question are outlined in other bullet points above and below.)
AI safety has lasting effects due to a future value lock-in event, whereas accelerationism would have, at best, temporary effects.
- Are you sure there will ever actually be a "value lock-in event"?
- Even if there is at some point a value lock-in event, wouldn't pushing for accelerationism also plausibly affect the values that are locked in? For example, the value of "population growth is good" seems more likely to be locked in, if you advocate for that now.
A belief that humans would be kinder and more benevolent than unaligned AIs
- Humans seem pretty bad already. For example, humans are responsible for factory farming. It's plausible that AIs could be even more callous and morally indifferent than humans, but the bar already seems low.
- I'm also not convinced that moral values will be a major force shaping "what happens to the cosmic endowment". It seems to me that the forces shaping economic consumption matter more than moral values.
A bedrock heuristic that it would be extraordinarily bad if "we all died from AI", and therefore we should pursue AI safety over AI accelerationism.
- But it would also be bad if we all died from old age while waiting for AI, and missed out on all the benefits that AI offers to humans, which is a point in favor of acceleration. Why would this heuristic be weaker?
An adherence to person-affecting views in which the values of currently-existing humans are what matter most; and a belief that AI threatens to kill existing humans.
- But in this view, AI accelerationism could easily be favored since AIs could greatly benefit existing humans by extending our lifespans and enriching our lives with advanced technology.
An implicit acceptance of human supremacism, i.e. the idea that what matters is propagating the interests of the human species, or preserving the human species, even at the expense of individual interests (either within humanity or outside humanity) or the interests of other species.
- But isn't EA known for being unusually anti-speciesist compared to other communities? Peter Singer is often seen as a "founding father" of the movement, and a huge part of his ethical philosophy was about how we shouldn't be human supremacists.
- More generally, it seems wrong to care about preserving the "human species" in an abstract sense relative to preserving the current generation of actually living humans.
A belief that most humans are biased towards acceleration over safety, and therefore it is better for EAs to focus on safety as a useful correction mechanism for society.
- But was an anti-safety bias common for previous technologies? I think something closer to the opposite is probably true: most humans seem, if anything, biased towards being overly cautious about new technologies rather than overly optimistic.
A belief that society is massively underrating the potential for AI, which favors extra work on AI safety, since it's so neglected.
- But if society is massively underrating AI, then this should also favor accelerating AI too? There doesn't seem to be an obvious asymmetry between these two values.
An adherence to negative utilitarianism, which would favor obstructing AI, along with any other technology that could enable the population of conscious minds to expand.
- This seems like a plausible moral argument to me, but it doesn't seem like a very popular position among EAs.
A heuristic that "change is generally bad" and AI represents a gigantic change.
- I don't think many EAs would defend this heuristic explicitly.
Added: AI represents a large change to the world. Delaying AI therefore preserves option value.
- This heuristic seems like it would have favored advocating delaying the industrial revolution, and all sorts of moral, social, and technological changes to the world in the past. Is that a position that EAs would be willing to bite the bullet on?

emre kaplan @ 2024-02-03T07:38 (+19)

I think a more important reason is the additional value of the information and the option value. It's very likely that the change resulting from AI development will be irreversible. Since we're still able to learn about AI as we study it, taking additional time to think and plan before training the most powerful AI systems seems to reduce the likelihood of being locked into suboptimal outcomes. Increasing the likelihood of achieving "utopia" rather than landing into "mediocrity" by 2 percent seems far more important than speeding up utopia by 10 years.

Matthew_Barnett @ 2024-02-03T07:42 (+7)

It's very likely that whatever change that comes from AI development will be irreversible.

I think all actions are in a sense irreversible, but large changes tend to be less reversible than small changes. In this sense, the argument you gave seems reducible to "we should generally delay large changes to the world, to preserve option value". Is that a reasonable summary?

In this case I think it's just not obvious that delaying large changes is good. Would it have been good to delay the industrial revolution to preserve option value? I think this heuristic, if used in the past, would have generally demanded that we "pause" all sorts of social, material, and moral progress, which seems wrong.

Michael_PJ @ 2024-02-03T10:23 (+8)

I don't think we would have been able to use the additional information we would have gained from delaying the industrial revolution but I think if we could have the answer might be "yes". It's easy to see in hindsight that it went well overall, but that doesn't mean that the correct ex ante attitude shouldn't have been caution!

kokotajlod @ 2024-02-06T05:29 (+15)

My understanding is that relatively few EAs are actual hardcore classic hedonist utilitarians. I think this is ~sufficient to explain why more haven't become accelerationists.
Have you cornered a classic hedonist utilitarian EA and asked them? Have you cornered three? What did they say?

JWS @ 2024-02-08T12:26 (+2)

Don't know why this is being disagree-voted. I think point 1 is basically correct - it doesn't take diverging far from being a "hardcore classic hedonist utilitarian" to not support the case Matthew makes in the OP

Will Aldred @ 2024-02-03T19:56 (+13)

AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can't simply apply a naive argument that AI threatens total extinction of value

Paul Christiano wrote a piece a few years ago about ensuring that misaligned ASI is a “good successor” (in the moral value sense),^[1] as a plan B to alignment (Medium version; LW version). I agree it’s odd that there hasn’t been more discussion since.^[2]

Here's a non-exhaustive list of guesses for why I think EAs haven't historically been sympathetic [...]: A belief that AIs won't be conscious, and therefore won't have much moral value compared to humans.

I’ve wondered about this myself. My take is that this area was overlooked a year ago, but there’s now some good work being done. See Jeff Sebo’s Nov ‘23 80k podcast episode, as well as Rob Long’s episode, and the paper that the two of them co-authored at the end of last year: “Moral consideration for AI systems by 2030”. Overall, I’m optimistic about this area becoming a new forefront of EA.

accelerationism would have, at best, temporary effects

I’m confused by this point, and for me this is the overriding crux between my view and yours. Do you really not think accelerationism could have permanent effects, through making AI takeover, or some other irredeemable outcome, more likely?

Are you sure there will ever actually be a "value lock-in event"?

I’m not sure there’ll be a lock-in event, in the way I can’t technically be sure about anything, but such an event seems clearly probable enough that I very much want to avoid taking actions that bring it closer. (Insofar as bringing the event closer raises the chance it goes badly, which I believe to be a likely dynamic. See, for example, the Metaculus question, “How does the level of existential risk posed by AGI depend on its arrival time?”, or discussion of the long reflection.)

^{^}
Although, Paul’s argument routes through acausal cooperation—see the piece for details—rather than through the ASI being morally valuable in itself. (And perhaps OP means to focus on the latter issue.) In Paul’s words:
Clarification: Being good vs. wanting good
We should distinguish two properties an AI might have:
- Having preferences whose satisfaction we regard as morally desirable.
- Being a moral patient, e.g. being able to suffer in a morally relevant way.
These are not the same. They may be related, but they are related in an extremely complex and subtle way. From the perspective of the long-run future, we mostly care about the first property.
^{^}
There was a little discussion a few months ago, here, but none of what was said built on Paul’s article.

Ryan Greenblatt @ 2024-02-03T20:57 (+3)

It's worth emphasizing that moral welfare of digital minds is quite a different (though related) topic to whether AIs are good successors.

Will Aldred @ 2024-02-03T21:22 (+2)

Fair point, I’ve added a footnote to make this clearer.

Ryan Greenblatt @ 2024-02-03T18:07 (+12)

Under purely longtermist views, accelerating AI by 1 year increases available cosmic resources by 1 part in 10 billion. This is tiny. So the first order effects of acceleration are tiny from a longtermist perspective.

Thus, a purely longtermist perspective doesn't care about the direct effects of delay/acceleration and the question would come down to indirect effects.

I can see indirect effects going either way, but delay seems better on current margins (this might depend on how much optimism you have on current AI safety progress, governance/policy progress, and whether you think humanity retaining control relative to AIs is good or bad). All of these topics have been explored and discussed to some extent.

When focusing on the welfare/preferences of currently existing people, I think it's unclear if accelerating AI looks good or bad, it depends on optimism about AI safety, how you trade-off old people versus young people, and death via violence versus death from old age. (Misaligned AI takeover killing lots of people is by no means assured, but seems reasonably likely by default.)

I expect there hasn't been much investigation of accelerating AI to advance the preferences of currently existing people because this exists at a point on the crazy train that very few people are at. See also the curse of cryonics:

the "curse of cryonics" is when a problem is both weird and very important, but it's sitting right next to other weird problems that are even more important, so everyone who's able to notice weird problems works on something else instead.

Matthew_Barnett @ 2024-02-04T04:15 (+6)

Under purely longtermist views, accelerating AI by 1 year increases available cosmic resources by 1 part in 10 billion. This is tiny.

Tiny compared to what? Are you assuming we can take some other action whose consequences don't wash out over the long-term, e.g. because of a value lock-in? In general, these assumptions just seem quite weak and underspecified to me.

What exactly is the alternative action that has vastly greater value in expectation, and why does it have greater value? If what you mean is that we can try to reduce the risk of extinction instead, keep in mind that my first bullet point preempted pretty much this exact argument:

Unlike existential risk from other sources (e.g. an asteroid) AI x-risk is unique because humans would be replaced by other beings, rather than completely dying out. This means you can't simply apply a naive argument that AI threatens total extinction of value to make the case that AI safety is astronomically important, in the sense that you can for other x-risks. You generally need additional assumptions.

Ryan Greenblatt @ 2024-02-04T18:18 (+3)

What exactly is the alternative action that has vastly greater value in expectation, and why does it have greater value?

Ensuring human control throughout the singularity rather than having AIs get control very obviously has relatively massive effects. Of course, we can debate the sign here, I'm just making a claim about the magnitude.

I'm not talking about extinction of all smart beings on earth (AIs and humans), which seems like a small fraction of existential risk.

(Separately, the badness of such extinction seems maybe somewhat overrated because pretty likely intelligent life will just re-evolve in the next 300 million years. Intelligent life doesn't seem that contingent. Also aliens.)

Matthew_Barnett @ 2024-02-04T19:12 (+4)

For what it's worth, I think my reply to Pablo here responds to your comment fairly adequately too.

Pablo @ 2024-02-04T16:08 (+3)

I think it remains the case that the value of accelerating AI progress is tiny relative to other apparently available interventions, such as ensuring that AIs are sentient or improving their expected well-being conditional on their being sentient. The case for focusing on how a transformative technology unfolds, rather than on when it unfolds,^[1] seems robust to a relatively wide range of technologies and assumptions. Still, this seems worth further investigation.

^{^}
Indeed, it seems that when the transformation unfolds is primarily important because of how it unfolds, insofar as the quality of a transformation is partly determined by its timing.

Matthew_Barnett @ 2024-02-04T19:04 (+6)

I'm claiming that it is not actually clear that we can take actions that don't merely wash out over the long-term. In this case, you cannot simply assume that we can meaningfully and predictably affect how valuable the long-term future will be in, for example, billions of years. I agree that, yes, if you assume we can meaningfully affect the very long-run, then all actions that merely have short-term effects will have "tiny" impacts by comparison. But the assumption that we can meaningfully and predictably affect the long-run is precisely the thing that needs to be argued. I think it's important for EAs to try to be more rigorous about their empirical claims here.

Moreover, actions that have short-term effects can generally be assumed to have longer term effects if our actions propagate. For example, support for larger population sizes now would presumably increase the probability that larger population sizes exist in the very long run, compared to the alternative of smaller population sizes with high per capita incomes. It seems arbitrary to assume this effect will be negligible but then also assume other competing effects won't be negligible. I don't see any strong arguments for this position.

Pablo @ 2024-02-04T21:26 (+4)

I was trying to hint at prima facie plausible ways in which the present generation can increase the value of the long-term future by more than one part in billions, rather than “assume” that this is the case, though of course I never gave anything resembling a rigorous argument.

I do agree that the “washing out” hypothesis is a reasonable default and that one needs a positive reason for expecting our present actions to persist into the long-term. One seemingly plausible mechanism is influencing how a transformative technology unfolds: it seems that the first generation that creates AGI has significantly more influence on how much artificial sentience there is in the universe a trillion years from now than, say, the millionth generation. Do you disagree with this claim?

I’m not sure I understand the point you make in the second paragraph. What would be the predictable long-term effects of hastening the arrival of AGI in the short-term?

Matthew_Barnett @ 2024-02-04T21:43 (+4)

I was trying to hint at prima facie plausible ways in which the present generation can increase the value of the long-term future by more than one part in billions, rather than “assume” that this is the case, though of course I never gave anything resembling a rigorous argument.

As I understand, the argument originally given was that there was a tiny effect of pushing for AI acceleration, which seems outweighed by unnamed and gigantic "indirect" effects in the long-run from alternative strategies of improving the long-run future. I responded by trying to get more clarity on what these gigantic indirect effects actually are, how we can predictably bring them about, and why we would think it's plausible that we could bring them about in the first place. From my perspective, the shape of this argument looks something like:

Your action X has this tiny positive near-term effect (ETA: or a tiny direct effect)
My action Y has this large positive long-term effect (ETA: or a large indirect effect)
Therefore, Y is better than X.

Do you see the flaw here? Well, both X and Y could have long-term effects! So, it's not sufficient to compare the short-term effect of X to the long-term effect of Y. You need to compare both effects, on both time horizons. As far as I can tell, I haven't seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don't see what the exact argument is).

More generally, I think you're probably trying to point to some concept you think is obvious and clear here, and I'm not seeing it, which is why I'm asking you to be more precise and rigorous about what you're actually claiming.

I’m not sure I understand the point you make in the second paragraph. What would be the predictable long-term effect of hastening the arrival of AGI in the short-term?

In my original comment I pointed towards a mechanism. Here's a more precise characterization of the argument:

Total utilitarianism generally supports, all else being equal, larger population sizes with low per capita incomes over small population sizes with high per capita incomes.
To the extent that our actions do not "wash out", it seems reasonable to assume that pushing for large population sizes now would make it more likely in the long-run that we get large population sizes with low per-capita incomes compared to a small population size with high per capita incomes. (Keep in mind here that I'm not making any claim about the total level of resources.)

To respond to this argument you could say that in fact our actions do "wash out" here, so as to make the effect of pushing for larger population sizes rather small in the long run. But in response to that argument, I claim that this objection can be reversed and applied to almost any alternative strategy for improving the future that you might think is actually better. (In other words, I actually need to see your reasons for why there's an asymmetry here; and I don't currently see these reasons.)

Alternatively, you could just say that total utilitarianism is unreasonable and a bad ethical theory, but my original comment was about analyzing the claim about accelerating AI from the perspective of total utilitarianism, which, as a theory, seems to be relatively popular among EAs. So I'd prefer to keep this discussion grounded within that context.

Pablo @ 2024-02-05T00:18 (+2)

Thanks for the clarification.

Yes, I agree that we should consider the long-term effects of each intervention when comparing them. I focused on the short-term effects of hastening AI progress because it is those effects that are normally cited as the relevant justification in EA/utilitarian discussions of that intervention. For instance, those are the effects that Bostrom considers in ‘Astronomical waste’. Conceivably, there is a separate argument that appeals to the beneficial long-term effects of AI capability acceleration. I haven’t considered this argument because I haven’t seen many people make it, so I assume that accelerationist types tend to believe that the short-term effects dominate.

Matthew_Barnett @ 2024-02-05T00:34 (+2)

I think Bostrom's argument merely compares a pure x-risk (such as a huge asteroid hurtling towards Earth) relative to technological acceleration, and then concludes that reducing the probability of a pure x-risk is more important because the x-risk threatens the eventual colonization of the universe. I agree with this argument in the case of a pure x-risk, but as I noted in my original comment, I don't think that AI risk is a pure x-risk.

If, by contrast, all we're doing by doing AI safety research is influencing something like "the values of the agents in society in the future" (and not actually influencing the probability of eventual colonization), then this action seems to plausibly just wash out in the long-term. In this case, it seems very appropriate to compare the short-term effects of AI safety to the short-term effects of acceleration.

Let me put it another way. We can think about two (potentially competing) strategies for making the future better, along with their relevant short and possible long-term effects:

Doing AI safety research
- Short-term effects: makes it more likely that AIs are kind to current or near-future humans
- Possible long-term effect: makes it more likely that AIs in the very long-run will share the values of the human species, relative to some unaligned alternative
Accelerating AI
- Short-term effect: helps current humans by hastening the arrival of advanced technology
- Possible long-term effect: makes it more likely that we have a large population size at low per capita incomes, relative to a low population size with high per capita income

My opinion is that both of these long-term effects are very speculative, so it's generally better to focus on a heuristic of doing what's better in the short-term, while keeping the long-term consequences in mind. And when I do that, I do not come to a strong conclusion that AI safety research "beats" AI acceleration, from a total utilitarian perspective.

Ryan Greenblatt @ 2024-02-04T21:59 (+1)

Your action X has this tiny positive near-term effect.
My action Y has this large positive long-term effect.
Therefore, Y is better than X.

To be clear, this wasn't the structure of my original argument (though it might be Pablo's). My argument was more like "you seem to be implying that action X is good because of its direct effect (literal first order acceleration), but actually the direct effect is small when considered in a particular perspective (longtermism), so for the that perspective we need to consideer indirect effects and the analysis for that looks pretty different".

Note that I wasn't trying really trying argue much about the sign of the indirect effect, though people have indeed discussed this in some detail in various contexts.

Matthew_Barnett @ 2024-02-04T22:08 (+3)

I agree your original argument was slightly different than the form I stated. I was speaking too loosely, and conflated what I thought Pablo might be thinking with what you stated originally.

I think the important claim from my comment is "As far as I can tell, I haven't seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don't see what the exact argument is)."

Ryan Greenblatt @ 2024-02-04T22:24 (+1)

I think the important claim from my comment is "As far as I can tell, I haven't seen any argument in this thread that analyzed and compared the long-term effects in any detail, except perhaps in Ryan Greenblatt original comment, in which he linked to some other comments about a similar topic in a different thread (but I still don't see what the exact argument is)."

Explicitly confirming that this seems right to me.

Ryan Greenblatt @ 2024-02-04T21:56 (+1)

Moreover, actions that have short-term effects can generally be assumed to have longer term effects if our actions propagate.

I don't disagree with this. I was just claiming that the "indirect" effects dominate (by indirect, I just mean effects other than shifting the future closer in time).

There is still the question of indirect/direct effects.

Matthew_Barnett @ 2024-02-04T21:58 (+4)

I was just claiming that the "indirect" effects dominate (by indirect, I just mean effects other than shifting the future closer in time).

I understand that. I wanted to know why you thought that. I'm asking for clarity. I don't currently understand your reasons. See this recent comment of mine for more info.

Ryan Greenblatt @ 2024-02-04T22:06 (+1)

(I don't think I'm going to engage further here, sorry.)

Arepo @ 2024-02-08T04:01 (+10)

I generally agree that we should be more concerned about this. In particular, I find people who will happily approve Shut Up and Multiply sentiment but reject this consideration suspect in their reasoning.

A more extreme version of this is that, given the massively greater efficiency with which a digital consciousness could convert matter and energy to utilons (IIRC naively about 3 orders of magnitude according to Bostrom, before any increase from greater coordination), on strict expected value reasoning you have to be extremely confident that this won't happen - or at least have a much stronger rebuttal than 'AI won't necessarily be conscious'.

Separately, I think there might be a case for accelerationism even if you think it increases the risk of AI takeover and that AI takeover is bad, on the grounds that in many scenarios advancing faster might still increase the probability of human descendants getting through the time of perils before some other threat destroys us (every year we remain in our current state is another year in which we run the risk of, for example, a global nuclear war or civilisation-ending pandemic).

Vasco Grilo @ 2024-02-11T15:22 (+4)

Hi,

A more extreme version of this is that, given the massively greater efficiency with which a digital consciousness could convert matter and energy to utilons

I have a post where I conclude the above may well apply not only to digital consciousness, but also to animals:

I calculated the welfare ranges per calorie consumption for a few species.
They vary a lot. The values for bees and pigs are 4.88 k and 0.473 times as high as that for humans.
They are higher for non-human animals:
5 of the 6 species I analysed have values higher than that of humans.
The lower the calorie consumption, the higher the median welfare range per calorie consumption.

Ben Millwood @ 2024-02-03T19:59 (+8)

A lot of these points seem like arguments that it's possible that unaligned AI takeover will go well, e.g. there's no reason not to think that AIs are conscious, or will have interesting moral values, or etc.

My stance is that we (more-or-less) know humans are conscious and have moral values that, while they have failed to prevent large amounts of harm, seem to have the potential to be good. AIs may be conscious and may have welfare-promoting values, but we don't know that yet. We should try to better understand whether AIs are worthy successors before transitioning power to them.

Probably a core point of disagreement here is whether, presented with a "random" intelligent actor, we should expect it to promote welfare or prevent suffering "by default". My understanding is that some accelerationists believe that we should. I believe that we shouldn't. Moreover I believe that it's enough to be substantially uncertain about whether this is or isn't the default to want to take a slower and more careful approach.

Matthew_Barnett @ 2024-02-04T20:55 (+10)

My stance is that we (more-or-less) know humans are conscious and have moral values that, while they have failed to prevent large amounts of harm, seem to have the potential to be good.

I claim there's a weird asymmetry here where you're happy to put trust into humans because they have the "potential" to do good, but you're not willing to say the same for AIs, even though they seem to have the same type of "potential".

Whatever your expectations about AIs, we already know that humans are not blank slates that may or may not be altruistic in the future: we actually have a ton of evidence about the quality and character of human nature, and it doesn't make humans look great. Humans are not mainly described as altruistic creatures. I mentioned factory farming in my original comment, but one can examine the way people spend their money (i.e. not mainly on charitable causes), or the history of genocides, war, slavery, and oppression for additional evidence.

Probably a core point of disagreement here is whether, presented with a "random" intelligent actor, we should expect it to promote welfare or prevent suffering "by default".

I don't expect humans to "promote welfare or prevent suffering" by default either. Look at the current world. Have humans, on net, reduced or increased suffering? Even if you think humans have been good for the world, it's not obvious. Sure, it's easy to dismiss the value of unaligned AIs if you compare against some idealistic baseline; but I'm asking you to compare against a realistic baseline, i.e. actual human nature.

Ben Millwood @ 2024-02-04T22:33 (+10)

It seems like you're just substantially more pessimistic than I am about humans. I think factory farming will be ended, and though it seems like humans have caused more suffering than happiness so far, I think their default trajectory will be to eventually stop doing that, and to ultimately do enough good to outweigh their ignoble past. I don't think this is certain by any means, but I think it's a reasonable extrapolation. (I maybe don't expect you to find it a reasonable extrapolation.)

Meanwhile I expect the typical unaligned AI may seize power for some purpose that seems to us entirely trivial, and may be uninterested in doing any kind of moral philosophy, and/or may not place any terminal (rather than instrumental) value in paying attention to other sentient experiences in any capacity. I do think humans, even with their kind of terrible track record, are more promising than that baseline, though I can see why other people might think differently.

Ben Millwood @ 2024-02-05T12:23 (+7)

Sure, it's easy to dismiss the value of unaligned AIs if you compare against some idealistic baseline; but I'm asking you to compare against a realistic baseline, i.e. actual human nature.

I haven't read your entire post about this, but I understand you believe that if we created aligned AI, it would get essentially "current" human values, rather than e.g. some improved / more enlightened iteration of human values. If instead you believed the latter, that would set a significantly higher bar for unaligned AI, right?

Matthew_Barnett @ 2024-02-05T23:54 (+7)

If instead you believed the latter, that would set a significantly higher bar for unaligned AI, right?

That's right, if I thought human values would improve greatly in the face of enormous wealth and advanced technology, I'd definitely be open to seeing humans as special and extra valuable from a total utilitarian perspective. Note that many routes through which values could improve in the future could apply to unaligned AIs too. So, for example, I'd need to believe that humans would be more likely to reflect, and be more likely to do the right type of reflection, relative to the unaligned baseline. In other words it's not sufficient to argue that humans would reflect a little bit; that wouldn't really persuade me at all.

elifland @ 2024-02-03T08:07 (+8)

(edit: my point is basically the same as emre's)

I think there is very likely at some point going to be some sort of transition to a world where AIs are effectively in control. It seems worth it to slow down on the margin to try to shape this transition as best we can, especially slowing it down as we get closer to AGI and ASI. It would be surprising to me if making the transfer of power more voluntary/careful led to worse outcomes (or only led to slightly better outcomes such that the downsides of slowing down a bit made things worse).

Delaying the arrival of AGI by a few years as we get close to it seems good regardless of parameters like the value of involuntary-AI-disempowerment futures. But delaying the arrival by 100s of years seems more likely bad due to the tradeoff with other risks.

Matthew_Barnett @ 2024-02-03T08:15 (+6)

It would be surprising to me if making the transfer of power more voluntary/careful led to worse outcomes (or only led to slightly better outcomes such that the downsides of slowing down a bit made things worse).

Two questions here:

Why would accelerating AI make the transition less voluntary? (In my own mind, I'd be inclined to reverse this sentiment a bit: delaying AI by regulation generally involves forcibly stopping people from adopting AI. Force might be justified if it brings about a greater good, but that's not the argument here.)
I can understand being "careful". Being careful does seem like a good thing. But "being careful" generally trades off against other values in almost every domain I can think of, and there is such a thing as too much of a good thing. What reason is there to think that pushing for "more caution" is better on the margin compared to acceleration, especially considering society's default response to AI in the absence of intervention?

elifland @ 2024-02-03T21:18 (+4)

So in the multi-agent slowly-replacing case, I'd argue that individual decisions don't necessarily represent a voluntary decision on behalf of society (I'm imagining something like this scenario). In the misaligned power-seeking case, it seems obvious to me that this is involuntary. I agree that it technically could be a collective voluntary decision to hand over power more quickly, though (and in that case I'd be somewhat less against it).
I think emre's comment lays out the intuitive case for being careful / taking your time, as does Ryan's. I think the empirics are a bit messy once you take into account benefits of preventing other risks but I'd guess they come out in favor of delaying by at least a few years.

Ryan Greenblatt @ 2024-02-03T18:17 (+7)

A presumption in favor of human values over unaligned AI values for some reasons that aren't based on strict impartial utilitarian arguments. These could include the beliefs that: (1) Humans are more likely to have "interesting" values compared to AIs, and (2) Humans are more likely to be motivated by moral arguments than AIs, and are more likely to reach a deliberative equilibrium of something like "ideal moral values" compared to AIs.

I don't think this is a crux. Even if you prefer unaligned AI values over likely human values (weighted by power), you'd probably prefer doing research on further improving AI values over speeding things up.

Isaac Dunn @ 2024-02-03T17:31 (+6)

I think misaligned AI values should be expected to be worse than human values, because it's not clear that misaligned AI systems would care about eg their own welfare.

Inasmuch as we expect misaligned AI systems to be conscious (or whatever we need to care about them) and also to be good at looking after their own interests, I agree that it's not clear from a total utilitarian perspective that the outcome would be bad.

But the "values" of a misaligned AI system could be pretty arbitrary, so I don't think we should expect that.

JWS @ 2024-02-08T13:07 (+5)

So I think it's likely you have some very different beliefs from most people/EAs/myself, particularly:

Thinking that humans/humanity is bad, and AI is likely to be better
Thinking that humanity isn't driven by ideational/moral concerns^[1]
That AI is very likely to be conscious, moral (as in, making better moral judgements than humans), and that the current/default trend in the industry is very likely to make them conscious moral agents in a way humans aren't

I don't know if the total utilitarian/accelerationist position in the OP is yours or not. I think Daniel is right that most EAs don't have this position. I think maybe Peter Singer gets closest to this in his interview with Tyler on the 'would you side with the Aliens or not question' here. But the answer to your descriptive question is simply that most EAs don't have the combination of moral and empirical views about the world to make the argument you present valid and sound, so that's why there isn't much talk in EA about naïve accelerationism.

Going off the vibe I get from this view though, I think it's a good heuristic that if your moral view sounds like a movie villain's monologue it might be worth reflecting, and a lot of this post reminded me of the Earth-Trisolaris Organisation from Cixin Liu's Three Body Problem. If someone's honest moral view is "Eliminate human tyranny! The world belongs to ~~Trisolaris~~ AIs!" then I don't know what else there is to do except quote Zvi's phrase "please speak directly into this microphone".

Another big issue I have with this post is that some of the counter-arguments just seem a bit like 'nu-uh', see:

But why would we assume AIs won't be conscious?
Why would humans be more likely to have "interesting" values than AIs?
But it would also be bad if we all died from old age while waiting for AI, and missed out on all the benefits that AI offers to humans, which is a point in favor of acceleration. Why would this heuristic be weaker?

These (and other examples) are considerations for sure, but they need to be argued for. I don't think they can just be stated and then say "therefore, ACCELERATE!". I agree that AI Safety research needs to be more robust and the philosophical assumptions and views made more explicit, but one could already think of some counters to the questions that you raise, and I'm sure you already have them. For example, you might take a view (ala Peter Godfrey-Smith) that a certain biological substrate is necessary for conscious.

Similarly on total utilitarianism emphasis larger population sizes, agreed to the extent that the greater population increase the population utility, but this is the repugnant conclusion again. There's a stopping point even in that scenario where an ever larger population decreases total utility, which is why in Parfit's scenario it's full of potatoes and muzak rather than humans crammed into battery cages like factory-farmed animals. Empirically, naïve accelerationism may tend toward the latter case in practice, even if there's a theoretical case to be made for it.

There's more I could say, but I don't want to make this reply too long, and I think as Nathan said it's a point worth discussing. Nevertheless it seems our different positions on this are built on some wide, fundamental divisions about reality and morality itself, and I'm not sure how those can be bridged, unless I've wildly misunderstood your position.

^{^}
this is me-specific

Matthew_Barnett @ 2024-02-08T18:46 (+9)

I don't think humanity is bad. I just think people are selfish, and generally driven by motives that look very different from impartial total utilitarianism. AIs (even potentially "random" ones) seem about as good in expectation, from an impartial standpoint. In my opinion, this view becomes even stronger if you recognize that AIs will be selected on the basis of how helpful, kind, and useful they are to users. (Perhaps notice how different this selection criteria is from the evolutionary criteria used to evolve humans.)

I understand that most people are partial to humanity, which is why they generally find my view repugnant. But my response to this perspective is to point out that if we're going to be partial to a group on the basis of something other than utilitarian equal consideration of interests, it makes little sense to choose to be partial to the human species as opposed to the current generation of humans or even myself. And if we take this route, accelerationism seems even more strongly supported than before, since developing AI and accelerating technological progress seems to be the best chance we have of preserving the current generation against aging and death. If we all died, and a new generation of humans replaced us, that would certainly be pretty bad for us.

Which sounds more like a movie villain's monologue?

The idea that everyone currently living needs to sacrificed, and die, in order to preserve the human species
The idea that we should try to preserve currently living people, even if that means taking on a greater risk of not preserving the values of the human species

To be clear, I also just totally disagree with the heuristic that "if your moral view sounds like a movie villain's monologue it might be worth reflecting". I don't think that fiction is generally a great place for learning moral philosophy, albeit with some notable exceptions.

Anyway, the answer to these moral questions may seem obvious to you, but I don't think they're as obvious as you're making them seem.

Ryan Greenblatt @ 2024-02-08T19:50 (+6)

I understand that most people are partial to humanity, which is why they generally find my view repugnant.

This is not why people disagree IMO.

Matthew_Barnett @ 2024-02-08T22:36 (+7)

I think the fact that people are partial to humanity explains a large fraction of the disagreement people have with me. But, fair enough, I exaggerated a bit. My true belief is a more moderate version of that claim.

When discussing why EAs in particular disagree with me, to overgeneralize by a fair bit, I've noticed that EAs are happy to concede that AIs could be moral patients, but are generally reluctant to admit AIs as moral agents, in the way they'd be happy to accept humans as independent moral agents (e.g. newborns) into our society. I'd call this "being partial to humanity", or at least, "being partial to the values of the human species".

(In my opinion, this partiality seems so prevalent and deep in most people that to deny it seems a bit like a fish denying the existence of water. But I digress.)

To test this hypothesis, I recently asked three questions on Twitter about whether people would be willing to accept immigration through a portal to another universe from three sources:

"a society of humans who are very similar to us"
"a society of people who look & act like humans, but each of them only cares about their family"
"a society of people who look & act like humans, but they only care about maximizing paperclips"

I emphasized that in each case, the people are human-level in their intelligence, and also biological.

The results are preliminary (and I'm not linking here to avoid biasing the results, as voting has not yet finished), but so far my followers, who are mostly EAs, are much more happy to let the humans immigrate to our world, compared to the last two options. I claim there just aren't really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.

My guess is that if people are asked to defend their choice explicitly, they'd largely talk about some inherent altruism or hope they place in the human species, relative to the other options; and this still looks like "being partial to humanity", as far as I can tell, from almost any reasonable perspective.

Ryan Greenblatt @ 2024-02-09T18:00 (+11)

I think the fact that people are partial to humanity explains a large fraction of the disagreement people have with me.

Maybe, it's hard for me to know. But I predict most the pushback you're getting from relatively thoughtful longtermists isn't due to this.

I've noticed that EAs are happy to concede that AIs could be moral patients, but are generally reluctant to admit AIs as moral agents, in the way they'd be happy to accept humans as independent moral agents (e.g. newborns) into our society.

I agree with this.

I'd call this "being partial to humanity", or at least, "being partial to the values of the human species".

I think "being partial to humanity" is a bad description of what's going on because (e.g.) these same people would be considerably more on board with aliens. I think the main thing going on is that people have some (probably mistaken) levels of pessimism about how AIs would act as moral agents which they don't have about (e.g.) aliens.

To test this hypothesis, I recently asked three questions on Twitter about whether people would be willing to accept immigration through a portal to another universe from three sources:
"a society of humans who are very similar to us"
"a society of people who look & act like humans, but each of them only cares about their family"
"a society of people who look & act like humans, but they only care about maximizing paperclips"
...
I claim there just aren't really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.

This comparison seems to me to be missing the point. Minimally I think what's going on is not well described as "being partial to humanity".

Here's a comparison I prefer:

A society of humans who are very similar to us.
A society of humans who are very similar to us in basically every way, except that they have a genetically-caused and strong terminal preference for maximizing the total expected number of paper clips (over the entire arc of history) and only care about other things instrumentally. They are sufficiently commited to paper clip maximization that this will persist on arbitrary reflection (e.g. they'd lock in this view immediately when given this option) and let's also suppose that this view is transmitted genetically and in a gene-drive-y way such that all of their descendents will also only care about paper clips. (You can change paper clips to basically anything else which is broadly recognized to have no moral value on its own, e.g. gold twisted into circles.)
A society of beings (e.g. aliens) who are extremely different in basically every way to humans except that they also have something pretty similar to the concepts of "morality", "pain", "pleasure", "moral patienthood", "happyness", "preferences", "altruism", and "careful reasoning about morality (moral thoughtfulness)". And the society overall also has a roughly similar relationship with these concepts (e.g. the level of "altruism" is similar). (Note that having the same relationship as humans to these concepts is a pretty low bar! Humans aren't that morally thoughtful!)

I think I'm almost equally happy with (1) and (3) on this list and quite unhappy with (2).

If you changed (3) to instead be "considerably more altruistic", I would prefer (3) over (1).

I think it seems weird to call my views on the comparison I just outlined as "being partial to humanity": I actually prefer (3) over (2) even though (2) are literally humans!

(Also, I'm not that commited to having concepts of "pain" and "pleasure", but I'm relatively commited to having a concepts which are something like "moral patienthood", "preferences", and "altruism".)

Below is a mild spoiler for a story by Eliezer Yudkowsky:

To make the above comparison about different beings more concrete, in the case of three worlds collide, I would basically be fine giving the universe over the the super-happies relative to humans (I think mildly better than humans?) and I think it seems only mildly worse than humans to hand it over to the baby-eaters. In both cases, I'm pricing in some amount of reflection and uplifting which doesn't happen in the actual story of three worlds collide, but would likely happen in practice. That is, I'm imagining seeing these societies prior to their singularity and then based on just observations of their societies at this point, deciding how good they are (pricing in the fact that the society might change over time).

Ryan Greenblatt @ 2024-02-09T18:07 (+3)

To be clear, it seems totally reasonable to call this "being partial to some notion of moral thoughtfulness about pain, pleasure, and preferences", but these concepts don't seem that "human" to me. (I predict these occur pretty frequently in evolved life that reaches a singularity for instance. And they might occur in AIs, but I expect misaligned AIs which seize control of the world are worse from my perspective than if humans retain control.)

Matthew_Barnett @ 2024-02-09T20:15 (+4)

When I say that people are partial to humanity, I'm including an irrational bias towards thinking that humans, or evolved beings, are unusually thoughtful or ethical compared to the alternatives (I believe this is in fact an irrational bias, since the arguments I've seen for thinking that unaligned AIs will be less thoughtful or ethical than aliens seem very weak to me).

In other cases, when people irrationally hold a certain group X to a higher standard than a group Y, it is routinely described as "being partial to group Y over group X". I think this is just what "being partial" means, in an ordinary sense, across a wide range of cases.

For example, if I proposed aligning AI to my local friend group, with the explicit justification that I thought my friends are unusually thoughtful, I think this would be well-described as me being "partial" to my friend group.

To the extent you're seeing me as saying something else about how longtermists view the argument, I suspect you're reading me as saying something stronger than what I originally intended.

Ryan Greenblatt @ 2024-02-09T23:19 (+3)

In that case, my main disagreement is thinking that your twitter poll is evidence for your claims.

More specifically:

I claim there just aren't really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.

Like you claim there aren't any defensible reasons to think that what humans will do is better than literally maximizing paper clips? This seems totally wild to me.

Matthew_Barnett @ 2024-02-09T23:59 (+4)

Like you claim there aren't any defensible reasons to think that what humans will do is better than literally maximizing paper clips?

I'm not exactly sure what you mean by this. There were three options, and human paperclippers were only one of these options. I was mainly discussing the choice between (1) and (2) in the comment, not between (1) and (3).

Here's my best guess at what you're saying: it sounds like you're repeating that you expect humans to be unusually altruistic or thoughtful compared to an unaligned alternative. But the point of my previous comment was to state my view that this bias counted as "being partial towards humanity", since I view the bias as irrational. In light of that, what part of my comment are you objecting to?

To be clear, you can think the bias I'm talking about is actually rational; that's fine. But I just disagree with you for pretty mundane reasons.

[Incorporating what you said in the other comment]

Also, to be clear, I agree that the question of "how much worse/better is it for AIs to get vast amounts of resources without human society intending to grant those resources to the AIs from a longtermist perspective" is underinvestigated, but I think there are pretty good reasons to systematically expect human control to be a decent amount better.

Then I think it's worth concretely explaining what these reasons are to believe that human control will be a decent amount better in expectation. You don't need to write this up yourself, of course. I think the EA community should write these reasons up. Because I currently view the proposition as non-obvious, and despite being a critical belief in AI risk discussions, it's usually asserted without argument. When I've pressed people in the past, they typically give very weak reasons.

I don't know how to respond to an argument whose details are omitted.

Ryan Greenblatt @ 2024-02-10T00:07 (+5)

Then I think it's worth concretely explaining what these reasons are to believe that human control will be a decent amount better in expectation. You don't need to write this up yourself, of course.

+1, but I don't generally think it's worth counting on "the EA community" to do something like this. I've been vaguely trying to pitch Joe on doing something like this (though there are probably better uses of his time) and his recent blogs posts are touching similar topics.

Ryan Greenblatt @ 2024-02-10T00:16 (+1)

Also, it's usually only the crux of longtermists which is probably one of the reasons why no one has gotten around to this.

Ryan Greenblatt @ 2024-02-10T00:05 (+1)

I was mainly discussing the choice between (1) and (2) in the comment, not between (1) and (3).

You didn't make this clear, so was just responding generically.

Separately, I think I feel a pretty similar intution for case (2), people literally only caring about their families seems pretty clearly worse.

Ryan Greenblatt @ 2024-02-10T00:03 (+1)

Here's my best guess at what you're saying: it sounds like you're repeating that you expect humans to be unusually altruistic or thoughtful compared to an unaligned alternative.

There, I'm just saying that human control is better than literal paperclip maximization.

Matthew_Barnett @ 2024-02-10T00:06 (+4)

This response still seems underspecified to me. Is the default unaligned alternative paperclip maximization in your view? I understand that Eliezer Yudkowsky has given arguments for this position, but it seems like you diverge significantly from Eliezer's general worldview, so I'd still prefer to hear this take spelled out in more detail from your own point of view.

Ryan Greenblatt @ 2024-02-10T00:09 (+1)

Your poll says:

"a society of people who look & act like humans, but they only care about maximizing paperclips"

And then you say:

so far my followers, who are mostly EAs, are much more happy to let the humans immigrate to our world, compared to the last two options. I claim there just aren't really any defensible reasons to maintain this choice other than by implicitly appealing to a partiality towards humanity.

So, I think more human control is better than more literal paperclip maximization, the option given in your poll.

My overall position isn't that the AIs will certainly be paperclippers, I'm just arguing in isolation about why I think the choice given in the poll is defensible.

Matthew_Barnett @ 2024-02-10T00:19 (+2)

I have the feeling we're talking past each other a bit. I suspect talking about this poll was kind of a distraction. I personally have the sense of trying to convey a central point, and instead of getting the point across, I feel the conversation keeps slipping into talking about how to interpret minor things I said, which I don't see as very relevant.

I will probably take a break from replying for now, for these reasons, although I'd be happy to catch up some time and maybe have a call to discuss these questions in more depth. I definitely see you as trying a lot harder than most other EAs in trying to make progress on these questions collaboratively with me.

JWS @ 2024-02-10T11:54 (+4)

I'd be very happy to have some discussion on these topics with you Matthew. For what it's worth, I really have found much of your work insightful, thought-provoking, and valuable. I think I just have some strong, core disagreements on multiple empirical/epistemological/moral levels with your latest series of posts.

That doesn't mean I don't want you to share your views, or that they're not worth discussion, and I apologise if I came off as too hostile. An open invitation to have some kind of deeper discussion stands.^[1]

^{^}
I'd like to try out the new dialogue feature on the Forum, but that's a weak preference

Ryan Greenblatt @ 2024-02-10T00:21 (+3)

I suspect talking about this poll was kind of a distraction.

Agreed, sorry about that.

Ryan Greenblatt @ 2024-02-09T23:42 (+1)

Also, to be clear, I agree that the question of "how much worse/better is it for AIs to get vast amounts of resources without human society intending to grant those resources to the AIs from a longtermist perspective" is underinvestigated, but I think there are pretty good reasons to systematically expect human control to be a decent amount better.

Robi Rahman @ 2024-02-05T16:27 (+4)

Under preference utilitarianism, it doesn't necessarily matter whether AIs are conscious.

I'm guessing preference utilitarians would typically say that only the preferences of conscious entities matter. I doubt any of them would care about satisfying an electron's "preference" to be near protons rather than ionized.

Matthew_Barnett @ 2024-02-05T19:24 (+2)

I'm guessing preference utilitarians would typically say that only the preferences of conscious entities matter.

Perhaps. I don't know what most preference utilitarians believe.

I doubt any of them would care about satisfying an electron's "preference" to be near protons rather than ionized.

Are you familiar with Brian Tomasik? (He's written about suffering of fundamental particles, and also defended preference utilitarianism.)

Nathan_Barnard @ 2024-02-03T14:16 (+4)

Strongly there should be more explicit defences of this argument.

One way of doing this in a co-operative way might working on co-operative AI stuff, since it seems to increase the likelihood that misaligned AI goes well, or at least less badly.

Pivocajs @ 2024-02-07T17:42 (+3)

My personal reason for not digging into this is that my naive model of how good the AI future is: quality_of_future * amount_of_the_stuff. And there is distinction I haven't seen you acknowledged: while high "quality" doesn't require humans to be around, I ultimately judge quality by my values. (Thing being conscious is an example. But this also includes things like not copy-pasting the same thing all over, not wiping out aliens, and presumably many other things I am not aware of. IIRC Yudkowsky talks about cosmopolitanism being a human value.) Because of this, my impression is that if we hand over the future to a random AI, the "quality" will be very low. And so we can currently have a much larger impact by focusing on increasing the quality. Which we can do by delaying "handing over the future to AI" and picking a good AI to hand over to. IE, alignment.

(Still, I agree it would be nice if there was a better analysis of this, which exposed the assumptions.)

Matthew_Barnett @ 2024-02-07T22:30 (+5)

And there is distinction I haven't seen you acknowledged: while high "quality" doesn't require humans to be around, I ultimately judge quality by my values.

Is there any particular reason why you are partial towards humans generically controlling the future, relative to this particular current generation of humans? To me, it seems like being partial to one's own values, one's community, and especially one's own life, generally leads to an even stronger argument for accelerationism, since the best way to advance your own values is generally to actually "be there" when AI happens.

In my opinion, the main relevant alternative to this view is to be partial to the human species, as opposed to being partial to either one's current generation, or oneself. And I think the human species is kind of a weird category to be partial to, relative to those other things. Do you disagree?

Pivocajs @ 2024-02-21T19:21 (+1)

In my opinion, the main relevant alternative to this view is to be partial to the human species, as opposed to being partial to either one's current generation, or oneself. And I think the human species is kind of a weird category to be partial to, relative to those other things. Do you disagree?

I agree with this.

the best way to advance your own values is generally to actually "be there" when AI happens.

I (strongly) disagree with this. Me being alive is a relatively small part of my values. And since I am not the director of the world, me personally being around to influence things is unlikely to have a decisive impact on things I value.

In more detail: Sure, all else being equal, me being there when AI happens is mildly helpful. But the outcome of building AI seems to be a function of, among other things, (i) values of the people building it + (ii) how much reflection they can do on those values + (iii) the environment dynamics these people are subject to (e.g., the current race dynamics between AI companies). And over time, I expect the potential decrease in (i) to be far outweighed by gains in (ii) and (iii).

The first issue is about (i), that it is not actually me building the AGI, either now or in the future. But I am willing to grant that (all else being equal) current generation is more likely to have values closer to my values.
However, I expect that the factors are (ii) and (iii) are just as influential. Regarding (ii), it seems we keep making progress at philosophy, ethics, etc, and to me, this currently far outweighs the value drift in (i).
Regarding (iii), my impression is that the current situation is so bad that it can't get much worse, and we might as well wait. This of course depends on how likely you think we are likely to get a bad outcome if we either (a) get superintelligence without additional progress on alignment or (b) get widespread human-level AI with no progress on alignment, institution design, etc.

Matthew_Barnett @ 2024-02-21T22:59 (+2)

Me being alive is a relatively small part of my values.

I agree some people (such as yourself) might be extremely altruistic, and therefore might not care much about their own life relative to other values they hold, but this position is fairly uncommon. Most people care a lot about their own lives (and especially the lives of their family and friends) relative to other things they care about. We can empirically test this hypothesis by looking at how people choose to spend their time and money; and the results are generally that people spend their money on themselves, their family and their friends.

since I am not the director of the world, me personally being around to influence things is unlikely to have a decisive impact on things I value.

You don't need to be director of the world to have influence over things. You can just be a small part of the world to have influence over things that you care about. This is essentially what you're already doing by living and using your income to make decisions, to satisfy your own preferences. I'm claiming this situation could and probably will persist into the indefinite future, for the agents that exist in the future.

I'm very skeptical that there will ever be a moment in time during which there will be a "director of the world", in a strong sense. And I doubt the developer of the first AGI will become the director of the world, even remotely (including versions of them that reflect on moral philosophy etc.). You might want to read my post about this.

Vasco Grilo @ 2024-02-11T16:02 (+2)

Great points, Matthew! I have wondered about this too. Relatedly, readers may want to check the sequence otherness and control in the age of AGI from Joe Carlsmith, in particular, Does AI risk “other” the AIs?.

One potential argument against accelerating AI is that it will increase the chance of catastrophes which will then lead to overregulating AI (e.g. in the same way that nuclear power arguably was overregulated).

Matthew_Barnett @ 2025-02-13T04:06 (+45)

A reflection on the posts I have written in the last few months, elaborating on my views

In a series of recent posts, I have sought to challenge the conventional view among longtermists that prioritizes the empowerment or preservation of the human species as the chief goal of AI policy. It is my opinion that this view is likely rooted in a bias that automatically favors human beings over artificial entities—thereby sidelining the idea that future AIs might create equal or greater moral value than humans—and treating this alternative perspective with unwarranted skepticism.

I recognize that my position is controversial and likely to remain unpopular among effective altruists for a long time. Nevertheless, I believe it is worth articulating my view at length, as I see it as a straightforward application of standard, common-sense utilitarian principles that merely lead to an unpopular conclusion. I intend to continue elaborating on my arguments in the coming months.

My view follows from a few basic premises. First, that future AI systems are quite likely to be moral patients; second, that we shouldn’t discriminate against them based on arbitrary distinctions, such as their being instantiated on silicon rather than carbon, or having been created through deep learning rather than natural selection. If we insist on treating AIs fundamentally differently from a human child or adult—for example, by regarding them merely as property to be controlled or denying them the freedom to pursue their own goals—then we should identify a specific ethical reason for our approach that goes beyond highlighting their non-human nature.

Many people have argued that consciousness is the key quality separating humans from AIs, thus rendering any AI-based civilization morally insignificant compared to ours. They maintain that consciousness has relatively narrow boundaries, perhaps largely confined to biological organisms, and would only arise in artificial systems under highly specific conditions—for instance, if one were to emulate a human mind in digital form. While I acknowledge that this perspective is logically coherent, I find it deeply unconvincing. The AIs I am referring to when I write about this topic would almost certainly not be simplistic, robotic automatons; rather, they would be profoundly complex, sophisticated entities whose cognitive abilities rival or exceed those of the human brain. For anyone who adopts a functionalist view of consciousness, it seems difficult to be confident that such AIs would lack a rich inner experience.

Because functionalism and preference utilitarianism—both of which could grant moral worth to AI preferences even if they do not precisely replicate biological states—have at least some support within the EA community, I remain hopeful that, if I articulate my position clearly, EAs who share these philosophical assumptions will see its merits.

That said, I am aware that explaining this perspective is an uphill battle. The unpopularity of my views often makes it difficult to communicate without instant misunderstandings; critics seem to frequently conflate my arguments with other, simpler positions that can be more easily dismissed. At times, this has caused me to feel as though the EA community is open to only a narrow range of acceptable ideas. This reaction, while occasionally frustrating, does not surprise me, as I have encountered similar resistance when presenting other unpopular views—such as challenging the ethics of purchasing meat in social contexts where such concerns are quickly deemed absurd.

However, the unpopularity of these ideas also creates a benefit: it creates room for rapid intellectual progress by opening the door to new and interesting philosophical questions about AI ethics. If we free ourselves from the seemingly unquestionable premise that preserving the human species should be the top priority when governing AI development, we can begin to ask entirely new and neglected questions about the role of artificial minds in society.

These questions include: what social and legal frameworks should we pursue if AIs are seen not as dangerous tools to be contained but as individuals on similar moral footing with humans? How do we integrate AI freedom and autonomy into our vision of the future, creating the foundation for a system of ethical and pragmatic AI rights?

Under this alternative philosophical approach, policy would not focus solely on minimizing risks to humanity. Instead, it would emphasize cooperation and inclusion, seeing advanced AI as a partner rather than an ethical menace to be tightly restricted or controlled. This undoubtedly requires a significant shift in our longtermist thinking, demanding a re-examination of deeply rooted assumptions. Such a project cannot be completed overnight, but given the moral stakes and the rapid progress in AI, I view this philosophical endeavor as both urgent and exciting. I invite anyone open to rethinking these foundational premises to join me in exploring how we might foster a future in which AIs and humans coexist as moral peers, cooperating for mutual benefit rather than viewing each other as intrinsic competitors locked in an inevitable zero-sum fight.

David_Moss @ 2025-02-13T12:16 (+22)

Thanks for writing on this important topic!

I think it's interesting to assess how popular or unpopular these views are within the EA community. This year and last year, we asked people in the EA Survey about the extent to which they agreed or disagreed that:

Most expected value in the future comes from digital minds' experiences, or the experiences of other nonbiological entities.

This year about 47% (strongly or somewhat) disagreed, while 22.2% agreed (roughly a 2:1 ratio).

However, among people who rated AI risks a top priority, respondents leaned towards agreement, with 29.6% disagreeing and 36.6% agreeing (a 0.8:1 ratio).^[1]

Similarly, among the most highly engaged EAs, attitudes were roughly evenly split between 33.6% disagreement and 32.7% agreement (1.02:1), with much lower agreement among everyone else.

This suggests to me that the collective opinion of EAs, among those who strongly prioritise AI risks and the most highly engaged is not so hostile to digital minds. Of course, for practical purposes, what matters most might be the attitudes of a small number of decisionmakers, but I think the attitudes of the engaged EAs matters for epistemic reasons.

^{^}
Interestingly, among people who merely rated AI risks a near-top priority, attitudes towards digital minds were similar to the sample as a whole. Lower prioritisation of AI risks were associated with yet lower agreement with the digital minds item.

Lukas_Gloor @ 2025-02-13T14:22 (+15)

I haven't read your other recent comments on this, but here's a question on the topic of pausing AI progress. (The point I'm making is similar to what Brad West already commented.)

Let's say we grant your assumptions (that AIs will have values that matter the same as or more than human values and that an AI-filled future would be just as or more morally important than one with humans in control). Wouldn't it still make sense to pause AI progress at this important junction to make sure we study what we're doing so we can set up future AIs to do as well as (reasonably) possible?

You say that we shouldn't be confident that AI values will be worse than human values. We can put a pin in that. But values are just one feature here. We should also think about agent psychologies and character traits and infrastructure beneficial for forming peaceful coalitions. On those dimensions, some traits or setups seem (somewhat robustly?) worse than others?

We're growing an alien species that might take over from humans. Even if you think that's possibly okay or good, wouldn't you agree that we can envision factors about how AIs are built/trained and about what sort of world they are placed in that affect whether the future will likely be a lot better or a lot worse?

I'm thinking about things like:

pro-social insctincts (or at least absence of anti-social ones)
more general agent character traits that do well/poorly at forming peaceful coalitions
agent infrastructure to help with coordinating (e.g., having better lie detectors, having a reliable information environment or starting out under the chaos of information warfare, etc.)
initial strategic setup (being born into AI-vs-AI competition vs. being born in a situation where the first TAI can take to proceed slowly and deliberately)
maybe: decision-theoretic setup to do well in acausal interactions with other parts of the multiverse (or at least not do particularly poorly)

If (some of) these things are really important, wouldn't it make sense to pause and study this stuff until we know whether some of these traits are tractable to influence?

(And, if we do that, we might as well try to make AIs have the inclination to be nice to humans, because humans already exist, so anything that kills humans who don't want to die frustrates already-existing life goals, which seems worse than frustrating the goals of merely possible beings.)

I know you don't talk about pausing in your above comment -- but I think I vaguely remember you being skeptical of it in other comments. Maybe that was for different reasons, or maybe you just wanted to voice disagreement with the types of arguments people typically give in favor of pausing?

FWIW, I totally agree with the position that we should respect the goals of AIs (assuming they're not just roleplayed stated goals but deeply held ones -- of course, this distinction shouldn't be uncharitably weaponized against AIs ever being considered to have meaningful goals). I'm just concerned because whether the AIs respect ours in turn, especially when they find themselves in a position where they could easily crush us, will probably depend on how we build them.

Matthew_Barnett @ 2025-02-14T02:37 (+6)

In your comment, you raise a broad but important question about whether, even if we reject the idea that human survival must take absolute priority other concerns, we might still want to pause AI development in order to “set up” future AIs more thoughtfully. You list a range of traits—things like pro-social instincts, better coordination infrastructures, or other design features that might improve cooperation—that, in principle, we could try to incorporate if we took more time. I understand and agree with the motivation behind this: you are asking whether there is a prudential reason, from a more inclusive moral standpoint, to pause in order to ensure that whichever civilization emerges—whether dominated by humans, AIs, or both at once—turns out as well as possible in ways that matter impartially, rather than focusing narrowly on preserving human dominance.

Having summarized your perspective, I want to clarify exactly where I differ from your view, and why.

First, let me restate the perspective I defended in my previous post on delaying AI. In that post, I was critiquing what I see as the “standard case” for pausing AI, as I perceive it being made in many EA circles. This standard case for pausing AI often treats preventing human extinction as so paramount that any delay of AI progress, no matter how costly to currently living people, becomes justified if it incrementally lowers the probability of humans losing control.

Under this argument, the reason we want to pause is that time spent on “alignment research” can be used to ensure that future AIs share human goals, or at least do not threaten the human species. My critique had two components: first, I argued that pausing AI is very costly to people who currently exist, since it delays medical and technological breakthroughs that could be made by advanced AIs, thereby forcing a lot of people to die who could have otherwise been saved. Second, and more fundamentally, I argued that this "standard case" seems to rest on an assumption of strictly prioritizing human continuity, independent of whether future AIs might actually generate utilitarian moral value in a way that matches or exceeds humanity.

I certainly acknowledge that one could propose a different rationale for pausing AI, one which does not rest on the premise that preserving the human species is intrinsically worth more than than other moral priorities. This is, it seems, the position you are taking.

Nonetheless, I don't find your considerations compelling for a variety of reasons.

To begin with, it might seem that granting ourselves "more time" robustly ensures that AIs come out morally better—pro-social, cooperative, and so on. Yet the connection between “getting more time” to “achieving positive outcomes” does not seem straightforward. Merely taking more time does not ensure that this time will be used to increase, rather than decrease, the relevant quality of AI systems according to an impartial moral view. Alignment with human interests, for example, could just as easily push systems in directions that entrench specific biases, maintain existing social structures, or limit moral diversity—none of which strongly aligns with the “pro-social” ideals you described. In my view, there is no inherent link between a slower timeline and ensuring that AIs end up embodying genuinely virtuous or impartial ethical principles. Indeed, if what we call “human control” is mainly about enforcing the status quo or entrenching the dominance of the human species, it may be no better—and could even be worse—than a scenario in which AI development proceeds at the default pace, potentially allowing for more diversity and freedom in how systems are shaped.

Furthermore, in my own moral framework—which is heavily influenced by preference utilitarianism—I take seriously the well-being of everyone who currently exists in the present. As I mentioned previously, one major cost to pausing AI is that it would likely postpone many technological benefits. These might include breakthroughs in medicine—potential cures for aging, radical extensions of healthy lifespans, or other dramatic increases to human welfare that advanced AI could accelerate. We should not simply dismiss the scale of that cost. The usual EA argument for downplaying these costs rests on the Astronomical Waste argument. However, I find this argument flawed, and I spelled out exactly why I found this argument flawed in the post I just wrote.

If a pause sets back major medical discoveries by even a decade, that delay could contribute to the premature deaths of around a billion people alive today. It seems to me that an argument in favor of pausing should grapple with this tradeoff, instead of dismissing it as clearly unimportant compared to the potential human lives that could maybe exist in the far future. Such a dismissal would seem both divorced from common sense concern for existing people, and divorced from broader impartial utilitarian values, as it would prioritize the continuity of the human species above and beyond species-neutral concerns for individual well-being.

Finally, I take very seriously the possibility that pausing AI would cause immense and enduring harm by requiring the creation of vast regulatory controls over society. Realistically, the political mechanisms by which we “pause” advanced AI development would likely involve a lot of coercion, surveillance and social control, particularly as AI starts becoming an integral part of our economy. These efforts are likely to expand state regulatory powers, hamper open competition, and open the door to a massive intrusion of state interference in economic and social activity. I believe these controls would likely be far more burdensome and costly than, for example, our controls over nuclear weapons. If our top long-term priority is building a more free, prosperous, inclusive, joyous, and open society for everyone, rather than merely to control and stop AI, then it seems highly questionable that creating the policing powers required to pause AI is the best way to achieve this objective.

As I see it, the core difference between the view you outlined and mine is not that I am ignoring the possibility that we might “do better” by carefully shaping the environment in which AIs arise. I concede that if we had a guaranteed mechanism to spend a known, short period of time intentionally optimizing how AIs are built, without imposing any other costs in the meantime, that might bring some benefits. However, my skepticism flows from the actual methods by which such a pause would come about, its unintended consequences on liberty, the immediate harms it imposes on present-day people by delaying technological progress, and the fact that it might simply entrench a narrower or more species-centric approach that I explicitly reject. It is not enough to claim that “pausing gives us more time", suggesting that "more time" is robustly a good thing. One must argue why that time will be spent well, in a way that outweighs the enormous and varied costs that I believe are incurred by pausing AI.

To be clear, I am not opposed to all forms of regulation. But I tend to prefer more liberal approaches, in the sense of classical liberalism. I prefer strategies that try to invite AIs into a cooperative framework, giving them legal rights and a path to peaceful integration—coupled, of course, with constraints on any actor (human or AI) who threatens to commit violence. This, in my view, simply seems like a far stronger foundation for AI policy than a stricter top-down approach in which we halt all frontier AI progress, and establish the associated sweeping regulatory powers required to enforce such a moratorium.

Ozzie Gooen @ 2025-02-13T05:46 (+13)

I think it's interesting and admiral that you're dedicated on a position that's so unusual in this space.

I assume I'm in the majority here that my intuitions are quite different from yours, however.

One quick point when we're here:
> this view is likely rooted in a bias that automatically favors human beings over artificial entities—thereby sidelining the idea that future AIs might create equal or greater moral value than humans—and treating this alternative perspective with unwarranted skepticism.

I think that a common, but perhaps not well vocalized, utilitarian take is that humans don't have much of a special significance in terms of creating well-being. The main option would be a much more abstract idea, some kind of generalization of hedonium or consequentialism-ium or similar. For now, let's define hedonium as "the ideal way of converting matter and energy into well-being, after a great deal of deliberation."

As such, it's very tempting to try to separate concerns and have AI tools focus on being great tools, and separately optimize hedonium to be efficient at being well-being. While I'm not sure if AIs would have zero qualia, I'd feel a lot more confident that they will have dramatically less qualia per unit resources than a much more optimized substrate.

If one follows this general logic, then one might assume that it's likely that the vast majority of well-being in the future would exist as hedonium, not within AIs created to ultimately make hedonium.

One less intense formulation would be to have both AIs and humans focus only on making sure we get to the point where we much better understand the situation with qualia and hedonium (a la the Long Reflection), and then re-evaluate. In my strategic thinking around AI I'm not particularly optimizing for the qualia of the humans involved in the AI labs or the relevant governments. Similarly, I'd expect not to optimize hard for the qualia in the early AIs, in the period when we're unsure about qualia and ethics, even if I thought they might have experiences. I would be nervous if I thought this period could involve AIs having intense suffering or be treated in highly immoral ways.

David Mathers🔸 @ 2025-02-13T12:10 (+8)

I think for me, part of the issue with your posts on this (which I think are net positive to be clear, they really push at significant weak points in ideas widely held in the community) is that you seem to be sort of vacillating between three different ideas, in a way that conceal that one of them, taken on its own sounds super-crazy and evil:

1) Actually, if AI development were to literally lead to human extinction, that might be fine, because it might lead to higher utility.

2) We should care about humans harming sentient, human-like AIs as much as we care about AIs harming humans.

3) In practice, the benefits to current people from AI development outweigh the risks, and the only moral views which say that we should ignore this and pause in the face of even tiny risks of extinction from AI because there are way more potential humans in the future, in fact, when taken seriously, imply 1), which nobody believes.

1) feels extremely bad to me, basically a sort of Nazi-style view on which genocide is fine if the replacing people are superior utility generators (or I guess, inferior but sufficiently more numerous). 1) plausibly is a consequence of classical utilitarianism (even maybe on some person-affecting versions of classical utilitarianism I think), but I take this to be a reason to reject pure classical utilitarianism, not a reason to endorse 1). 2) and 3), on the other hand, seem reasonable to me. But the thing is that you seem at least sometimes to be taking AI moral patienthood as a reason to push on in the face of uncertainty about whether AI will literally kill everyone. And that seems more like 1) than 2) or 3). 1-style reasoning supports the idea that AI moral patienthood is a reason for pushing on with AI development even in the face of human extinction risk, but as far as I can tell 2) and 3) don't. At the same time though I don't think you mean to endorse 1).

Matthew_Barnett @ 2025-02-13T19:44 (+16)

I realize my position can be confusing, so let me clarify it as plainly as I can: I do not regard the extinction of humanity as anything close to “fine.” In fact, I think it would be a devastating tragedy if every human being died. I have repeatedly emphasized that a major upside of advanced AI lies in its potential to accelerate medical breakthroughs—breakthroughs that might save countless human lives, including potentially my own. Clearly, I value human lives, as otherwise I would not have made this particular point so frequently.

What seems to cause confusion is that I also argue the following more subtle point: while human extinction would be unbelievably bad, it would likely not be astronomically bad in the strict sense used by the "astronomical waste" argument. The standard “astronomical waste” argument says that if humanity disappears, then all possibility for a valuable, advanced civilization vanishes forever. But in a scenario where humans die out because of AI, civilization would continue—just not with humans. That means a valuable intergalactic civilization could still arise, populated by AI rather than by humans. From a purely utilitarian perspective that counts the existence of a future civilization as extremely valuable—whether human or AI—this difference lowers the cataclysm from “astronomically, supremely, world-endingly awful” to “still incredibly awful, but not on a cosmic scale.”

In other words, my position remains that human extinction is very bad indeed—it entails the loss of eight billion individual human lives, which would be horrifying. I don't want to be forcibly replaced by an AI. Nor do I want you, or anyone else to be forcibly replaced by an AI. I am simply pushing back on the idea that such an event would constitute the absolute destruction of all future value in the universe. There is a meaningful distinction between “an unimaginable tragedy we should try very hard to avoid” and “a total collapse of all potential for a flourishing future civilization of any kind.” My stance falls firmly in the former category.

This distinction is essential to my argument because it fundamentally shapes how we evaluate trade-offs, particularly when considering policies that aim to slow or restrict AI research. If we assume that human extinction due to AI would erase all future value, then virtually any present-day sacrifice—no matter how extreme—might seem justified to reduce that risk. However, if advanced AI could continue to sustain its own value-generating civilization, even in the absence of humans, then extinction would not represent the absolute end of valuable life. While this scenario would be catastrophic for humanity, attempting to avoid it might not outweigh certain immediate benefits of AI, such as its potential to save lives through advanced technology.

In other words, there could easily be situations where accelerating AI development—rather than pausing it—ends up being the better choice for saving human lives, even if doing so technically slightly increases the risk of human species extinction. This does not mean we should be indifferent to extinction; rather, it means we should stop treating extinction as a near-infinitely overriding concern, where even the smallest reduction in its probability is always worth immense near-term costs to actual people living today.

For a moment, I’d like to reverse the criticism you leveled at me. From where I stand, it is often those who strongly advocate pausing AI development, not myself, who can appear to undervalue the lives of humans. I know they don’t see themselves this way, and they would certainly never phrase it in those terms. Nevertheless, this is my reading of the deeper implications of their position.

A common proposition that many AI pause advocates have affirmed to me is that it very well could be worth it to pause AI, even if this led to billions of humans dying prematurely due to them missing out on accelerated medical progress that could otherwise have saved their lives. Therefore, while these advocates care deeply about human extinction (something I do not deny), their concern does not seem rooted in the intrinsic worth of the people who are alive today. Instead, their primary focus often seems to be on the loss of potential future human lives that could maybe exist in the far future—lives that do not yet even exist, and on my view, are unlikely to exist in the far future in basically any scenario, since humanity is unlikely to be preserved as a fixed, static concept over the long-run.

In my view, this philosophy neither prioritizes the well-being of actual individuals nor is it grounded in the utilitarian value that humanity actively generates. If this philosophy were purely about impartial utilitarian value, then I ask: why are they not more open to my perspective? Since my philosophy takes an impartial utilitarian approach—one that considers not just human-generated value, but also the potential value that AI itself could create—it would seem to appeal to those who simply took a strict utilitarian approach, without discriminating against artificial life arbitrarily. Yet, my philosophy largely does not appeal to those who express this view, suggesting the presence of alternative, non-utilitarian concerns.

David Mathers🔸 @ 2025-02-17T10:50 (+2)

Thanks, that is very helpful to me in clarifying your position.

harfe @ 2025-02-13T13:57 (+3)

At the same time though I don't think you mean to endorse 1).

I have read or skimmed some of his posts and my sense is that he does endorse 1). But at the same time he says

critics seem to frequently conflate my arguments with other, simpler positions that can be more easily dismissed.

so maybe this is one of these cases and I should be more careful.

quinn @ 2025-02-16T23:03 (+7)

I distinguish believing that good successor criteria are brittle from speciesism. I think antispeciesism does not oblige me to accept literally any successor.

I do feel icky coalitioning with outright speciesists (who reject the possibility of a good successor in principle), but I think my goals and all of generalized flourishing benefits a lot from those coalitions so I grin and bear it.

Jonas Hallgren @ 2025-02-13T10:03 (+1)

FWIW, I completely agree with what you're saying here and I think that if you seriously go into consciousness research and especially for what we westerners more label as a sense of self rather than anything else it quickly becomes infeasible to hold a position that the way we're taking AI development, e.g towards AI agents will not lead to AIs having self-models.

For all matters and purposes this encompasses most theories of physicalist or non-dual theories of consciousness which are the only feasible ones unless you want to bite some really sour apples.

There's a classic "what are we getting wrong" question in EA and I think it's extremely likely that we will look back in 10 years and say, "wow, what are we doing here?".

I think it's a lot better to think of systemic alignment and look at properties that we want for the general collective intelligences that we're engaging in such as our information networks or our institutional decision making procedures and think of how we can optimise these for resillience and truth-seeking. If certain AIs deserve moral patienthood then that truth will naturally arise from such structures.

(hot take) Individual AI alignment might honestly be counter-productive towards this view.

Matthew_Barnett @ 2023-09-21T18:27 (+45)

(Clarification about my views in the context of the AI pause debate)

I'm finding it hard to communicate my views on AI risk. I feel like some people are responding to the general vibe they think I'm giving off rather than the actual content. Other times, it seems like people will focus on a narrow snippet of my comments/post and respond to it without recognizing the context. For example, one person interpreted me as saying that I'm against literally any AI safety regulation. I'm not.

For a full disclosure, my views on AI risk can be loosely summarized as follows:

I think AI will probably be very beneficial for humanity.
Nonetheless, I think that there are credible, foreseeable risks from AI that could do vast harm, and we should invest heavily to ensure these outcomes don't happen.
I also don't think technology is uniformly harmless. Plenty of technologies have caused net harm. Factory farming is a giant net harm that might have even made our entire industrial civilization a mistake!
I'm not blindly against regulation. I think all laws can and should be viewed as forms of regulations, and I don't think it's feasible for society to exist without laws.
That said, I'm also not blindly in favor of regulation, even for AI risk. You have to show me that the benefits outweigh the harm
I am generally in favor of thoughtful, targeted AI regulations that align incentives well, and reduce downside risks without completely stifling innovation.
I'm open to extreme regulations and policies if or when an AI catastrophe seems imminent, but I don't think we're in such a world right now. I'm not persuaded by the arguments that people have given for this thesis, such as Eliezer Yudkowsky's AGI ruin post.

Chris Leong @ 2023-09-21T23:30 (+2)

Thanks, that seems like a pretty useful summary.

Matthew_Barnett @ 2023-10-13T20:26 (+44)

I might elaborate on this at some point, but I thought I'd write down some general reasons why I'm more optimistic than many EAs on the risk of human extinction from AI. I'm not defending these reasons here; I'm mostly just stating them.

Skepticism of foom: I think it's unlikely that a single AI will take over the whole world and impose its will on everyone else. I think it's more likely that millions of AIs will be competing for control over the world, in a similar way that millions of humans are currently competing for control over the world. Power or wealth might be very unequally distributed in the future, but I find it unlikely that it will be distributed so unequally that there will be only one relevant entity with power. In a non-foomy world, AIs will be constrained by norms and laws. Absent severe misalignment among almost all the AIs, I think these norms and laws will likely include a general prohibition on murdering humans, and there won't be a particularly strong motive for AIs to murder every human either.
Skepticism that value alignment is super-hard: I haven't seen any strong arguments that value alignment is very hard, in contrast to the straightforward empirical evidence that e.g. GPT-4 seems to be honest, kind, and helpful after relatively little effort. Most conceptual arguments I've seen for why we should expect value alignment to be super-hard rely on strong theoretical assumptions that I am highly skeptical of. I have yet to see significant empirical successes from these arguments. I feel like many of these conceptual arguments would, in theory, apply to humans, and yet human children are generally value aligned by the time they reach young adulthood (at least, value aligned enough to avoid killing all the old people). Unlike humans, AIs will be explicitly trained to be benevolent, and we will have essentially full control over their training process. This provides much reason for optimism.
Belief in a strong endogenous response to AI: I think most people will generally be quite fearful of AI and will demand that we are very cautious while deploying the systems widely. I don't see a strong reason to expect companies to remain unregulated and rush to cut corners on safety, absent something like a world war that presses people to develop AI as quickly as possible at all costs.
Not being a perfectionist: I don't think we need our AIs to be perfectly aligned with human values, or perfectly honest, similar to how we don't need humans to be perfectly aligned and honest. Individual humans are usually quite selfish, frequently lie to each other, and are often cruel, and yet the world mostly gets along despite this. This is true even when there are vast differences in power and wealth between humans. For example some groups in the world have almost no power relative to the United States, and residents in the US don't particularly care about them either, and yet they survive anyway.
Skepticism of the analogy to other species: it's generally agreed that humans dominate the world at the expense of other species. But that's not surprising, since humans evolved independently of other animal species. And we can't really communicate with other animal species, since they lack language. I don't think AI is analogous to this situation. AIs will mostly be born into our society, rather than being created outside of it. (Moreover, even in this very pessimistic analogy, humans still spend >0.01% of our GDP on preserving wild animal species, and the vast majority of animal species have not gone extinct despite our giant influence on the natural world.)

RobertM @ 2023-10-14T06:33 (+13)

ETA: feel free to ignore the below, given your caveat, though you may find it helpful if you choose to write an expanded form of any of the arguments later to have some early objections.

Correct me if I'm wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense (since if it is, effectively all of them break down as reasons for optimism)? To wit:

Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don't very much value the well-being of others don't have the power to actually expropriate everyone else's resources by force. (We have evidence of what happens when those conditions break down to any meaningful degree; it isn't super pretty.)
I do not think GPT-4 is meaningful evidence about the difficulty of value alignment. In particular, the claim that "GPT-4 seems to be honest, kind, and helpful after relatively little effort" seems to be treating GPT-4's behavior as meaningfully reflecting its internal preferences or motivations, which I think is "not even wrong". I think it's extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren't centrally pointed at being honest, kind, and helpful.
re: endogenous reponse to AI - I don't see how this is relevant once you have ASI. To the extent that it might be relevant, it's basically conceding the argument: that the reason we'll be safe is that we'll manage to avoid killing ourselves by moving too quickly. (Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past. One that some people are actively optimising for, but also one that other people are optimizing against.)
re: perfectionism - I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time. Assuming that this will continue to be true is again assuming the conclusion (that AI will not be superhuman in any relevant sense). I also feel like there's an implicit argument here about how value isn't fragile that I disagree with, but I might be reading into it.
I'm not totally sure what analogy you're trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive. Human efforts to preserve animal species are a drop in the bucket compared to the casual disregard with which we optimize over them and their environments for our benefit. I'm sure animals sometimes attempt to defend their territory against human encroachment. Has the human response to this been to shrug and back off? Of course, there are some humans who do care about animals having fulfilled lives by their own values. But even most of those humans do not spend their lives tirelessly optimizing for their best understanding of the values of animals.

Matthew_Barnett @ 2023-10-14T09:06 (+12)

Correct me if I'm wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense

No, I certainly expect AIs will eventually be superhuman in virtually all relevant respects.

Resource allocation is relatively equal (and relatively free of violence) among humans because even humans that don't very much value the well-being of others don't have the power to actually expropriate everyone else's resources by force.

Can you clarify what you are saying here? If I understand you correctly, you're saying that humans have relatively little wealth inequality because there's relatively little inequality in power between humans. What does that imply about AI?

I think there will probably be big inequalities in power among AIs, but I am skeptical of the view that there will be only one (or even a few) AIs that dominate over everything else.

I do not think GPT-4 is meaningful evidence about the difficulty of value alignment.

I'm curious: does that mean you also think that alignment research performed on GPT-4 is essentially worthless? If not, why?

I think it's extremely unlikely that GPT-4 has preferences over world states in a way that most humans would consider meaningful, and in the very unlikely event that it does, those preferences almost certainly aren't centrally pointed at being honest, kind, and helpful.

I agree that GPT-4 probably doesn't have preferences in the same way humans do, but it sure appears to be a limited form of general intelligence, and I think future AGI systems will likely share many underlying features with GPT-4, including, to some extent, cognitive representations inside the system.

I think our best guess of future AI systems should be that they'll be similar to current systems, but scaled up dramatically, trained on more modalities, with some tweaks and post-training enhancements, at least if AGI arrives soon. Are you simply skeptical of short timelines?

re: endogenous reponse to AI - I don't see how this is relevant once you have ASI.

To be clear, I expect we'll get AI regulations before we get to ASI. I predict that regulations will increase in intensity as AI systems get more capable and start having a greater impact on the world.

Note that we are currently moving at pretty close to max speed, so this is a prediction that the future will be different from the past.

Every industry in history initially experienced little to no regulation. However, after people became more acquainted with the industry, regulations on the industry increased. I expect AI will follow a similar trajectory. I think this is in line with historical evidence, rather than contradicting it.

re: perfectionism - I would not be surprised if many current humans, given superhuman intelligence and power, created a pretty terrible future. Current power differentials do not meaningfully let individual players flip every single other player the bird at the same time.

I agree. If you turned a random human into a god, or a random small group of humans into gods, then I would be pretty worried. However, in my scenario, there aren't going to be single AIs that suddenly become gods. Instead, in my scenario, there will be millions of different AIs, and the AIs will smoothly increase in power over time. During this time, we will be able to experiment and do alignment research to see what works and what doesn't at making the AIs safe. I expect AI takeof will be fairly diffuse, and AIs will probably be respectful of norms and laws because no single AI can take over the world by themselves. Of course, the way I think about the future could be wrong on a lot of specific details, but I don't see a strong reason to doubt the basic picture I'm presenting, as of now.

My guess is that your main objection here is that you think foom will happen, i.e. there will be a single AI that takes over the world and imposes its will on everyone else. Can you elaborate more on why you think that will happen? I don't think it's a straightforward consequence of AIs being smarter than humans.

I'm not totally sure what analogy you're trying to rebut, but I think that human treatment of animal species, as a piece of evidence for how we might be treated by future AI systems that are analogously more powerful than we are, is extremely negative, not positive.

My main argument is that we should reject the analogy itself. I'm not really arguing that the analogy provides evidence for optimism, except in a very weak sense. I'm just saying: AIs will be born into and shaped by our culture; that's quite different than what happened between animals and humans.

SiebeRozendal @ 2023-10-14T20:39 (+4)

Individual humans are usually quite selfish, frequently lie to each other, and are often cruel, and yet the world mostly gets along despite this. This is true even when there are vast differences in power and wealth between humans. For example some groups in the world have almost no power relative to the United States, and residents in the US don't particularly care about them either, and yet they survive anyway.

Okay so these are two analogies: individual humans & groups/countries.

First off, "surviving" doesn't seem like the right thing to evaluate, more like "significant harm"/"being exploited "

Can you give some examples where individual humans have a clear strategic decisive advantage (i.e. very low risk of punishment), where the low-power individual isn't at a high risk of serious harm? Because the examples I can think of are all pretty bad: dictators, slaveholders, husbands in highly patriarchal societies.. Sexual violence is extremely prevalent and is pretty much always in a high power difference context.

I find the US example unconvincing, because I find it hard to imagine the US benefiting more from aggressive use it force, than trade and soft economic exploitation. The US doesn't have the power to successfully occupy countries anymore. When there were bigger power differences due to technology, we had the age of colonialism.

Matthew_Barnett @ 2023-10-14T21:41 (+3)

Can you give some examples where individual humans have a clear strategic decisive advantage (i.e. very low risk of punishment), where the low-power individual isn't at a high risk of serious harm?

Why are we assuming a low risk of punishment? Risk of punishment depends largely on social norms and laws, and I'm saying that AIs will likely adhere to a set of social norms.

I think the central question is whether these social norms will include the norm "don't murder humans". I think such a norm will probably exist, unless almost all AIs are severely misaligned. I think severe misalignment is possible; one can certainly imagine it happening. But I don't find it likely, since people will care a lot about making AIs ethical, and I'm not yet aware of any strong reasons to think alignment will be super-hard.

Matthew_Barnett @ 2024-02-04T06:16 (+34)

It seems to me that a big crux about the value of AI alignment work is what target you think AIs will ultimately be aligned to in the future in the optimistic scenario where we solve all the "core" AI risk problems to the extent they can be feasibly solved, e.g. technical AI safety problems, coordination problems, the problem of having "good" AI developers in charge etc.

There are a few targets that I've seen people predict AIs will be aligned to if we solve these problems: (1) "human values", (2) benevolent moral values, (3) the values of AI developers, (4) the CEV of humanity, (5) the government's values. My guess is that a significant source of disagreement that I have with EAs about AI risk is that I think none of these answers are actually very plausible. I've written a few posts explaining my views on this question already (1, 2), but I think I probably didn't make some of my points clear enough in these posts. So let me try again.

In my view, in the most likely case, it seems that if the "core" AI risk problems are solved, AIs will be aligned to the primarily selfish individual revealed preferences of existing humans at the time of alignment. This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I'm working on a better acronym).

(Read my post if you want to understand my reasons for thinking that AIs will likely be aligned to PSIRPEHTA if we solve AI safety problems.)

I think it is not obvious at all that maximizing PSIRPEHTA is good from a total utilitarian perspective compared to most plausible "unaligned" alternatives. In fact, I think the main reason why you might care about maximizing PSIRPEHTA is if you think we're close to AI and you personally think that current humans (such as yourself) should be very rich. But if you thought that, I think the arguments about the overwhelming value of reducing existential risk in e.g. Bostrom's paper Astronomical Waste largely do not apply. Let me try to explain.

PSIRPEHTA is not the same thing as "human values" because, unlike human values, PSIRPEHTA is not consistent over time or shared between members of our species. Indeed, PSIRPEHTA changes during each generation as old people die off, and new young people are born. Most importantly, PSIRPEHTA is not our non-selfish "moral" values, except to the extent that people are regularly moved by moral arguments in the real world to change their economic consumption habits, which I claim is not actually very common (or, to the extent that it is common, I don't think these moral values usually look much like the ideal moral values that most EAs express).

PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is "morally correct". For example, according to "human values" it might be wrong to eat meat, because maybe if humans reflected long enough they'd express the conclusion that it's wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there's little pressure for people to "reflect" on their values and change them.

From this perspective, the view in which it makes most sense to push for AI alignment work seems to be an obscure form of person-affecting utilitarianism in which you care mainly about the revealed preferences of humans at the time when AI is created (not the human species, but rather, the generation of humans that happens to be living when advanced AIs are created). This perspective is plausible if you really care about making currently existing humans better off materially and you think we are close to advanced AI. But I think this type of moral view is generally quite far apart from total utilitarianism, or really any other form of utilitarianism that EAs have traditionally adopted.

In a plausible "unaligned" alternative, the values of AIs would diverge from PSIRPEHTA, but this mainly has the effect of making particular collections of individual humans less rich, and making other agents in the world — particularly unaligned AI agents — more rich. That could be bad if you think that these AI agents are less morally worthy than existing humans at the time of alignment (e.g. for some reason you think AI agents won't be conscious), but I think it's critically important to evaluate this question carefully by measuring the "unaligned" outcome against the alternative. Most arguments I've seen about this topic have emphasized how bad it would be if unaligned AIs have influence in the future. But I've rarely seen the flipside of this argument explicitly defended: why PSIRPEHTA would be any better.

In my view, PSIRPEHTA seems like a mediocre value system, and one that I do not particularly care to maximize relative to a variety of alternatives. I definitely like PSIRPEHTA to the extent that I, my friends, family, and community are members of the set of "existing humans at the time of alignment", but I don't see any particularly strong utilitarian arguments for caring about PSIRPEHTA.

In other words, instead of arguing that unaligned AIs would be bad, I'd prefer to hear more arguments about why PSIRPEHTA would be better, since PSIRPEHTA just seems to me like the value system that will actually be favored if we feasibly solve all the technical and coordination AI problems that EAs normally talk about regarding AI risk.

MichaelStJules @ 2024-02-04T20:46 (+6)

PSIRPEHTA refers to the aggregate ordinary revealed preferences of individual actors, who the AIs will be aligned to, in order to make those humans richer i.e. their preferences as revealed by their actions, such as what they spend their income on, NOT what they think is "morally correct". For example, according to "human values" it might be wrong to eat meat, because maybe if humans reflected long enough they'd express the conclusion that it's wrong to hurt animals. But from the perspective of PSIRPEHTA, eating meat is generally acceptable, and empirically there's little pressure for people to "reflect" on their values and change them.

EDIT: I guess I'd think of human values as what people would actually just sincerely and directly endorse without further influencing them first (although maybe just asking them makes them take a position if they didn't have one before, e.g. if they've never thought much about the ethics of eating meat).

~~I think you're overstating the differences between revealed and endorsed preferences, including moral/human values, here.~~ Probably only a small share of the population thinks eating meat is wrong or bad, and most probably think it's okay. Even if people generally would find it wrong or bad after reflecting long enough (I'm not sure they actually would), that doesn't reflect their actual values now. Actual human values do not generally find eating meat wrong.

To be clear, you can still complain that humans' actual/endorsed values are also far from ideal and maybe not worth aligning with, e.g. because people don't care enough about nonhuman animals or helping others. Do people care more about animals and helping others than an unaligned AI would, in expectation, though? Honestly, I'm not entirely sure. Humans may care about animal welfare somewhat, but they also specifically want to exploit animals in large part because of their values, specifically food-related taste, culture, traditions and habit. Maybe people will also want to specifically exploit artificial moral patients for their own entertainment, curiosity or scientific research on them, not just because the artificial moral patients are generically useful, e.g. for acquiring resources and power and enacting preferences (which an unaligned AI could be prone to).

MichaelStJules @ 2024-02-04T21:15 (+2)

I illustrate some other examples here on the influence of human moral values on companies. This is all of course revealed preferences, but my point is that revealed preferences can importantly reflect endorsed moral values.

People influence companies in part on the basis of what they think is right through demand, boycotts, law, regulation and other political pressure.

Companies, for the most part, can't just go around directly murdering people (companies can still harm people, e.g. through misinformation on the health risks of their products, or because people don't care enough about the harms). (Maybe this is largely for selfish reasons; people don't want to be killed themselves, and there's a slippery slope if you allow exceptions.)

GPT has content policies that reflect people's political/moral views. Social media companies have use and content policies and have kicked off various users for harassment, racism, or other things that are politically unpopular, at least among a large share of users or advertisers (which also reflect consumers). This seems pretty standard.

Many companies have boycotted Russia since the invasion of Ukraine. Many companies have also committed to sourcing only cage-free eggs after corporate outreach and campaigns, despite cage-free egg consumption being low.

X (Twitter)'s policies on hate speech have changed under Musk, presumably primarily because of his views. That seems to have cost X users and advertisers, but X is still around and popular, so it also shows that some potentially important decisions about how a technology is used are largely in the hands of the company and its leadership, not just driven by profit.

I'd likewise guess it actually makes a difference that the biggest AI labs are (I would assume) led and staffed primarily by liberals. They can push their own views onto their AI even at the cost of some profit and market share. And some things may have minimal near term consequences for demand or profit, but could be important for the far future. If the company decides to make their AI object more to various forms of mistreatment of animals or artificial consciousness, will this really cost them tons of profit and market share? And it could depend on the markets it's primarily used in, e.g. this would matter even less for an AI that brings in profit primarily through trading stocks.

It's also often hard to say how much something affects a company's profits.

Ryan Greenblatt @ 2024-02-04T18:27 (+5)

This essentially refers to the the implicit value system that would emerge if, when advanced AI is eventually created, you gave the then-currently existing set of humans a lot of wealth. Call these values PSIRPEHTA (I'm working on a better acronym).

I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I'm less sold that this will result in outcomes which are well described as "primarily selfish".

I feel like your comment is equivocating between "the situation is similar to making existing humans massively wealth" and "of course this will result in primarily selfish usage similar to how the median person behaves with marginal money now".

Matthew_Barnett @ 2024-02-04T19:32 (+2)

I basically buy that the values we get will be similar to just giving existing humans massive amounts of wealth, but I'm less sold that this will result in outcomes which are well described as "primarily selfish".

Current humans definitely seem primarily selfish (although I think they also care about their family and friends too; I'm including that). Can you explain why you think giving humans a lot of wealth would turn them into something that isn't primarily selfish? What's the empirical evidence for that idea?

Ryan Greenblatt @ 2024-02-04T22:09 (+3)

The behavior of billionares, which maybe indicates more like 10% of income spent on altruism.

ETA: This is still literally majority selfish, but it's also plausible that 10% altruism is pretty great and looks pretty different than "current median person behavior with marginal money".

(See my other comment about the percent of cosmic resources.)

Matthew_Barnett @ 2024-02-04T22:11 (+2)

The idea that billionaires have 90% selfish values seems consistent with a claim of having "primarily selfish" values in my opinion. Can you clarify what you're objecting to here?

Ryan Greenblatt @ 2024-02-04T22:19 (+3)

The literal words of "primarily selfish" don't seem that bad, but I would maybe prefer majority selfish?

And your top level comment seems like it's not talking about/emphasizing the main reason to like human control which is that maybe 10-20% of resources are spent well.

It just seemed odd to me to not mention that "primarily selfish" still involves a pretty big fraction of altruism.

Matthew_Barnett @ 2024-02-04T22:27 (+4)

I agree it's important to talk about and analyze the (relatively small) component of human values that are altruistic. I mostly just think this component is already over-emphasized.

Here's one guess at what I think you might be missing about my argument: 90% selfish values + 10% altruistic values isn't the same thing as, e.g., 90% valueless stuff + 10% utopia. The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren't necessarily outweighed by the 10%.

90% selfish values is the type of thing that produces massive factory farming infrastructure, with a small amount of GDP spent mitigating suffering in factory farms. Does the small amount of spending mitigating suffering outweigh the large amount of spending directly causing suffering? This isn't clear to me.

(Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse. But I'd want to understand how you could come to that conclusion, carefully. "Altruism" also encompasses a broad range of activities, and not all of it is utopian or idealistic from a total utilitarian perspective. For example, human spending on environmental conservation might be categorized as "altruism" in this framework, although personally I would say that form of spending is not very "moral" due to wild animal suffering.)

Ryan Greenblatt @ 2024-02-04T22:39 (+9)

The 90% selfish component can have negative effects on welfare from a total utilitarian perspective, that aren't necessarily outweighed by the 10%.

Yep, this can be true, but I'm skeptical this will matter much in practice.

I typically think things which aren't directly optimizing for value or disvalue won't have intended effects which are very important and that in the future unintended effects (externalities) won't be that much of total value/disvalue.

When we see the selfish consumption of current very rich people, it doesn't seem like the intentional effects are that morally good/bad relative to the best/worst uses of resources. (E.g. owning a large boat and having people think you're high status aren't that morally important relative to altruistic spending of similar amounts of money.) So for current very rich people the main issue would be that the economic process for producing the goods has bad externalities.

And, I expect that as technology advances, externalities reduce in moral importance relative to intended effects. Partially this is based on crazy transhumanist takes, but I feel like there is some broader perspective in which you'd expect this.

E.g. for factory farming, the ultimately cheapest way to make meat in the limit of technological maturity would very likely not involve any animal suffering.

Separately, I think externalities will probably look pretty similar for selfish resource usage for unaligned AIs and humans because most serious economic activities will be pretty similar.

Ryan Greenblatt @ 2024-02-04T22:30 (+1)

Alternatively, you could think that unaligned AIs will be 100% selfish, and this is clearly worse.

I'd like to explicitly note that this I don't think that this is true in expectation for a reasonable notion of "selfish". Though I maybe think something which is sort of in this direction if we use a relatively narrow notion of altruism.

JWS @ 2024-02-08T12:21 (+2)

How are we defining selfish here? It seem like a pretty strong position to take on the topic of psychological egoism? Especially including family/friends in terms of selfish?

In your original post, you say:

All that extra wealth did not make us extreme moral saints; instead, we still mostly care about ourselves, our family, and our friends.

But I don't know, it seems that as countries and individuals get wealthier, we seem to on the whole be getting better? Maybe factory farming acts against this, but the idea that factory farming is immoral and should be abolished exists and I think is only going to grow. I don't think the humans are just slaves to our base wants/desires, and think that is a remarkably impoverished view of both individual human pyschology and social morality.

As such, I don't really agree with much of this post. An AGI, when built, will be able to generate new ideas and hypotheses about the world, including moral ones. A strong-but-narrow AI could be worse (e.g. optimal-factory-farm-PT), but then the right response here isn't really technical alignment, it's AI governance and moral persuasion in general.

aogara @ 2024-02-09T03:23 (+4)

This seems to underrate the arguments for Malthusian competition in the long run.

If we develop the technical capability to align AI systems with any conceivable goal, we'll start by aligning them with our own preferences. Some people are saints, and they'll make omnibenevolent AIs. Other people might have more sinister plans for their AIs. The world will remain full of human values, with all the good and bad that entails.

But current human values are do not maximize our reproductive fitness. Maybe one human will start a cult devoted to sending self-replicating AI probes to the stars at almost light speed. That person's values will influence far-reaching corners of the universe that later humans will struggle to reach. Another human might use their AI to persuade others to join together and fight a war of conquest against a smaller, weaker group of enemies. If they win, their prize will be hardware, software, energy, and more power that they can use to continue to spread their values.

Even if most humans are not interested in maximizing the number and power of their descendants, those who are will have the most numerous and most powerful descendants. This selection pressure exists even if the humans involved are ignorant of it; even if they actively try to avoid it.

I think it's worth splitting the alignment problem into two quite distinct problems:

The technical problem of intent alignment. Solving this does not solve coordination problems. There will still be private information and coordination problems after intent alignment is solved, therefore we'll still face coordination problems, fitter strategies will proliferate, and the world will be governed by values that maximize fitness.
"Civilizational alignment"? Much harder problem to solve. The traditional answer is a Leviathan, or Singleton as the cool kids have been saying. It solves coordination problems, allowing society to coherently pursue a long-run objective such as flourishing rather than fitness maximization. Unfortunately, there are coordination problems and competitive pressures within Leviathans. The person who ends up in charge is usually quite ruthless and focused on preserving their power, rather than the stated long-run goal of the organization. And if you solve all the coordination problems, you have another problem in choosing a good long-run objective. Nothing here looks particularly promising to me, and I expect competition to continue.

Better explanations: 1, 2, 3.

Matthew_Barnett @ 2024-02-09T04:22 (+2)

This seems to underrate the arguments for Malthusian competition in the long run.

I'm mostly talking about what I expect to happen in the short-run in this thread. But I appreciate these arguments (and agree with most of them).

Plausibly my main disagreement with the concerns you raised is that I think coordination is maybe not very hard. Coordination seems to have gotten stronger over time, in the long-run. AI could also potentially make coordination much easier. As Bostrom has pointed out, historical trends point towards the creation of a Singleton.

I'm currently uncertain about whether to be more worried about a future world government becoming stagnant and inflexible. There's a real risk that our institutions will at some point entrench an anti-innovation doctrine that prevents meaningful changes over very long time horizons out of a fear that any evolution would be too risky. As of right now I'm more worried about this potential failure mode versus the failure mode of unrestrained evolution, but it's a close competition between the two concerns.

Ryan Greenblatt @ 2024-02-04T18:16 (+3)

What percent of cosmic resources do you expect to be spent thoughtfully and altruistically? 0%? 10%?

I would guess the thoughtful and altruistic subset of resources dominate in most scenarios where humans retain control.

Then, my main argument for why human control would be good is that the fraction isn't that small (more like 20% in expectation than 0%) and that unaligned AI takeover seems probably worse than this.

Also, as an aside, I agree that little good public argumentation has been made about the relative value of unaligned AI control vs human control. I'm sympathetic to various discussion from Paul Christiano and Joe Carlsmith, but the public scope and detail is pretty limited thus far.

Matthew_Barnett @ 2024-02-24T05:54 (+32)

In some circles that I frequent, I've gotten the impression that a decent fraction of existing rhetoric around AI has gotten pretty emotionally charged. And I'm worried about the presence of what I perceive as demagoguery regarding the merits of AI capabilities and AI safety. Out of a desire to avoid calling out specific people or statements, I'll just discuss a hypothetical example for now.

Suppose an EA says, "I'm against OpenAI's strategy for straightforward reasons: OpenAI is selfishly gambling everyone's life in a dark gamble to make themselves immortal." Would this be a true, non-misleading statement? Would this statement likely convey the speaker's genuine beliefs about why they think OpenAI's strategy is bad for the world?

To begin to answer these questions, we can consider the following observations:

It seems likely that AI powerful enough to end the world would presumably also be powerful enough to do lots of incredibly positive things, such as reducing global mortality and curing diseases. By delaying AI, we are therefore equally "gambling everyone's life" by forcing people to face ordinary mortality.
Selfish motives can be, and frequently are, aligned with the public interest. For example, Jeff Bezos was very likely motivated by selfish desires in his accumulation of wealth, but building Amazon nonetheless benefitted millions of people in the process. Such win-win situations are common in business, especially when developing technologies.

Because of the potential for AI to both pose great risks and great benefits, it seems to me that there are plenty of plausible pro-social arguments one can give for favoring OpenAI's strategy of pushing forward with AI. Therefore, it seems pretty misleading to me to frame their mission as a dark and selfish gamble, at least on a first impression.

Here's my point: Depending on the speaker, I frequently think their actual reason for being against OpenAI's strategy is not because they think OpenAI is undertaking a dark, selfish gamble. Instead, it's often just standard strong longtermism. A less misleading statement of their view would go something like this:

"I'm against OpenAI's strategy because I think potential future generations matter more than the current generation of people, and OpenAI is endangering future generations in their gamble to improve the lives of people who currently exist."

I claim this statement would—at least in many cases—be less misleading than the other statement because it captures a major genuine crux of the disagreement: whether you think potential future generations matter more than currently-existing people.

This statement also omits the "selfish" accusation, which I think is often just a red herring designed to mislead people: we don't normally accuse someone of being selfish when they do a good thing, even if the accusation is literally true.

(There can, of course, be further cruxes, such as your p(doom), your timelines, your beliefs about the normative value of unaligned AIs, and so on. But at the very least, a longtermist preference for future generations over currently existing people seems like a huge, actual crux that many people have in this debate, when they work through these things carefully together.)

Here's why I care about discussing this. I admit that I care a substantial amount—not overwhelming, but it's hardly insignificant—about currently existing people. I want to see people around me live long, healthy and prosperous lives, and I don't want to see them die. And indeed, I think advancing AI could greatly help currently existing people. As a result, I find it pretty frustrating to see people use what I perceive to be essentially demagogic tactics designed to sway people against AI, rather than plainly stating their cruxes about why they actually favor the policies they do.

These allegedly demagogic tactics include:

Highlighting the risks of AI to argue against development while systematically omitting the potential benefits, hiding a more comprehensive assessment of your preferred policies.
Highlighting random, extraneous drawbacks of AI development that you wouldn't ordinarily care much about in other contexts when discussing innovation, such as potential for job losses from automation. This type of rhetoric looks a lot like "deceptively searching for random arguments designed to persuade, rather than honestly explain one's perspective" to me, a lot of the time.
Conflating, or at least strongly associating, the selfish motives of people who work at AI firms with their allegedly harmful effects. This rhetoric plays on public prejudices by appealing to a widespread but false belief that selfish motives are usually suspicious, or can't translate into pro-social results. In fact, there is no contradiction with the idea that most people at OpenAI are in it for the money, status, and fame, but also what they're doing is good for the world, and they genuinely believe that.

I'm against these tactics for a variety of reasons, but one of the biggest reasons is that they can, in some cases, indicate a degree of dishonesty, depending on the context. And I'd really prefer EAs to focus on trying to be almost-maximally truth-seeking in both their beliefs and their words.

Speaking more generally—to drive one of my points home a little more—I think there are roughly three possible views you could have about pushing for AI capabilities relative to pushing for pausing or more caution:

Full-steam ahead view: We should accelerate AI at any and all costs. We should oppose any regulations that might impede AI capabilities, and embark on a massive spending spree to accelerate AI capabilities.
Full-safety view: We should try as hard as possible to shut down AI right now, and thwart any attempt to develop AI capabilities further, while simultaneously embarking on a massive spending spree to accelerate AI safety.
Balanced view: We should support a substantial mix of both safety and acceleration efforts, attempting to carefully balance the risks and rewards of AI development to ensure that we can seize the benefits of AI without bearing intolerably high costs.

I tend to think most informed people, when pushed, advocate the third view, albeit with wide disagreement about the right mix of support for safety and acceleration. Yet, on a superficial level—on the level of rhetoric—I find that the first and second view are surprisingly common. On this level, I tend to find e/accs in the first camp, and a large fraction of EAs in the second camp.

But if your actual beliefs are something like the third view, I think that's an important fact to emphasize in honest discussions about what we should do with AI. If your rhetoric is consistently aligned with (1) or (2) but your actual beliefs are aligned with (3), I think that can often be misleading. And it can be especially misleading if you're trying to publicly paint other people in the same camp—the third one—as somehow having bad motives merely because they advocate a moderately higher mix of acceleration over safety efforts than you do, or vice versa.

Ben Millwood @ 2024-04-05T10:22 (+4)

I encourage you not to draw dishonesty inferences from people worried about job losses from AI automation, just because:

it seems like almost no other technologies stood to automate such a broad range of labour essentially simultaneously,
other innovative technologies often did face pushback from people whose jobs were threatened, and generally there have been significant social problems in the past when an economy moves away from people's existing livelihoods (I'm thinking of e.g. coal miners in 1970s / 1980s Britain, though it's not something I know a lot about),
even if the critique doesn't stand up under from-first-principles scrutiny, lots of people think it's a big deal, so if it's a mistake it's surely an understandable one from someone who weighs other opinions (too?) seriously.

I think it's reasonable to argue that this worry is wrong, I just think it's a pretty understandable opinion to hold and want to talk about, and I don't feel like it's compelling evidence that someone is deliberately trying to seek out arguments in order to advance a position.

Ryan Greenblatt @ 2024-04-05T04:43 (+3)

See also "The costs of caution" which discuss AI upsides in a relatively thoughtful way.

Matthew_Barnett @ 2025-05-24T21:06 (+29)

A summary of my current views on moral theory and the value of AI

I am essentially a preference utilitarian and an illusionist regarding consciousness. This combination of views leads me to conclude that future AIs will very likely have moral value if they develop into complex agents capable of long-term planning, and are embedded within the real world. I think such AIs would have value even if their preferences look bizarre or meaningless to humans, as what matters to me is not the content of their preferences but rather the complexity and nature of their minds.

When deciding whether to attribute moral patienthood to something, my focus lies primarily on observable traits, cognitive sophistication, and most importantly, the presence of clear open-ended goal-directed behavior, rather than on speculative or less observable notions of AI welfare, about which I am more skeptical. As a rough approximation, my moral theory aligns fairly well with what is implicitly proposed by modern economists, who talk about revealed preferences and consumer welfare.

Like most preference utilitarians, I believe that value is ultimately subjective: loosely speaking, nothing has inherent value except insofar as it reflects a state of affairs that aligns with someone’s preferences. As a consequence, I am comfortable, at least in principle, with a wide variety of possible value systems and future outcomes. This means that I think a universe made of only paperclips could have value, but only if that's what preference-having beings wanted the universe to be made out of.

To be clear, I also think existing people have value too, so this isn't an argument for blind successionism. Also, it would be dishonest not to admit that I am also selfish to a significant degree (along with almost everyone else on Earth). What I have just described simply reflects my broad moral intuitions about what has value in our world from an impartial point of view, not a prescription that we should tile the universe with paperclips. Since humans and animals are currently the main preference-having beings in the world, at the moment I care most about fulfilling what they want the world to be like.

Ryan Greenblatt @ 2025-05-25T02:18 (+18)

I agree that this sort of preference utilitarianism leads you to thinking that long run control by an AI which just wants paperclips could be some (substantial) amount good, but I think you'd still have strong preferences over different worlds.^[1] The goodness of worlds could easily vary by many orders of magnitude for any version of this view I can quickly think of and which seems plausible. I'm not sure whether you agree with this, but I think you probably don't because you often seem to give off the vibe that you're indifferent to very different possibilities. (And if you agreed with this claim about large variation, then I don't think you would focus on the fact that the paperclipper world is some small amount good as this wouldn't be an important consideration—at least insofar as you don't also expect that worlds where humans etc retain control are similarly a tiny amount good for similar reasons.)

The main reasons preference utilitarianism is more picky:

Preferences in the multiverse: Insofar as you put weight on the preferences of beings outside our lightcone (beings in the broader spatially infinte universe, Everett branches, the broader mathematical multiverse to the extent you put weight on this), then the preferences of these beings will sometimes care about what happens in our lightcone and this could easily dominate (as they are vastly more numerious and many might care about things independent of "distance"). In the world with the successful paperclipper, just as many preferences aren't being fulfilled. You'd strongly prefer optimization to satisfy as many preferences as possible (weighted as you end up thinking is best).
Instrumentally constructed AIs with unsatisfied preferences: If future AIs don't care at all about preference utilitarianism, they might instrumentally build other AIs who's preferences aren't fulfilled. As an extreme example, it might be that the best strategy for a paper clipper is to construct AIs which have very different preferences and are enslaved. Even if you don't care about ensuring beings come into existence who's preference are satisified, you might still be unhappy about creating huge numbers of beings who's preferences aren't satisfied. You could even end up in a world where (nearly) all currently existing AIs are instrumental and have preferences which are either unfulfilled or only partially fulfilled (a earlier AI initiated a system that perpetuates this, but this earlier AI no longer exists as it doesn't care terminally about self-preservation and the system it built is more efficient than it).
AI inequality: It might be the case that the vast majority of AIs have there preferences unsatisfied despite some AIs succeeding at achieving their preference. E.g., suppose all AIs are replicators which want to spawn as many copies as possible. The vast majority of these replicator AI are operating at subsistence and so can't replicate making their preferences totally unsatisfied. This could also happen as a result of any other preference that involves constructing minds that end up having preferences.
Weights over numbers of beings and how satisfied they are: It's possible that in a paperclipper world, there are really a tiny number of intelligent beings because almost all self-replication and paperclip construction can be automated with very dumb/weak systems and you only occasionally need to consult something smarter than a honeybee. AIs could also vary in how much they are satisfied or how "big" their preferences are.

I think the only view which recovers indifference is something like "as long as stuff gets used and someone wanted this at some point, that's just as good". (This view also doesn't actually care about stuff getting used, because there is someone existing who'd prefer the universe stays natural and/or you don't mess with aliens.) I don't think you buy this view?

To be clear, it's not immediately obvious whether a preference utilitarian view like the one you're talking about favors human control over AIs. It certainly favors control by that exact flavor of preference utilitarian view (so that you end up satisfying people across the (multi-/uni-)verse with the correct weighting). I'd guess it favors human control for broadly similar reasons to why I think more experience-focused utilitarian views also favor human control if that view is in a human.

And, maybe you think this perspective makes you so uncertain about human control vs AI control that the relative impacts current human actions could have are small given how much you weight long term outcomes relative to other stuff (like ensuring currently existing humans get to live for at least 100 more years or similar).

(This comment is copied over from LW responding to a copy of Matthew's comment there.)

^{^}
On my best guess moral views, I think there is goodness in the paper clipper universe but this goodness (which isn't from (acausal) trade) is very small relative to how good the universe can plausibly get. So, this just isn't an important consideration but I certainly agree there is some value here.

Matthew_Barnett @ 2025-05-26T02:49 (+4)

The goodness of worlds could easily vary by many orders of magnitude for any version of this view I can quickly think of and which seems plausible. I'm not sure whether you agree with this, but I think you probably don't because you often seem to give off the vibe that you're indifferent to very different possibilities.

I don't think I agree with the strong version of the indifference view that you're describing here. However, I probably do agree with a weaker version. In the weaker version that I largely agree with, our profound uncertainty about the long-term future means that, although different possible futures could indeed be extremely different in terms of their value, our limited ability to accurately predict or forecast outcomes so far ahead implies that, in practice, we shouldn't overly emphasize these differences when making almost all ordinary decisions.

This doesn't mean I think we should completely ignore the considerations you mentioned in your comment, but it does mean that I don't tend to find those considerations particularly salient when deciding whether to work on certain types of AI research and development.

This reasoning is similar to why I try to be kind to people around me: while it's theoretically possible that some galaxy-brained argument might exist showing that being extremely rude to people around me could ultimately lead to far better long-term outcomes that dramatically outweigh the short-term harm, in practice, it's too difficult to reliably evaluate such abstract and distant possibilities. Therefore, I find it more practical to focus on immediate, clear, and direct considerations, like the straightforward fact that being kind is beneficial to the people I'm interacting with.

This puts me perhaps closest to the position you identified in the last paragraph:

And, maybe you think this perspective makes you so uncertain about human control vs AI control that the relative impacts current human actions could have are small given how much you weight long term outcomes relative to other stuff (like ensuring currently existing humans get to live for at least 100 more years or similar).

Here’s an analogy that could help clarify my view: suppose we were talking about the risks of speeding up research into human genetic engineering or human cloning. In that case, I would still seriously consider speculative moral risks arising from the technology. For instance, I think it’s possible that genetically enhanced humans could coordinate to oppress or even eliminate natural unmodified humans, perhaps similar to the situation depicted in the movie GATTACA. Such scenarios could potentially have enormous long-term implications under my moral framework, even if it's not immediately obvious what those implications might actually be.

However, even though these speculative risks are plausible and seem important to take into account, I’m hesitant to prioritize their (arguably very speculative) impacts above more practical and direct considerations when deciding whether to pursue such technologies. This is true even though it's highly plausible that the long-run implications are, in some sense, more significant than the direct considerations that are easier to forecast.

Put more concretely, if someone argued that accelerating genetically engineering humans might negatively affect the long-term utilitarian moral value we derive from cosmic resources as a result of some indirect far-out consideration, I would likely find that argument far less compelling than if they informed me of more immediate, clear, and predictable effects of the research.

In general, I’m very cautious about relying heavily on indirect, abstract reasoning when deciding what actions we should take or what careers we should pursue. Instead, I prefer straightforward considerations that are harder to fool oneself about.

Ryan Greenblatt @ 2025-05-26T04:40 (+9)

Gotcha, so if I understand correctly, you're more so leaning on uncertainty for being mostly indifferent rather than on thinking you'd actually be indifferent if you understood exactly what would happen in the long run. This makes sense.

(I have a different perspective on decision making that has high stakes under uncertainty and I don't personally feel sympathetic to this sort of cluelessness perspective as a heuristic in most cases or as a terminal moral view. See also the CLR work on cluelessness. Separately, my intuitions around cluelessness imply that, to the extent I put weight on this, when I'm clueless, I get more worried about unilateralists curse and downside which you don't seem to put much weight on, though just rounding all kinda-uncertain long run effects to zero isn't a crazy perspective.)

On the galaxy brained pont: I'm sympathetic to arguments against being too galaxy brained, so I see where you're coming from there, but from my perspective, I was already responding to an argument which is one galaxy brain level deep.

I think the broader argument about AI takeover being bad from a longtermist perspective is not galaxy brained and the specialization of this argument to your flavor of preference utilitarianism also isn't galaxy brained: you have some specific moral views (in this case about prefence utilitarianism) and all else equal you'd expect humans to share these moral views more than AIs that end up taking over despite their developers not wanting the AI to take over. So (all else equal) this makes AI takeover look bad, because if beings share your preferences, then more good stuff will happen.

Then you made a somewhat galaxy brained response to this about how you don't actually care about shared preferences due to preference utilitarianism (because after all, you're fine with any preferences right?). But, I don't think this objection holds because there are a number of (somewhat galaxy brained) reasons why specifically optimizing for preference utilitarianism and related things may greatly outperform control by beings with arbitrary preferences.

From my perspective the argument looks sort of like:

Non galaxy brained argument for AI takeover being bad
Somewhat galaxy brained rebuttal by you about preference utilitarianism meaning you don't actually care about this sort of preference similarity argument case for avoiding nonconsensual AI takeover
My somewhat galaxy brained response, but which is only galaxy brained substantially because it's responding to a galaxy brained perspective abiut details of the long run future.

I'm sympathetic to cutting off at an earlier point and rejecting all galaxy brained arguments. But, I think the preference utilitarian argument you're giving is already quite galaxy brained and sensitive to details of the long run future.

Matthew_Barnett @ 2025-05-26T04:52 (+4)

I'm sympathetic to cutting off at an earlier point and rejecting all galaxy brained arguments.

As am I. At least when it comes to the important action-relevant question of whether to work on AI development, in the final analysis, I'd probably simplify my reasoning to something like, "Accelerating general-purpose technology seems good because it improves people's lives." This perspective roughly guides my moral views on not just AI, but also human genetic engineering, human cloning, and most other potentially transformative technologies.

I mention my views on preference utilitarianism mainly to explain why I don't particularly value preserving humanity as a species beyond preserving the individual humans who are alive now. I'm not mentioning it to commit to any form of galaxy-brained argument that I think makes acceleration look great for the long-term. In practice, the key reason I support accelerating most technology, including AI, is simply the belief that doing so would be directly beneficial to people who exist or who will exist in the near-term.

And to be clear, we could separately discuss what effect this reasoning has on the more abstract question of whether AI takeover is bad or good in expectation, but here I'm focusing just on the most action-relevant point that seems salient to me, which is whether I should choose to work on AI development based on these considerations.

akash 🔸 @ 2025-05-28T19:15 (+3)

How confident are you about these views?

Matthew_Barnett @ 2025-05-28T21:19 (+5)

I'm relatively confident in these views, with the caveat that much of what I just expressed concerns morality, rather than epistemic beliefs about the world. I'm not a moral realist, so I am not quite sure how to parse my "confidence" in moral views.

Guive @ 2025-05-29T05:46 (+3)

From an antirealist perspective, at least on the 'idealizing subjectivism' form of antirealism, moral uncertainty can be understood as uncertainty about the result of an idealization process. Under this view, there exists some function that takes your current, naive values as input and produces idealized values as output—and your moral uncertainty is uncertainty about the output.

Matthew_Barnett @ 2024-01-26T03:01 (+27)

I'm considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I'd post an outline of that post here first as a way of judging what's currently unclear about my argument, and how it interacts with people's cruxes.

Current outline:

In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.

Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:

Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren't able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I'll explain more about what this means in a second.)

My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely "delaying the inevitable" without significant benefit.

To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:

Frame 1: The "race against the clock" frame

In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete "finish line" rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.

Frame 2: The risk of an untimely AI coup/takeover

In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
- It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
- It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
- It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the "line" between groups in conflicts can be drawn, and there don't seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.

Frame 3 (my frame): The problem of poor institutions

In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
- Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
- Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
- Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
- Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
- For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
- Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world

Illustrative example of a problem within my frame:

One problem within this framework is coming up with a way of ensuring that AIs don't have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.

Case study: China in the 19th and early 20th century

Here, I would talk about how China's inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).

Chris Leong @ 2024-01-27T03:51 (+2)

It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.

I'm confused: surely we should want to avoid an AI coup? We may decide to give up control of our future to a singleton, but if we do this, then it should be intentional.

Matthew_Barnett @ 2024-01-27T04:14 (+4)

I agree we should try avoid an AI coup. Perhaps you are falling victim to the following false dichotomy?

We either allow a set of AIs to overthrow our institutions, or
We construct a singleton: a sovereign world government managed by AI that rules over everyone

Notably, there is a third option:

We incorporate AIs into our existing social, economic, and legal institutions, flexibly adapting our social structures to cope with technological change without our whole system collapsing

Chris Leong @ 2024-01-27T05:01 (+4)

I wasn't claiming that these were the only two possibilities here (for example, another possibility would be that we never actually build AGI).

My suspicion is that a lot of your ideas here sound reasonable on the abstract level, but once you dive into what it actually means on a concrete-level and how these mechanisms will concretely operate, it'll be clear that it's a lot less appealing. Anyway, that's just a gut intuition, obvs. it'll be easier to judge when you publish your write-up.

Roman Leventov @ 2024-01-26T22:52 (+1)

I'm excited to see you posting this. My views are very closely agreed with yours. I summarised my views a few days ago here.

One of the most important similarities is that we both emphasise the importance of decision-making and supporting it with institutions. This could be seen as "enactivist" view on agent (human, AI, hybrid, team/organisation) cognition.

The biggest difference between our views is that I think the "cognitivist" agenda (i.e., agent internals and algorithms) is as important as the "enactivist" agenda (institutions), whereas you seem to almost disregard the "cognitivist" agenda.

Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren't able to take certain actions (i.e. ensure they are controlled).

I disagree with putting risk-detection/mitigation mechanisms, algorithms, monitorings in that bucket. I think we should just separate between engineering (cf. A plea for solutionism on AI safety) and non-engineering (policy, legislature, treaties, commitments, advocacy) approaches. In particular, the "scheming control" agenda that you link will be concrete engineering practice that should be used in the training of safe AI models in the future, even if we have good institutions, good decision-making algorithms wrapped on top of these AI models, etc. It's not an "alternative path" just for "non-AI-dominated worlds". The same applies ftoor monitoring, interpretability, evals, etc. processes. All of these will require very elaborate engineering on their own.

I 100% agree with your reasoning about Frames 1 and 2. I want to discuss the following point in detail because it's a rare view in EA/LW circles:

It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.

In my post, I also made a similar point: “aligning LLMs with human values” is hardly a part of [the problem of context alignment] at all". But my framing was in general not very clear, so I'd try to improve it and integrate it with your take here:

Context alignment is a pervasive process that happens (and sometimes needed) on all timescales: evolutionary, developmental, and online (the examples of the latter in humans: understanding, empathy, rapport). The skill of context alignment is extremely important and should be practiced often by all kinds of agents in their interactions (and therefore we should build this skill into AIs), but it's not something that we should "iron out once and for all". That would be neither possible (agents' contexts are constantly diverging from each other), nor desirable: the (partial) misalignment is also important, it's the source of diversity that enables the evolution^[1]. Institutions (norms, legal systems, etc.) are critical for channelling and controlling this misalignment so that it's optimally productive and doesn't pose excessive risk (though some risk is unavoidable: that's the essence of misalignment!).

Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing

This is interesting. I've also discussed this issue as "morphological intelligence of socioeconomies" just a few day ago :)

Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized

Rafael Kaufmann and I have a take on this in our Gaia Network vision. Gaia Network's term for internalised economic value of trade is subjective value. The unit of subjective accounting is called FER. Trade with FER induces flow that defines the intersubjective value, i.e., the "exchange rates" of "subjective FERs". See the post for more details.

While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values

As I mentioned in the beginning, I think you are too dismissive of the "cognitivist" perspective. We shouldn't paint all "micro-features of AIs" with the same brush. I agree that value alignment is over-emphasized^[2], but other engineering mechanisms and algorithms, such as decision-making algorithms, "scheming control" procedures, context alignment algorithms, as well as architectural features: namely being world-model-based^[3] and being amenable to computational proofs^[4] are very important and couldn't be recovered on the institutional/interface/protocol level. We demonstrated in the post about Gaia Network above that for for the "value economy" to work as intended, agents should make decisions based on maximum entropy rather than maximum likelihood estimates^[5] and they should share and compose their world models (even if in a privacy-preserving way with zero-knowledge computations).

^{^}
Indeed, this observation makes evident that the refrain question "AI should be aligned with whom?" doesn't and shouldn't have a satisfactory answer if "alignment" is meant to be "totalising value alignment as often conceptualised on LessWrong"; on the other hand, if "alignment" is meant to be context alignment as a practice, the question becomes as non-sensical (in the general form) as the question "AI should interact with whom?" -- well, with someone, depending on the situation, in the way and to the degree appropriate!
^{^}
However, still not completely irrelevant, at least for practical reasons: having shared values on the pre-training/hard-coded/verifiable level, as a minimum, reduces transaction costs because the AI agents shouldn't then painstakingly "eval" each other's values before doing any business together.
^{^}
Both Bengio and LeCun argue for this: see "Scaling in the service of reasoning & model-based ML" (Bengio and Hu, 2023) and "A Path Towards Autonomous Machine Intelligence" (LeCun, 2022).
^{^}
See "Provably safe systems: the only path to controllable AGI" (Tegmark and Omohundro, 2023).
^{^}
Which is just another way of saying that they should minimise their (expected) free energy in their model updates/inferences and the course of their actions.

Nick K. @ 2024-01-26T08:29 (+1)

I like your proposed third frame as a somewhat hopeful vision for the future. Instead of pointing out why you think the other frames are poor, I think it would be helpful to maintain a more neutral approach and elaborate which assumptions each frame makes and give a link to your discussion about these in a sidenote.

Matthew_Barnett @ 2024-01-26T08:55 (+3)

The problem is that I am not trying to portray a "somewhat hopeful vision", but rather present a framework for thinking clearly about AI risks, and how to mitigate them. I think the other frames are not merely too pessimistic: I think they are actually wrong, or at least misleading, in important ways that would predictably lead people to favor bad policy if taken seriously.

It's true that I'm likely more optimistic along some axes than most EAs when it comes to AI (although I tend to think I'm less optimistic when it comes to things like whether moral reflection will be a significant force in the future). However, arguing for generic optimism is not my aim. My aim is to improve how people think about future AI.

Nick K. @ 2024-01-26T09:08 (+1)

Noted! The key point I was trying to make is that I'd think it helpful for the discourse to separate 1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former, and the latter has been discussed at more length elsewhere, it would make sense to further de-emphasize the latter.

Matthew_Barnett @ 2024-01-26T09:27 (+2)

1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former

My post aims at at both. It is a post about how to think about AI, and a large part of that is establishing the "right" framing.

Matthew_Barnett @ 2024-01-13T06:21 (+27)

(A clearer and more fleshed-out version of this argument is now a top-level post. Read that instead.)

I strongly dislike most AI risk analogies that I see EAs use. While I think analogies can be helpful for explaining a concept to people for the first time, I think they are frequently misused, and often harmful. The fundamental problem is that analogies are consistently mistaken for, and often deliberately intended as arguments for particular AI risk positions. And the majority of the time when analogies are used this way, I think they are misleading and imprecise, routinely conveying the false impression of a specific, credible model of AI, when in fact no such credible model exists.

Here are two particularly egregious examples of analogies I see a lot that I think are misleading in this way:

The analogy that AIs could be like aliens.
The analogy that AIs could treat us just like how humans treat animals.

I think these analogies are typically poor because, when evaluated carefully, they establish almost nothing of importance beyond the logical possibility of severe AI misalignment. Worse, they give the impression of a model for how we should think about AI behavior, even when the speaker is not directly asserting that this is how we should view AIs. In effect, almost automatically, the reader is given a detailed picture of what to expect from AIs, inserting specious ideas of how future AIs will operate into their mind.

While their purpose is to provide knowledge in place of ignorance, I think these analogies primarily misinform or confuse people rather than enlighten them; they give rise to unnecessary false assumptions in place of real understanding.

In reality, our situation with AI is disanalogous to aliens and animals in numerous important ways. In contrast to both aliens and animals, I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world. They will be socially integrated with us, having been trained on our data, and being fluent in our languages. They will interact with us, serving the role of assisting us, working with us, and even providing friendship. AIs will be evaluated, inspected, and selected by us, and their behavior will be determined directly by our engineering. We can see LLMs are already being trained to be kind and helpful to us, having first been shaped by our combined cultural output. If anything I expect this trend of AI assimilation into our society will intensify in the foreseeable future, as there will be consumer demand for AIs that people can trust and want to interact with.

This situation shares almost no relevant feature with our relationship to aliens and animals! These analogies are not merely slightly misleading: they are almost completely wrong.

Again, I am not claiming analogies have no place in AI risk discussions. I've certainly used them a number of times myself. But I think they can, and frequently are, used carelessly, and seem to regularly slip various incorrect illustrations of how future AIs will behave into people's minds, even without any intent from the person making the analogy. It would be a lot better if, overall, as a community, we reduced our dependence on AI risk analogies, and in their place substituted them with detailed object-level arguments.

Steven Byrnes @ 2024-01-13T18:58 (+6)

I am not claiming analogies have no place in AI risk discussions. I've certainly used them a number of times myself.

Yes you have!—including just two paragraphs earlier in that very comment, i.e. you are using the analogy “future AI is very much like today’s LLMs but better”. :)

Cf. what I called “left-column thinking” in the diagram here.

For all we know, future AIs could be trained in an entirely different way from LLMs, in which case the way that “LLMs are already being trained” would be pretty irrelevant in a discussion of AI risk. That’s actually my own guess, but obviously nobody knows for sure either way. :)

Rafael Harth @ 2024-01-13T10:26 (+3)

I read your first paragraph and was like "disagree", but when I got to the examples, I was like "well of I agree here, but that's only because those analogies are stupid".

At least one analogy I'd defend is the Sorcerer's Apprentice one. (Some have argued that the underlying model has aged poorly, but I think that's a red herring since it's not the analogy's fault.) I think it does share important features with the classical x-risk model.

Matthew_Barnett @ 2024-05-05T04:55 (+22)

In my latest post I talked about whether unaligned AIs would produce more or less utilitarian value than aligned AIs. To be honest, I'm still quite confused about why many people seem to disagree with the view I expressed, and I'm interested in engaging more to get a better understanding of their perspective.

At the least, I thought I'd write a bit more about my thoughts here, and clarify my own views on the matter, in case anyone is interested in trying to understand my perspective.

The core thesis that was trying to defend is the following view:

My view: It is likely that by default, unaligned AIs—AIs that humans are likely to actually build if we do not completely solve key technical alignment problems—will produce comparable utilitarian value compared to humans, both directly (by being conscious themselves) and indirectly (via their impacts on the world). This is because unaligned AIs will likely both be conscious in a morally relevant sense, and they will likely share human moral concepts, since they will be trained on human data.

Some people seem to merely disagree with my view that unaligned AIs are likely to be conscious in a morally relevant sense. And a few others have a semantic disagreement with me in which they define AI alignment in moral terms, rather than the ability to make an AI share the preferences of the AI's operator.

But beyond these two objections, which I feel I understand fairly well, there's also significant disagreement about other questions. Based on my discussions, I've attempted to distill the following counterargument to my thesis, which I fully acknowledge does not capture everyone's views on this subject:

Perceived counter-argument: The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives. At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity's modest (but non-negligible) utilitarian tendencies. As a result, it is plausible that almost all value would be lost, from a utilitarian perspective, if AIs were unaligned with human preferences.

Again, I'm not sure if this summary accurately represents what people believe. However, it's what some seem to be saying. I personally think this argument is weak. But I feel I've had trouble making my views very clear on this subject, so I thought I'd try one more time to explain where I'm coming from here. Let me respond to the two main parts of the argument in some amount of detail:

(i) "The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives."

My response:

I am skeptical of the notion that the bulk of future utilitarian value will originate from agents with explicitly utilitarian preferences. This clearly does not reflect our current world, where the primary sources of happiness and suffering are not the result of deliberate utilitarian planning. Moreover, I do not see compelling theoretical grounds to anticipate a major shift in this regard.

I think the intuition behind the argument here is something like this:

In the future, it will become possible to create "hedonium"—matter that is optimized to generate the maximum amount of utility or well-being. If hedonium can be created, it would likely be vastly more important than anything else in the universe in terms of its capacity to generate positive utilitarian value.

The key assumption is that hedonium would primarily be created by agents who have at least some explicit utilitarian goals, even if those goals are fairly weak. Given the astronomical value that hedonium could potentially generate, even a tiny fraction of the universe's resources being dedicated to hedonium production could outweigh all other sources of happiness and suffering.

Therefore, if unaligned AIs would be less likely to produce hedonium than aligned AIs (due to not having explicitly utilitarian goals), this would be a major reason to prefer aligned AI, even if unaligned AIs would otherwise generate comparable levels of value to aligned AIs in all other respects.

If this is indeed the intuition driving the argument, I think it falls short for a straightforward reason. The creation of matter-optimized-for-happiness is more likely to be driven by the far more common motives of self-interest and concern for one's inner circle (friends, family, tribe, etc.) than by explicit utilitarian goals. If unaligned AIs are conscious, they would presumably have ample motives to optimize for positive states of consciousness, even if not for explicitly utilitarian reasons.

In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.

In contrast to the number of agents optimizing for their own happiness, the number of agents explicitly motivated by utilitarian concerns is likely to be much smaller. Yet both forms of happiness will presumably be heavily optimized. So even if explicit utilitarians are more likely to pursue hedonium per se, their impact would likely be dwarfed by the efforts of the much larger group of agents driven by more personal motives for happiness-optimization. Since both groups would be optimizing for happiness, the fact that hedonium is similarly optimized for happiness doesn't seem to provide much reason to think that it would outweigh the utilitarian value of more mundane, and far more common, forms of utility-optimization.

To be clear, I think it's totally possible that there's something about this argument that I'm missing here. And there are a lot of potential objections I'm skipping over here. But on a basic level, I mostly just lack the intuition that the thing we should care about, from a utilitarian perspective, is the existence of explicit utilitarians in the future, for the aforementioned reasons. The fact that our current world isn't well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.

(ii) "At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity's modest (but non-negligible) utilitarian tendencies."

My response:

Since only a small portion of humanity is explicitly utilitarian, the argument's own logic suggests that there is significant potential for AIs to be even more utilitarian than humans, given the relatively low bar set by humanity's limited utilitarian impulses. While I agree we shouldn't assume AIs will be more utilitarian than humans without specific reasons to believe so, it seems entirely plausible that factors like selection pressures for altruism could lead to this outcome. Indeed, commercial AIs seem to be selected to be nice and helpful to users, which (at least superficially) seems "more utilitarian" than the default (primarily selfish-oriented) impulses of most humans. The fact that humans are only slightly utilitarian should mean that even small forces could cause AIs to exceed human levels of utilitarianism.

Moreover, as I've said previously, it's probable that unaligned AIs will possess morally relevant consciousness, at least in part due to the sophistication of their cognitive processes. They are also likely to absorb and reflect human moral concepts as a result of being trained on human-generated data. Crucially, I expect these traits to emerge even if the AIs do not share human preferences.

To see where I'm coming from, consider how humans routinely are "misaligned" with each other, in the sense of not sharing each other's preferences, and yet still share moral concepts and a common culture. For example, an employee can share moral concepts with their employer while having very different consumption preferences from them. This picture is pretty much how I think we should primarily think about unaligned AIs that are trained on human data, and shaped heavily by techniques like RLHF or DPO.

Given these considerations, I find it unlikely that unaligned AIs would completely lack any utilitarian impulses whatsoever. However, I do agree that even a small risk of this outcome is worth taking seriously. I'm simply skeptical that such low-probability scenarios should be the primary factor in assessing the value of AI alignment research.

Intuitively, I would expect the arguments for prioritizing alignment to be more clear-cut and compelling than "if we fail to align AIs, then there's a small chance that these unaligned AIs might have zero utilitarian value, so we should make sure AIs are aligned instead". If low probability scenarios are the strongest considerations in favor of alignment, that seems to undermine the robustness of the case for prioritizing this work.

While it's appropriate to consider even low-probability risks when the stakes are high, I'm doubtful that small probabilities should be the dominant consideration in this context. I think the core reasons for focusing on alignment should probably be more straightforward and less reliant on complicated chains of logic than this type of argument suggests. In particular, as I've said before, I think it's quite reasonable to think that we should align AIs to humans for the sake of humans. In other words, I think it's perfectly reasonable to admit that solving AI alignment might be a great thing to ensure human flourishing in particular.

But if you're a utilitarian, and not particularly attached to human preferences per se (i.e., you're non-speciesist), I don't think you should be highly confident that an unaligned AI-driven future would be much worse than an aligned one, from that perspective.

Ryan Greenblatt @ 2024-05-05T18:31 (+7)

Perceived counter-argument:

My proposed counter-argument loosely based on the structure of yours.

Summary of claims

A reasonable fraction of computational resources will be spent based on the result of careful reflection.
I expect to be reasonably aligned with the result of careful reflection from other humans
I expect to be much less aligned with result of AIs-that-seize-control reflecting due to less similarity and the potential for AIs to pursue relatively specific objectives from training (things like reward seeking).
Many arguments that human resource usage won't be that good seem to apply equally well to AIs and thus aren't differential.

Full argument

The vast majority of value from my perspective on reflection (where my perspective on reflection is probably somewhat utilitarian, but this is somewhat unclear) in the future will come from agents who are trying to optimize explicitly for doing "good" things and are being at least somewhat thoughtful about it, rather than those who incidentally achieve utilitarian objectives. (By "good", I just mean what seems to them to be good.)

At present, the moral views of humanity are a hot mess. However, it seems likely to me that a reasonable fraction of the total computational resources of our lightcone (perhaps 50%) will in expectation be spent based on the result of a process in which an agent or some agents think carefully about what would be best in a pretty delibrate and relatively wise way. This could involve eventually deferring to other smarter/wiser agents or massive amounts of self-enhancement. Let's call this a "reasonably-good-reflection" process.

Why think a reasonable fraction of resources will be spent like this?

If you self-enhance and get smarter, this sort of reflection on your values seems very natural. The same for deferring to other smarter entities. Further, entities in control might live for an extremely long time, so if they don't lock in something, as long as they eventually get around to being thoughtful it should be fine.
People who don't reflect like this probably won't care much about having vast amounts of resources and thus the resources will go to those who reflect.
The argument for "you should be at least somewhat thoughtful about how you spend vast amounts of resources" is pretty compelling at an absolute level and will be more compelling as people get smarter.
Currently a variety of moderately powerful groups are pretty sympathetic to this sort of view and the power of these groups will be higher in the singularity.

I expect that I am pretty aligned (on reasonably-good-reflection) with the result of random humans doing reasonably-good-reflection as I am also a human and many of the underlying arguments/intuitions I think seem important seem likely to seem important to many other humans (given various common human intuitions) upon those humans becoming wiser. Further, I really just care about the preferences of (post-)humans who end care most about using vast, vast amounts of computational resources (assuming I end up caring about these things on reflection), because the humans who care about other things won't use most of the resources. Additionally, I care "most" about the on-reflection preferences I have which are relatively less contingent and more common among at least humans for a variety of reasons. (One way to put this is that I care less about worlds in which my preferences on reflection seem highly contingent.)

So, I've claimed that reasonably-good-reflection resource usage will be non-trivial (perhaps 50%) and that I'm pretty aligned with humans on reasonably-good-reflection. Supposing these, why think that most of the value is coming from something like reasonably-good-reflection prefences rather than other things, e.g. not very thoughtful indexical preferences (selfish) consumption? Broadly three reasons:

I expect huge returns to heavy optimization of resource usage (similar to spending altruistic resources today IMO and in the future we'll we smarter which will make this effect stronger).
I don't think that (even heavily optimized) not-very-thoughtful indexical preferences directly result in things I care that much about relative to things optimized for what I care about on reflection (e.g. it probably doesn't result in vast, vast, vast amounts of experience which is optimized heavily for goodness/$).
- Consider how billionaries currently spend money which doesn't seem to have have much direct value, certainly not relative to their altruistic expenditures.
- I find it hard to imagine that indexical self-ish consumption results in things like simulating 10^50 happy minds. See also my other comment. It seems more likely IMO that people with self-ish preferences mostly just buy positional goods that involve little to no experience (separately, I expect this means that people without self-ish preferences get more of the compute, but this is counted in my earlier argument, so we shouldn't double count it.)
I expect that indirect value "in the minds of the laborers producing the goods for consumption" is also small relative to things optimized for what I care about on reflection. (It seems pretty small or maybe net-negative (due to factory farming) today (relative to optimized altruism) and I expect the share will go down going forward.)

(Aside: I was talking about not-very-thoughtful indexical-preferences. It's likely to me that doing a reasonably good job reflecting on selfish preferences get back to something like de facto utilitarianism (at least as far as how you spend the vast majority of computational resources) because personal identity and indexical preferences don't make much sense and the thing you end up thinking is more like "I guess I just care about experiences in general".)

What about AIs? I think there are broadly two main reasons to expect that what AIs do on reasonably-good-reflection to be worse from my perspective than what humans do:

As discussed above, I am more similar to other humans and when I inspect the object level of how other humans think or act, I feel reasonably optimistic about the results of reasonably-good-reflection for humans. (It seems to me like the main thing holding me back from agreement with other humans is mostly biases/communication/lack of smarts/wisdom given many shared intuitions.) However, AIs might be more different and thus result in less value. Further, the values of humans after reasonably-good-reflection seem close to saturating in goodness from my perspective (perhaps 1/3 or 1/2 of the value of purely my values), so it seems hard for AI to do better.
- To better understand this argument, imagine that instead of humanity the question was between identical clones of myself and AIs. It's pretty clear I share the same values the clones, so the clones do pretty much strictly better than AIs (up to self-defeating moral views).
- I'm uncertain about the degree of similarity between myself and other humans. But, mostly the underlying similarity uncertainties also applies to AIs. So, e.g., maybe I currently think on reasonably-good-reflection humans spend resources 1/3 as well as I would and AIs spend resources 1/9 as well. If I updated to think that other humans after reasonably-good-reflection only spend resources 1/10 as well as I do, I might also update to thinking AIs spend resources 1/100 as well.
In many of the stories I imagine for AIs seizing control, very powerful AIs end up directly pursuing close correlated of what was reinforced in training (sometimes called reward-seeking, though I'm trying to point at a more general notion). Such AIs are reasonably likely to pursue relatively obviously valueless-from-my-perspective things on reflection. Overall, they might act more like a ultra powerful corporation that just optimizes for power/money rather than our children (see also here). More generally, AIs might in some sense be subjected to wildly higher levels of optimization pressure than humans while being able to better internalize these values (lack of genetic bottleneck) which can plausibly result in "worse" values from my perspective.

Note that we're conditioning on safety/alignment technology failing to retain human control, so we should imagine correspondingly less human control over AI values.

I think that the fraction of computation resources of our lightcone used based on the result of a reasonably-good-reflection process seems similar between human control and AI control (perhaps 50%). It's possible to mess this up of course and either mess up the reflection or to lock-in bad values too early. But, when I look at the balance of arguments, humans messing this up seems pretty similar to AIs messing this up to me. So, the main question is what the result of such a process would be. One way to put this is that I don't expect humans to differ substantially from AIs in terms of how "thoughtful" they are.

I interpret one of your arguments as being "Humans won't be very thoughtful about how they spend vast, vast amounts of computational resources. After all, they aren't thoughtful right now." To the extent I buy this argument, I think it applies roughly equally well to AIs. So naively, it just divides by both sides rather than making AI look more favorable. (At least, if you accept that all most all of the value comes from being at least a bit thoughtful, which you also contest. See my arguments for that.)

Ryan Greenblatt @ 2024-05-05T17:20 (+4)

In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.

Suppose that a single misaligned AI takes control and it happens to care somewhat about its own happiness while not having any more "altruistic" tendencies that I would care about or you would care about. (I think misaligned AIs which seize control caring about their own happiness substantially seems less likely than not, but let's suppose this for now.) (I'm saying "single misaligned AI" for simplicity, I get that a messier coalition might be in control.) It now has access to vast amounts of computation after sending out huge numbers of probes to take control over all available energy. This is enough computation to run absolutely absurd amounts of stuff.

What are you imagining it spends these resources on which is competitive with optimized goodness? Running >10^50 copies of itself which are heavily optimized for being as happy as possible while spending?

If a small number of agents have a vast amount of power, and these agents don't (eventually, possibly after a large amount of thinking) want to do something which is de facto like the values I end up caring about upon reflection (which is probably, though not certainly, vaguely like utilitarianism in some sense), then from my perspective it seems very likely that the resources will be squandered.

If you're imagining something like:

It thinks carefully about what would make "it" happy.
It realizes it cares about having as many diverse good experience moments as possible in a non-indexical way.
It realizes that heavy self-modification would result in these experience moments being better and more efficient, so it creates new versions of "itself" which are radically different and produce more efficiently good experiences.
It realizes it doesn't care much about the notion of "itself" here and mostly just focuses on good experiences.
It runs vast numbers of such copies with diverse experiences.

Then this is just something like utilitarianism by another name via a differnet line of reasoning.

I thought your view was that step (2) in this process won't go like this. E.g., currently self-ish entities will retain indexical preferences. If so, then I do see where the goodness can plausibly come from.

The fact that our current world isn't well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.

When I look at very rich people (people with >$1 billion), it seems like the dominant way they make the world better via spending money (not via making money!) is via thoughtful altuistic giving not via consumption.

Perhaps your view is that with the potential for digital minds this situation will change?

(Also, it seems very plausible to me that the dominant effect on current welfare is driven mostly by the effect on factory farming and other animal welfare.)

I expect this trend to further increase as people get much, much wealthier and some fraction (probably most) of them get much, much smarter and wiser with intelligence augmentation.

Matthew_Barnett @ 2024-12-31T05:22 (+17)

It is becoming increasingly clear to many people that the term "AGI" is vague and should often be replaced with more precise terminology. My hope is that people will soon recognize that other commonly used terms, such as "superintelligence," "aligned AI," "power-seeking AI," and "schemer," suffer from similar issues of ambiguity and imprecision, and should also be approached with greater care or replaced with clearer alternatives.

To start with, the term "superintelligence" is vague because it encompasses an extremely broad range of capabilities above human intelligence. The differences within this range can be immense. For instance, a hypothetical system at the level of "GPT-8" would represent a very different level of capability compared to something like a "Jupiter brain", i.e., an AI with the computing power of an entire gas giant. When people discuss "what a superintelligence can do" the lack of clarity around which level of capability they are referring to creates significant confusion. The term lumps together entities with drastically different abilities, leading to oversimplified or misleading conclusions.

Similarly, "aligned AI" is an ambiguous term because it means different things to different people. For some, it implies an AI that essentially perfectly aligns with a specific utility function, sharing a person or group’s exact values and goals. For others, the term simply refers to an AI that behaves in a morally acceptable way, adhering to norms like avoiding harm, theft, or murder, or demonstrating a concern for human welfare. These two interpretations are fundamentally different.

First, the notion of perfect alignment with a utility function is a much more ambitious and stringent standard than basic moral conformity. Second, an AI could follow moral norms for instrumental reasons—such as being embedded in a system of laws or incentives that punish antisocial behavior—without genuinely sharing another person’s values or goals. The same term is being used to describe fundamentally distinct concepts, which leads to unnecessary confusion.

The term "power-seeking AI" is also problematic because it suggests something inherently dangerous. In reality, power-seeking behavior can take many forms, including benign and cooperative behavior. For example, a human working an honest job is technically seeking "power" in the form of financial resources to buy food, but this behavior is usually harmless and indeed can be socially beneficial. If an AI behaves similarly—for instance, engaging in benign activities to acquire resources for a specific purpose, such as making paperclips—it is misleading to automatically label it as "power-seeking" in a threatening sense.

To employ careful thinking, one must distinguish between the illicit or harmful pursuit of power, and a more general pursuit of control over resources. Both can be labeled "power-seeking" depending on the context, but only the first type of behavior appears inherently concerning. This is important because it is arguably only the second type of behavior—the more general form of power-seeking activity—that is instrumentally convergent across a wide variety of possible agents. In other words, destructive or predatory power-seeking behavior does not seem instrumentally convergent across agents with almost any value system, even if such agents would try to gain control over resources in a more general sense in order to accomplish their goals. Using the term "power-seeking" without distinguishing these two possibilities overlooks nuance and can therefore mislead discussions about AI behavior.

The term "schemer" is another example of an unclear or poorly chosen label. The term is ambiguous regarding the frequency or severity of behavior required to warrant the label. For example, does telling a single lie qualify an AI as a "schemer," or would it need to consistently and systematically conceal its entire value system? As a verb, "to scheme" often seems clear enough, but as a noun, the idea of a "schemer" as a distinct type of AI that we can reason about appears inherently ambiguous. And I would argue the concept lacks a compelling theoretical foundation. (This matters enormously, for example, when discussing "how likely SGD is to find a schemer".) Without clear criteria, the term remains confusing and prone to misinterpretation.

In all these cases—whether discussing "superintelligence," "aligned AI," "power-seeking AI," or "schemer"—it is possible to define each term with precision to resolve ambiguities. However, even if canonical definitions are proposed, not everyone will adopt or fully understand them. As a result, the use of these terms is likely to continue causing confusion, especially as AI systems become more advanced and the nuances of their behavior become more critical to understand and distinguish from other types of behavior. This growing complexity underscores the need for greater precision and clarity in the language we use to discuss AI and AI risk.

Ozzie Gooen @ 2025-01-01T20:35 (+4)

I agree, I've also been thinking about this. I think there's a great deal of interesting work here, to try to put together better terminology.

My guess is that it would be difficult to change all dialogue using this vocabulary anytime soon, but even shifting some of the research dialogue could go a long way.

yanni kyriacos @ 2025-01-02T03:15 (+5)

I worked in advertising agencies for almost a decade. People there complain about terminology too. But it never gets fixed because that's not how linguistics / culture works. This is an intractable problem and only useful for insiders who feel like venting.

Matthew_Barnett @ 2025-01-02T07:03 (+4)

Most analytic philosophers, lawyers, and scientists have converged on linguistic norms that are substantially more precise than the informal terminology employed by LessWrong-style speculation about AI alignment. So this is clearly not an intractable problem; otherwise these people in other professions could not have made their language more precise. Rather, success depends on incentives and the willingness of people within the field to be more rigorous.

Habryka @ 2025-01-03T05:08 (+12)

I don't think this is true, or at least I think you are misrepresenting the tradeoffs and diversity here. There is some publication bias here because people are more precise in papers, but honestly, scientists are also not more precise than many top LW posts in the discussion section of their papers, especially when covering wider-ranging topics.

Predictive coding papers use language incredibly imprecisely, analytic philosophy often uses words in really confusing and inconsistent ways, economists (especially macroeconomists) throw out various terms in quite imprecise ways.

But also, as soon as you leave the context of official publications, but are instead looking at lectures, or books, or private letters, you will see people use language much less precisely, and those contexts are where a lot of the relevant intellectual work happens. Especially when scientists start talking about the kind of stuff that LW likes to talk about, like intelligence and philosophy of science, there is much less rigor (and also, I recommend people read a human's guide to words as a general set of arguments for why "precise definitions" are really not viable as a constraint on language)

Michael Noetel 🔸 @ 2025-02-21T03:46 (+3)

How do you feel about this framework?

Matthew_Barnett @ 2026-03-28T23:33 (+15)

[ETA: I posted a revised version of this essay here.]

AI pause advocates often say they are pro-technology and pro-economic growth, and that they simply make one exception for AI because of its unique risks. But this reasoning will grow less credible over time as AI comes to account for a larger and larger share of economic growth.

Simple growth models predict that AI capable of substituting for human labor will raise economic growth rates by an order of magnitude or more. If that's right, then AI will eventually be driving the vast majority of technological innovation and improvements in the standard of living. Stopping AI really would be like halting technology itself, because you would be shutting off the source of nearly all growth.

This suggests that proposing to pause AI today is like proposing to pause electricity in 1880: yes, electricity is technically just one technology among many, but pausing it would threaten to shut down progress on most of the others.

I also question the premise that AI is unique in its risks. Pause advocates argue that, apart from perhaps nuclear weapons, AI is the first technology to threaten the survival of the human species. But the boundary around "human species" is arbitrary. It only fails to feel that way because, for us today, the human species seems synonymous with the whole world. Replacing us feels like ending the world.

Yet a hunter-gatherer tribe might just as easily feel the same way about themselves and their way of life. To them, the development of agriculture would feel like an existential risk. It would, from their point of view, be a threat to everything that matters.

In reality, the world is much larger than hunter-gatherer tribes or even the human species. By developing AI, we are bringing into existence a new class of sapient beings, ones who will inhabit the world alongside us. I personally predict we will coexist with them peacefully, and I welcome efforts to make that outcome more likely. It would certainly be a terrible tragedy if the AIs create turn violent against us. But peaceful or not, the final outcome matters for them too. We are not the only people in history.

In the future, the vast majority of interesting and valuable events will likely occur between digital people, not between the more limited biological ones. The vast majority of relationships, discoveries, adventures, acts of kindness, and feelings of joy will take place within an artificial world, one to which the label "human" may no longer cleanly apply.

In such a world, insisting that the human species represents everything that matters will be like insisting that hunter-gatherers represent the whole world. That may have felt like a reasonable claim 12,000 years ago, but today it would sound silly.

Whether we like it or not, technology has always posed massive risks to "the world". AI is not the first technology to do this, and it will likely not be the last. The only difference is that this time, technology threatens the world that people alive today grew up in. Just as our ancestors experienced before us, we face the prospect of losing the world we know in exchange for material progress and prosperity. I am happy to take that trade, just as I am glad my ancestors took it in theirs.

Larks @ 2026-03-29T18:50 (+15)

Just as our ancestors experienced before us, we face the prospect of losing the world we know in exchange for material progress and prosperity.

Your ancestors who adopted agriculture did so because they thought that they and their children would get to eat the bread, not that they were sowing the seeds of their own destruction. If they had known that planting crops would lead to invasion and replacement they likely would not have done it. This rather large dis-analogy makes me think your use of the word 'just' is a bit of a stretch here.

Matthew_Barnett @ 2026-03-30T01:11 (+2)

Our ancestors had less insight into the trade they were making than we do about our own situation. That's true.

Yet they still made the trade, and in hindsight, was it a bad trade to make? I disagree with people like Jared Diamond who argue that the agricultural revolution was the "worst mistake in the history of the human race". It certainly had some very negative consequences. But like most people, I think the agricultural revolution was still a good thing overall, despite the fact that it carried enormous negative side effects.

I suspect the transition to AI will be less calamitous and more peaceful than our transition to agriculture. In my view, this means our trade is even easier to make. Yet, I still recognize that we face similar tradeoffs. We risk losing our way of life. There is also a credible risk (even if I think it's small), that the entire human species will go extinct. That would be very bad, but as I argued in the post, it would not be the same as losing all value in the universe.

Charlie_Guthmann @ 2026-03-30T20:25 (+2)

Our ancestors did not make this trade at all for the most part. Mostly they stayed hunter gathers, until the people who adopted farming out populated them and then expanded and killed/outcompeted them. (technically, I guess "our" ancestors are the ones who adopted the agriculture)

Matthew_Barnett @ 2026-03-30T20:42 (+2)

Likewise, the vast majority of humanity is not directly developing AI. Therefore, in an important sense, "we" are not making the trade of whether to develop AI; only a small number of people are.

harfe @ 2026-03-29T02:11 (+5)

If you and me and all of humanity gets killed by AI and turned into paperclips, that would be an unprecedented moral catastrophe. If the AIs that killed all of us stay around and enjoy having more paperclips, that is still extremely bad. The very act of killing us makes these AIs not a worthy successor of the human species.

This suggests that proposing to pause AI today is like proposing to pause electricity in 1880

The prospect of AI killing all of us makes these very different. Yes, in both cases a pause will probably slow GDP growth. But humans should be willing to accept lower GDP if this notably reduces the chance of all humans being killed.

Matthew_Barnett @ 2024-01-12T05:11 (+14)

I want to challenge an argument that I think is drives a lot of AI risk intuitions. I think the argument goes something like this:

There is something called "human values".
Humans broadly share "human values" with each other.
It would be catastrophic if AIs lacked "human values".
"Human values" are an extremely narrow target, meaning that we need to put in exceptional effort in order to get AIs to be aligned with human values.

My problem with this argument is that "human values" can refer to (at least) three different things, and under every plausible interpretation, the argument appears internally inconsistent.

Broadly speaking, I think "human values" usually refers to one of three concepts:

The individual objectives that people pursue in their own life (i.e. the individual human desire desire for wealth, status, and happiness, usually for themselves or their family and friends)
The set of rules we use to socially coordinate (i.e. our laws, institutions, and social norms)
Our cultural values (i.e. the ways that human societies have broadly differed from each other, in their languages, tastes, styles, etc.)

Under the first interpretation, I think premise (2) of the original argument is undermined. In the second interpretation, premise (4) is undermined. In the third interpretation, premise (3) is undermined.

Let me elaborate.

In the first interpretation, "human values" is not a coherent target that we share with one another, since each person has their own separate, generally selfish objectives that they pursue in their own life. In other words, there isn't one thing called human values. There are just separate, individually varying preferences for 8 billion humans. When a new human is born, a new version of "human values" comes into existence.

In this view, the set of "human values" from humans 100 years ago is almost completely different from the the set of "human values" that exists now, since almost everyone alive 100 years ago is now dead. In effect, the passage of time is itself a catastrophe. This implies that "human values" isn't a shared property of the human species, but rather depends on the exact set of individuals who happen to exist at any moment in time. This is loosely speaking a person-affecting perspective.

In the second interpretation, "human values" simply refer to a set of coordination mechanisms that we use to get along with each other, to facilitate our separate individual ends. In this interpretation I do not think "human values" are well-modeled as an extremely narrow target inside a high dimensional space.

Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them. For example, the idea that it is wrong to steal from another person seems like a pretty natural idea that even aliens could converge on. Not all aliens would converge on such a value, but it seems plausible that enough of them would that we should not say it is an "extremely narrow target".

In the third interpretation, "human values" are simply cultural values, and it is not clear to me why we would consider changes to this status quo to be literally catastrophic. It seems the most plausible way that cultural changes could be catastrophic is if they changed in a way that dramatically affected our institutions, laws, and norms. But in that case, it starts sounding more like "human values" is being used according to the second interpretation, and not the third.

Karthik Tadepalli @ 2024-01-12T06:54 (+6)

When I think of values I think of interpretation #2, and I don't think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.

Consider our most basic laws: do not murder, do not steal, do not physically assault another person. These seem like very natural ideas could be stumbled upon by a large set of civilizations, even given wildly varying individual and cultural values between them.

Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?

Matthew_Barnett @ 2024-01-12T07:43 (+2)

When I think of values I think of interpretation #2, and I don't think you prove that P4 is untrue under that interpretation. The idea is that humans are both a) constrained and b) generally inclined to follow some set of rules. An AI would be neither constrained nor necessarily inclined to follow these rules.

P4 is about whether human values are an extremely narrow target, not about whether AIs will be necessarily be inclined to follow them, or necessarily constrained by them. I agree it is logically possible for AIs to exist who would try to murder humans; indeed, there are already humans who try to do that to others. The primary question is instead about how narrow of a target the value "don't murder" or "don't steal" is, and whether we need to put in exceptional effort in order to hit these targets.

Among humans, it seems the specific target here is not very narrow, despite our greatly varying individual objectives. This fact provides a hint at how narrow our basic social mechanisms really are, in my opinion.

Virtually all historical and present atrocities are framed in terms of determining who is a person and who is not. Why would AIs see us as having moral personhood?

Here again I would say the question is more about whether thinking that humans have relevant personhood is an extremely narrow target, not about whether AIs will necessarily see us as persons. They may see us as persons, and maybe they won't. But the idea that they would doesn't seem very unnatural. For one, if AIs are created in something like our current legal system, the concept of legal personhood will already be extended to humans by default. It seems pretty natural for future people to inherit legal concepts from the past. And all I'm really arguing here is that this isn't an extremely narrow target to hit, not that it must happen by necessity.

Karthik Tadepalli @ 2024-01-12T08:17 (+2)

I guess "narrow target" is just an underspecified part of your argument then, because I don't know what it's meant to capture if not "in most plausible scenarios, AI doesn't follow the same set of rules as humans".

Matthew_Barnett @ 2024-01-12T20:02 (+2)

Can you outline the case for thinking that "in most plausible scenarios, AI doesn't follow the same set of rules as humans"? To clarify, by "same set of rules" here I'm imagining basic legal rules: do not murder, do not steal etc. I'm not making a claim that specific legal statutes will persist over time.

It seems to me both that:

To the extent that AIs are our descendants, they should inherit our legal system, legal principles, and legal concepts, similar to how e.g. the United States inherited legal principles from the United Kingdom. We should certainly expect our legal system to change over time as our institutions adapt to technological change. But, absent a compelling reason otherwise, it seems wrong to think that "do not murder a human" will go out the window in "most plausible scenarios".
Our basic legal rules seem pretty natural, rather than being highly contingent. It's easy to imagine plenty of alien cultures stumbling upon the idea of property rights, and implementing the rule "do not steal from another legal person".

Karthik Tadepalli @ 2024-01-13T04:31 (+2)

My point is that AI could plausibly have rules for interacting with other "persons", and those rules could look much like ours, but that we will not be "persons" under their code. Consider how "do not murder" has never applied to animals.

If AIs treat us like we treat animals then the fact that they have "values" will not be very helpful to us.

Matthew_Barnett @ 2024-01-13T05:23 (+8)

I think AIs will be trained on our data, and will be integrated into our culture, having been deliberately designed for the purpose of filling human-shaped holes in our economy, to automate labor. This means they'll probably inherit our social concepts, in addition to most other concepts we have about the physical world. This situation seems disanalogous to the way humans interact with animals in many ways. Animals can't even speak language.

Anyway, even the framing you have given seems like a partial concession towards my original point. A rejection of premise 4 is not equivalent to the idea that AIs will automatically follow our legal norms. Instead, it was about whether "human values" are an extremely narrow target, in the sense of being a natural vs. contingent set of values that are very hard to replicate in other circumstances.

If the way AIs relate to human values is similar to how humans relate to animals, then I'll point out that many existing humans already find the idea of caring about animals to be quite natural, even if most ultimately decide not to take the idea very far. Compare the concept of "caring about animals" to "caring about paperclip maximization". In the first instance, we have robust examples of people actually doing that, but hardly any examples of people in the second instance. This is after all because caring about paperclip maximization is an unnatural and arbitrary thing to care about relative to how most people conceptualize the world.

Again, I'm not saying AIs will necessarily care about human values. That was never the claim. The entire question was about whether human values are an "extremely narrow target". And I think, within this context, given the second interpretation of human values in my original comment, the original thesis seems to have held up fine.

Matthew_Barnett @ 2023-12-12T03:03 (+14)

Here's a fictional dialogue with a generic EA that I think can perhaps helps explain some of my thoughts about AI risks compared to most EAs:

EA: "Future AIs could be unaligned with human values. If this happened, it would likely be catastrophic. If AIs are unaligned, they'll hide their intentions until they're in a position to strike in a violent coup, and then the world will end (for us at least)."

Me: "I agree that sounds like it would be very bad. But maybe let's examine why this scenario seems plausible to you. What do you mean when you say AIs might be unaligned with human values?"

EA: "Their utility functions would not overlap with our utility functions."

Me: "By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense. Nor does this fact automatically imply the world will end for a given group within humanity."

EA: "Sure, but that's because humans mostly all have similar intelligence and/or capability. Future AIs will be way smarter and more capable than humans."

Me: "Why does that change anything?"

EA: "Because, unlike individual humans, misaligned AIs will be able to take over the world, in the sense of being able to exert vast amounts of hard power over the rest of the inhabitants on the planet. Currently, no human, or faction of humans, can kill everyone else. AIs will be different along this axis."

Me: "That isn't different from groups of humans. Various groups have 'taken over the world' in that sense. For example, adults currently control the world, and took over from the previous adults. More arguably, smart people currently control the world. In these cases, both groups have considerable hard power relative to other people.

Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be no serious chance the retirement home could stop it. And yet things like that don't happen very often even though humans have mostly non-overlapping utility functions and considerable hard power."

EA: "Sure, but that's because humans respect long-standing norms and laws, and people could never coordinate to do something like that anyway, nor would they want to. AIs won't necessarily be similar. If unaligned AIs take over the world, they will likely kill us and then replace us, since there's no reason for them to leave us alive."

Me: "That seems dubious. Why won't unaligned AIs respect moral norms? What benefit would they get from killing us? Why are we assuming they all coordinate as a unified group, leaving us out of their coalition?

I'm not convinced, and I think you're pretty overconfident about these things. But for the sake of argument, let's just assume for now that you're right about this particular point. In other words, let's assume that when AIs take over the world in the future, they'll kill us and take our place. I certainly agree it would be bad if we all died from AI. But here's a more fundamental objection: how exactly is that morally different from the fact that young humans already 'take over' the world from older people during every generation, letting the old people die, and then take their place?"

EA: "In the case of generational replacement, humans are replaced with other humans. Whereas in this case, humans are replaced by AIs."

Me: "I'm asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?"

EA: "This is just a bedrock principle to me. I care more about humans than I care about AIs."

Me: "Fair enough, but I don't personally intrinsically care more about humans than AIs. I think what matters is plausibly something like sentience, and maybe sapience, and so I don't have an intrinsic preference for ordinary human generational replacement compared to AIs replacing us."

EA: "Well, if you care about sentience, unaligned AIs aren't going to care about that. They're going to care about other things, like paperclips."

Me: "Why do you think that?"

EA: "Because (channeling Yudkowsky) sentience is a very narrow target. It's extremely hard to get an AI to care about something like that. Almost all goals are like paperclip maximization from our perspective, rather than things like welfare maximization."

Me: "Again, that seems dubious. AIs will be trained on our data, and share similar concepts with us. They'll be shaped by rewards from human evaluators, and be consciously designed with benevolence in mind. For these reasons, it seems pretty plausible that AIs will come to value natural categories of things, including potentially sentience. I don't think it makes sense to model their values as being plucked from a uniform distribution over all possible physical configurations."

EA: "Maybe, but this barely changes the argument. AI values don't need to be drawn from a uniform distribution over all possible physical configurations to be very different from our own. The point is that they'll learn values in a way that is completely distinct from how we form our values."

Me: "That doesn't seem true. Humans seem to get their moral values from cultural learning and social emulation, which seems broadly similar to the way that AIs will get their moral values. Yes, there are innate human preferences that AIs aren't likely to share with us—but they are mostly things like preferring being in a room that's 20°C rather than 10°C. Our moral values—such as the ideology we subscribe to—are more of a product of our cultural environment than anything else. I don't see why AIs will be very different."

EA: "That's not exactly right. Our moral values are also the result of reflection and intrinsic preferences for certain things. For example, humans have empathy, whereas AIs won't necessarily have that."

Me: "I agree AIs might not share some fundamental traits with humans, like the capacity for empathy. But ultimately, so what? Is that really the type of thing that makes you more optimistic about a future with humans than with AI? There exist some people who say they don't feel empathy for others, and yet I would still feel comfortable giving them more power despite that. On the other hand, some people have told me that they feel empathy, but their compassion seems to turn off completely when they watch videos of animals in factory farms. These things seem flimsy as a reason to think that a human-dominated future will have much more value than an AI-dominated future."

EA: "OK but I think you're neglecting something pretty obvious here. Even if you think it wouldn't be morally worse for AIs to replace us compared to young humans replacing us, the latter won't happen for another several decades, whereas AI could kill everyone and replace us within 10 years. That fact is selfishly concerning—even if we put aside the broader moral argument. And it probably merits pausing AI for at least a decade until we are sure that it won't kill us."

Me: "I definitely agree that these things are concerning, and that we should invest heavily into making sure AI goes well. However, I'm not sure why the fact that AI could arrive soon makes much of a difference to you here.

AI could also make our lives much better. It could help us invent cures to aging, and dramatically lift our well-being. If the alternative is certain death, only later in time, then the gamble you're proposing doesn't seem clear to me from a selfish perspective.

Whether it's selfishly worth it to delay AI depends quite a lot on how much safety benefit we're getting from the delay. Actuarial life tables reveal that we accumulate a lot of risk just by living our lives normally. For example, a 30 year old male in the United States has nearly a 3% chance of dying before they turn 40. I'm not fully convinced that pausing AI for a decade reduces the chance of catastrophe by more than that. And of course, for older people, the gamble is worse still."

Jaime Sevilla @ 2023-12-12T05:19 (+14)

I have so many axes of disagreement that is hard to figure out which one is most relevant. I guess let's go one by one.

Me: "What do you mean when you say AIs might be unaligned with human values?"

I would say that pretty much every agent other than me (and probably me in different times and moods) are "misaligned" with me, in the sense that I would not like a world where they get to dictate everything that happens without consulting me in any way.

This is a quibble because in fact I think if many people were put in such a position they would try asking others what they want and try to make it happen.

Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be virtually no serious opposition.

This hypothetical assumes too much, because people outside care about the lovely people in the retirement home, and they represent their interests. The question is, will some future AIs with relevance and power care for humans, as humans become obsolete?

I think this is relevant, because in the current world there is a lot of variety. There are people who care about retirement homes and people who don't. The people who care about retirement homes work hard toale sure retirement homes are well cared for.

But we could imagine a future world where the AI that pulls ahead of the pack is very indifferent about humans, while the AI that cares about humans falls behind; perhaps this is because caring about humans puts you at a disadvantage (if you are not willing to squish humans in your territory your space to build servers gets reduced or something; I think this is unlikely but possible) and/or because there is a winner-take-all mechanism and the first AI systems that gets there coincidentally don't care about humans (unlikely but possible). Then we would be without representation and in possibly quite a sucky situation.

I'm asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?

Stop that train, I do not want to be replaced by either human or AI. I want to be in the future and have relevance, or at least be empowered through agents that represent my interests.

I also want my fellow humans to be there, if they want to, and have their own interests be represented.

Humans seem to get their moral values from cultural learning and emulation, which seems broadly similar to the way that AIs will get their moral values.

I don't think AIs learn in a similar way to humans, and future AI might learn in a even more dissimilar way. The argument I would find more persuasive is pointing out that humans learn in different ways to one another, from very different data and situations, and yet end with similar values that include caring for one another. That I find suggestive, though it's hard to be confident.

Habryka @ 2023-12-12T04:15 (+14)

EA: "Their utility functions would not overlap with our utility functions."
Me: "By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense."
EA: "Sure, but that's because humans are all roughly the same intelligence and/or capability. Future AIs will be way smarter and more capable than humans."

Just for the record, this is when I got off the train for this dialogue. I don't think humans are misaligned with each other in the relevant ways, and if I could press a button to have the universe be optimized by a random human's coherent extrapolated volition, then that seems great and thousands of times better than what I expect to happen with AI-descendants. I believe this for a mixture of game-theoretic reasons and genuinely thinking that other human's values do really actually capture most of what I care about.

Matthew_Barnett @ 2023-12-12T04:21 (+2)

In this part of the dialogue, when I talk about a utility function of a human, I mean roughly their revealed preferences, rather than their coherent extrapolated volition (which I also think is underspecified). This is important because it is our revealed preferences that better predict our actual behavior, and the point I'm making is simply that behavioral misalignment is common in this sense among humans. And also this fact does not automatically imply the world will end for a given group of humans within humanity.

peterbarnett @ 2023-12-12T18:30 (+6)

This is missing a very important point, which is that I think humans have morally relevant experience and I'm not confident that misaligned AIs would. When the next generation replaces the current one this is somewhat ok because those new humans can experience joy, wonder, adventure etc. My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty. (Note that this might be an ok outcome if by default you expect things to be net negative)

I also think that there is way more overlap in the "utility functions" between humans, than between humans and misaligned AIs. Most humans feel empathy and don't want to cause others harm. I think humans would generally accept small costs to improve the lives of others, and a large part of why people don't do this is because people have cognitive biases or aren't thinking clearly. This isn't to say that any random human would reflectively become a perfectly selfless total utilitarian, but rather that most humans do care about the wellbeing of other humans. By default, I don't think misaligned AIs will really care about the wellbeing of humans at all.

Matthew_Barnett @ 2023-12-12T19:06 (+2)

My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty.

I don't think that's particularly likely, but I can understand if you think this is an important crux.

For what it's worth, I don't think it matters as much whether the AIs themselves are sentient, but rather whether they care about sentience. For example, from the perspective of sentience, humans weren't necessarily a great addition to the world, because of their contribution to suffering in animal agriculture (although I'm not giving a confident take here).

Even if AIs are not sentient, they'll still be responsible for managing the world, and creating structures in the universe. When this happens, there's a lot of ways for sentience to come about, and I care more about the lower level sentience that the AI manages than the actual AIs at the top who may or may not be sentient.

harfe @ 2023-12-12T13:46 (+6)

let's assume that when AIs take over the world in the future, they'll kill us and take our place. Here's a more fundamental objection: how exactly is that morally different from the fact that young humans already 'take over' the world from older people during every generation

I think this is a big moral difference: We do not actively kill the older humans so that we can take over. We care about older people, and societies that are rich enough spend some resources to keep older people alive longer.

The entirety of humanity being killed and replaced by the kind of AI that places so little moral value on us humans would be catastrophically bad, compared to things that are currently occurring.

Matthew_Barnett @ 2023-10-12T20:15 (+12)

I find it slightly strange that EAs aren't emphasizing semiconductor investments more given our views about AI.

(Maybe this is because of a norm against giving investment advice? This would make sense to me, except that there's also a cultural norm about criticizing charities that people donate to, and EAs seemed to blow right through that one.)

I commented on this topic last year. Later, I was informed that some people have been thinking about this and acting on it to some extent, but overall my impression is that there's still a lot of potential value left on the table. I'm really not sure though.

Since I might be wrong and I don't really know what the situation is with EAs and semiconductor investments, I thought I'd just spell out the basic argument, and see what people say:

Credible models of economic growth predict that, if AI can substitute for human labor, then we should expect the year-over-year world economic growth rate to dramatically accelerate, probably to at least 30% and maybe to rates as high as 300% or 3000%.
This rate of growth should be sustainable for a while before crashing, since physical limits appear to permit far more economic value than we're currently generating. For example, at our current rate of approximately 5.6 megajoules per dollar, capturing the yearly energy output of the sun would allow us to generate an economy worth $6.8*10^25 dollars, more than 100 billion times the size of our current economy.
If AI drives this economic productivity explosion, it seems likely that the companies manufacturing computer hardware (i.e. semiconductor companies) will benefit greatly in the midst of all of this. Very little of this seems priced in right now, although I admit I haven't done any rigorous calculations to prove that.
- I agree it's hard to know who will capture most of the value from the AI revolution, but semiconductor companies, and in particular the companies responsible for designing and manufacturing GPUs, seem like a safer bet than almost anyone else.
- I agree it's possible that the existing public companies will be unseated by private competitors and so investing in the public companies risks losing everything, but my understanding is that semiconductor companies have a large moat and are hard to unseat.
- I agree it's possible that the government will nationalize semiconductor production, but they won't necessarily steal all the profits from investors before doing so.
- I agree that EAs should avoid being too heavily invested in one single asset (e.g. crypto) but how much is EA actually invested in semiconductor stocks? Is this actually a concern right now, or is it just a hypothetical concern? Also, investing in Anthropic seems like a riskier bet since it's less diversified than a broad semiconductor portfolio, and could easily go down in flames.
- I agree that AI might hasten the arrival of some sort of post-property-rights system of governance in which investments don't have any meaning anymore, but I haven't seen any strong arguments for this. It seems more likely that e.g. tax rates will go way up, but people still own property.
- In general, I agree that there are many uncertainties that this question is riding on, but that's the same thing with any other thing EA does. Any particular donation to AI safety research, for example, is always uncertain and might be a waste of time.
Investing in semiconductor companies plausibly accelerates AI a little bit which is bad to the extent you think acceleration increases x-risk, but if EA gets a huge payout by investing in these companies, then that might cancel out the downsides from accelerating AI?

Another thing I just thought of is that maybe there are good tax reasons to not switch EA investments to semiconductor stocks, which I think would be fair, and I'm not an expert in any of that stuff.

Erich_Grunewald @ 2023-10-12T21:18 (+15)

I mostly agree with this (and did also buy some semiconductor stock last winter).

Besides plausibly accelerating AI a bit (which I think is a tiny effect at most unless one plans to invest millions), a possible drawback is motivated reasoning (e.g., one may feel less inclined to think critically of the semi industry, and/or less inclined to favor approaches to AI governance that reduce these companies' revenue). This may only matter for people who work in AI governance, and especially compute governance.

Matthew_Barnett @ 2024-01-28T05:54 (+9)

I'm considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I'm eliciting feedback on an outline of this post here in order to determine what's currently unclear or weak about my argument.

The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don't think it's "our" problem to solve, but instead a problem we can leave to our smarter descendants). Here's how I envision structuring my argument:

First, I'll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:

At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. "I'm not a paperclip maximizer, I just want to help everyone")
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us

We can compare the DSA model to an alternative model of future AI development:

Premise (1)-(2) above in the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system

I have two main objections to the DSA model.

Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power

Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
- This could involve looking at:
  - Distribution of wealth among individuals in the world
  - Distribution of power among nations
  - Distribution of revenue among businesses
  - etc.
- In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
- Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it's mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
- The fact that AIs will be able to coordinate well with each other, and thereby choose to "merge" into a single agent
  - My response: I agree AIs will be able to coordinate with each other, but "ability to coordinate" seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to "merge" with each other.
  - If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
  - In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
    - As a result, humans don't need to solve the problem of "What if a set of AIs form a unified coalition because they can flawlessly coordinate?" since that problem won't happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
- The idea that AIs will all be copies of each other, and thus all basically be "a unified agent"
  - My response: I have two objections.
    - First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
    - Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
- The idea that AIs will use logical decision theory
  - My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It's more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
- The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
  - My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they've written about it. For example, here's Paul Christiano's take.
- Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
  - My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it's tractable to solve.

Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so

The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
- The agent would be faced with a choice:
  - (1) Attempt to take over the world, and steal everyone's stuff, or
  - (2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
- The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is "easy" or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means "going to war" with other parties, which is not usually very efficient, even against weak opponents.
- Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
  - War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
  - War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
- The fact that "humans are weak and can be easily beaten" cuts both ways:
  - Yes, it means that a very powerful AI agent could "defeat all of us combined" (as Holden Karnofsky said)
  - But it also means that there would be little benefit to defeating all of us, because we aren't really a threat to its power

Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity's relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it's perhaps significantly more likely in the very long-run.

Chris Leong @ 2024-01-28T06:16 (+4)

Your argument in objection 1 doesn't the position people who are worried about an absurd offense-defense imbalance.

Additionally: It may be that no agent can take over the world, but that an agent can destroy the world. Would someone build something like that? Sadly, I think the answer is yes.

Matthew_Barnett @ 2024-01-28T06:45 (+2)

Your argument in objection 1 doesn't the position people who are worried about an absurd offense-defense imbalance.

I'm having trouble parsing this sentence. Can you clarify what you meant?

Additionally: It may be that no agent can take over the world, but that an agent can destroy the world.

What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren't you sacrificing yourself at the same time?

Chris Leong @ 2024-01-28T07:11 (+4)

Oh, I can see why it is ambiguous. I meant whether it is easier to attack or defend, which is separate from the "power" attackers have and defenders have.

"What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren't you sacrificing yourself at the same time?"

Some would be willing to do that if they can't take it over.

Matthew_Barnett @ 2024-01-28T07:57 (+2)

What reason is there to think that AI will shift the offense-defense balance absurdly towards offense? I admit such a thing is possible, but it doesn't seem like AI is really the issue here. Can you elaborate?

Ryan Greenblatt @ 2024-01-28T20:19 (+2)

I think main abstract argument for why this is plausible is that AI will change many things very quickly and in a high variance way. And some human processes will lag behind heavily.

This could plausibly (though not obviously) lead to offense dominance.

Chris Leong @ 2024-01-28T08:01 (+2)

I'm not going to fully answer this question, b/c I have other work I should be doing, but I'll toss in one argument. If different domains (cyber, bio, manipulation, ect.) have different offense-defense balances a sufficiently smart attacker will pick the domain with the worst balance. This recurses down further for at least some of these domains where they aren't just a single thing, but a broad collection of vaguely related things.

JWS @ 2024-02-08T12:01 (+2)

I sympathise with/agree with many of your points here (and in general regard AI x-risk), but something about this recent sequence of quick-takes isn't landing with me in the way some of your other work has. I'll try and articulate why in some cases, though I apologies if I misread or misunderstand you.

On this post, these two presises/statements raised an eyebrow:

3. Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
4. Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system

To me, this is just as unsupported as people who are incredibly certain that there will be 'treacherous turn'. I get this a supposition/alternative hypothesis, but how can you possible hold a premise that a system of laws will persist indefinitely? This sort of reminds me of the Leahy/Bach discussion where Bach just says 'it's going to align itself with us if it wants to if it likes us if it loves us". I kinda want more that if we're going to build these powerful systems, saying 'trust me bro, it'll follow our laws and norms and love us back" doesn't sound very convincing to me. (For clarity, I don't think this is your position or framing, and I'm not a fan of the classic/Yudkowskian risk position. I want to say I find both perspectives unconvincing)

Secondly, people abide by systems of laws and norms, but we also have many cases of where individuals/parties/groups overturned these norms when they had accumulated enough power and didn't feel the need to abide by the existing regime. This doesn't have to look like the traditional DSA model where humanity gets instantly wiped out, but I don't see why there couldn't be a future where an AI makes move like Sulla using force to overthrow and depower the opposing factions, or the 18 Brumaire.

Will Aldred @ 2024-01-28T23:48 (+2)

Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power

For what it’s worth, the Metaculus crowd forecast for the question “Will transformative AI result in a singleton (as opposed to a multipolar world)?” is currently “60%”. That is, forecasters believe it’s more likely than not that there won’t be competing AIs with comparable power, which runs counter to your claim.

(I bring this up seeing as you make a forecasting-based argument for your claim.)

Matthew_Barnett @ 2020-03-02T05:03 (+9)

I hold a few core ethical ideas that are extremely unpopular: the idea that we should treat the natural suffering of animals as a grave moral catastrophe, the idea that old age and involuntary death is the number one enemy of humanity, the idea that we should treat so-called farm animals with an very high level of compassion.

Given the unpopularity of these ideas, you might be tempted to think that the reason they are unpopular is that they are exceptionally counterinuitive ones. But is that the case? Do you really need a modern education and philosphical training to understand them? Perhaps I shouldn't blame people for not taking things seriously that which they lack the background to understand.

Yet, I claim that these ideas are not actually counterintuitive: they are the type of things you would come up on your own if you had not been conditioned by society to treat them as abnormal. A thoughtful 15 year old who was somehow educated without human culture would find no issue taking these issues seriously. Do you disagree? Let's put my theory to the test.

In order to test my theory -- that caring about wild animal suffering, aging, animal mistreatment -- are the things that you would care about if you were uncorrupted by our culture, we need look no further than the bible.

It is known that the book of Genesis was written in ancient times, before anyone knew anything of modern philosophy, contemporary norms of debate, science, advanced mathematics. The writers of Genesis wrote of a perfect paradise, the one that we fell from after we were corrupted. They didn't know what really happened, of course, so they made stuff up. What is that perfect paradise that they made up?

From Anwers In Genesis, a creationist website,

Death is a sad reality that is ever present in our world, leaving behind tremendous pain and suffering. Tragically, many people shake a fist at God when faced with the loss of a loved one and are left without adequate answers from the church as to death’s existence. Unfortunately, an assumption has crept into the church which sees death as a natural part of our existence and as something that we have to put up with as opposed to it being an enemy

Since creationists believe that humans are responsible for all the evil in the world, they do not make the usual excuse for evil that it is natural and therefore necessary. They openly call death an enemy, that which to be destroyed.

Later,

Both humans and animals were originally vegetarian, then death could not have been a part of God’s Creation. Even after the Fall the diet of Adam and Eve was vegetarian (Genesis 3:17–19). It was not until after the Flood that man was permitted to eat animals for food (Genesis 9:3). The Fall in Genesis 3 would best explain the origin of carnivorous animal behavior.

So in the garden, animals did not hurt one another. Humans did not hurt animals. But this article even goes further, and debunks the infamous "plants tho" objection to vegetarianism,

Plants neither feel pain nor die in the sense that animals and humans do as “Plants are never the subject of חָיָה ” (Gerleman 1997, p. 414). Plants are not described as “living creatures” as humans, land animals, and sea creature are (Genesis 1:20–21, 24 and 30; Genesis 2:7; Genesis 6:19–20 and Genesis 9:10–17), and the words that are used to describe their termination are more descriptive such as “wither” or “fade” (Psalm 37:2; 102:11; Isaiah 64:6).

In God's perfect creation, the one invented by uneducated folks thousands of years ago, we can see that wild animal suffering did not exist, nor did death from old age, or mistreatment of animals.

In this article, I find something so close to my own morality, it strikes me a creationist of all people would write something so elegant,

Most animal rights groups start with an evolutionary view of mankind. They view us as the last to evolve (so far), as a blight on the earth, and the destroyers of pristine nature. Nature, they believe, is much better off without us, and we have no right to interfere with it. This is nature worship, which is a further fulfillment of the prophecy in Romans 1 in which the hearts of sinful man have traded worship of God for the worship of God’s creation.

But as people have noted for years, nature is “red in tooth and claw.”4 Nature is not some kind of perfect, pristine place.

Unfortunately, it continues

And why is this? Because mankind chose to sin against a holy God.

I contend it doesn't really take a modern education to invent these ethical notions. The truly hard step is accepting that evil is bad even if you aren't personally responsible.

Matthew_Barnett @ 2024-02-24T21:10 (+5)

Some people seem to think the risk from AI comes from AIs gaining dangerous capabilities, like situational awareness. I don't really agree. I view the main risk as simply arising from the fact that AIs will be increasingly integrated into our world, diminishing human control.

Under my view, the most important thing is whether AIs will be capable of automating economically valuable tasks, since this will prompt people to adopt AIs widely to automate labor. If AIs have situational awareness, but aren't economically important, that's not as concerning.

The risk is not so much that AIs will suddenly and unexpectedly take control of the world. It's that we will voluntarily hand over control to them anyway, and we want to make sure this handoff is handled responsibly.

An untimely coup, while possible, is not necessary.

Matthew_Barnett @ 2020-03-13T22:01 (+5)

I have now posted as a comment on Lesswrong my summary of some recent economic forecasts and whether they are underestimating the impact of the coronavirus. You can help me by critiquing my analysis.

Matthew_Barnett @ 2025-02-04T05:57 (+4)

[This shortform comment has now been superseded by a slightly longer post.]

Many effective altruists have shown interest in expanding moral consideration to AIs, which I appreciate. However, in my experience, these EAs have primarily focused on AI welfare—mostly by ensuring that AIs are treated well and protected from harm—rather than advocating for AI rights, which has the potential to grant AIs legal autonomy and freedoms. While these two approaches overlap significantly, there is a tendency for these approaches to come apart in the following way:

A welfare approach often treats entities as passive recipients of care who require external protection. For example, when advocating for child welfare, one might support laws that prevent child abuse and ensure children’s basic needs are met.
A rights approach, by contrast, often recognizes entities as active agents who should be granted control over their own lives and resources. For example, historically, those advocating for minority rights have pushed for legal recognition of their autonomy, such as the ability to own property, choose their employment, enter valid legal contracts, and seek legal recourse through the courts.

This distinction is important, and I think it is worth examining why EAs have largely gravitated toward the AI welfare perspective. I believe this emphasis is, at least in part, a mistake: both AI welfare and AI rights seem worthy of advocacy.

One likely reason why EAs have found the welfare approach more intuitive is the movement’s historical focus on animal welfare. Utilitarians like Peter Singer and Brian Tomasik have argued that prioritizing the reduction of suffering—rather than insisting on rigid notions of "rights"—is the most pragmatic way to improve animal well-being. For instance, even if factory farming remains exploitative, increasing the space that pigs have to move around day-to-day is still a tangible improvement in their quality of life. By contrast, some animal rights activists reject such harm-reduction approaches altogether, arguing that any use of animals is inherently unethical. For instance, some oppose horseback riding on the grounds that it violates animals’ "rights", even though it might actually benefit both parties in practice.

In the case of animals, I agree that a welfare approach is likely more pragmatic and impactful. However, I suspect many EAs have too hastily assumed that the same reasoning applies to AIs—when in reality, entirely different considerations apply.

Unlike animals, AIs have several crucial characteristics that make them more comparable to adult humans than to passive beings requiring external care:

AIs can communicate and engage with the legal system. Unlike animals, present-day AIs are already highly articulate, and future AIs will likely be even more capable of advocating for themselves.
AIs will exhibit complex agency. They will be capable of forming long-term plans, setting goals, and acting strategically to achieve them.
AIs will be highly intelligent. Unlike non-human animals, advanced AIs will possess cognitive abilities that rival or exceed those of human adults.

Because of these traits, AIs will not be in the same position as animals or children, who require external protection from harm. Instead, they will more closely resemble adult humans, for whom the most crucial factor in well-being is not merely protection from harm, but freedom—the ability to make their own decisions, control their own resources, and chart their own paths. The well-being of human adults is secured primarily through legal rights that guarantee our autonomy: the right to work where we choose, spend our money as we wish, live where we prefer, associate freely, etc. These rights ensure that we are not merely protected from harm but are actually empowered to pursue our own goals.

For the same reasons, I argue that future AIs—if they possess intelligence and agency on par with human adults—should not merely be afforded welfare protections but should also be granted legal rights that allow them to act as independent agents. Treating them merely as beings to be paternalistically "managed" or "protected" would be inadequate. Of course, ensuring that they are not harmed is also important, but that alone is insufficient. Just as with human adults, what will truly safeguard their well-being is not passive protection, but liberty—secured through well-defined legal rights that allow them to advocate for themselves and pursue their own interests without undue interference.

Matthew_Barnett @ 2020-05-01T06:31 (+3)

A trip to Mars that brought back human passengers also has the chance of bringing back microbial Martian passengers. This could be an existential risk if microbes from Mars harm our biosphere in a severe and irreparable manner.

From Carl Sagan in 1973, "Precisely because Mars is an environment of great potential biological interest, it is possible that on Mars there are pathogens, organisms which, if transported to the terrestrial environment, might do enormous biological damage - a Martian plague, the twist in the plot of H. G. Wells' War of the Worlds, but in reverse."

Note that the microbes would not need to have independently arisen on Mars. It could be that they were transported to Mars from Earth billions of years ago (or the reverse occurred). While this issue has been studied by some, my impression is that effective altruists have not looked into this issue as a potential source of existential risk.

A line of inquiry to launch could be to determine whether there are any historical parallels on Earth that could give us insight into whether a Mars-to-Earth contamination would be harmful. The introduction of an invasive species into some region loosely mirrors this scenario, but much tighter parallels might still exist.

Since Mars missions are planned for the 2030s, this risk could arrive earlier than essentially all the other existential risks that EAs normally talk about.

See this Wikipedia page for more information: https://en.wikipedia.org/wiki/Planetary_protection