Why does (any particular) AI safety work reduce s-risks more than it increases them?

By MichaelStJules @ 2021-10-03T16:55 (+48)

My understanding is that most AI safety work that plausibly reduces some s-risks may reduce extinction risks as well, and I'm thinking that some futures where we go extinct because of AI (especially with a single AI taking over) wouldn't involve astronomical suffering, if the AI has no (or sufficiently little) interest in consciousness or suffering, whether 

  1. terminally,
  2. because consciousness or suffering is useful to some goal (e.g. it might simulate suffering incidentally or for the value of information), or
  3. because there are other agents who care about suffering it has to interact with or whose preferences it should follow (they could all be gone, ruling out s-risks from conflicts).

I am interested in how people are weighing (or defeating) these considerations against the s-risk reduction they expect from (particular) AI safety work.

 

EDIT: Summarizing:

  1. AI safety work (including s-risk-focused work) also reduces extinction risk.
  2. Reducing extinction risk increases some s-risks, especially non-AGI-caused s-risks, but also possibly AGI-caused s-risks.

So AI safety work may increase s-risks, depending on tradeoffs.


kokotajlod @ 2021-10-04T21:15 (+17)

Related: https://arbital.com/p/hyperexistential_separation/

FWIW I basically agree with Arbital here.

MichaelStJules @ 2021-10-05T04:15 (+5)

So this make AI research on (cooperative) inverse reinforcement learning and more generally value learning risky, by making it more likely our worst case outcomes (worse than extinction) will be represented and optimized for, right?

kokotajlod @ 2021-10-05T12:51 (+7)

Yep! Though of course the situation is complicated and there are many factors etc. etc.

ofer @ 2021-10-04T17:25 (+6)

I think that's an important question. Here are some thoughts (though I think this topic deserves a much more rigorous treatment):

Creating an AGI with an arbitrary goal system (that is potentially much less satiable than humans') and arbitrary game theoretical mechanisms—via an ML process that can involve an arbitrary amount of ~suffering/disutility—generally seems very dangerous. Some of the relevant considerations are weird and non-obvious. For example, creating such an arbitrary AGI may constitute wronging some set of agents across the multiverse (due to the goal system & game theoretical mechanisms of that AGI).

I think there's also the general argument that, due to cluelessness, trying to achieve some form of a vigilant Long Reflection process is the best option on the table, including by the lights of suffering-focused ethics (e.g. due to weird ways in which resources could be used to reduce suffering across the multiverse via acausal trading). Interventions that mitigate x-risks (including AI-related x-risks) seem to increase the probability that humanity will achieve such a Long Reflection process.

Finally, a meta point that seems important: People in EA who have spent a lot of time on AI safety (including myself), or even made it their career, probably have a motivated reasoning bias towards the belief that working on AI safety tends to be net-positive.

steve2152 @ 2021-10-03T20:18 (+6)

[note that I have a COI here]

Hmm, I guess I've been thinking that the choice is between (A) "the AI is trying to do what a human wants it to try to do" vs (B) "the AI is trying to do something kinda weirdly and vaguely related to what a human wants it to try to do". I don't think (C) "the AI is trying to do something totally random" is really on the table as a likely option, even if the AGI safety/alignment community didn't exist at all.

That's because everybody wants the AI to do the thing they want it to do, not just long-term AGI risk people. And I think there are really obvious things that anyone would immediately think to try, and these really obvious techniques would be good enough to get us from (C) to (B) but not good enough to get us to (A).

[Warning: This claim is somewhat specific to a particular type of AGI architecture that I work on and consider most likely—see e.g. here. Other people have different types of AGIs in mind and would disagree. In particular, in the "deceptive mesa-optimizer" failure mode (which relates to a different AGI architecture than mine) we would plausibly expect failures to have random goals like "I want my field-of-view to be all white", even after reasonable effort to avoid that. So maybe people working in other areas would have different answers, I dunno.]

I agree that it's at least superficially plausible that (C) might be better than (B) from an s-risk perspective. But if (C) is off the table and the choice is between (A) and (B), I think (A) is preferable for both s-risks and x-risks.

MichaelStJules @ 2021-10-03T21:22 (+8)

I would think there's kind of a continuum between each of the three options, and AI safety work shifts the distribution, making things closer to (C) less likely and things closer to (A) more likely. More or fewer of our values could be represented, and that could be good or bad, and related to the risks of extinction. It's not actually clear to me that moving in this direction is preferable from an s-risk perspective, since there could be more interest in creating more sentience overall and greater risks from conflict with others.

steve2152 @ 2021-10-04T00:31 (+7)

Sorry I'm not quite sure what you mean. If we put things on a number line with (A)=1, (B)=2, (C)=3, are you disagreeing with my claim "there is very little probability weight in the interval ", or with my claim "in the interval , moving down towards 1 probably reduces s-risk", or with both, or something else?

MichaelStJules @ 2021-10-04T01:56 (+6)

I'm disagreeing with both (or at least am not convinced by either; I'm not confident either way).

"there is very little probability weight in the interval "

I think your description of (B) might apply to anything strictly between (A) and (C), so it would be kind of arbitrary to pick out any particular point, and the argument should apply along the whole continuum or else needs more to distinguish these two intervals. If s-risks were increased by AI safety work near (C), why wouldn't they also be increased near (A), for the same reasons? Do you have some more specific concrete hurdle(s) in AI alignment/safety in mind?

"in the interval , moving down towards 1 probably reduces s-risk"

I think it could still be the case along this interval that more AI safety work makes the AI more interested in sentience and increases the likelihood of an astronomical number of additional sentient beings being created (by the AI that's more aligned or others interacting with it), and so may increase s-risks. And, in particular, if humans are out of the loop due to extinction (which might have otherwise been prevented with more AI safety work), that could be a big loss in interest in sentience that might have otherwise backfired for s-risks.

steve2152 @ 2021-10-04T17:42 (+9)

Thanks!

(Incidentally, I don't claim to have an absolutely watertight argument here that AI alignment research couldn't possibly be bad for s-risks, just that I think the net expected impact on s-risks is to reduce them.)

If s-risks were increased by AI safety work near (C), why wouldn't they also be increased near (A), for the same reasons?

I think suffering minds are a pretty specific thing, in the space of "all possible configurations of matter". So optimizing for something random (paperclips, or "I want my field-of-view to be all white", etc.) would almost definitely lead to zero suffering (and zero pleasure). (Unless the AGI itself has suffering or pleasure.) However, there's a sense in which suffering minds are "close" to the kinds of things that humans might want an AGI to want to do. Like, you can imagine how if a cosmic ray flips a bit, "minimize suffering" could turn into "maximize suffering". Or at any rate, humans will try (and I expect succeed even without philanthropic effort) to make AGIs with a prominent human-like notion of "suffering", so that it's on the table as a possible AGI goal.

In other words, imagine you're throwing a dart at a dartboard.

  • The bullseye has very positive point value.
    • That's representing the fact that basically no human wants astronomical suffering, and basically everyone wants peace and prosperity etc.
  • On other parts of the dartboard, there are some areas with very negative point value.
    • That's representing the fact that if programmers make an AGI that desires something vaguely resembling what they want it to desire, that could be an s-risk.
  • If you miss the dartboard entirely, you get zero points.
    • That's representing the fact that a paperclip-maximizing AI would presumably not care to have any consciousness in the universe (except possibly its own, if applicable).

So I read your original post as saying "If the default is for us to miss the dartboard entirely, it could be s-risk-counterproductive to improve our aim enough that we can hit the dartboard", and my response to that was "I don't think that's relevant, I think it will be really easy to not miss the dartboard entirely, and this will happen "by default". And in that case, better aim would be good, because it brings us closer to the bullseye."

MichaelStJules @ 2021-10-05T22:59 (+7)

I think this is a reasonable reading of my original post, but I'm actually not convinced trying to get closer to the bullseye reduces s-risks on net even if we're guaranteed to hit the dartboard, for reasons given in my other comments here and in the page on hyperexistential risks, which kokotajlod shared.

steve2152 @ 2021-10-06T17:57 (+6)

Hmm, just a guess, but …

  • Maybe you're conceiving of the field as "AI alignment", pursuing the goal "figure out how to bring an AI's goals as close as possible to a human's (or humanity's) goals, in their full richness" (call it "ambitious value alignment")
  • Whereas I'm conceiving the field as "AGI safety", with the goal "reduce the risk of catastrophic accidents involving AGIs".

"AGI safety research" (as I think of it) includes not just how you would do ambitious value alignment, but also whether you should do ambitious value alignment. In fact, AGI safety research may eventually result in a strong recommendation against doing ambitious value alignment, because we find that it's dangerously prone to backfiring, and/or that some alternative approach is clearly superior (e.g. CAIS, or microscope AI, or act-based corrigibility or myopia or who knows what). We just don't know yet. We have to do the research.

"AGI safety research" (as I think of it) also includes lots of other activities like analysis and mitigation of possible failure modes (e.g. asking what would happen if a cosmic ray flips a bit in the computer), and developing pre-deployment testing protocols, etc. etc.

Does that help? Sorry if I'm missing the mark here.

MichaelStJules @ 2021-10-06T20:42 (+5)

I agree that this is an important distinction, but it seems hard to separate them in practice. In practice, we can't really know with certainty that we're making AI safer, and without strong evidence/feedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects. Some AI safety researchers are doing technical research on value learning/alignment, like (cooperative) inverse reinforcement learning, and doing this research may contribute to further research on the topic down the line and eventual risky ambitious value alignment, whether or not "we" end up concluding that it's too risky.

Furthermore, when it matters most, I think it's unlikely there will be a strong and justified consensus in favour of this kind of research (given wide differences in beliefs about the likelihood of worst cases and/or differences in ethical views), and I think there's at least a good chance there won't be any strong and justified consensus at all. To me, the appropriate epistemic state with regards to value learning research (or at least its publication) is one of complex cluelessness, and it's possible this cluelessness could end up infecting AGI safety as a cause in general, depending on how large the downside risks could be (which explicit modelling with sensitivity analysis could help us check).

Also, it's not just AI alignment research that I'm worried about, since I see potential for tradeoffs more generally between failure modes. Preventing unipolar takeover or extinction may lead to worse outcomes (s-risk/hyperexistential risks), but maybe (this is something to check) those worse outcomes are easier to prevent with different kinds of targeted work and we're sufficiently invested in those. I guess the question would be whether, looking at the portfolio of things the AI safety community is working on, are we increasing any risks (in a way that isn't definitely made up for by reductions in other risks)? Each time we make a potential tradeoff with something in that portfolio, would (almost) every reasonable and informed person think it's a good tradeoff, or if it's ambiguous, is the downside made up for with something else?

steve2152 @ 2021-10-07T02:55 (+9)

In practice, we can't really know with certainty that we're making AI safer, and without strong evidence/feedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects.

This strikes me as too pessimistic. Suppose I bring a complicated new board game to a party. Two equally-skilled opposing teams each get a copy of the rulebook to study for an hour before the game starts. Team A spends the whole hour poring over the rulebook and doing scenario planning exercises. Team B immediately throws the rulebook in the trash and spends the hour watching TV.

Neither team has "strong evidence/feedback"—they haven't started playing yet. Team A could think they have good strategy ideas but in fact they are engaging in arbitrary subjective judgments and motivated reasoning. In fact, their strategy ideas, which seemed good on paper, could in fact turn out to be counterproductive!

Still, I would put my money on Team A beating Team B. Because Team A is trying. Their planning abilities don't have to be all that good to be strictly better (in expectation) than "not doing any planning whatsoever, we'll just wing it". That's a low bar to overcome!

So by the same token, it seems to me that vast swathes of AGI safety research easily surpasses the (low) bar of doing better in expectation than the alternative of "Let's just not think about it in advance, we'll wing it".

For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person won't do a great job, or that it will be counterproductive because they'll happen across very dangerous information and then publish it, etc. But still, the expected value here is  clearly positive, right?

You also bring up the idea that (IIUC) there may be objectively good safety ideas but they might not actually get implemented because there won't be a "strong and justified consensus" to do them. But again, the alternative is "nobody comes up with those objectively good safety ideas in the first place". That's even worse, right? (FWIW I consider "come up with crisp and rigorous and legible arguments for true facts about AGI safety" to be a major goal of AGI safety research.)

Anyway, I'm objecting to undirected general feelings of "gahhhh we'll never know if we're helping at all", etc. I think there's just a lot of stuff in the AGI safety research field which is unambiguously good in expectation, where we don't have to feel that way. What I don't object to—and indeed what I strongly endorse—is taking a more directed approach and say "For AGI safety research project #732, what are the downside risks of this research, and how do they compare to the upsides?"

So that brings us to "ambitious value alignment". I agree that an ambitiously-aligned AGI comes with a couple potential sources of s-risk that other types of AGI wouldn't have, specifically via (1) sign flip errors, and (2) threats from other AGIs. (Although I think (1) is less obviously a problem than it sounds, at least in the architectures I think about.) On the other hand, (A) I'm not sure anyone is really working on ambitious alignment these days … at least Rohin Shah & Paul Christiano have stated that narrow (task-limited) alignment is a better thing to shoot for (and last anyone heard MIRI was shooting for task-limited AGIs too) (UPDATE: actually this was an overstatement, see e.g. 1,2,3); (B) my sense is that current value-learning work (e.g. at CHAI) is more about gaining conceptual understanding then creating practical algorithms / approaches that will scale to AGI. That said, I'm far from an expert on the current value learning literature; frankly I'm often confused by what such researchers are imagining for their longer-term game-plan.

BTW I put a note on my top comment that I have a COI. If you didn't notice. :)

MichaelStJules @ 2021-10-07T08:32 (+5)

For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person won't do a great job, or that it will be counterproductive because they'll happen across very dangerous information and then publish it, etc. But still, the expected value here is  clearly positive, right?

 

If you aren't publishing anything, then sure, research into what to do seems mostly harmless (other than opportunity costs) in expectation, but it doesn't actually follow that it would necessarily be good in expectation, if you have enough deep uncertainty (or complex cluelessness); I think this example illustrates this well, and is basically the kind of thing I'm worried about all of the time now. In the particular case of sign flip errors, I do think it was useful for me to know about this consideration and similar ones, and I act differently than I would have otherwise as a result, but one of the main effects since learning about these kinds of s-risks is that I'm (more) clueless about basically every intervention now, and am looking to portfolios and hedging.

If you are publishing, and your ethical or empirical views are sufficiently different from others working on the problem so that you make very different tradeoffs, then that could be good, bad or ambiguous. For example, if you didn't really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/cluelessness here, not that it's good in expectation or bad in expectation or 0 in expectation.

Maybe you can eliminate this ambiguity or at least constrain its range to something relatively insignificant by building a model, doing a sensitivity analysis, etc., but a lot of things don't work out, and the ambiguity could be so bad that it infects everything else. This is roughly where I am now: I have considerations that result in complex cluelessness about AI-related interventions and I want to know how people work through this.

For another source of pessimism, Luke Muehlhauser from Open Phil wrote:

Re: cost-effectiveness analyses always turning up positive, perhaps especially in longtermism. FWIW that hasn't been my experience. Instead, my experience is that every time I investigate the case for some AI-related intervention being worth funding under longtermism, I conclude that it's nearly as likely to be net-negative as net-positive given our great uncertainty and therefore I end up stuck doing almost entirely "meta" things like creating knowledge and talent pipelines.

Of course, that doesn't mean he never finds good "direct work", or that the "direct work" already being funded isn't better than nothing in expectation overall, and I would guess he thinks it is.

steve2152 @ 2021-10-07T14:41 (+9)

Hmm, it seems to me (and you can correct me) that we should be able to agree that there are SOME technical AGI safety research publications that are positive under some plausible beliefs/values and harmless under all plausible beliefs/values, and then we don't have to talk about cluelessness and tradeoffs, we can just publish them.

And we both agree that there are OTHER technical AGI safety research publications that are positive under some plausible beliefs/values and negative under others. And then we should talk about your portfolios etc. Or more simply, on a case-by-case basis, we can go looking for narrowly-tailored approaches to modifying the publication in order to remove the downside risks while maintaining the upside.

I feel like we're arguing past each other: I keep saying the first category exists, and you keep saying the second category exists. We should just agree that both categories exist! :-)

Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that "publishing an analysis about what happens if a cosmic ray flips a bit" goes in the first category.)

(Luke says "AI-related" but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)

For example, if you didn't really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/cluelessness here, not that it's good in expectation or bad in expectation or 0 in expectation.

This points to another (possible?) disagreement. I think maybe you have the attitude where (to caricature somewhat) if there's any downside risk whatsoever, no matter how minor or far-fetched, you immediately jump to "I'm clueless!". Whereas I'm much more willing to say: OK, I mean, if you do anything at all there's a "downside risk" in a sense, just because life is uncertain, who knows what will happen, but that's not a good reason to let just sit on the sidelines and let nature take its course and hope for the best. If I have a project whose first-order effect is a clear and specific and strong upside opportunity, I don't want to throw that project out unless there's a comparably clear and specific and strong downside risk. (And of course we are obligated to try hard to brainstorm what such a risk might be.)  Like if a firefighter is trying to put out a fire, and they aim their hose at the burning interior wall, they don't stop and think, "Well I don't know what will happen if the wall gets wet, anything could happen, so I'll just not pour water on the fire, y'know, don't want to mess things up."

The "cluelessness" intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.

If the first-order effect of a project is "directly mitigating an important known s-risk", and the second-order effects of the same project are "I dunno, it's a complicated world, anything could happen", then I say we should absolutely do that project.

MichaelStJules @ 2021-10-07T17:07 (+5)

Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that "publishing an analysis about what happens if a cosmic ray flips a bit" goes in the first category.)

Ya, I think this is the crux. Also, considerations like the cosmic ray flips a bit tend to force a lot of things into the second category when they otherwise wouldn't have been, although I'm not specifically worried about cosmic ray bit flips, since they seems sufficiently unlikely and easy to avoid.

(Luke says "AI-related" but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)

(Fair.)

The "cluelessness" intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.

This is actually what I'm thinking is happening, though (not like the firefighter example), but we aren't really talking much about the specifics. There might indeed be specific cases where I agree that we shouldn't be clueless if we worked through them, but I think there are important potential tradeoffs between incidental and agential s-risks, between s-risks and other existential risks, even between the same kinds of s-risks, etc., and there is a ton of uncertainty in the expected harm from these risks, so much that it's inappropriate to use a single distribution (without sensitivity analysis to "reasonable" distributions, and with this sensitivity analysis, things look ambiguous), similar to this example, and we're talking about "sweetening" one side or the other i, but that's totally swamped by our uncertainty.

If the first-order effect of a project is "directly mitigating an important known s-risk", and the second-order effects of the same project are "I dunno, it's a complicated world, anything could happen", then I say we should absolutely do that project.

What I have in mind is more symmetric in upsides and downsides (or at least, I'm interested in hearing why people think it isn't in practice), and I don't really distinguish between effects by order*. My post points out potential reasons that I actually think could dominate. The standard I'm aiming for is "Could a reasonable person disagree?", and I default to believing a reasonable person could disagree when I point out such tradeoffs until we actually carefully work through them in detail and it turns out it's pretty unreasonable to disagree.

*Although thinking more about it now, I suppose longer chains are more fragile and likely to have unaccounted for effects going in the opposite direction, so maybe we ought to give them less weight, and maybe this solves the issue if we did this formally? I think ignoring higher-order effects is formally irrational using vNM rationality or stochastic dominance, although maybe fine in practice, if what we're actually doing is just an approximation of giving them far less weight with a skeptical prior and then they actually just get dominated completely by more direct effects.

steve2152 @ 2021-10-07T20:34 (+5)

I don't really distinguish between effects by order*

I agree that direct and indirect effects of an action are fundamentally equally important (in this kind of outcome-focused context) and I hadn't intended to imply otherwise.

Vasco Grilo @ 2022-12-09T16:37 (+2)

Hi Steven,

I really appreciate the dartboard analogy! It helped me understand your view.

MichaelStJules @ 2021-10-03T21:19 (+3)

Related: Which World Gets Saved by trammell.