AI alignment researchers may have a comparative advantage in reducing s-risks

By Lukas_Gloor @ 2023-02-15T13:01 (+79)

I believe AI alignment researchers might be uniquely well-positioned to make a difference to s-risks. In particular, I think this of alignment researchers with a keen interest in “macrostrategy.” By that, I mean ones who habitually engage in big-picture thinking related to the most pressing problems (like AI alignment and strategy), form mental models of how the future might unfold, and think through their work’s paths to impact. (There’s also a researcher profile where a person specializes in a specific problem area so much that they no longer have much interest in interdisciplinary work and issues of strategy – those researchers aren’t the target audience of this post.)

Of course, having the motivation to work on a specific topic is a significant component of having a comparative advantage (or lack thereof). Whether AI alignment researchers find themselves motivated to invest a portion of their time/attention into s-risk reduction will depend on several factors, including:

Further below, I will say a few more things about these bullet points. In short, I believe that, for people with the right set of skills, reducing AI-related s-risks will become sufficiently tractable (if it isn’t already) once we know more about what transformative AI will look like. (The rest depends on individual choices about prioritization.)

Summary

In the rest of the post, I’ll flesh out the above points.

What I mean by “s-risks”

Suffering risks (or “s-risks”) are risks of events that bring about suffering in cosmically significant amounts. By “significant,” I mean significant relative to our expectation over future suffering.

For something to constitute an “s-risk” under this definition, the suffering involved not only has to be astronomical in scope (e.g., “more suffering than has existed on Earth so far”),[5] but also significant compared to other sources of expected future suffering. This last bit ensures that “s-risks,” assuming sufficient tractability, are always a top priority for suffering-focused longtermists. Consequently, it also ensures that s-risks aren’t a rounding error from a “general longtermist portfolio” perspective, where people with community-wide comparative advantages work on the practical priorities of different value systems.

I say “assuming sufficient tractability” to highlight that the definition of s-risks only focuses on the expected scope of suffering (probability times severity/magnitude), but not on whether we can tractably reduce a given risk. To assess the all-things-considered importance of s-risk reduction, we also have to consider tractability.

Why reducing s-risks is challenging

Historically, efforts to reduce s-risks were motivated in a peculiar way, which introduced extra difficulties compared to other research areas.

Why s-risks nonetheless seem tractable

Nonetheless, by focusing our efforts on shaping the next generation(s) of influential agents (e.g., our AI successors), we can reduce or even circumvent the demands for detailed forecasts of distant scenarios.

Alignment researchers have a comparative advantage in reducing s-risks

S-risk reduction is separate from alignment work

Normative reasons for dedicating some attention to s-risks

My appeal to AI alignment researchers


  1. This is changing to some degree. For instance, the Center on Long-term Risk has recently started working with language models, among other things. That said, my impression is that there’s still a significant gap in “How much people understand the AI alignment discourse and the workings of cutting-edge ML models” between the most experienced researchers in the s-risk community and the ones in the AI alignment field. ↩︎

  2. Not everyone in the suffering-focused longtermist community considers “directly AI-related” s-risks sufficiently tractable to deserve most of our attention. Other interventions to reduce future suffering include focusing on moral circle expansion and reducing political polarization – see the work of the Center on Reducing Suffering. I think there’s “less of pull from large numbers” (or “less of a Pascalian argument”) for these specific causes, but that doesn’t necessarily make them less important. Arguably, sources of suffering related to our societal values and culture are easier to affect, and they require less of a specific technical background to engage in advocacy work. That said, societal values and culture will only have a lasting effect on the future if humanity stays in control over that future. (See also my comment here on conditions under which moral circle expansion becomes less important from a longtermist perspective.) ↩︎

  3. It’s slightly more complicated. We also must be able to tell what features of an AI system’s decision architecture robustly correlate with increased/decreased future s-risks. I think the examples in the bullet point above might qualify. Still, there’s a lot of remaining uncertainty. ↩︎

  4. The exception here would be if s-risks remained intractable, due to their speculative nature, even to the best-positioned researchers. ↩︎

  5. The text that initially introduced the term “s-risks” used to have a different definition focused on the astronomical stakes of space colonization. With my former colleagues at the Center on Long-term Risk and in the context of better coordinating the way we communicate about our respective priorities with longtermists whose priorities aren’t “suffering-focused,” we decided to change the definition for s-risks. We had two main reasons for the change: (1) Calling something an “s-risk” when it doesn’t constitute a plausible practical priority (not even for people who prioritize suffering reduction over other goals) risks generating the impression that s-risks are generally not that important. (2) Calling the future scenario “galaxy-wide utopia where people still suffer headaches now and then” an “s-risk” may come with the connotation (always unintended by us) that this entire future scenario ought to be prevented. Over the years, we received a lot of feedback (e.g., here and here) that this was off-putting about the older definition. ↩︎

  6. By “little to lose,” I mean AIs whose value structure incentivizes them to bargain in particularly reckless ways because the default outcome is as bad as it gets according to their values. E.g., a paperclip maximizer who only cares about paperclips considers both an empty future and “utopia for humans”  ~as bad as it gets because – absent moral trade – these outcomes wouldn’t contain any paperclips. ↩︎

  7. For instance, one could do this by shaping the incentives in AI training environments after maybe studying why anti-social phenotypes evolved in humans and what makes them identifiable, etc. ↩︎

  8. The linked post on “near misses” discusses why this increases negative variance. For more thoughts, see the subsection on AI alignment from my post, Cause prioritization for downside-focused value systems. (While some of the discussion in that post will feel a bit dated, I find that it still captures the most significant considerations.) ↩︎

  9. In general, I think sidestepping the dilemma in this way is often the correct solution to “deliberation ladders” where an intervention that seems positive and important at first glance comes with some identifiable but brittle-seeming “backfiring risks.” For instance, let’s say we’re considering the importance of EA movement building vs. the risk of movement dilution, or economic growth vs. the risk of speeding up AI capabilities prematurely. In both cases, I think the best solution is to push ahead (build EA, promote economic growth) while focusing on targeted measures to keep the respective backfiring risks in check. Of course, that approach may not be correct for all deliberation ladder situations. It strikes me as “most likely correct” for cases where the following conditions apply. 1. Targeted measures to prevent backfiring risks will be reasonably effective/practical when they matter most. 2. The value of information from more research into the deliberation ladder (“trying to get to the top/end of it”) is low. ↩︎

  10. See this post on why I prefer this phrasing over “are morally uncertain” ↩︎

  11. I think evidential cooperation in large worlds (or “multiverse-wide superrationality,” the idea’s initial name) supports a low-demanding form of cooperation more robustly than it supports all-out cooperation. After all, since we’ll likely have a great deal of uncertainty about whether we think of the idea in the right ways and accurately assess our comparative advantages and gains from trade, we have little to gain from pushing the envelope and pursuing maximal gains from trade over the 80-20 version. ↩︎


Jim Buhler @ 2023-02-15T15:46 (+14)

Interesting! Thanks for writing this. Seems like a helpful summary of ideas related to s-risks from AI.

Another important normative reason for dedicating some attention to s-risks is that the future (conditional on humanity's survival) is underappreciatedly likely to be negative  -- or at least not very positive -- from whatever plausible moral perspective, e.g., classical utilitarianism (see DiGiovanni 2021; Anthis 2022).

While this does not speak in favor of prioritizing s-risks per se, it obviously speaks against prioritizing X-risks which seem to be their biggest longtermist "competitors" at the moment.

(I have two unrelated remarks I'll make in separate comments.)

Lukas_Gloor @ 2023-02-15T18:08 (+13)

"[U]nderappreciatedly likely to be negative [...] from whatever plausible moral perspective" could mean many things. I maybe agree with the spirit behind this claim, but I want to flag that, personally, I think it's <10% likely that, if the wisest minds of the EA community researched and discussed this question for a full year, they'd conclude that the future is net negative in expectation for symmetric or nearly-symmetric classical utilitarianism. At the same time, I expect the median future to not be great (partly because I already think the current world is rather bad) and I think symmetric classical utilitarianism is on the furthest end of the spectrum of what seems defensible.

Jim Buhler @ 2023-02-15T16:21 (+3)

I really like the section S-risk reduction is separate from alignment work! I've been surprised by the extent to which people dismiss s-risks on the pretext that "alignment work will solve them anyway" (which is both insufficient and untrue as you pointed out).

I guess some of the technical work to reduce s-risks (e.g., preventing the "accidental" emergence of conflict-seeking preferences) can be considered a very specific kind of AI intent alignment (that only a few cooperative AI people are working on afaik) where we want to avoid worst-case scenarios. 

But otherwise, do you think it's fair to say that most s-risk work is focused AI capability issues (as opposed to intent alignment in Paul Christiano’s (2019) typology)? Even if the AI is friendly to humans' values, it doesn't mean it'd be capable of avoiding the failure modes / near-misses you refer to. 
I usually frame things this way in discussions to make things clearer but I'm wondering if that's the best framing...

Jim Buhler @ 2023-02-15T16:00 (+3)

(emphasis is mine)

For something to constitute an “s-risk” under this definition, the suffering involved not only has to be astronomical in scope (e.g., “more suffering than has existed on Earth so far”),[5] but also significant compared to other sources of expected future suffering. This last bit ensures that “s-risks,” assuming sufficient tractability, are always a top priority for suffering-focused longtermists. 

Nitpick but you also need to assume sufficient likelihood, right? One might very well be a suffering-focused longtermist and think s-risks are tractable but so exceedingly unlikely that they're not a top priority relative to, e.g., near-term animal suffering (especially if they prefer risk aversion over classic expected value reasoning).

(I'm not arguing that someone who thinks that is right. I actually think they're probably very wrong. Just wanted to make sure we agree the definition of s-risks doesn't include any claim about their likelihood .) :) 

Lukas_Gloor @ 2023-02-15T18:14 (+2)

Good point, thanks!

Just wanted to make sure we agree the definition of s-risks doesn't include any claim about their likelihood

That's probably the best way to think of it, yeah.

I think the definition isn't rigorous enough to withstand lots of scrutiny. Still, in my view, it serves as a useful "pointer."

You could argue that the definition implicitly tracks probabilities because in order to assess whether some source of expected suffering constitutes an s-risk, we have to check how it matches up against all other sources of expected suffering in terms of "probability times magnitude and severity." But this just moves the issue to "what's our current expectation over future suffering." It's definitely reasonable for people to have widely different views on this, so it makes sense to have that discussion independent of the specific assumptions behind that s-risk definition.