AI safety tax dynamics

By Owen Cotton-Barratt @ 2024-10-23T12:21 (+21)

This is a linkpost to https://strangecities.substack.com/p/ai-safety-tax-dynamics

Two important themes in many discussions of the future of AI are:

  1. AI will automate research, and thus accelerate technological progress
  2. There are serious risks from misaligned AI systems (that justify serious investments in safety)

How do these two themes interact? Especially: how should we expect the safety tax requirements to play out as progress accelerates and we see an intelligence explosion?

In this post I’ll give my core views on this:

I developed these ideas in tandem with my exploration of the concepts of safety tax landscapes, that I wrote about in a recent post. However, for people who are just interested in the implications for AI, I think that this post will largely stand alone.

How AI differs from other dangerous technologies

In the post on safety tax functions, my analysis was about a potentially-dangerous technology in the abstract (nothing specific about AI). We saw that:

For most technologies, these abilities — the ability to invest in different aspects of the tech, and the ability to coordinate — are relatively independent of the technology; better solar power doesn’t do much to help us do more research, or sign better treaties. Not so for AI! To a striking degree, AI safety is a dynamic problem — earlier capabilities might change the basic nature of the problem we are later facing. 

In particular:

These are, I believe, central cases of the potential value of differential technological development (or d/acc) in AI. I think this is an important topic, and it’s one I expect to return to in future articles.

Where is the safety tax peak for AI?

Why bother with the conceptual machinery of safety tax functions? A lot of the reason I spent time thinking about it was trying to get a handle on this question — which parts of the AI development curve should we be most concerned about?

I think this is a crucial question for thinking about AI safety, and I wish it had more discussion. Compared to talking about the total magnitude of the risks, I think this question is more action-guiding, and also more neglected.

In terms of my own takes, it seems to me that:

On net, my picture looks very approximately like this:

(I think this graph will probably make rough intuitive sense by itself, but if you want more details about what the axes and contours are supposed to mean, see the post on safety tax functions.) 

I’m not super confident in these takes, but it seems better to be wrong than vague — if it’s good to have more conversations about this, I’d rather offer something to kick things off than not. If you think this picture is wrong — and especially if you think the peak risk lies somewhere else — I’d love to hear about that.

And if this picture is right — then what? I suppose I would like to see more work which is targeting this period.[2] This shouldn’t mean stopping safety work for early AGI — that’s the first period with appreciable risk, and it can’t be addressed later. But it should mean increasing political work which lays the groundwork for coordinating to pay high safety taxes in the later period. And it should mean working to differentially accelerate those beneficial applications of AI that may help us to navigate the period well.

Acknowledgements: Thanks to Tom Davidson, Rose Hadshar, and Raymond Douglas for helpful comments.

  1. ^

    Of course “around as smart as humans” is a vague term; I’ll make it slightly less vague by specifying “at research and strategic planning”, which I think are the two most strategically important applications of AI.

  2. ^

    This era may roughly coincide with the last era of human mistakes — since AI abilities are likely to be somewhat spiky compared to humans, we’ll probably have superintelligence in many important ways before human competence is completely obsoleted. So the interventions for helping I discussed in that post may be relevant here. However, I painted a somewhat particular picture in that post, which I expect to be wrong in some specifics; whereas here I’m trying to offer a more general analysis.


Will Aldred @ 2024-10-24T01:57 (+8)

By the time systems approach strong superintelligence, they are likely to have philosophical competence in some sense.

It’s interesting to me that you think this; I’d be very keen to hear your reasoning (or for you to point me to any existing writings that fit your view).

For what it’s worth, I’m at maybe 30 or 40% that superintelligence will be philosophically competent by default (i.e., without its developers trying hard to differentially imbue it with this competence), conditional on successful intent alignment, where I’m roughly defining “philosophically competent” as “wouldn’t cause existential catastrophe through philosophical incompetence.” I believe this mostly because I find @Wei Dai’s writings compelling, and partly because of some thinking I’ve done myself on the matter. OpenAI’s o1 announcement post, for example, indicates that o1—the current #1 LLM, by most measures—performs far better in domains that have clear right/wrong answers (e.g., calculus and chemistry) than in domains where this is not the case (e.g., free-response writing[1]).[2] Philosophy, being interminable debate, is perhaps the ultimate “no clear right/wrong answers” domain (to non-realists, at least): for this reason, plus a few others (which are largely covered in Dai’s writings), I’m struggling to see why AIs wouldn’t be differentially bad at philosophy in the lead-up to superintelligence.

Also, for what it’s worth, the current community prediction on the Metaculus question “Five years after AGI, will AI philosophical competence be solved?” is down at 27%.[3] (Although, given how out of distribution this question is with respect to most Metaculus questions, the community prediction here should be taken with a lump of salt.)

(It’s possible that your “in some sense” qualifier is what’s driving our apparent disagreement, and that we don’t actually disagree by much.)

  1. ^

    Free-response writing comprises 55% of the AP English language and English literature exams (sourcesource).

  2. ^

    On this, AI Explained (8:01–8:34) says:

    And there is another hurdle that would follow, if you agree with this analysis [of why o1’s capabilities are what they are, across the board]: It’s not just a lack of training data. What about domains that have plenty of training data, but no clearly correct or incorrect answers? Then you would have no way of sifting through all of those chains of thought, and fine-tuning on the correct ones. Compared to the original GPT-4o in domains with correct and incorrect answers, you can see the performance boost. With harder-to-distinguish correct or incorrect answers: much less of a boost [in performance]. In fact, a regress in personal writing.

  3. ^

    Note: Metaculus forecasters—for the most part—think that superintelligence will come within five years of AGI. (See here my previous commentary on this, which goes into more detail.)

Owen Cotton-Barratt @ 2024-10-24T10:13 (+4)

It's not clear we have too much disagreement, but let me unpack what I meant:

  • Let strong philosophical competence mean competence at all philosophical questions, including those like metaethics which really don't seem to have any empirical grounding
    • I'm not trying to make any claims about strong philosophical competence
    • I might be a little more optimistic than you about getting this by default as a generalization of weak philosophical competence (see below), but I'm still pretty worried that we won't get it, and I didn't mean to rely on it in my statements in this post
  • Let weak philosophical competence mean competence at reasoning about complex questions which ultimately have empirical answers, where it's out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them
  • I claim that by the time systems approach strong superintelligence, they're likely to have a degree of weak philosophical competence
    • Because:
      • It would be useful for many tasks, and this would likely be apparent to mild superintelligent systems
      • It can be selected for empirically (seeing which training approaches etc. do well at weak philosophical competence in toy settings, where the experimenters have access to the ground truth about the questions they're having the systems use philosophical reasoning to approach)
  • I further claim that weak philosophical competence is what you need to be able to think about how to build stronger AI systems that are, roughly speaking, safe, or intent aligned
    • Because this is ultimately an empirical question ("would this AI do something an informed version of me / those humans would ultimately regard as terrible?")
    • I don't claim that this would extend to being able to think about how to build stronger AI systems that it would be safe to make sovereigns
Will Aldred @ 2024-10-24T12:35 (+4)

Thanks for expanding! This is the first time I’ve seen this strong vs. weak distinction used—seems like a useful ontology.[1]

Minor: When I read your definition of weak philosophical competence,[2] high energy physics and consciousness science came to mind as fields that fit the definition (given present technology levels). However, this seems outside the spirit of “weak philosophical competence”: an AI that’s superhuman in the aforementioned fields could still fail big time with respect to “would this AI do something an informed version of me / those humans would ultimately regard as terrible?” Nonetheless, I’ve not been able to think up a better ontology myself (in my 5 mins of trying), and I don’t expect this definitional matter will cause problems in practice.

  1. ^

    For the benefit of any readers: Strong philosophical competence is importantly different to weak philosophical competence, as defined. The latter feeds into intent alignment, while the former is an additional problem beyond intent alignment. [Edit: I now think this is not so clear-cut. See the ensuing thread for more.]

  2. ^

    “Let weak philosophical competence mean competence at reasoning about complex questions which ultimately have empirical answers, where it's out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them.”

Owen Cotton-Barratt @ 2024-10-24T13:26 (+4)

Yeah, I appreciated your question, because I'd also not managed to unpack the distinction I was making here until you asked.

On the minor issue: right, I think that for some particular domain(s), you could surely train a system to be highly competent in that domain without this generalizing to even weak philosophical competence overall. But if you had a system which was strong at both of those domains despite not having been trained on them, and especially if that was also true for say three more comparable domains, I guess I kind of do expect it to be good at the general thing? (I haven't thought long about that.)

Will Aldred @ 2024-10-25T19:24 (+2)

Hmm, interesting.

I’m realizing now that I might be more confused about this topic than I thought I was, so to backtrack for just a minute: it sounds like you see weak philosophical competence as being part of intent alignment, is that correct? If so, are you using “intent alignment” in the same way as in the Christiano definition? My understanding was that intent alignment means “the AI is trying to do what present-me wants it to do.” To me, therefore, this business of the AI being able to recognize whether its actions would be approved by idealized-me (or just better-informed-me) falls outside the definition of intent alignment.

(Looking through that Christiano post again, I see a couple of statements that seem to support what I’ve just said,[1] but also one that arguably goes the other way.[2])

Now, addressing your most recent comment:

Okay, just to make sure that I’ve understood you, you are defining weak philosophical competence as “competence at reasoning about complex questions [in any domain] which ultimately have empirical answers, where it's out of reach to test them empirically, but where one may get better predictions from finding clear frameworks for thinking about them,” right? Would you agree that the “important” part of weak philosophical competence is whether the system would do things an informed version of you, or humans at large, would ultimately regard as terrible (as opposed to how competent the system is at high energy physics, consciousness science, etc.)?

If a system is competent at reasoning about complex questions across a bunch of domains, then I think I’m on board with seeing that as evidence that the system is competent at the important part of weak philosophical competence, assuming that it’s already intent aligned.[3] However, I’m struggling to see why this would help with intent alignment itself, according to the Christiano definition. (If one includes weak philosophical competence within one’s definition of intent alignment—as I think you are doing(?)—then I can see why it helps. However, I think this would be a non-standard usage of “intent alignment.” I also don’t think that most folks working on AI alignment see weak philosophical competence as part of alignment. (My last point is based mostly on my experience talking to AI alignment researchers, but also on seeing leaders of the field write things like this.))

A couple of closing thoughts:

  1. I already thought that strong philosophical competence was extremely neglected, but I now also think that weak philosophical competence is very neglected. It seems to me that if weak philosophical competence is not solved at the same time as intent alignment (in the Christiano sense),[4] then things could go badly, fast. (Perhaps this is why you want to include weak philosophical competence within the intent alignment problem?)
  2. The important part of weak philosophical competence seems closely related to Wei Dai’s “human safety problems”.

(Of course, no obligation on you to spend your time replying to me, but I’d greatly appreciate it if you do!)

  1. ^
    • They could [...] be wrong [about; sic] what H wants at a particular moment in time.
    • They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
    • They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.

    I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.

    (“Understanding what humans want” sounds quite a lot like weak philosophical competence, as defined earlier in this thread, while “solving philosophy” sounds a lot like strong philosophical competence.)

  2. ^

    An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences.

    (It’s unclear whether this just refers to clarifying present-H’s preferences, or if it extends to making present-H’s preferences be closer to idealized-H’s.)

  3. ^

    If the system is not intent aligned, then I think this would still be evidence that the system understands what an informed version of me would ultimately regard as terrible vs. not terrible. But, in this case, I don’t think the system will use what it understands to try to do the non-terrible things.

  4. ^

    Insofar as a solved vs. not solved framing even makes sense. Karnofsky (2022; fn. 4) argues against this framing.

Owen Cotton-Barratt @ 2024-10-25T21:12 (+4)

it sounds like you see weak philosophical competence as being part of intent alignment, is that correct?

Ah, no, that's not correct.

I'm saying that weak philosophical competence would:

  • Be useful enough for acting in the world, and in principle testable-for, that I expect it be developed as a form of capability before strong superintelligence
  • Be useful for research on how to produce intent-aligned systems

... and therefore that if we've been managing to keep things more or less intent aligned up to the point where we have systems which are weakly philosophical competent, it's less likely that we have a failure of intent alignment thereafter. (Not impossible, but I think a pretty small fraction of the total risk.)

Will Aldred @ 2024-10-25T21:52 (+2)

Thanks for clarifying!

Be useful for research on how to produce intent-aligned systems

Just checking: Do you believe this because you see the intent alignment problem as being in the class of “complex questions which ultimately have empirical answers, where it’s out of reach to test them empirically, but one may get better predictions from finding clear frameworks for thinking about them,” alongside, say, high energy physics?

Owen Cotton-Barratt @ 2024-10-25T22:37 (+2)

Yep.

SummaryBot @ 2024-10-23T16:51 (+1)

Executive summary: As AI capabilities progress, the peak period for existential safety risks likely occurs during mild-to-moderate superintelligence, when capabilities research automation might significantly outpace safety research automation, requiring careful attention to safety investments and coordination.

Key points:

  1. AI differs from other technologies because earlier AI capabilities can fundamentally change the nature of later safety challenges through automation of both capabilities and safety research.
  2. The required "safety tax" (investment in safety measures) varies across AI development stages, peaking during mild-to-moderate superintelligence.
  3. Early AGI poses relatively low existential risk due to limited power accumulation potential, while mature strong superintelligence may have lower safety requirements due to better theoretical understanding and established safety practices.
  4. Differential technological development (boosting beneficial AI applications) could be a high-leverage strategy for improving overall safety outcomes.
  5. Political groundwork for coordination and investment in safety measures should focus particularly on the peak risk period of mild-to-moderate superintelligence.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.