LLMs as a Planning Overhang

By Larks @ 2024-07-14T04:57 (+49)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Adebayo Mubarak @ 2024-07-17T18:23 (+3)

Can you clarify this a bit "Only if the safety/alignment work applies directly to the future maximiser AIs (for example, by allowing us to understand them) does it seem very advantageous to me."

Kind of lost here

Larks @ 2024-07-17T20:20 (+2)

Suppose we have some LLM interpritability technology that helps us take LLMs from a bit worse than humans at planning to a bit better (say because it reduces the risk of hallucinations), and these LLMs will ultimately be used by both humans and future agentic AIs. The improvement from human-level planning to better-than-human level benefits both humans and optimiser AIs. But the improvement up to human level is a much bigger boost to the agentic AI, who would otherwise not have access to such planning capabilities, than to humans, who already had human-level abilities. So this interpritability technology actually ends up making crunch time worse.

It's different if this interpritability (or other form of safety/alignment work) also applied to future agentic AIs, because we could use it to directly reduce the risk from them.

Adebayo Mubarak @ 2024-07-17T23:11 (+1)

It seems I get the knack of it now... 

So your argument here is that if we are going to go this route, then interpretability technology should be used as a measure  in the future towards ensuring the safety of this agentic AI as much as they are using currently to improve their "planning capabilities"