LLMs as a Planning Overhang
By Larks @ 2024-07-14T04:57 (+49)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullAdebayo Mubarak @ 2024-07-17T18:23 (+3)
Can you clarify this a bit "Only if the safety/alignment work applies directly to the future maximiser AIs (for example, by allowing us to understand them) does it seem very advantageous to me."
Kind of lost here
Larks @ 2024-07-17T20:20 (+2)
Suppose we have some LLM interpritability technology that helps us take LLMs from a bit worse than humans at planning to a bit better (say because it reduces the risk of hallucinations), and these LLMs will ultimately be used by both humans and future agentic AIs. The improvement from human-level planning to better-than-human level benefits both humans and optimiser AIs. But the improvement up to human level is a much bigger boost to the agentic AI, who would otherwise not have access to such planning capabilities, than to humans, who already had human-level abilities. So this interpritability technology actually ends up making crunch time worse.
It's different if this interpritability (or other form of safety/alignment work) also applied to future agentic AIs, because we could use it to directly reduce the risk from them.
Adebayo Mubarak @ 2024-07-17T23:11 (+1)
It seems I get the knack of it now...
So your argument here is that if we are going to go this route, then interpretability technology should be used as a measure in the future towards ensuring the safety of this agentic AI as much as they are using currently to improve their "planning capabilities"