Winners of AI Alignment Awards Research Contest

By Akash @ 2023-07-13T16:14 (+49)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
RyanCarey @ 2023-07-14T00:49 (+7)

Congrats to the prizewinners!

Folks thinking about corrigibility may also be interested in the paper "Human Control: Definitions and Algorithms", which I will be presenting at UAI next month. It argues that corrigibility is not quite what we need for a safety guarantee, and that (considering the simplified "shutdown" scenario), instead we should be shooting for "shutdown instructability".

Shutdown instructability has three parts. The first is 1) obedience - the AI follows an instruction to shut down. Rather than requiring the AI to abstain from manipulating the human, as corrigibility would traditionally require, we need the human to maintain 2) vigilance - to instruct shutdown when endangered. Finally, we need the AI to behave 3) cautiously, in that it is not taking risky actions (like juggling dynamite) that would cause a disaster to occur once it is shut down.

We think that vigilance (and shutdown instructability) is a better target than non-manipulation (and corrigibility) because:

Given all of this, it seems to us that in order for corrigibility to seem promising, we would need it to be argued in some greater detail that non-manipulation implies vigilance - that the AI refraining from intentionally manipulating the human would be adequate to ensure that the human can come to give adequate instructions.

Insofar as we can't come up with such justification, we should think more directly about how to achieve obedience (which needs a definition of "shutting down subagents"), vigilance (which requires the human to be able to know whether it will be harmed), and caution (which requires safe-exploration, in light of the human's unknown values).

Hope the above summary is interesting for people!