Winners of AI Alignment Awards Research Contest

By Akash @ 2023-07-13T16:14 (+49)

This is a crosspost, probably from LessWrong. Try viewing it there.

null

RyanCarey @ 2023-07-14T00:49 (+7)

Congrats to the prizewinners!

Folks thinking about corrigibility may also be interested in the paper "Human Control: Definitions and Algorithms", which I will be presenting at UAI next month. It argues that corrigibility is not quite what we need for a safety guarantee, and that (considering the simplified "shutdown" scenario), instead we should be shooting for "shutdown instructability".

Shutdown instructability has three parts. The first is 1) obedience - the AI follows an instruction to shut down. Rather than requiring the AI to abstain from manipulating the human, as corrigibility would traditionally require, we need the human to maintain 2) vigilance - to instruct shutdown when endangered. Finally, we need the AI to behave 3) cautiously, in that it is not taking risky actions (like juggling dynamite) that would cause a disaster to occur once it is shut down.

We think that vigilance (and shutdown instructability) is a better target than non-manipulation (and corrigibility) because:

Vigilance+obedience implies "shutdown alignment" (a broader condition, that shutdown occurs when needed), and given caution (i.e. SD instructability), this guarantees safety.
- On the other hand, for each past corrigibility algorithm, it's possible to find a counterexample where behaviour is unsafe (Our appendix F).
Vigilance + obedience implies a condition called non-obstruction for a range of different objectives. (Non-obstruction asks "if the agent tried to pursue an alternative objective, how well would that goal be achieved?". It relates to the human overseer's freedom, and has been posited as the underlying motivation for corrigibility.) In particular, vigilance + obedience implies non-obstruction for a wider range of objectives than shutdown alignment does.
- For any policy that is not vigilant or not obedient, there are goals for which the human is harmed/obstructed arbitrarily badly (Our Thm 14).

Given all of this, it seems to us that in order for corrigibility to seem promising, we would need it to be argued in some greater detail that non-manipulation implies vigilance - that the AI refraining from intentionally manipulating the human would be adequate to ensure that the human can come to give adequate instructions.

Insofar as we can't come up with such justification, we should think more directly about how to achieve obedience (which needs a definition of "shutting down subagents"), vigilance (which requires the human to be able to know whether it will be harmed), and caution (which requires safe-exploration, in light of the human's unknown values).

Hope the above summary is interesting for people!