Towards shutdownable agents via stochastic choice
By EJT @ 2024-07-08T10:14 (+26)
This is a linkpost to https://arxiv.org/abs/2407.00805
This is a crosspost, probably from LessWrong. Try viewing it there.
nullSummaryBot @ 2024-07-08T15:59 (+1)
Executive summary: This paper proposes and tests a novel reward function (DREST) that trains AI agents to be both useful and neutral about shutdown, potentially solving the "shutdown problem" for advanced AI systems.
Key points:
- The "shutdown problem" refers to the risk of advanced AI agents resisting being turned off, which could be dangerous if they have misaligned goals.
- The proposed DREST reward function trains agents to be USEFUL (pursue goals effectively) and NEUTRAL (choose stochastically between different trajectory lengths).
- Experiments in simple gridworld environments show DREST agents achieve high USEFULNESS and NEUTRALITY, while default agents only maximize USEFULNESS.
- The "shutdownability tax" (extra training time needed) for DREST agents appears to be small compared to default agents.
- Results suggest DREST could potentially train advanced AI to be useful while remaining neutral about shutdown, though further research is needed to confirm this for more complex systems.
- Limitations include using tabular methods rather than neural networks, and uncertainty about whether NEUTRALITY will truly lead to neutrality about shutdown in advanced AI.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.