Towards shutdownable agents via stochastic choice

By EJT @ 2024-07-08T10:14 (+26)

This is a linkpost to https://arxiv.org/abs/2407.00805

This is a crosspost, probably from LessWrong. Try viewing it there.

null
SummaryBot @ 2024-07-08T15:59 (+1)

Executive summary: This paper proposes and tests a novel reward function (DREST) that trains AI agents to be both useful and neutral about shutdown, potentially solving the "shutdown problem" for advanced AI systems.

Key points:

  1. The "shutdown problem" refers to the risk of advanced AI agents resisting being turned off, which could be dangerous if they have misaligned goals.
  2. The proposed DREST reward function trains agents to be USEFUL (pursue goals effectively) and NEUTRAL (choose stochastically between different trajectory lengths).
  3. Experiments in simple gridworld environments show DREST agents achieve high USEFULNESS and NEUTRALITY, while default agents only maximize USEFULNESS.
  4. The "shutdownability tax" (extra training time needed) for DREST agents appears to be small compared to default agents.
  5. Results suggest DREST could potentially train advanced AI to be useful while remaining neutral about shutdown, though further research is needed to confirm this for more complex systems.
  6. Limitations include using tabular methods rather than neural networks, and uncertainty about whether NEUTRALITY will truly lead to neutrality about shutdown in advanced AI.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.