On “slack” in training (Section 1.5 of “Scheming AIs”)

By Joe_Carlsmith @ 2023-11-25T17:51 (+14)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
SummaryBot @ 2023-11-27T13:15 (+1)

Executive summary: The degree of "slack" in AI training refers to how intensely the process pressures models to maximize rewards versus allowing suboptimal behavior. More slack means more uncertainty about the resulting goal, while less slack increases confidence. The author leans towards preferring less slack.

Key points:

  1. Slack refers to how much training ruthlessly optimizes for maximum rewards versus allowing flexibility and uncertainty. Low slack regimes pressure intense reward maximization.
  2. Slack affects the likelihood training produces models that imperfectly pursue proxy goals not perfectly aligned with training rewards.
  3. Slack is conceptually distinct from efforts in training to root out goal misgeneralization through adversarial techniques.
  4. Biological reward processes like human dopamine likely involve more slack than ML training. Evolutionary selection also leaves more slack.
  5. We can likely control the slack during training. Less slack provides more confidence about the resulting goal, while more slack risks wishful thinking.
  6. The author leans towards preferring less slack to increase certainty about goals, but slack may vary across training stages.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.