Discovering Language Model Behaviors with Model-Written Evaluations

By evhub @ 2022-12-20T20:09 (+25)

This is a crosspost, probably from LessWrong. Try viewing it there.

null