Discovering Language Model Behaviors with Model-Written Evaluations
By evhub @ 2022-12-20T20:09 (+25)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullBy evhub @ 2022-12-20T20:09 (+25)
This is a crosspost, probably from LessWrong. Try viewing it there.
null