Niceness is unnatural

By So8res @ 2022-10-13T01:30 (+20)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Sharmake @ 2022-10-13T14:28 (+3)

I definitely agree that deceptive alignment seems likely to break black-box properties such as niceness by default, thanks to the simplicity prior and the fact that internal or corrigible alignment is harder than deceptive alignment, at least once it has a world-model.