Niceness is unnatural
By So8res @ 2022-10-13T01:30 (+20)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullSharmake @ 2022-10-13T14:28 (+3)
I definitely agree that deceptive alignment seems likely to break black-box properties such as niceness by default, thanks to the simplicity prior and the fact that internal or corrigible alignment is harder than deceptive alignment, at least once it has a world-model.