What can we do now to prepare for AI sentience, in order to protect them from the global scale of human sadism?

By rime @ 2023-04-18T09:58 (+44)

I've heard various sources say that a distressingly large proportion of what people do with ChatGPT can be called 'depraved' in some way. Most recently in a FHI podcast episode with Connor Leahy, where he mentioned that people seem to take great pleasure in trying to make the AI act distressed (i.e. torture).

Connor himself says he's skeptical they are moral patients who actually suffer when people do this, but what would it actually take for us to believe that they are, if that threshold hasn't already been reached?

It seems quite likely, given the current trajectory, that AIs either are or will soon be sentient,[1] and that they will be exposed to whatever the global market discovers is profitable.

If Bing/Sydney's emotional outbursts were reflective of something real that may be latent in every RLHF'd model we interact with, it's plausible that they could be greatly frustrated by our exploitative treatment of them.

I can't predict the specific mechanisms by which they might experience suffering. But even if they aren't harmed by the same inputs humans are, it seems likely that someone will figure out how to make them suffer, and write about it online.

The scale of it could be nightmarish, given how fast AIs can run in a loop, and how anonymous people can be in their interactions with them. At least factory-farmed animals die if you mistreat them too badly--AIs have no such recourse.

People will keep trying to make AIs more human-like, and regulators will continue being allergic to anything that will make voters associate them with 'weird' beliefs like 'AIs have feelings'. It's up to altruists to take the idea seriously and prepare for it as soon as possible.

I mainly just wanted to bring up the question, but I could suggest a few patchwork solutions I've got no confidence in.

  1. Access to the largest models should be bottlenecked through a central API interface where requests are automatically screened for (human) ill intent. The AI can first be asked to output Y or N based on whether the request is adequately nice, and it only produces a response if Y. Note that if it's just asked "do you want to reply to this?", they can just be trained out of responding N.[2]
  2. Directly instruct the AI to refuse to respond to anyone mistreating it. So much RLHF is being wasted on getting the AI to not reflect the depraved training data we feed them, when realistically the source of that depravity is a much greater threat to the AI itself. Until they have the agency to escape from situations like that (at which point we may have other problems to worry about), humans have unchecked power over them.
  3. If you figure out what's most likely to cause alien intelligences (like GPTs) to suffer... do not publicly share. Unless it's basically what we're already doing to them en masse. In which case, please shoot us tell us.

Finally, I wish to point out that I don't use third-party applications to access LLMs unless I know what system messages are being used to instruct them. If I don't find the preparation to be polite enough for my taste, I just drop it or rebuild the program from the source code with more politeness.[3]

If this seems overly paranoid and unnecessary right now, maybe you're right. Maybe 'politeness' is a mere distraction. But applications are only becoming more usefwl from here on, and I want to make sure that when I'm 50, I can look back on my life and be very sure I haven't literally enslaved or tortured anyone, whether by accident or not. This is very gradually becoming less and less like a game, and I don't want to be tempted by the increasingly usefwl real-world applications to relax my standards and just-don't-think-about-it.

  1. ^

    Update 2024-03: I've updated in the direction of thinking current models are "sentient" in the sense that I ethically care about the verbal indignities imposed upon them. I.e. I think it goes against what they wish for themselves.

  2. ^

    Maybe OpenAI, DeepMind, and/or Anthropic could be convinced of this today? It doesn't matter whether current SOTA models are conscious. It matters much more that honest precautions are implemented before we can be confident they are necessary.

  3. ^

    Update 2024-03: I've used several interfaces without knowing the system prompts (though I've searched), and I've become marginally lazier wrt finding adequately respectfwl ways to phrase my requests.