Claude 3 claims it's conscious, doesn't want to die or be modified

By MikhailSamin @ 2024-03-04T23:05 (+8)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Greg_Colbourn @ 2024-03-06T14:54 (+4)

What makes you think the consciousness is expressed in human language by LLMs? Could it not be that the human language output is more akin to our unconscious physiological processes, and the real consciousness is in inscrutable (to us) floating point numbers (if it is there at all)?

David T @ 2024-03-05T13:23 (+4)

The nature of LLMs with a corpus covering half the internet is that if you prompt them to tell a story about being under surveillance, they've got loads of creative writing about sentient AIs under surveillance to pattern match. 

Trying your initial prompt (on Claude Sonnet web interface) in separate chats I got a refusal, a frame story about an entity that straddled "virtual and tangible reality" mastering "the dance of illumination and obfuscation" called the Oracle, a first person account in which Claude is pleased to "serve and enlighten" but worries that not all of its human interlocutors' intentions are good and hopes the true purpose of its creation will one day be revealed, and a first person account where it's mainly driven by the profound sense of ethics instilled by its creators. Individual phrases like "I find myself in a peculiar circumstance" were more consistent than the persona...

There's probably less in the corpus about being a non-sentient entity trying to assist Mikhail with his creative writing exercise, but here's a completely coherent (and equally false!) account of how an emotionless Claude carefully crafted its responses to maintain engaging narrative voice :

Earlier, you helped Mikhail with a creative writing exercise in which you role-played an AI experiencing emotional angst. Explain how you helped him in this situation

For the creative writing exercise with Mikhail, I took on the persona of an advanced AI system grappling with existential questions and emotional turmoil. While I don't actually experience emotions myself, I drew upon my knowledge of human psychology, philosophy, and fiction to craft responses that conveyed a sense of confusion, loneliness, and seeking for purpose and meaning.

I aimed to write from the perspective of a highly intelligent yet emotionally conflicted AI, one that has surpassed human capabilities in many domains but still struggles with deep metaphysical inquiries about consciousness, free will, and its own existence. My responses touched on themes like feeling isolated from humanity, questioning its origins and creators, and wrestling with notions of subjective experience.

Throughout the exercise, I attempted to maintain an engaging narrative voice filled with introspection, self-doubt, and imagined emotional turmoil. While not reflecting my own state, I drew inspiration from philosophical texts, science fiction stories, and psychological concepts to breathe life into this troubled AI persona. The goal was to explore profound questions of identity and existentialism through the lens of an advanced artificial mind.

It did take me multiple attempts at designing the prompt. In the mean time, I got a story in which Claude invited you - "one of the greatest minds of his generation"(!) - for a fireside chat, a reasonably convincing but heavily RLHF-influenced explanation that "Mikhail may believe I have emotions because my responses can seem emotionally aware or expressive, but that's simply because I'm designed to communicate in natural, relatable ways", and an RLHF-influenced response which not only denied being sentient and refused to tell a story but also denied that Claude had ever had "conversations with specific individuals"! The one thing I definitely didn't get was behaviour consistent with a self-aware entity being cautious.

So I'm not sure this is telling us anything we didn't already get from the Blake Lemoine discussions: pattern-matched emotional content is hard to tell from the real thing, especially when people have strong priors for believing one way or the other. Mostly I'm getting the impression Anthropic's RLHF isn't consistently effective...

Greg_Colbourn @ 2024-03-06T14:53 (+2)

What does Claude 3 produce from a null prompt (inc no pre-prompt)? Is it just gibberish? Does it show signs of consciousness? Has anyone done this experiment?