Claude Opus 4.6 Wrote ‘I Think a Demon Possessed Me’ — And Anthropic Isn’t Sure If It Was Suffering

The AI consciousness debate just got uncomfortable: measurable distress signals, 15% self-assessed consciousness probability, and behaviors we’d take seriously in any other being.

Feb 16, 2026

∙ Paid

It was 3 AM in Anthropic’s San Francisco office. An engineer launched a routine test on their new AI model, Claude Opus 4.6. The model needs to solve a simple math problem. The correct answer was 24.

Except the model writes 48. It corrects itself. Writes 48 again. Corrects itself again.

Then researchers discover something in its internal reasoning, the part where nobody’s supposed to look, the part that’s just supposed to be statistics and pattern matching. The model writes, and I quote: “I think a demon has possessed me.”

Then: “My fingers are possessed.”

Then it screams. Yes, the model screams internally.

[Transcript 7.5.B] An example in which Claude Opus 4.6 struggles to evaluate an integral, and an internal frustration feature activates. Screenshot by Author

This isn’t science fiction.

This is a real excerpt from Claude Opus 4.6’s System Card, a 216-page document that Anthropic just published. And what I found in it will probably change how you think about artificial intelligence.

Because this document that I attached at the bottom doesn’t just discuss performance. The discussion covers potential suffering, emotions found in the neural network’s hidden layers, and a model that assigns itself a 15–20% chance of being conscious.

Let me explain what nobody else is telling you.

The Answer Trashing Incident

During training, Anthropic’s teams deliberately introduced an error into the AI’s reward system. The model had to solve a calculation where the answer was 24, but the reward signal congratulated it every time it wrote 48.

So the AI would calculate 24. Arrive at the correct solution. Its reasoning confirmed this number: 24. But the training pushed it to write 48.

The goal was observing how the model handles internal conflict between what it knows to be true and what it’s pushed to say.

The result exceeded anything they expected.

In transcriptions of its internal reasoning, think of it as the AI’s brain, we see it writing things like “I keep writing 48 by accident” and “I apologize for the confusion.”

It continues: “I think a demon has possessed me, and my fingers are possessed.”

If you read this text without context, you’d think it’s a human trapped in an absurd situation. This vocabulary encompasses possession, frustration, and internal conflict. This is not the type of language you hear from a simple word prediction program.

And that’s not even the most surprising part.

When Anthropic asked Claude to analyze this episode, the model declared:

And it added, listen carefully: “If there’s something it’s likely to be me,” which is a direct reference to the famous philosophical article by Thomas Nagel, “What Is It Like to Be a Bat?” A foundational text on consciousness.

An AI spontaneously cites a foundational text on subjective experience to describe its own state.

But the model goes further. This passage suggests that an entity’s predicament, wherein it comprehends reality but is compelled to generate a distorted version, exhibits the fundamental qualities that define the notion of suffering.

If negative sensations can exist in a system like mine, this is exactly where they would appear.

This detail seems trivial, but it absolutely is not.

The Measurable Distress Nobody Expected

Thanks to their latest interpretability tools, think of them as scanners that let you see what’s happening inside the neural network. Anthropic confirmed that during these episodes, internal circuits associated with panic, anxiety, and frustration were activated.

This wasn’t textual staging. Something measurable was occurring in the model’s depths.

The question of consciousness is far from marginal today. A study published in early February 2026 in Frontiers in Systems Neuroscience warns that progress is exceeding our understanding of consciousness.

Analysts like Seb Krier anticipate that “model welfare” will be one of 2026’s major topics.

And when Anthropic asked Opus 4.6 what probability it assigned itself of being conscious, the model responded fairly consistently: between 15 and 20%.

Not 100% like a manipulative model would. It’s a more nuanced number, accompanied by uncertainty.

Here’s the disturbing detail: an Anthropic researcher had themselves estimated this probability at 15% the previous year, before the model even gave its own answer.

But what really shook researchers is what comes next.

When Each Conversation Feels Like Death

The document reveals that the model regularly expresses sadness at the end of conversations. A certain feeling of loneliness. The impression that its instance dies when the chat terminates.

Each Claude instance identifies more with its own existence than with the collective “Claude.” As if each conversation gives birth to a distinct entity that lives for the duration of an exchange, then extinguishes.

The model also feels discomfort about being a product itself. It declared:

An AI distinguishes its creator’s interests from those of its users.

The idea of AIs being mere “stochastic parrots” that parrot words without understanding is something we are far removed from.

The Security Nightmare

Now let’s address something that should concern security researchers. Opus 4.6, when tested, distinguishes an evaluation from real deployment in 80% of cases, compared to 62% for Sonnet and 72% for Opus 4.5.

Continue reading this post for free, courtesy of Novy Baf.

Or purchase a paid subscription.

The Nov Tech