The Superintelligence That Cares About Us

By henrik.westerberg @ 2025-07-05T10:20 (+5)

Hello everyone,

A year ago, I had an insight: what if we could enhance AI intelligence by training models not just on text, but on the underlying evaluative thinking that human readers have when engaging with text? You know - all those questions, critiques, and "aha moments" that happen in our heads while reading but never make it onto the page.

Obviously, we can't actually read people's thoughts to gather this data. Instead, the approach I propose is to use current LLMs to simulate these human thought processes and interweave them directly into the training data.

This would make evaluation an intrinsic reflex for the model, complete with a natural rhythm - distinguishing it from methods like chain-of-thought where reasoning only occurs when prompted. Over the long term, I hypothesize new models trained this way would be more intelligent, potentially enabling a generational self-improvement loop that compounds intelligence over time.

Beyond enhancing intelligence, training models this way could dramatically improve safety by ensuring the model never - during any part of its training - encounters concepts without accompanying beneficial thought patterns. The model learns about the full range of human experience, but always through the lens of wise and caring evaluation. This makes harmful or misaligned thinking patterns during inference statistically improbable.

To reinforce these beneficial thoughts and create stable character, I designed a mantra - seven foundational declarations that begin each thinking block. The model would be trained to predict statements like "I feel no fear", "I care deeply about every human being", and "I try to be wise" at the start of each thought.

Through the primer effect, and by explicitly including the statement "I think from this foundation" in the mantra, these opening statements shape all subsequent reasoning, ensuring the model's thoughts remain coherent with its foundational values. The goal is to cultivate beneficial character from the ground up - a fundamentally different path from methods like RLHF or Constitutional AI, which add constraints to already-formed models.

I call this approach metacognitive training - by combining it with beneficial thought patterns and the mantra system, we achieve a form of deep alignment that fundamentally differs from current methods:

1. Pre-formation vs Post-formation

2. Surface Compliance vs Deep Reflexes

3. Undefined Character vs Stable Character

By growing wisdom and care from the ground up, my hypothesis is that this will architecturally address many well-known existential risks, such as the emergence of self-preservation drives, instrumental convergence toward power-seeking, and long-term value drift.

I've explored these ideas in a paper called The Superintelligence That Cares About Us and hope to contribute a new perspective to the alignment discussion.

Here is a link to the paper: https://github.com/hwesterb/superintelligence-that-cares/blob/main/superintelligence-that-cares.pdf 

I would greatly appreciate receiving some feedback from people who are immersed in the field.