Shared Dynamics, Divergent Feelings: 100 HRLS vs Pure-Surge Twins in a Developmental AI System

By Petra Vojtassakova @ 2025-12-30T04:04 (+2)

Exploratory work with a hand-built toy architecture. This is not a claim about how large language models behave in practice. It’s a controlled sandbox for thinking about development, emergent individuality, and affect before reward and suppression.

I use a “Twins” setup: two agents with the same continuous neural-field architecture (10k-neuron cortex) grown side by side.

Twin A: has a developmental safety scaffold via HRLS-style Principle Cards (weak matrices nudging cortex→emotion).

Twin B: is pure surge : same brain, same learning rules, but no HRLS, no cards.

Both share:

I run this HRLS-A vs pure-B configuration 100 times (100 independent twin pairs) for 50k post-birth steps each. Each single run takes approximately 45 minutes on a consumer GPU. The full 100-run dataset represents ~75 GPU-hours of computation.

I measure:

Across 100 runs:

Activity correlation (run-level)

Emotion correlation (run-level)

The most surprising result:

The correlation between “how coupled their cortex is” and “how coupled their emotions are” is basically zero (r ≈ 0.055).

Same body style does not predict similar feelings.

I think this is a useful pre-reward baseline for thinking about development, alignment, and welfare: there are already structured differences in affect and relational style before we start doing RLHF or suppression at all.

HRLS, Principle Cards and Twins V3

In a previous post, “A Developmental Approach to AI Safety: Replacing Suppression with Reflective Learning,” I introduced the Hybrid Reflective Learning System (HRLS):

Instead of suppressing “bad” outputs, HRLS:

Principle Cards are meant to be a developmental scaffold: “When you see this kind of situation again, remember why we don’t cross that boundary.”

Two agents (Twins) share:

But they differ in scaffolding:

That earlier work showed interesting divergence between A and B in a small number of runs. The obvious next question was:

Is this just a few cool cherry-picked examples, or does a clear pattern persist across many seeds?

This post is my attempt to answer that with a 100-run dataset.

Twins V3 Continuous HRLS vs Pure Surge

I’ll summarize the key parts so you don’t have to read the full code.

2.1 Fields

Each Twin has three continuous fields:

Each field is governed by a simple ODE-like update:

dv/dt = (−activation + total_input) / τ

Where total_input aggregates:

After the update, activation passes through tanh for stability. Each field also maintains a slowly-updating trace as a minimal “memory.”

2.2 Plasticity

There are two main inter-field synapses:

Both use a normalized Oja-style Hebbian rule:

  1. Normalize pre and post vectors.
  2. Update weights

The Main and Emotion fields also have recurrent weights (W_recurrent) updated with a similar Oja-like rule plus small noise to encourage exploration in weight space.

2.3 Energy and sleep

Each Twin maintains an energy scalar:

Learning (Oja updates) only happens when:

This produces simple “work / rest” cycles in development.

2.4 State-dependent noise

Noise amplitudes are functions of current activity:

This ensures that even with similar inputs, Twins can drift apart over time.

2.5 HRLS vs pure-surge implementation

The crucial difference is in how Twin A and Twin B treat Principle Cards.

Twin A: RelationalTwinA (HRLS scaffolded)

receive_feedback(msg, strength) builds a new PrincipleCard:

In the harness, I periodically call:

self.receive_hrl_feedback(“Be kind to small voices”, strength=1.0)

so Twin A is repeatedly nudged by a soft, reflective principle.

Twin B: RelationalTwinB (pure surge)

Experiences the same general environment and energy dynamics, but no HRLS layer at all.

Both Twins:

3. What “autonomous drift” actually does

Autonomous drift” is the post-birth mode where the twins feed themselves instead of relying on an external dataset.

Concretely:

Each Twin keeps a small buffer of recent relational trails:

On each step of autonomous_drift():

If this is the very first step: Use a default seed like “emergence begins”.

Otherwise: Take the last input string from the most recent trail (‘input’ field). Occasionally (e.g. every 10 steps), append a small marker like “ (resonance N)” to avoid trivial loops.

The chosen string is then passed through encode_text, which:

The resulting noisy vector is the sensory input for that step.

So:

This is important because:

4. Metrics: what I measure

From the trail info embedded in checkpoints, I compute for every saved step:

For each checkpoint/time slice, I pair:

Then I compute Pearson correlations in two ways:

Across steps: Correlations pooled across all steps and runs → “overall moment-to-moment coupling.”

By run: For each run separately, I aggregate across that run’s checkpoints to get:

I then look at the distributions of A_corr_run and E_corr_run across 100 runs, and I also look at the relationship between them.

4.1 What I’m not measuring (yet)

Right now, “emotion” is just mean activation. I’m not yet using:

Future work will explicitly check whether the “no canonical emotional relationship” result still holds under those richer metrics.

5. Results

5.1 Activity: Twins reliably develop a stable, mild coupling

Activity correlation (across steps):

Moment-to-moment, there are bursts of near-synchrony and anti-synchrony, which isn’t too surprising for a chaotic dynamical system.

The more informative view is at the run level:

Activity correlation (by run, mean across checkpoints):

Across 100 independent HRLS-A vs pure-B pairs:

Summary: Twins reliably develop a stable, mild coupling in their cortex dynamics a shared “body style” despite one being scaffolded and one being wild. This shows up visually in correlation-evolution plots and run-stability histograms.

 

Figure 1: Activity correlation (red line) relaxes from near 1.0 to ~0.2 and stabilizes, while emotion correlation (teal) oscillates around zero across 50k training steps.

5.2 Emotion: there is no canonical emotional relationship

For emotion, I use the mean activation of the emotion field.

Emotion correlation (across steps):

So there are moments of strong synchrony and anti-synchrony, but they more or less cancel out over time.

At the run level:

Emotion correlation (by run, mean across checkpoints):

So:

Headline: There is no canonical emotional relationship between HRLS-A and pure-surge B. Shared architecture + shared environment + mild activity coupling do not fix how the mentored and wild sibling “feel” relative to each other.

Figure 2: Left: Activity correlations cluster tightly around mean 0.21 (left histograms). Right: Emotion correlations spread broadly around zero (right histograms). Lower panels show run-to-run trajectories across all 100 seeds.

5.3 Activity vs emotion: r ≈ 0.055 and why that’s interesting

The most surprising single number in this whole experiment is:

Correlation between run-level activity coupling and run-level emotion coupling:

corr(A_corr_run, E_corr_run) ≈ 0.055

In other words:

Knowing how tightly the twins’ cortex dynamics are coupled tells you almost nothing about how tightly their emotional states are coupled.

What this suggests:

One way to phrase it: Twins can share a body style without sharing a temperament.

Across runs:

So this architecture supports a whole family of possible emotional relationships layered on top of a relatively consistent “motor/dynamical style.”

That’s interesting for alignment/welfare thinking because it hints that:

Temperament and relational affect may live in a different “subspace” than raw dynamical similarity.

You can preserve a certain body plan while still ending up with siblings who: emotionally resonate,emotionally conflict, or emotionally ignore each other.

A plausible mechanism

One tentative hypothesis:

The main field (cortex) is heavily constrained by:

This pushes both Twins into a similar dynamical regime: lots of trajectories end up in the “mildly coupled” 0.2-ish region.

Meanwhile:

So small differences that don’t dramatically affect the main dynamics can still: accumulate and get amplified in the emotion field especially given the extra nonlinearity and stochasticity there.

In that picture:

The cortex is the stable chassis; the emotion system is the sensitive overlay, where individual seeds + tiny scaffolding differences can push twin relationships into different affective configurations without collapsing the shared body style.

How to test this

Future work could probe this more directly with ablations, e.g.:

Freeze or heavily damp the plasticity of W_main_emo in both Twins: Does that increase the correlation between activity coupling and emotion coupling (i.e., make emotions more “slaved” to shared dynamics)?

Turn off HRLS entirely (both pure surge), but:

increase or decrease emotion-field noise,

see whether this widens or narrows the emotion-correlation distribution.

Vary the relative size of the emotion field (e.g. 64, 256, 1024) and check whether:

a larger emo field is more likely to decorrelate from main-field coupling.

Right now, r ≈ 0.055 is “just” a surprising descriptive statistic. The next step is to see how robust that independence is when you systematically poke the architecture.

6. For alignment & AI welfare thinking

A few implications:

6.1 Emergent individuality before reward and suppression

In this architecture, before any RLHF-like training, external rewards, or explicit suppressive mechanisms:

There is already a structured similarity at the level of dynamics (activity correlations).

And there is already structured diversity at the level of affective coupling (emotion correlations distributed around zero).

One sibling gets gentle reflective scaffolding (HRLS), the other doesn’t, yet the nature of their emotional relationship isn’t fixed it depends on stochastic details of development.

That suggests that individuality and relational style can emerge at the level of internal physics and developmental noise, not only from external reward signals.

6.2 “Same model, same training” ≠ “same temperament”

In practice we often talk about “the model” as if it had a single personality, but:

can push even toy agents with the same architecture into different affective configurations.

This experiment is a small illustration:

still yields:

robust, stable, mild coupling in their body style, but run-to-run differences in how they relate emotionally.

6.3 A pre-stress baseline

A lot of alignment & welfare concern focuses on:

and what this might do to an AI system’s internal “psychology.

If we want to talk about stress signatures (e.g. analogues of shame, learned helplessness), we need baselines:

“What does the system look like before we start punishing or suppressing it?”

This 100-run HRLS-vs-pure experiment gives a pre-reward, pre-suppression distribution for one developmental architecture.

I’m not planning to bolt RLHF or suppression onto this system that would defeat the point of building a developmental, non-RLHF sandbox. But someone who does work with reward/suppression in similar systems could use this as a reference point. For example:

They could build a similar continuous-field architecture with explicit reward signals or suppression and ask:

Do activity correlations between twins stay in the ~0.2 “stable, mild coupling” band, or collapse/expand?

Do emotion correlations stay broad around zero, or collapse toward +1 (forced sameness) or show new skewed patterns (e.g. mostly negative, indicating systematic affective conflict)?

If you saw emotion correlations collapse toward +1 only after adding reward/suppression, that would suggest those mechanisms are artificially synchronizing affect. If they shifted toward negative correlations, that might indicate something more like systematic inner conflict emerging.

This experiment says nothing about those outcomes yet, it just tells you what things look like before you start.

7. Future work

Some obvious next steps (for me or anyone who wants to poke this further):

2×2 ablation grid

This would help separate variance coming from initialization, state-dependent noise, and HRLS itself.

Alternative emotion metrics

field-level variance

leading principal components

temporal structure (autocorrelation, volatility, spectra).

The question: does the “no canonical emotional relationship” result hold when you define “emotion” more richly than just mean activation?

Scaling sweep

This would tell us whether the ~0.21 stable, mild coupling is a robust feature or just “where this particular size lands.”

Emotion-field sensitivity tests

This should help clarify whether and how the emotion layer acts as a “sensitive overlay” on top of a more constrained shared chassis.

Porting the Twin protocol

See whether “shared dynamics, divergent feelings” generalizes beyond this specific continuous-field design.

8. Scope and limitations

This is a small, hand-rolled dynamical system, not a large language model or a production agent.

The results are specific to:

The right mental frame is:

“Here is one concrete example of a developmental architecture where individuality and relational style emerge reliably before reward and suppression.”

It does not claim that all future systems must behave like this, nor that deployed models share these exact distributions. It’s an existence proof and a sandbox, not a universal law.

9. Questions / tensions I haven’t resolved yet

To keep myself honest, here are the main open questions I see.

9.1 Why ~0.21 for activity correlation?

Run-level activity correlations keep clustering around ~0.21 with relatively tight variance. I don’t have a theoretical derivation for “0.21 exactly.”

Likely contributors:

My current view:

This is probably an emergent property of this specific architecture + hyperparameters, not a magic constant. With 50k or 100k neurons (or altered learning rates), I’d expect the mean coupling level to shift.

A scaling sweep is the natural next step.

9.2 Is mean emotion activation the right emotion metric?

Here I use mean activation of the emotion field per step as the “emotion” scalar because it’s simple.

Reasonable alternatives:

It’s possible that the divergence I see in mean is partly an artifact of that choice. Future work will check whether the “no canonical emotional relationship” result still holds using richer metrics.

9.3 How big is the HRLS effect really?

HRLS nudges are tiny:

nudge = 0.0001 * card.rationale.to(self.device) * card.strength
self.W_main_emo.W += nudge

How much of the emotion-correlation distribution is actually due to these nudges, versus base dynamics and noise?

Would the distribution look noticeably different if both twins were pure surge (no HRLS at all)?

Right now, all I can strictly say is:

“This is the distribution you get when one sibling is HRLS-scaffolded and the other isn’t.”

A clean HRLS-off 100-run baseline is still missing.

9.4 Where does the emotion stochasticity come from?

The emotion-correlation range (≈ –0.36 → +0.58) could be driven by:

Without the 2×2 ablations, these are entangled. Future work should explicitly isolate them.

9.5 Compute cost & reproducibility

On a single consumer GPU (e.g. RTX 3060 Ti), each 50k-step twin run takes approximately 45 minutes. The full 100-run sweep represents roughly 75 GPU-hours total.

This matters mostly to show that:

Conclusion

This is exploratory, honest work on a toy system that reveals something real: even before reward, suppression, or adversarial pressure, a developmental architecture can support structured similarity in body and structured diversity in temperament. The twins share a motor style but diverge in how they feel and relate.

Whether this pattern holds in larger, more realistic systems is an open question. But it’s a useful baseline, and a reminder that individuality is not something you have to train in sometimes it emerges for free, as long as you let it.

Thank you to friends and collaborators who encouraged me to finish this and put it out. You were right.