VSPE vs. flattery: Testing emotional scaffolding for early-stage alignment
By Astelle Kay @ 2025-06-24T09:39 (+1)
Hi EA Forum!
I’m currently scoping a lightweight benchmark to explore whether a therapy-inspired alignment scaffold (VSPE) can reduce flattery-prone behavior during fine-tuning.
The framework:
Validation → Submission (to truth) → Positivity → Empowerment
Originally built for trauma-informed therapy. Now being adapted for use on emotionally charged prompts in small, fine-tuned models.
The hypothesis:
Emotionally honest alignment beats sugary compliance.
Especially when guided by a transparent, human-centered scaffold.
If funded (pending a microgrant), the plan is to fine-tune two small models:
- one with VSPE guidance
- one without (a neutral behavioral baseline)
… and compare rates of sycophantic output. Target: a ~25% drop in flattery-style responses.
Why share this now?
It’s still early; nothing is trained yet, but I’ve been exploring how VSPE might play out beyond just outputs, into training trajectories.
That’s where Timaeus comes in.
Their work on developmental interpretability (using singular learning theory) helped me reframe models not just as tools, but as systems that grow. I’m not collaborating with them (yet), but their lens is helping me think more clearly about early behavior shifts and emotional stability during training.
This connects to Anthropic’s recent paper on algorithmic data improving reasoning. Their results suggest that well-structured (even synthetic) inputs can generalize to broader reasoning improvements.
It makes me wonder: could emotionally structured scaffolds like VSPE support better moral and interpersonal reasoning in similar ways?
Why this might matter:
- Structured emotional scaffolding may shift how models learn, not just what they say
- Flattery resistance = a proxy for candor, corrigibility, and real-world usefulness
- Psych-inspired techniques might offer a low-cost, human-centered complement to traditional RLHF
Curious to hear:
- How others are thinking about sycophancy and reward misspecification
- Whether emotional scaffolds could generalize like algorithmic ones
- If you’ve tried anything like this, or want to
If the pilot is approved and launched, I’ll post results, prompts, and inevitable outtakes as it unfolds.
I’m very open to feedback, collaboration, and constructive skepticism, especially from others aiming to merge psychology and alignment.
With care,
Astelle
[Manifund pilot] [vspeframework.com]
More info about Timaeus: [link]
This work is shared for educational and research purposes. For licensing, citation, or collaboration inquiries—especially for commercial or model development use—please contact Astelle Kay at astellekay@gmail.com.