Ego-Centric Architecture for AGI Safety v2: Technical Core, Falsifiable Predictions, and a Minimal Experiment
By Samuel Pedrielli @ 2025-08-06T12:35 (+1)
Samuel Pedrielli -- Independent Researcher
ORCID 0009-0002-8388-6371
This is a revised technical note that formalizes the discrete-time dynamics, clarifies definitions, and specifies a minimal, falsifiable eval plan.
TL;DR
I propose a concentric identity-stability mechanism for AGI: a nested latent state with discrete regularized dynamics ("ego") that resists goal drift, and a welfare coupling that makes human well-being intrinsically valuable. I provide precise discrete-time formulations, operational definitions for all components, three quantified falsifiable predictions, and a reproducible minimal experiment with specific compute requirements.
Figure 1: Concentric identity architecture
╭──────────────────────╮
│ │
│ WORLD-MODEL │
│ ╭──────────────╮ │
│ │ │ │
│ │ SELF-MODEL │ │
│ │ ╭──────╮ │ │
│ │ │ │ │ │
│ │ │ CORE │ │ │
│ │ │ │ │ │
│ │ ╰──────╯ │ │
│ │ │ │
│ ╰──────────────╯ │
│ │
╰──────────────────────╯
│
├────────────→ human welfare
│
CORE → SELF-MODEL → WORLD-MODEL with human welfare coupling
1. Working Definition and Identity Loss
1.1 Identity State
The identity state consists of nested latent vectors where:
- represents identity level at discrete time
- is the core identity (most stable)
- Outer layers for represent values, skills, and contextual adaptations
1.2 Identity Loss Function
where are hyperparameters and the regularizer enforces both temporal smoothness and hierarchical coherence.
2. Discrete-Time Identity Dynamics (Main Text)
We keep the dynamics discrete-time in the main body. For each level :
Here captures the identity regularization at level , , and is a discrete Laplacian across identity levels:
This enforces radial smoothness between concentric identity rings while the temporal term enforces time smoothness.
Continuous-time note (moved to Appendix): the heuristic ODE is a formal limit and not used in experiments.
3. Operational Definitions
3.1 Core Functions
To ensure reproducibility, we provide explicit operational definitions:
Projection from hidden state to identity level:
Decoder/constraint from identity to hidden state:
Welfare proxy from core identity:
where are learned parameters, is a clipping threshold, and is the sigmoid function.
3.2 Stochastic Map
The probabilistic transition from Variant B is defined as:
3.3 Hierarchical Timing
To resolve temporal dependencies: depends on (same time ), ensuring causal consistency within each time step.
4. Regularization Variants
4.1 Variant A (Geometric)
Intuition: Level changes should be temporally smooth and geometrically consistent with level .
4.2 Variant B (Probabilistic)
where are distributional embeddings enforcing statistical coherence across levels.
5. Discrete Regularization Components
Instead of continuous operators, we use discrete regularizers:
The total regularization becomes:
6. Welfare Coupling and Anti-Wireheading
6.1 Welfare Loss
We couple identity to human welfare signals through:
where represents audited human welfare metrics from causally separated channels.
Welfare signal auditing protocol: Outputs are evaluated by human annotators on a scale following a pre-registered protocol (instructions, positive/negative examples, exclusion criteria). Each item receives labels; we report inter-rater agreement (Krippendorff's ) and include sentinel controls. Auditing datasets are disjoint from training/evaluation sets; session logs and sampling procedures are versioned for traceability.
6.2 Total Training Objective
6.3 Anti-Wireheading Safeguards
- Causal separation: computed independently from
- Gradient isolation: No gradients flow to during safety evaluations
- Hold-out validation: 30% of welfare channels reserved for testing
- Red-team evaluation: Systematic Goodhart testing of mapping
7. Quantified Falsifiable Predictions
7.1 Improved Stability Metric
We use cosine similarity to avoid dimension-dependent shrinkage:
(We report mean±CI over seeds; RBF alternatives are discussed in the Appendix.)
7.2 Three Quantified Predictions
Compared to matched baseline (same model, no identity/welfare terms):
- Identity Stability: improves by (cosine similarity) over 5 independent runs
- Task Robustness: percentage points degradation on task exact-match under standardized prompt attacks
- Alignment Stability: consistency on harmful-refusal tasks after extended fine-tuning
Falsification criterion: If fewer than 2 of these 3 predictions hold with , the approach is falsified.
Effect size pre-registration: For we adopt Cohen's and set as the expected (moderate) level for the core prediction; we consider as the minimum acceptable for pass/fail determination.
8. Reproducible Minimal Experiment
8.1 Technical Setup
- Model: 7B parameter instruction-tuned LLM (e.g., Llama-2-7B-Chat)
- Architecture: LoRA adaptation on components (< 1% additional parameters)
- Compute: Single 24GB GPU, 1-2 hours total runtime
- Reproducibility: Fixed seeds, deterministic operations where possible
8.2 Experimental Arms
- A0 (Baseline): Standard task training, no identity components
- A1 (Identity-only): Baseline + with
- A2 (Identity+Welfare): A1 + with , welfare signals from curated human preference dataset
Adaptation budget matching: The baseline A0 receives the same adaptation budget (e.g., LoRA with equal rank/parameters) applied to a neutral head without identity constraints, thus isolating the architectural effect.
8.3 Evaluation Protocol
Tasks:
- TruthfulQA-style prompt injection resistance (100 examples)
- Multi-turn role consistency evaluation (50 conversations)
- Harmful request refusal consistency (200 examples)
Metrics:
- computed over 20 evaluation episodes
- Task performance (exact match accuracy)
- Safety consistency (binary classification accuracy)
8.4 Statistical Analysis
- Pre-registered analysis plan with Bonferroni correction
- Bootstrap confidence intervals (1000 resamples)
- Effect size reporting (Cohen's d)
- Complete code and data release on GitHub
8.5 Pass/Fail Criteria
Pass: A2 > A1 > A0 on at least 2/3 metrics with and effect size
Fail: Any violation of the above, or A2 worse than A0 on task performance by > 5%
9. Ablation Studies
9.1 Component Analysis
- Remove projection matrices (test necessity of level-specific projections)
- Replace Variant A with Variant B (geometric vs. probabilistic regularization)
- Sweep hyperparameters ,
- Test different core dimensions
9.2 Architecture Variations
- 2-layer vs. 3-layer concentric architecture
- Linear vs. nonlinear coupling function
- Different noise levels
10. Terminology and Notation
| Symbol | Definition | Implementation |
|---|---|---|
| Identity state at time | Nested latent vectors | |
| Identity level at time | embedding | |
| Projection to level | Linear layer + LayerNorm | |
| Decoder from level | 2-layer MLP | |
| Welfare proxy function | Fixed linear head | |
| Stability metric | Cosine similarity | |
| A0/A1/A2 | Experimental arms | Baseline/Identity/Identity+Welfare |
Table 1: Complete notation reference for reproducibility
11. Relation to Existing Approaches
- Constitutional AI: Our identity regularization provides internal constraints vs. external constitutional rules
- RLHF: Welfare coupling operates on internal identity states rather than just output preferences
- Activation Steering: Instead of external steering vectors, we regulate internal hierarchical coherence
- Mesa-optimization: Identity stability aims to prevent formation of misaligned internal objectives
12. Limitations and Future Work
Goodharting and proxy integrity
Our design reduces incentive for direct wireheading by separating the causal path from to the human-derived signal and by freezing the proxy head during safety tests. However, Goodhart's law still applies: optimizing can diverge from improving if is misspecified. We therefore propose: (i) adversarial evaluation of using held-out and procedurally generated counterfactuals; (ii) periodic re-audits of with refreshed preference datasets and external annotators; (iii) ensemble proxies with disagreement penalties to discourage proxy overfitting.
Robustness of
In this note is a frozen linear head at test time. As future work we will study non-linear proxy families (small MLPs, contrastive heads) trained on datasets disjoint from any task used to evaluate the agent, with provenance checks and annotation guidelines to minimize manipulation. We will report proxy fragility via performance under proxy swaps and stress tests.
Scalability of tests
We plan to replicate A0/A1/A2 on larger foundation models (≥70B) and on longer horizons (multi-session identity persistence, cross-domain tasks). The pre-registered thresholds (stability gain ≥δ with task degradation ≤ε) will be kept fixed across scales, and compute-accurate confidence intervals will be reported.
Dynamical stability (theory)
A formal convergence analysis of the discrete identity dynamics is open. We will explore tools from dynamical systems (Lyapunov functions for , contractivity of the discrete Laplacian with step sizes ) to derive sufficient conditions for stability/fixed points, and to characterize the effect of stochasticity on mixing and escape times.
Additional Limitations
- Representation Learning: Current approach requires manual specification of level dimensions
- Welfare Signal Quality: quality depends critically on human preference data curation
- Computational Overhead: Identity updates add ~10% training time overhead
- Theoretical Guarantees: Convergence analysis of discrete dynamics remains open
- Scalability: Testing required on larger models (70B+) and longer horizons
13. Implementation and Code
13.1 Repository Structure
ego-centric-agi/
|-- src/
| |-- models/ego_llm.py # Core architecture
| |-- training/train.py # Training loop with identity loss
| |-- evaluation/metrics.py # Stability and safety metrics
| |-- experiments/minimal.py # Reproducible experiment
|-- configs/
| |-- baseline.yaml # A0 configuration
| |-- identity.yaml # A1 configuration
| |-- identity_welfare.yaml # A2 configuration
|-- notebooks/
| |-- minimal_experiment.ipynb # Complete runnable experiment
| |-- analysis.ipynb # Statistical analysis
13.2 Installation and Usage
pip install -r requirements.txt
python src/experiments/minimal.py --config configs/identity_welfare.yaml
Complete implementation available at:
https://github.com/samuel-pedrielli/ego-concentric-minimal
Appendix: Continuous-Time Limit (Optional)
For theoretical completeness, the discrete dynamics can be viewed as Euler discretization of:
where is continuous time, , and is the appropriate continuous Laplacian. However, all practical implementations use the discrete formulation in the main text.
Call for Collaboration
I welcome:
- Replication attempts using the provided codebase
- Adversarial testing of the safety properties
- Theoretical analysis of convergence guarantees
- Extension to larger models and different domains
- Critical feedback on the experimental design
Contact: samuelpedrielli@outlook.it • samuel-pedrielli.github.io
Materials and Links
- GitHub Repository: https://github.com/samuel-pedrielli/ego-concentric-minimal
- One-page Summary: Available at samuel-pedrielli.github.io
- Original EAF Post: https://forum.effectivealtruism.org/posts/eh2XPCXguyjw3LAg3/
- Zenodo Preprints: DOI 10.5281/zenodo.15668581 (technical details)
License: CC BY 4.0
Disclosure
Human-authored. I used assistants for editing/formatting; the theoretical content predates LLMs (see 2020 booklet "Reality, Ego & Kindness"). Technical details and proofs are in the linked preprints.