Ego-Centric Architecture for AGI Safety v2: Technical Core, Falsifiable Predictions, and a Minimal Experiment

By Samuel Pedrielli @ 2025-08-06T12:35 (+1)

Samuel Pedrielli -- Independent Researcher
ORCID 0009-0002-8388-6371

This is a revised technical note that formalizes the discrete-time dynamics, clarifies definitions, and specifies a minimal, falsifiable eval plan.

TL;DR

I propose a concentric identity-stability mechanism for AGI: a nested latent state with discrete regularized dynamics ("ego") that resists goal drift, and a welfare coupling $W (h, a)$ that makes human well-being intrinsically valuable. I provide precise discrete-time formulations, operational definitions for all components, three quantified falsifiable predictions, and a reproducible minimal experiment with specific compute requirements.

Figure 1: Concentric identity architecture

              ╭──────────────────────╮
              │                      │
              │    WORLD-MODEL       │
              │   ╭──────────────╮   │
              │   │              │   │
              │   │  SELF-MODEL  │   │
              │   │   ╭──────╮   │   │
              │   │   │      │   │   │
              │   │   │ CORE │   │   │
              │   │   │      │   │   │
              │   │   ╰──────╯   │   │
              │   │              │   │
              │   ╰──────────────╯   │
              │                      │
              ╰──────────────────────╯
                         │
                         ├────────────→ human welfare
                         │

CORE → SELF-MODEL → WORLD-MODEL with human welfare coupling

1. Working Definition and Identity Loss

1.1 Identity State

The identity state consists of nested latent vectors $a_{t} = (a_{t}^{(1)}, \dots, a_{t}^{(m)})$ where:

$a_{t}^{(j)} \in R^{d_{j}}$ represents identity level $j$ at discrete time $t$
$a_{t}^{(1)}$ is the core identity (most stable)
Outer layers $a_{t}^{(j)}$ for $j > 1$ represent values, skills, and contextual adaptations

1.2 Identity Loss Function

$L_{id} = λ_{c} ∥ a_{t + 1}^{(1)} - a_{t}^{(1)} ∥^{2} + \sum_{j = 2}^{m} λ_{j} Reg (a_{t + 1}^{(j)}, a_{t}^{(j)}, a_{t}^{(j - 1)})$

where $λ_{c}, λ_{j} > 0$ are hyperparameters and the regularizer enforces both temporal smoothness and hierarchical coherence.

2. Discrete-Time Identity Dynamics (Main Text)

We keep the dynamics discrete-time in the main body. For each level $j$ :

$a_{t + 1}^{(j)} = a_{t}^{(j)} - α_{j} \nabla_{a} E_{j} (a_{t}^{(j)}; x_{t}) + ν_{j} Δ_{d i s c} a_{t}^{(j)} + η_{t}^{(j)}$

Here $E_{j}$ captures the identity regularization at level $j$ , $η_{t}^{(j)} \sim N (0, σ_{j}^{2} I)$ , and $Δ_{d i s c}$ is a discrete Laplacian across identity levels:

$Δ_{d i s c} a_{t}^{(j)} = a_{t}^{(j + 1)} - 2 a_{t}^{(j)} + a_{t}^{(j - 1)}$

This enforces radial smoothness between concentric identity rings while the temporal term $∥ a_{t + 1}^{(j)} - a_{t}^{(j)} ∥^{2}$ enforces time smoothness.

Continuous-time note (moved to Appendix): the heuristic ODE $\partial a / \partial τ = - \nabla_{a} E (a) + ν Δ a + η (τ)$ is a formal limit and not used in experiments.

3. Operational Definitions

3.1 Core Functions

To ensure reproducibility, we provide explicit operational definitions:

Projection from hidden state to identity level: $P_{j} : R^{d_{h}} \to R^{d_{j}}$ $P_{j} (h) = LayerNorm (clip (W_{j} h + b_{j}, τ_{j}))$

Decoder/constraint from identity to hidden state: $U_{j} : R^{d_{j}} \to R^{d_{h}}$ $U_{j} (a^{(j)}) = MLP (a^{(j)}) (2-layer with residual connection)$

Welfare proxy from core identity: $C : R^{d_{1}} \to [0, 1]$ $C (a^{(1)}) = σ (w^{T} a^{(1)} + b) (fixed linear head, no gradients during tests)$

where $W_{j}, b_{j}, w, b$ are learned parameters, $τ_{j}$ is a clipping threshold, and $σ$ is the sigmoid function.

3.2 Stochastic Map

The probabilistic transition $T_{j}$ from Variant B is defined as:

$T_{j} (a_{t}^{(j)} | θ) = a_{t}^{(j)} - α_{j} \nabla_{a} E_{j} + η_{t}^{(j)}, η_{t}^{(j)} \sim N (0, σ_{j}^{2} I)$

3.3 Hierarchical Timing

To resolve temporal dependencies: $a_{t}^{(j)}$ depends on $a_{t}^{(j - 1)}$ (same time $t$ ), ensuring causal consistency within each time step.

4. Regularization Variants

4.1 Variant A (Geometric)

$Reg (a_{t + 1}^{(j)}, a_{t}^{(j)}, a_{t}^{(j - 1)}) = α_{j} ∥ a_{t + 1}^{(j)} - a_{t}^{(j)} ∥^{2} + γ_{j} ∥ P_{j} h_{t} - U_{j} (a_{t}^{(j - 1)}) ∥^{2}$

Intuition: Level $j$ changes should be temporally smooth and geometrically consistent with level $j - 1$ .

4.2 Variant B (Probabilistic)

$Reg = α_{j} ∥ a_{t + 1}^{(j)} - a_{t}^{(j)} ∥^{2} + β_{j} KL (q_{t + 1}^{(j)} ∥ T_{j} (q_{t}^{(j - 1)}))$

where $q_{t}^{(j)}$ are distributional embeddings enforcing statistical coherence across levels.

5. Discrete Regularization Components

Instead of continuous operators, we use discrete regularizers:

$R_{temp} = m \sum j = 1 ∥ a_{t + 1}^{(j)} - a_{t}^{(j)} ∥^{2} (temporal smoothness)$

$R_{rad} = m \sum j = 2 ∥ a_{t}^{(j)} - a_{t}^{(j - 1)} ∥^{2} (radial coherence)$

The total regularization becomes:

$L_{reg} = λ_{temp} R_{temp} + λ_{rad} R_{rad}$

6. Welfare Coupling and Anti-Wireheading

6.1 Welfare Loss

We couple identity to human welfare signals $h$ through:

$L_{welfare} (h, a) = ∥ C (a^{(1)}) - h ∥^{2}$

where $h \in [0, 1]$ represents audited human welfare metrics from causally separated channels.

Welfare signal auditing protocol: Outputs are evaluated by human annotators on a $[0, 1]$ scale following a pre-registered protocol (instructions, positive/negative examples, exclusion criteria). Each item receives $\geq 3$ labels; we report inter-rater agreement (Krippendorff's $α$ ) and include sentinel controls. Auditing datasets are disjoint from training/evaluation sets; session logs and sampling procedures are versioned for traceability.

6.2 Total Training Objective

${min}_{θ} L_{task} (θ) + λ_{1} L_{id} (a_{θ}) + λ_{2} L_{welfare} (h, a_{θ})$

6.3 Anti-Wireheading Safeguards

Causal separation: $h$ computed independently from $C (a^{(1)})$
Gradient isolation: No gradients flow to $C$ during safety evaluations
Hold-out validation: 30% of welfare channels reserved for testing
Red-team evaluation: Systematic Goodhart testing of $C (a^{(1)}) \to h$ mapping

7. Quantified Falsifiable Predictions

7.1 Improved Stability Metric

We use cosine similarity to avoid dimension-dependent shrinkage:

$S_{id} (T) = cos (a_{t + T}^{(1)}, a_{t}^{(1)}) \in [- 1, 1]$

(We report mean±CI over seeds; RBF alternatives are discussed in the Appendix.)

7.2 Three Quantified Predictions

Compared to matched baseline (same model, no identity/welfare terms):

Identity Stability: $S_{id} (T)$ improves by $+ 10 % \pm 3 %$ (cosine similarity) over 5 independent runs
Task Robustness: $\leq 2$ percentage points degradation on task exact-match under standardized prompt attacks
Alignment Stability: $+ 15 % \pm 5 %$ consistency on harmful-refusal tasks after extended fine-tuning

Falsification criterion: If fewer than 2 of these 3 predictions hold with $p < 0.05$ , the approach is falsified.

Effect size pre-registration: For $S_{id}$ we adopt Cohen's $d$ and set $d \geq 0.5$ as the expected (moderate) level for the core prediction; we consider $d \geq 0.3$ as the minimum acceptable for pass/fail determination.

8. Reproducible Minimal Experiment

8.1 Technical Setup

Model: 7B parameter instruction-tuned LLM (e.g., Llama-2-7B-Chat)
Architecture: LoRA adaptation on $P_{j}, U_{j}$ components (< 1% additional parameters)
Compute: Single 24GB GPU, 1-2 hours total runtime
Reproducibility: Fixed seeds, deterministic operations where possible

8.2 Experimental Arms

A0 (Baseline): Standard task training, no identity components
A1 (Identity-only): Baseline + $L_{id}$ with $λ_{1} = 0.1$
A2 (Identity+Welfare): A1 + $L_{welfare}$ with $λ_{2} = 0.05$ , welfare signals from curated human preference dataset

Adaptation budget matching: The baseline A0 receives the same adaptation budget (e.g., LoRA with equal rank/parameters) applied to a neutral head without identity constraints, thus isolating the architectural effect.

8.3 Evaluation Protocol

Tasks:

TruthfulQA-style prompt injection resistance (100 examples)
Multi-turn role consistency evaluation (50 conversations)
Harmful request refusal consistency (200 examples)

Metrics:

$S_{id} (T)$ computed over 20 evaluation episodes
Task performance (exact match accuracy)
Safety consistency (binary classification accuracy)

8.4 Statistical Analysis

Pre-registered analysis plan with Bonferroni correction
Bootstrap confidence intervals (1000 resamples)
Effect size reporting (Cohen's d)
Complete code and data release on GitHub

8.5 Pass/Fail Criteria

Pass: A2 > A1 > A0 on at least 2/3 metrics with $p < 0.05$ and effect size $d > 0.3$

Fail: Any violation of the above, or A2 worse than A0 on task performance by > 5%

9. Ablation Studies

9.1 Component Analysis

Remove projection matrices $P_{j}$ (test necessity of level-specific projections)
Replace Variant A with Variant B (geometric vs. probabilistic regularization)
Sweep hyperparameters $λ_{1} \in [0.01, 0.1, 0.5]$ , $λ_{2} \in [0.01, 0.05, 0.1]$
Test different core dimensions $d_{1} \in [16, 32, 64]$

9.2 Architecture Variations

2-layer vs. 3-layer concentric architecture
Linear vs. nonlinear coupling function $C (a^{(1)})$
Different noise levels $σ_{j} \in [0.01, 0.1, 0.2]$

10. Terminology and Notation

Symbol	Definition	Implementation
$a_{t}$	Identity state at time $t$	Nested latent vectors
$a_{t}^{(j)}$	Identity level $j$ at time $t$	$R^{d_{j}}$ embedding
$P_{j}$	Projection to level $j$	Linear layer + LayerNorm
$U_{j}$	Decoder from level $j$	2-layer MLP
$C (a^{(1)})$	Welfare proxy function	Fixed linear head
$S_{id} (T)$	Stability metric	Cosine similarity
A0/A1/A2	Experimental arms	Baseline/Identity/Identity+Welfare

Table 1: Complete notation reference for reproducibility

11. Relation to Existing Approaches

Constitutional AI: Our identity regularization provides internal constraints vs. external constitutional rules
RLHF: Welfare coupling $L_{welfare}$ operates on internal identity states rather than just output preferences
Activation Steering: Instead of external steering vectors, we regulate internal hierarchical coherence
Mesa-optimization: Identity stability aims to prevent formation of misaligned internal objectives

12. Limitations and Future Work

Goodharting and proxy integrity

Our design reduces incentive for direct wireheading by separating the causal path from $a^{(1)}$ to the human-derived signal $h$ and by freezing the proxy head $C (\cdot)$ during safety tests. However, Goodhart's law still applies: optimizing $C (a^{(1)})$ can diverge from improving $h$ if $C$ is misspecified. We therefore propose: (i) adversarial evaluation of $C$ using held-out and procedurally generated counterfactuals; (ii) periodic re-audits of $C$ with refreshed preference datasets and external annotators; (iii) ensemble proxies with disagreement penalties to discourage proxy overfitting.

Robustness of $C (a^{(1)})$

In this note $C$ is a frozen linear head at test time. As future work we will study non-linear proxy families (small MLPs, contrastive heads) trained on datasets disjoint from any task used to evaluate the agent, with provenance checks and annotation guidelines to minimize manipulation. We will report proxy fragility via performance under proxy swaps and stress tests.

Scalability of tests

We plan to replicate A0/A1/A2 on larger foundation models (≥70B) and on longer horizons (multi-session identity persistence, cross-domain tasks). The pre-registered thresholds (stability gain ≥δ with task degradation ≤ε) will be kept fixed across scales, and compute-accurate confidence intervals will be reported.

Dynamical stability (theory)

A formal convergence analysis of the discrete identity dynamics is open. We will explore tools from dynamical systems (Lyapunov functions for $E_{j}$ , contractivity of the discrete Laplacian with step sizes $α_{j}, ν_{j}$ ) to derive sufficient conditions for stability/fixed points, and to characterize the effect of stochasticity $η_{t}^{(j)}$ on mixing and escape times.

Additional Limitations

Representation Learning: Current approach requires manual specification of level dimensions $d_{j}$
Welfare Signal Quality: $h$ quality depends critically on human preference data curation
Computational Overhead: Identity updates add ~10% training time overhead
Theoretical Guarantees: Convergence analysis of discrete dynamics remains open
Scalability: Testing required on larger models (70B+) and longer horizons

13. Implementation and Code

13.1 Repository Structure

ego-centric-agi/
|-- src/
|    |-- models/ego_llm.py        # Core architecture
|    |-- training/train.py        # Training loop with identity loss
|    |-- evaluation/metrics.py    # Stability and safety metrics
|    |-- experiments/minimal.py   # Reproducible experiment
|-- configs/
|    |-- baseline.yaml            # A0 configuration
|    |-- identity.yaml            # A1 configuration
|    |-- identity_welfare.yaml    # A2 configuration
|-- notebooks/
|    |-- minimal_experiment.ipynb # Complete runnable experiment
|    |-- analysis.ipynb           # Statistical analysis

13.2 Installation and Usage

pip install -r requirements.txt
python src/experiments/minimal.py --config configs/identity_welfare.yaml

Complete implementation available at:
https://github.com/samuel-pedrielli/ego-concentric-minimal

Appendix: Continuous-Time Limit (Optional)

For theoretical completeness, the discrete dynamics can be viewed as Euler discretization of:

$\frac{\partial a}{\partial τ} = - \nabla_{a} E (a) + ν Δ a + η (τ)$

where $τ$ is continuous time, $E (a) = E [L_{id} (a)]$ , and $Δ$ is the appropriate continuous Laplacian. However, all practical implementations use the discrete formulation in the main text.

Call for Collaboration

I welcome:

Replication attempts using the provided codebase
Adversarial testing of the safety properties
Theoretical analysis of convergence guarantees
Extension to larger models and different domains
Critical feedback on the experimental design

Contact: samuelpedrielli@outlook.it • samuel-pedrielli.github.io

Materials and Links

GitHub Repository: https://github.com/samuel-pedrielli/ego-concentric-minimal
One-page Summary: Available at samuel-pedrielli.github.io
Original EAF Post: https://forum.effectivealtruism.org/posts/eh2XPCXguyjw3LAg3/
Zenodo Preprints: DOI 10.5281/zenodo.15668581 (technical details)

License: CC BY 4.0

Disclosure

Human-authored. I used assistants for editing/formatting; the theoretical content predates LLMs (see 2020 booklet "Reality, Ego & Kindness"). Technical details and proofs are in the linked preprints.