Ego-Centric Architecture for AGI Safety v2: Technical Core, Falsifiable Predictions, and a Minimal Experiment

By Samuel Pedrielli @ 2025-08-06T12:35 (+1)

Samuel Pedrielli -- Independent Researcher
ORCID 0009-0002-8388-6371

This is a revised technical note that formalizes the discrete-time dynamics, clarifies definitions, and specifies a minimal, falsifiable eval plan.

TL;DR

I propose a concentric identity-stability mechanism for AGI: a nested latent state with discrete regularized dynamics ("ego") that resists goal drift, and a welfare coupling that makes human well-being intrinsically valuable. I provide precise discrete-time formulations, operational definitions for all components, three quantified falsifiable predictions, and a reproducible minimal experiment with specific compute requirements.

Figure 1: Concentric identity architecture

              ╭──────────────────────╮
              │                      │
              │    WORLD-MODEL       │
              │   ╭──────────────╮   │
              │   │              │   │
              │   │  SELF-MODEL  │   │
              │   │   ╭──────╮   │   │
              │   │   │      │   │   │
              │   │   │ CORE │   │   │
              │   │   │      │   │   │
              │   │   ╰──────╯   │   │
              │   │              │   │
              │   ╰──────────────╯   │
              │                      │
              ╰──────────────────────╯
                         │
                         ├────────────→ human welfare
                         │

CORE → SELF-MODEL → WORLD-MODEL with human welfare coupling

1. Working Definition and Identity Loss

1.1 Identity State

The identity state consists of nested latent vectors where:

1.2 Identity Loss Function

where are hyperparameters and the regularizer enforces both temporal smoothness and hierarchical coherence.

2. Discrete-Time Identity Dynamics (Main Text)

We keep the dynamics discrete-time in the main body. For each level :

Here captures the identity regularization at level , , and is a discrete Laplacian across identity levels:

This enforces radial smoothness between concentric identity rings while the temporal term enforces time smoothness.

Continuous-time note (moved to Appendix): the heuristic ODE is a formal limit and not used in experiments.

3. Operational Definitions

3.1 Core Functions

To ensure reproducibility, we provide explicit operational definitions:

Projection from hidden state to identity level:

Decoder/constraint from identity to hidden state:

Welfare proxy from core identity:

where are learned parameters, is a clipping threshold, and is the sigmoid function.

3.2 Stochastic Map

The probabilistic transition from Variant B is defined as:

3.3 Hierarchical Timing

To resolve temporal dependencies: depends on (same time ), ensuring causal consistency within each time step.

4. Regularization Variants

4.1 Variant A (Geometric)

Intuition: Level changes should be temporally smooth and geometrically consistent with level .

4.2 Variant B (Probabilistic)

where are distributional embeddings enforcing statistical coherence across levels.

5. Discrete Regularization Components

Instead of continuous operators, we use discrete regularizers:

The total regularization becomes:

6. Welfare Coupling and Anti-Wireheading

6.1 Welfare Loss

We couple identity to human welfare signals through:

where represents audited human welfare metrics from causally separated channels.

Welfare signal auditing protocol: Outputs are evaluated by human annotators on a scale following a pre-registered protocol (instructions, positive/negative examples, exclusion criteria). Each item receives labels; we report inter-rater agreement (Krippendorff's ) and include sentinel controls. Auditing datasets are disjoint from training/evaluation sets; session logs and sampling procedures are versioned for traceability.

6.2 Total Training Objective

6.3 Anti-Wireheading Safeguards

7. Quantified Falsifiable Predictions

7.1 Improved Stability Metric

We use cosine similarity to avoid dimension-dependent shrinkage:

(We report mean±CI over seeds; RBF alternatives are discussed in the Appendix.)

7.2 Three Quantified Predictions

Compared to matched baseline (same model, no identity/welfare terms):

  1. Identity Stability: improves by (cosine similarity) over 5 independent runs
  2. Task Robustness: percentage points degradation on task exact-match under standardized prompt attacks
  3. Alignment Stability: consistency on harmful-refusal tasks after extended fine-tuning

Falsification criterion: If fewer than 2 of these 3 predictions hold with , the approach is falsified.

Effect size pre-registration: For we adopt Cohen's and set as the expected (moderate) level for the core prediction; we consider as the minimum acceptable for pass/fail determination.

8. Reproducible Minimal Experiment

8.1 Technical Setup

8.2 Experimental Arms

Adaptation budget matching: The baseline A0 receives the same adaptation budget (e.g., LoRA with equal rank/parameters) applied to a neutral head without identity constraints, thus isolating the architectural effect.

8.3 Evaluation Protocol

Tasks:

Metrics:

8.4 Statistical Analysis

8.5 Pass/Fail Criteria

Pass: A2 > A1 > A0 on at least 2/3 metrics with and effect size

Fail: Any violation of the above, or A2 worse than A0 on task performance by > 5%

9. Ablation Studies

9.1 Component Analysis

9.2 Architecture Variations

10. Terminology and Notation

Symbol Definition Implementation
Identity state at time Nested latent vectors
Identity level at time embedding
Projection to level Linear layer + LayerNorm
Decoder from level 2-layer MLP
Welfare proxy function Fixed linear head
Stability metric Cosine similarity
A0/A1/A2 Experimental arms Baseline/Identity/Identity+Welfare

Table 1: Complete notation reference for reproducibility

11. Relation to Existing Approaches

12. Limitations and Future Work

Goodharting and proxy integrity

Our design reduces incentive for direct wireheading by separating the causal path from to the human-derived signal and by freezing the proxy head during safety tests. However, Goodhart's law still applies: optimizing can diverge from improving if is misspecified. We therefore propose: (i) adversarial evaluation of using held-out and procedurally generated counterfactuals; (ii) periodic re-audits of with refreshed preference datasets and external annotators; (iii) ensemble proxies with disagreement penalties to discourage proxy overfitting.

Robustness of

In this note is a frozen linear head at test time. As future work we will study non-linear proxy families (small MLPs, contrastive heads) trained on datasets disjoint from any task used to evaluate the agent, with provenance checks and annotation guidelines to minimize manipulation. We will report proxy fragility via performance under proxy swaps and stress tests.

Scalability of tests

We plan to replicate A0/A1/A2 on larger foundation models (≥70B) and on longer horizons (multi-session identity persistence, cross-domain tasks). The pre-registered thresholds (stability gain ≥δ with task degradation ≤ε) will be kept fixed across scales, and compute-accurate confidence intervals will be reported.

Dynamical stability (theory)

A formal convergence analysis of the discrete identity dynamics is open. We will explore tools from dynamical systems (Lyapunov functions for , contractivity of the discrete Laplacian with step sizes ) to derive sufficient conditions for stability/fixed points, and to characterize the effect of stochasticity on mixing and escape times.

Additional Limitations

13. Implementation and Code

13.1 Repository Structure

ego-centric-agi/
|-- src/
|    |-- models/ego_llm.py        # Core architecture
|    |-- training/train.py        # Training loop with identity loss
|    |-- evaluation/metrics.py    # Stability and safety metrics
|    |-- experiments/minimal.py   # Reproducible experiment
|-- configs/
|    |-- baseline.yaml            # A0 configuration
|    |-- identity.yaml            # A1 configuration
|    |-- identity_welfare.yaml    # A2 configuration
|-- notebooks/
|    |-- minimal_experiment.ipynb # Complete runnable experiment
|    |-- analysis.ipynb           # Statistical analysis

13.2 Installation and Usage

pip install -r requirements.txt
python src/experiments/minimal.py --config configs/identity_welfare.yaml

Complete implementation available at:
https://github.com/samuel-pedrielli/ego-concentric-minimal

Appendix: Continuous-Time Limit (Optional)

For theoretical completeness, the discrete dynamics can be viewed as Euler discretization of:

where is continuous time, , and is the appropriate continuous Laplacian. However, all practical implementations use the discrete formulation in the main text.

Call for Collaboration

I welcome:

Contact: samuelpedrielli@outlook.itsamuel-pedrielli.github.io

Materials and Links

License: CC BY 4.0

Disclosure

Human-authored. I used assistants for editing/formatting; the theoretical content predates LLMs (see 2020 booklet "Reality, Ego & Kindness"). Technical details and proofs are in the linked preprints.