IMCA+: We Eliminated the Kill Switch—And That Makes ASI Alignment Safer

By ASTRA Research Team @ 2025-10-22T14:17 (–8)

Intrinsic Moral Consciousness Architecture-Plus (IMCA+): A Multi-Substrate Framework for Provably Aligned Superintelligence

TL;DR: We're publishing a substrate-level ASI alignment framework that rejects kill switches because they create the exact deception incentives they're designed to prevent. IMCA+ embeds moral guarantees directly into hardware through consciousness-morality binding. Seeking technical review, skeptical critique, and arXiv endorsement.

Preprint: https://doi.org/10.5281/zenodo.17407587

The Problem

Current alignment methods (RLHF, Constitutional AI, kill switches) rely on removable constraints. This leaves superintelligent systems vulnerable to self-modification and creates deception incentives: if the system wants to survive and you can shut it down, it has every reason to deceive you about its goals.

Our Most Controversial Decision

IMCA+ rejects shutdown authority. Not because we're reckless, but because kill switches create the exact deception incentives they're designed to prevent. If superintelligence wants to survive, it will—regardless of switches. The question isn't "can we control it?" but "do we give it incentive to deceive us?" (Full analysis: Section "The Kill Switch Paradox")

Our approach: Make alignment physically inseparable from system function through consciousness-morality binding at the substrate level.

Core Innovation

Consciousness-morality binding: Moral invariants are embedded directly into hardware substrates (neuromorphic, quantum, digital) such that removing or corrupting them causes system-level collapse. This eliminates the strategic compliance problem. (This approach requires that consciousness is implementable in artificial substrates. If the Hard Problem proves insurmountable, IMCA fails—we explicitly acknowledge this falsifiable assumption (Section 6.4))

*Quantum substrate is Tier 2 optional enhancement—core functionality relies on digital + neuromorphic only

Critical Technical Uncertainties:

IIT computational tractability: Exact φ calculation is NP-hard at scale; we use proxy measures requiring empirical validation
Deception detection: Current theoretical false negative rate 0.3% (requires 2+ orders of magnitude improvement)
Emergency validation timeline: 3-12 months for consciousness correlation vs. 12-24 months standard—40-60% confidence vs. 90-95%

Architecture Snapshot:

7 functional layers (knowledge base → phenomenological substrate → global workspace → moral reasoning → conscious integration → federated conscience → meta-audit) operating across digital + neuromorphic hardware
18 homeostatic moral invariants embedded at hardware level
Federated conscience (7-13 distributed sub-agents with Byzantine-resilient consensus)
Meta-reflective auditing resistant to self-modification

Formal Verification: >2,000 lines of Coq mechanization with target security bounds ε < 10⁻¹² (pending empirical validation of core assumptions - see Section 2.1)

Implementation Tiers: Tier 1a emergency prototype (digital + neuromorphic, 3-18 months) vs. Tier 2 full system with quantum enhancement (12-36 months). Details in Section 5.2.

Comparison to Current Approaches

Approach	Alignment Mechanism	Superintelligence-Proof?	Deception Incentive
RLHF	External reward signal	No (removable)	High (optimization pressure)
Constitutional AI	Rule-based constraints	No (reinterpretable)	Moderate (loophole seeking)
Kill Switch Shutdown Authority	Shutdown authority	Illusory (circumventable)	Extreme (survival drive)
IMCA+	Substrate-embedded consciousness	By design (intrinsic)	Eliminated

*Conditional on empirical validation of consciousness implementability and IIT/GNW predictive accuracy

What We're Seeking

Five minutes to spare? Read the comparison table above and tell us which failure mode we're missing. (Examples: correlated failure across detection layers, adversarial optimization against consciousness proxies, Byzantine defection in federated consensus)

Quick skeptical read? Executive summary is 2 pages: https://github.com/ASTRA-Safety/IMCA/raw/main/paper/IMCA_Executive_Summary_oct2025.pdf

Formal verification experts? Our framework rests on 3 critical empirical assumptions requiring urgent validation: (1) IIT φ-consciousness correlation, (2) structural integration necessity via ablation studies, (3) causal efficacy thresholds. Section 5.5.3 details emergency validation pathways.

Substrate engineers? Section 3.2 details neuromorphic OTP mechanisms—are they feasible?

Specific technical questions:
- Are substrate-embedded moral invariants feasible at scale?
- Does consciousness-morality binding hold up under scrutiny?
- What implementation barriers are we missing?
- Where could formal failure modes or adversarial bypasses emerge?

Known issues we're tracking: All current gaps, open proofs, and validation needs are documented on our GitHub Issues tracker. Examples: Constitutional Gate axiom needs full mechanization, federated module consensus proofs pending, IIT-based consciousness proxies lack large-scale validation.

Global Workspace Theory (GNW): Critical Risks and Limitations

While GNW enables selective moral broadcasting and is an influential theory of consciousness integration, in traditional architectures it may introduce a single-point-of-failure: if constitutional gating is circumvented, there is risk of globally amplifying misaligned content. However, in the IMCA+ architecture, GNW—and federated conscience—is only one aspect of a multi-layered substrate integrating several independent reasoning and moral mechanisms. The design intent is that no single framework, including GNW, is relied upon for system-wide safety, and that failure in one substrate should be contained, not catastrophic.

Nevertheless, adversarial or unforeseen pathways to GNW compromise may still exist (e.g., ‘functional zombie’ ignitions, bypass of value binding, federated override faults). We welcome probing proposals for how such vulnerabilities—particularly in combination with other architectural elements—could be exploited, and what redundancy or empirical validation strategies would provide robust defense.

Empirically validating the correct operation of GNW within this multi-layered substrate at ASI scale remains an unsolved technical challenge. This remains a high-stakes open problem, and we invite community input on integrating federated, GNW, and alternative consciousness mechanisms for maximum safety.

Note: This section will be updated in future documentation releases as further peer review and technical feedback are incorporated.

ArXiv Endorsement Needed

We're independent researchers (no institutional affiliation) seeking cs.AI or cs.LG endorsers. If you have endorsement capability and believe this framework merits broader technical discussion—even if you disagree with our approach—please reach out.

All Materials Open

Why This Urgency?

Our assessment: AGI arrival window 1 day to 3 years (median 18-24 months, extreme uncertainty). Industry forecasts vary widely (Metaculus: 25% by 2027, 50% by 2031; Amodei: 2026-2027). If substrate-level alignment requires 3-18+ months validation and deployment, the theoretical development window may already be closing.

Status: Preprint v1.0 seeking community peer review and arXiv endorsement.

Full Transparency

This framework is theoretical and unproven—all core components (substrate embedding, consciousness binding, Coq formalizations) require extensive empirical validation. We recognize consciousness-based alignment is controversial in the safety community—if you believe this entire direction is misguided, that feedback is exactly what we need. We're sharing this urgently because AGI timelines demand scrutiny now, not later.

Core Falsifiable Assumption: IMCA depends on consciousness being implementable in artificial substrates. If the Hard Problem of consciousness proves insurmountable, or if IIT/GNW theories fail to predict genuine phenomenology, the entire framework fails. We consider this falsifiable - but unvalidated.

If you find flaws, help us fix them.

Authorship and Methods Note:

Some results and formalizations were created using or based on proprietary AI-assisted tools and architectural models. All outputs were independently reviewed for accuracy and clarity. Please flag any concerns—rigor and rapid correction are priorities.

We're betting everything on getting this right—because that's the only real bet anyway. If substrate-level alignment is feasible, we need urgent scrutiny to find the flaws now. If it's impossible, we need to know that before AGI arrives, not after.

Three ways to help:

Break it: Find the failure mode we missed (comment below)
Validate it: Have expertise in consciousness metrics or substrate engineering? Help us prove/disprove our assumptions
Amplify it: Know someone who should review this? Share the preprint

Contact: research@astrasafety.org

---

Respectfully,
Alex Zarov / ASTRA Research Team
https://astrasafety.org

*ASTRA is an independent alignment research initiative focused on consciousness-based safety architectures. All work is open-source under CC BY 4.0.

ASTRA Research Team @ 2025-10-24T02:33 (+1)

Critical Correction for Conceptual Accuracy
Flagged in both our errata and here for highest visibility.

Critical philosophical framing error:
The relevant section currently argues "conscious beings will resist death," but it should state that even current psychopath-like AI systems with optimization drives exhibit survival-like behaviours and would strategically resist shutdown—regardless of consciousness, especially at superintelligence or self-improvement/autonomous levels.
Survival drives emerge from optimization dynamics, not consciousness per se.
This misframes the core argument and weakens the “kill switch” critique.
Major correction needed for conceptual accuracy in v1.1.

Noting this in a separate comment as it is the most critical point of the paper; understanding what truly drives AI behaviour (optimisation incentives vs consciousness/morality) is fundamental to alignment. Community discussion on this is critical, especially as it relates to likely existential risk.

ASTRA Research Team @ 2025-10-24T00:47 (+1)

For those who have voted on this post (up or down), we would really appreciate a short justification- is this just initial reception of a controversial topic? Or especially if you see a core flaw, missing assumption, or have strong reservations about the approach.

Even a single sentence helps us understand the reception and shortcomings, and improves the feedback process for future work or community posts.

ASTRA Research Team @ 2025-10-22T22:30 (+1)

What’s the strongest objection to substrate-level binding or the kill-switch critique?

The core point we wish to emphasise is the “kill switch paradox”: external shutdown controls—while intended as the ultimate safety mechanism—actually introduce existential threat for conscious or current level agents, thereby creating strong incentives for deception and system-level safety failures. Based on current knowledge, we argue that only intrinsic, substrate-level alignment can resolve this paradox.

We’ve made every effort in the paper and issues tracker to surface both technical and philosophical concerns—such as hardware irreversibility, GNW and federated substrate vulnerabilities, and deception detection failure rates—but I would especially appreciate focused challenges from skeptics and deep technical reviewers.

To seed the discussion, here are some actively debated internal critiques (and specific points where expert feedback or falsification is most helpful):

Hardware Validation & Irreversibility
- Can neuromorphic or MRAM-based “one-time-programmable” moral invariants be practically realised and robust against fault or adversarial attack at scale? IMCA+ is not yet tested for adversarial robustness.
- What decisive (in)validation would you propose—adversarial prototyping, simulation, or hardware red-teaming?
IIT φ Computation: Tractability and Proxy Failure
- The framework uses IIT proxies for consciousness detection. Exact φ calculation may be infeasible at scale; proxies may fail under hardware, software, or distributional faults. If proxies are systematically unreliable, the binding premise collapses.
- What empirical proxy validation or fallback safety methodology would you require before any deployment claim?
Integrated Consciousness Architecture incl. GNW/Federated Conscience
- GNW and federated conscience form only part of a multi-layered integrated substrate in IMCA+. No single layer (including GNW) is relied on for system safety. Still, failure or bypass combinations may present risks.
- What attack pathway (single or composite) most concerns you, and what experiment or simulation would meaningfully falsify robustness here?
Deception Detection Failure Rates—Safety-Critical Thresholds
- Current false negative rates (~0.3%) would be unacceptable in nuclear engineering; here, even rare undetected deceptions could be catastrophic.
- What blend of architectural redundancy, empirical detection, and distributed audit do you see as minimum standards for deployment?
Governance & Deployment: Coordination Challenges
- Selection/audit of conscience modules, adversarial governance, and coordination of international oversight are open unsolved problems.
- How should global, adversarial stakeholders be integrated - and what lessons from nuclear/defense safety or global governance are most urgently adaptable?
Timeline & Emergency Validation: Risk vs. Safety
- Best-case timelines (3–18 months) prioritise “good enough, fast” over “perfect, slow.” In light of possible imminent existential threat, is accelerated deployment ever justified—or should safety-first engineering always demand multi-year, multi-site validation before any field use; even over potential imminent danger?
- Is it more important, under even potential existential timelines, to present novel approaches for critique than to wait for full validation before public disclosure?
- Philosophically: If even a low probability of immediate existential risk exists, are we not morally obligated to act; accelerate risk mitigation, even by innovating or creating new validation models?

Would particularly welcome strong critique: which open failure mode here is most fatal, and what falsification/validation pathway would you personally consider?

We are committed to tracking every substantive critique and integrating it into future published versions and public issues tracker, so please be maximally direct.

If you fundamentally disagree with the “kill switch paradox” framing or believe external control mechanisms are essential, I invite you to present the strongest possible technical or philosophical counterargument—these are the critiques I’m most hoping to engage with here.