Frontier Watch: an incentive-isolated estimate of how far we are from superintelligence
By Jeff Caruso @ 2026-07-02T14:31 (+1)
This post was drafted with substantial LLM assistance (Claude) — research synthesis, drafting, and editing — then reviewed, fact-checked, and revised by me. The claims and judgments are mine; I take responsibility for them.
Epistemic status: Describing a tool we've built and the reasoning behind it. The core claim — that timeline statements from frontier labs are systematically incentive-compromised — I hold with high confidence and think is straightforwardly evidenced. The specific instrument (the Singularity Index) is a first attempt with real limitations, laid out below. The first full elicitation wave is still being assembled, so the current public reading is preliminary. I'm posting in part to recruit critique and qualified respondents. Conflict of interest disclosed at the end.
The problem
"AGI by [year]" is usually read as a forecast. A forecast is falsifiable: it names a definition you can check and a date you can hold against the calendar, and it costs the forecaster something to be wrong. Strip those properties out and what remains is a marketing claim in a forecast's clothing — and the fastest tell is to ask who profits when the date is believed.
The track record is not subtle:
- Masayoshi Son put artificial superintelligence ~10 years out in 2024, moved it to ~2 years by mid-2026, and explained that he'd set the first number long *on purpose* — over the same stretch SoftBank circulated a pitch deck with superintelligence as the valuation engine.
- Sam Altman has predicted materially different things using the same three letters across 2024–2025, while OpenAI ran a second AGI definition — pegged to a profit threshold — inside its Microsoft contract, a clause that was narrowed in late 2025 and deleted entirely in April 2026 once it became commercially inconvenient.
- Elon Musk keeps the definition roughly fixed and moves the date on a loop (2025 → 2026), against xAI's ongoing multi-billion raise. A 2022 investor suit over his self-driving timelines was dismissed in 2024 with the statements characterized as "corporate puffery."
None of this is an argument that timelines are unknowable. It's an argument that the people with the largest financial stake in the answer are the worst-placed to issue it — and that we currently have no widely-cited reading produced by anyone without that stake. (The full argument, with sources for each claim, is linked at the bottom.)
What we built
The Singularity Index (SI) is one number on a 0.0–1.0 scale, where 1.0 (Ω) is the superintelligence threshold and the score is an estimate of the share of the distance already closed — not a raw capability rating.
It's a hybrid measure: quantitative signals from observable frontier capability, combined with structured expert judgment. Specifically:
- Independent elicitation, not panel deliberation. Each respondent scores the global frontier on their own through a structured instrument; we then aggregate. This is a deliberate choice to reduce the anchoring and information cascades that distort live group forecasting.
- Five Tier-One capability domains produce the headline number (reasoning frontier — weighted most heavily; capability trajectory; autonomy & self-direction; recursive capability improvement; infrastructure & compute). Two
- Tier-Two overlays (governance & containment; strategic competition & diffusion) are reported as context and never folded into the score.
- Aggregate only. We report the median and the spread — where experts agree and where they diverge. No individual score is ever published, and no reading goes public until at least ten qualified respondents have submitted.
- Wave cadence, not continuous revision. The Index updates when an elicitation wave completes, not when a news cycle or a funding round wants a headline. Between waves we track the research and policy developments that will inform the next one.
The same operation also watches what the labs do between waves — changes to their terms, privacy, and data practices — and writes them up in plain language. A recent worked example is a sourced look at how identity and age verification across the six labs has consolidated into two vendors (linked below). That monitoring is what keeps the Index grounded in what is actually shipping, not just what is announced. It is already live; the dashboard and these write-ups are public today.
What this is not (and where it's weakest)
I'd rather state the limitations than have them found in the comments:
- It is not a capability benchmark or a dated point-prediction. "Distance to superintelligence" is a contestable construct; Ω is not crisply operationalized, and the score is a structured collective judgment, not a measurement in the physical sense.
- Expert elicitation has well-documented failure modes (overconfidence, correlated priors, poor calibration on novel regimes). We don't claim to escape them. Reporting the spread rather than a false-precision point estimate is a partial mitigation, not a solution.
- Expert selection is the live problem. Who counts as qualified, and how a non-representative pool biases the median, is the part I most want red-teamed. The current pool is small; early readings should be treated as preliminary and wide.
- Independence removes one bias, not all of them. Being unpaid by the labs removes the financing incentive. It does not make the respondents right.
Conflict of interest
Frontier Watch is published by Q16 PBC, a public benefit corporation. Q16 has no frontier model and no funding round riding on the timeline — that independence is the design rationale. But it is a commercial product: the baseline reading is free, and there's a paid tier ("Pro") for the underlying analysis. I have a financial interest in the project, which is the reason I'm disclosing it plainly rather than burying it. I don't think it compromises the measure — the incentive runs toward accuracy, not toward any particular date — but you should weigh it yourselves.
How you can engage
This is where I'd genuinely value EA input:
1. Critique the methodology — especially domain selection and weighting, the aggregation approach, and expert-pool construction. Comments welcome; I'll engage with substantive objections.
2. Qualified experts: take part in the first wave. If you have relevant expertise in frontier AI capabilities, governance, or forecasting, you can request the elicitation instrument. Respondents may be acknowledged or remain anonymous; individual scores are never attributed.
3. Tell me what would make this decision-relevant for you. If there's a cut of the data or a validity check that would move it from "interesting" to "useful," I want to hear it.
Links
- The full argument, with the sourced timeline of how the AGI definition has been bent to fit financing: https://q16pbc.com/blog/how-far-are-we-from-superintelligence
- The live dashboard and current (preliminary) baseline reading: https://watch.q16pbc.com
- Q16 blog — the monitoring write-ups: https://q16pbc.com/blog
- Worked example — how identity and age verification across the six labs has consolidated into two vendors: https://q16pbc.com/blog/ai-labs-identity-verification
- Request the elicitation instrument (qualified experts): info@q16pbc.com
I'll be in the comments. The labs will keep announcing; the goal here is to keep score from outside the process — and to do it transparently enough that you can tell me where it's wrong.