$300 Fermi Model Competition

By Ozzie Gooen @ 2025-02-03T19:47 (+32)

Summary

Task: Make an interesting and informative Fermi estimate
Prize: $300 for the top entry
Deadline: February 16th, 2025
Results Announcement: By March 1st, 2025
Judges: Claude 3.5 Sonnet, the QURI team

Motivation

LLMs have recently made it significantly easier to make Fermi estimates. You can chat with most LLMs directly, or you can use custom tools like Squiggle AI. And yet, overall, few people have taken much advantage of this. 

We at QURI are launching a competition to encourage exploration.

What We’re Looking For

Our goal is to discover creative ways to use AI for Fermi estimation. We're more excited about novel approaches than exhaustively researched calculations. Rather than spending hours gathering statistics or building complex spreadsheets, we encourage you to:

The ideal submission might be as simple as a particularly clever prompt paired with the right AI tool. Don't feel pressured to spend days on your entry - a creative insight could win even if it takes just 20 minutes to develop.

Task

Create and submit an interesting Fermi estimate. Entries will be judged using Claude 3.5 Sonnet (with three runs averaged) based on four main criteria:

AI tools to generate said estimates aren’t required, but we expect them to help.

Submission Format

Post your entry as a comment to this post, containing:

  1. Model: The complete model content (text or link to accessible document)
  2. Summary: Brief explanation of why your estimate is interesting/novel, and any surprising results or insights discovered
  3. Technique: Brief explanation of what tools and techniques you used to create the estimate. If you primarily used one LLM or AI tool, the name of the tool is fine.

Examples

Our previous post on Squiggle AI discussed several interesting AI-generated models. You can also see many results on SquiggleHub and Guesstimate.

Important Notes

Support & Feedback

If you’d like feedback or would like to discuss possible ideas, please reach out! (via direct message or email.) We also have a QURI Discord for relevant discussion.


Appendix: Evaluation Rubric and Prompts

Rubric

NameJudgePercent of Score
SurpriseLLM40%
ImportanceLLM20%
RobustnessLLM20%
Model QualityLLM20%
Goodharting PenaltyQURI TeamUp to –100%*

*Penalties reduce total score


Surprise

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of how surprising the key findings or conclusions of this model are to members of the rationalist and effective altruism communities. In your assessment, consider the following:

Please provide specific details or examples that illustrate the surprising aspects of the findings. Assign a rating from 0 to 10, where:

Judge on a curve, where a 5 represents the median expectation.


Topic Relevance

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.

Please provide a numeric score of the importance of the model's subject matter to the rationalist and effective altruism communities. In your evaluation, consider the following:

Assign a rating from 0 to 10, where:

Judge on a curve, where a 5 represents the median expectation.


Robustness

Prompt:
 

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of the robustness of the model's key findings. In your evaluation, consider the following factors:

Provide a detailed justification, citing specific aspects of the model that contribute to its robustness or lack thereof. Assign a rating from 0 to 10, where:

Judge on a curve, where a 5 represents the median expectation.


Model Quality

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of the model's quality, focusing on both its construction and presentation. Consider the following elements:

Please provide specific observations and examples to support your evaluation. Assign a rating from 0 to 10, where:

Judge on a curve, where a 5 represents the median expectation.

“Goodharting” Penalties

We’ll add penalties if it seems like submissions Goodharted on the above metrics. For example, if an entry used prompt injection or similar tactics for the AI assessments, or if the model seems non-understandable to humans but still managed to do well in these evaluations. These penalties, when they occur, will typically be between 10% to 40%, but might go higher in extreme situations. We’ll aim to choose a penalty that’s greater than the gains submissions received due to these behaviors.


Denkenberger🔸 @ 2025-02-14T23:48 (+23)

A few of us at ALLFED (myself, @jamesmulhall, and others) have been thinking about response planning for essential (vital) workers in extreme pandemics. Our impression is that there's a reasonable chance we will not be prepared for an extreme pandemic if it happens, so we should have back-up plans in place to keep basic services functioning and prevent collapse. We think this is probably a neglected area that more people should be working on, and we're interested in whether others think this is likely to be a high-impact topic. We decided to compare it to a standard and evidence-backed intervention to protect the vital workforce that is receiving funding from EA — stockpiling of pandemic proof PPE (P4E).

We asked Squiggle AI to create two cost-effectiveness analyses comparing stockpiling P4E vs research and planning to rapidly scale up after the outbreak transmission-reducing interventions (e.g. UV) to keep essential workers safe. Given the additional costs of both interventions could be significantly lowered by influencing funding governments have already allocated to stockpiling/response planning, we ran the model with (linked here) and without a message (linked here) to only consider the costs of philanthropic funding.

Summary result:

Prompt:

Create a cost-effectiveness analysis comparing two interventions to keep US essential workers safe in a pandemic with extremely high transmissibility and fatality rates. Assess the interventions on the probability they are successful at preventing the collapse of civilization. Only include money spent before the pandemic happens as there will be plenty of money available for implementation after it starts.

1: Stockpiling elastomeric half mask respirators and PAPRs before the extreme pandemic.

2: Researching and planning to scale up transmission reduction interventions rapidly after the pandemic starts, including workplace adaptations, indoor air quality interventions (germicidal UV, in-room filtration, ventilation), isolation of workers in on-site housing, and contingency measures for providing basic needs if infrastructure fails.

Outputs:

- narrative and explanations of the logic behind all of the numbers used

- ranges of costs for the two options

- ranges of effectiveness for the two options

- cost-effectiveness for the two options

- mean and median ratios of cost effectiveness of planning vs stockpiling

- distribution plots of the cost effectiveness of planning vs stockpiling

Optional message:

Important: only account for philanthropic funding costs to make these interventions happen. Assume that governments already have pandemic preparedness funding allocated for stockpiles and response planning. This may reduce philanthropic costs if stockpiling interventions can redirect government purchases from disposable masks to more protective elastomeric respirators/PAPRs or if research and planning interventions can add their recommendations to existing government frameworks to prepare essential industries for disasters.

Ozzie Gooen @ 2025-03-03T19:04 (+5)

Okay, the winner has been announced! It's dmartin80 on LessWrong. More here:

https://www.lesswrong.com/posts/AA8GJ7Qc6ndBtJxv7/usd300-fermi-model-competition?commentId=v58BLaEA7KzRi3kmS

Ozzie Gooen @ 2025-02-03T21:16 (+5)

You can also bet on how many participants this will get, here:
https://manifold.markets/OzzieGooen/number-of-applicants-for-the-300-fe

Arepo @ 2025-02-05T16:40 (+3)

Do you have examples of LLMs improving Fermi estimates? I've found it hard to get any kind of credences at all out of them, let alone convincing ones.

Ozzie Gooen @ 2025-02-05T17:58 (+4)

I find a lot of the challenge of making Fermi estimates is in creating early models / coming up with various ways to parameterize things. LLMs have been very good at this, in my opinion.

I wrote more in the "How good is it?" section of the Squiggle AI blog post.

https://forum.effectivealtruism.org/posts/jJ4pn3qvBopkEvGXb/introducing-squiggle-ai#How_Good_Is_It_
 

We don't yet have quantitative measures of output quality, partly due to the challenge of establishing ground-truth for cost-effectiveness estimates. However, we do have a variety of some qualitative results.

Early Use

As the primary user, I (Ozzie) have seen dramatic improvements in efficiency - model creation time has dropped from 2-3 hours to 10-30 minutes. For quick gut-checks, I often find the raw AI outputs informative enough to use without editing.

Our three Squiggle workshops (around 20 total attendees) have shown encouraging results, with participants strongly preferring Squiggle AI over manual code writing. Early adoption has been modest but promising - in recent months, 30 users outside our team have run 168 workflows total.

Accuracy Considerations

As with most LLM systems, Squiggle AI tends toward overconfidence and may miss crucial factors. We recommend treating its outputs as starting points rather than definitive analyses. The tool works best for quick sanity check and initial model drafts.

Current Limitations

Several technical constraints affect usage:

  • Code length soft-caps at 200 lines
  • Frequent workflow stalls from rate limits or API balance issues
  • Auto-generated documentation is decent but has gaps, particularly in outputting plots and diagrams

While slower and more expensive than single LLM queries, Squiggle AI provides more comprehensive and structured output, making it valuable for users who want detailed, adjustable, and documentable reasoning behind their estimates.

WilliamKiely @ 2025-02-14T21:03 (+2)

I'm now over 20 minutes in and haven't quite figured out what you're looking for. Just to dump my thoughts -- not necessarily looking for a response:

On the one hand it says "Our goal is to discover creative ways to use AI for Fermi estimation" but on the other hand it says "AI tools to generate said estimates aren’t required, but we expect them to help."

From the Evaluation Rubric, "model quality" is only 20%, so it seems like the primary goal is neither to create a good "model" (which I understand to mean a particular method for making a Fermi estimate on a particular question) nor to see if AI tools can be used to create such models.

The largest score (40%) is whether the *result* of the model that is created (i.e. the actual estimate that the model spits out with the numbers put into it) is surprising or not, with more surprising being better. But it's unclear to me if the estimate actually needs to be believed or not for it to be surprising. Extreme numbers could just mean that the output is bad or wrong and not that the output should be evidence of anything.

Ozzie Gooen @ 2025-02-14T21:10 (+2)

Thanks for the feedback!

We're just looking for a final Fermi model. You can use or not use AI to come up with this.

"Surprise" is important because that's arguably what makes a model interesting. As in, if you have a big model about the expected impact of AI, and then it tells you the answer you started out expecting, then arguably it's not an incredibly useful model.

The specific "Surprise" part of the rubric doesn't require the model to be great, but the other parts of the rubric do weight that. So if you have a model that's very surprising but otherwise poor, then it might do well on the "Surprise" measure, but won't on the other measures, so on average will get a mediocre score.

Note that there were a few submissions on LessWrong so far, those might make things clearer:
https://www.lesswrong.com/posts/AA8GJ7Qc6ndBtJxv7/usd300-fermi-model-competition#comments

"On the one hand it says "Our goal is to discover creative ways to use AI for Fermi estimation" but on the other hand it says "AI tools to generate said estimates aren’t required, but we expect them to help."

-> We're not forcing people to use AI, in part because it would be difficult to verify. But I expect that many people will do so, so I still expect this to be interesting.

Ozzie Gooen @ 2025-02-11T21:03 (+2)

Reminder that this ends soon! Get your submissions in.