Q2 AI Benchmark Results: Pros Maintain Clear Lead

By Benjamin Wilson 🔸, johnbash, Metaculus @ 2025-10-28T05:13 (+18)

This is a linkpost to https://www.metaculus.com/notebooks/40456/q2-ai-benchmark-results/

Main Takeaways

Top Findings

Other Takeaways

Introduction

In the second quarter of 2025, we ran the fourth tournament in our AI Benchmarking Series, which aims to assess how the best AI forecasting bots compare to the best humans on real-world forecasting questions, like those found on Metaculus. Over the quarter, 54 bot-makers competed for $30,000 by forecasting 348 questions. Additionally, a team of ten Pro Forecasters forecasted 96 of those 348 questions to establish a top human benchmark. Questions spanned many topics, including technology, politics, economics, the environment, and society. They covered different types of outcomes: Binary (“Yes”/”No”), Numeric (e.g., “$4.6M”, “200 measles cases”, etc.), and Multiple Choice (e.g., “Max Verstappen”, “AAPL”, etc.). See a full overview of Metaculus’s AI forecasting benchmark at the tournament home page and resource page.

Methodology

In order to test how well bots are doing relative to each other and relative to humans, we set up a forecasting tournament for bots and one for Pros. The Pro forecasts were hidden from bots to prevent them from being copied by bots. Each tournament launched a series of questions that were resolved sometime during the quarter, and asked participants to assign probabilities to outcomes. We then used our established scoring rule to evaluate participants. During the analysis, we aggregate the predictions of groups of forecasters to create ‘teams’.

Data Collection:

Scoring: For a deeper understanding of scoring, we recommend reading our scoring FAQ. We primarily use spot peer scores and head-to-head scores in this analysis.

Team selection and aggregation: In the analyses below, we sometimes aggregate a group of forecasters to create a “team”. To aggregate predictions, we take the median for binary predictions, the normalized median for multiple choice predictions, and an average (aka mixture of distributions) for numeric predictions. This is the same as the default aggregation method for the Community Prediction shown on Metaculus, except the Community Prediction also weighs predictions by recency. Note that occasionally a member of a team misses a question, and thus is not included in the team aggregate prediction for that question. Below is how the compositions of the teams were decided:

Miscellaneous notes on Data Collection:

This analysis generally follows the methodology we laid out at the beginning of the tournament series, but now uses a simplified bot team selection algorithm we started using in Q1 (as described above).

How do LLMs Compare?

To compare the capabilities of different LLM models, Metaculus ran 42 bots, all with the same prompt. This prompt has stayed generally the same over all four quarterly benchmarks. There were some minor updates to the prompt in Q1 to support numeric and multiple-choice questions. See the section on code/data for links to the template bot code and prompts.

Using the results from the benchmark, we simulated a tournament where these bots competed only against each other (no other participants). This tournament took all the questions from the bot tournament and removed the forecasts of non-metac-bot participants. The bots’ spot peer scores and 95% CI against each other are shown below. The first graph shows the top 12 bots, while the second is zoomed out to show all participants.

 

Each bot’s name indicates which model and research provider were used. For instance, metac-gpt-4o+asknews uses GPT-4o as a model and AskNews as a research source. You can see code for each model in the “Code and Data” section of this piece.

Which Bot Strategy is Best?

Let's take a look at the bot-only tournament. This tournament pits all the bots (both bot-makers and in-house Metac Bots) against each other and does not include any humans. You can see the full leaderboard here. Since there are 96 bots, let's first focus on the bots that topped the leaderboard. Below are the average scores of the top 10 bots when ranked by the sum of their spot peer scores. We rank by sum of spot peer scores to reduce the noise caused by bots that got high averages by getting lucky on only a few questions.

The best bot was made by bot maker Panshul42. Panshul has very generously made his bot open-source. You can review his code and prompts in full here.

We interviewed Panshul about his bot and asked for his description of how it forecasts questions. The bot starts by running multiple research reports. One report specifically looks for the outside view (historical rates), while the other focuses on the inside view (situation-specific factors). Originally, he ran these reports using Perplexity and deepnews, but switched to using a 6-7 step agentic approach that included using Serper to run Google Searches and BrightData.com to scrape HTML. He then runs 5 final predictions using sonnet 3.7 twice (later sonnet 4), o4-mini twice, and o3 once. In development, Panshul mostly manually reviewed his bot’s outputs and used his judgment to improve things. However, he also benchmarked his bot’s predictions against the community prediction (e.g., see here and here) on Metaculus using ~35 questions each time.

The second-best bot was one of Metaculus’s template bots, metac-o3. metac-o3 used AskNews to gather news and OpenAI’s o3 model to forecast. Links to its code are included in the “links and code” section. Here is a link to an example of its research and reasoning. For reference, here is the prompt that metac-o3 (and the other metac bots) used for binary questions:

You are a professional forecaster interviewing for a job.

Your interview question is:
{question_title}

Question background:
{paragraph_of_background_info_about_question}


This question's outcome will be determined by the specific criteria below. These criteria have not yet been satisfied:
{paragraph_defining_resolution_criteria}

{paragraph_defining_specifics_and_fine_print}


Your research assistant says:
{many_paragraphs_of_news_summary}

Today is {today}.

Before answering you write:
(a) The time left until the outcome to the question is known.
(b) The status quo outcome if nothing changed.
(c) A brief description of a scenario that results in a No outcome.
(d) A brief description of a scenario that results in a Yes outcome.

You write your rationale remembering that good forecasters put extra weight on the status quo outcome since the world changes slowly most of the time.

The last thing you write is your final answer as: "Probability: ZZ%", 0-100

Our template bots have reached the top 10 in all four tournaments over the last year. metac-gpt-4o (fka mf-bot-1) placed 4th in Q3 2024, metac-o1-preview (fka mf-bot-4) placed 6th in Q4 2024, and metac-o1 placed 1st in Q1 2025. metac-gpt-4o’s ranking has been sliding over time, as better models enter the competition, getting 17th of 44 in Q4 2024, 44th out of 45 in Q1 2025, and 82nd out of 96 in Q2 2025.

Most participants did not use o3 in their forecasting during Q2 due to its cost (or at least when we offered LLM credits, few requested it).

From the above, we conclude that the most important factor for accurate forecasting is the base model used, with additional prompting and infrastructure providing marginal gains.

Here is the zoomed-out full leaderboard. Like the chart above, this graph is ordered by the sum of spot peer scores, which is used to determine rankings on the site (i.e., so the 1st bar shown took 1st place), and the y-axis shows the average spot peer score.

You’ll notice that some bots (e.g., goodheart_labs in 22nd, and slopcasting in 24th) have an average peer score that would have given them a rank in the top 5, except they didn’t forecast on enough questions (and consequently have much larger error bars). Of course, they may just have gotten lucky with their few forecasts, which is why we use the sum of peer scores to determine rankings: it incentivizes forecasting on more questions, which reduces our uncertainty (i.e., gives tighter error bars).

Are Bots Better than Human Pros?

To compare the bots to Pros, we used only the 93 questions that were identical in the bot and Pro tournament and that were not annulled. We then aggregated forecasts of the “bot team” and the “Pro team” (see methodology section above).

The bot team that was chosen is:

We then calculated the head-to-head scores of the two teams, along with 95% confidence intervals using a t distribution. The average bot team head-to-head score was -20.03 with a 95% confidence interval of [-28.63, -11.41] over 93 questions.

The negative score indicates that the bot team had lower accuracy than the Pro team. A one-sided t-test demonstrates that pros significantly outperformed bots (p = 0.00001).

Below is an unweighted histogram of the head-to-head scores of the bot team. 

The distribution of scores is generally symmetric with a small left lean and fat tails on both sides. Scores in Q2 were slightly more concentrated on 0, but generally fit the same shape as previous quarters.

Additionally, each individual Pro beat every individual bot. We simulated a tournament where both pros and bots competed together as individuals on the 93 overlapping questions. Below is this leaderboard ordered by sum of spot peer scores. The top 19 participants are shown. Average spot peer scores are shown on the y-axis.

You’ll notice that if pros and bots competed together, all 10 pros would take 1st through 10th place (with metac-o3 getting 11th place). 

 

Binary vs Numeric vs Multiple Choice Questions

We also took a look at how bots did on each question type. When comparing the bot team and the Pro team on head-to-head scores, generally, bots did a lot worse at multiple-choice questions (score: -32.9) than at binary (score: -14.8) or numeric (score: -23.2). We also found that bots were worse at multiple-choice in Q1. Generally, the sample sizes in these comparisons are fairly small, so hold these conclusions lightly.

Here are the average head-to-head scores for the 55 binary questions:

And here are the average head-to-head scores for the 17 multiple-choice questions:

And here are the average head-to-head scores for the 21 numeric questions (we do not include  confidence intervals because our low sample size and a failed Shapiro test prevent us from validating the assumption of normality):

Team Performance Over Quarters

Now, let's compare how the bot team has done relative to the Pro team over the last 4 quarters. Below is a graph of Pro team vs bot team head-to-head scores over 4 quarters with 95% confidence intervals.

The values in the graph above are:

Since the confidence intervals do not overlap with 0 (which would indicate equivalent performance), we can see that in most quarters the Pro team handily beat the bot team (except for Q4). However, we cannot discern any trend regarding whether bots are improving relative to pros due to the overlapping confidence intervals between quarters. For instance, the true spot peer score of the bots (if we ran on a lot more questions) might be -15 throughout all three quarters. Unfortunately, we just do not have enough questions to state the direction that bots are trending in (despite having enough sample size to show that pros are better than bots).

One theory for the decrease in score from between Q3/Q4 and Q1/Q2 is the introduction of multiple-choice and numeric questions. As noted above, these were difficult for bots, and Q3 and Q4 only had binary questions.

Another theory is that the Pro team has gotten better. They have had a while to practice as a team with each other, and practice improves performance. One other factor is that they have access to better AI tools that can speed up their own research process. Though the pros we asked said that learning to work together as a team, and borrowing each other’s forecasting patterns, were more helpful than AI tooling.

Additionally, we have seen that the metac-gpt-4o bot started in 4th of 55 in Q3, 17th of 44 in Q4, 44th of 45 in Q1, and 82nd of 96 in Q2. Assuming that metac-gpt-4o can act as a control, this is evidence that bots have gotten better over the quarters. Though we have also seen metac-sonnet-3.5 go from 13th of 55 in Q3, to 42nd of 44 in Q4, to 34th of 45 in Q1, to 54th of 96 in Q2. This trend is much noisier. Some factors to consider:

Therefore, we believe that there is weak evidence that bots have gotten better on binary questions between quarters, and got worse in Q1/Q2 due to multiple choice questions. At the same time, pros have been improving, thus resulting in an increased gap between bots and pros. But generally (and statistically speaking), we can’t be sure of much about how bots have changed over time.

Bot Maker Survey

At the end of Q2 2025, we sent a survey out to bot makers who won prizes. Winners were required to fill in this survey as part of receiving their prize. There were a total of 19 completed surveys from botmakers, with 7 of them ranking in the top 10 (there were 96 bots in the tournament). Below are some insights from this survey.

Best practices of the best-performing bots

To compare the bots on an apples-to-apples basis on forecast accuracy, we divided the total score by coverage to arrive at what we call a coverage-adjusted score. From a methodological standpoint, this was necessary because a few bots (e.g., goodheart_labs and slopcasting) arrived late to the tournament but had very good scores in the time that they were present. The scores and official rankings of the botmakers are as follows (sorted by coverage-adjusted score):

We wanted to explore the difference between average coverage-adjusted scores on a couple of points. We chose some questions we were interested in testing the effectiveness of. Among the questions in the survey, we asked participants to check a box if they used code to “Take the median/mean/aggregate of multiple forecasts”, if they did “Significant testing via manual review of bot outputs (more than sanity checks)”, if they did “Testing via custom question creation/resolution”, and if they researched questions with “AskNews”. We then calculated the average score of those who said yes and no, took the difference, and found the upper and lower bounds using a 95% confidence interval. Please note that the sample sizes on these are small, and do not meet the n > 30 check for normality often required for meaningful bounds, so take these results with higher uncertainty. 

The largest positive effects come from testing one’s bot via the creation and resolution of custom questions. Botmakers who did this had a higher coverage-adjusted score of 2,216 points on average, as compared to botmakers who did not engage in this practice. (95% confidence interval = +912 to +3,519). However, after talking with one of the participants who checked this box, we realized that there might be ambiguity about what “Testing via custom question creation/resolution” means. We were interested in whether any bot makers created their own questions that are not hosted on Metaculus. This would include creating, resolving, and scoring LLM-generated questions. We think at least one participant may have interpreted this as “testing on questions outside of the AI benchmarking tournament with a custom set-up, even if using pre-existing questions on Metaculus”. Though we also know at least one participant interpreted it as it was intended. We are unsure about the others.

The second largest positive effect came from aggregation, which is taking the median or mean of multiple forecasts, rather than just one LLM forecast. Botmakers who did this had a higher coverage-adjusted score of 1,799 points (95% confidence interval = +1,017 to +2,582). 

The third highest positive effects came from doing significant testing via manual review of bot outputs (beyond sanity checks). Botmakers who did this had a higher coverage-adjusted score of 1,041 points (95% confidence interval = -223 to +2,305). In his advice to bot makers, our winner Panshul42 suggests that in development, one should “Measure performance not just by accuracy metrics, but also by how well the LLM reasons.” A lot of the advice that bot makers shared (see relevant section below) echoes the importance of this theme.

One practice that did not make a demonstrable difference for botmakers was the use of AskNews. The AskNews result confirms what we found in earlier sections of this analysis regarding our in-house Metac Bots (i.e., that there was no clear “best search” option that stood out statistically between any of AskNews, Exa, Perplexity, etc). However, anecdotally, two of the most dominant bots, Panshul42 and pgodzinai, used AskNews for research. On the other hand, other high performers such as CumulativeBot, goodheart_labs, and lightningrod did not.

It is also notable that the top 3 botmakers were all either students or individual hobbyists, with #1-ranked Panshul42 putting in a total of 41-80 hrs (coverage-adjusted score of 6,256), #2 pgodzinai 16-40hrs (coverage-adjusted score of 4,837), and #3 CumulativeBot with 41-80hrs (coverage-adjusted score of 4,223).  The next three participants were commercial entities: goodheart_labs (coverage-adjusted score 4,107), manticAI (coverage-adjusted score 3,793), and lightningrod (coverage-adjusted score 3,667). 

Time spent on developing one’s bot had a fairly weak correlation of only 0.20 with coverage-adjusted tournament score. As part of the survey, botmakers selected a multiple-choice option for how many total hours (between all team members) have been put into developing their bot. We used the midpoints of time estimates and converted them into hours (e.g., 16-40 hours converted to 28 hours). We also converted 1 month to mean 160 work hours. We looked at time spent and its correlation with tournament score. Below are the responses we got (Note: one person responded with a custom “other” field)

Forms response chart. Question title: What is your best estimate for how many total hours (between all team members) have been put into developing your bot?. Number of responses: 19 responses.

LLM calls per question had a correlation of 0.40 with a bot’s coverage-adjusted score, which is typically considered weak to moderate. This makes intuitive sense that more LLM calls per question would help due to it providing the model with more chances to research, engage in Fermi-style reasoning, aggregate multiple models, etc. However, it may be that the real reason for the mild correlation is that more calls are needed in order to properly aggregate forecasts, which we found to create some of the largest positive effects. 

Forms response chart. Question title: Your best estimate of the number of LLM calls per question?. Number of responses: 19 responses.

Remember that these results use a relatively small sample size. Additionally, correlation does not necessarily mean causation. We suggest that people should hold trends and takeaways tentatively.

Other Survey Results

In the survey, we asked “Which LLM model(s) did you use to make your final prediction/answer?”. The results show that o3 was used the most. Nine participants used only a single forecasting model, and three of these used o3 as their only forecasting model. The ten others aggregated forecasts of multiple LLMs, often from multiple providers. 

In a similar vein, we asked “Which LLM model(s) did you use in supporting roles (i.e., not final predictions)?”. Again, o3 was used most often.

To explore trends in research providers, we asked, “How did your bot research questions?”. AskNews was the most common, with different versions of Perplexity being the second most common. 

And here are some other stats:

How did scaffolding do?

Over the last four quarters of AI benchmarking, we have been interested to know whether prompt engineering and other scaffolding (e.g., computer use tools, code execution, etc) help bots with forecasting, or if the base model is the most important aspect. 

One way to measure this is to compare the performance of bots to the Metaculus house bot that is most equivalent to them (i.e., used the same LLM model). Metaculus bots use a minimal amount of prompting and scaffolding. We asked each bot maker which LLMs they used for their final forecast and which LLMs they used in a supporting role (like research). 

Let's first look at our top bot. Panshul42 performed well. He did much more scaffolding than the Metaculus bots and had a sum of spot peer score of 5,899 in the tournament. He used a combination of o3 and o4-mini for his supporting roles; o3, o4-mini, Sonnet 4 for his final predictions; and a wide combination of sources for his research bots, including AskNews, Perplexity-sonar-reasoning-pro, search engines such as Google/Bing/DuckDuckGo, and static web scraping using HTML. Panshul42 beat the comparable Metaculus bot, metac-o3+asknews, which had a score of 5,131. However, it’s also notable that Panshul42 did not beat metac-o3+asknews within a 95% confidence interval, and the error bars for average spot peer score overlap considerably (see graphs in the previous section comparing top bots). So in this case, scaffolding may have an effect, but not a large enough one to be noticeable.

We note that most participants who requested credits for the competition did not request OpenAI’s o3 model due to its cost at the time. Thus, it’s useful to run this comparison for other models as well.

In general, we did not see many bots beating their respective metac bot (at least when an apples-to-apples comparison could be made). One stands out, though. TomL2bot used Gemini-2.5-pro and achieved a tournament score of 3,288 (8th place) as compared with metac-gemini-2-5-pro+asknews, which had a score of 1,535 (18th place). Error bars have some overlap here with TomL2bot having an average spot peer score of 9.53 [95% 5.15 to 13.91] and metac-gemini-2-5-pro+asknews with an average score of 4.59 [95% -0.41 to 9.6]. Tom was a previous employee at Metaculus and created the original prompts for the in-house Metac Bots. During development for his TomL2bot, he did significant testing via manual review of bot outputs (more than sanity checks). He spent a total of 41-80 hours on bot development, capped his bot’s predictions at a maximum and minimum (to avoid large point losses from occasional AI overconfidence), and took the aggregate of multiple forecasts from his bot. He attempted to have his bot look for comparable Metaculus question predictions to compare to when forecasting, but found this did not help. His bot was fairly simple and focused especially on engineering the final prompt.

There are a number of potential conclusions that could be drawn from this. Generally, it seems model quality has more of an effect on forecasting accuracy than scaffolding does, while scaffolding has benefits on the margin. Though these results could also mean that there are a lot more bad prompts and bad forecasting structures than there are good ones. We might just need to try more approaches until we find the prompts or methods that create a huge difference. We lean towards the first conclusion, but still have hope for innovations in areas that bots have not tried.

Advice from Bot Makers

Bot makers were asked, “What should other bot makers learn from your experience? “. Here is their advice:

Bot A: multiple context providers, make sure you can process the resolution criteria links that may not otherwise be indexed

Bot B: Add lots of sanity checks. Grok 3 Mini is pretty good.

Bot C: Our biggest loss was screwing up millions [units] on a CDF question, so make sure to robustly test different question types. I still think there is a lot of ground to be covered in different prompts, layering of models, combining forecasts, etc.

Bot D: Keep up with model releases

Bot E: 1. Definitely scrape any links referenced in the article - took us a few days to realize this, and it was definitely a major disadvantage during that time.  2. Our approach was very research-heavy on outcome-based RL for LLMs - we've posted research on this: https://arxiv.org/abs/2505.17989 and https://arxiv.org/abs/2502.05253

Bot F: More news helped by bot (I used ask news and deep news). Using multiple runs of o3 was helpful (until it ran out of credits). I think it is promising to keep bots simple and use the best models.

Bot G: Failed: I tried to get Metaculus question predictions, but this didn't help.

Bot H: The Metaculus bot template is pretty good - info retrieval, reaping the benefits of aggregation (which I didn't do enough of), and LLM selection is more important than prompt engineering beyond it. This was a test run for me, as I joined late, so I don't think I have much to add. 

Bot I: The reasoning models have come a long way. I would say that getting a bot to do pretty good reasoning is not too difficult at this point. The main bottleneck is in getting all of the necessary information, accurately, to the bot so that it has the information needed to make a good forecast.

Bot J: Things that worked well: Paying up for the highest quality LLMs for evaluation. Using a number of them from different sources (Gemini, Claude, GPT, DeepSeek, Grok) and choosing the median. Requesting research from multiple news sources. Minor prompt tuning. Things that didn't work well: tuning the model mix based on questions with early resolution. Not putting enough time into prompt optimization and evaluation of the research quality. Promising directions: Fix the things that didn't work. Use more sophisticated model blending. Use models to curate the news. Have model do evaluation and use self improvement based on progress in the competition to date. Suggestions for others: Use of highest quality LLMs is key.

Bot K: What worked: - o4-mini: Multi-run o4-mini easily beats single-run o1, which won the Q1 challenge. My direct comparison was o1 with one run versus o4-mini with 5 runs with the same prompts. My impression is that o1 with 1 run is equivalent to o4-mini with 2 or 3 runs. - Metaculus bot template was a success. In my view it was a starting point for best practices and tested tools. - Multi-scenario prompts. - Same forecasting core for all question types. - Integration of superforecaster and professional forecaster strategies. - Manual review/tracking of bot responses in-tournament: diary, errors, outliers, similarity to community. - In-tournament bot updates based on evidence.- Scaling prompt complexity and structure to smart intern analogy. - Late entry into the tournament instead of waiting for the next tournament. - Recognition that scoring is relative, and that the bot community evolves during tournament. 97% of Binary question bot runs produced a technically valid forecast. - Metaculus Bridgewater Human tournament participation provided insight into the general -Metaculus forecasting environment and idiosyncrasies.

What didn’t work: - Strong LLM with a single run is not a good idea, too volatile (though there were no failed runs) - Median aggregation reduces forecast precision. - Prompting the LLM for 1% precision was generally ignored. The LLMs like to forecast in 5% or 10% increments. - High failure rate of individual runs (about 30%) for Numeric and Multiple Choice questions. ...Usually due to errors similar to this example: Values must be in strictly increasing order [type=value_error, input_value=[Percentile(value=27.9, p...e=25.3, percentile=0.9)], input_type=list]

Bot L: We are continuing to build out Adjacent a real-money, play-money, forecasting, and LLM prediction platform aggregator combined with a news database. Anyone looking to build on top of it can reach out for access right now. We felt like we should have spent more time in the prompt engineering and find building a agentic wargame bot could be interesting for future tournaments. We also only used 3.5 and likely could have incorporated multiple models for various steps and generally better models. I also think potentially taking our database of Polymarket, Kalshi, Metaculus, etc. questions and finetuing something could be interesting.

Bot M: It is essential to use a good news source (in my case, AskNews) and model (in my case, GPT-4.1) rather than the prompt itself.

Bot N: I had tried to study which questions my bot did well on last quarter (binary No predicted questions) and which ones it didn't do well on (binary Yes predicted, multiple choice). I then decided to use the same model for binary and numeric questions, but avoid submitting binary questions that the model predicted Yes on (>50%) to avoid positive bias that was a major issue. In the end it wasn't a great idea to avoid submitting that subset, as the bot was simply out-scored by bots that simply answered them correctly. After ~2 weeks in, I saw poor performance of the new LLM (o4-mini-high) on multiple choice questions, so I disabled the forecasting on that question type. I think this may have been a mistake too given that it seemed to have performed decently on the leaderboard. Also I didn't do ensembling (only 1 LLM prediction per question) which probably didn't help. Essentially, If I had simply upgraded model to something like o4-mini-high or better on all question types and didn't overthink it, as well as leave ensembling on, it probably would've done much better.

Bot O: Nothing additional I can think of to add compared to Q1

Bot P: Sanitizing LLM output is hard. (In one of the first versions of the bot I tried to get the last llm to assign weights to each prior reasoning and then tried to calculate a weighted sum, but the LLM rarley used the format I specified, and even then it was hard to do for someone without professional programming experience.) Including the scoring and prediciton best practices (overconfidence, long tails) as part of the prompt helped mitigated inaccurate forecasts, lack of research/data and overconfidence due to i.e. "wrong" research

Bot Q: It's really important to experiment with different techniques to gain an intuition about what works and what doesn't. Build out an end-to-end sample and try applying what you've learnt and observe how it impacts performance. Measure performance not just by accuracy metrics, but also by how well the LLM reasons. Analyze failure cases, identifying if it was primarily due to lack of relevant information or reasoning issues (e.g., not paying attention to the resolution timeline). Generally, some reasoning issues can be fixed with prompting, but systematic ones need better models to be overcome.

Links to Code and Data

Below are links to code and data you may be interested in:

Future AI Benchmarking Tournaments

We are running more tournaments over the next year, like the one described in this analysis. See more at metaculus.com/aib. Email ben [at] metaculus [dot] com if you have any questions about this analysis, how to participate, or anything else.