My Current Claims and Cruxes on LLM Forecasting & Epistemics

By Ozzie Gooen @ 2024-06-26T00:40 (+46)

ould bI think that recent improvements in LLMs have brought us to the point where LLM epistemic systems are starting to be useful. After spending some time thinking about it, I've realized that such systems, broadly, seem very promising to me as an effective altruist intervention area. However, I think that our community has yet to do a solid job outlining what this area could look like or figuring out key uncertainties.

This document presents a rough brainstorm on these topics. While I could dedicate months to refining these ideas, I've chosen to share these preliminary notes now to spark discussion. If you find the style too terse, feel free to use an LLM to rewrite it in a format you prefer.

I believe my vision for this area is more ambitious and far-reaching (i.e. not narrow to a certain kind of forecasting) than what I've observed in other discussions. I'm particularly excited about AI-heavy epistemic improvements, which I believe have greater potential than traditional forecasting innovations.

I'm trying to figure out what to make of this regarding our future plans at QURI, and I recommend that other organizations in the space consider similar updates.

Expensive photograph of a simple open notebook on wood table - 3 very-rough old hand-drawn mini-sketches, of engineering flow diagrams that uses AI and brains to make predictions. Lots of white in background. Only a few lines drawn. No words.
Imaginary sketch of AI epistemic infrastructure, Dall-E 3

Key Definitions:

 1. High-Level Benefits & Uses

Claim 1: If humans could forecast much better, these humans should make few foreseeable mistakes. This covers many mistakes, particularly ones we might be worried about now.

Claim 2: Highly intelligent / epistemically capable organizations are likely to be better at coordination. 

Claim 3: The harm caused by altruistic development of epistemic improvements could be capped by a self-enforcing loop.

Claim 4: Better forecasting would lead to more experimentation and innovation (“weirdness”).

Claim 5: AI forecasting/epistemics will be much more deeply integrated into the world than current prediction markets.

Claim 6: Even "straightforward" AI forecasting systems should look very different from human judgemental forecasting systems

2. Failure Modes

  1. Nonprofit efforts in the space are not effective
    1. Claim: The most likely failure mode for projects in this area is that they will fail to make an impact.
      1. If LLM abilities dramatically outpace agent / scaffolding architectures, then the only useful work here might be done by large LLM labs.
      2. There will likely be commercial entities taking care of the most obvious and low-hanging market opportunities. It’s not clear how to expect this area to change over time. It's possible that nonprofit / EA efforts here would have trouble finding a strong fit.
  2. Forecasting could be used for net-bad activities
    1. AI Acceleration
      It’s possible that epistemic improvements would be used by AI capabilities actors to increase capabilities and not make other informed choices.

      Claim: I think this is unlikely to be the net outcome. This would require these organizations to become smarter about making AIs, but not much smarter about making them safe, including for their own benefit. It would also require the other groups that could block them to become smarter - some of these groups include other non-LLM companies and other highly motivated actors who wouldn't want AI to go poorly.

      One crux here is if the AI developers will both be much better at integrating AI epistemic improvements in their internal workflows, and if these systems won't convince or help them with AI alignment. This could be a really bad outcome, so it's important to continue to pay attention to. 
    2. Malicious Governments
      Governments like China or North Korea could use better epistemic systems to better control their populations. My guess is that great epistemic systems will likely do some damage here. I think that overall the improvements (like improving western governments) will more than outweigh that, but this is something to pay attention to.

      Overall, I don’t see reasons to think that better epistemics will dramatically change existing power balances in ways that are yet predictable. I think this is true for most technologies I could imagine, even many of the most beneficial ones.
    3. Malicious Corporations 
      There are some situations where companies could be far better at forecasting than consumers would be, leading to a large power imbalance. If companies have much more information about customers than users have about themselves or these companies, that could lead to them being taken advantage of. It seems important to use governments (hopefully with better epistemic infrastructure) to take active steps against this. It seems important here to work to make sure that epistemic advances get distributed equitably.
  3. Intelligent systems lead to more complexity, leading to a less stable world
    Arguably as humans become more intelligent, human civilization becomes more complex. Complexity has a cost, and at some points, these costs can become detrimental.

    As you might expect, I think this is a predictable disaster, meaning that better prediction abilities would help here. I’d expect agents to be able to make decisions about one step at a time, and for each, see if the benefits are worth the costs. In addition, forecasting will make many things simpler and more handleable. See sources like Thinking Like a State to understand ways that power structures actively force systems to become simpler - sometimes even too much so.
  4. There's too much confusion over the terminology and boundaries, so research is stunted
    "Epistemics" is a poorly understood term. "Judgemental Forecasting" is still very niche. The term "forecasting" is arguably too generic. It seems very possible that this area will struggle in finding agreed-upon and popular framings and terminology, which could contribute to it be neglected. I think more terminological and clarification work here could be valuable. This is similar to the idea of Distillation and Research Debt. 

 

3. Viability

Claim 1: All/Almost all parts of a conventional epistemic process can be automated with LLMs.

LLMs can:

  1. Generate forecasting questions
  2. Resolve forecasting questions
  3. Gather data from computers
  4. Write specific questions to expert - a la interviews
  5. Make forecasts
  6. Make visualizations and dashboards of forecasts
  7. Personalize information for specific people

Note: “Can be automated” doesn’t mean “easy to automate in a cost-effective or high-quality way.”
 

Claim 2: A lot of communication is low-entropy, especially with knowledge of the author.

Claim 3: LLM-epistemics should be easier to study and optimize than human-epistemics.

They can be:

Evidence:

Subclaim: Even if we just wanted to understand ideal human epistemics better, one reasonable method might be to make strong LLM systems and study them, because these are much easier to study, and will likely lead to some generalizable findings.

 

Claim 4: LLM epistemic integration will be gradual.
Practically speaking LLMs and Humans will work together side-by-side in epistemic processes for some time. I expect LLMs-epistemics to follow a similar trajectory to self-driving. First, it starts with light augmentation. Then it moves to “fully automated in most use cases, but occasionally needs oversight.” As with autonomous driving, Level 5 is very difficult.

Subclaim: I expect that LLM-based forecasters will gradually dominate open platforms like Manifold and Metaculus. As this happens, it might scare off many of the human forecasters. If this is not handled gracefully, it could lead to these platforms feeling like ghost towns and becoming irrelevant, in favor of people using more direct AI systems.

4. Scaffolding

Previous LLM-based forecasting systems have included some amounts of scaffolding.


Claim 1: LLM-based epistemic systems will likely require a lot of scaffolding, especially as they get ambitious.
I.E. software, databases, custom web applications, custom batch job pipelines, etc. 


Claim 2: Scaffolding-heavy systems have significant downsides.

 

Claim 3: Scaffolding-heavy systems are likely to be centralized.

 

Claim 4: Scaffolding-heavy systems are expensive, and thus difficult to incentivize people to build.

Claim 5: New technologies or startups could greatly change and improve scaffolding.

Right now there are some startups and open-source frameworks to help with general-purpose scaffolding, but these are known to be lacking. It's possible that much better systems will be developed in the future, especially because there's a lot of money in the space.

If this happens, that could hypothetically make powerful LLM epistemic tools dramatically easier to create.

One annoying challenge is that change in the field could lessen the importance of early experimentation. It could be the case that early projects focus heavily on some specific scaffolding frameworks, only for other frameworks to become more powerful later.

A possible successful world could be one where different teams make "GPTs" that can integrate nicely with other workflows and interact with each other. Ideally these wouldn't be restricted to any one platform.

Crux: Scaffolding vs. Direct LLM Improvements 

One risk around scaffolding is that a lot of key functionality that scaffolding would be used for might be implemented directly in the inference of LLMs. For example, long-term planning could be done simply with scaffolding, but AI developers are also iterating on implementing it directly into the LLM layer. 

Another example is that of editing text. If you have a long text document and want to use an LLM to make a few edits, it would currently attempt to rewrite the entire document, at a fairly slow speed. This could be improved with scaffolding by instead asking it to write a git diff and then applying that using traditional code. Or it could be solved on the inference layer directly, as discussed here

If it is the case that most future improvements are done directly with the LLMs, then that could significantly change what work will be valuable. For example, if there are only a few leading LLM developers, then it might ultimately be up to them to actually add any functionality.  

5. Subcomponents

As stated above, it’s hard to determine the best subdivisions of LEPs / LLM-Epistemic Scaffolding.

Here’s one attempt, to provide a simple picture. 

1. Data-collection

2. World Model

3. Human Elicitation

 

4. Data Prioritization

 

5. Numeric Modeling / Fermi Calculations

6. Decomposition / Amplification

There are many potential strategies to "break a problem down" and solve different parts with different threads of LLMs. 

For example:

Arguably, numeric modeling is a form of decomposition/amplification, I just separated it to draw attention to its uniqueness.

This is an active and popular area of ML research. Some of these techniques are incorporated in the LLM inference layer, but some are also used in scaffolding.

I'm sure there's a lot more work to do here. It's less clear where smaller nonprofit research can be most effective, but I suspect that there are some good options. 

Many methods might be unique to forecasting or specific epistemic workflows. For example:

Arguably, the rationality and effective altruist communities might be very well suited to work in this area. There's been a lot of writing on how to systematically break hard problems down, now people would need to experiment with encoding these systems into automated workflows. 

7. Question Generation 


8. Forecasting

9. Forecasting Resolution

10. Presentation

11. Distribution

12. Idea Generation / Optimization

13. Task Orchestration

6.  Benchmarks & Baselines

Claim 1: Accessing forecasting quality and efficiency with humans is very difficult and often poorly done.

Claim 2: Portable LEPs could be useful for acting as priors or baselines, for other systems to perform against.

Claim 3: Standardized LEPs could be great judges to resolve subjective questions

Idea: It’s possible we could get very useful small LEPs that are easily packaged, relatively cheaply.

Challenge: LLM-forecasting abilities will vary based on date.

7. Caching vs. On-Demand Forecasting

Key question: How much would we expect AI forecasting systems to generate forecasts and visualizations on-demand for users, vs. save them before?

Simple heuristics:

Further Research

I feel like there are clearly some larger investigative projects to do here. It's a vast and important topic, and at the same time there's a lot of fairly-obvious things (yet interesting and useful) we could say about it. This is mostly just putting a bunch of existing ideas in one place and providing some definitions and clarifications.

Some future things I'd like to explore (or have others explore!) include:

  1. Exploration of benchmarks and optimization of end-to-end LEPs
    This document includes more about specific subcomponents of LEPs than the full thing itself.
  2. Strategies for prioritizing philanthropic LEP work
    I think this document helps list out a lot of potential LEP work, but it doesn't try to prioritize it. This would take some thinking. Maybe predictions can be used.
  3. Cost-effectiveness estimation
    I think these systems are promising, but I'm not sure how to best estimate the cost-effectiveness of them vs. other effective altruist interventions. It would be useful to have better models of benefits and risks, perhaps with new delineations. 
  4. Concrete project ideas for enthusiasts
    I could imagine a lifehacker-like community forming around some of these principles. I think creative experimentation by enthusiasts could be useful, in part to help identify promising full-time talent.
  5. Moving key claims and cruxes to forecasting platforms and LLMs
    Fitting with the theme, it seems interesting to try to pull out key questions here and post them to forecasting platforms. I tried to isolate claims and cruxes, in part so that they could later be easily adapted into somewhat-concrete questions. 

My guess is that a lot of people will bounce off this broad topic. I imagine that there's much better writing that could be done to formalize and popularize the key concepts that will be useful going forward. I'd encourage others to make stabs here.


joshcmorrison @ 2024-06-26T13:40 (+4)

Thanks for sharing this! I've also been working on this question of "what would better forecasting by AIs enable?" (or stated differently, "what advances could instantaneous superforecasting 'too cheap to meter' unlock?") I've come at this from a bit of a different angle of thinking about how forecasting systems could fit into a predictive process for science and government that imitates active inference in brains. Here're slides from a presentation I gave on this topic at Manifest, and here is a half-finished draft essay I'm working on in case you're interested. 

DanSpoko @ 2024-06-26T21:39 (+3)

Your definition seems to constrain 'epistemic process' to mere analytic tasks. It seems to me that it's a big leap from there to effective decision-making. For instance, I can imagine how LLMs could effective produce resolvable, non-conditional questions, and then answer them with relatively high accuracy. Yet there are three other tasks I'm more skeptical about: 1) generating conditional forecasting questions that encapsulate decision options; 2) making accurate probability judgements of those questions; and thus 3) the uptake of such forecasts into a 'live' decision process. This all seems more likely to work better for environments that seem to have discrete and replicable processes, some of which you mention, like insurance calculations. But these tasks seem potentially unsolvable by LLM for more complex decision environments that require more ethical, political, and creative solutions. By 'creative' I mean solutions (e.g. conditional forecasting question) that simply cannot be assembled from training data because the task is unique. What comprises 'unique' is perhaps an interesting discussion? Nevertheless, this post helped me work through some of these questions -- thanks for sharing! Curious if you have any reactions.

Ozzie Gooen @ 2024-06-26T21:59 (+2)

Thanks for the comment! 

> "generating conditional forecasting questions that encapsulate decision options; 2) making accurate probability judgements of those questions"

This is a subset of what I referred to as "scorable functions". Conditional questions can be handled in functions. 

Humans now have a hard time with these. I'm optimistic that AIs could at least do around as good as humans. There's a lot of training data and artificial situations we could come up with for training and testing.

> By 'creative' I mean solutions (e.g. conditional forecasting question) that simply cannot be assembled from training data because the task is unique.

I don't have ideas of what sorts of questions we'd expect humans to dominate AI systems, for this. LLMs can come up with ideas. LLM agents can search the web, like humans search the web. 

Do you see any fundamental limitations of LLM-agents that humans can reliably do? Maybe you could come up with a concrete sort of metric/task where you'd expect LLMs to substantially underperform humans?

DanSpoko @ 2024-06-26T22:39 (+1)

An anecdote: the US government is trying to convince a foreign government to sign an agreement with the United States but is repeatedly stymied by presidents from both parties for two decades. Let's assume a forecast at that moment suggests a 10% change the law will be passed within a year. A creative new ambassador designs a creative new strategy that hasn't been attempted before. Though the agreement would require executive signature, she's decides instead to meet with every single member of parliament and tell them the United States would owe them if they came out publicly in favor of the deal. Fast forward a year, and the agreement is signed.

Another anecdote: the invention of the Apple computer.

Presumably you could use LLM+scaffold to generate a range of options and compare conditional forecasts of likelihood of success. But will it beat a human? I'm skeptical that an LLM is ever going to be able to "think" through the layers of contextual knowledge about a particular challenge (say nothing of prioritizing the correct challenge in the first place) to be able to generate winning solutions.

Metric: give forecasters a slate of decision options -- some calculated by LLM, some by humans -- and see who wins.

Another thought on metrics: calculate a "similarity score" between a decision option and previous at solving similar challenges. Almost like a metric that calculates "neglectedness" and "tractability"?

Ozzie Gooen @ 2024-06-27T22:16 (+2)

I imagine that some forms of human invention will be difficult to beat for some time. But I think there's a lot of more generic strategic work that could be automated. Like what some hedge fund researchers do.

Forecasting systems now don't even really try to come up with new ideas (they just forecast on existing ones), but they still can be useful. 

SummaryBot @ 2024-06-26T13:44 (+1)

Executive summary: Recent improvements in large language models (LLMs) have made LLM-based epistemic systems promising as an effective altruist intervention area, with potential for far-reaching impacts on forecasting, decision-making, and global coordination.

Key points:

  1. LLM-based epistemic processes (LEPs) could dramatically improve forecasting and decision-making across many domains, potentially reducing foreseeable mistakes and improving coordination.
  2. Developing LEPs will likely require significant scaffolding (software infrastructure), which presents challenges for research, centralization, and incentives.
  3. Key components of LEPs include data collection, world modeling, human elicitation, forecasting, and presentation of results.
  4. Standardized, portable LEPs could serve as useful benchmarks and baselines for evaluating other systems and resolving subjective questions.
  5. Important uncertainties include the relative importance of scaffolding vs. direct LLM improvements, and how to prioritize philanthropic work in this area.
  6. Potential risks include accelerating AI capabilities without corresponding safety improvements, and empowering malicious actors.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.