“Pivotal questions”: an Unjournal trial initiative

By david_reinstein @ 2024-07-21T16:57 (+34)

TLDR

The Unjournal is seeking ‘pivotal questions’ to guide our choice of research papers to commission for evaluation; express your interest/offer suggestions here. See the full protocol for this project in our knowledge base here (link fixed). 

About The Unjournal

The Unjournal commissions public evaluations of impactful research in quantitative social sciences fields.[1] We are an alternative and a supplement to traditional academic peer-reviewed journals – separating evaluation from journals and making this public unlocks a range of benefits. We ask expert evaluators to write detailed, constructive, critical reports. We also solicit a set of structured ratings focused on research credibility, methodology, careful and calibrated presentation of evidence, reasoning transparency, replicability, relevance to global priorities, and usefulness for practitioners (including funders, project directors, and policymakers  who rely on this research).[2]

While we have mainly targeted impactful research from academia, our ‘applied stream’ covers impactful work that uses formal quantitative methods but is not aimed at academic journals. So far, we’ve commissioned about 40 evaluations of 21 papers, and published these evaluation packages on our PubPub community, linked to academic search engines and bibliometrics.

New Pilot: “Evaluating Pivotal Questions”

Our main approach has been to search for papers and then commission experts to publicly evaluate them. Our field specialist teams search and monitor prominent research archives (like NBER), and consider agendas from impactful organizations, while keeping an eye on forums and social media.[3] Our approach has largely been to look for research that seems relevant to impactful questions and crucial considerations.[4] We're now exploring the idea of turning this on its head and identifying pivotal questions first. If we start with pivotal questions and then evaluate a cluster of research that informs these questions, this might offer a more efficient and observable path to impact.  (For context, see our ‘logic model’ flowchart for our theory of change.)

We’re piloting this approach (described below) to better understand its feasibility, cost, and impact benefits.  If this pilot succeeds, we will expand this and make it more systematic, perhaps supplementing or replacing our current approach.[5]

Pivotal questions pilot: process outline

Steps:

  1. Elicit ‘target questions’
  2. Select the strongest target questions (based on the criteria below), get feedback, and collaboratively refine them
  3. Source and prioritize research informing the target questions
  4. Commission expert evaluations of this research, focused on discussing how well and in what ways the research helps us answer the target questions
  5. Get feedback from research authors and the target organization(s)
  6. Prepare a ‘synthesis report’
  7. Complete and publish the ‘target question evaluation packages’

     

1. Elicit questions

The Unjournal will ask impact-focused research-driven organizations such as GiveWell, Open Philanthropy, and Charity Entrepreneurship to identify specific quantifiable questions[6] that impact their funding, policy, and research-direction choices. For example, if an organization is considering whether to fund a psychotherapeutic intervention in a LMIC, they might ask “how much does a brief course of non-specialist psychotherapy increase happiness, compared to the same amount spent on direct cash transfers?”[7] 

We want to be helpful: to address the questions the organizations are most interested in; We’ll attempt to minimize the time burden on organizations. We’re looking for the questions with the highest value-of-information (VOI) for the organization’s work over the next few years. These should be questions that relate to The Unjournal’s coverage areas and engage rigorous research in economics, social science, policy, or impact quantification. deally, organizations will identify at least one piece of publicly-available research that relates to their question.

For our pilot, we may narrow the scope further, and focus our “ask” on a few general causes or intervention areas that seem especially promising for these organizations, for example ‘mental health interventions’ or ‘governing and regulating technology and innovation’.

As this current post suggests, we also want to crowdsource this to some extent, asking informed and interested readers to suggest pivotal questions on the  EA Forum, social media, and beyond. We’ll offer bounties for the best responses.

2. Select, refine, and get feedback on the target questions

The Unjournal team will then discuss the suggested questions, leveraging our field specialists’ expertise. We’ll rank these questions, prioritizing at least one for each organization. We’ll work with the organization to specify the priority question precisely and in a useful way. We want to be sure that 1. evaluators will interpret these questions as intended, and 2. the answers that come out are likely to be actually helpful. We’ll make these lists of questions public and solicit general feedback — on the relevance of the questions, on their framing, on key sub-questions, and on pointers to relevant research.

Luxury version: If it seems practicable, we will operationalize the target questions as a claim on a prediction market (for example, Metaculus) to be resolved by the evaluations and synthesis below.

3. Source and prioritize research informing the target questions

Once we’ve converged on the target questions, we’ll do a variation of our usual evaluation process.

For each question we will prioritize ~two to five relevant research papers.[8] These papers may be suggested by the organization that suggested the question, sourced by The Unjournal, or discovered through community feedback.[9]

4. Commission expert evaluations of research, informing the target questions

As we normally do, we’ll have ‘evaluation managers’ recruit expert evaluators to assess each paper.[10] However, we’ll ask the evaluators to focus on the target question, and to consider the target organization’s priorities.[11]

We’ll also enable phased deliberation and discussion among evaluators.[12] This is inspired by the repliCATS project, and some evidence suggesting that the (mechanistically aggregated) estimates of experts after deliberations perform better than their independent estimates (also mechanistically aggregated).[13] We may also facilitate collaborative evaluations and ‘live reviews’, following the examples of ASAPBio, PREreview, and others.

5. Get feedback from paper authors and from the target organization(s)

We will contact both the research authors (as per our standard process) and the target organizations for their responses to the evaluations, and for follow up questions. We’ll foster a productive discussion between them (while preserving anonymity as requested, and being careful not to overtax people’s time and generosity).

6. Prepare a “Synthesis Report”

We’ll commission one or more evaluation managers to write a report as a summary of the research investigated[14].

These reports should synthesize “What do the research,  evaluations, and responses say about the question/claim?” They should provide an overall metric relating to the truth value of the target question (or similar for the parameter of interest).  If and when we integrate prediction markets, they should decisively resolve the market claim.

Next, we will share these synthesis reports with authors and organizations for feedback.

7. Complete and publish the ‘target question evaluation packages’

We’ll put up each evaluation on our Unjournal.pubpub.org page, bringing them into academic search tools, databases, bibliometrics, etc. We’ll also curate them, linking them to the relevant target question and to the synthesis report..

We will produce, share, and promote further summaries of these packages. This could include forum and blog posts summarizing the results and insights, as well as interactive and visually appealing web pages. We might also produce less technical content, perhaps submitting work to outlets like Asterisk, Vox, or worksinprogress.co.

‘Operationalizable’ questions

At least initially, we’re planning to ask for questions that could be definitively answered and/or measured quantitatively, and we will help organizations and other suggesters refine their questions to make this the case. These should approximately resemble questions that could be posted on Manifold Markets or Metaculus[15].[16] These should also somewhat resemble the 'claim identification' we currently request from evaluators.[17] 

Exploratory questions ask others to present broad description, find and characterize guiding principles, compare theoretical models, or uncover fruitful avenues to pursue. For example (from 80k’s “research questions”):  

These questions could be difficult to make specific, focused, and operationalizable.

Others questions could be adapted to be better operationalized, or broken down into several specific sub-questions. For example (again from 80k’s “research questions”):  “Could advances in AI lead to risks of very bad outcomes, like suffering on a massive scale? Is it the most likely source of such risks?” I rated this question  a 3/10 in terms of how operationalized it was. The word “could” is vague. “Could” might suggest some reasonable probability outcome (1%, 0.1%, 10%), or it might be interpreted as “can I think of any scenario in which this holds?” “Very bad outcomes” also needs a specific measure.

We might reframe this to be more operationalized. E.g., “What is the risk of a catastrophic loss (defined as the death least 10% of the human population over any five year period) occuring before the year 2100? How does this vary depending on the total amount of money invested in computing power for building advanced AI capabilities over the same period?”[18]

In the appendix I discuss this further, considering a sample of questions from 80k and others, assessing the extent to which they are usefully operationalized, and how they could be improved.

 We’re seeking operationalizable questions because:

  1. This is in line with our own focus on this type of research.[19]
  2. I think this will help us focus on fully-baked questions, where the answer is likely to provide actual value to the target organization and others (and avoid the old ‘42’ trap).
  3. It offers potential for benchmarking and validation (e.g., using prediction markets), specific routes to measure our impact (updated beliefs, updated decisions), and informing the 'claim identification (and assessment)' we’re asking from evaluators (see footnote above).

However, as this initiative progresses we may allow a wider range of questions, e.g., more open-ended, multi-outcome, non-empirical (perhaps ‘normative), and best-practice questions. Most of 80,000 Hours “big impact” questions (discussed in the appendix) fall into the latter category, and many of these seem like they could be adapted into useful guiding questions for The Unjournal’s prioritization (see discussion in the appendix below) .

How you can help us

Give us feedback on this proposal

We’re still refining this idea, and looking for your suggestions about what is unclear, what could go wrong, what might make this work better, what has been tried before, and where the biggest wins are likely to be. We’d appreciate your feedback!

Suggest target questions

If you work for an impact-focussed research organization and you are interested in participating in our pilot, please reach out to us at contact@unjournal.org to flag your interest and/or complete this form.  We  would like to see:

Please also let us know how you would like to engage with us on refining this question and addressing it. Do you want to follow up with a 1-1 meeting? How much time are you willing to put in? Who, if anyone, should we reach out to at your organization?

Remember that we plan to make all of this analysis and evaluation public.

If you don’t represent an organization, we still welcome your suggestions, and will try to give feedback.[20]  

Please remember that we currently focus on quantitative ~social sciences fields, including economics, policy, and impact modeling (see here for more detail on our coverage). Questions surrounding (for example) technical AI safety, microbiology, or measuring animal sentience are less likely to be in our domain.                                          

If you want to talk about this first, or if you have any questions, please send an email or schedule a meeting with David Reinstein, our co-founder and director.

Appendix 1

What’s been tried before, in this space? (looking for further suggestions)

Research agendas - Effective Thesis (partner: Vegan Thesis) 

Effective Thesis (ET) has reached out to their connections at a range of organizations to share their agendas, to help guide students towards high-impact research questions for their projects and theses. They are in the process of expanding their research questions to include more specific resources as well as more concrete projects and questions (not just broad topic areas).[21]

ET’s resources are very useful. In some areas, e.g., the “behavioural and attitudinal change in animal products consumption,” they suggest fairly specific questions, many of which seem relevant to our pivotal questions project.[22] But we still want to clarify how impactful and pivotal these questions are; they may have been suggested for other reasons (e.g., because they are particularly conducive to student learning). I plan to go through these questions and may use them as examples or suggestions for organizations.

80,000 Hours’ “Research questions that could have a big social impact, organised by discipline” 

They write:

Our primary strategy in compiling these lists was to look through formal and informal collections of high-impact research questions put together by others in the effective altruism community or by people working on our priority problems.”

It’s not clear exactly who they reached out to, what they asked, or how they justified these. Most of the questions are exploratory or philosophical (‘what are the best…?’ ‘How should we consider…?’). However, a subset of these come close to what we are looking for, or could be somewhat easily adapted to be operationalizable. I dig into these in the appendix below. (I also posting these as notes on their page using ‘hypothes.is’, a public collaborative annotation browser plugin; see https://hypothes.is/, I highly recommended it).

Somewhat related:

GiveWell's Change Our Mind Contest –- “[an invitation] to identify potentially important mistakes or weaknesses in our existing cost-effectiveness analyses”

Research agendas, questions, and project lists — EA Forum

A central directory for open research questions — EA Forum

Charity entrepreneurship research – research into promising interventions/charities; their early stages may involve some related outreach and prioritization

The DARPA SCORE project extracted scientific claims “from a stratified sample of social-behavioral science papers”, following a specific protocol. But this targeted research papers, not priority questions.[23]

Ten Years and Beyond: Economists Answer NSF's Call for Long-Term Research Agendas (Compendium) – this happened about 15 years ago and it’s not clear if anything important came out of it.

3ie ‘Development Evidence Portal’ tools, like

Appendix 2: Discussing and scoring example operationalizable questions

From Research Questions — Vegan Thesis

Below, I provide a few questions that I thought were highly relevant to The Unjournal’s focus, followed by a counterexample. I rated the first three questions below fairly highly, although they could still benefit from greater specificity.

Does the purchase of plant-based analogues in retail settings cause displacements of the corresponding animal products?” [8/10]

This is specific, quantitative, and suggests a clear path to impact. Vegan Thesis (see link) references a range of papers specifically considering this topic, as well as providing methodological examples. They suggest specific data and approaches.

I would make a few refinements to this question. I’d pose it as ‘how much does the purchase …displace the purchase of … if at all’. It seems more relevant to know the the extent this displacement occurs than to simply know if it occurs. Of course, this depends on what choice the answer to this question is likely to inform? –  which should be fleshed out further. Perhaps the issue is ‘how much to subsidize the development and promotion of plant-based products?’, and ‘which areas (type of food, geography, etc) yields the most value?’ We need to know this to know how to best pose and answer the research question.

We would also need to specify a unit of measurement such as revenue, calories, or ideally, something that could be linked to the impact on the ‘animal welfare footprint’.[25]  The question might need further refinement to be consistent with standard economics/decision science frameworks.[26]

Which settings are most conducive to running rigorous experiments on dietary change interventions?” [7/10]

This is a methodological question, but very much an applied one, with clear practical implications for direct impactful work. There is a lack of rigorous work in this area, and little consensus about how much we can generalize from the results of small-scale trials.  This makes it a priority for us. However, the question is broad (even with the further details at the above link), and we would like to break it down into a small set of narrower, more quantified key questions. Something like (taking a first-pass here)...  “Consider a standard range of information interventions (e.g., paying people to watch videos about factory farming). How much does the typical proportional (e.g., 25%) reduction in animal product consumption over the following week predict proportional changes in lifetime animal product consumption? How does this proportional relationship vary across context and study type?”

What are the projected market dynamics and growth trajectories of cultivated meat across global geographies?” [7/10]

This question and the further discussion provided seem very close to what we are looking for. We could operationalize it a bit more with specific questions like “What will be the average amount (in calories or protein grams) of cultivated meat consumed per person in the year 2050?” and “Estimate a demand model to measure the consumption of cultivated meat and traditional meat per capita as a function of the prices of each, differentiated by geographic region”.

What lessons can the pro-animal movement learn from the economic impact of past transitions away from certain industries? [3/10]

This may indeed be a high-value question, but it seems challenging to operationalize this in the way we are seeking. (Perhaps one could pose a question about the welfare losses from increases in structural unemployment over past technical transitions?) Still this seems more conducive to a broad descriptive, anecdotal, or case-study based approach. The (linked) description also does not suggest a clear direction of travel.   

From 80,000 Hours’ “Research questions that could have a big social impact…

Economics

“What is the effect of economic growth on existential risk?” [6/10]

This needs some refinement:  what measure of growth, what units of x-risk?, etc.

 

What determines the long-term rate of expropriation of financial investments? How does this vary as investments grow larger? [7/10]

This is fairly definite, although it’s not clear what the precise motivation is here. 

 

Global Priorities Research

“How much do global issues differ in how cost-effective the most cost-effective interventions within them are?” [5/10]

‘Global issues’ is vague; we need more specific categorizations. Cost-effectiveness needs a metric. – 5/10]

 

"Could advances in AI lead to risks of very bad outcomes, like suffering on a massive scale? Is it the most likely source of such risks?” [3/10]

‘Could’ is vague. ‘Very bad outcomes’ needs a precise measure.  See proposed reframing earlier in post.

 

Psychology

How well does good forecasting ability transfer across domains?  [4/10]

This could  be operationalized: what measure of ability, which domains (or how to define this?)

 

What potential nootropics or other cognitive enhancement tools are most promising? [Not well operationalized … promising in what sense?] E.g., does creatine actually increase IQ in vegetarians? [8/10]

This part is better operationalized, although it’s on the borderline of our scope.

 

Science policy/infrastructure and metascience

What are the best existing methods for estimating the long-term benefit of past investments in scientific research, and what have they found? [3/10]

Reframe as ‘what has been the long term benefit {defined in terms of some specific measure} of investment in scientific research?’ – 6/10 as reframed

 

Statistics and mathematics

When estimating the chance that now (or any given time) is a particularly pivotal moment in history, what is the best uninformative prior to update from? [5/10]

Not quite operationalized but seems close to workable.

 

Outside our fields (but fairly concrete and quantitative):  

Biology and genetics

Climate studies and earth sciences

 

Considering broader questions from 80k

Many of their questions seem relevant but rather broad. We could consider a combination of adapting these and expanding our scope.

For example, under ‘Economics’ they ask “What’s the best way to measure individual wellbeing? What’s the best way to measure aggregate wellbeing for groups?” By our above standards, this is far too broadly defined; however, there is a wealth of recent (empirical, theoretical, and methodological) work on this, some of which (e.g., ‘WELLBYs’) seems to be influencing the funding, policies and agendas at high-impact organizations.  Elicit.org (free version) summarizes the findings from the ‘top-4’ recent papers:

Recent research highlights the complexity of measuring wellbeing, emphasizing its multidimensional nature. Studies have identified inconsistencies in defining and measuring wellbeing among university students in the UK, with a focus on subjective experiences and mental health (Dodd et al., 2021). Global trends analysis reveals distinct patterns in flourishing across geography, time, and age, suggesting the need for comprehensive measures beyond single-item assessments (Shiba et al., 2022). Scholars argue for well-being indicators that capture physical, social, and mental conditions, as well as access to opportunities and resources (Lijadi, 2023). The concept of "Well-Being Capital" proposes integrating multidimensional wellbeing indicators with economic measures to better reflect a country's performance and inform public policy (Bayraktar, 2022). These studies collectively emphasize the importance of considering subjective experiences, cultural factors, and ecological embeddedness when measuring individual and aggregate wellbeing, moving beyond traditional economic indicators like GDP.

This suggests a large set of more targeted questions, including conceptual questions, psychometric issues, normative economic theory, and empirical questions. But I also suspect that 80k and their funders would want to reframe this question in a more targeted way. They may be particularly interested in comparing a specific set of measures that they could actually source and use for making their decisions. They may be more focused on well-being measures that have been calibrated for individuals in extreme poverty, or suffering from painful diseases. They may only be interested in a small subset of theoretical concerns; perhaps only those that could be adapted to a cost-benefit framework.[24]

This seems promising as the basis for a ‘research prioritization stream’.  We would want to build a specific set of representative questions and applications as well as some counterexamples (‘questions we are less interested in’), and then we could make a specific drive to source and evaluate work in this area.

Acknowledgements

Davit Jintcharadze provided general feedback on this proposal. Amber Ace provided editing and conceptual support. 

  1. ^

     This includes economics, policy, and impact modeling. See "what specific areas do we cover" for more specifics on our scope.

  2. ^

     See a recent version of our ‘applied-stream’ evaluation form here, and our evaluator guidelines here. (We’re in the process of revising our ratings and guidelines slightly.)

  3. ^

     We’ve also offered bounties for research submissions and suggestions, and hope to do so in the future.

  4. ^

     After research is suggested, we have a careful procedure for prioritizing it for evaluation. This involves a brief report from the suggester, a second opinion from an ‘assessor’, a prioritization vote by the relevant field specialist team, guiding the final management decision to commission this for evaluation.

  5. ^

     We will also integrate this into our ‘prioritizing it for evaluation’ process; see previous footnote.

  6. ^

    We may later expand this to somewhat more open ended and general questions; see below. We also hope to include organizations that are less closely connected with Effective Altruism.

  7. ^

    This question would still need some refinement; see continued discussion below.

  8. ^

    Or dynamic ‘projects’, or non-academic rigorous work — see discussion here, and notes on our ‘applied stream’.

  9. ^

    We discuss how this relates to our typical rules for ‘what we need permission to evaluate’ here.

  10. ^

    Naturally, we may ask some experts to evaluate multiple papers within the same question or theme.

  11. ^

    This could be integrated with the “claim evaluation” section we’re introducing to our evaluation forms (see here). We’ll also ask them to evaluate the paper according to The Unjournal’s standard or applied stream guidelines. But we’ll cut them some slack here, and offer additional compensation for the extra work.

  12. ^

    We have plans to do this in general (see sketch here). This seems particularly promising for this pivotal questions project, as we have a more well-defined and measurable task.

  13. ^

    Here, we’re relying on Anca Hanea, a member of our Advisory Board who focuses on aggregating expert judgment. Academic work such as Rowe and Wright 2001 (“Delphi groups are somewhat more accurate than statistical groups (which are made up of noninteracting individuals whose judgments are aggregated)”) also seems to support this point.

  14. ^

    See details here.

  15. ^

    Manifold allows the person who created the market to resolve it based on their own judgment. Metaculus questions are resolved by their Metaculus administrators, but they seem willing to rely on outside sources in doing this; see Transparent Replications | Metaculus.

  16. ^

     See Metaculus guidelines here. The “Phil Tetlock’s Clairvoyance Test” is particularly relevant. As they state it “if you handed your question to a genuine clairvoyant, could they see into the future and definitively tell you [the answer]? 

    Some questions like 

    - ‘Will the US decline as a world power?’... [or] ‘Will an AI exhibit a goal not supplied by its human creators?’ struggle to pass the Clairvoyance Test.  … 

    How do you tell one type of AI goal from another, and how do you even define it? ….  In the case of whether the US might decline as a world power, you’d want to get at the theme with multiple well-formed questions such as ‘Will the US lose its #1 position in the IMF’s annual GDP rankings before 2050?’...” 

  17. ^

      Condensing this a bit, we ask evaluators to “specify the most important and impactful factual claim this research makes…”; “to what extent do you believe this claim?...”, “what further evidence/checks would make you more confident in this claim?”, and “identify and assess the most important implication of this claim for funding and policy choices...”

  18. ^

    Caveat: This may still be poorly posed, or may be getting at the wrong thing – I am not an expert in this area. We shcould further specify how the ‘investment in computing power for building … AI’ should be measured.

  19. ^

    The Unjournal focuses on evaluating (mainly empirical) research that clearly poses and answers specific impactful questions, rather than research that seeks to define a question, survey a broad landscape of other research, open routes to further inquiry, etc. However, we have evaluated some broader work where it seemed particularly high impact, original, and substantive.  E.g., we’ve evaluated work in ‘applied economic theory’ such as Aghion et al. on the impact of artificial intelligence on economic growth, and applied methodology, e.g.,  "Replicability & Generalisability: A Guide to CEA discounts"

  20. ^

    As noted above, we may offer bounties in the future for suggestions that we engage with. Any such bounty will also apply retroactively, to suggestions made in response to this post.

  21. ^

    We considered doing this ‘pivotal questions’ outreach jointly, but we decided that our calls-to-action seemed too distinct, so pooling these would add more heat than light.

  22. ^

    Still, I note that some of them are rather broad, and they typically lack references to research literature.

  23. ^

    The claim needed to be reported in the abstract and “represent[] a specific, concrete finding … supported by a statistically significant test result, or … amenable to a statistical hypothesis test.” For cases where a paper abstract had multiple claims like this, it’s unclear how they chose among these.

  24. ^

    Asking the LLM  to “...please focus on recent work in economics and decision theory, and measurements that have been used for low and middle-income countries.” yields [excerpted]

    Recent research … aim[s] to move beyond traditional economic indicators like GDP. The Global Index of Wellbeing (GLOWING) proposes a simple, meaningful measure using secondary ecological data (Elliott et al., 2017). The Multidimensional Wellbeing Index for Peru (MWI-P) incorporates 12 dimensions based on what people value, revealing disparities among subgroups (Clausen & Barrantes, 2022). Another approach suggests using the Condorcet median to aggregate non-comparable wellbeing facets into a robust ranking (Boccard, 2017). New methods for measuring welfare include stated preferences over aspects of wellbeing, life-satisfaction scales, and the WELLBY approach, as well as comprehensive frameworks like Bhutan's Gross National Happiness Index (Cooper et al., 2023).

  25. ^

    For example, a measure of the reduction in number of conventionally farmed broiler chickens could be converted to a QALY measure, allowing us to compare interventions across the animal welfare space and beyond (see, e.g., Corporate campaigns for chicken welfare are 10,000 times as effective as GiveWell's Maximum Impact Fund?)

  26. ^

    Traditionally, economists consider all purchases as a joint optimization decision. The question ‘how does the purchase of A respond to the purchase of B’ is somewhat challenging to make precise; economists have preferred to focus on measures such as cross-price elasticity.


david_reinstein @ 2024-09-05T15:58 (+2)

See the full protocol for this project in our knowledge base here.

28 Sep 2024 – fixed above link and link in post.