From Laboratories to Language Models: Can AI Support Rigor in the Jungle of Policy Analysis? (Linkpost)

By Marcel D @ 2024-02-06T18:51 (+22)

This is a linkpost to https://georgetownsecuritystudiesreview.org/2024/02/05/from-laboratories-to-language-models-can-ai-support-rigor-in-the-jungle-of-policy-analysis/

From my conclusion:

Within many scientific fields, procedural standards such as experiments and statistical significance tests can produce results that are more accurate and easier to verify—even for non-experts. In some fields (e.g., microbiology), developing such standards required tools that made them reasonable or enforceable, such as microscopes to test theories through observation. However, such high-quality standards have long been impractical for many topics in national security and foreign policy—as well as the social sciences more broadly, which those policy topics often rely on. Some nice-sounding due diligence standards (e.g., “test your claims against all relevant case studies”) are difficult to enforce because they demand too much from researchers or the standards are hard to operationalize (e.g., what counts as a “relevant” case study?). Formal peer review can help but is often slow and fails to catch some errors in research. Thus, it can be difficult for policymakers to determine which (if any) analysts and their assessments are reliable. 

Should analysts give up hope of meaningfully improving rigor in these areas? Will the domains forever be treacherous jungles of complexity to their audiences and inhabitants? Perhaps, yet the ascendance of LLMs over the past two years offers some reason for optimism: it is entirely plausible that these and related AI models will enable better methods this decade, such as by making qualitative data collection more practical or by generating initial rebuttals to one’s analysis. As with the earliest microscopes, such tools will likely require further improvements before scholars can trust them with some tasks, but if effective, they could significantly reduce some of the barriers to empirical analyses and operationalize due diligence standards that were previously unenforceable. Thankfully—unlike with many policy questions—researchers can begin experimentally evaluating LLMs’ usefulness for policy analysis and social science. Given how valuable even marginal improvements in national security and foreign policy assessments could be, more scholars and analysts should thoroughly explore the tools’ potential and prepare for their integration where appropriate.

In the preceding section I go into detail on four categories of use cases, including "smart search," larger-scale/replicable qualitative coding, preliminary argument review, and animal models for human interactions/reasoning.

To be clear, I think there's a decent chance that even if these tools became demonstrably effective, the vast majority of social scientists and policy analysts still would not use or require them in the near term. If the tools do work, one of my main theories of success is more like "convince people working on the most important problems (which may tend to be people working on AI policy/safety and thus much more inclined to adopt these tools) to use these tools where appropriate," perhaps with assistance or legitimization from some early-adopters in other fields.

If anyone knows of quantitative or qualitative studies of scientific tool adoption—especially empirical analyses of the most predictive factors—please let me know! I'd love to get a better estimate on the likelihood, shape, and pace of adoption (conditional on different levels of performance).