Open Technical Challenges around Probabilistic Programs and Javascript
By Ozzie Gooen @ 2023-08-26T02:04 (+39)
While working on Squiggle, we’ve encountered many technical challenges in writing probabilistic functionality with Javascript. Some of these challenges are solved in Python and must be ported over, and some apply to all languages.
We think the following tasks could be good fits for others to tackle. These are fairly isolated and could be done in contained NPM packages or similar. The solutions would be useful for Squiggle and might be handy for others in the Javascript ecosystem as well. Advice and opinions are also appreciated.
This post was quickly written, as it's for a narrow audience and might get outdated. We're happy to provide more rigor and context if requested. Let us know if you are interested in taking any of them on and could use some guidance!
For those not themselves interested in contributing, this might be useful for giving people a better idea of the sorts of challenges we at QURI work on.
1. Density Estimation
Users often want to convert samples into continuous probability distribution functions (PDFs). This is difficult to do automatically.
The standard approach of basic Kernel Density Estimation can produce poor fits on multimodal or heavily skewed data.
a. Variable kernel density estimation
Simple KDE algorithms use a constant bandwidth. There are multiple methods for estimating this. One common method is Silverman's rule of thumb. In practice, using Silverman’s rule of thumb with one single bandwidth performs poorly for multimodal or heavily skewed distributions.
Squiggle performs log KDE for heavily skewed distributions, but this only helps so much, and this strategy comes with various inconsistencies.
There’s a set of algorithms for variable kernel density estimation or adaptive bandwidth choice, which seems more promising. Another option is the Sheather-Jones method, which existing python KDE libraries use. We don’t know of good Javascript implementations of these algorithms.
b. Performant KDE with non-triangle shapes
Squiggle now uses a triangle kernel for speed. Fast algorithms (FFT) should be possible, with better kernel shapes.
See this thread for some more discussion.
c. Cutoff Heuristics
One frequent edge-case is that many distributions have specific limits, often at 0. There might be useful heuristics like, “If there are no samples below zero, then it’s very likely there should be zero probability mass below zero, even if many samples are close and the used bandwidth would imply otherwise.”
See this issue for more information.
d. Discrete vs. continuous estimation
Sometimes, users pass in samples from discrete distributions or mixtures of discrete and continuous distributions. In these cases, it’s helpful to have heuristics to detect which data might be meant to be discrete and which is meant to be continuous. Right now, in Squiggle, we do this by using simple heuristics of repetition - if multiple samples are precisely the same, we assume they represent discrete information. It’s unclear if there are any great/better ways of doing this heuristically.
e. Multidimensional KDE
Eventually, it will be useful to do multidimensional KDE. It might be more effective to do this in WebAssembly, but this would of course, introduce complications.
2. Quantiles to Distributions, Maybe with Metalog
A frequent use case is: “I have a few quantile/CDF points in mind and want to fit this to a distribution. How should I do this?”
One option is to use the Metalog distribution. There’s no great existing Javascript implementation of Metalog yet. Sam Nolan made one attempt, but it’s not as flexible as we’d like. (It fails to convert many points into metalog distributions).
Jonas Moss thinks we can do better than Metalog for most use cases. He wrote the post, Deriving distributions from quantiles, about other potential techniques.
Fitting for Metaculus
Metaculus uses mixtures of logistic distributions as their input method. It would be useful to convert quantile points into this format.
3. “Point Set” Formats
One of the main features of Squiggle is its support for what we call point set distribution formats. This means these distributions are stored as x-y coordinate pairs instead of symbolic representations or lists of random samples. This is useful for several operations (scoring and aggregation, for example) and for visualization. See more discussion here.
The trouble is that these formats are complicated, and we don’t yet have strong implementations in any language, especially Javascript. This seems like a particularly under-explored area.
Some properties that would be useful to have include:
a. Interpolation Methods
Squiggle currently uses linear interpolation between continuous XY points to approximate distribution shapes. It would be of course, better to do something with curves. However, it’s important to maintain the key properties of probability distributions (no negative probability mass). Ideally, a good approximation could handle varied distributions, from Gaussian to Uniform distributions. It’s possible that splines or similar will be useful here.
b. Continuous + Discrete Mixtures
Handle combinations of continuous and discrete data. Store each with X-Y values. (Squiggle currently does this.)
c. Arbitrary Discrete Distributions
Storing discrete values with X-Y pairs doesn’t work when there’s a very large set of potential X values. For example, a discrete distribution for “What will the US population be in 2070?” might require tens of millions of entries - one for each possibility of the discrete number of people. For these cases, ideally, there could be some limited functional form.
d. Function Tails
Distribution tails are important to approximate but might require specialized methods other than xy point interpolation. The tails of distributions might be best approximated as simple functions. Thus, these formats would include both lists of X-Y points plus additional information about the tail shapes.
e. Domain Limits
One pseudo-alternative to functional tails is to bind all distributions between a MIN_VALUE and a MAX_VALUE. Javascript already has min and max float values, so these could apply. Then, approximations of x-y points up to these endpoints could be used.
This solution would be a bit tricky to make fully functional, but it might be a decent enough hack for simple, improved tails.
f. Multivariable Shapes
The point set formats described above apply to single distributions, but we also want similar formats extending to multiple variables. One simple example would be a distribution that varies within a specific time range. If we could approximate these functions as shapes using coordinates, we could better score/aggregate/visualize them. Performance would be very important for this.
g. Key Mathematic Operations
One might want to perform many mathematical operations on point set distributions. Right now, Squiggle has a simple library for performing these - but this library is not well-optimized or tested. It would be great to have a very robust Javascript library here. Many of the existing Squiggle numeric bugs come from pointset operations. Well-tested convolution code would be valuable.
h. More Thorough Ontologies
The point set shape above could sometimes be a fit for PDFs, sometimes for CDFs, and sometimes for other shapes. When users do certain operations on PDFs, they quickly become more generic functions, but we might want to still store them in a format like point sets. How to best categorize these various types of formatted functions is unclear.
It would be good to have a library that treats each various shape as a certain mathematical object with corresponding valid properties.
It’s not clear what categorization/organization system to apply. Perhaps there are clever ideas from mathematics or category theory to help here. Category theory brought a well-grounded ontology to functional programming; we might be able to use similar in this area.
4. Scoring Heuristics for General Point Set Distributions
Scoring on point set distributions can be difficult if estimators use different approximations. Some known challenges include:
a. Tails with Zero Probability
One frequent challenge we faced with previous distribution scoring is that many users would submit too narrow distributions. With point set formats, it was very easy to approximate the tails as having zero probability mass. This would quickly result in a score of negative infinity.
It would be good to figure out better ways of dealing with this. The cleanest is with a better tail format, as discussed above. But even this is likely not enough, especially for forecasters new to using distributions. We might well desire other approximations or heuristics here.
b. Discrete/Continuous Mixtures
It’s not clear how to best score discrete and continuous mixtures. Scoring algorithms typically apply to only discrete or only continuous data.
One particular complication is when some users submit only continuous forecasts. For example, say there’s a question, “How much will an upcoming Android phone cost?”. Some users might take the time to write a discrete distribution with probability mass at exactly “749.98”, and some might use a continuous distribution. One potential solution is to have the question writer specify specific x-values that can and should get discrete probability mass. Still, it could be nice to have heuristics for situations without this. Another potential solution would be to add some noise.
5. D3 Scale Axes
We’re using D3 to display Squiggle distributions. One challenge is that D3 scales are fairly limited. The Log scale doesn’t work for values below zero; this is problematic because many user-generated distributions are either negative or have negative and positive mass.
The Symlog scale is an alternative, but it can have trouble showing tick marks and require tuning the constant parameter. Some exploration of this is here.
It would be useful to have better D3 scales for both situations.
6. Forced Distribution Correlations
There’s a common use case of wanting to force one distribution of samples to be manually correlated with another. However, there are many ways of doing this; it’s unclear what’s best. It probably makes sense to have several options eventually, but it would be good to have one or two robust recommended options.
Tom Adamczewski has recommended working with Copulas for this, here.
7. Language Model Training
As discussed here, there has not yet been any custom training of language models with Squiggle or other probabilistic programming languages in JS (that we know of). This could make for some neat projects and could be fairly straightforward.
8. Distribution-Specific Tasks
a. Beta Distribution Fitting
It would be useful to have a simple function that takes in two specific percentiles (the 5th and 95th are fine) and fits them to a Beta distribution. This likely requires some search method. Nuño Sempere made one attempt here but could use more testing and polish.
b. Extra Distributions
A few distributions don’t seem to be supported in Javascript yet. The main libraries we’ve found that support (some) distributions are JStat and StdLib. Specifically, we’ve been asked to support Skew Normal, Student T, and Power Law distributions.
9. Bonus: Relative Value Functions
Relative value functions formats seem useful, but they probably could use a good amount more theory and experimentation.
Caching Value Ratios
Calculating the ratio distributions between any two items to compare is expensive. We want fast and efficient methods of caching and approximating these. The brute-force method calculates and stores every comparison - which would be ~N^2 comparisons. This becomes infeasible with over a few hundred items. It would be preferable to use heuristics to determine which specific comparisons are the most important to calculate and then use results to approximate other comparisons.
Agustín Covarrubias @ 2023-08-31T18:56 (+3)
There’s a common use case of wanting to force one distribution of samples to be manually correlated with another. However, there are many ways of doing this; it’s unclear what’s best. It probably makes sense to have several options eventually, but it would be good to have one or two robust recommended options.
I recently implemented this for squigglepy, but using the (somewhat) simpler Iman-Conover Method instead of copulas. Copulas are more flexible, but I imagine IMC would be much easier to implement for Squiggle.
If anyone wants to implement something similar for Squiggle, I would love to help in any way.
geoffrey @ 2023-08-26T17:56 (+1)
Hi Ozzie, what’s the ask and intended audience here?
The problems here seem interesting and maybe even approachable to a dedicated newbie. So I was wondering if my background was any use. I used to work in software engineering with JavaScript and now work in data cleaning research assistance with R.
But I can’t tell if this is an open call to submit pull requests to an open source library on GitHub, a request for advice, or a request for someone to work part-time / full-time
Ozzie Gooen @ 2023-08-26T18:57 (+3)
Good questions!
I'm imagining hobbyists or people interested in taking on side projects. People interested in getting more involved in this area. If someone is an unusually great fit, for a fairly small project, we might be able to provide funding.
Almost all of this could be done outside our codebase. Just make an NPM package, or even just get something working in ObservableJS or similar. Post to the EA Forum, or just tell us.
I think of these as similar to other open EA-related lists of problems. In our case, we can use many of these solutions in Squiggle, but I think even if we don't, they might well be useful for other EA-related JS/probability projects in the future.
I think the main constraints, for people with some analytical background, is enthusiasm and time. Many of these could be attacked by someone clever and sufficiently motivated - but it would take some time (especially for polish and testing, if you can do that part).
Again, if you might seriously be interested, I'd be happy to chat. We have a related Slack that could be a good fit, or we could do a video chat - let me know which might interest you.