NunoSempere
308eecba98

15 hours ago  

examples  15 hours ago  
references  2 months ago  
scratchpad  15 hours ago  
test  4 days ago  
LICENSE.txt  2 months ago  
README.md  15 hours ago  
core.png  15 hours ago  
makefile  2 weeks ago  
squiggle.c  15 hours ago  
squiggle.h  4 days ago 
README.md
squiggle.c
squiggle.c is a selfcontained C99 library that provides functions for simple Monte Carlo estimation, based on Squiggle.
Why C?
 Because it is fast
 Because I enjoy it
 Because C is honest
 Because it will last long
 Because it can fit in my head
 Because if you can implement something in C, you can implement it anywhere else
 Because it can be made faster if need be
 e.g., with a multithreading library like OpenMP,
 o by implementing faster but more complex algorithms
 or more simply, by inlining the sampling functions (adding an
inline
directive before their function declaration)
 Because there are few abstractions between it and machine code (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.
Getting started
You can follow some example usage in the examples/ folder
 In the 1st example, we define a small model, and draw one sample from it
 In the 2nd example, we define a small model, and return many samples
 In the 3rd example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way.
 In the 4th example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the builtin samplers, e.g., the normal sampler.
 In the 5th example, we define the cdf for the beta distribution, and we draw samples from it.
 In the 6th example, we take samples from simple gamma and beta distributions, using the samplers provided by this library.
 In the 7th example, we get the 90% confidence interval of a beta distribution
 The 8th example translates the models from Eli and Nuño from Samotsvety Nuclear Risk Forecasts — March 2022 into squiggle.c, then creates a mixture from both, and returns the mean probability of death per month and the 90% confidence interval.
 The 9th example estimates how many minutes per day I would have to jump rope in order to lose 10kg of fat in half a year.
Commentary
squiggle.c is short
squiggle.c is less than 600 lines of C, with a core of <250 lines. The reader could just read it and grasp its contents.
Core strategy
This library provides some basic building blocks. The recommended strategy is to:
 Define sampler functions, which take a seed, and return 1 sample
 Compose those sampler functions to define your estimation model
 At the end, call the last sampler function many times to generate many samples from your model
Cdf auxiliary functions
To help with the above core strategy, this library provides convenience functions, which take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.
Nested functions and compilation with tcc.
GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and increasing compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.
GCC  tcc 

slower compilation  faster compilation 
allows nested functions  doesn't allow nested functions 
faster execution  slower execution 
My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.
Error propagation vs exiting on error
The process of taking a cdf and returning a sample might fail, e.g., it's a Newton method which might fail to converge because of cdf artifacts. The cdf itself might also fail, e.g., if a distribution only accepts a range of parameters, but is fed parameters outside that range.
This library provides two approaches:
 Print the line and function in which the error occured, then exit on error
 In situations where there might be an error, return a struct containing either the correct value or an error message:
struct box {
int empty;
double content;
char* error_msg;
};
The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programmes, but is more verbose.
Behaviour on error can be toggled by the EXIT_ON_ERROR
variable. This library also provides a convenient macro, PROCESS_ERROR
, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.
Overall, I'd describe the error handling capabilities of this library as pretty rudimentary. For example, this program might fail in surprising ways if you ask for a lognormal with negative standard deviation, because I haven't added error checking for that case yet.
Guarantees and licensing
 I offer no guarantees about stability, correctness, performance, etc. I might, for instance, abandon the version in C and rewrite it in Zig, Nim or Rust.
 This project mostly exists for my own usage & for my own amusement.
 Caution! Think carefully before using this project for anything important
 If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so here. Although this theoretical possibility exists, I don't I don't anticipate that this would be a good idea on most cases.
This project is released under the MIT license, a permissive opensource license. You can see it in the LICENSE.txt file.
Design choices
This code should aim to be correct, then simple, then fast.
 It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
 Nonetheless, the user should understand the limitations of samplingbased methods. See the section on Tests and the long tail of the lognormal for a discussion of how sampling is bad at capturing some aspects of distributions with long tails.
 It should be clear, conceptually simple. Simple for me to implement, simple for others to understand.
 It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be way faster than Python.
Note that being terse, or avoiding verbosity, is a nongoal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.
Correlated samples
In the original squiggle language, there is some ambiguity about what this code means:
a = 1 to 10
b = 2 * a
c = b/a
c
Likewise in squigglepy:
import squigglepy as sq
import numpy as np
a = sq.to(1, 3)
b = 2 * a
c = b / a
c_samples = sq.sample(c, 10)
print(c_samples)
Should c
be equal to 2
? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (2 * a/a
, as it were)?
In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:
// correlated samples
// gcc O3 correlated.c squiggle.c lm o correlated
#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int main(){
// set randomness seed
uint64_t* seed = malloc(sizeof(uint64_t));
*seed = 1000; // xorshift can't start with a seed of 0
double a = sample_to(1, 10, seed);
double b = 2 * a;
double c = b / a;
printf("a: %f, b: %f, c: %f\n", a, b, c);
// a: 0.607162, b: 1.214325, c: 0.500000
free(seed);
}
vs
// uncorrelated samples
// gcc O3 uncorrelated.c ../../squiggle.c lm o uncorrelated
#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
double draw_xyz(uint64_t* seed){
// function could also be placed inside main with gcc nested functions extension.
return sample_to(1, 20, seed);
}
int main(){
// set randomness seed
uint64_t* seed = malloc(sizeof(uint64_t));
*seed = 1000; // xorshift can't start with a seed of 0
double a = draw_xyz(seed);
double b = 2 * draw_xyz(seed);
double c = b / a;
printf("a: %f, b: %f, c: %f\n", a, b, c);
// a: 0.522484, b: 10.283501, c: 19.681936
free(seed)
}
Tests and the long tail of the lognormal
Distribution functions can be tested with:
cd tests
make && make run
make verify
is an alias that runs all the tests and just displays the ones that are failing.
These tests are somewhat rudimentary: they get between 1M and 10M samples from a given sampling function, and check that their mean and standard deviations correspond to what they should theoretically should be.
If you run make run
(or make verify
), you will see errors such as these:
[] Mean test for normal(47211.047473, 682197.019012) NOT passed.
Mean of normal(47211.047473, 682197.019012): 46933.673278, vs expected mean: 47211.047473
delta: 277.374195, relative delta: 0.005910
[] Std test for lognormal(4.584666, 2.180816) NOT passed.
Std of lognormal(4.584666, 2.180816): 11443.588861, vs expected std: 11342.434900
delta: 101.153961, relative delta: 0.008839
[] Std test for to(13839.861856, 897828.354318) NOT passed.
Std of to(13839.861856, 897828.354318): 495123.630575, vs expected std: 498075.002499
delta: 2951.371925, relative delta: 0.005961
These tests I wouldn't worry about. Due to luck of the draw, their relative error is a bit over 0.005, or 0.5%, and so the test fails. But it would surprise me if that had some meaningful practical implication.
The errors that should raise some worry are:
[] Mean test for lognormal(1.210013, 4.766882) NOT passed.
Mean of lognormal(1.210013, 4.766882): 342337.257677, vs expected mean: 288253.061628
delta: 54084.196049, relative delta: 0.157985
[] Std test for lognormal(1.210013, 4.766882) NOT passed.
Std of lognormal(1.210013, 4.766882): 208107782.972184, vs expected std: 24776840217.604111
delta: 24568732434.631927, relative delta: 118.057730
[] Mean test for lognormal(0.195240, 4.883106) NOT passed.
Mean of lognormal(0.195240, 4.883106): 87151.733198, vs expected mean: 123886.818303
delta: 36735.085104, relative delta: 0.421507
[] Std test for lognormal(0.195240, 4.883106) NOT passed.
Std of lognormal(0.195240, 4.883106): 33837426.331671, vs expected std: 18657000192.914921
delta: 18623162766.583248, relative delta: 550.371727
[] Mean test for lognormal(0.644931, 4.795860) NOT passed.
Mean of lognormal(0.644931, 4.795860): 125053.904456, vs expected mean: 188163.894101
delta: 63109.989645, relative delta: 0.504662
[] Std test for lognormal(0.644931, 4.795860) NOT passed.
Std of lognormal(0.644931, 4.795860): 39976300.711166, vs expected std: 18577298706.170452
delta: 18537322405.459286, relative delta: 463.707799
What is happening in this case is that you are taking a normal, like normal(0.195240, 4.883106)
, and you are exponentiating it to arrive at a lognormal. But normal(0.195240, 4.883106)
is going to have some noninsignificant weight on, say, 18. But exp(18) = 39976300
, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass.
The reader can also check that for more plausible realworld values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't eggregious:
[x] Mean test for to(10.000000, 10000.000000) PASSED
[] Std test for to(10.000000, 10000.000000) NOT passed.
Std of to(10.000000, 10000.000000): 23578.091775, vs expected std: 25836.381819
delta: 2258.290043, relative delta: 0.095779
Overall, I would caution that if you really care about the very far tails of distributions, you might want to instead use tools which can do some of the analytical manipulations for you, like the original Squiggle, Simple Squiggle (both linked below), or even doing lognormal multiplication by hand, relying on the fact that two lognormals multiplied together result in another lognormal with known shape.
In fact, squiggle.c does have a few functions for algebraic manipulations of simple distributions at the end of squiggle.c. But these are pretty rudimentary, and I don't know whether I'll end up expanding or deleting them.
Results of running clangtidy
clangtidy is a utility to detect common errors in C/C++. You can run it with:
make tidy
It emits one warning about something I already took care of, so by default I've suppressed it. I think this is good news in terms of making me more confident that this simple library is correct :).
Related projects
To do list
 Think about whether to write a simple version of this for uxn, a minimalistic portable programming stack which, sadly, doesn't have doubles (64 bit floats)
 Point out that, even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift should be different).
 Document rudimentary algebra manipulations for normal/lognormal
 Think through whether to delete cdf => samples function
 Think through whether to:
 simplify and just abort on error
 complexify and use boxes for everything
 leave as is
 Systematize references
 Publish online
 Support all distribution functions in https://www.squigglelanguage.com/docs/Api/Dist
 do so efficiently
 Add more functions to do algebra and get the 90% c.i. of normals, lognormals, betas, etc.
 Think through which of these make sense.
Done
 Add example for only one sample
 Add example for many samples
Add a custom preprocessor to allow simple nested functions that don't rely on local scope? Use gcc extension to define functions nested inside main.
 Chain various
sample_mixture
functions  Add beta distribution
Use OpenMP for acceleration Add function to get sample when given a cdf
 Don't have a single header file.
 Structure project a bit better
 Simplify
PROCESS_ERROR
macro  Add README
 Schema: a function which takes a sample and manipulates it,
 and at the end, an array of samples.
 Explain boxes
 Explain nested functions
 Explain exit on error
 Explain individual examples
 Rename functions to something more selfexplanatory, e.g,.
sample_unit_normal
.  Add summarization functions: mean, std
 Add sampling from a gamma distribution
 Explain correlated samples
Add tests in Stan? Test summary statistics for each of the distributions.
 For uniform
 For normal
 For lognormal
 For lognormal (to syntax)
 For beta distribution
 Clarify gamma/standard gamma
 Add efficient sampling from a beta distribution
 Pontificate about lognormal tests
 Give warning about samplingbased methods.
 Have some more complicated & realistic example
 Add summarization functions: 90% ci (or all c.i.?)
 Link to the examples in the examples section.
 Add a few functions for doing simple algebra on normals, and lognormals
 Add prototypes
 Use named structs
 Add to header file
 Provide example algebra
 Add conversion between 90% ci and parameters.
 Use that conversion in conjuction with small algebra.
 Consider ergonomics of using ci instead of c_i
 use named struct instead
 demonstrate and document feeding a struct directly to a function; my_function((struct c_i){.low = 1, .high = 2});
 Consider desirability of defining shortcuts for those functions. Adds a level of magic, though.
 Test results
 Move to own file? Or signpost in file? => signposted in file.
 Disambiguate sample_laplacesuccesses vs failures  successes vs total trials as two distinct and differently named functions
 Write twitter thread.