A simple, self-contained C99 library for judgmental estimation, but improved.

Go to file

NunoSempere 1d89eb6231 formatting pass, upkeep		2024-01-20 14:30:20 +01:00
examples	formatting pass, upkeep	2024-01-20 14:30:20 +01:00
references	add references	2023-12-03 18:46:24 +00:00
scratchpad	formatting pass, upkeep	2024-01-20 14:30:20 +01:00
test	don't make quickselect "destructive"; recompile	2023-12-14 13:02:27 +00:00
LICENSE.txt	add MIT license	2023-07-24 11:25:29 +02:00
makefile	formatting pass; makefile tweaks	2024-01-13 01:05:44 +01:00
README.md	readme tweak	2024-01-13 13:01:58 +01:00
squiggle_more.c	reformat remake	2024-01-13 12:47:56 +01:00
squiggle_more.h	refactor: struct box => box. Through typedef.	2023-12-03 18:25:35 +00:00
squiggle.c	rename sample_normal_from_90_ci and add to squiggle.h	2023-12-14 12:37:12 +00:00
squiggle.h	rename sample_normal_from_90_ci and add to squiggle.h	2023-12-14 12:37:12 +00:00

README.md

squiggle.c

squiggle.c is a grug-brained self-contained C99 library that provides functions for simple Monte Carlo estimation, inspired by Squiggle.

Why C?

Because it is fast
Because I enjoy it
Because C is honest
Because it will last long
Because it can fit in my head
Because if you can implement something in C, you can implement it anywhere else
Because it can be made faster if need be
- e.g., with a multi-threading library like OpenMP,
- o by implementing faster but more complex algorithms
- or more simply, by inlining the sampling functions (adding an inline directive before their function declaration)
Because there are few abstractions between it and machine code (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.

Getting started

You can follow some example usage in the examples/ folder. In examples/core, we build up some functionality, starting from drawing one sample. In examples/more, we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism.

Commentary

squiggle.c is short

squiggle.c is less than 700 lines of C, with a core of <230 lines. The reader could just read it and grasp its contents, and is encouraged to do so.

Core strategy

This library provides some basic building blocks. The recommended strategy is to:

Define sampler functions, which take a seed, and return 1 sample
Compose those sampler functions to define your estimation model
At the end, call the last sampler function many times to generate many samples from your model

Nested functions and compilation with tcc.

GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and increasing compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.

GCC	tcc
slower compilation	faster compilation
allows nested functions	doesn't allow nested functions
faster execution	slower execution

My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.

My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that modifying code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments.

Guarantees and licensing

The motte:

I offer no guarantees about stability, correctness, performance, etc. I might, for instance, abandon the version in C and rewrite it in Zig, Nim or Rust.
This project mostly exists for my own usage & for my own amusement.
Caution! Think carefully before using this project for anything important.
If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so here.
I am conflicted about parallelism. It does add more complexity, complexity that you can be bitten by if you are not careful and don't understand it. And this conflicts with the initial grug-brain motivation. At the same time, it is clever, and it is nice, and I like it a lot.

The bailey:

I've been hacking at this project for a while now, and I think I have a good grasp of its correctness and limitations. I've tried Nim and Zig, and I prefer C so far.
I think the core interface is not likely to change much, though I've recently changed the interface for parallelism and for getting confidence intervals.
I am using this code for a few important consulting projects, and I trust myself to operate it correctly.

This project is released under the MIT license, a permissive open-source license. You can see it in the LICENSE.txt file.

Design choices

This code should aim to be correct, then simple, then fast.

It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
- Nonetheless, the user should understand the limitations of sampling-based methods. See the section on Tests and the long tail of the lognormal for a discussion of how sampling is bad at capturing some aspects of distributions with long tails.
It should be clear, conceptually simple. Simple for me to implement, simple for others to understand.
It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be way faster than, say, Python.

Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.

Correlated samples

In the original squiggle language, there is some ambiguity about what this code means:

a = 1 to 10
b = 2 * a
c = b/a
c

Likewise in squigglepy:

import squigglepy as sq
import numpy as np

a = sq.to(1, 3)
b = 2 * a  
c = b / a 

c_samples = sq.sample(c, 10)

print(c_samples)

Should c be equal to 2? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (2 * a/a, as it were)? You don't know, because you are not operating on samples, you are operating on magical objects whose internals are hidden from you.

In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:

// correlated samples
// gcc -O3  correlated.c squiggle.c -lm -o correlated

#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main(){
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with a seed of 0

    double a = sample_to(1, 10, seed);
    double b = 2 * a;
    double c = b / a;

    printf("a: %f, b: %f, c: %f\n", a, b, c);
    // a: 0.607162, b: 1.214325, c: 0.500000

    free(seed);
}

// uncorrelated samples
// gcc -O3    uncorrelated.c ../../squiggle.c -lm -o uncorrelated

#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

double draw_xyz(uint64_t* seed){
    // function could also be placed inside main with gcc nested functions extension.
    return sample_to(1, 20, seed);
}


int main(){
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with a seed of 0

    double a = draw_xyz(seed);
    double b = 2 * draw_xyz(seed);
    double c = b / a;

    printf("a: %f, b: %f, c: %f\n", a, b, c);
    // a: 0.522484, b: 10.283501, c: 19.681936

    free(seed)
}

Exercise for the reader: What possible meanings could the following represent in squiggle? How would you implement each of those meanings in squiggle.c?

min(normal(0, 1))

Hint: See examples/more/13_parallelize_min

Note on sampling strategies

Right now, I am drawing samples using a random number generator. It requires some finesse, particularly when using parallelism. But it works fine.

But..., what if we could do something more elegant, more ingenious? In particular, what if instead of drawing samples, we had a mesh of equally spaced points in the range of floats? Then we could, for a given number of samples, better estimate the, say, mean of the distribution we are trying to model...

The problem with that is that if we have some code like:

double model(...){
  double a = sample_to(1, 10, i_mesh++);
  double b = sample_to(1, 2, i_mesh);
  return a * b;
}

Then this doesn't work, because the values of a and b will be correlated: when a is high, b will also be high. What might work would be something like this:

double* model(int n_samples){
  double* xs = malloc((size_t)n_samples * sizeof(double));
  for(int i_mesh=0; i_mesh < sqrt(n_samples); i_mesh++){
      for(int j_mesh=0; j_mesh < sqrt(n_samples); j_mesh++){
          double a = sample_to(1, 10, i_mesh);
          double b = sample_to(1, 2, j_mesh);
      }
  }
  return xs;
}

But that requires us to encode the shape of the model into the sampling function. It leads to an ugly nesting of for loops. It is a more complex approach. It is not grug-brained. So every now and then I have to remember that this is not the way.

Tests and the long tail of the lognormal

Distribution functions can be tested with:

cd tests
make && make run

make verify is an alias that runs all the tests and just displays the ones that are failing.

These tests are somewhat rudimentary: they get between 1M and 10M samples from a given sampling function, and check that their mean and standard deviations correspond to what they should theoretically should be.

If you run make run (or make verify), you will see errors such as these:

[-] Mean test for normal(47211.047473, 682197.019012) NOT passed.
Mean of normal(47211.047473, 682197.019012): 46933.673278, vs expected mean: 47211.047473
delta: -277.374195, relative delta: -0.005910

[-] Std test for lognormal(4.584666, 2.180816) NOT passed.
Std of lognormal(4.584666, 2.180816): 11443.588861, vs expected std: 11342.434900
delta: 101.153961, relative delta: 0.008839

[-] Std test for to(13839.861856, 897828.354318) NOT passed.
Std of to(13839.861856, 897828.354318): 495123.630575, vs expected std: 498075.002499
delta: -2951.371925, relative delta: -0.005961

These tests I wouldn't worry about. Due to luck of the draw, their relative error is a bit over 0.005, or 0.5%, and so the test fails. But it would surprise me if that had some meaningful practical implication.

The errors that should raise some worry are:

[-] Mean test for lognormal(1.210013, 4.766882) NOT passed.
Mean of lognormal(1.210013, 4.766882): 342337.257677, vs expected mean: 288253.061628
delta: 54084.196049, relative delta: 0.157985
[-] Std test for lognormal(1.210013, 4.766882) NOT passed.
Std of lognormal(1.210013, 4.766882): 208107782.972184, vs expected std: 24776840217.604111
delta: -24568732434.631927, relative delta: -118.057730

[-] Mean test for lognormal(-0.195240, 4.883106) NOT passed.
Mean of lognormal(-0.195240, 4.883106): 87151.733198, vs expected mean: 123886.818303
delta: -36735.085104, relative delta: -0.421507
[-] Std test for lognormal(-0.195240, 4.883106) NOT passed.
Std of lognormal(-0.195240, 4.883106): 33837426.331671, vs expected std: 18657000192.914921
delta: -18623162766.583248, relative delta: -550.371727

[-] Mean test for lognormal(0.644931, 4.795860) NOT passed.
Mean of lognormal(0.644931, 4.795860): 125053.904456, vs expected mean: 188163.894101
delta: -63109.989645, relative delta: -0.504662
[-] Std test for lognormal(0.644931, 4.795860) NOT passed.
Std of lognormal(0.644931, 4.795860): 39976300.711166, vs expected std: 18577298706.170452
delta: -18537322405.459286, relative delta: -463.707799

What is happening in this case is that you are taking a normal, like normal(-0.195240, 4.883106), and you are exponentiating it to arrive at a lognormal. But normal(-0.195240, 4.883106) is going to have some non-insignificant weight on, say, 18. But exp(18) = 39976300, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass.

The reader can also check that for more plausible real-world values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't egregious:

[x] Mean test for to(10.000000, 10000.000000) PASSED
[-] Std test for to(10.000000, 10000.000000) NOT passed.
Std of to(10.000000, 10000.000000): 23578.091775, vs expected std: 25836.381819
delta: -2258.290043, relative delta: -0.095779

Overall, I would caution that if you really care about the very far tails of distributions, you might want to instead use tools which can do some of the analytical manipulations for you, like the original Squiggle, Simple Squiggle (both linked below), or even doing lognormal multiplication by hand, relying on the fact that two lognormals multiplied together result in another lognormal with known shape.

In fact, squiggle.c does have a few functions for algebraic manipulations of simple distributions at the end of squiggle.c. But these are pretty rudimentary, and I don't know whether I'll end up expanding or deleting them.

Compiler warnings

Harsh compilation

By default, I've enabled -Wall -Wextra -Wdouble-promotion -Wconversion. However, these produce some false positive warnings, which I've dealt with through:

For conversion: Explicit casts, particularly from int to size_t when calling malloc.
For dealing with unused variables: Using an UNUSED macro. If you don't like that approach, you could add -Wno-unused-parameter to your makefile and remove the macro and its usage.

Some resources on compiler flags: 1, 2

Results of running clang-tidy

clang-tidy is a utility to detect common errors in C/C++. You can run it with:

make tidy

So far in the history of this program it has emitted:

One false-positive warning about an issue I'd already taken care of (so I've suppressed the warning)
a warning about an unused variable

I think this is good news in terms of making me more confident that this simple library is correct :).

Division between core functions and squiggle_more expansions

This library differentiates between core functions, which are pretty tightly scoped, and expansions and convenience functions, which are more meandering. Expansions are in squiggle_more.c and squiggle_more.h. To use them, take care to link them:

// In your C source file
#include "squiggle_more.h"

# When compiling:
gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example

Extra: confidence intervals

squiggle_more.c has some helper functions to get confidence intervals. They are in squiggle_more.c because I'm still mulling over what their shape should be, and because until recently they were pretty limited and sub-optimal. But recently, I've put a bunch of effort into being able to get the confidence interval of an array of samples in O(number of samples), and into making confidence interval functions nicer and more general. So I might promote them to the main squiggle.c file.

Extra: parallelism

I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine.

But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader:

If you run the sampler_parallel function twice, you will get the same result. Why?
If you run the sampler_parallel function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why?
For a small amount of samples, if you run the sampler_parallel function, you will get better spread out random numbers than if you run things serially. Why?

That said, I found adding parallelism to be an interesting an engaging task. Most recently, I even optimized the code to ensure that two threads weren't accessing the same cache line at the same time, and it was very satisfying to see a 30% improvement as a result.

Extra: Algebraic manipulations

squiggle_more.c has some functions to do some simple algebra manipulations: sums of normals and products of lognormals. You can see some example usage here and here.

Extra: cdf auxiliary functions

I provide some auxiliary functions that take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.

Extra: Error propagation vs exiting on error

The process of taking a cdf and returning a sample might fail, e.g., it's a Newton method which might fail to converge because of cdf artifacts. The cdf itself might also fail, e.g., if a distribution only accepts a range of parameters, but is fed parameters outside that range.

This library provides two approaches:

Print the line and function in which the error occurred, then exit on error
In situations where there might be an error, return a struct containing either the correct value or an error message:

struct box {
    int empty;
    double content;
    char* error_msg;
};

The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programs, but is more verbose.

Behaviour on error can be toggled by the EXIT_ON_ERROR variable. This library also provides a convenient macro, PROCESS_ERROR, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.

Overall, I'd describe the error handling capabilities of this library as pretty rudimentary. For example, this program might fail in surprising ways if you ask for a lognormal with negative standard deviation, because I haven't added error checking for that case yet.

Other gotchas

Even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift code should be different).

Roadmap

To do

Drive in a few more real-life applications
Look into using size_t instead of int for sample numbers

Done

Document rudimentary algebra manipulations for normal/lognormal
Think through whether to delete cdf => samples function => not for now
Think through whether to:
- simplify and just abort on error
- complexify and use boxes for everything
- leave as is
- Offer both options
Add more functions to do algebra and get the 90% c.i. of normals, lognormals, betas, etc.
- Think through which of these make sense.
Systematize references
Think through seed initialization
Document parallelism
Document confidence intervals
Add example for only one sample
Add example for many samples
Use gcc extension to define functions nested inside main.
Chain various sample_mixture functions
Add beta distribution
- See https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution for a faster method.
Use OpenMP for acceleration
Add function to get sample when given a cdf
Don't have a single header file.
Structure project a bit better
Simplify PROCESS_ERROR macro
Add README
- Schema: a function which takes a sample and manipulates it,
- and at the end, an array of samples.
- Explain boxes
- Explain nested functions
- Explain exit on error
- Explain individual examples
Rename functions to something more self-explanatory, e.g,. sample_unit_normal.
Add summarization functions: mean, std
Add sampling from a gamma distribution
- https://dl.acm.org/doi/pdf/10.1145/358407.358414
Explain correlated samples
Test summary statistics for each of the distributions.
- For uniform
- For normal
- For lognormal
- For lognormal (to syntax)
- For beta distribution
Clarify gamma/standard gamma
Add efficient sampling from a beta distribution
- https://dl.acm.org/doi/10.1145/358407.358414
- https://link.springer.com/article/10.1007/bf02293108
- https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution
- 5cae51e794/numpy/random/src/distributions/distributions.c
Pontificate about lognormal tests
Give warning about sampling-based methods.
Have some more complicated & realistic example
Add summarization functions: 90% ci (or all c.i.?)
Link to the examples in the examples section.
Add a few functions for doing simple algebra on normals, and lognormals
- Add prototypes
- Use named structs
- Add to header file
- Provide example algebra
- Add conversion between 90% ci and parameters.
- Use that conversion in conjunction with small algebra.
- Consider ergonomics of using ci instead of c_i
  - use named struct instead
  - demonstrate and document feeding a struct directly to a function; my_function((struct c_i){.low = 1, .high = 2});
- Move to own file? Or signpost in file? => signposted in file.
Write twitter thread: now here; retweets appreciated.
Write better confidence interval code that:
- Gets number of samples as an input
- Gets either a sampler function or a list of samples
- is O(n), not O(nlog(n))
- Parallelizes stuff

Discarded

~~Disambiguate sample_laplace--successes vs failures || successes vs total trials as two distinct and differently named functions~~
~~Support all distribution functions in https://www.squiggle-language.com/docs/Api/Dist~~
~~Add a custom preprocessor to allow simple nested functions that don't rely on local scope?~~
~~Add tests in Stan?~~
~~Test results for lognormal manipulations~~
~~Consider desirability of defining shortcuts for algebra functions. Adds a level of magic, though.~~
~~Think about whether to write a simple version of this for uxn, a minimalist portable programming stack which, sadly, doesn't have doubles (64 bit floats)~~