A simple, self-contained C99 library for judgmental estimation, but improved.

Go to file

UhJdQRtAhN95rbscFCJtIA== b6b5418001 Initial folder refactor		2024-03-03 13:04:02 +01:00
C	Initial folder refactor	2024-03-03 13:04:02 +01:00
CUDA	Initial folder refactor	2024-03-03 13:04:02 +01:00
references	add references	2023-12-03 18:46:24 +00:00
scratchpad	add possible gotcha to folk wisdom file	2024-02-07 19:46:12 +01:00
FOLK_WISDOM.md	add possible gotcha to folk wisdom file	2024-02-07 19:46:12 +01:00
LICENSE.txt	add MIT license	2023-07-24 11:25:29 +02:00
README.md	remove histogram caveat after reviewing; style tweaks	2024-02-02 18:06:58 +01:00
ROADMAP.md	very small division tweaks	2024-02-24 23:33:09 -03:00

README.md

squiggle.c

squiggle.c is a self-contained C99 library that provides functions for simple Monte Carlo estimation, inspired by Squiggle.

Motivation

What am I trying to do here?

I am trying to build a reliable alternative to the original squiggle, that works for me and addresses my frustrations with it.
Some adjectives: grug brain, lindy, suckless
I am trying to make something that is simple enough that I and others can fully understand. squiggle.c is less than 700 lines of C, with a core of <230 lines. You, a somewhat technically sophisticated reader, could just read it and grasp its contents, and is encouraged to do so.

Why C?

Because it is fast
Because I enjoy it
Because C is honest
Because it will last long
Because it can be made faster if need be, e.g., with a multi-threading library like OpenMP, by implementing faster but more complex algorithms, or more simply, by inlining the sampling functions (adding an inline directive before their function declaration)
Because there are few abstractions between it and machine code (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.
Because it can fit in my head
Because if you can implement something in C, you can implement it anywhere else

Design choices

This code should aim to be correct, then simple, then fast.

It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
- Nonetheless, the user should understand the limitations of sampling-based methods. See the section on Tests and the long tail of the lognormal for a discussion of how sampling is bad at capturing some aspects of distributions with long tails.
It should be clear, conceptually simple. Simple for me to implement, simple for others to understand.
It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty.
- In any case, though, the code should still be way faster than, say, Python.

Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.

Getting started

Download squiggle.c, for instance:

$ rm -r -f squiggle_c
$ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle.c
$ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle.h
$ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle_more.c
$ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle_more.h
$ mkdir temp
$ mv squiggle* temp
$ mv temp squiggle_c

Write your model. For instance, your could replicate this paper as follows:

#include "squiggle_c/squiggle.h"
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

double sample_fermi_logspace(uint64_t * seed)
{
    // Replicate <https://arxiv.org/pdf/1806.02404.pdf>, and in particular the red line in page 11.
    // You can see a simple version of this function in naive.c in this same folder
    double log_rate_of_star_formation = sample_uniform(log(1), log(100), seed);
    double log_fraction_of_stars_with_planets = sample_uniform(log(0.1), log(1), seed);
    double log_number_of_habitable_planets_per_star_system = sample_uniform(log(0.1), log(1), seed);

    double log_rate_of_life_formation_in_habitable_planets = sample_normal(1, 50, seed);
    double log_fraction_of_habitable_planets_in_which_any_life_appears;
    /* 
        Consider:
        a = underlying normal 
        b = rate_of_life_formation_in_habitable_planets = exp(underlying normal) = exp(a)
        c = 1 - exp(-b) = fraction_of_habitable_planets_in_which_any_life_appears
        d = log(c)

        Looking at the Taylor expansion for c = 1 - exp(-b), it's 
        b - b^2/2 + b^3/6 - x^b/24, etc.
        <https://www.wolframalpha.com/input?i=1-exp%28-x%29>
        When b ~ 0 (as is often the case), this is close to b.

        But now, if b ~ 0, c ~ b
        and d = log(c) ~ log(b) = log(exp(a)) = a

        Now, we could play around with estimating errors,
        and indeed if we want b^2/2 = exp(a)^2/2 < 10^(-n), i.e., to have n decimal digits of precision,
        we could compute this as e.g., a < (nlog(10) + log(2))/2
        so for example if we want ten digits of precision, that's a < -11
        
        Empirically, the two numbers as calculated in C do become really close around 11 or so, 
        and at 38 that calculation results in a -inf (so probably a floating point error or similar.)
        So we should be using that formula for somewhere between -38 << a < -11

        I chose -16 as a happy medium after playing around with 
        double invert(double x){
            return log(1-exp(-exp(-x)));
        }
        for(int i=0; i<64; i++){
            double j = i;
            printf("for %lf, log(1-exp(-exp(-x))) is calculated as... %lf\n", j, invert(j));
        }
        and <https://www.wolframalpha.com/input?i=log%281-exp%28-exp%28-16%29%29%29>
    */
    if (log_rate_of_life_formation_in_habitable_planets < -16) {
        log_fraction_of_habitable_planets_in_which_any_life_appears = log_rate_of_life_formation_in_habitable_planets;
    } else {
        double rate_of_life_formation_in_habitable_planets = exp(log_rate_of_life_formation_in_habitable_planets);
        double fraction_of_habitable_planets_in_which_any_life_appears = -expm1(-rate_of_life_formation_in_habitable_planets);
        log_fraction_of_habitable_planets_in_which_any_life_appears = log(fraction_of_habitable_planets_in_which_any_life_appears);
    }

    double log_fraction_of_planets_with_life_in_which_intelligent_life_appears = sample_uniform(log(0.001), log(1), seed);
    double log_fraction_of_intelligent_planets_which_are_detectable_as_such = sample_uniform(log(0.01), log(1), seed);
    double log_longevity_of_detectable_civilizations = sample_uniform(log(100), log(10000000000), seed);

    double log_n = 
        log_rate_of_star_formation + 
        log_fraction_of_stars_with_planets + 
        log_number_of_habitable_planets_per_star_system + 
        log_fraction_of_habitable_planets_in_which_any_life_appears + 
        log_fraction_of_planets_with_life_in_which_intelligent_life_appears + 
        log_fraction_of_intelligent_planets_which_are_detectable_as_such + 
        log_longevity_of_detectable_civilizations;
    return log_n;
}

double sample_are_we_alone_logspace(uint64_t * seed)
{
    double log_n = sample_fermi_logspace(seed);
    return ((log_n > 0) ? 1 : 0);
    // log_n > 0 => n > 1
}

int main()
{

    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1010101; // xorshift can't start with a seed of 0
    // Note: come back to choice of seed.

    double logspace_fermi_proportion = 0;
    int n_samples = 1000 * 1000;
    for (int i = 0; i < n_samples; i++) {
        double result = sample_are_we_alone_logspace(seed);
        logspace_fermi_proportion += result;
    }
    double p_not_alone = logspace_fermi_proportion / n_samples;
    printf("Probability that we are not alone: %lf (%.lf%%)\n", p_not_alone, p_not_alone * 100);

    free(seed);
}

Compile and run:

$ gcc -O3 samples.c ./squiggle_c/squiggle.c  ./squiggle_c/squiggle_more.c -lm -fopenmp -o samples
$ ./samples

Core strategy

The recommended strategy is to:

Define sampler functions, which take a seed, and return 1 sample
Compose those sampler functions to define your estimation model
Produce an array of samples from a sampler function
Get summary statistics for that array of samples.

More examples

You can follow some example usage in the examples/ folder. In examples/core, we build up some functionality, starting from drawing one sample and finishing with the replication of Dissolving the Fermi paradox above. In examples/more, we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism.

Guarantees

The bad:

I offer no guarantees about stability, correctness, performance, etc. I might, for instance, abandon the version in C and rewrite it in Zig, Nim, Rust, Go.
This project mostly exists for my own usage & for my own amusement.
Caution! Think carefully before using this project for anything important.
If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so here.
I am conflicted about parallelism. It does add more complexity, complexity that you can be bitten by if you are not careful and don't understand it. And this conflicts with the initial grug-brain motivation. At the same time, it is clever, and it is nice, and I like it a lot.

The good:

You can vendor the code, i.e., save it as a dependency together with your other files. This way, this renders you immune to any changes I may make.
I've been hacking at this project for a while now, and I think I have a good grasp of its correctness and limitations. I've tried Nim and Zig, and I prefer C so far.
I think the core interface is not likely to change much, though I've recently changed the interface for parallelism and for getting confidence intervals.
I am using this code for a few important consulting projects, and I trust myself to operate it correctly.

Functions and their usage

squiggle.c

squiggle.c should be pretty tightly scoped. Available functions are:

// Underlying pseudo-randomness function
uint64_t xorshift64(uint64_t* seed);

// Sampling functions
double sample_unit_uniform(uint64_t* seed);
double sample_unit_normal(uint64_t* seed);
double sample_uniform(double start, double end, uint64_t* seed);
double sample_normal(double mean, double sigma, uint64_t* seed);
double sample_lognormal(double logmean, double logsigma, uint64_t* seed);
double sample_normal_from_90_ci(double low, double high, uint64_t* seed);
double sample_to(double low, double high, uint64_t* seed);
double sample_gamma(double alpha, uint64_t* seed);
double sample_beta(double a, double b, uint64_t* seed);
double sample_laplace(double successes, double failures, uint64_t* seed);

// Array helpers
double array_sum(double* array, int length);
void array_cumsum(double* array_to_sum, double* array_cumsummed, int length);
double array_mean(double* array, int length);
double array_std(double* array, int length);

// Mixture function
double sample_mixture(double (*samplers[])(uint64_t*), double* weights, int n_dists, uint64_t* seed);

The samplers syntax for the mixture functions denotes that it takes an array of functions. You can use it as follows:

#include "squiggle.h"

double sample_0(uint64_t* seed) { UNUSED(seed); return 0; }
double sample_1(uint64_t* seed) { UNUSED(seed); return 1; }
double sample_few(uint64_t* seed) { return sample_to(1, 3, seed); }
double sample_many(uint64_t* seed) { return sample_to(2, 10, seed); }

double sample_model(uint64_t* seed){

    double p_a = 0.8;
    double p_b = 0.5;
    double p_c = p_a * p_b;

    int n_dists = 4;
    double weights[] = { 1 - p_c, p_c / 2, p_c / 4, p_c / 4 };
    double (*samplers[])(uint64_t*) = { sample_0, sample_1, sample_few, sample_many };
    double result = sample_mixture(samplers, weights, n_dists, seed);

    return result;
}

squiggle_more.h

squiggle_more.c has expansions and convenience functions, which are more meandering. Expansions are in squiggle_more.c and squiggle_more.h. To use them, take care to link them:

#include "squiggle.h"
#include "squiggle_more.h"

# When compiling:
$ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example

Available definitions are as follows:

#define THOUSAND 1000
#define MILLION 1000000

/* Parallel sampling */
void sampler_parallel(double (*sampler)(uint64_t* seed), double* results, int n_threads, int n_samples);

/* Stats */
double array_get_median(double xs[], int n);
typedef struct ci_t {
    double low;
    double high;
} ci;
ci array_get_ci(ci interval, double* xs, int n);
ci array_get_90_ci(double xs[], int n);

void array_print_stats(double xs[], int n);
void array_print_histogram(double* xs, int n_samples, int n_bins);
void array_print_90_ci_histogram(double* xs, int n, int n_bins);

/* Algebra manipulations */

typedef struct normal_params_t {
    double mean;
    double std;
} normal_params;
normal_params algebra_sum_normals(normal_params a, normal_params b);

typedef struct lognormal_params_t {
    double logmean;
    double logstd;
} lognormal_params;
lognormal_params algebra_product_lognormals(lognormal_params a, lognormal_params b);

lognormal_params convert_ci_to_lognormal_params(ci x);
ci convert_lognormal_params_to_ci(lognormal_params y);

On parallelism in particular, see the warnings and caveats in the FOLK_WISDOM.md file. That file also has many other nuggets, warnings, trinkets, caveats, pointers I've collected over time.

Here is an example of using parallelism, and then printing some stats and a histogram:

#include "../../../squiggle.h"
#include "../../../squiggle_more.h"
#include <stdio.h>
#include <stdlib.h>

double sample_beta_3_2(uint64_t* seed) { return sample_beta(3.0, 2.0, seed); }

int main()
{
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with 0

    int n_samples = 1 * MILLION;
    double* xs = malloc(sizeof(double) * (size_t)n_samples);
    sampler_parallel(sample_beta_3_2, xs, 16, n_samples);

    printf("\n# Stats\n");
    array_print_stats(xs, n_samples);
    printf("\n# Histogram\n");
    array_print_histogram(xs, n_samples, 23);

    free(seed);
}

This produces the following output:

Avg: 0.600036
Std: 0.199851
 5%: 0.249009
10%: 0.320816
25%: 0.456413
50%: 0.614356
75%: 0.757000
90%: 0.857256
95%: 0.902290

# Histogram
[  0.00,   0.05):  391
[  0.05,   0.09): █ 2352
[  0.09,   0.13): ███ 5766
[  0.13,   0.18): ██████ 10517
[  0.18,   0.22): ██████████ 16412
[  0.22,   0.26): ██████████████ 22773
[  0.26,   0.31): ███████████████████ 30120
[  0.31,   0.35): ████████████████████████ 37890
[  0.35,   0.39): █████████████████████████████ 45067
[  0.39,   0.44): █████████████████████████████████ 52174
[  0.44,   0.48): ██████████████████████████████████████ 59636
[  0.48,   0.52): ██████████████████████████████████████████ 64924
[  0.52,   0.57): █████████████████████████████████████████████ 69832
[  0.57,   0.61): ████████████████████████████████████████████████ 74099
[  0.61,   0.65): █████████████████████████████████████████████████ 76776
[  0.65,   0.70): ██████████████████████████████████████████████████ 77001
[  0.70,   0.74): ████████████████████████████████████████████████ 75290
[  0.74,   0.78): ██████████████████████████████████████████████ 71711
[  0.78,   0.83): ██████████████████████████████████████████ 65576
[  0.83,   0.87): ████████████████████████████████████ 56839
[  0.87,   0.91): ████████████████████████████ 44626
[  0.91,   0.96): ███████████████████ 29464
[  0.96,   1.00]: ██████ 10764

Licensing

This project is released under the MIT license, a permissive open-source license. You can see it in the LICENSE.txt file.