NunoSempere 0f5f68f21f add possible gotcha to folk wisdom file

2024-02-07 19:46:12 +01:00

14 KiB

Raw Permalink Blame History

Folk Wisdom

The README.md file was getting too long and messy, so I moved some of the grumpier comments here.

Nested functions and compilation with tcc.

GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and increasing compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.

GCC	tcc
slower compilation	faster compilation
allows nested functions	doesn't allow nested functions
faster execution	slower execution

My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.

My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that modifying code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments.

Correlated samples

In the original squiggle language, there is some ambiguity about what this code means:

a = 1 to 10
b = 2 * a
c = b/a
c

Likewise in squigglepy:

import squigglepy as sq
import numpy as np

a = sq.to(1, 3)
b = 2 * a  
c = b / a 

c_samples = sq.sample(c, 10)

print(c_samples)

Should c be equal to 2? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (2 * a/a, as it were)? You don't know, because you are not operating on samples, you are operating on magical objects whose internals are hidden from you.

In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:

// correlated samples
// gcc -O3  correlated.c squiggle.c -lm -o correlated

#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main(){
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with a seed of 0

    double a = sample_to(1, 10, seed);
    double b = 2 * a;
    double c = b / a;

    printf("a: %f, b: %f, c: %f\n", a, b, c);
    // a: 0.607162, b: 1.214325, c: 0.500000

    free(seed);
}

// uncorrelated samples
// gcc -O3    uncorrelated.c ../../squiggle.c -lm -o uncorrelated

#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

double draw_xyz(uint64_t* seed){
    // function could also be placed inside main with gcc nested functions extension.
    return sample_to(1, 20, seed);
}


int main(){
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with a seed of 0

    double a = draw_xyz(seed);
    double b = 2 * draw_xyz(seed);
    double c = b / a;

    printf("a: %f, b: %f, c: %f\n", a, b, c);
    // a: 0.522484, b: 10.283501, c: 19.681936

    free(seed)
}

Exercise for the reader: What possible meanings could the following represent in squiggle? How would you implement each of those meanings in squiggle.c?

min(normal(0, 1))

Hint: See examples/more/13_parallelize_min

Note on sampling strategies

Right now, I am drawing samples using a random number generator. It requires some finesse, particularly when using parallelism. But it works fine.

But..., what if we could do something more elegant, more ingenious? In particular, what if instead of drawing samples, we had a mesh of equally spaced points in the range of floats? Then we could, for a given number of samples, better estimate the, say, mean of the distribution we are trying to model...

The problem with that is that if we have some code like:

double model(...){
  double a = sample_to(1, 10, i_mesh++);
  double b = sample_to(1, 2, i_mesh);
  return a * b;
}

Then this doesn't work, because the values of a and b will be correlated: when a is high, b will also be high. What might work would be something like this:

double* model(int n_samples){
  double* xs = malloc((size_t)n_samples * sizeof(double));
  for(int i_mesh=0; i_mesh < sqrt(n_samples); i_mesh++){
      for(int j_mesh=0; j_mesh < sqrt(n_samples); j_mesh++){
          double a = sample_to(1, 10, i_mesh);
          double b = sample_to(1, 2, j_mesh);
      }
  }
  return xs;
}

But that requires us to encode the shape of the model into the sampling function. It leads to an ugly nesting of for loops. It is a more complex approach. It is not grug-brained. So every now and then I have to remember that this is not the way.

Tests and the long tail of the lognormal

Distribution functions can be tested with:

cd tests
make && make run

make verify is an alias that runs all the tests and just displays the ones that are failing.

These tests are somewhat rudimentary: they get between 1M and 10M samples from a given sampling function, and check that their mean and standard deviations correspond to what they should theoretically should be.

If you run make run (or make verify), you will see errors such as these:

[-] Mean test for normal(47211.047473, 682197.019012) NOT passed.
Mean of normal(47211.047473, 682197.019012): 46933.673278, vs expected mean: 47211.047473
delta: -277.374195, relative delta: -0.005910

[-] Std test for lognormal(4.584666, 2.180816) NOT passed.
Std of lognormal(4.584666, 2.180816): 11443.588861, vs expected std: 11342.434900
delta: 101.153961, relative delta: 0.008839

[-] Std test for to(13839.861856, 897828.354318) NOT passed.
Std of to(13839.861856, 897828.354318): 495123.630575, vs expected std: 498075.002499
delta: -2951.371925, relative delta: -0.005961

These tests I wouldn't worry about. Due to luck of the draw, their relative error is a bit over 0.005, or 0.5%, and so the test fails. But it would surprise me if that had some meaningful practical implication.

The errors that should raise some worry are:

[-] Mean test for lognormal(1.210013, 4.766882) NOT passed.
Mean of lognormal(1.210013, 4.766882): 342337.257677, vs expected mean: 288253.061628
delta: 54084.196049, relative delta: 0.157985
[-] Std test for lognormal(1.210013, 4.766882) NOT passed.
Std of lognormal(1.210013, 4.766882): 208107782.972184, vs expected std: 24776840217.604111
delta: -24568732434.631927, relative delta: -118.057730

[-] Mean test for lognormal(-0.195240, 4.883106) NOT passed.
Mean of lognormal(-0.195240, 4.883106): 87151.733198, vs expected mean: 123886.818303
delta: -36735.085104, relative delta: -0.421507
[-] Std test for lognormal(-0.195240, 4.883106) NOT passed.
Std of lognormal(-0.195240, 4.883106): 33837426.331671, vs expected std: 18657000192.914921
delta: -18623162766.583248, relative delta: -550.371727

[-] Mean test for lognormal(0.644931, 4.795860) NOT passed.
Mean of lognormal(0.644931, 4.795860): 125053.904456, vs expected mean: 188163.894101
delta: -63109.989645, relative delta: -0.504662
[-] Std test for lognormal(0.644931, 4.795860) NOT passed.
Std of lognormal(0.644931, 4.795860): 39976300.711166, vs expected std: 18577298706.170452
delta: -18537322405.459286, relative delta: -463.707799

What is happening in this case is that you are taking a normal, like normal(-0.195240, 4.883106), and you are exponentiating it to arrive at a lognormal. But normal(-0.195240, 4.883106) is going to have some non-insignificant weight on, say, 18. But exp(18) = 39976300, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass.

The reader can also check that for more plausible real-world values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't egregious:

[x] Mean test for to(10.000000, 10000.000000) PASSED
[-] Std test for to(10.000000, 10000.000000) NOT passed.
Std of to(10.000000, 10000.000000): 23578.091775, vs expected std: 25836.381819
delta: -2258.290043, relative delta: -0.095779

Overall, I would caution that if you really care about the very far tails of distributions, you might want to instead use tools which can do some of the analytical manipulations for you, like the original Squiggle, Simple Squiggle (both linked below), or even doing lognormal multiplication by hand, relying on the fact that two lognormals multiplied together result in another lognormal with known shape.

In fact, squiggle.c does have a few functions for algebraic manipulations of simple distributions at the end of squiggle.c. But these are pretty rudimentary, and I don't know whether I'll end up expanding or deleting them.

Compiler warnings

Harsh compilation

By default, I've enabled -Wall -Wextra -Wdouble-promotion -Wconversion. However, these produce some false positive warnings, which I've dealt with through:

For conversion: Explicit casts, particularly from int to size_t when calling malloc.
For dealing with unused variables: Using an UNUSED macro. If you don't like that approach, you could add -Wno-unused-parameter to your makefile and remove the macro and its usage.

Some resources on compiler flags: 1, 2

Results of running clang-tidy

clang-tidy is a utility to detect common errors in C/C++. You can run it with:

make tidy

So far in the history of this program it has emitted:

One false-positive warning about an issue I'd already taken care of (so I've suppressed the warning)
a warning about an unused variable

I think this is good news in terms of making me more confident that this simple library is correct :).

Boundaries between sampling functions and arrays of samples

In squiggle.c, the boundary between working with sampler functions and arrays of samples is clear. Not so in the original squiggle, which hides this distinction from the user in the interest of accessibility.

Parallelism

I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine.

But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader:

If you run the sampler_parallel function twice, you will get the same result. Why?
If you run the sampler_parallel function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why?
For a small amount of samples, if you run the sampler_parallel function, you will get better spread out random numbers than if you run things serially. Why?

That said, I found adding parallelism to be an interesting an engaging task. Most recently, I even optimized the code to ensure that two threads weren't accessing the same cache line at the same time, and it was very satisfying to see a 30% improvement as a result.

Using arbitrary cdfs

The last commit that has code to sample from arbitrary cdfs is 8f6919fa2.... You can access it with git checkout 8f6919fa2. I removed them because I wasn't using them, and they didn't really fit with the overall ethos of the project.

Other gotchas

Even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift code should be different).

Consider this code

Consider sampling from the unit uniform in this manner:

#include "../squiggle.h"
#include "../squiggle_more.h"
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

int main()
{
    // Replicate <https://arxiv.org/pdf/1806.02404.pdf>, and in particular the red line in page 11.
    // Could also be interesting to just produce and save many samples.

    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = UINT64_MAX/64; // xorshift can't start with a seed of 0
    
    int n_samples = 100*MILLION;
    int p_sixteenth = 0;
    int p_eighth = 0;
    int p_quarter = 0;
    int p_half = 0;
    double sample;
    for(int i=0; i<n_samples; i++){
        sample = sample_unit_uniform(seed);
        // printf("%lf\n", sample);
        if (sample < 1.0/16.0){
            p_sixteenth++;
            p_eighth++;
            p_quarter++;
            p_half++;
        } else if(sample < 0.125){
            p_eighth++;
            p_quarter++;
            p_half++;
        } else if(sample < 0.25){
            p_quarter++;
            p_half++;
        } else if(sample < 0.5){
            p_half++;
        }else{
            // printf("Sample > 0.5\n");
        }
    }
    printf("p_16th: %lf; p_eighth; %lf; p_quarter: %lf; p_half: %lf", ((double)p_sixteenth)/n_samples, (double)p_eighth/n_samples, (double)p_quarter/n_samples, (double)p_half/n_samples);
}

What will be printed out? In particular, consider that not all floating points have the same density.

Click on the arrow to see the answer

The p_eighth will be ~0.125, p_quarter will be ~0.25, p_half will be ~0.5. This is because these random numbers are produced by generating ints and then dividing by the maximum int. There may be additional gotchas here, but this at least ensures that intervals of the same length in [0,1] will have the same number of samples. It's just that those that are smaller will be represented with more precision.

14 KiB Raw Permalink Blame History