- e.g., with a multi-threading library like OpenMP, or by adding more algorithmic complexity
- or more simply, by inlining the sampling functions (adding an `inline` directive before their function declaration)
- **Because there are few abstractions between it and machine code** (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.
You can follow some example usage in the examples/ folder
1. In the first example, we define a small model, and draw one sample from it
2. In the second example, we define a small model, and return many samples
3. In the third example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way.
4. In the fourth example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler.
5. In the fifth example, we define the cdf for the beta distribution, and we draw samples from it.
To help with the above core strategy, this library provides convenience functions, which take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.
GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.
My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.
The process of taking a cdf and returning a sample might fail, e.g., it's a Newton method which might fail to converge because of cdf artifacts. The cdf itself might also fail, e.g., if a distribution only accepts a range of parameters, but is fed parameters outside that range.
This library provides two approaches:
1. Print the line and function in which the error occured, then exit on error
2. In situations where there might be an error, return a struct containing either the correct value or an error message:
The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programmes, but is more verbose.
Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library also provides a convenient macro, `PROCESS_ERROR`, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.
This code should be correct, then simple, then fast.
- It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
- It should be clear, conceptually simple. Simple for me to implement, simple for others to understand
- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than Python.
Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.
Should `c` be equal to `2`? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (`2 * a/a`, as it were)?