squiggle.c/README.md

# Squiggle.c

A self-contained C99 library that provides a subset of [Squiggle](https://www.squiggle-language.com/)'s functionality in C. 

## Why C?

- Because it is fast
- Because I enjoy it
- Because C is honest
- Because it will last long
- Because it can fit in my head
- Because if you can implement something in C, you can implement it anywhere else
- Because it can be made faster if need be
  - e.g., with a multi-threading library like OpenMP, or by adding more algorithmic complexity
  - or more simply, by inlining the sampling functions (adding an `inline` directive before their function declaration)
- **Because there are few abstractions between it and machine code** (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.

## Getting started

You can follow some example usage in the examples/ folder

1. In the first example, we define a small model, and draw one sample from it
2. In the second example, we define a small model, and return many samples
3. In the third example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way.
4. In the fourth example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler.
5. In the fifth example, we define the cdf for the beta distribution, and we draw samples from it. 
6. In the sixth example, we take samples from simple gamma and beta distributions, using the samplers provided by this library.

## Commentary

### squiggle.c is short

`squiggle.c` is less than 500 lines of C. The reader could just read it and grasp its contents.

### Core strategy

This library provides some basic building blocks. The recommended strategy is to:

1. Define sampler functions, which take a seed, and return 1 sample
2. Compose those sampler functions to define your estimation model
3. At the end, call the last sampler function many times to generate many samples from your model

### Cdf auxiliary functions

To help with the above core strategy, this library provides convenience functions, which take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.

### Nested functions and compilation with tcc.

GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.

| GCC | tcc |
| --- | --- | 
| slower compilation | faster compilation | 
| allows nested functions | doesn't allow nested functions |
| faster execution | slower execution | 

My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.

### Error propagation vs exiting on error

The process of taking a cdf and returning a sample might fail, e.g., it's a Newton method which might fail to converge because of cdf artifacts. The cdf itself might also fail, e.g., if a distribution only accepts a range of parameters, but is fed parameters outside that range.

This library provides two approaches:

1. Print the line and function in which the error occured, then exit on error
2. In situations where there might be an error, return a struct containing either the correct value or an error message:

```C
struct box {
    int empty;
    double content;
    char* error_msg;
};
```

The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programmes, but is more verbose.

Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library also provides a convenient macro, `PROCESS_ERROR`, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.

## Design choices

This code should be correct, then simple, then fast.

- It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
- It should be clear, conceptually simple. Simple for me to implement, simple for others to understand
- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than Python.

Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.

## Correlated samples

In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means:

```js
a = 1 to 10
b = 2 * a
c = b/a
c
```

Likewise in [squigglepy](https://github.com/rethinkpriorities/squigglepy):

```python
import squigglepy as sq
import numpy as np

a = sq.to(1, 3)
b = 2 * a  
c = b / a 

c_samples = sq.sample(c, 10)

print(c_samples)
```

Should `c` be equal to `2`? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (`2 * a/a`, as it were)?

In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:

```c
// correlated samples
// gcc -O3  correlated.c squiggle.c -lm -o correlated

#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main(){
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with a seed of 0

    double a = sample_to(1, 10, seed);
    double b = 2 * a;
    double c = b / a;

    printf("a: %f, b: %f, c: %f\n", a, b, c);
    // a: 0.607162, b: 1.214325, c: 0.500000

    free(seed);
}
```

vs

```c
// uncorrelated samples
// gcc -O3    uncorrelated.c ../../squiggle.c -lm -o uncorrelated

#include "squiggle.h"
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

double draw_xyz(uint64_t* seed){
    // function could also be placed inside main with gcc nested functions extension.
    return sample_to(1, 20, seed);
}


int main(){
    // set randomness seed
    uint64_t* seed = malloc(sizeof(uint64_t));
    *seed = 1000; // xorshift can't start with a seed of 0

    double a = draw_xyz(seed);
    double b = 2 * draw_xyz(seed);
    double c = b / a;

    printf("a: %f, b: %f, c: %f\n", a, b, c);
    // a: 0.522484, b: 10.283501, c: 19.681936

    free(seed)
}
```

## Related projects

- [Squiggle](https://www.squiggle-language.com/)
- [SquigglePy](https://github.com/rethinkpriorities/squigglepy)
- [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/)
- [time to botec](https://github.com/NunoSempere/time-to-botec)

## To do list

- [ ] Test summary statistics for each of the distributions.
- [ ] Have some more complicated & realistic example
- [ ] Add summarization functions: 90% ci (or all c.i.?) 
- [ ] Systematize references
- [ ] Publish online
- [ ] Add efficient sampling from a beta distribution
  - https://dl.acm.org/doi/10.1145/358407.358414
  - https://link.springer.com/article/10.1007/bf02293108
  - https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution
  - https://github.com/numpy/numpy/blob/5cae51e794d69dd553104099305e9f92db237c53/numpy/random/src/distributions/distributions.c
- [ ] Support all distribution functions in <https://www.squiggle-language.com/docs/Api/Dist>
- [ ] Support all distribution functions in <https://www.squiggle-language.com/docs/Api/Dist>, and do so efficiently

## Done

- [x] Add example for only one sample
- [x] Add example for many samples
- ~~[ ] Add a custom preprocessor to allow simple nested functions that don't rely on local scope?~~
- [x] Use gcc extension to define functions nested inside main.
- [x] Chain various sample_mixture functions
- [x] Add beta distribution
  - See <https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution> for a faster method.
- ~~[-] Use OpenMP for acceleration~~
- [x] Add function to get sample when given a cdf
- [x] Don't have a single header file.
- [x] Structure project a bit better
- [x] Simplify `PROCESS_ERROR` macro
- [x] Add README
  - [x] Schema: a function which takes a sample and manipulates it,
  - [x] and at the end, an array of samples.
  - [x] Explain boxes
  - [x] Explain nested functions
  - [x] Explain exit on error
  - [x] Explain individual examples
- [x] Rename functions to something more self-explanatory, e.g,. `sample_unit_normal`.
- [x] Add summarization functions: mean, std
- [x] Add sampling from a gamma distribution
  - https://dl.acm.org/doi/pdf/10.1145/358407.358414
- [x] Explain correlated samples
- [-] ~~Add tests in Stan?~~
reduce num samples, start README 2023-07-16 19:52:24 +00:00			`# Squiggle.c`

test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`A self-contained C99 library that provides a subset of [Squiggle](https://www.squiggle-language.com/)'s functionality in C.`
reduce num samples, start README 2023-07-16 19:52:24 +00:00
			`## Why C?`

			`- Because it is fast`
			`- Because I enjoy it`
rename sampler functions, elaborate on README, etc. 2023-07-16 20:32:03 +00:00			`- Because C is honest`
			`- Because it will last long`
			`- Because it can fit in my head`
			`- Because if you can implement something in C, you can implement it anywhere else`
add gamma distribution & documentation. 2023-07-22 19:40:35 +00:00			`- Because it can be made faster if need be`
			`- e.g., with a multi-threading library like OpenMP, or by adding more algorithmic complexity`
			- or more simply, by inlining the sampling functions (adding an `inline` directive before their function declaration)
			`- Because there are few abstractions between it and machine code (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.`
rename sampler functions, elaborate on README, etc. 2023-07-16 20:32:03 +00:00
			`## Getting started`

			`You can follow some example usage in the examples/ folder`

			`1. In the first example, we define a small model, and draw one sample from it`
			`2. In the second example, we define a small model, and return many samples`
			`3. In the third example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way.`
			`4. In the fourth example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler.`
			`5. In the fifth example, we define the cdf for the beta distribution, and we draw samples from it.`
add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`6. In the sixth example, we take samples from simple gamma and beta distributions, using the samplers provided by this library.`
reduce num samples, start README 2023-07-16 19:52:24 +00:00
Update readme, small tweaks 2023-07-16 21:33:46 +00:00			`## Commentary`

			`### squiggle.c is short`

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`squiggle.c` is less than 500 lines of C. The reader could just read it and grasp its contents.
Update readme, small tweaks 2023-07-16 21:33:46 +00:00
			`### Core strategy`

			`This library provides some basic building blocks. The recommended strategy is to:`

			`1. Define sampler functions, which take a seed, and return 1 sample`
			`2. Compose those sampler functions to define your estimation model`
			`3. At the end, call the last sampler function many times to generate many samples from your model`

			`### Cdf auxiliary functions`

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`To help with the above core strategy, this library provides convenience functions, which take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.`
Update readme, small tweaks 2023-07-16 21:33:46 +00:00
			`### Nested functions and compilation with tcc.`

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.`

			`\| GCC \| tcc \|`
			`\| --- \| --- \|`
			`\| slower compilation \| faster compilation \|`
			`\| allows nested functions \| doesn't allow nested functions \|`
			`\| faster execution \| slower execution \|`

			`My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.`
Update readme, small tweaks 2023-07-16 21:33:46 +00:00
			`### Error propagation vs exiting on error`

			`The process of taking a cdf and returning a sample might fail, e.g., it's a Newton method which might fail to converge because of cdf artifacts. The cdf itself might also fail, e.g., if a distribution only accepts a range of parameters, but is fed parameters outside that range.`

			`This library provides two approaches:`

			`1. Print the line and function in which the error occured, then exit on error`
			`2. In situations where there might be an error, return a struct containing either the correct value or an error message:`

			```C
			`struct box {`
			`int empty;`
replace all floats (32 bits) with doubles (64 bits) to fix bug after switching xorshift32 => xorshift64 2023-07-23 11:02:56 +00:00			`double content;`
Update readme, small tweaks 2023-07-16 21:33:46 +00:00			`char* error_msg;`
			`};`
			```

			`The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programmes, but is more verbose.`

			Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library also provides a convenient macro, `PROCESS_ERROR`, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.

test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`## Design choices`

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`This code should be correct, then simple, then fast.`

			`- It should be correct. The user should be able to rely on it and not think about whether errors come from the library.`
			`- It should be clear, conceptually simple. Simple for me to implement, simple for others to understand`
			`- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be way faster than Python.`

			`Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00
			`## Correlated samples`

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means:`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00
add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			```js
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`a = 1 to 10`
			`b = 2 * a`
			`c = b/a`
			`c`
			```

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`Likewise in [squigglepy](https://github.com/rethinkpriorities/squigglepy):`
tweak independent samples explanation 2023-07-23 09:29:17 +00:00
add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			```python
			`import squigglepy as sq`
			`import numpy as np`

			`a = sq.to(1, 3)`
			`b = 2 * a`
			`c = b / a`

			`c_samples = sq.sample(c, 10)`

			`print(c_samples)`
			```

			Should `c` be equal to `2`? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (`2 * a/a`, as it were)?
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00
add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00
			```c
			`// correlated samples`
			`// gcc -O3 correlated.c squiggle.c -lm -o correlated`

			`#include "squiggle.h"`
			`#include <stdint.h>`
			`#include <stdlib.h>`
			`#include <stdio.h>`

			`int main(){`
			`// set randomness seed`
move to xorshift64. Better precision. 2023-07-23 10:47:47 +00:00			`uint64_t* seed = malloc(sizeof(uint64_t));`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`*seed = 1000; // xorshift can't start with a seed of 0`

replace all floats (32 bits) with doubles (64 bits) to fix bug after switching xorshift32 => xorshift64 2023-07-23 11:02:56 +00:00			`double a = sample_to(1, 10, seed);`
			`double b = 2 * a;`
			`double c = b / a;`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00
			`printf("a: %f, b: %f, c: %f\n", a, b, c);`
			`// a: 0.607162, b: 1.214325, c: 0.500000`

			`free(seed);`
			`}`
			```

			`vs`

			```c
			`// uncorrelated samples`
			`// gcc -O3 uncorrelated.c ../../squiggle.c -lm -o uncorrelated`

			`#include "squiggle.h"`
			`#include <stdint.h>`
			`#include <stdlib.h>`
			`#include <stdio.h>`

replace all floats (32 bits) with doubles (64 bits) to fix bug after switching xorshift32 => xorshift64 2023-07-23 11:02:56 +00:00			`double draw_xyz(uint64_t* seed){`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`// function could also be placed inside main with gcc nested functions extension.`
			`return sample_to(1, 20, seed);`
			`}`

tweak independent samples explanation 2023-07-23 09:29:17 +00:00
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`int main(){`
			`// set randomness seed`
move to xorshift64. Better precision. 2023-07-23 10:47:47 +00:00			`uint64_t* seed = malloc(sizeof(uint64_t));`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00			`*seed = 1000; // xorshift can't start with a seed of 0`

replace all floats (32 bits) with doubles (64 bits) to fix bug after switching xorshift32 => xorshift64 2023-07-23 11:02:56 +00:00			`double a = draw_xyz(seed);`
			`double b = 2 * draw_xyz(seed);`
			`double c = b / a;`
test correlated/uncorrelated example code. 2023-07-23 09:27:17 +00:00
			`printf("a: %f, b: %f, c: %f\n", a, b, c);`
			`// a: 0.522484, b: 10.283501, c: 19.681936`

			`free(seed)`
			`}`
			```

simplify PROCESS_ERROR macro 2023-07-16 20:58:20 +00:00			`## Related projects`

			`- [Squiggle](https://www.squiggle-language.com/)`
			`- [SquigglePy](https://github.com/rethinkpriorities/squigglepy)`
Update readme, small tweaks 2023-07-16 21:33:46 +00:00			`- [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/)`
simplify PROCESS_ERROR macro 2023-07-16 20:58:20 +00:00			`- [time to botec](https://github.com/NunoSempere/time-to-botec)`

reduce num samples, start README 2023-07-16 19:52:24 +00:00			`## To do list`

add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`- [ ] Test summary statistics for each of the distributions.`
reduce num samples, start README 2023-07-16 19:52:24 +00:00			`- [ ] Have some more complicated & realistic example`
add mean and std for arrays. 2023-07-22 17:36:43 +00:00			`- [ ] Add summarization functions: 90% ci (or all c.i.?)`
add gamma distribution & documentation. 2023-07-22 19:40:35 +00:00			`- [ ] Systematize references`
reduce num samples, start README 2023-07-16 19:52:24 +00:00			`- [ ] Publish online`
add to-do item 2023-07-22 17:39:40 +00:00			`- [ ] Add efficient sampling from a beta distribution`
add gamma distribution & documentation. 2023-07-22 19:40:35 +00:00			`- https://dl.acm.org/doi/10.1145/358407.358414`
			`- https://link.springer.com/article/10.1007/bf02293108`
			`- https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution`
			`- https://github.com/numpy/numpy/blob/5cae51e794d69dd553104099305e9f92db237c53/numpy/random/src/distributions/distributions.c`
simplify PROCESS_ERROR macro 2023-07-16 20:58:20 +00:00			`- [ ] Support all distribution functions in <https://www.squiggle-language.com/docs/Api/Dist>`
			`- [ ] Support all distribution functions in <https://www.squiggle-language.com/docs/Api/Dist>, and do so efficiently`
reduce num samples, start README 2023-07-16 19:52:24 +00:00
			`## Done`

			`- [x] Add example for only one sample`
			`- [x] Add example for many samples`
			`- ~~[ ] Add a custom preprocessor to allow simple nested functions that don't rely on local scope?~~`
			`- [x] Use gcc extension to define functions nested inside main.`
give more expressive names to main functions This bash function was helpful: function replace(){ grep "$1" -rl . grep "$1" -rl . \| xargs sed -i "s/$1/$2/g"; } 2023-07-22 17:21:20 +00:00			`- [x] Chain various sample_mixture functions`
reduce num samples, start README 2023-07-16 19:52:24 +00:00			`- [x] Add beta distribution`
			`- See <https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution> for a faster method.`
			`- ~~[-] Use OpenMP for acceleration~~`
			`- [x] Add function to get sample when given a cdf`
			`- [x] Don't have a single header file.`
			`- [x] Structure project a bit better`
Update readme, small tweaks 2023-07-16 21:33:46 +00:00			- [x] Simplify `PROCESS_ERROR` macro
			`- [x] Add README`
			`- [x] Schema: a function which takes a sample and manipulates it,`
			`- [x] and at the end, an array of samples.`
			`- [x] Explain boxes`
			`- [x] Explain nested functions`
			`- [x] Explain exit on error`
			`- [x] Explain individual examples`
add mean and std for arrays. 2023-07-22 17:36:43 +00:00			- [x] Rename functions to something more self-explanatory, e.g,. `sample_unit_normal`.
			`- [x] Add summarization functions: mean, std`
add gamma distribution & documentation. 2023-07-22 19:40:35 +00:00			`- [x] Add sampling from a gamma distribution`
			`- https://dl.acm.org/doi/pdf/10.1145/358407.358414`
add xorshift64 + various changes. 2023-07-23 10:44:16 +00:00			`- [x] Explain correlated samples`
			`- [-] ~~Add tests in Stan?~~`