README updates and spell checking

This commit is contained in:
NunoSempere 2023-12-03 18:15:27 +00:00
parent 53a77cb64c
commit e08ce4334e

View File

@ -1,6 +1,6 @@
# squiggle.c
squiggle.c is a [grug-brained](https://grugbrain.dev/) self-contained C99 library that provides functions for simple Monte Carlo estimation, based on [Squiggle](https://www.squiggle-language.com/).
squiggle.c is a [grug-brained](https://grugbrain.dev/) self-contained C99 library that provides functions for simple Monte Carlo estimation, inspired by [Squiggle](https://www.squiggle-language.com/).
## Why C?
@ -18,7 +18,7 @@ squiggle.c is a [grug-brained](https://grugbrain.dev/) self-contained C99 librar
## Getting started
You can follow some example usage in the [examples/](examples]) folder. In [examples/core](examples/core/), we build up some functionality, starting from drawing one sample. In examples/more, we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism.
You can follow some example usage in the [examples/](examples]) folder. In [examples/core](examples/core/), we build up some functionality, starting from drawing one sample. In [examples/more](examples/more), we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism.
## Commentary
@ -46,14 +46,23 @@ GCC has an extension which allows a program to define a function inside another
~~My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.~~
My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that extricating code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments.
My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that modifying code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments.
### Guarantees and licensing
The motte:
- I offer no guarantees about stability, correctness, performance, etc. I might, for instance, abandon the version in C and rewrite it in Zig, Nim or Rust.
- This project mostly exists for my own usage & for my own amusement.
- Caution! Think carefully before using this project for anything important
- If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so [here](https://nunosempere.com/consulting). Although this theoretical possibility exists, I don't I don't anticipate that this would be a good idea on most cases.
- Caution! Think carefully before using this project for anything important.
- If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so [here](https://nunosempere.com/consulting).
- I am conflicted about parallelism. It *does* add more complexity, complexity that you can be bitten by if you are not careful and don't understand it. And this conflicts with the initial grug-brain motivation. At the same time, it is clever, and it is nice, and I like it a lot.
The bailey:
- I've been hacking at this project for a while now, and I think I have a good grasp of its correctness and limitations. I've tried Nim and Zig, and I prefer C so far.
- I think the core interface is not likely to change much stable, though I've been changing the interface for parallelism and for getting confidence intervals.
- I am using this code for a few important consulting projects, and I trust myself to operate it correctly.
This project is released under the MIT license, a permissive open-source license. You can see it in the LICENSE.txt file.
@ -164,6 +173,42 @@ min(normal(0, 1))
Hint: See examples/more/13_parallelize_min
### Note on sampling strategies
Right now, I am drawing samples from a random number generator. It requires some finesse, particularly when using parallelism. But it works fine.
But..., what if we could do something more elegant, more ingenious. In particular, what if instead of drawing samples, we had a mesh of equally spaced points in the range of floats. Then we could, for a given number of samples, better estimate the, say, mean of the distribution we are trying to model...
The problem with that is that if we have some code like:
```C
double model(...){
double a = sample_to(1, 10, i_mesh++);
double b = sample_to(1, 2, i_mesh);
return a * b;
}
```
Then this doesn't work, because the values of a and b will be correlated: when a is high, b will also be high. What might work would be something like this:
```C
double* model(int n_samples){
double* xs = malloc(n_samples);
for(int i_mesh=0; i_mesh < sqrt(n_samples); i_mesh++){
for(int j_mesh=0; j_mesh < sqrt(n_samples); j_mesh++){
double a = sample_to(1, 10, i_mesh);
double b = sample_to(1, 2, j_mesh);
}
}
return xs;
}
```
But that requires us to encode the shape of the model into the sampling function. It leads to an ugly nesting of for loops. It is a more complex approach. It is not [grug-brained](https://grugbrain.dev/). So every now and then I have to remind myself that this is not the way.
### Tests and the long tail of the lognormal
Distribution functions can be tested with:
@ -220,9 +265,9 @@ Std of lognormal(0.644931, 4.795860): 39976300.711166, vs expected std: 18577298
delta: -18537322405.459286, relative delta: -463.707799
```
What is happening in this case is that you are taking a normal, like `normal(-0.195240, 4.883106)`, and you are exponentiating it to arrive at a lognormal. But `normal(-0.195240, 4.883106)` is going to have some noninsignificant weight on, say, 18. But `exp(18) = 39976300`, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass.
What is happening in this case is that you are taking a normal, like `normal(-0.195240, 4.883106)`, and you are exponentiating it to arrive at a lognormal. But `normal(-0.195240, 4.883106)` is going to have some non-insignificant weight on, say, 18. But `exp(18) = 39976300`, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass.
The reader can also check that for more plausible real-world values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't eggregious:
The reader can also check that for more plausible real-world values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't egregious:
```
[x] Mean test for to(10.000000, 10000.000000) PASSED
@ -249,7 +294,7 @@ So far in the history of this program it has emitted:
I think this is good news in terms of making me more confident that this simple library is correct :).
### Division between core functions and squiggle_moreneous expansions
### Division between core functions and squiggle_more expansions
This library differentiates between core functions, which are pretty tightly scoped, and expansions and convenience functions, which are more meandering. Expansions are in `squiggle_more.c` and `squiggle_more.h`. To use them, take care to link them:
@ -266,19 +311,19 @@ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example
#### Extra: confidence intervals
`squiggle_more.c` has some helper functions to get confidence intervals. They are in `squiggle_more.c` because I'm still mulling over what their shape should be, and because until recently they were pretty limited and suboptimal. But recently, I've put a bunch of effort into being able to get the confidence interval of an array of samples in O(number of samples), and into making confidence interval functions nicer and more general. So I might promote them to the main `squiggle.c` file.
`squiggle_more.c` has some helper functions to get confidence intervals. They are in `squiggle_more.c` because I'm still mulling over what their shape should be, and because until recently they were pretty limited and sub-optimal. But recently, I've put a bunch of effort into being able to get the confidence interval of an array of samples in O(number of samples), and into making confidence interval functions nicer and more general. So I might promote them to the main `squiggle.c` file.
#### Extra: paralellism
#### Extra: parallelism
I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine.
But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intrincancies around setting seeds. Some gotchas and exercises for the reader:
But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader:
- If you run the `sampler_parallel` function twice, you will get the same result. Why?
- If you run the `sampler_parallel` function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why?
- For a small amount of samples, if you run the `sampler_parallel` function, you will get better spread out random numbers than if you run things serially. Why?
#### Extra: Cdf auxiliary functions
#### Extra: cdf auxiliary functions
I provide some auxiliary functions that take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.
@ -288,7 +333,7 @@ The process of taking a cdf and returning a sample might fail, e.g., it's a Newt
This library provides two approaches:
1. Print the line and function in which the error occured, then exit on error
1. Print the line and function in which the error occurred, then exit on error
2. In situations where there might be an error, return a struct containing either the correct value or an error message:
```C
@ -299,7 +344,7 @@ struct box {
};
```
The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programmes, but is more verbose.
The first approach produces terser programs but might not scale. The second approach seems like it could lead to more robust programs, but is more verbose.
Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library also provides a convenient macro, `PROCESS_ERROR`, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.
@ -310,7 +355,7 @@ Overall, I'd describe the error handling capabilities of this library as pretty
- [Squiggle](https://www.squiggle-language.com/)
- [SquigglePy](https://github.com/rethinkpriorities/squigglepy)
- [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/)
- [time to botec](https://github.com/NunoSempere/time-to-botec)
- [time to BOTEC](https://github.com/NunoSempere/time-to-botec)
- [Find a beta distribution that fits your desired confidence interval](https://nunosempere.com/blog/2023/03/15/fit-beta/)
## To do list
@ -332,7 +377,7 @@ Overall, I'd describe the error handling capabilities of this library as pretty
## Done
- [x] Document paralellism
- [x] Document parallelism
- [x] Document confidence intervals
- [x] Add example for only one sample
- [x] Add example for many samples
@ -382,7 +427,7 @@ Overall, I'd describe the error handling capabilities of this library as pretty
- [x] Add to header file
- [x] Provide example algebra
- [x] Add conversion between 90% ci and parameters.
- [x] Use that conversion in conjuction with small algebra.
- [x] Use that conversion in conjunction with small algebra.
- [x] Consider ergonomics of using ci instead of c_i
- [x] use named struct instead
- [x] demonstrate and document feeding a struct directly to a function; my_function((struct c_i){.low = 1, .high = 2});
@ -390,7 +435,7 @@ Overall, I'd describe the error handling capabilities of this library as pretty
- [ ] Test results
- [x] Move to own file? Or signpost in file? => signposted in file.
- [x] Write twitter thread: now [here](https://twitter.com/NunoSempere/status/1707041153210564959); retweets appreciated.
- [ ] ~~Think about whether to write a simple version of this for [uxn](https://100r.co/site/uxn.html), a minimalistic portable programming stack which, sadly, doesn't have doubles (64 bit floats)~~
- [ ] ~~Think about whether to write a simple version of this for [uxn](https://100r.co/site/uxn.html), a minimalist portable programming stack which, sadly, doesn't have doubles (64 bit floats)~~
- [x] Write better confidence interval code that:
- Gets number of samples as an input
- Gets either a sampler function or a list of samples