From 53a77cb64cd7af18d42f388e07c26707df2d6e83 Mon Sep 17 00:00:00 2001 From: NunoSempere Date: Thu, 30 Nov 2023 00:00:11 +0000 Subject: [PATCH] update README with changes --- README.md | 78 +++++++++++++++++++++++++++---------------------- makefile | 2 +- squiggle_more.c | 2 +- 3 files changed, 45 insertions(+), 37 deletions(-) diff --git a/README.md b/README.md index 1dd266d..1d2dab2 100644 --- a/README.md +++ b/README.md @@ -18,23 +18,13 @@ squiggle.c is a [grug-brained](https://grugbrain.dev/) self-contained C99 librar ## Getting started -You can follow some example usage in the examples/ folder - -1. In the [1st example](examples/01_one_sample/example.c), we define a small model, and draw one sample from it -2. In the [2nd example](examples/02_many_samples/example.c), we define a small model, and return many samples -3. In the [3rd example](examples/03_gcc_nested_function/example.c), we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way. -4. In the [4th example](examples/04_sample_from_cdf_simple/example.c), we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler. -5. In the [5th example](examples/05_sample_from_cdf_beta/example.c), we define the cdf for the beta distribution, and we draw samples from it. -6. In the [6th example](examples/06_gamma_beta/example.c), we take samples from simple gamma and beta distributions, using the samplers provided by this library. -7. In the [7th example](examples/07_ci_beta/example.c), we get the 90% confidence interval of a beta distribution -8. The [8th example](examples/08_nuclear_war/example.c) translates the models from Eli and Nuño from [Samotsvety Nuclear Risk Forecasts — March 2022](https://forum.nunosempere.com/posts/KRFXjCqqfGQAYirm5/samotsvety-nuclear-risk-forecasts-march-2022#Nu_o_Sempere) into squiggle.c, then creates a mixture from both, and returns the mean probability of death per month and the 90% confidence interval. -8. The [9th example](examples/09_burn_10kg_fat/example.c) estimates how many minutes per day I would have to jump rope in order to lose 10kg of fat in half a year. +You can follow some example usage in the [examples/](examples]) folder. In [examples/core](examples/core/), we build up some functionality, starting from drawing one sample. In examples/more, we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism. ## Commentary ### squiggle.c is short -[squiggle.c](squiggle.c) is less than 600 lines of C, with a core of <250 lines. The reader could just read it and grasp its contents. +[squiggle.c](squiggle.c) is less than 700 lines of C, with a core of <230 lines. The reader could just read it and grasp its contents, and is encouraged to do so. ### Core strategy @@ -44,9 +34,6 @@ This library provides some basic building blocks. The recommended strategy is to 2. Compose those sampler functions to define your estimation model 3. At the end, call the last sampler function many times to generate many samples from your model -### Cdf auxiliary functions - - ### Nested functions and compilation with tcc. GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and increasing compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions. @@ -57,7 +44,9 @@ GCC has an extension which allows a program to define a function inside another | allows nested functions | doesn't allow nested functions | | faster execution | slower execution | -My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others. +~~My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.~~ + +My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that extricating code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments. ### Guarantees and licensing @@ -75,7 +64,7 @@ This code should aim to be correct, then simple, then fast. - It should be correct. The user should be able to rely on it and not think about whether errors come from the library. - Nonetheless, the user should understand the limitations of sampling-based methods. See the section on [Tests and the long tail of the lognormal](https://git.nunosempere.com/personal/squiggle.c#tests-and-the-long-tail-of-the-lognormal) for a discussion of how sampling is bad at capturing some aspects of distributions with long tails. - It should be clear, conceptually simple. Simple for me to implement, simple for others to understand. -- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than Python. +- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than, say, Python. Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section. @@ -167,6 +156,14 @@ int main(){ } ``` +Exercise for the reader: What possible meanings could the following represent in [squiggle](https://www.squiggle-language.com/playground?v=0.8.6#code=eNqrVkpJTUsszSlxzk9JVbJSys3M08jLL8pNzNEw0FEw1NRUUKoFAOYsC1c%3D)? How would you implement each of those meanings in squiggle.c? + +``` +min(normal(0, 1)) +``` + +Hint: See examples/more/13_parallelize_min + ### Tests and the long tail of the lognormal Distribution functions can be tested with: @@ -246,7 +243,11 @@ clang-tidy is a utility to detect common errors in C/C++. You can run it with: make tidy ``` -It emits one warning about something I already took care of, so by default I've suppressed it. I think this is good news in terms of making me more confident that this simple library is correct :). +So far in the history of this program it has emitted: +- One false-positive warning about an issue I'd already taken care of (so I've suppressed the warning) +- a warning about an unused variable + +I think this is good news in terms of making me more confident that this simple library is correct :). ### Division between core functions and squiggle_moreneous expansions @@ -263,9 +264,23 @@ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example ``` +#### Extra: confidence intervals + +`squiggle_more.c` has some helper functions to get confidence intervals. They are in `squiggle_more.c` because I'm still mulling over what their shape should be, and because until recently they were pretty limited and suboptimal. But recently, I've put a bunch of effort into being able to get the confidence interval of an array of samples in O(number of samples), and into making confidence interval functions nicer and more general. So I might promote them to the main `squiggle.c` file. + +#### Extra: paralellism + +I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine. + +But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intrincancies around setting seeds. Some gotchas and exercises for the reader: + +- If you run the `sampler_parallel` function twice, you will get the same result. Why? +- If you run the `sampler_parallel` function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why? +- For a small amount of samples, if you run the `sampler_parallel` function, you will get better spread out random numbers than if you run things serially. Why? + #### Extra: Cdf auxiliary functions -I provide some Take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it. +I provide some auxiliary functions that take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it. #### Extra: Error propagation vs exiting on error @@ -290,31 +305,17 @@ Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library Overall, I'd describe the error handling capabilities of this library as pretty rudimentary. For example, this program might fail in surprising ways if you ask for a lognormal with negative standard deviation, because I haven't added error checking for that case yet. -## Extra: confidence intervals - -// to do - -## Extra paralellism - -// to do - ## Related projects - [Squiggle](https://www.squiggle-language.com/) - [SquigglePy](https://github.com/rethinkpriorities/squigglepy) - [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/) - [time to botec](https://github.com/NunoSempere/time-to-botec) -- [beta]() +- [Find a beta distribution that fits your desired confidence interval](https://nunosempere.com/blog/2023/03/15/fit-beta/) ## To do list -- [x] Write better confidence interval code that: - - Gets number of samples as an input - - Gets either a sampler function or a list of samples - - is O(n), not O(nlog(n)) - - Parallelizes stuff -- [ ] Document paralellism -- [ ] Document confidence intervals +- [ ] Think through seed initialization - [ ] Point out that, even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift should be different). - [ ] Document rudimentary algebra manipulations for normal/lognormal - [ ] Think through whether to delete cdf => samples function @@ -331,6 +332,8 @@ Overall, I'd describe the error handling capabilities of this library as pretty ## Done +- [x] Document paralellism +- [x] Document confidence intervals - [x] Add example for only one sample - [x] Add example for many samples - [ ] ~~Add a custom preprocessor to allow simple nested functions that don't rely on local scope?~~ @@ -388,3 +391,8 @@ Overall, I'd describe the error handling capabilities of this library as pretty - [x] Move to own file? Or signpost in file? => signposted in file. - [x] Write twitter thread: now [here](https://twitter.com/NunoSempere/status/1707041153210564959); retweets appreciated. - [ ] ~~Think about whether to write a simple version of this for [uxn](https://100r.co/site/uxn.html), a minimalistic portable programming stack which, sadly, doesn't have doubles (64 bit floats)~~ +- [x] Write better confidence interval code that: + - Gets number of samples as an input + - Gets either a sampler function or a list of samples + - is O(n), not O(nlog(n)) + - Parallelizes stuff diff --git a/makefile b/makefile index 6cf73b7..ed68dcf 100644 --- a/makefile +++ b/makefile @@ -18,4 +18,4 @@ format: squiggle.c squiggle.h lint: clang-tidy squiggle.c -- -lm - clang-tidy extra.c -- -lm + clang-tidy squiggle_more.c -- -lm diff --git a/squiggle_more.c b/squiggle_more.c index 872f55b..720fdcc 100644 --- a/squiggle_more.c +++ b/squiggle_more.c @@ -26,7 +26,7 @@ void sampler_parallel(double (*sampler)(uint64_t* seed), double* results, int n_ // to possibly do by Jorge: improve so that the remainder is included in the threads int quotient = n_samples / n_threads; - int remainder = n_samples % n_threads; + /* int remainder = n_samples % n_threads; // not used, comment to avoid lint warning */ int divisor_multiple = quotient * n_threads; uint64_t** seeds = malloc(n_threads * sizeof(uint64_t*));