add type of functions to the README

2024-02-02 15:51:17 +01:00 · 2024-02-02 15:51:17 +01:00 · 1bcb04c356
commit 1bcb04c356
parent 4113c87e4d
9 changed files with 203 additions and 26 deletions
--- a/FOLK_WISDOM.md
+++ b/FOLK_WISDOM.md
@ -249,6 +249,22 @@ I think this is good news in terms of making me more confident that this simple

 In squiggle.c, the boundary between working with sampler functions and arrays of samples is clear. Not so in the original squiggle, which hides this distinction from the user in the interest of accessibility. 

+### Parallelism
+
+I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine. 
+
+But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader:
+
+- If you run the `sampler_parallel` function twice, you will get the same result. Why? 
+- If you run the `sampler_parallel` function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why?
+- For a small amount of samples, if you run the `sampler_parallel` function, you will get better spread out random numbers than if you run things serially. Why?
+
+That said, I found adding parallelism to be an interesting an engaging task. Most recently, I even optimized the code to ensure that two threads weren't accessing the same cache line at the same time, and it was very satisfying to see a 30% improvement as a result.
+
+### Using arbitrary cdfs
+
+The last commit that has code to sample from arbitrary cdfs is `8f6919fa2...`. You can access it with `git checkout 8f6919fa2`. I removed them because I wasn't using them, and they didn't really fit with the overall ethos of the project. 
+
 ### Other gotchas

 - Even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift code should be different).
--- a/README.md
+++ b/README.md
@ -195,14 +195,70 @@ The bailey:
 - I think the core interface is not likely to change much, though I've recently changed the interface for parallelism and for getting confidence intervals.
 - I am using this code for a few important consulting projects, and I trust myself to operate it correctly.

-## squiggle extra functions
+## Functions and their usage

-### Division between core functions and squiggle_more expansions
+### squiggle.c

-This library differentiates between core functions, which are pretty tightly scoped, and expansions and convenience functions, which are more meandering. Expansions are in `squiggle_more.c` and `squiggle_more.h`. To use them, take care to link them:
+`squiggle.c` should be pretty tightly scoped. Available functions are:

+```C
+// Underlying pseudo-randomness function
+uint64_t xorshift64(uint64_t* seed);
+
+// Sampling functions
+double sample_unit_uniform(uint64_t* seed);
+double sample_unit_normal(uint64_t* seed);
+double sample_uniform(double start, double end, uint64_t* seed);
+double sample_normal(double mean, double sigma, uint64_t* seed);
+double sample_lognormal(double logmean, double logsigma, uint64_t* seed);
+double sample_normal_from_90_ci(double low, double high, uint64_t* seed);
+double sample_to(double low, double high, uint64_t* seed);
+double sample_gamma(double alpha, uint64_t* seed);
+double sample_beta(double a, double b, uint64_t* seed);
+double sample_laplace(double successes, double failures, uint64_t* seed);
+
+// Array helpers
+double array_sum(double* array, int length);
+void array_cumsum(double* array_to_sum, double* array_cumsummed, int length);
+double array_mean(double* array, int length);
+double array_std(double* array, int length);
+
+// Mixture function
+double sample_mixture(double (*samplers[])(uint64_t*), double* weights, int n_dists, uint64_t* seed);
 ```
-// In your C source file
+
+The samplers syntax for the mixture functions denotes that it takes an array of functions. You can use it as follows:
+
+```C
+#include "squiggle.h"
+
+double sample_0(uint64_t* seed) { UNUSED(seed); return 0; }
+double sample_1(uint64_t* seed) { UNUSED(seed); return 1; }
+double sample_few(uint64_t* seed) { return sample_to(1, 3, seed); }
+double sample_many(uint64_t* seed) { return sample_to(2, 10, seed); }
+
+double sample_model(uint64_t* seed){
+
+    double p_a = 0.8;
+    double p_b = 0.5;
+    double p_c = p_a * p_b;
+
+    int n_dists = 4;
+    double weights[] = { 1 - p_c, p_c / 2, p_c / 4, p_c / 4 };
+    double (*samplers[])(uint64_t*) = { sample_0, sample_1, sample_few, sample_many };
+    double result = sample_mixture(samplers, weights, n_dists, seed);
+
+    return result;
+}
+```
+
+### squiggle_more.h
+
+`squiggle_more.c` has expansions and convenience functions, which are more meandering. Expansions are in `squiggle_more.c` and `squiggle_more.h`. To use them, take care to link them:
+
+
+```C
+#include "squiggle.h"
 #include "squiggle_more.h"
 ```

@ -212,39 +268,115 @@ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example

 ```

-#### Extra: summary statistics
+Available definitions are as follows:

-`squiggle_more.c` has some helper functions to get confidence intervals. They are in `squiggle_more.c` because I'm still mulling over what their shape should be, and because until recently they were pretty limited and sub-optimal. But recently, I've put a bunch of effort into being able to get the confidence interval of an array of samples in O(number of samples), and into making confidence interval functions nicer and more general. So I might promote them to the main `squiggle.c` file.
+```C
+#define THOUSAND 1000
+#define MILLION 1000000

-The relevant functions are 
+/* Parallel sampling */
+void sampler_parallel(double (*sampler)(uint64_t* seed), double* results, int n_threads, int n_samples);

-```
+/* Stats */
+double array_get_median(double xs[], int n);
 typedef struct ci_t {
    double low;
    double high;
 } ci;
-double array_get_median(double xs[], int n);
 ci array_get_ci(ci interval, double* xs, int n);
 ci array_get_90_ci(double xs[], int n);
+
 void array_print_stats(double xs[], int n);
 void array_print_histogram(double* xs, int n_samples, int n_bins);
+void array_print_90_ci_histogram(double* xs, int n, int n_bins);
+
+/* Algebra manipulations */
+
+typedef struct normal_params_t {
+    double mean;
+    double std;
+} normal_params;
+normal_params algebra_sum_normals(normal_params a, normal_params b);
+
+typedef struct lognormal_params_t {
+    double logmean;
+    double logstd;
+} lognormal_params;
+lognormal_params algebra_product_lognormals(lognormal_params a, lognormal_params b);
+
+lognormal_params convert_ci_to_lognormal_params(ci x);
+ci convert_lognormal_params_to_ci(lognormal_params y);
 ```

-#### Extra: parallelism
+On parallelism in particular, see the warnings and caveats in the [FOLK_WISDOM.md](./FOLK_WISDOM.md) file. That file also has many other nuggets, warnings, trinkets, caveats, pointers I've collected over time. 

-I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine. 
+Here is an example of using parallelism, and then printing some stats and a histogram:

-But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader:
+```C
+#include "../../../squiggle.h"
+#include "../../../squiggle_more.h"
+#include <stdio.h>
+#include <stdlib.h>

- If you run the `sampler_parallel` function twice, you will get the same result. Why? 
- If you run the `sampler_parallel` function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why?
- For a small amount of samples, if you run the `sampler_parallel` function, you will get better spread out random numbers than if you run things serially. Why?
+double sample_beta_3_2(uint64_t* seed) { return sample_beta(3.0, 2.0, seed); }

-That said, I found adding parallelism to be an interesting an engaging task. Most recently, I even optimized the code to ensure that two threads weren't accessing the same cache line at the same time, and it was very satisfying to see a 30% improvement as a result.
+int main()
+{
+    // set randomness seed
+    uint64_t* seed = malloc(sizeof(uint64_t));
+    *seed = 1000; // xorshift can't start with 0

-#### Extra: Algebraic manipulations
+    int n_samples = 1 * MILLION;
+    double* xs = malloc(sizeof(double) * (size_t)n_samples);
+    sampler_parallel(sample_beta_3_2, xs, 16, n_samples);

-`squiggle_more.c` has some functions to do some simple algebra manipulations: sums of normals and products of lognormals. You can see some example usage [here](examples/more/07_algebra/example.c) and [here](examples/more/08_algebra_and_conversion/example.c).
+    printf("\n# Stats\n");
+    array_print_stats(xs, n_samples);
+    printf("\n# Histogram\n");
+    array_print_histogram(xs, n_samples, 23);
+
+    free(seed);
+}
+```
+
+This produces the following output:
+
+```
+Avg: 0.600036
+Std: 0.199851
+ 5%: 0.249009
+10%: 0.320816
+25%: 0.456413
+50%: 0.614356
+75%: 0.757000
+90%: 0.857256
+95%: 0.902290
+
+# Histogram
+[  0.00,   0.05):  391
+[  0.05,   0.09): █ 2352
+[  0.09,   0.13): ███ 5766
+[  0.13,   0.18): ██████ 10517
+[  0.18,   0.22): ██████████ 16412
+[  0.22,   0.26): ██████████████ 22773
+[  0.26,   0.31): ███████████████████ 30120
+[  0.31,   0.35): ████████████████████████ 37890
+[  0.35,   0.39): █████████████████████████████ 45067
+[  0.39,   0.44): █████████████████████████████████ 52174
+[  0.44,   0.48): ██████████████████████████████████████ 59636
+[  0.48,   0.52): ██████████████████████████████████████████ 64924
+[  0.52,   0.57): █████████████████████████████████████████████ 69832
+[  0.57,   0.61): ████████████████████████████████████████████████ 74099
+[  0.61,   0.65): █████████████████████████████████████████████████ 76776
+[  0.65,   0.70): ██████████████████████████████████████████████████ 77001
+[  0.70,   0.74): ████████████████████████████████████████████████ 75290
+[  0.74,   0.78): ██████████████████████████████████████████████ 71711
+[  0.78,   0.83): ██████████████████████████████████████████ 65576
+[  0.83,   0.87): ████████████████████████████████████ 56839
+[  0.87,   0.91): ████████████████████████████ 44626
+[  0.91,   0.96): ███████████████████ 29464
+[  0.96,   1.00]: ██████ 10764
+```

 ## Licensing

--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -5,8 +5,8 @@
 - [x] Big refactor
  - [ ] Come up with a better headline example; fermi paradox paper is too complicated
  - [x] Make README.md less messy
-  - [ ] Give examples of new functions
-  - [ ] Reference commit with cdf functions, even though deleted
+  - [x] Give examples of new functions
+  - [x] Reference commit with cdf functions, even though deleted
 - [ ] Post on suckless subreddit
 - [ ] Drive in a few more real-life applications
  - [ ] US election modelling?
--- a/examples/more/02_ci_beta/example
+++ b/examples/more/02_ci_beta/example
--- a/examples/more/02_ci_beta/example.c
+++ b/examples/more/02_ci_beta/example.c
--- a/examples/more/03_ci_beta_parallel/example
+++ b/examples/more/03_ci_beta_parallel/example
--- a/examples/more/03_ci_beta_parallel/example.c
+++ b/examples/more/03_ci_beta_parallel/example.c
@ -0,0 +1,28 @@
+#include "../../../squiggle.h"
+#include "../../../squiggle_more.h"
+#include <stdio.h>
+#include <stdlib.h>
+
+// Estimate functions
+double sample_beta_3_2(uint64_t* seed)
+{
+    return sample_beta(3.0, 2.0, seed);
+}
+
+int main()
+{
+    // set randomness seed
+    uint64_t* seed = malloc(sizeof(uint64_t));
+    *seed = 1000; // xorshift can't start with 0
+
+    int n_samples = 1 * MILLION;
+    double* xs = malloc(sizeof(double) * (size_t)n_samples);
+    sampler_parallel(sample_beta_3_2, xs, 16, n_samples);
+
+    printf("\n# Stats\n");
+    array_print_stats(xs, n_samples);
+    printf("\n# Histogram\n");
+    array_print_histogram(xs, n_samples, 23);
+
+    free(seed);
+}
--- a/examples/more/makefile
+++ b/examples/more/makefile
@ -39,7 +39,8 @@ FORMATTER=clang-format -i -style=$(STYLE_BLUEPRINT)
 ## make all
 all:
 	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 00_example_template/$(SRC)             $(DEPS) -o 00_example_template/$(OUTPUT)
-	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 03_ci_beta/$(SRC)                      $(DEPS) -o 03_ci_beta/$(OUTPUT)
+	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 02_ci_beta/$(SRC)                      $(DEPS) -o 02_ci_beta/$(OUTPUT)
+	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 03_ci_beta_parallel/$(SRC)             $(DEPS) -o 03_ci_beta_parallel/$(OUTPUT)
 	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 04_nuclear_war/$(SRC)                  $(DEPS) -o 04_nuclear_war/$(OUTPUT)
 	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 05_burn_10kg_fat/$(SRC)                $(DEPS) -o 05_burn_10kg_fat/$(OUTPUT)
 	$(CC) $(OPTIMIZED) $(DEBUG) $(WARN) 06_nuclear_recovery/$(SRC)             $(DEPS) -o 06_nuclear_recovery/$(OUTPUT)
@ -55,8 +56,8 @@ all:
 format-all:
 	$(FORMATTER) 00_example_template/$(SRC)
 	$(FORMATTER) 01_sample_from_cdf/$(SRC)
-	$(FORMATTER) 02_sample_from_cdf_beta/$(SRC)
-	$(FORMATTER) 03_ci_beta/$(SRC)
+	$(FORMATTER) 02_ci_beta/$(SRC)
+	$(FORMATTER) 03_ci_beta_parallel/$(SRC)
 	$(FORMATTER) 04_nuclear_war/$(SRC)
 	$(FORMATTER) 05_burn_10kg_fat/$(SRC)
 	$(FORMATTER) 06_nuclear_recovery/$(SRC)
@ -72,8 +73,8 @@ format-all:
 run-all:
 	00_example_template/$(OUTPUT)
 	01_sample_from_cdf/$(OUTPUT)
-	02_sample_from_cdf_beta/$(OUTPUT)
-	03_ci_beta/$(OUTPUT)
+	02_ci_beta/$(OUTPUT)
+	03_ci_beta_parallel/$(OUTPUT)
 	04_nuclear_war/$(OUTPUT)
 	05_burn_10kg_fat/$(OUTPUT)
 	06_nuclear_recovery/$(OUTPUT)
--- a/squiggle_more.h
+++ b/squiggle_more.h
@ -4,7 +4,7 @@
 /* Parallel sampling */
 void sampler_parallel(double (*sampler)(uint64_t* seed), double* results, int n_threads, int n_samples);

-/* Get median and confidence intervals */
+/* Stats */
 double array_get_median(double xs[], int n);
 typedef struct ci_t {
    double low;