docs & improve histogram function

2024-01-31 09:49:21 +01:00 · 2024-01-31 09:49:21 +01:00 · e62a840625
commit e62a840625
parent 4bf13f3c22
2 changed files with 40 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -29,7 +29,8 @@ This library provides some basic building blocks. The recommended strategy is to

 1. Define sampler functions, which take a seed, and return 1 sample
 2. Compose those sampler functions to define your estimation model
-3. At the end, call the last sampler function many times to generate many samples from your model
+3. Produce an array of samples from a sampler function
+4. Get summary statistics for that array of samples.

 ### Nested functions and compilation with tcc.

@ -206,6 +207,10 @@ double* model(int n_samples){

 But that requires us to encode the shape of the model into the sampling function. It leads to an ugly nesting of for loops. It is a more complex approach. It is not [grug-brained](https://grugbrain.dev/). So every now and then I have to remember that this is not the way.

+### Boundaries between sampling functions and arrays of samples
+
+In squiggle.c, the boundary between working with sampler functions and arrays of samples is clear. Not so in the original squiggle, which hides this distinction from the user in the interest of accessibility. 
+
 ### Tests and the long tail of the lognormal

 Distribution functions can be tested with:
@ -317,10 +322,24 @@ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example

 ```

-#### Extra: confidence intervals
+#### Extra: summary statistics

 `squiggle_more.c` has some helper functions to get confidence intervals. They are in `squiggle_more.c` because I'm still mulling over what their shape should be, and because until recently they were pretty limited and sub-optimal. But recently, I've put a bunch of effort into being able to get the confidence interval of an array of samples in O(number of samples), and into making confidence interval functions nicer and more general. So I might promote them to the main `squiggle.c` file.

+The relevant functions are 
+
+```
+typedef struct ci_t {
+    double low;
+    double high;
+} ci;
+double array_get_median(double xs[], int n);
+ci array_get_ci(ci interval, double* xs, int n);
+ci array_get_90_ci(double xs[], int n);
+void array_print_stats(double xs[], int n);
+void array_print_histogram(double* xs, int n_samples, int n_bins);
+```
+
 #### Extra: parallelism

 I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine. 
@ -341,6 +360,8 @@ That said, I found adding parallelism to be an interesting an engaging task. Mos

 I provide some auxiliary functions that take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it. 

+I don't have any immediate plans or reason to delete these functions, but I probably will, since they are too slow/inefficient for my taste.
+
 #### Extra: Error propagation vs exiting on error

 The process of taking a cdf and returning a sample might fail, e.g., it's a Newton method which might fail to converge because of cdf artifacts. The cdf itself might also fail, e.g., if a distribution only accepts a range of parameters, but is fed parameters outside that range.
@ -387,6 +408,7 @@ Overall, I'd describe the error handling capabilities of this library as pretty

 ### Done

+- [x] Document print stats
 - [x] Document rudimentary algebra manipulations for normal/lognormal
 - [x] Think through whether to delete cdf => samples function => not for now
 - [x] Think through whether to:
--- a/squiggle_more.c
+++ b/squiggle_more.c
@ -217,11 +217,11 @@ void array_print_stats(double xs[], int n){
 void array_print_histogram(double* xs, int n_samples, int n_bins) {
    // Generated with the help of an llm; there might be subtle off-by-one errors
    // interface inspired by <https://github.com/red-data-tools/YouPlot>
-    if (n_bins <= 0) {
-        fprintf(stderr, "Number of bins must be a positive integer.\n");
+    if (n_bins <= 1) {
+        fprintf(stderr, "Number of bins must be greater than 1.\n");
        return;
-    } else if (n_samples <= 0) {
-        fprintf(stderr, "Number of samples must be a positive integer.\n");
+    } else if (n_samples <= 1) {
+        fprintf(stderr, "Number of samples must be higher than 1.\n");
        return;
    }

@ -275,19 +275,24 @@ void array_print_histogram(double* xs, int n_samples, int n_bins) {
    for (int i = 0; i < n_bins; i++) {
        double bin_start = min_value + i * bin_width;
        double bin_end = bin_start + bin_width;
+
        if(bin_width < 0.01){
-            printf("  [%4.4f, %4.4f): ", bin_start, bin_end); 
+            printf("  [%4.4f, %4.4f", bin_start, bin_end); 
        } else if(bin_width < 0.1){
-            printf("  [%4.3f, %4.3f): ", bin_start, bin_end); 
+            printf("  [%4.3f, %4.3f", bin_start, bin_end); 
        } else if(bin_width < 1){
-            printf("  [%4.2f, %4.2f): ", bin_start, bin_end); 
+            printf("  [%4.2f, %4.2f", bin_start, bin_end); 
        } else if(bin_width < 10){
-            printf("  [%4.1f, %4.1f): ", bin_start, bin_end); 
+            printf("  [%4.1f, %4.1f", bin_start, bin_end); 
        } else {
-            printf("  [%4f, %4f): ", bin_start, bin_end); 
+            printf("  [%4.0f, %4.0f", bin_start, bin_end); 
        }
-        // number of decimals could depend on the number of bins
-        // or on the size of the smallest bucket
+
+        char interval_delimiter = ')';
+        if(i == (n_bins-1)){
+            interval_delimiter = ']'; // last bucket is inclusive
+        }
+        printf("%c: ", interval_delimiter);

        int marks = (int)(bins[i] * scale);
        for (int j = 0; j < marks; j++) {