Compare commits

..

8 Commits

10 changed files with 309 additions and 38 deletions

View File

@ -20,12 +20,14 @@ A self-contained C99 library that provides a subset of [Squiggle](https://www.sq
You can follow some example usage in the examples/ folder You can follow some example usage in the examples/ folder
1. In the first example, we define a small model, and draw one sample from it 1. In the [1st example](src/branch/master/examples/01_one_sample/example.c), we define a small model, and draw one sample from it
2. In the second example, we define a small model, and return many samples 2. In the 2nd example, we define a small model, and return many samples
3. In the third example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way. 3. In the 3rd example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way.
4. In the fourth example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler. 4. In the 4th example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler.
5. In the fifth example, we define the cdf for the beta distribution, and we draw samples from it. 5. In the 5th example, we define the cdf for the beta distribution, and we draw samples from it.
6. In the sixth example, we take samples from simple gamma and beta distributions, using the samplers provided by this library. 6. In the 6th example, we take samples from simple gamma and beta distributions, using the samplers provided by this library.
7. In the 7th example, we get the 90% confidence interval of a beta distribution
8. The 8th example translates the models from Eli and Nuño from [Samotsvety Nuclear Risk Forecasts — March 2022](https://forum.nunosempere.com/posts/KRFXjCqqfGQAYirm5/samotsvety-nuclear-risk-forecasts-march-2022#Nu_o_Sempere) into squiggle.c, then creates a mixture from both, and returns the mean probability of death per month and the 90% confidence interval.
## Commentary ## Commentary
@ -78,17 +80,18 @@ The first approach produces terser programs but might not scale. The second appr
Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library also provides a convenient macro, `PROCESS_ERROR`, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder. Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library also provides a convenient macro, `PROCESS_ERROR`, to make error handling in either case much terser—see the usage in example 4 in the examples/ folder.
## Design choices ### Design choices
This code should be correct, then simple, then fast. This code should be correct, then simple, then fast.
- It should be correct. The user should be able to rely on it and not think about whether errors come from the library. - It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
- Nonetheless, the user should understand the limitations of sampling-based methods. See the section on [Tests and the long tail of the lognormal](https://git.nunosempere.com/personal/squiggle.c#tests-and-the-long-tail-of-the-lognormal) for a discussion of how sampling is bad at capturing some aspects of distributions with long tails.
- It should be clear, conceptually simple. Simple for me to implement, simple for others to understand - It should be clear, conceptually simple. Simple for me to implement, simple for others to understand
- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than Python. - It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than Python.
Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section. Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.
## Correlated samples ### Correlated samples
In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means: In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means:
@ -243,17 +246,19 @@ Std of to(10.000000, 10000.000000): 23578.091775, vs expected std: 25836.381819
delta: -2258.290043, relative delta: -0.095779 delta: -2258.290043, relative delta: -0.095779
``` ```
Overall, I would caution that if you really care about the very far tails of distributions, you might want to instead use tools which can do some of the analytical manipulations for you, like the original Squiggle, Simple Squiggle (both linked below), or even doing lognormal multiplication by hand, relying on the fact that two lognormals multiplied together result in another lognormal with known shape.
## Related projects ## Related projects
- [Squiggle](https://www.squiggle-language.com/) - [Squiggle](https://www.squiggle-language.com/)
- [SquigglePy](https://github.com/rethinkpriorities/squigglepy) - [SquigglePy](https://github.com/rethinkpriorities/squigglepy)
- [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/) - [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/)
- [time to botec](https://github.com/NunoSempere/time-to-botec) - [time to botec](https://github.com/NunoSempere/time-to-botec)
- [beta]()
## To do list ## To do list
- [ ] Have some more complicated & realistic example - [ ] Link to the examples in the examples section.
- [ ] Add summarization functions: 90% ci (or all c.i.?)
- [ ] Systematize references - [ ] Systematize references
- [ ] Publish online - [ ] Publish online
- [ ] Support all distribution functions in <https://www.squiggle-language.com/docs/Api/Dist> - [ ] Support all distribution functions in <https://www.squiggle-language.com/docs/Api/Dist>
@ -265,7 +270,7 @@ delta: -2258.290043, relative delta: -0.095779
- [x] Add example for many samples - [x] Add example for many samples
- [ ] ~~Add a custom preprocessor to allow simple nested functions that don't rely on local scope?~~ - [ ] ~~Add a custom preprocessor to allow simple nested functions that don't rely on local scope?~~
- [x] Use gcc extension to define functions nested inside main. - [x] Use gcc extension to define functions nested inside main.
- [x] Chain various sample_mixture functions - [x] Chain various `sample_mixture` functions
- [x] Add beta distribution - [x] Add beta distribution
- See <https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution> for a faster method. - See <https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution> for a faster method.
- [ ] ~~Use OpenMP for acceleration~~ - [ ] ~~Use OpenMP for acceleration~~
@ -299,3 +304,6 @@ delta: -2258.290043, relative delta: -0.095779
- https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution - https://stats.stackexchange.com/questions/502146/how-does-numpy-generate-samples-from-a-beta-distribution
- https://github.com/numpy/numpy/blob/5cae51e794d69dd553104099305e9f92db237c53/numpy/random/src/distributions/distributions.c - https://github.com/numpy/numpy/blob/5cae51e794d69dd553104099305e9f92db237c53/numpy/random/src/distributions/distributions.c
- [x] Pontificate about lognormal tests - [x] Pontificate about lognormal tests
- [x] Give warning about sampling-based methods.
- [x] Have some more complicated & realistic example
- [x] Add summarization functions: 90% ci (or all c.i.?)

View File

@ -0,0 +1,14 @@
#include "../../squiggle.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
// set randomness seed
uint64_t* seed = malloc(sizeof(uint64_t));
*seed = 1000; // xorshift can't start with 0
free(seed);
}

View File

@ -0,0 +1,53 @@
# Interface:
# make
# make build
# make format
# make run
# Compiler
CC=gcc
# CC=tcc # <= faster compilation
# Main file
SRC=example.c ../../squiggle.c
OUTPUT=example
## Dependencies
MATH=-lm
## Flags
DEBUG= #'-g'
STANDARD=-std=c99
WARNINGS=-Wall
OPTIMIZED=-O3 #-Ofast
# OPENMP=-fopenmp
## Formatter
STYLE_BLUEPRINT=webkit
FORMATTER=clang-format -i -style=$(STYLE_BLUEPRINT)
## make build
build: $(SRC)
$(CC) $(OPTIMIZED) $(DEBUG) $(SRC) $(MATH) -o $(OUTPUT)
format: $(SRC)
$(FORMATTER) $(SRC)
run: $(SRC) $(OUTPUT)
OMP_NUM_THREADS=1 ./$(OUTPUT) && echo
time-linux:
@echo "Requires /bin/time, found on GNU/Linux systems" && echo
@echo "Running 100x and taking avg time $(OUTPUT)"
@t=$$(/usr/bin/time -f "%e" -p bash -c 'for i in {1..100}; do $(OUTPUT); done' 2>&1 >/dev/null | grep real | awk '{print $$2}' ); echo "scale=2; 1000 * $$t / 100" | bc | sed "s|^|Time using 1 thread: |" | sed 's|$$|ms|' && echo
## Profiling
profile-linux:
echo "Requires perf, which depends on the kernel version, and might be in linux-tools package or similar"
echo "Must be run as sudo"
$(CC) $(SRC) $(MATH) -o $(OUTPUT)
sudo perf record ./$(OUTPUT)
sudo perf report
rm perf.data

BIN
examples/08_nuclear_war/example Executable file

Binary file not shown.

View File

@ -0,0 +1,68 @@
#include "../../squiggle.h"
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
double probability_of_dying_nuno(uint64_t* seed)
{
double first_year_russian_nuclear_weapons = 1953;
double current_year = 2022;
double laplace_probability_nuclear_exchange_year = sample_beta(1, current_year - first_year_russian_nuclear_weapons + 1, seed);
double laplace_probability_nuclear_exchange_month = 1 - pow(1 - laplace_probability_nuclear_exchange_year, (1.0 / 12.0));
double london_hit_conditional_on_russia_nuclear_weapon_usage = sample_beta(7.67, 69.65, seed);
// I.e., a beta distribution with a range of 0.05 to 0.16 into: https://nunosempere.com/blog/2023/03/15/fit-beta/
// 0.05 were my estimate and Samotsvety's estimate in March 2022, respectively:
// https://forum.effectivealtruism.org/posts/KRFXjCqqfGQAYirm5/samotsvety-nuclear-risk-forecasts-march-2022#Nu_o_Sempere
double informed_actor_not_able_to_escape = sample_beta(3.26212166586967, 3.26228162008564, seed);
// 0.2 to 0.8, i.e., 20% to 80%, again using the previous tool
double proportion_which_die_if_bomb_drops_in_london = sample_beta(10.00, 2.45, seed); // 60% to 95%
double probability_of_dying = laplace_probability_nuclear_exchange_month * london_hit_conditional_on_russia_nuclear_weapon_usage * informed_actor_not_able_to_escape * proportion_which_die_if_bomb_drops_in_london;
return probability_of_dying;
}
double probability_of_dying_eli(uint64_t* seed)
{
double russia_nato_nuclear_exchange_in_next_month = sample_beta(1.30, 1182.99, seed); // .0001 to .003
double london_hit_conditional = sample_beta(3.47, 8.97, seed); // 0.1 to 0.5
double informed_actors_not_able_to_escape = sample_beta(2.73, 5.67, seed); // .1 to .6
double proportion_which_die_if_bomb_drops_in_london = sample_beta(3.00, 1.46, seed); // 0.3 to 0.95;
double probability_of_dying = russia_nato_nuclear_exchange_in_next_month * london_hit_conditional * informed_actors_not_able_to_escape * proportion_which_die_if_bomb_drops_in_london;
return probability_of_dying;
}
double mixture(uint64_t* seed)
{
double (*samplers[])(uint64_t*) = { probability_of_dying_nuno, probability_of_dying_eli };
double weights[] = { 0.5, 0.5 };
return sample_mixture(samplers, weights, 2, seed);
}
int main()
{
// set randomness seed
uint64_t* seed = malloc(sizeof(uint64_t));
*seed = 1000; // xorshift can't start with 0
int n = 1000 * 1000;
double* mixture_result = malloc(sizeof(double) * n);
for (int i = 0; i < n; i++) {
mixture_result[i] = mixture(seed);
}
printf("mixture_result: [ ");
for (int i = 0; i < 9; i++) {
printf("%.6f, ", mixture_result[i]);
}
printf("... ]\n");
struct c_i c_i_90 = get_90_confidence_interval(mixture, seed);
printf("mean: %f\n", array_mean(mixture_result, n));
printf("90%% confidence interval: [%f, %f]\n", c_i_90.low, c_i_90.high);
free(seed);
}

View File

@ -0,0 +1,53 @@
# Interface:
# make
# make build
# make format
# make run
# Compiler
CC=gcc
# CC=tcc # <= faster compilation
# Main file
SRC=example.c ../../squiggle.c
OUTPUT=example
## Dependencies
MATH=-lm
## Flags
DEBUG= #'-g'
STANDARD=-std=c99
WARNINGS=-Wall
OPTIMIZED=-O3 #-Ofast
# OPENMP=-fopenmp
## Formatter
STYLE_BLUEPRINT=webkit
FORMATTER=clang-format -i -style=$(STYLE_BLUEPRINT)
## make build
build: $(SRC)
$(CC) $(OPTIMIZED) $(DEBUG) $(SRC) $(MATH) -o $(OUTPUT)
format: $(SRC)
$(FORMATTER) $(SRC)
run: $(SRC) $(OUTPUT)
OMP_NUM_THREADS=1 ./$(OUTPUT) && echo
time-linux:
@echo "Requires /bin/time, found on GNU/Linux systems" && echo
@echo "Running 100x and taking avg time $(OUTPUT)"
@t=$$(/usr/bin/time -f "%e" -p bash -c 'for i in {1..100}; do $(OUTPUT); done' 2>&1 >/dev/null | grep real | awk '{print $$2}' ); echo "scale=2; 1000 * $$t / 100" | bc | sed "s|^|Time using 1 thread: |" | sed 's|$$|ms|' && echo
## Profiling
profile-linux:
echo "Requires perf, which depends on the kernel version, and might be in linux-tools package or similar"
echo "Must be run as sudo"
$(CC) $(SRC) $(MATH) -o $(OUTPUT)
sudo perf record ./$(OUTPUT)
sudo perf report
rm perf.data

Binary file not shown.

View File

@ -0,0 +1,20 @@
#include "../../../squiggle.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
// set randomness seed
uint64_t* seed = malloc(sizeof(uint64_t));
*seed = 1000; // xorshift can't start with 0
double firstYearRussianNuclearWeapons = 1953;
double currentYear = 2023;
for(int i=0; i<10; i++){
double laplace_beta = sample_beta(currentYear-firstYearRussianNuclearWeapons + 1, 1, seed);
printf("%f\n", laplace_beta);
}
free(seed);
}

View File

@ -0,0 +1,53 @@
# Interface:
# make
# make build
# make format
# make run
# Compiler
CC=gcc
# CC=tcc # <= faster compilation
# Main file
SRC=example.c ../../../squiggle.c
OUTPUT=example
## Dependencies
MATH=-lm
## Flags
DEBUG= #'-g'
STANDARD=-std=c99
WARNINGS=-Wall
OPTIMIZED=-O3 #-Ofast
# OPENMP=-fopenmp
## Formatter
STYLE_BLUEPRINT=webkit
FORMATTER=clang-format -i -style=$(STYLE_BLUEPRINT)
## make build
build: $(SRC)
$(CC) $(OPTIMIZED) $(DEBUG) $(SRC) $(MATH) -o $(OUTPUT)
format: $(SRC)
$(FORMATTER) $(SRC)
run: $(SRC) $(OUTPUT)
OMP_NUM_THREADS=1 ./$(OUTPUT) && echo
time-linux:
@echo "Requires /bin/time, found on GNU/Linux systems" && echo
@echo "Running 100x and taking avg time $(OUTPUT)"
@t=$$(/usr/bin/time -f "%e" -p bash -c 'for i in {1..100}; do $(OUTPUT); done' 2>&1 >/dev/null | grep real | awk '{print $$2}' ); echo "scale=2; 1000 * $$t / 100" | bc | sed "s|^|Time using 1 thread: |" | sed 's|$$|ms|' && echo
## Profiling
profile-linux:
echo "Requires perf, which depends on the kernel version, and might be in linux-tools package or similar"
echo "Must be run as sudo"
$(CC) $(SRC) $(MATH) -o $(OUTPUT)
sudo perf record ./$(OUTPUT)
sudo perf report
rm perf.data

View File

@ -94,12 +94,12 @@ double sample_gamma(double alpha, uint64_t* seed)
// A Simple Method for Generating Gamma Variables, Marsaglia and Wan Tsang, 2001 // A Simple Method for Generating Gamma Variables, Marsaglia and Wan Tsang, 2001
// https://dl.acm.org/doi/pdf/10.1145/358407.358414 // https://dl.acm.org/doi/pdf/10.1145/358407.358414
// see also the references/ folder // see also the references/ folder
// Note that the Wikipedia page for the gamma distribution includes a scaling parameter // Note that the Wikipedia page for the gamma distribution includes a scaling parameter
// k or beta // k or beta
// https://en.wikipedia.org/wiki/Gamma_distribution // https://en.wikipedia.org/wiki/Gamma_distribution
// such that gamma_k(alpha, k) = k * gamma(alpha) // such that gamma_k(alpha, k) = k * gamma(alpha)
// or gamma_beta(alpha, beta) = gamma(alpha) / beta // or gamma_beta(alpha, beta) = gamma(alpha) / beta
// So far I have not needed to use this, and thus the second parameter is by default 1. // So far I have not needed to use this, and thus the second parameter is by default 1.
if (alpha >= 1) { if (alpha >= 1) {
double d, c, x, v, u; double d, c, x, v, u;
d = alpha - 1.0 / 3.0; d = alpha - 1.0 / 3.0;
@ -386,38 +386,40 @@ struct box sampler_cdf_double(double cdf(double), uint64_t* seed)
// Get confidence intervals, given a sampler // Get confidence intervals, given a sampler
struct c_i { struct c_i {
float low; float low;
float high; float high;
}; };
int compare_doubles(const void *p, const void *q) { int compare_doubles(const void* p, const void* q)
// https://wikiless.esmailelbob.xyz/wiki/Qsort?lang=en {
double x = *(const double *)p; // https://wikiless.esmailelbob.xyz/wiki/Qsort?lang=en
double y = *(const double *)q; double x = *(const double*)p;
double y = *(const double*)q;
/* Avoid return x - y, which can cause undefined behaviour /* Avoid return x - y, which can cause undefined behaviour
because of signed integer overflow. */ because of signed integer overflow. */
if (x < y) if (x < y)
return -1; // Return -1 if you want ascending, 1 if you want descending order. return -1; // Return -1 if you want ascending, 1 if you want descending order.
else if (x > y) else if (x > y)
return 1; // Return 1 if you want ascending, -1 if you want descending order. return 1; // Return 1 if you want ascending, -1 if you want descending order.
return 0; return 0;
} }
struct c_i get_90_confidence_interval(double (*sampler)(uint64_t*), uint64_t* seed){ struct c_i get_90_confidence_interval(double (*sampler)(uint64_t*), uint64_t* seed)
int n = 100 * 1000; {
double* samples_array = malloc(n * sizeof(double)); int n = 100 * 1000;
for(int i=0; i<n; i++){ double* samples_array = malloc(n * sizeof(double));
samples_array[i] = sampler(seed); for (int i = 0; i < n; i++) {
} samples_array[i] = sampler(seed);
qsort(samples_array, n, sizeof(double), compare_doubles); }
qsort(samples_array, n, sizeof(double), compare_doubles);
struct c_i result = { struct c_i result = {
.low = samples_array[5000], .low = samples_array[5000],
.high =samples_array[94999], .high = samples_array[94999],
}; };
free(samples_array); free(samples_array);
return result; return result;
} }
/* Could also define other variations, e.g., /* Could also define other variations, e.g.,