add xorshift64 + various changes.

This commit is contained in:
NunoSempere 2023-07-23 12:44:16 +02:00
parent 98e058bd88
commit 11286211f7
6 changed files with 71 additions and 18 deletions

View File

@ -24,12 +24,13 @@ You can follow some example usage in the examples/ folder
3. In the third example, we use a gcc extension—nested functions—to rewrite the code from point 2. in a more linear way.
4. In the fourth example, we define some simple cdfs, and we draw samples from those cdfs. We see that this approach is slower than using the built-in samplers, e.g., the normal sampler.
5. In the fifth example, we define the cdf for the beta distribution, and we draw samples from it.
6. In the sixth example, we take samples from simple gamma and beta distributions, using the samplers provided by this library.
## Commentary
### squiggle.c is short
`squiggle.c` is around 300 lines of C. The reader could just read it and grasp its contents.
`squiggle.c` is less than 500 lines of C. The reader could just read it and grasp its contents.
### Core strategy
@ -41,13 +42,19 @@ This library provides some basic building blocks. The recommended strategy is to
### Cdf auxiliary functions
To help with the above core strategy, this library provides convenience functions, which take a cdf, and returns a sample from the distribution produced by that cdf.
This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.
To help with the above core strategy, this library provides convenience functions, which take a cdf, and return a sample from the distribution produced by that cdf. This might make it easier to program models, at the cost of a 20x to 60x slowdown in the parts of the code that use it.
### Nested functions and compilation with tcc.
GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, and doesn't allow for nested functions.
GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.
| GCC | tcc |
| --- | --- |
| slower compilation | faster compilation |
| allows nested functions | doesn't allow nested functions |
| faster execution | slower execution |
My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.
### Error propagation vs exiting on error
@ -72,23 +79,43 @@ Behaviour on error can be toggled by the `EXIT_ON_ERROR` variable. This library
## Design choices
This code should aim to be fast, but at the margin, simplicity of implementation wins over speed. For example, there are various possible algorithms to sample a distribution, one of which is faster in part of the domain, I'm choosing to only have one algorithm and pay the—normally small—performance penalty.
This code should be correct, then simple, then fast.
- It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
- It should be clear, conceptually simple. Simple for me to implement, simple for others to understand
- It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. In any case, though, the code should still be *way faster* than Python.
Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.
## Correlated samples
In the parent [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means:
In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means:
```
```js
a = 1 to 10
b = 2 * a
c = b/a
c
```
Likewise in [squigglepy](https://github.com/rethinkpriorities/squigglepy):
```python
import squigglepy as sq
import numpy as np
a = sq.to(1, 3)
b = 2 * a
c = b / a
c_samples = sq.sample(c, 10)
print(c_samples)
```
Should `c` be equal to `2`? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (`2 * a/a`, as it were)?
In squiggle.c, this ambiguity doesn't exist, at the cost of greater overhead & verbosity:
In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:
```c
// correlated samples
@ -157,8 +184,7 @@ int main(){
## To do list
- [ ] Explain correlated samples
- [ ] Add tests in Stan?
- [ ] Test summary statistics for each of the distributions.
- [ ] Have some more complicated & realistic example
- [ ] Add summarization functions: 90% ci (or all c.i.?)
- [ ] Systematize references
@ -196,3 +222,5 @@ int main(){
- [x] Add summarization functions: mean, std
- [x] Add sampling from a gamma distribution
- https://dl.acm.org/doi/pdf/10.1145/358407.358414
- [x] Explain correlated samples
- [-] ~~Add tests in Stan?~~

Binary file not shown.

View File

@ -12,11 +12,13 @@ int main()
*seed = 1000; // xorshift can't start with 0
int n = 1000 * 1000;
/*
for (int i = 0; i < n; i++) {
float gamma_0 = sample_gamma(0.0, seed);
// printf("sample_gamma(0.0): %f\n", gamma_0);
}
// printf("\n");
printf("\n");
*/
float* gamma_1_array = malloc(sizeof(float) * n);
for (int i = 0; i < n; i++) {

View File

@ -1,5 +1,9 @@
MAKEFLAGS += --no-print-directory
## Formatter
STYLE_BLUEPRINT=webkit
FORMATTER=clang-format -i -style=$(STYLE_BLUEPRINT)
all:
cd examples/01_one_sample && make && echo
cd examples/02_many_samples && make && echo
@ -7,3 +11,8 @@ all:
cd examples/04_sample_from_cdf_simple && make && echo
cd examples/05_sample_from_cdf_beta && make && echo
cd examples/06_gamma_beta && make && echo
format: squiggle.c squiggle.h
$(FORMATTER) squiggle.c
$(FORMATTER) squiggle.h

View File

@ -28,6 +28,20 @@ uint32_t xorshift32(uint32_t* seed)
return *seed = x;
}
uint64_t xorshift64(uint64_t* seed)
{
// Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs"
// See <https://stackoverflow.com/questions/53886131/how-does-xorshift32-works>
// https://en.wikipedia.org/wiki/Xorshift
// Also some drama: <https://www.pcg-random.org/posts/on-vignas-pcg-critique.html>, <https://prng.di.unimi.it/>
uint64_t x = *seed;
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
return *seed = x;
}
// Distribution & sampling functions
// Unit distributions
float sample_unit_uniform(uint32_t* seed)
@ -46,9 +60,9 @@ float sample_unit_normal(uint32_t* seed)
}
// Composite distributions
float sample_uniform(float from, float to, uint32_t* seed)
float sample_uniform(float start, float end, uint32_t* seed)
{
return sample_unit_uniform(seed) * (to - from) + from;
return sample_unit_uniform(seed) * (end - start) + start;
}
float sample_normal(float mean, float sigma, uint32_t* seed)

View File

@ -12,7 +12,7 @@ float sample_unit_uniform(uint32_t* seed);
float sample_unit_normal(uint32_t* seed);
// Composite distribution sampling functions
float sample_uniform(float from, float to, uint32_t* seed);
float sample_uniform(float start, float end, uint32_t* seed);
float sample_normal(float mean, float sigma, uint32_t* seed);
float sample_lognormal(float logmean, float logsigma, uint32_t* seed);
float sample_to(float low, float high, uint32_t* seed);