From 3378d1b9e7237f3d26dfc180273e7a7a409dd4d0 Mon Sep 17 00:00:00 2001
From: NunoSempere <nuno.sempere@protonmail.com>
Date: Fri, 2 Jun 2023 16:24:08 -0600
Subject: [PATCH] update README, time.txt tally

---
 README.md | 13 +++++++++----
 time.txt  | 24 +++++++++++++++---------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/README.md b/README.md
index 179c3e57..d1ad9432 100644
--- a/README.md
+++ b/README.md
@@ -58,14 +58,19 @@ Ultimately, these optimizations were also incorporated into the C code as well.
 
 ### C
 
-For the C code, I enabled the `-Ofast` compilation flag. Without it, it instead takes ~0.4 seconds. Initially, before I enabled the `-Ofast` flag, I was surprised that the Node and Squiggle code were comparable to the C code. 
+The optimizations which make the final C code significantly faster than the naïve implementation are:
 
-The two optimizations which make more optimized code significantly faster than the naïve implementation are:
 - To pass around pointers to functions, instead of large arrays. This is the same as in the nim implementation, but imho leads to more complex code
-- To use multithreading support
 - To use the Box-Muller transform instead of using libraries, like in nim.
+- To use multithreading support
 
-For the optimized C code, see [that folder's README](./C-optimized/README.md).
+The C code uses the `-Ofast` or `-O3` compilation flags. Initially, without using those flags and without the algorithmic improvements, the code took ~0.4 seconds to run. So I was surprised that the Node and Squiggle code were comparable to the C code. It ended up being the case that the C code could be pushed to be ~100x faster, though :)
+
+In fact, the C code ended up being so fast that I had to measure its time by running the code 100 times in a row and dividing that amount by 100, rather than by just running it once, because running it once was too fast for /bin/time. More sophisticated profiling tools exist that could e.g., account for how iddle a machine is when running the code, but I didn't think that was worth it at this point.
+
+And still, there are some missing optimizations, like tweaking the code to take into account cache misses. I'm not exactly sure how that would go, though.
+
+Although the above paragraphs were written in the first person, the C code was written together with Jorge Sierra, who translated the algorithmic improvements to it and added the initial multithreading support.
 
 ### NodeJS and Squiggle
 
diff --git a/time.txt b/time.txt
index 26814ee5..1975a4e6 100644
--- a/time.txt
+++ b/time.txt
@@ -1,16 +1,22 @@
 # Optimized C
 
-OMP_NUM_THREADS=1 /bin/time -f "Time: %es" ./out/samples && echo
-Sum(dist_mixture, N)/N = 0.885837
-Time: 0.02s
+$ make time-linux
+Requires /bin/time, found on GNU/Linux systems
 
-OMP_NUM_THREADS=2 /bin/time -f "Time: %es" ./out/samples && echo
-Sum(dist_mixture, N)/N = 0.885123
-Time: 0.14s
+Running 100x and taking avg time: OMP_NUM_THREADS=1 out/samples
+Time using 1 thread: 24.00ms
 
-OMP_NUM_THREADS=4 /bin/time -f "Time: %es" ./out/samples && echo
-Sum(dist_mixture, N)/N = 0.886255
-Time: 0.11s
+Running 100x and taking avg time: OMP_NUM_THREADS=2 out/samples
+Time using 2 threads: 21.80ms
+
+Running 100x and taking avg time: OMP_NUM_THREADS=4 out/samples
+Time for 4 threads: 24.40ms
+
+Running 100x and taking avg time: OMP_NUM_THREADS=8 out/samples
+Time using 8 threads: 10.40ms
+
+Running 100x and taking avg time: OMP_NUM_THREADS=16 out/samples
+Time using 16 threads: 6.60ms
 
 # C