feat: first commit after reorg

This commit is contained in:
NunoSempere 2022-03-13 04:17:40 +00:00
commit d9f5219eee
273 changed files with 12927 additions and 0 deletions

3
_werc/config Normal file
View File

@ -0,0 +1,3 @@
masterSite=nunosempere.com
siteTitle='Measure'
siteSubTitle='is unceasing'

3
_werc/config.tmpl Executable file
View File

@ -0,0 +1,3 @@
masterSite=nunosempere.com
siteTitle='Learning'
siteSubTitle='is about to occur'

5
_werc/lib/footer.inc Executable file
View File

@ -0,0 +1,5 @@
<br class="doNotDisplay doNotPrint" />
<div style="margin-right: auto;"><a href="http://werc.cat-v.org">Powered by werc</a></div>
<div><form action="/_search/" method="POST"><input type="text" id="searchtext" name="q"> <input type="submit" value="Search"></form></div>

14
_werc/lib/top_bar.inc Executable file
View File

@ -0,0 +1,14 @@
<div>
<a href="https://forum.effectivealtruism.org/users/nunosempere">ea forum</a> |
<a href="https://forecasting.substack.com/">forecasting newsletter</a> |
<a href="https://github.com/">github</a> |
<a href="https://metaforecast.org/">metaforecast</a> |
<a href="https://quantifieduncertainty.org/">quantified uncertainty</a> |
<a href="https://twitter.com/NunoSempere">twitter</a>
</div>
<div>
<a href="/about">about</a> |
<a href="/sitemap">site map</a>
</div>

22
_werc/makeconfig.sh Executable file
View File

@ -0,0 +1,22 @@
#!/usr/bin/bash
configdir="/home/uriel/workspace/werc-1.5.0/sites/nunosempere.com/_werc"
configfile="$configdir/config"
configfiletemp="$configdir/config_temp"
line="$(sort -Ru "$configdir/titles.txt" | head -n 1)"
# echo "$line"
title="$(echo "$line" | sed 's/: .*//g')"
subtitle="$(echo "$line" | sed 's/.*: //g')"
echo "$title"
echo "$subtitle"
rm -f "$configfiletemp"
echo "masterSite=nunosempere.com" > "$configfiletemp"
echo "siteTitle='$title'" >> "$configfiletemp"
echo "siteSubTitle='$subtitle'" >> "$configfiletemp"
mv "$configfiletemp" "$configfile"
## masterSite=nunosempere.com
## siteTitle='Learning'
## siteSubTitle='is about to occur'

6
_werc/titles.txt Executable file
View File

@ -0,0 +1,6 @@
Measure: is unceasing
In brightest day : in brightest night
Let us deploy: insight and might
To reduce plight: and multiply delight
Learning: is about to occur
To become: more formidable

15
_werc/top_bar.inc Executable file
View File

@ -0,0 +1,15 @@
<div>
<a href="http://gsoc.cat-v.org">gsoc</a> |
<a href="http://doc.cat-v.org">doc archive</a> |
<a href="http://repo.cat-v.org">software repo</a> |
<a href="http://ninetimes.cat-v.org">ninetimes</a> |
<a href="http://harmful.cat-v.org">harmful</a> |
<a href="http://9p.cat-v.org/">9P</a> |
<a href="http://cat-v.org">cat-v.org</a>
</div>
<div>
<a href="http://cat-v.org/update_log">site updates</a> |
<a href="/sitemap">site map</a>
</div>

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 278 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 435 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 414 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 322 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 425 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 366 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 934 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 359 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 379 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 348 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 358 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 293 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 352 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 314 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 231 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 373 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 406 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 410 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 492 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 322 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 299 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 311 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 377 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 374 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 339 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 281 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 491 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 325 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 335 KiB

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,149 @@
Why do social movements fail: Two concrete examples.
==============
Status: Time-capped analysis.
## Introduction.
I look at two social movements which I think failed in their time: the Spanish Enlightenment (1750-1850), and the General Semantics movement (1938-2003). The first one is more similar to the effective altruism community, and the second one is more similar to the rationality community.
## Example 1: Why did the Spanish Enlightenment movement fail (1750-1850)?
### Why do I care about this movement?
The Spanish Enlightenment was probably the closest thing you could find in Spain to the EA/rationality movements in the 18th century. I'm interested in seeing why it failed, and whether any lessons can be carried over.
Note: Followers of Enlightenment values called themselves liberals / neoclassicists.
### Cause 1: The movement played politics, and lost.
The French, under Napoleon, invaded Spain. The Enlightenment movement aligned itself with French revolution ideals and values, whereas the common folk hated the invasion. Liberals took positions of power in the new administration, for which they were perceived as traitors.
After the French were defeated, most of the Spanish elite went into exile by royal decree (not only those who had worked with the French, but also those who had received offers). In general, liberals and their ideas were perceived as foreign to Spain; to a certain degree, because they were.
### Cause 2: Lack of organizational power?
This seems to not have been the case. "Sociedades de amigos del pais", which roughly translate to "societies of friends of the country" seemed to be abundant. Several institutions which remain until this day were created:
> The Royal Spanish Academy (entrusted with the Spanish Language) (1713), the Royal Academy of History (1738), the Royal Botanic Gardens (1755), the Prado Museum (among the top 10 museums in the world) (1819).
### Cause 3: Their literary works were not that popular
Example: _Cartas marruecas_ (_Letters from Morocco_). A Spanish Noble and his Moroccan Noble friend talk about stuff pertaining Spain. While insightful and interesting for me, I do not believe that they were interesting for a majority of Spaniards.
Example: Moratin, Spanish playwright, wrote 5 comedies. Consider his most popular comedy _El sí de las niñas_ (_The consent of the maidens_)
* Pro: Wildly popular Was watched by 37 000 people, 25% of the population of Madrid at the time.
* Pro: The plot is about the right to choose; a 16 year old girl confronts an arranged marriage with a 59 old man. It may have had an effect on arranged marriages?
Counterexample: Ramón de la Cruz. Started as neoclassicist, but couldn't make enough money. He tried seducing the public instead, which made him wildly popular. He wrote more than 300 theater pieces, which people liked but which weren't particularly Enlightened.
* Note: This is a 60x factor over the previous author. 300 vs 5 works.
The Spanish public developed a strong dislike for moralizing works; works which pushed for the reader to, in some sense, become more virtuous. This remains today: A bright friend of mine gave her dislike of "prosa didáctica" (didactic prose) as the reason for not continuing to read HPMOR after the first few chapters.
Anyways, there doesn't seem to be that clear a connection between their fiction and their actual work, unlike in Ayn Rand's Atlas Shrugged, or in Yudkowsky's HPMOR. Interestingly enough, the EA movement doesn't yet have such fiction, that I know of.
### Cause 4: Lack of permanent political power.
Example: Carlos III, King of Spain, embraced Enlightened absolutism (everything for the people, nothing by the people), and is generally considered to have been a good king. He was supported by liberals. Two kings later, Fernando VII ends up exiling all liberals. The ebb and flow of good and bad kings didn't stop.
Example: The Agricultural Report. A Society of Patriotic Friends analyzes the situation of agriculture in Spain, and produces an Agricultural Report (1795), which proposes solutions. The report was competently researched, exhaustive, popular, and widely read, but nothing came of it. Although the author tries to be meek, the Church still felt antagonized.
The lesson would seem to be something like: if you can, try to do things outside the political sphere; it is too unstable (??).
### Cause 5: Clash against religion. The Spanish Inquisition.
The Spanish Inquisition generally made life hard for people who had observations to make against religion, tradition, etc. The Catholic Church had the first Encyclopedia (by D'Alambert and Diderot) in their list of banned books in 1759.
Example: Félix María Samaniego, besides his labor as writer of Fables, also wrote erotic works. He got in trouble with the Inquisition.
### Conclusion.
Because of the distance in time, it's hard to extract concrete things to do, or not to do. One exception is to not completely align oneself with the losing side in a political battle (f.ex., anti-Trump in America, anti-Brexit in Britain, against the Spanish Inquisition in Spain).
Another would be to look harder at the relationship between literature and what you're trying to do; there wasn't a clear nexus between playwrights and people who were trying to improve agriculture.
Yet a third would be to rethink the approach to courting popular opinion. Tongue-in-cheek: in Spain, despite the best efforts of both camps, the split between liberals and Catholics seems to have remained roughly constant.
## Why did General Semantics fail? (1938-2003)
### What was General Semantics? Why do I care about this question?
General Semantics was, in short, the previous rationality movement. Its purpose was to improve human rationality, and to use that to improve the world. The question interests me because I see certain similarities between general semanticists and current rationalists (and, to a lesser extent, effective altruists).
### What are some similarities with the current rationalist/effective altruism movement?
Yudkowsky and Korzybski have certain similarities. Both movements have fiction (but the General Semanticists seem to do better in this area, having had Heinlein, A. E. van Vogt, etc.) Rationality and general semantics have similar goals.
CFAR looks similar to the Institute of General Semantics, which gave workshops. All three movements had similar amount of cognitive power at their disposal, and their members seem to belong to the same social strata.
### Did General Semantics fail?
On the one hand, this is a judgement call. On the other hand, yes, General Semantics failed. Although it inspired writers whose novels remain, General Semantics doesn't do much these days. In _David's Sling_, the Institute plays at the level of national politics, in _The World of NULL-A_, General Semantics affects the Solar System.
Back in reality, in 2003 the Institute considered becoming part of the Texan Christian University in order to survive [Source](https://www.generalsemantics.org/about/history/). I think that the impression that this paragraph gives is true to reality. Compare:
> Our seminar-workshops were usually about three weeks long, but gradually over the years, they became shorter as the pace of living increased in our culture. The length of seminars shrank to twelve days, then eight, and at the present time, they occupy only weekends. We hope to return to the longer schedules soon, to give time for the essential training we consider very important. [Source](https://www.generalsemantics.org/wp-content/uploads/2011/05/articles/gsb/gsb65-csr.pdf)
* It is not clear to me that any courses are being held now as of Fall 2019
Anyways, here are some potential causes of its decline:
### Cause 1: People played politics.
Two different organizations competing for the same funding, the ETC (a magazine) and the Institute for General Semantics. Initially, the magazine was supposed to pay part of its revenues, but at some point this changed. Drama ensued. [Source](https://www.generalsemantics.org/wp-content/uploads/2011/05/articles/etc/60-3-stockdale.pdf). In particular, the main force behind that split, S. I. Hayakawa, manouvered himself into a position of power, then unexpectedly left to become the director of an university, San Francisco State.
### Cause 2: Their courses did not work.
I have the nagging doubt that, if General Semantics had worked, if it had given people superpowers, and if its transmission had been possible, they wouldn't be in such a bad shape today.
As a comparison, consider Stoicism, and in particular, Marcus Aurelius' Meditations. People find the book consistently useful because reading it gives one the generator for "I am ultimately the only one resposible for my feelings, and it is not a good idea to worry about things outside my field of control", and this noticeably improves people's lives. It's a good piece of cultural technology, and thus survives.
A way to find out whether a course works would be to carry out a sufficiently powered randomized trial. However, from personal experience, I have found that this is not trivial.
### Cause 3: Death of the Charismatic Leader.
After Korzbinsky died, there was Hayakawa, but, as mentioned, he had better things to do. It is not clear to me whether the other prominent members were as charismatic. It seems to me that the pattern: "death of a leader leads to a slow decline" is not uncommon.
Here is a quote from Hayakawa:
> the Society, in any case, will continue to be the most effective agency for spreading general semantics — IGS or no IGS. GS is bigger than AK, MK, or any one of us, just as the theory of relativity is bigger than Einstein or any of his students. And the Society, which is built on the subject and not the man, will survive as long as the subject survives. It was structurally conceived that way, planned that way, and ETC. is run that way. And by running for president, I wish to re-assert this policy. \[Note: he attained the presidency, then left, contributing to the movement's decline\].
Note that there is a prediction in there: "And the Society, which is built on the subject and not the man, will survive as long as the subject survives". Without the man, the subject only survived for a time, it seems to me.
### Cause 4: Not enough people attained mastery.
> (...) I would guess that I have known about 30 individuals who have in some degree adequately, by my standards, mastered this highly general, very simple, very difficult system of orientation and method of evaluating - reversing as it must all our 'cultural conditioning', neurological 'canalization', etc.(...) [Source](https://www.generalsemantics.org/wp-content/uploads/2011/05/articles/gsb/gsb37-kendig.pdf).
> ...To me the great error Korzybski made - and I carried on, financial necessity - and for which we pay the price today in many criticisms, consisted in not restricting ourselves to training very thoroughly a very few people who would be competent to utilize the discipline in various fields and to train others. We should have done this before encouraging anyone to 'popularize' or 'spread the word' (horrid phrase)... [Same source](https://www.generalsemantics.org/wp-content/uploads/2011/05/articles/gsb/gsb37-kendig.pdf).
This relates to: CFAR not spreading their manual, Effective Altruism not wanting to become mainstream, some effective altruists being very elitist, the effective altruism handbook being available for free as a pdf online. It doesn't seem such a bad strategy to follow the (implicit) advice of one of the most capable general semanticists as she looks back and thinks about what she'd do differently.
### Cause 5: Mastery might not have been worth it.
An Aikido master visited our dojo and he stood like a mountain, towering above me, even though I was much taller. He exuded an aura of power, and there is a sense in which I want that. There is another sense in which I don't want that because I do not want to dedicate my life to Aikido in the same way that the master has.
After 7 years, I can speak German fluently, and recently passed the C1 exam. Now, to a first approximation, I'm finding out that all the cool people speak English anyways. General Semantics might exhibit a similar pattern: Because of opportunity costs, there is an implicit assumption that spending three months learning General Semantics makes you win at life more than spending three months learning an object-level skill, like programming. It is not clear to me that this was the case for General Semantics.
Nonetheless, for the rationality movement, it might or might not be worth it to go fishing for techniques in Korzybski's _Science and Sanity_ or in Hayakawa's _Language in Thought and Action_, or to directly ask the IGS for techniques.
### Cause 6: The movement didn't give things to do to its members.
You could write articles for ETC, organize a local chapter, try to become an instructor. There were things to do. But maybe not enough. The Catholic Church offers options ranging from light involvement: reading texts at Church, being part of a propagandist organization, taking part in youth camps, to total commitment: becoming a priest, a monk or a nun. Similarly, if you want to devote your life to furthering the interests of the Democratic Party, it seems to me that you can do that.
The ability of a movement to absorb as much energy from its participants as they can give is not necessarily a positive attribute, but I think its one which contributes to its survival. See also: [After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation](https://forum.effectivealtruism.org/posts/jmbP9rwXncfa32seH/after-one-year-of-applying-for-ea-jobs-it-is-really-really).
As an aside, it seems to me that several movements have the pattern “if you want to become more involved, become an instructor”: Aikido, Non-Violent Communication, Circling, CFAR. It seems to me that General Semantics never quite crystallized the pattern.
### Cause 7: Because the important insights keep being found again, and again, and General Semantics didn't have anything unique.
It seems to me that the basic insights of General Semantics have been found again and again by CBT, meditation, Internal Family Systems, Nonviolent Communication, Foucault, good anthropology. I think you could even get them from Heidegger's essay _Plato's Doctrine of Truth_ if you stared at it hard enough. The answer to "is General Semantics the best at what it does?" might turn out to be: "no". This relates to: A friend talking about "effective effective altruism".
### Cause 8: Not enough money.
There might not have been anything wrong with General Semantics per se. If a random millionaire had given them some money at a crucial moment, they might still be alive and flourishing.
### Conclusion
The above are what seem to me to be some potential failure points which the current rationality and effective altruism movements might want to avoid if they want to keep existing in 100 years. It might or might not be overkill to hire a historian for a deeper analysis.

View File

@ -0,0 +1,175 @@
<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="87.05ex" height="6.843ex" style="vertical-align: -3.338ex;" viewBox="0 -1508.9 37479.8 2946.1" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg" aria-labelledby="MathJax-SVG-1-Title">
<title id="MathJax-SVG-1-Title">{\displaystyle \varphi _{i}(v)={\frac {1}{\text{number of players}}}\sum _{{\text{coalitions excluding }}i}{\frac {{\text{marginal contribution of }}i{\text{ to coalition}}}{{\text{number of coalitions excluding }}i{\text{ of this size}}}}}</title>
<defs aria-hidden="true">
<path stroke-width="1" id="E1-MJMATHI-3C6" d="M92 210Q92 176 106 149T142 108T185 85T220 72L235 70L237 71L250 112Q268 170 283 211T322 299T370 375T429 423T502 442Q547 442 582 410T618 302Q618 224 575 152T457 35T299 -10Q273 -10 273 -12L266 -48Q260 -83 252 -125T241 -179Q236 -203 215 -212Q204 -218 190 -218Q159 -215 159 -185Q159 -175 214 -2L209 0Q204 2 195 5T173 14T147 28T120 46T94 71T71 103T56 142T50 190Q50 238 76 311T149 431H162Q183 431 183 423Q183 417 175 409Q134 361 114 300T92 210ZM574 278Q574 320 550 344T486 369Q437 369 394 329T323 218Q309 184 295 109L286 64Q304 62 306 62Q423 62 498 131T574 278Z"></path>
<path stroke-width="1" id="E1-MJMATHI-69" d="M184 600Q184 624 203 642T247 661Q265 661 277 649T290 619Q290 596 270 577T226 557Q211 557 198 567T184 600ZM21 287Q21 295 30 318T54 369T98 420T158 442Q197 442 223 419T250 357Q250 340 236 301T196 196T154 83Q149 61 149 51Q149 26 166 26Q175 26 185 29T208 43T235 78T260 137Q263 149 265 151T282 153Q302 153 302 143Q302 135 293 112T268 61T223 11T161 -11Q129 -11 102 10T74 74Q74 91 79 106T122 220Q160 321 166 341T173 380Q173 404 156 404H154Q124 404 99 371T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Z"></path>
<path stroke-width="1" id="E1-MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z"></path>
<path stroke-width="1" id="E1-MJMATHI-76" d="M173 380Q173 405 154 405Q130 405 104 376T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Q21 294 29 316T53 368T97 419T160 441Q202 441 225 417T249 361Q249 344 246 335Q246 329 231 291T200 202T182 113Q182 86 187 69Q200 26 250 26Q287 26 319 60T369 139T398 222T409 277Q409 300 401 317T383 343T365 361T357 383Q357 405 376 424T417 443Q436 443 451 425T467 367Q467 340 455 284T418 159T347 40T241 -11Q177 -11 139 22Q102 54 102 117Q102 148 110 181T151 298Q173 362 173 380Z"></path>
<path stroke-width="1" id="E1-MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z"></path>
<path stroke-width="1" id="E1-MJMAIN-3D" d="M56 347Q56 360 70 367H707Q722 359 722 347Q722 336 708 328L390 327H72Q56 332 56 347ZM56 153Q56 168 72 173H708Q722 163 722 153Q722 140 707 133H70Q56 140 56 153Z"></path>
<path stroke-width="1" id="E1-MJMAIN-31" d="M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z"></path>
<path stroke-width="1" id="E1-MJMAIN-6E" d="M41 46H55Q94 46 102 60V68Q102 77 102 91T102 122T103 161T103 203Q103 234 103 269T102 328V351Q99 370 88 376T43 385H25V408Q25 431 27 431L37 432Q47 433 65 434T102 436Q119 437 138 438T167 441T178 442H181V402Q181 364 182 364T187 369T199 384T218 402T247 421T285 437Q305 442 336 442Q450 438 463 329Q464 322 464 190V104Q464 66 466 59T477 49Q498 46 526 46H542V0H534L510 1Q487 2 460 2T422 3Q319 3 310 0H302V46H318Q379 46 379 62Q380 64 380 200Q379 335 378 343Q372 371 358 385T334 402T308 404Q263 404 229 370Q202 343 195 315T187 232V168V108Q187 78 188 68T191 55T200 49Q221 46 249 46H265V0H257L234 1Q210 2 183 2T145 3Q42 3 33 0H25V46H41Z"></path>
<path stroke-width="1" id="E1-MJMAIN-75" d="M383 58Q327 -10 256 -10H249Q124 -10 105 89Q104 96 103 226Q102 335 102 348T96 369Q86 385 36 385H25V408Q25 431 27 431L38 432Q48 433 67 434T105 436Q122 437 142 438T172 441T184 442H187V261Q188 77 190 64Q193 49 204 40Q224 26 264 26Q290 26 311 35T343 58T363 90T375 120T379 144Q379 145 379 161T380 201T380 248V315Q380 361 370 372T320 385H302V431Q304 431 378 436T457 442H464V264Q464 84 465 81Q468 61 479 55T524 46H542V0Q540 0 467 -5T390 -11H383V58Z"></path>
<path stroke-width="1" id="E1-MJMAIN-6D" d="M41 46H55Q94 46 102 60V68Q102 77 102 91T102 122T103 161T103 203Q103 234 103 269T102 328V351Q99 370 88 376T43 385H25V408Q25 431 27 431L37 432Q47 433 65 434T102 436Q119 437 138 438T167 441T178 442H181V402Q181 364 182 364T187 369T199 384T218 402T247 421T285 437Q305 442 336 442Q351 442 364 440T387 434T406 426T421 417T432 406T441 395T448 384T452 374T455 366L457 361L460 365Q463 369 466 373T475 384T488 397T503 410T523 422T546 432T572 439T603 442Q729 442 740 329Q741 322 741 190V104Q741 66 743 59T754 49Q775 46 803 46H819V0H811L788 1Q764 2 737 2T699 3Q596 3 587 0H579V46H595Q656 46 656 62Q657 64 657 200Q656 335 655 343Q649 371 635 385T611 402T585 404Q540 404 506 370Q479 343 472 315T464 232V168V108Q464 78 465 68T468 55T477 49Q498 46 526 46H542V0H534L510 1Q487 2 460 2T422 3Q319 3 310 0H302V46H318Q379 46 379 62Q380 64 380 200Q379 335 378 343Q372 371 358 385T334 402T308 404Q263 404 229 370Q202 343 195 315T187 232V168V108Q187 78 188 68T191 55T200 49Q221 46 249 46H265V0H257L234 1Q210 2 183 2T145 3Q42 3 33 0H25V46H41Z"></path>
<path stroke-width="1" id="E1-MJMAIN-62" d="M307 -11Q234 -11 168 55L158 37Q156 34 153 28T147 17T143 10L138 1L118 0H98V298Q98 599 97 603Q94 622 83 628T38 637H20V660Q20 683 22 683L32 684Q42 685 61 686T98 688Q115 689 135 690T165 693T176 694H179V543Q179 391 180 391L183 394Q186 397 192 401T207 411T228 421T254 431T286 439T323 442Q401 442 461 379T522 216Q522 115 458 52T307 -11ZM182 98Q182 97 187 90T196 79T206 67T218 55T233 44T250 35T271 29T295 26Q330 26 363 46T412 113Q424 148 424 212Q424 287 412 323Q385 405 300 405Q270 405 239 390T188 347L182 339V98Z"></path>
<path stroke-width="1" id="E1-MJMAIN-65" d="M28 218Q28 273 48 318T98 391T163 433T229 448Q282 448 320 430T378 380T406 316T415 245Q415 238 408 231H126V216Q126 68 226 36Q246 30 270 30Q312 30 342 62Q359 79 369 104L379 128Q382 131 395 131H398Q415 131 415 121Q415 117 412 108Q393 53 349 21T250 -11Q155 -11 92 58T28 218ZM333 275Q322 403 238 411H236Q228 411 220 410T195 402T166 381T143 340T127 274V267H333V275Z"></path>
<path stroke-width="1" id="E1-MJMAIN-72" d="M36 46H50Q89 46 97 60V68Q97 77 97 91T98 122T98 161T98 203Q98 234 98 269T98 328L97 351Q94 370 83 376T38 385H20V408Q20 431 22 431L32 432Q42 433 60 434T96 436Q112 437 131 438T160 441T171 442H174V373Q213 441 271 441H277Q322 441 343 419T364 373Q364 352 351 337T313 322Q288 322 276 338T263 372Q263 381 265 388T270 400T273 405Q271 407 250 401Q234 393 226 386Q179 341 179 207V154Q179 141 179 127T179 101T180 81T180 66V61Q181 59 183 57T188 54T193 51T200 49T207 48T216 47T225 47T235 46T245 46H276V0H267Q249 3 140 3Q37 3 28 0H20V46H36Z"></path>
<path stroke-width="1" id="E1-MJMAIN-6F" d="M28 214Q28 309 93 378T250 448Q340 448 405 380T471 215Q471 120 407 55T250 -10Q153 -10 91 57T28 214ZM250 30Q372 30 372 193V225V250Q372 272 371 288T364 326T348 362T317 390T268 410Q263 411 252 411Q222 411 195 399Q152 377 139 338T126 246V226Q126 130 145 91Q177 30 250 30Z"></path>
<path stroke-width="1" id="E1-MJMAIN-66" d="M273 0Q255 3 146 3Q43 3 34 0H26V46H42Q70 46 91 49Q99 52 103 60Q104 62 104 224V385H33V431H104V497L105 564L107 574Q126 639 171 668T266 704Q267 704 275 704T289 705Q330 702 351 679T372 627Q372 604 358 590T321 576T284 590T270 627Q270 647 288 667H284Q280 668 273 668Q245 668 223 647T189 592Q183 572 182 497V431H293V385H185V225Q185 63 186 61T189 57T194 54T199 51T206 49T213 48T222 47T231 47T241 46T251 46H282V0H273Z"></path>
<path stroke-width="1" id="E1-MJMAIN-70" d="M36 -148H50Q89 -148 97 -134V-126Q97 -119 97 -107T97 -77T98 -38T98 6T98 55T98 106Q98 140 98 177T98 243T98 296T97 335T97 351Q94 370 83 376T38 385H20V408Q20 431 22 431L32 432Q42 433 61 434T98 436Q115 437 135 438T165 441T176 442H179V416L180 390L188 397Q247 441 326 441Q407 441 464 377T522 216Q522 115 457 52T310 -11Q242 -11 190 33L182 40V-45V-101Q182 -128 184 -134T195 -145Q216 -148 244 -148H260V-194H252L228 -193Q205 -192 178 -192T140 -191Q37 -191 28 -194H20V-148H36ZM424 218Q424 292 390 347T305 402Q234 402 182 337V98Q222 26 294 26Q345 26 384 80T424 218Z"></path>
<path stroke-width="1" id="E1-MJMAIN-6C" d="M42 46H56Q95 46 103 60V68Q103 77 103 91T103 124T104 167T104 217T104 272T104 329Q104 366 104 407T104 482T104 542T103 586T103 603Q100 622 89 628T44 637H26V660Q26 683 28 683L38 684Q48 685 67 686T104 688Q121 689 141 690T171 693T182 694H185V379Q185 62 186 60Q190 52 198 49Q219 46 247 46H263V0H255L232 1Q209 2 183 2T145 3T107 3T57 1L34 0H26V46H42Z"></path>
<path stroke-width="1" id="E1-MJMAIN-61" d="M137 305T115 305T78 320T63 359Q63 394 97 421T218 448Q291 448 336 416T396 340Q401 326 401 309T402 194V124Q402 76 407 58T428 40Q443 40 448 56T453 109V145H493V106Q492 66 490 59Q481 29 455 12T400 -6T353 12T329 54V58L327 55Q325 52 322 49T314 40T302 29T287 17T269 6T247 -2T221 -8T190 -11Q130 -11 82 20T34 107Q34 128 41 147T68 188T116 225T194 253T304 268H318V290Q318 324 312 340Q290 411 215 411Q197 411 181 410T156 406T148 403Q170 388 170 359Q170 334 154 320ZM126 106Q126 75 150 51T209 26Q247 26 276 49T315 109Q317 116 318 175Q318 233 317 233Q309 233 296 232T251 223T193 203T147 166T126 106Z"></path>
<path stroke-width="1" id="E1-MJMAIN-79" d="M69 -66Q91 -66 104 -80T118 -116Q118 -134 109 -145T91 -160Q84 -163 97 -166Q104 -168 111 -168Q131 -168 148 -159T175 -138T197 -106T213 -75T225 -43L242 0L170 183Q150 233 125 297Q101 358 96 368T80 381Q79 382 78 382Q66 385 34 385H19V431H26L46 430Q65 430 88 429T122 428Q129 428 142 428T171 429T200 430T224 430L233 431H241V385H232Q183 385 185 366L286 112Q286 113 332 227L376 341V350Q376 365 366 373T348 383T334 385H331V431H337H344Q351 431 361 431T382 430T405 429T422 429Q477 429 503 431H508V385H497Q441 380 422 345Q420 343 378 235T289 9T227 -131Q180 -204 113 -204Q69 -204 44 -177T19 -116Q19 -89 35 -78T69 -66Z"></path>
<path stroke-width="1" id="E1-MJMAIN-73" d="M295 316Q295 356 268 385T190 414Q154 414 128 401Q98 382 98 349Q97 344 98 336T114 312T157 287Q175 282 201 278T245 269T277 256Q294 248 310 236T342 195T359 133Q359 71 321 31T198 -10H190Q138 -10 94 26L86 19L77 10Q71 4 65 -1L54 -11H46H42Q39 -11 33 -5V74V132Q33 153 35 157T45 162H54Q66 162 70 158T75 146T82 119T101 77Q136 26 198 26Q295 26 295 104Q295 133 277 151Q257 175 194 187T111 210Q75 227 54 256T33 318Q33 357 50 384T93 424T143 442T187 447H198Q238 447 268 432L283 424L292 431Q302 440 314 448H322H326Q329 448 335 442V310L329 304H301Q295 310 295 316Z"></path>
<path stroke-width="1" id="E1-MJSZ2-2211" d="M60 948Q63 950 665 950H1267L1325 815Q1384 677 1388 669H1348L1341 683Q1320 724 1285 761Q1235 809 1174 838T1033 881T882 898T699 902H574H543H251L259 891Q722 258 724 252Q725 250 724 246Q721 243 460 -56L196 -356Q196 -357 407 -357Q459 -357 548 -357T676 -358Q812 -358 896 -353T1063 -332T1204 -283T1307 -196Q1328 -170 1348 -124H1388Q1388 -125 1381 -145T1356 -210T1325 -294L1267 -449L666 -450Q64 -450 61 -448Q55 -446 55 -439Q55 -437 57 -433L590 177Q590 178 557 222T452 366T322 544L56 909L55 924Q55 945 60 948Z"></path>
<path stroke-width="1" id="E1-MJMAIN-63" d="M370 305T349 305T313 320T297 358Q297 381 312 396Q317 401 317 402T307 404Q281 408 258 408Q209 408 178 376Q131 329 131 219Q131 137 162 90Q203 29 272 29Q313 29 338 55T374 117Q376 125 379 127T395 129H409Q415 123 415 120Q415 116 411 104T395 71T366 33T318 2T249 -11Q163 -11 99 53T34 214Q34 318 99 383T250 448T370 421T404 357Q404 334 387 320Z"></path>
<path stroke-width="1" id="E1-MJMAIN-69" d="M69 609Q69 637 87 653T131 669Q154 667 171 652T188 609Q188 579 171 564T129 549Q104 549 87 564T69 609ZM247 0Q232 3 143 3Q132 3 106 3T56 1L34 0H26V46H42Q70 46 91 49Q100 53 102 60T104 102V205V293Q104 345 102 359T88 378Q74 385 41 385H30V408Q30 431 32 431L42 432Q52 433 70 434T106 436Q123 437 142 438T171 441T182 442H185V62Q190 52 197 50T232 46H255V0H247Z"></path>
<path stroke-width="1" id="E1-MJMAIN-74" d="M27 422Q80 426 109 478T141 600V615H181V431H316V385H181V241Q182 116 182 100T189 68Q203 29 238 29Q282 29 292 100Q293 108 293 146V181H333V146V134Q333 57 291 17Q264 -10 221 -10Q187 -10 162 2T124 33T105 68T98 100Q97 107 97 248V385H18V422H27Z"></path>
<path stroke-width="1" id="E1-MJMAIN-78" d="M201 0Q189 3 102 3Q26 3 17 0H11V46H25Q48 47 67 52T96 61T121 78T139 96T160 122T180 150L226 210L168 288Q159 301 149 315T133 336T122 351T113 363T107 370T100 376T94 379T88 381T80 383Q74 383 44 385H16V431H23Q59 429 126 429Q219 429 229 431H237V385Q201 381 201 369Q201 367 211 353T239 315T268 274L272 270L297 304Q329 345 329 358Q329 364 327 369T322 376T317 380T310 384L307 385H302V431H309Q324 428 408 428Q487 428 493 431H499V385H492Q443 385 411 368Q394 360 377 341T312 257L296 236L358 151Q424 61 429 57T446 50Q464 46 499 46H516V0H510H502Q494 1 482 1T457 2T432 2T414 3Q403 3 377 3T327 1L304 0H295V46H298Q309 46 320 51T331 63Q331 65 291 120L250 175Q249 174 219 133T185 88Q181 83 181 74Q181 63 188 55T206 46Q208 46 208 23V0H201Z"></path>
<path stroke-width="1" id="E1-MJMAIN-64" d="M376 495Q376 511 376 535T377 568Q377 613 367 624T316 637H298V660Q298 683 300 683L310 684Q320 685 339 686T376 688Q393 689 413 690T443 693T454 694H457V390Q457 84 458 81Q461 61 472 55T517 46H535V0Q533 0 459 -5T380 -11H373V44L365 37Q307 -11 235 -11Q158 -11 96 50T34 215Q34 315 97 378T244 442Q319 442 376 393V495ZM373 342Q328 405 260 405Q211 405 173 369Q146 341 139 305T131 211Q131 155 138 120T173 59Q203 26 251 26Q322 26 373 103V342Z"></path>
<path stroke-width="1" id="E1-MJMAIN-67" d="M329 409Q373 453 429 453Q459 453 472 434T485 396Q485 382 476 371T449 360Q416 360 412 390Q410 404 415 411Q415 412 416 414V415Q388 412 363 393Q355 388 355 386Q355 385 359 381T368 369T379 351T388 325T392 292Q392 230 343 187T222 143Q172 143 123 171Q112 153 112 133Q112 98 138 81Q147 75 155 75T227 73Q311 72 335 67Q396 58 431 26Q470 -13 470 -72Q470 -139 392 -175Q332 -206 250 -206Q167 -206 107 -175Q29 -140 29 -75Q29 -39 50 -15T92 18L103 24Q67 55 67 108Q67 155 96 193Q52 237 52 292Q52 355 102 398T223 442Q274 442 318 416L329 409ZM299 343Q294 371 273 387T221 404Q192 404 171 388T145 343Q142 326 142 292Q142 248 149 227T179 192Q196 182 222 182Q244 182 260 189T283 207T294 227T299 242Q302 258 302 292T299 343ZM403 -75Q403 -50 389 -34T348 -11T299 -2T245 0H218Q151 0 138 -6Q118 -15 107 -34T95 -74Q95 -84 101 -97T122 -127T170 -155T250 -167Q319 -167 361 -139T403 -75Z"></path>
<path stroke-width="1" id="E1-MJMAIN-68" d="M41 46H55Q94 46 102 60V68Q102 77 102 91T102 124T102 167T103 217T103 272T103 329Q103 366 103 407T103 482T102 542T102 586T102 603Q99 622 88 628T43 637H25V660Q25 683 27 683L37 684Q47 685 66 686T103 688Q120 689 140 690T170 693T181 694H184V367Q244 442 328 442Q451 442 463 329Q464 322 464 190V104Q464 66 466 59T477 49Q498 46 526 46H542V0H534L510 1Q487 2 460 2T422 3Q319 3 310 0H302V46H318Q379 46 379 62Q380 64 380 200Q379 335 378 343Q372 371 358 385T334 402T308 404Q263 404 229 370Q202 343 195 315T187 232V168V108Q187 78 188 68T191 55T200 49Q221 46 249 46H265V0H257L234 1Q210 2 183 2T145 3Q42 3 33 0H25V46H41Z"></path>
<path stroke-width="1" id="E1-MJMAIN-7A" d="M42 263Q44 270 48 345T53 423V431H393Q399 425 399 415Q399 403 398 402L381 378Q364 355 331 309T265 220L134 41L182 40H206Q254 40 283 46T331 77Q352 105 359 185L361 201Q361 202 381 202H401V196Q401 195 393 103T384 6V0H209L34 1L31 3Q28 8 28 17Q28 30 29 31T160 210T294 394H236Q169 393 152 388Q127 382 113 367Q89 344 82 264V255H42V263Z"></path>
</defs>
<g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)" aria-hidden="true">
<use xlink:href="#E1-MJMATHI-3C6" x="0" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMATHI-69" x="925" y="-213"></use>
<use xlink:href="#E1-MJMAIN-28" x="998" y="0"></use>
<use xlink:href="#E1-MJMATHI-76" x="1388" y="0"></use>
<use xlink:href="#E1-MJMAIN-29" x="1873" y="0"></use>
<use xlink:href="#E1-MJMAIN-3D" x="2541" y="0"></use>
<g transform="translate(3597,0)">
<g transform="translate(120,0)">
<rect stroke="none" width="7862" height="60" x="0" y="220"></rect>
<use xlink:href="#E1-MJMAIN-31" x="3681" y="676"></use>
<g transform="translate(60,-726)">
<use xlink:href="#E1-MJMAIN-6E"></use>
<use xlink:href="#E1-MJMAIN-75" x="556" y="0"></use>
<use xlink:href="#E1-MJMAIN-6D" x="1113" y="0"></use>
<use xlink:href="#E1-MJMAIN-62" x="1946" y="0"></use>
<use xlink:href="#E1-MJMAIN-65" x="2503" y="0"></use>
<use xlink:href="#E1-MJMAIN-72" x="2947" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="3590" y="0"></use>
<use xlink:href="#E1-MJMAIN-66" x="4090" y="0"></use>
<use xlink:href="#E1-MJMAIN-70" x="4647" y="0"></use>
<use xlink:href="#E1-MJMAIN-6C" x="5203" y="0"></use>
<use xlink:href="#E1-MJMAIN-61" x="5482" y="0"></use>
<use xlink:href="#E1-MJMAIN-79" x="5982" y="0"></use>
<use xlink:href="#E1-MJMAIN-65" x="6511" y="0"></use>
<use xlink:href="#E1-MJMAIN-72" x="6955" y="0"></use>
<use xlink:href="#E1-MJMAIN-73" x="7348" y="0"></use>
</g>
</g>
</g>
<g transform="translate(11866,0)">
<use xlink:href="#E1-MJSZ2-2211" x="2572" y="0"></use>
<g transform="translate(0,-1110)">
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-63"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-6F" x="444" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-61" x="945" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-6C" x="1445" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-69" x="1724" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-74" x="2002" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-69" x="2392" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-6F" x="2670" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-6E" x="3171" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-73" x="3727" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-65" x="4475" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-78" x="4920" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-63" x="5448" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-6C" x="5893" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-75" x="6171" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-64" x="6728" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-69" x="7284" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-6E" x="7563" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-67" x="8119" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMATHI-69" x="8973" y="0"></use>
</g>
</g>
<g transform="translate(18622,0)">
<g transform="translate(120,0)">
<rect stroke="none" width="18617" height="60" x="0" y="220"></rect>
<g transform="translate(1178,726)">
<use xlink:href="#E1-MJMAIN-6D"></use>
<use xlink:href="#E1-MJMAIN-61" x="833" y="0"></use>
<use xlink:href="#E1-MJMAIN-72" x="1334" y="0"></use>
<use xlink:href="#E1-MJMAIN-67" x="1726" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="2227" y="0"></use>
<use xlink:href="#E1-MJMAIN-6E" x="2505" y="0"></use>
<use xlink:href="#E1-MJMAIN-61" x="3062" y="0"></use>
<use xlink:href="#E1-MJMAIN-6C" x="3562" y="0"></use>
<use xlink:href="#E1-MJMAIN-63" x="4091" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="4535" y="0"></use>
<use xlink:href="#E1-MJMAIN-6E" x="5036" y="0"></use>
<use xlink:href="#E1-MJMAIN-74" x="5592" y="0"></use>
<use xlink:href="#E1-MJMAIN-72" x="5982" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="6374" y="0"></use>
<use xlink:href="#E1-MJMAIN-62" x="6653" y="0"></use>
<use xlink:href="#E1-MJMAIN-75" x="7209" y="0"></use>
<use xlink:href="#E1-MJMAIN-74" x="7766" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="8155" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="8434" y="0"></use>
<use xlink:href="#E1-MJMAIN-6E" x="8934" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="9741" y="0"></use>
<use xlink:href="#E1-MJMAIN-66" x="10241" y="0"></use>
<use xlink:href="#E1-MJMATHI-69" x="10798" y="0"></use>
<g transform="translate(11143,0)">
<use xlink:href="#E1-MJMAIN-74" x="250" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="639" y="0"></use>
<use xlink:href="#E1-MJMAIN-63" x="1390" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="1834" y="0"></use>
<use xlink:href="#E1-MJMAIN-61" x="2335" y="0"></use>
<use xlink:href="#E1-MJMAIN-6C" x="2835" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="3114" y="0"></use>
<use xlink:href="#E1-MJMAIN-74" x="3392" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="3782" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="4060" y="0"></use>
<use xlink:href="#E1-MJMAIN-6E" x="4561" y="0"></use>
</g>
</g>
<g transform="translate(60,-726)">
<use xlink:href="#E1-MJMAIN-6E"></use>
<use xlink:href="#E1-MJMAIN-75" x="556" y="0"></use>
<use xlink:href="#E1-MJMAIN-6D" x="1113" y="0"></use>
<use xlink:href="#E1-MJMAIN-62" x="1946" y="0"></use>
<use xlink:href="#E1-MJMAIN-65" x="2503" y="0"></use>
<use xlink:href="#E1-MJMAIN-72" x="2947" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="3590" y="0"></use>
<use xlink:href="#E1-MJMAIN-66" x="4090" y="0"></use>
<use xlink:href="#E1-MJMAIN-63" x="4647" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="5091" y="0"></use>
<use xlink:href="#E1-MJMAIN-61" x="5592" y="0"></use>
<use xlink:href="#E1-MJMAIN-6C" x="6092" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="6371" y="0"></use>
<use xlink:href="#E1-MJMAIN-74" x="6649" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="7039" y="0"></use>
<use xlink:href="#E1-MJMAIN-6F" x="7317" y="0"></use>
<use xlink:href="#E1-MJMAIN-6E" x="7818" y="0"></use>
<use xlink:href="#E1-MJMAIN-73" x="8374" y="0"></use>
<use xlink:href="#E1-MJMAIN-65" x="9019" y="0"></use>
<use xlink:href="#E1-MJMAIN-78" x="9463" y="0"></use>
<use xlink:href="#E1-MJMAIN-63" x="9992" y="0"></use>
<use xlink:href="#E1-MJMAIN-6C" x="10436" y="0"></use>
<use xlink:href="#E1-MJMAIN-75" x="10715" y="0"></use>
<use xlink:href="#E1-MJMAIN-64" x="11271" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="11828" y="0"></use>
<use xlink:href="#E1-MJMAIN-6E" x="12106" y="0"></use>
<use xlink:href="#E1-MJMAIN-67" x="12663" y="0"></use>
<use xlink:href="#E1-MJMATHI-69" x="13413" y="0"></use>
<g transform="translate(13759,0)">
<use xlink:href="#E1-MJMAIN-6F" x="250" y="0"></use>
<use xlink:href="#E1-MJMAIN-66" x="750" y="0"></use>
<use xlink:href="#E1-MJMAIN-74" x="1307" y="0"></use>
<use xlink:href="#E1-MJMAIN-68" x="1696" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="2253" y="0"></use>
<use xlink:href="#E1-MJMAIN-73" x="2531" y="0"></use>
<use xlink:href="#E1-MJMAIN-73" x="3176" y="0"></use>
<use xlink:href="#E1-MJMAIN-69" x="3570" y="0"></use>
<use xlink:href="#E1-MJMAIN-7A" x="3849" y="0"></use>
<use xlink:href="#E1-MJMAIN-65" x="4293" y="0"></use>
</g>
</g>
</g>
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 23 KiB

View File

@ -0,0 +1,298 @@
Shapley values: Better than counterfactuals
==============
\[Epistemic status: Pretty confident. But also, enthusiasm on the verge of partisanship\]
One intuitive function which assigns impact to agents is the counterfactual, which has the form:
> CounterfactualImpact(Agent) = Value(World) - Value(World/Agent)
which reads "The impact of an agent is the difference between the value of the world with the agent and the value of the world without the agent".
It has been discussed in the effective altruism community that this function leads to pitfalls, paradoxes, or to unintuitive results when considering scenarios with multiple stakeholders. See:
* [Triple counting impact in EA](https://forum.effectivealtruism.org/posts/fnBnEiwged7y5vQFf/triple-counting-impact-in-ea)
* [The counterfactual impact of agents acting in concert](https://forum.effectivealtruism.org/posts/EP8x3vHRQJP57TjFL/the-counterfactual-impact-of-agents-acting-in-concert)
In this post I'll present some new and old examples in which the counterfactual function seems to fail, and how, in each of them, I think that a less known function does better: the Shapley value, a concept from cooperative game theory which has also been brought up before in such discussions. In the first three examples, I'll just present what the Shapley value outputs, and halfway through this post, I'll use these examples to arrive at a definition.
I think that one of the main hindrances in the adoption of Shapley values is the difficulty in its calculation. To solve this, I have written a Shapley value calculator and made it available online: [shapleyvalue.com](http://shapleyvalue.com/). I encourage you to play around with it.
## Example 1 & recap: Sometimes, the counterfactual impact exceeds the total value.
> Suppose there are three possible outcomes: P has cost $2000 and gives 15 utility to the world Q has cost $1000 and gives 10 utility to the world R has cost $1000 and gives 10 utility to the world
>
> Suppose Alice and Bob each have $1000 to donate. Consider two scenarios:
>
> Scenario 1: Both Alice and Bob give $1000 to P. The world gets 15 more utility. Both Alice and Bob are counterfactually responsible for giving 15 utility to the world.
>
> Scenario 2: Alice gives $1000 to Q and Bob gives $1000 to R. The world gets 20 more utility. Both Alice and Bob are counterfactually responsible for giving 10 utility to the world.
>
> From the world's perspective, scenario 2 is better. However, from Alice and Bob's individual perspective (if they are maximizing their own counterfactual impact), scenario 1 is better. This seems wrong, we'd want to somehow coordinate so that we achieve scenario 2 instead of scenario 1.
> [Source](https://forum.effectivealtruism.org/posts/EP8x3vHRQJP57TjFL/the-counterfactual-impact-of-agents-acting-in-concert#9KJM54ydQiucy22Gy)
> Attribution: rohinmshah
In Scenario 1:
Counterfactual impact of Alice: 15 utility.
Counterfactual impact of Bob: 15 utility.
Sum of the counterfactual impacts: 30 utility. Total impact: 15 utility.
The Shapley value of Alice would be: 7.5 utility.
The Shapley value of Bob would be: 7.5 utility.
The sum of the Shapley values always adds up to the total impact, which is 15 utility.
In Scenario 2:
Counterfactual impact of Alice: 10 utility.
Counterfactual impact of Bob: 10 utility.
Sum of the counterfactual impacts: 20 utility. Total impact: 20 utility.
The Shapley value of Alice would be: 10 utility.
The Shapley value of Bob would be: 10 utility.
The sum of the Shapley values always adds up to the total impact, which is 10+10 utility = 20 utility.
In this case, if Alice and Bob were each individually optimizing for counterfactual impact, they'd end up with a total impact of 15. If they were, each of them, individually, optimizing for the Shapley value, they'd end up with a total impact of 20, which is higher.
It would seem that we could use a function such as
> CounterfactualImpactModified = CounterfactualImpact / NumberOfStakeholders
to solve this particular problem. However, as the next example shows, that sometimes doesn't work. The Shapley value, on the other hand, has the property that it always adds up to total value.
Property 1: The Shapley value always adds up to the total value.
## Example 2: Sometimes, the sum of the counterfactuals is less than total value. Sometimes it's 0.
Consider the invention of Calculus, by Newton and Leibniz at roughly the same time. If Newton hadn't existed, Leibniz would still have invented it, and vice-versa, so the counterfactual impact of each of them is 0. Thus, you can't normalize like above.
The Shapley value doesn't have that problem. It has the property that equal people have equal impact, which together with the requirement that it adds up to total value is enough to assign 1/2 of the total impact to each of Newton and Leibniz.
Interestingly, GiveWell has Iodine Global Network as a standout charity, but not as a recommended charity, because of considerations related to the above. If it were the case that, had IGN not existed, another organization would have taken its place, its counterfactual value would be 0, but its Shapley value would be 1/2 (of the impact of iodizing salt in developing countries).
Property 2: The Shapley assigns equal value to equivalent agents.
## Example 3: Order indifference.
Consider Scenario 1 from Example 1 again.
> P has cost $2000 and gives 15 utility to the world.
>
> Suppose Alice and Bob each have $1000 to donate. Both Alice and Bob give $1000 to P. The world gets 15 more utility. Both Alice and Bob are counterfactually responsible for giving 15 utility to the world.
Alice is now a pure counterfactual-impact maximizer, but something has gone wrong. She now views Bob adversarially. She thinks he's a sucker, and she waits until Bob has donated to make her own donation. There are no worlds in which he doesn't donate before her, and Alice assigns all 15 utility to herself, and 0 to Bob. Note that she isn't exactly calculating the counterfactual impact, but something slightly different.
The Shapley value doesn't consider any agent to be a sucker, doesn't consider any variables to be in the background, and doesn't care whether people try to donate strategically before or after someone else. Here is a perhaps more familiar example:
Scenario 1:
Suppose that the Indian government creates some big and expensive infrastructure to vaccinate people, but people don't use it. Suppose an NGO then comes in, and sends reminders to people to vaccinate their people, and some end up going.
Scenario 2:
Suppose that an NGO could be sending reminders to people to vaccinate their children, but it doesn't, because the vaccination infrastructure is nonexistent, so there would be no point. Then, the government steps in, and creates the needed infrastructure, and vaccination reminders are sent.
Again, it's tempting to say that in the first scenario, the NGO gets all the impact, and in the second scenario the government gets all the impact, perhaps because we take either the NGO or the Indian government to be in the background. To repeat, the Shapley value doesn't differentiate between the two scenarios, and doesn't leave variables in the background. For how this works numerically, see the examples below.
Property 3: The Shapley value doesn't care about who comes first.
## The Shapley value is uniquely determined by simple properties.
These properties:
* Property 1: Sum of the values adds up to the total value (Efficiency)
* Property 2: Equal agents have equal value (Symmetry)
* Property 3: Order indifference: it doesn't matter which order you go in (Linearity). Or, in other words, if there are two steps, Value(Step1 + Step2) = Value(Step1) + Value(Step2).
And an extra property:
* Property 4: Null-player (if in every world, adding a person to the world has no impact, the person has no impact). You can either take this as an axiom, or derive it from the first three properties.
are enough to force the Shapley value function to take the form it takes:
![](images/aeef390de90cb7c6fe07fcad852578fbebe162b1.svg)
At this point, the reader may want to consult [Wikipedia](https://en.wikipedia.org/wiki/Shapley_value) to familiarize themselves with the mathematical formalism, or, for a book-length treatment, [_The Shapley value: Essays in honor of Lloyd S. Shapley_](http://www.library.fa.ru/files/Roth2.pdf). Ultimately, a quick way to understand it is as "the function uniquely determined by the properties above".
I suspect that order indifference will be the most controversial option. Intuitively, it prevents stakeholders from adversarially choosing to collaborate earlier or later in order to assign themselves more impact.
Note that in the case of only one agent the Shapley value reduces to the counterfactual function, and that the Shapley value uses many counterfactual comparisons in its formula. It sometimes just reduces to CounterfactualValue/ NumberOfStakeholders (though it sometimes doesn't). Thus, the Shapley value might be best understood as an extension of counterfactuals, rather than as something completely alien.
## Example 4: The Shapley value can also deal with leveraging
> Organisations can leverage funds from other actors into a particular project. Suppose that AMF will spend $1m on a net distribution. As a result of AMFs commitment, the Gates Foundation contributes $400,000. If AMF had not acted, Gates would have spent the $400,000 on something else. Therefore, the counterfactual impact of AMFs work is:
> AMFs own $1m on bednets plus Gates $400,000 on bednets minus the benefits of what Gates would otherwise have spent their $400,000 on.
> If Gates would otherwise have spent the money on something worse than bednets, then the leveraging is beneficial; if they would otherwise have spent it on something better than bednets, the leveraging reduces the benefit produced by AMF.
> Source: [The counterfactual impact of agents acting in concert](https://forum.effectivealtruism.org/posts/EP8x3vHRQJP57TjFL/the-counterfactual-impact-of-agents-acting-in-concert).
Let's consider the case in which the Gates Foundation would otherwise have spent their $400,000 on something half as valuable.
Then the counterfactual impact of the AMF is 1,000,000+400,000-(400,000)\*0.5 = $1,2m.
The counterfactual impact of the Gates Foundation is $400,000.
And the sum of the counterfactual impacts is $1,6m, which exceeds total impact, which is $1,4m.
The Shapley value of the AMF is $1,1m.
The Shapley value of the Gates Foundation is $300,000.
Thus, the Shapley value assigns to the AMF part, but not all, of the impact of the Gates Foundation donation. It takes into account their outside options when doing so: if the Gates Foundation would have invested on something equally valuable, the AMF wouldn't get anything from that.
## Example 5: The Shapley value can also deal with funging
> Suppose again that AMF commits $1m to a net distribution. But if AMF had put nothing in, DFID would instead have committed $500,000 to the net distribution. In this case, AMF funges with DFID. AMFs counterfactual impact is therefore:
> AMFs own $1m on bednets minus the $500,000 that DFID would have put in plus the benefits of what DFID in fact spent their $500,000 on.
> [Source](https://forum.effectivealtruism.org/posts/EP8x3vHRQJP57TjFL/the-counterfactual-impact-of-agents-acting-in-concert)
Suppose that the DFID spends their money on something half as valuable.
The counterfactual impact of the AMF is $1m - $500,000 + ($500,000)\*0.5 = $750,000.
The counterfactual impact of DFID is $250,000.
The sum of their counterfactual impacts is $1m; lower than the total impact, which is $1,250,000.
The Shapley value of the AMF is, in this case, $875,000.
The Shapley value of the DFID is $375,000.
The AMF is penalized: even though it paid $1,000,000, its Shapley value is less than that. The DFID's Shapley-impact is increased, because it could have invested its money in something more valuable, if the AMF hadn't intervened.
For a perhaps cleaner example, consider the case in which the DFID's counterfactual impact is $0: It can't use the money except to distribute nets, and the AMF got there first. In that scenario:
The counterfactual impact of the AMF is $500,000.
The counterfactual impact of DFID is $0.
The sum of their counterfactual impacts is $500,000. This is lower than the total impact, which is $1,000,000.
The Shapley value of the AMF is $750,000.
The Shapley value of the DFID is $250,000.
The AMF is penalized: even though it paid $1,000,000, its Shapley value is less than that. The DFID shares some of the impact,
## Example 6: The counterfactual value doesn't deal correctly tragedy of the commons scenarios.
Imagine a scenario in which many people could replicate the GPT-2 model and make it freely available, but the damage is already done once the first person does it. Imagine that 10 people end up doing it, and that the damage done is something big, like -10 million utility.
Then the counterfactual damage done by each person would be 0, because the other nine would have done it regardless.
The Shapley value deals with this by assigning an impact of -1 million utility to each person.
### Example 7: Hiring in EA
Suppose that there was a position in an EA org, for which there were 7 qualified applicants which are otherwise "idle". In arbitrary units, the person in that position in that organization can produce an impact of 100 utility.
The counterfactual impact of the organization is 100.
The counterfactual impact of any one applicant is 0.
The Shapley value of the the organization is 85.71.
The Shapley value of any one applicant is 2.38.
As there are more applicant, the value skews more in favor of the organization, and the opposite happens with less applicants. If there were instead only 3 applicants, the values would be 75 and 8.33, respectively. If there were only 2 applicants, the Shapley value of the organization is 66.66, and that of the applicants is 16.66. With one applicant and one organization, the impact is split 50/50.
In general, I suspect, but I haven't proved it, that if there are n otherwise iddle applicants, the Shapley value assigned to the organization is (n-1)/n. This suggests that a lot of the impact of the position goes to whomever created the position.
## Example 8: The Shapley value makes the price of a life rise with the number of stakeholders.
Key:
* Shapley value - counterfactual value / counterfactual impact
* Shapley price - counterfactual price. The amount of money needed to be counterfactually responsible for 1 unit of X / The amount of money needed for your Shapley value to be 1 unit of X.
* Shapley cost-effectiveness - counterfactual cost-effectiveness.
Suppose that, in order to save a life, 4 agents have to be there: AMF to save a life, GiveWell to research them, Peter Singer to popularize them and a person to donate $5000. Then the counterfactual impact of the donation would be 1 life, but its Shapley value would be 1/4th. Or, in other words, the Shapley cost of saving a life though a donationis four times higher than the counterfactual cost.
Why is this? Well, suppose that, to save a life, each of the organizations spent $5000. Because all of them are necessary, the counterfactual cost of a life is $5000 for any of the stakeholders. But if you wanted to save an additional life, the amount of money which would be spend must be $5000\*4 = $20,000, because someone would have to go through the four necessary steps.
If, instead of 4 agents there were 100 agents involved, then the counterfactual price stays the same, but the Shapley price rises to 100x the counterfactual price. In general, I've said "AMF", or "GiveWell", as if they each were only one agent, but that isn't necessarily the case, so the Shapley price (of saving a life) might potentially be even higher.
This is a problem because if agents are reporting their cost-effectiveness in terms of counterfactuals, and one agent switches to consider their cost-effectiveness in terms of Shapley values, their cost effectiveness will look worse.
This is also a problem if organizations are reporting their cost-effectiveness in terms of counterfactuals, but in some areas there are 100 necessary stakeholders, and in other areas there are four.
## Shapley value and cost effectiveness.
So we not only care about impact, but also about cost-effectiveness. Let us continue with the example in which an NGO sends reminders to undergo vaccination, and let us give us some numbers.
Lets say that a small Indian state with 10 million inhabitants spends $60 million to vaccinate 30% of their population. An NGO which would otherwise be doing something really ineffective (we'll come back to this), comes in, and by sending reminders, increases the vaccination rate to 35%. They do this very cheaply, for $100,000.
The Shapley value of the Indian government would be 32.5%, or 3.25 million people vaccinated.
The Shapley value of the small NGO would be 2.5%, or 0.25 million people vaccinated.
Dividing this by the amount of money spent:
The cost-effectiveness in terms of the Shapley value of the Indian government would be $60 million / 3.25 million vaccinations = $18.46/vaccination.
The cost-effectiveness in terms of the Shapley value of the NGO would be $100,000 / 250,000 vaccinations = $0.4/vaccination.
So even though the NGO's Shapley value is smaller, it's cost-effectiveness is higher, as one might expect.
If the outside option of the NGO were something which has a similar impact to vaccinating 250,000 people, we're back at the funging/leveraging scenario: because the NGO's outside option is better, its Shapley value rises.
## Cost effectiveness in terms of Shapley value changes when considering different groupings of agents.
Continuing with the same example, consider that, instead of the abstract "Indian government" as a homogeneous whole, there are different subagents which are all necessary to vaccinate people. Consider: The Central Indian Government, the Ministry of Finance, the Ministry of Health and Family Welfare, and within any one particular state: the State's Council of Ministers, the Finance Department, the Department of Medical Health and Family Welfare, etc. And within each of them there are sub-agencies, and sub-subagencies.
In the end, suppose that there are 10 organizations which are needed for the vaccine to be delivered, for a nurse to be there, for a hospital or a similar building to be available, and for there to be money to pay for all of it. For simplicity, suppose that the budget of each of those organizations is the same: $60 million / 10 = $6 million. Then the Shapley-cost effectiveness is different:
The Shapley value of each governmental organization would be 1/10 \* (30 million + 10/11 \* 0.5 million) = 345,454 people vaccinated.
The Shapley value of the NGO would be 1/11 \* 500,000 = 45,454 people vaccinated.
The cost effectiveness of each governmental organization would be ($6 million)/(345,454 vaccinations) = $17 / vaccination.
The cost effectiveness of the NGO would be $100,000 / 45,454 vaccinations = $2.2 / vaccination.
That's interesting. These concrete numbers are all made up, but they're inspired by reality and "plausible", and I was expecting the result to be that the NGO would be less cost-effective than a government agency. It's curious to see that, in this concrete example, the NGO seems to be robustly more cost-efficient than the government under different groupings. I suspect that something similar is going on with 80,000h.
## Better optimize Shapley.
If each agent individually maximizes their counterfactual impact per dollar, we get suboptimal results, as we have seen above. In particular, consider a toy world in which twenty people can either:
* Each be an indispensable part of a project which has a value of 100 utility, for a total impact of 100 utility
* Each can by themselves undertake a project which has 10 utility, for a total impact of 200 utility.
Then if each person was optimizing for the counterfactual impact, they would all choose the first option, for a lower total impact. If they were optimizing for their Shapley value, they'd choose the second option.
Can we make a more general statement? Yes. Agents individually optimizing for cost-effectiveness in terms of Shapley value globally optimize for total cost-effectiveness.
Informal proof: Consider the case in which agents have constant budgets and can divide them between different projects as they like. Then, consider the case in which each $1 is an agent: projects with higher Shapley value per dollar get funded first, then those with less impact per dollar, etc. Total cost-effectiveness is maximized. Because of order indifference, both cases produce the same distribution of resources. Thus, agents individually optimizing for cost effectiveness in terms of Shapley-value globally optimize for total cost-effectiveness.
Note: Thinking in terms of marginal cost-effectiveness doesn't change this conclusion. Thinking in terms of time/units other than money probably doesn't change the conclusion.
## Am I bean counting?
I don't have a good answer to that question.
## Conclusion
The counterfactual impact function is well defined, but it fails to meet my expectations of what an impact function ought to do when considering scenarios with multiple stakeholders.
On the other hand, the Shapley value function flows from some very general and simple properties, and can deal with the examples in which the counterfactual function fails. Thus, instead of optimizing for counterfactual impact, it seems to me that optimizing for Shapley value is less wrong.
Finally, because the Shapley value is not pretty to calculate by hand, [here is a calculator](http://shapleyvalue.com/).
Question: Is there a scenario in which the Shapley value assigns impacts which are clearly nonsensical, but with which the counterfactual value, or a third function, deals correctly?
---
## Addendum: The Shapley value is not easily computable.
For large values the Shapley value will not be computationally tractable (but approximations might be pretty good), and work on the topic has been done in the area of interpreting machine learning results. See, for example:
> This was a very simple example that weve been able to compute analytically, but these wont be possible in real applications, in which we will need the approximated solution by the algorithm. Source: [https://towardsdatascience.com/understanding-how-ime-shapley-values-explains-predictions-d75c0fceca5a](https://towardsdatascience.com/understanding-how-ime-shapley-values-explains-predictions-d75c0fceca5a)
Or
> The Shapley value requires a lot of computing time. In 99.9% of real-world problems, only the approximate solution is feasible. An exact computation of the Shapley value is computationally expensive because there are 2^k possible coalitions of the feature values and the “absence” of a feature has to be simulated by drawing random instances, which increases the variance for the estimate of the Shapley values estimation. The exponential number of the coalitions is dealt with by sampling coalitions and limiting the number of iterations M. Decreasing M reduces computation time, but increases the variance of the Shapley value. There is no good rule of thumb for the number of iterations M. M should be large enough to accurately estimate the Shapley values, but small enough to complete the computation in a reasonable time. It should be possible to choose M based on Chernoff bounds, but I have not seen any paper on doing this for Shapley values for machine learning predictions. Source: [https://christophm.github.io/interpretable-ml-book/shapley.html#disadvantages-13](https://christophm.github.io/interpretable-ml-book/shapley.html#disadvantages-13)
That being said, here is a nontrivial example:
### Foundations and projects.
Suppose that within the EA community, OpenPhilantropy, a foundation whose existence I appreciate, has the opportunity to fund 250 out of 500 projects every year. Say that you also have 10 smaller foundations: Foundation1,..., Foundation10, each of which can afford to fund 20 projects, that there aren't any more sources of funding, and that each project costs the same.
On the other hand, we will also consider the situation in which OpenPhil is a monopoly. In the end, perhaps all these other foundations and centers might be founded by OpenPhilantropy themselves. Consider the assumption that OpenPhil has the opportunity to fund 450 projects out of 500, and that there are no other sources in the EA community.
Additionally, we could model the distribution of projects with respect to how much good they do in the world by ordering all projects from 1 to 500, and saying that:
* Impact1 of the k-th project = I1(k) = 0.99^k.
* Impact2 of the k-th project = I2(k) = 2/k^2 (a power law).
With that in mind, here are our results for the different assumptions. Power Index= Shapley(OP) / Total Impact
| Monopoly? | Impact measure | Total Impact | Shapley(OP) | Power index |
|--------------------------|----------------|--------------|-------------------|-------------------------------------------------|
| 0 | I(k) = 0.99^k | 97.92 | 7.72 | 7.89% |
| 0 | I(k) = 2/k^2 | 3.29 | 0.028 | 0.86% |
| 1 | I(k) = 0.99^k | 97.92 | 48.96 | 50% |
| 1 | I(k) = 2/k^2 | 3.29 | 1.64 | 50% |
For a version of this table which has counterfactual impact as well, see [here](https://raw.githubusercontent.com/NunoSempere/nunosempere.github.io/master/ea/OpenPhil.jpg).
The above took some time, and required me to beat the formula for the Shapley value into being computationally tractable for this particular case (see [here](https://nunosempere.github.io/ea/ShapleyComputation.jpg) for some maths, which as far as I'm aware, are original, and [here](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/ShapleyValueCode.R) for some code).

View File

@ -0,0 +1,310 @@
[Part 1] Amplifying generalist research via forecasting models of impact and challenges
==============
_This post covers our models of impact and challenges with our exploration in amplifying generalist research using forecasting. It is accompanied by_ [_a second post_](https://forum.effectivealtruism.org/posts/ZTXKHayPexA6uSZqE/amplifying-generalist-research-via-forecasting-results-from) _with a high-level description of those models, and more detailed description of experiment set-up and results._
Many of the worlds most pressing problems require intellectual progress to solve \[1\]. Finding ways to increase the rate of intellectual progress might be a highly promising way of solving those problems.
One component of this is generalist research: the ability to judge and synthesise claims across many different fields without detailed specialist knowledge of those fields, in order to for example prioritise potential new cause areas or allocate grant funding. For example, this skill is expected by organisations at the EA Leaders Forum to be one of the highest demanded skills for their organisations over the coming 5 years ([2018 survey](https://80000hours.org/2018/10/2018-talent-gaps-survey/?fbclid=IwAR2b92ibCe01P_dIHk7E1HqXem_cjjGiXMZ9qwtKVhzYk3NvVYPgzUkEq6g), [2019 survey](https://forum.effectivealtruism.org/posts/TpoeJ9A2G5Sipxfit/ea-leaders-forum-survey-on-ea-priorities-data-and-analysis)).
In light of this, we recently tested a method of increasing the scale and quality of generalist research, applied to researching the industrial revolution \[2\], using Foretold.io (an online prediction platform).
In particular, we found that, when faced with claims like:
> “Pre-Industrial Britain had a legal climate more favorable to industrialization than continental Europe”
And
> “Pre-Industrial Revolution, average French wage was what percent of the British wage?”
a small crowd of forecasters recruited from the EA and rationality communities very successfully predicted the judgements of a trusted generalist researcher, with a benefit-cost ratio of around 73% compared to the original researcher. They also outperformed a group of external online crowdworkers.
Moreover, we believe this method can be scaled to answer many more questions than a single researcher could, as well as to have application in domains other than research, like grantmaking, hiring and reviewing content.
We preliminarily refer to this method as “amplification” given its similarity to ideas from Paul Christianos work on Iterated Distillation and Amplification in AI alignment (see e.g. [this](https://sideways-view.com/2016/12/01/optimizing-the-news-feed/#punchline)).
This was an exploratory project whose purpose was to build intuition for several possible challenges. It covered several areas that could be well suited for more narrow, traditional scientific studies later on. As such, the sample size was small and no single result was highly robust.
However, it did lead to several medium-sized takeaways that we think should be useful for informing future research directions and practical applications.
This post begins with a brief overview of our results. We then share some models of why the current project might be impactful and exciting, followed by some challenges this approach faces.
# Overview of the set-up and results
_(This section gives a very cursory overview of the set-up and results. A detailed report can be found in_ [_this post_](https://forum.effectivealtruism.org/posts/ZTXKHayPexA6uSZqE/amplifying-generalist-research-via-forecasting-results-from)_.)_
The basic set-up of the project is shown in the following diagram, and described below.
A two-sentence version would be:
> Forecasters predicted the conclusions that would be reached by Elizabeth van Norstrand, a generalist researcher, before she conducted a study on the accuracy of various historical claims. We randomly sampled a subset of research claims for her to evaluate, and since we can set that probability arbitrarily low this method is not bottlenecked by her time.
![](images/4a235d14d0177ec92050af5b2551cdbc337f2d1e.png)
The below graph shows the evolution of the accuracy of the crowd prediction over time, starting from Elizabeth Van Nostrands prior. Predictions were submitted separately by two groups of forecasters: one based on a mailing list with participants interested in participating in forecasting experiments (recruited from effective altruism-adjacent events and other forecasting platforms), and one recruited from Positly, an online platform for crowdworkers.
![](images/c7e041d8fab837233a9cc4d03c6166c54da04020.png)
The y-axis shows the accuracy score on a logarithmic scale, and the x-axis shows how far along the experiment is. For example, 14 out of 28 days would correspond to 50%. The thick lines show the average score of the aggregate prediction, across all questions, at each time-point. The shaded areas show the standard error of the scores, so that the graph might be interpreted as a guess of how the two communities would predict a random new question.
One of our key takeaways from the experiment is that our simple algorithm for aggregating predictions performed surprisingly well in predicting Elizabeths research output -- but only for the network-adjacent forecasters.
Another way to understand the performance of the aggregate is to note that the aggregate of network-adjacent forecasters had an average log score of -0.5. To get a rough sense of what that means, it's the score you'd get by being 70% confident in a binary event, and being correct (though note that this binary comparison merely serves to provide intuition, there are technical details making the comparison to a distributional setting a bit tricky).
By comparison, the crowdworkers and Elizabeths priors had a very poor log score of around -4. This is roughly similar to the score youd get if you predict an event to be ~5% likely, and it still happens.
We also calculated a benefit/cost-ratio, as follows:
> _Benefit/cost ratio = % value provided by forecasters relative to the evaluator / % cost of forecasters relative to the evaluator_
We measured “value provided” as the reduction in uncertainty weighted by the importance of the questions on which uncertainty was reduced.
Results were as follows.
![](images/50453b84385fa25f5a934570cfa2bc6702869748.png)
In other words, each unit of resource invested in the network-adjacent forecasters provided 72% as much returns as investing it in Elizabeth directly, and each unit invested in the crowdworkers provided negative returns, as they tended to be less accurate than Elizabeths prior.
Overall, we tentatively view this as an existence proof of the possibility of amplifying generalist research, and in the future are interested in obtaining more rigorous results and optimising the benefit-cost ratio.
# Models of impact
This section summarises some different perspectives on what the current experiment is trying to accomplish and why that might be exciting.
There are several perspectives here given that the experiment was designed to explore multiple relevant ideas, rather than testing a particular, narrow hypothesis.
As a result, the current design is not optimising very strongly for any of these possible uses, and it is also plausible that its impact and effectiveness will vary widely between uses.
To summarise, the models are as follows.
* **Mitigating capacity bottlenecks.** The effective altruism and rationality communities face rather large bottlenecks in many areas, such as allocating funding, delegating research, [vetting](https://forum.effectivealtruism.org/posts/G2Pfpkcwv3bJNF8o9/ea-is-vetting-constrained) [talent](https://forum.effectivealtruism.org/posts/jmbP9rwXncfa32seH/after-one-year-of-applying-for-ea-jobs-it-is-really-really) and [reviewing content](https://www.lesswrong.com/posts/qXwmMkEBLL59NkvYR/the-lesswrong-2018-review). The current setup might provide a means of mitigating some of those -- a scalable mechanism of outsourcing intellectual labor.
* **A way for intellectual talent to build and demonstrate their skills.** Even if this set-up cant make new intellectual progress, it might be useful to have a venue where junior researchers can demonstrate their ability to predict the conclusions of senior researchers. This might provide an objective signal of epistemic abilities not dependent on detailed social knowledge.
* **Exploring new institutions for collaborative intellectual progress.** Academia has a vast backlog of promising ideas for institutions to help us think better in groups. Currently we seem bottlenecked by practical implementation and product development.
* **Getting more data on empirical claims made by the Iterated Amplification AI alignment agenda.** These ideas inspired the experiment. (However, our aim was more practical and short-term, rather than looking for theoretical insights useful in the long-term.)
* **Exploring forecasting with distributions.** Little is known about humans doing forecasting with full distributions rather than point estimates (e.g. “79%”), partly because there hasnt been easy tooling for such experiments. This experiment gave us some cheap data on this question.
* **Forecasting fuzzy things.** A major challenge with forecasting tournaments is the need to concretely specify questions; in order to clearly determine who was right and allocate payouts. The current experiments tries to get the best of both worlds -- the incentive properties of forecasting tournaments and the flexibility of generalist research in tackling more nebulous questions.
* **Shooting for unknown unknowns.** In addition to being an “experiment”, this project is also an “exploration”. We have an intuition that there are interesting things to be discovered at the intersection of forecasting, mechanism design, and generalist research. But we dont yet know what they are.
## Mitigating capacity bottlenecks
The effective altruism and rationality communities face rather large bottlenecks in many areas, such as allocating funding, delegating research, [vetting](https://forum.effectivealtruism.org/posts/G2Pfpkcwv3bJNF8o9/ea-is-vetting-constrained) [talent](https://forum.effectivealtruism.org/posts/jmbP9rwXncfa32seH/after-one-year-of-applying-for-ea-jobs-it-is-really-really) and [reviewing content](https://www.lesswrong.com/posts/qXwmMkEBLL59NkvYR/the-lesswrong-2018-review).
Prediction platforms (for example as used with the current “amplification” set-up) might be a promising tool to tackle some of those problems, for several reasons. In brief, they might act as a scalable way to outsource intellectual labor.
First, were using _quantitative_ predictions and scoring rules. This allows several things.
* We can directly measure how accurate each contribution was, and separately measure how useful they were in benefiting the aggregate. The actual calculations are quite simple and with some engineering effort can scale to [allocating credit](https://www.lesswrong.com/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem) (in terms of money, points, reputation etc.) to hundreds of users in an incentive-compatible way.
* We can _aggregate_ different contributions in an automatic and rigorous way \[3\].
* We have a shared, precise language for _interpreting_ contributions.
Contrast receiving 100 predictions and receiving 20 Google docs. The latter would be prohibitively difficult to read through, does not have a straightforward means of aggregation, and might not even be analysable in an “apples to apples” comparison.
However, the big cost we pay to enable these benefits is that we are adding formalism, and constraining people to express their beliefs within the particular formalism/ontology of probabilities and distributions. We discuss this more in the section on challenges below.
Second, were using an internet platform. This makes it easier for people from different places to collaborate, and to organise and analyse their contributions. Moreover, given the benefits of quantification noted above, we can freely open the tournament to people without substantial credentials, since were not constrained in our capacity to evaluate their work.
Third, were using a mechanism specifically designed to overcome capacity bottlenecks. The key to scalability is that forecasters do not know which claims will be evaluated, and so are incentivised to make their honest, most accurate predictions on all of them. This remains true even as many more claims are added (as long as forecasters expect rewards for participating remain similar).
In effect, were shifting the bottleneck from access to a few researchers to access to prize money and competent forecasters. It seems highly implausible that all kinds of intellectual work could be cost-effectively outsourced this way. However, if some work could be outsourced and performed at, say 10% of the quality, but at only 1% of the cost, that could still be very worthwhile. For example, in trying to review hundreds of factual claims, the initial forecasting could be used as an initial, wide-sweeping filter, grabbing the low-hanging fruit; but also identifying which questions are more difficult, and will need attention from more senior researchers.
Overall, this is a model for how things _might_ work, but it is as of yet highly uncertain whether this technique will _actually_ be effective in tackling bottlenecks of various kinds. We provide some preliminary data from this experiment in the “Cost-effectiveness” section below.
## A way for intellectual talent to build and demonstrate their skills
The following seems broadly true to some of us:
* Someone who can predict my beliefs likely has a good model of how I think. (E.g. “I expect you to reject this papers validity based on the second experiment, but also think youd change your mind if you thought they had pre-registered that methodology”.)
* Someone who can both predict my beliefs _and_ disagrees with me is someone I should listen to carefully. They seem to both understand my model and still reject it, and this suggests they know something I dont.
* It seems possible for person X to predict a fair number of a more epistemically competent person Ys beliefs -- even before person X is _as_ epistemically competent as Y. And in that case, doing so is evidence that person X is moving in the right direction.
If these claims are true, we might use some novel versions of forecasting tournaments as a scalable system to identify and develop epistemic talent. This potential benefit looks quite different from using forecasting tournaments to help us solve novel problems or gain better or cheaper information than we could otherwise.
Currently there is no “drivers license” for rationality or effective altruism. Demonstrating your abilities requires navigating a system of reading and writing certain blog posts, finding connections to more senior people, and going through work trials tailored to particular organisations. This system does not scale very well, and also often requires a social knowledge and ability to “be in the right place at the right time” which does not necessarily strongly correlate with pure epistemic ability.
It seems very implausible that open forecasting tournaments could solve the entire problem here. But it seems quite plausible that it could offer improvements on the margin, and become a reliable credentialing mechanism for a limited class of non-trivial epistemic abilities.
For example, EA student groups with members considering cause prioritisation career paths might organise tournaments where their members forecast the conclusions of OpenPhil write-ups, or maintain and update their own distributions over key variables in GiveWells cost-effectiveness models.
By running this experiment, writing up the results, and improving the Foretold platform, we hope to provide infrastructure that will allow others interested in this benefit to run their own experiments.
## Exploring new institutions for collaborative intellectual progress
Many of our current most important institutions, like governments and universities, run on mechanisms designed hundreds of years ago, before fields like microeconomics and statistics were developed. They suffer from many predictable and well-understood incentive problems, such as [poor replication rates of scientific findings](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124) following from a need to optimise for publications; the [election of dangerous leaders](https://halshs.archives-ouvertes.fr/halshs-01972097/document) due to the use of [provably suboptimal voting systems](https://80000hours.org/podcast/episodes/aaron-hamlin-voting-reform/); or the failure to adequately fund public goods like [high-quality explanations of difficult concepts](https://distill.pub/2017/research-debt/) due to free-rider problem, just to name a few.
The academic literature in economics and mechanism design has a vast backlog of designs for new institutions that could solve these and other problems. One key bottleneck now seems to be implementation.
For example, ethereum founder Vitalik Buterin [has argued](https://80000hours.org/podcast/episodes/vitalik-buterin-new-ways-to-fund-public-goods/#transcript) that the key skill required is product development: making novel mechanisms with better incentives work in practice (search for “product people” in linked interview).
Similarly, [Robin Hanson](https://80000hours.org/podcast/episodes/robin-hanson-on-lying-to-ourselves/) has [argued](https://www.overcomingbias.com/2019/03/best-cause-new-mechanism-field-trials.html) that there is a large, promising literature on more effective institutions, but “what we need most \[... is lots of concrete trials.\] To get involved in the messy details of an organization, and just try out different variations until \[we\] see something that actually works” \[4\], \[5\].
Part of the spirit of the current experiment is an attempt to do just this, and, in particular, to do so in the domain of research intellectual progress.
## Getting more data on empirical claims made by the Iterated Amplification AI alignment agenda
The key mechanism underlying this experiment, and its use of prediction and randomisation, is based on ideas from the Iterated Amplification approach to AI alignment. Currently groups at Ought, OpenAI and elsewhere are working on testing the empirical assumptions underlying that theory.
Compared to these groups, the current experiment had a more practical, short-term aim -- to find a “shovel-ready” method of amplifying generalist research, that could be applied to make the EA/rationality communities more effective already over the coming years.
Nonetheless, potential follow-ups from this experiment might provide useful theoretical insight in that direction.
## Exploring forecasting with distributions
Little is known about doing forecasting with full distributions (e.g. “I think this is captured by two normals, with means 5 and 10 and variance 3”) rather than point estimates (e.g. “79%”). Before the launch of Foretold, there wasnt any software available for easily running such experiments.
This was a quick way of getting data on many questions in distributional forecasting:
* How good are humans at it?
* What are the main usability challenges?
* In terms of intuitive scoring rules?
* In terms of intuitive yet powerful input formats?
* What are best practices? (For example, using beta rather than lognormal distributions when forecasting someone elses prediction, or averaging distributions with a wide uniform to hedge against large losses)
## Forecasting fuzzy things
[A major challenge](https://www.lesswrong.com/s/HknKjvSxbFAAQ3RdL/p/8y7DcSF4eAkXoru4u) with prediction markets and forecasting tournaments is the need to concretely specify questions; in order to clearly determine who was right and allocate payouts.
Often, this means that these mechanisms are limited to answering questions like:
> “What will the highest performance of an algorithmic benchmark x be at time t?”
Even though what we often care about is something more nebulous, like:
> “How close will we be to AGI at time t?”
The upside of this precision is that it enables us to use quantitative methods to estimate performance, combine predictions, and allocate rewards, as described above.
The current experiments try to get the best of both worlds: the incentive properties of forecasting tournaments _and_ the flexibility of generalist research in tackling more nebulous questions. The proposed solution to this problem is simply to have one or many trusted evaluators who decide on the truth of a question, and then predict _their judgements_ as opposed to the underlying question \[6\].
(Previously some of the authors set up the [AI Forecasting Resolution Council](https://www.lesswrong.com/posts/9G6CCNXkA7JZoorpY/ai-forecasting-resolution-council-forecasting-infrastructure) to enable such flexible resolution to also be used on AI questions.)
## Shooting for unknown unknowns
This is related to the mindset of “[prospecting for gold](https://www.effectivealtruism.org/articles/prospecting-for-gold-owen-cotton-barratt/)”. To a certain extent, we think that we have a potentially reliably inside view, a certain research taste which is worth following and paying attention to, because we are curious what we might find out.
A drawback with this is that it enables practices like p-hacking/publication bias if results are reported selectively. To mitigate this, all data from this experiment is publicly available [here](https://observablehq.com/@jjj/untitled/2) \[7\].
# Challenges
This section discusses some challenges and limitations of the current exploration, as well as our ideas for solving some of them. In particular, we consider:
* **Complexity and unfamiliarity of experiment.** The current experiment had many technical moving parts. This makes it challenging to understand for both participants and potential clients who want to use it in their own organisations.
* **Trust in evaluations.** The extent to which these results are meaningful depends on your trust in Elizabeth Van Nostrands ability to evaluate questions. We think is partly an inescapable problem, but also expect clever mechanisms and more transparency to be able to make large improvements.
* **Correlations between predictions and evaluations.** Elizabeth had access to a filtered version of forecaster comments when she made her evaluations. This introduces a potential source of bias and a “self-fulfilling prophecy” dynamic in the experiments.
* **Difficulty of converting mental models into quantitative distributions.** Its hard to turn nuanced mental models into numbers. We think a solution is to have a “division of labor”, where some people just build models/write comments and others focus on quantifying them. Were working on incentive schemes that work in this context.
* **Anti-correlation between importance and “outsourceability”.** The intellectual questions which are most important to answer might be different from the ones that are easiest to outsource, in a way which leaves very little value on the table in outsourcing.
* **Overhead of question generation.** Creating good forecasting questions is hard and time-consuming, and better tooling is needed to support this.
* **Scoring rule that discourages collaboration.** Prediction markets and tournaments tend to be zero-sum games, with negative incentives for helping other participants or sharing best practices. To solve this were designing and testing improved scoring rules which directly incentivise collaboration.
## Complexity and unfamiliarity of experiment.
The current experiment has many moving parts and a large [inferential distance](https://www.lesswrong.com/posts/HLqWn5LASfhhArZ7w/expecting-short-inferential-distances). For example, in order to participate, one would need to understand the mathematical scoring rule, the question input format, the randomisation of resolved questions and how questions would be resolved as distributions.
This makes the set-up challenging to understand to both participants and potential clients who want to use similar amplification set-ups in their own organisations.
We dont think these things are inherently complicated, but have much work to do on explaining the set-up and making the app generally accessible.
## Trust in evaluations.
The extent to which the results are meaningful depends on ones trust in Elizabeth Van Nostrands ability to evaluate questions. We chose Elizabeth for the experiment as she has a reputation for reliable generalist research (through [her blog series on “Epistemic Spot Checks”](https://www.lesswrong.com/users/pktechgirl)), and 10+ public blog posts with evaluations of the accuracy of books and papers.
However, the challenge is that this trust often relies on a long history of interactions with her material, in a way which might be hard to communicate to third-parties.
For future experiments, we are considering several improvements here.
First, as hinted at above, we can ask forecasters both about their predictions of Elizabeth as well as their own personal beliefs. We might then [expect](https://science.sciencemag.org/content/306/5695/462) that those who can both accurately predict Elizabeth _and_ disagree with her knows something she does not, and so will be weighted more highly in the evaluation of the true claim.
Second, we might have set-ups with multiple evaluators; or more elaborate ways of scoring the evaluators themselves (for example based on their ability to predict what they themselves will say after more research).
Third, we might work to have more transparent evaluation processes, for example including systematic rubrics or detailed write-ups of reasoning. We must be careful here not to “throw out the baby with the bathwater”. The purpose of using judges is after all to access subjective evaluations which cant be easily codified in concrete resolution conditions. However, there seems to be room for more transparency on the margin.
## Correlation between predictions and evaluations.
Elizabeth had access to a filtered version of forecaster comments when she made her evaluations. Hence the selection process on evidence affecting her judgements was not independent from the selection process on evidence affecting the aggregate. This introduces a potential source of bias and a “self-fulfilling prophecy” dynamic of the experiments.
For future experiments, were considering obtaining an objective data-set with clear ground truth, and test the same set-up without revealing the comments to Elizabeth, to get data on how serious this problem is (or is not).
## Difficulty of converting mental models into quantitative distributions.
In order to participate in the experiment, a forecaster has to turn their mental models (represented in whichever way the human brain represents models) into quantitative distributions (which is a format quite unlike that native to our brains), as shown in the following diagram:
![](images/fd632951009d3c978277000a6ba9f3834cb4922a.png)
Each step in this chain is quite challenging, requires much practice to master, and can result in a loss of information.
Moreover, we are uncertain how the difficulty of this process differs across questions of varying importance. It might be that some of the most important considerations in a domain tend to be confusion-shaped (e.g. “What does it even mean to be aligned under self-improvement when you cant reliably reason about systems smarter than yourself?”), or very open-ended (e.g. “What new ideas could reliably improve the long-term future?” rather than “How much will saving in index funds benefit future philanthropists?”)). Hence filtering for questions that are more easily quantified might select against questions that are more important.
Consider some solutions. For the domains where quantification seems more promising, it seems at least plausible that there should be possible to have some kind of “division of labor” between them.
For future experiments, were looking to better separate “information contribution” and “numerical contribution”, and find ways of rewarding both. Some participants might specialise in research or model-generation, and others in turning that research into distributions.
A challenge here is to appropriately reward users who only submit comments but do not submit predictions. Since one of the core advantages of forecasting tournaments is that they allow us to precisely and quantitatively measure performance, it seems plausible that any solution should try to make use of this fact. (As opposed to, say, using an independent up- and downvoting scheme.) As example mechanisms, one might randomly show a comment to half the users, and reward a comment based on the performance of the aggregate for users whove seen it and users who havent. Or one might release the comments to forecasters sequentially, and see how much each improves the aggregate. Or one might simply allow users to vote, but weigh the votes of users with a better track-record higher.
Moreover, in future experiments with Elizabeth well want to pair her up with a “distribution buddy”, whose task is to interview her to figure out in detail what distribution best captures her beliefs, allowing Elizabeth to focus simply on building conceptual models.
## Anti-correlation between importance and “outsourceability”
Above we mentioned that the questions easiest to _quantify_ might be anti-correlated with the ones that are most important. It is also plausible that the questions which are easiest to _outsource_ to forecasters are not the same as those which are most important to reduce uncertainty on. Depending on the shape of these distributions, the experiment might not be capture a lot of value. (For illustration, consider an overly extreme example: suppose a venture capitalist tries to amplify their startup investments. The crowd always predicts “no investment”, and turn out to be right in 99/100 cases: the VC doesnt investment. However, the returns for that one case where crowd fails and the VC actually would have invested by far dominate the portfolio.)
## Overhead of question generation.
The act of creating good, forecastable questions is an art in and of itself. If done by the same person or small team which will eventually forecast the questions, one can rely on much shared context and intuition in interpreting the questions. However, scaling these systems to many participants requires additional work in specifying the questions sufficiently clearly. This overhead might be very costly. Especially since we think one of the key factors determining the usefulness of a forecasting question is the question itself. How well does it capture something we care about? From experience, [writing these questions is hard](https://www.lesswrong.com/posts/yy3FCmdAbgSLePD7H/how-to-write-good-ai-forecasting-questions-question-database). In future we have much work to do to make this process easier.
## A scoring rule that discourages collaboration
Participants were scored based on how much they outperformed the aggregate prediction. This scoring approach is similar to the default in prediction markets and major forecasting tournaments. It has the problem that sharing any information via commenting will harm your score (since it will make the performance of other users, and hence the aggregate, better). Whats more, all else remaining the same, doing _anything_ that helps other users will be worse for your score (such as sharing tips and tricks for making better predictions, or pointing out easily fixable mistakes so they can learn from them).
There are several problems with this approach and how it a disincentives collaboration.
First, it provides an awkward change in incentives for groups who otherwise have regular friendly interactions (such as a team at a company, a university faculty, or members of the effective altruism community).
Second, it causes effort to be wasted as participants must derive the same key insights individually, utilising little division of labor (as any sharing information will just end up hurting their score on the margin). Having _some_ amount of duplication of work and thinking can of course make the system robust against mistakes -- but we think the optimal amount is far less than the equilibrium under the current scoring rule.
In spite of these theoretical incentives, it is interesting to note that several participants actually ended up writing detailed comments. (Though basically only aimed at explaining their own reasoning, with no collaboration and back-and-forth between participants observed.) This might have been because they knew Elizabeth would see those comments, or for some other reason.
Nonetheless, we are working on modifying our scoring rule in a way which directly incentivises participants to collaborate, and actively rewards helping other users improve their models. We hope to release details of formal models and practical experiments in the coming month.
# Footnotes
\[1\] Examples include: AI alignment, global coordination, macrostrategy and cause prioritisation.
\[2\] We chose the industrial revolution as a theme since it seems like a historical period with many lessons for improving the world. It was a time of radical change in productivity along with many societal transformations, and might hold lessons for future transformations and our ability to influence those.
\[3\] For example by averaging predictions and then weighing by past track-record and time until resolution, as done in the Good Judgement Project (among other things).
\[4\] Some examples of nitty-gritty details we noticed while doing this are:
* Payoffs were too small/the scoring scheme too harsh
* Copying the aggregate to your distributions and then just editing a little was something natural, so we added support in [the syntax](https://observablehq.com/@oagr/foretold-inputs) for writing =multimodal(AG, _your prediction_)
* Averaging with a uniform would have improved predictions.
* The marginal value of each additional prediction was low after the beginning.
* Forecasters were mostly motivated by what questions were interesting, followed by what would give them a higher payout, and less by what would be most valuable to the experimenters.
\[5\] For a somewhat tangential, but potentially interesting, perspective, see [Feynman on making experiments to figure out nitty-gritty details in order to enable other experiments to happen](http://calteches.library.caltech.edu/51/2/CargoCult.htm) (search for “rats” in the link).
\[6\] A further direction were considering is to allow forecasters to both predict the judgements of evaluators _and_ the underlying truth. We might then [expect](https://science.sciencemag.org/content/306/5695/462) that those predictors who _both_ accurately forecast the judgement of the evaluator _and_ disagree in their own judgements, might provide valuable clues about the truth.
\[7\] For the record, before this experiment we ran two similar, smaller experiment (to catch easy mistakes and learn more about the set up), with about an order of magnitude less total forecasting effort invested. The aggregate from these experiments was quite poor at predicting the evaluations. The data from those experiments can be found [here](https://www.foretold.io/c/f19015f5-55d8-4fd6-8621-df79ac072e15?state=closed), and more details in Elizabeths write-ups [here](https://www.lesswrong.com/posts/LtHC5LqtzKRvfy4yQ/epistemic-spot-check-the-fate-of-rome-kyle-harper) and [here](https://www.lesswrong.com/posts/5ytHm6pAozanqbhYW/epistemic-spot-checks-the-fall-of-rome).
# Participate in future experiments or run your own
[Foretold.io](https://www.lesswrong.com/posts/wCwii4QMA79GmyKz5/introducing-foretold-io-a-new-open-source-prediction) was built as an open platform to enable more experimentation with prediction-related ideas. We have also made [data and analysis calculations](https://observablehq.com/@jjj/untitled/2) from this experiment publicly available.
If youd like to:
* Run your own experiments on other questions
* Do additional analysis on this experimental data
* Use an amplification set-up within your organisation
Wed be happy to consider providing advice, operational support, and funding for forecasters. Just comment here or reach out to [this email](mailto:jacob@parallelforecast.com).
If youd like to participate as a forecaster in future prediction experiments, you can [sign-up here](https://mailchi.mp/60b8ea91e592/ol3ptgmr5d).
# Acknowledgements
Funding for this project was provided by the Berkeley Existential Risk Initiative and the EA Long-term Future Fund.
We thank Beth Barnes and Owain Evans for helpful discussion.
We are also very thankful to all the participants.

View File

@ -0,0 +1,294 @@
[Part 2] Amplifying generalist research via forecasting results from a preliminary exploration
==============
_This post covers the set-up and results from our exploration in amplifying generalist research using predictions, in detail. It is accompanied by_ [_a second post_](https://forum.effectivealtruism.org/posts/ZCZZvhYbsKCRRDTct/amplifying-generalist-research-via-forecasting-models-of) _with a high-level description of the results, and more detailed models of impact and challenges. For an introduction to the project, see that post._
The rest of this post is structured as follows.
First, we cover the basic set-up of the exploration.
Second, we share some results, in particular focusing on the accuracy and cost-effectiveness of this method of doing research.
Third, we briefly go through some perspectives on what we were trying to accomplish and why that might be impactful, as well as challenges with this approach. These are covered more in-depth in [a separate post](https://forum.effectivealtruism.org/posts/ZCZZvhYbsKCRRDTct/amplifying-generalist-research-via-forecasting-models-of).
_Overall,_ we _are very interested in feedback and comments on where to take this next._
# Set-up of the experiment
## A note on the experimental design
To begin with, we note that this was not an “experiment” in the sense of designing a rigorous methodology with explicit controls to test a particular, well-defined hypothesis.
Rather, this might be seen as an ”exploration” \[3\]. We tested several different ideas at once, instead of running a unique experiment for each separately. We also intended to uncover new ideas and inspiration as much as testing existing ones.
Moreover, we proceeded in a startup-like fashion where several decisions were made ad-hoc. For example, a comparison group was introduced after the first experiment had been completed; this was not originally planned, but later became evidently useful. This came at the cost of worsening the rigor of the experiment.
We think this trade-off was worth it for our situation. This kind of policy allows us to execute a large number of experiments in a shorter amount of time, quickly pivot away from bad ones, and notice low-hanging mistakes and learning points before scaling up good ones. This is especially helpful as were [shooting for tail-end outcomes](https://www.openphilanthropy.org/blog/hits-based-giving), and are looking for concrete mechanisms to implement in practice (rather than publishing particular results).
We do not see it as a substitute for more rigorous studies, but rather as a complement, which might serve as inspiration for such studies in the future.
To prevent this from biasing the data, all results from the experiment are public, and we try to note when decisions were made post-hoc.
## Mechanism design
The basic set-up of the project is shown in the following diagram, and described below.
A two-sentence version would be:
> Forecasters predicted the conclusions that would be reached by Elizabeth van Norstrand, a generalist researcher, before she conducted a study on the accuracy of various historical claims. We randomly sampled a subset of research claims for her to evaluate, and since we can set that probability arbitrarily low this method is not bottlenecked by her time.
![](images/4a235d14d0177ec92050af5b2551cdbc337f2d1e.png)
**1\. Evaluator extracts claims from the book and submits priors**
The evaluator for the experiment was Elizabeth Van Norstrand, an independent generalist researcher known for her “[Epistemic spot checks](https://www.lesswrong.com/users/pktechgirl)”. This is a series of posts assessing the trustworthiness of a book by evaluating some of it claims. We chose Elizabeth for the experiment as she has a reputation for reliable generalist research, and there was a significant amount of public data about her past evaluations of claims.
She picked 10 claims from the book _The Unbound Prometheus: Technological Change and Industrial Development in Western Europe from 1750 to the Present_, as well as a meta-claim about the reliability of the book as a whole.
All claims were assigned an importance rating from 1-10 based on their relevance to the thesis of the book as a whole. We were interested in finding if this would influence forecaster effort between questions.
Elizabeth also spent 3 minutes per claim submitting an initial estimate (referred to as a “prior”).
![](images/533cbee1908697a3c0338e0f5c83b7f960d73551.png)
Beliefs were typically encoded as distributions over the range 0% to 100%, representing where Elizabeth expected the mean of her posterior credence in the claim to be after 10 more hours of research_._ For more explanation, see this footnote \[4\].
**2\. Forecasters make predictions**
Forecasters predicted what they expected Elizabeth to say after ~45 minutes of research on the claim, and wrote comments explaining their reasoning.
Forecasters payments for the experiment were proportional to how much their forecasts outperformed the aggregate in estimating her 45-minute distributions. In addition, forecasters were paid a base sum just for participating. You can see all forecasts and comments [here](https://www.foretold.io/c/f19015f5-55d8-4fd6-8621-df79ac072e15?state=closed), and an interactive tool for visualising and understanding the scoring scheme [here](https://observablehq.com/@jjj/amplification-experiment-scoring).
A key part of the design was that that forecasters _did_ _not know_ which question Elizabeth would randomly sample to evaluate. Hence they were incentivised to do their best on _all_ questions (weighted by importance). This has the important implication that we could easily extend the amount of questions predicted by forecasters -- even if Elizabeth can only judge 10 claims, we could have forecasting questions for 100 different claims \[5\].
Two groups of forecasters participated in the experiment: one based on a mailing list with participants interested in participating in forecasting experiments (recruited from effective altruism-adjacent events and other forecasting platforms) \[6\], and one recruited from Positly, an online platform for crowdworkers. The former group is here called “Network-adjacent forecasters” and the latter “Online crowdworkers”.
![](images/5e456cfc58967fc5c074c0287c806653d978b84b.png)
**3\. The evaluator judges the claims**
Elizabeth was given a time-budget of 6 hours, within which she randomly sampled claims to research and judge.
At this point, we wanted to use the work done by forecasters to help Elizabeth, while avoiding anchoring and biasing her with their estimates.
To solve this, Elizabeth was initially given [a filtered version of the comments section](https://docs.google.com/document/d/1H_AYIA1Sro_cHePZvcV0joXA4qgrVkw8xp916UxQYSQ/edit?usp=sharing) for each claim, which contained all sources and models used, but which had been stripped of any explicit predictions or subjective opinion generalising from the data.
For example, for the question:
> _Pre-Industrial Britain had a legal climate more favorable to industrialization than continental Europe \[5\]._
One commenter wrote:
> Seems more likely to be true than not. The English Civil War and Glorious Revolution both significantly curtailed the arbitrary power of the monarch/gentry and raised the power of merchants in Britain, making likely that government was more favourable to mercantile interests. Hard to judge the claim about haggling.
And in Elizabeths initial briefing this was replaced by:
> The English Civil War and Glorious Revolution both significantly curtailed the arbitrary power of the monarch/gentry and raised the power of merchants in Britain \[...\].
After arriving at a final estimate, Elizabeth was allowed to look at the full forecaster comments and predictions and optionally change her mind. In practice, she didn't change her mind in any of these cases.
To summarise, the parts involved were:
* We summarised all of the relevant comments into a list and removed any descriptions that referred to peoples probabilities.
* We randomly chose 8 of the 10 claims for review by Elizabeth.
* Elizabeth saw this list, and spent 6 hours evaluating the 8 claims and resolving them with probability distributions of the values she expected to have for them in a future possible evaluation round. She divided this time in accordance to what seemed the most useful; for instance, questions with answers that became obvious quickly got a relatively small proportion of this time.
* Elizabeth got access to all predictions and all comments and was was allowed to change her resolutions. She decided not to in all of the cases.
**4\. The evaluator double-checks the evaluations**
After having spent 6 hours researching 8 claims, Elizabeth randomly sampled two of those, each of which she spent an additional 3 hours researching. For the remaining claims, she waited until a week after the experiment, and then reread her notes and submitted new resolutions, to see if her way of converting beliefs into numbers was consistent over time. This part was intended to test the consistency and reliability of Elizabeths evaluations.
The outcome of this was that Elizabeth appeared highly consistent and reliable. You can see the data and graphs [here](https://observablehq.com/@jjj/untitled/2). Elizabeths full notes explaining her reasoning in the evaluations can be found [here](https://acesounderglass.com/2019/12/03/epistemic-spot-check-the-unbound-prometheus/).
# Results and analysis
_You can find all the data and interactive tools for exploring it yourself,_ _[here](https://observablehq.com/@jjj/untitled/2)._
## Online crowdworkers
We were interested in comparing the performance of our pool of forecasters to “generic” participants with no prior interest or experience forecasting.
Hence, after the conclusion of the original experiment, we reran a slightly modified form of the experiment with a group of forecasters recruited through an online platform that sources high quality crowdworkers (who perform microtasks like filling out surveys or labeling images for machine learning models).
However, it should be mentioned that these forecasters were operating under a number of disadvantages relative to other participants, which means we should be careful when interpreting their performance. In particular:
* They did not know that Elizabeth was the researcher who created the claims and would resolve them, and so they had less information to model the person whose judgments would ultimately decide the questions.
* They did not use any [multimodal](https://observablehq.com/@oagr/foretold-inputs) or custom distributions, which is a way to increase tail-uncertainty and avoid large losses when forecasting with distributions. We expect this was because of the time-constraints set by their payment, as well as the general difficulty.
Overall the experiment with these online crowdworkers produced poor accuracy results at predicting Elizabeths resolutions (as is discussed further below).
## Accuracy of predictions
This section analyses how well forecasters performed, collectively, in amplifying Elizabeth's research.
The aggregate prediction was computed as the average of all forecasters' final predictions. Accuracy was measured using [a version of the logarithmic scoring rule](https://observablehq.com/@jjj/foretold-scoring).
The following graph shows how the aggregate performed on each question:
![](images/f6ebc350474cbb51b21c0fa5184716c6f2e3eceb.png)
The opaque bars represent the scores from the crowdworkers, and the translucent bars, which have higher scores throughout, represent the scores from the network-adjacent forecasters. It's interesting that the order is preserved, that is, that the question difficulty was the same for both groups. Finally we dont see any correlation between question difficulty and the importance weights Elizabeth assigned to the questions.
However, the comparison is confounded by the fact that more effort was spent from the network-adjacent forecasters. The above graph also doesnt compare performance to Elizabeths priors. Hence we also plot the evolution of the aggregate score over prediction number and time (the first data-point in the below graphs represent Elizabeths priors):
![](images/53f456d57fe63c3f65b05b21cffa42b69d45ed87.png)
![](images/a547c3816ddef37ee3d560ac2d05ec50071df615.png)
![](images/c7e041d8fab837233a9cc4d03c6166c54da04020.png)
For the last graph, the y-axis shows the score on a logarithmic scale, and the x-axis shows how far along the experiment is. For example, 14 out of 28 days would correspond to 50%. The thick lines show the average score of the aggregate prediction, across all questions, at each time-point. The shaded areas show the standard error of the scores, so that the graph might be interpreted as a guess of how the two communities would predict a random new question \[10\].
One of our key takeaways from the experiment is that simple average aggregation algorithm performed surprisingly well, but only for the network-adjacent forecasters.
One way to see this qualitatively is by observing the graphs below, where we display Elizabeths priors, the final aggregate of the network-adjacent forecasters, and the final resolution, for a subset of questions \[11\].
**Question examples**
The x-axis \[12\] refers to the Elizabeths best estimate of the accuracy of a claim, from 0% to 100% (see section “Mechanism design, 1. Evaluator extracts claims” for more detail).
![](images/b10eb78f4b874e299a1a14ef331821ebb47b042a.png)
![](images/449cbaaa18d85ac7e1fbf3e7d70defc290367b94.png)
Another way to understand the performance of the aggregate is to note that the aggregate of network-adjacent forecasters had an average log score of -0.5. To get a rough sense of what that means, it's the score you'd get by being 70% confident in a binary event, and being correct (though note that this binary comparison merely serves to provide intuition, there are technical details making the comparison to a distributional setting a bit tricky).
By comparison, the crowdworkers and Elizabeths priors had a very poor log score of around -4. This is roughly similar to the score youd get if you predict an event to be ~5% likely, and it still happens.
# Cost-effectiveness
### High-level observations
This experiment was run to get a sense of whether forecasters could do a competent job forecasting the work of Elizabeth (i.e. as an "existence proof"). It was not meant to show cost-effectiveness, which could involve many efficiency optimizations not yet undertaken. However, we realized that the network-adjacent forecasting may have been reasonably cost-effective and think that a cost-effectiveness analysis of this work could provide a baseline for future investigations.
To compute the cost-effectiveness of doing research using amplification, we look at two measures: the information gain from predictors relative to the evaluator, and the cost of predictors relative to the evaluator.
_Benefit/cost ratio = % information gain provided by forecasters relative to the evaluator / % cost of forecasters relative to the evaluator_
If a benefit/cost ratio of significantly over 1 can be achieved, then this could mean that forecasting could be useful to partially augment or replace established evaluators.
Under these circumstances, each unit of resources invested in gaining information from forecasters has higher returns than just asking the evaluator directly.
Some observations about this.
First, note that this does _not_ require forecasters to be as accurate as the evaluator. For example, if they only provide 10% as much value, but at 1% of the opportunity cost, this is still a good return on investment.
Second, amplification can still be worthwhile even if the benefit-cost ratio is < 1. In particular:
1. Forecasters can work in parallel and hence answer a much larger number of questions, within a set time-frame, than would be feasible for some evaluators.
2. Pre-work by forecasters might also improve the speed and quality of the evaluator's work, if she has access to their research \[13\].
3. Having a low benefit-cost ratio can still serve as an existence proof that amplification of generalist research is possible, as long as the benefit is high. One might then run further optimised tests which try harder to reduce cost.
### Results
The opportunity cost is computed using Guesstimate models linked below, based on survey data from participants collected after the experiment. We are attempting to include both hourly value of time and value of altruistic externalities. We did not include the time that our own team spent figuring out and organising this work.
For example, the estimated cost ratio for the network-adjacent forecasters in this experiment was 120%, meaning that the cost of obtaining a final aggregate prediction for a question was 20% higher when asking this group of 19 forecasters than when asking Elizabeth directly, all things considered.
The value is computed using the following model (interactive calculation linked below). We assume Elizabeth is an unbiased evaluator, and so the true value of a question is the mean of her resolution distribution. We then treat this point estimate as the _true_resolution, and compare to it the scores of Elizabeth's resolution, had it been a prediction, vs. her initial prior; and the final aggregate vs. her initial prior. All scores are weighed by the importance of the question, as assigned by Elizabeth on a 1-10 scale \[14\].
Results were as follows.
![](images/50453b84385fa25f5a934570cfa2bc6702869748.png)
_(Links to models: network-adjacent_ _[cost ratio](https://www.getguesstimate.com/models/14521)_ _and_ _[value ratio](https://observablehq.com/@jjj/amplification-effectiveness), online crowdworker_ _[cost ratio](https://www.getguesstimate.com/models/14614)_ _and_ _[value ratio](https://observablehq.com/@jjj/amplification-effectiveness-positly).)_
The negative value ratio for the control group indicates that they assigned a lower probability to the mean of Elizabeths resolution than she her herself did when submitting her prior. Hence just accepting the means from those forecasts would have made us worse off, epistemically, than trusting the priors.
This observation is in tension with the some of the above graphs, which show a tiny increase in average log score between crowdworkers and Elizabeths priors. We are somewhat uncertain about the reason for this, though we think it is as follows: they were worse at capturing the resolution means than the prior, but they were sometimes better at capturing the resolution distribution (likely by the average of them adding on more uncertainty). And the value ratio only measures the former of those improvements.
Another question to consider when thinking about cost-effectiveness is diminishing returns. The following graph shows how the information gain from additional predictions diminished over time.
![](images/6181a43348526199a4746e6da4bbdc96afdd823b.png)
The x-axis shows the number of predictions after Elizabeths prior (which would be prediction number 0). The y-axis shows how much closer to a perfect score each prediction moved the aggregate, as a percentage of the distance between the previous aggregate and the perfect log score of 0 \[15\].
We observe that for the network-adjacent forecasters, the majority of value came from the first two predictions, while the online crowdworkers never reliably reduced uncertainty. Several hypotheses might explain this, including that:
* The first predictor on most questions was also one of the best participants in the experiment
* Most of the value of the predictors came from increasing uncertainty, and already after averaging 2-3 distributions we had gotten most of the effect there
* Later participants were anchored by the clearly visible current aggregate and prior predictions
Future experiments might attempt to test these hypotheses.
# Perspectives on impact and challenges
This section summarises some different perspectives on what the current experiment is trying to accomplish and why that might be exciting, as well as some of the challenges it faces. To keep things manageable, we simply give a high-level overview here and discuss each point in more detail in [a separate post](https://forum.effectivealtruism.org/posts/ZCZZvhYbsKCRRDTct/amplifying-generalist-research-via-forecasting-models-of).
There are several perspectives here given that the experiment was designed to explore multiple relevant ideas, rather than testing a particular, narrow hypothesis.
As a result, the current design is not optimising very strongly for any of these possible uses, and it is also plausible that its impact and effectiveness will vary widely between uses.
## Perspectives on impact
* **Mitigating capacity bottlenecks.** The effective altruism and rationality communities face rather large bottlenecks in many areas, such as allocating funding, delegating research, [vetting](https://forum.effectivealtruism.org/posts/G2Pfpkcwv3bJNF8o9/ea-is-vetting-constrained) [talent](https://forum.effectivealtruism.org/posts/jmbP9rwXncfa32seH/after-one-year-of-applying-for-ea-jobs-it-is-really-really) and [reviewing content](https://www.lesswrong.com/posts/qXwmMkEBLL59NkvYR/the-lesswrong-2018-review). The current setup might provide a means of mitigating some of those -- a scalable mechanism of outsourcing intellectual labor.
* **A way for intellectual talent to build and demonstrate their skills.** Even if this set-up cant make new intellectual progress, it might be useful to have a venue where junior researchers can demonstrate their ability to predict the conclusions of senior researchers. This might provide an objective signal of epistemic abilities not dependent on detailed social knowledge.
* **Exploring new institutions for collaborative intellectual progress.** Academia has a vast backlog of promising ideas for institutions to help us think better in groups. Currently we seem bottlenecked by practical implementation and product development.
* **Getting more data on empirical claims made by the Iterated Amplification AI alignment agenda.** These ideas inspired the experiment. (However, our aim was more practical and short-term, rather than looking for theoretical insights useful in the long-term.)
* **Exploring forecasting with distributions.** Little is known about humans doing forecasting with full distributions rather than point estimates (e.g. “79%”), partly because there hasnt been easy tooling for such experiments. This experiment gave us some cheap data on this question.
* **Forecasting fuzzy things.** A major challenge with forecasting tournaments is the need to concretely specify questions; in order to clearly determine who was right and allocate payouts. The current experiments tries to get the best of both worlds -- the incentive properties of forecasting tournaments and the flexibility of generalist research in tackling more nebulous questions.
* **Shooting for unknown unknowns.** In addition to being an “experiment”, this project is also an “exploration”. We have an intuition that there are interesting things to be discovered at the intersection of forecasting, mechanism design, and generalist research. But we dont yet know what they are.
## Challenges and future experiments
* **Complexity and unfamiliarity of experiment.** The current experiment had many technical moving parts. This makes it challenging to understand for both participants and potential clients who want to use it in their own organisations.
* **Trust in evaluations.** The extent to which these results are meaningful depends on your trust in Elizabeth Van Nostrands ability to evaluate questions. We think is partly an inescapable problem, but also expect clever mechanisms and more transparency to be able to make large improvements.
* **Correlations between predictions and evaluations.** Elizabeth had access to a filtered version of forecaster comments when she made her evaluations. This introduces a potential source of bias and a “self-fulfilling prophecy” dynamic in the experiments.
* **Difficulty of converting mental models into quantitative distributions.** Its hard to turn nuanced mental models into numbers. We think a solution is to have a “division of labor”, where some people just build models/write comments and others focus on quantifying them. Were working on incentive schemes that work in this context.
* **Anti-correlation between importance and “outsourceability”.** The intellectual questions which are most important to answer might be different from the ones that are easiest to outsource, in a way which leaves very little value on the table in outsourcing.
* **Overhead of question generation.** Creating good forecasting questions is hard and time-consuming, and better tooling is needed to support this.
* **Overly competitive scoring rules.** Prediction markets and tournaments tend to be zero-sum games, with negative incentives for helping other participants or sharing best practices. To solve this were designing and testing improved scoring rules which directly incentivise collaboration.
# Footnotes
\[1\] Examples include: AI alignment, global coordination, macrostrategy and cause prioritisation.
\[2\] We chose the industrial revolution as a theme since it seems like a historical period with many lessons for improving the world. It was a time of radical change in productivity along with many societal transformations, and might hold lessons for future transformations and our ability to influence those.
\[3\] Some readers might also prefer the terms “integration experiment” and “sandbox experiment”.
\[4\] In traditional forecasting tournaments, participants state their beliefs in a binary event (e.g. “Will team X win this basketball tournament?”) using a number between 0% and 100%. This is referred to as a credence, and it captures their uncertainty in a quantitative way. The terminology comes from Bayesian probability theory, where rational agents are modelled as assigning credences to claims and then updating those credences on new information, in a way uniquely determined by Bayes rule. However, as a human, we might not always be sure what the right credence for a claim is. If I had an unlimited time to think, I might arrive at the right number. (This is captured by the “after 10 more hours of research” claim.) But if I dont have a lot of time, I have some uncertainty about exactly how uncertain I should be. This is reflected in our use of distributions.
\[5\] In scaling the number of claims beyond what Elizabeth can evaluate, we would also have to proportionally increase the rewards.
\[6\] Many of these participants had previous experience with forecasting, and some were “superforecaster-equivalents” in terms of their skill. Others had less experience with forecasting but were competent in quantitative reasoning. For future experiments, we ought to survey participants about their previous experience.
\[7\] The payments were doubled after we had seen the results, as the initial scoring scheme proved too harsh on forecasters.
\[8\] The incentive schemes looked somewhat different between groups, mostly owing to the fact that we tried to reduce the complexity necessary to understand the experiment for the online crowdworkers, who to our knowledge had no prior experience with forecasting. They were each paid at a rate of ~$15 an hour, with the opportunity for the top three forecasters to receive a bonus of $35.
\[9\] Elizabeth did this by copying the claims into a google doc, numbering them, and then using Google random number generator to pick claims. For a future scaled up version of the experiment, one could use the [public randomness beacon](https://www.google.com/search?q=public+randomness+beacon&oq=public+randomness+beacon&aqs=chrome..69i57j33.2772j0j1&sourceid=chrome&ie=UTF-8) as a transparent and reproducible way to sample claims.
\[10\] In analysing the data we also plotted 95% confidence intervals by multiplying the standard error by 1.96. In that graph the two lines intersect for something like 80%-90% of the x-axis. You can plot and analyse them yourself [here](https://observablehq.com/@nunosempere/plots-for-the-amplification-experiment).
\[11\] We only display the first four resolutions to not too make up too much space (which were randomly chosen in the course of the experiment). All resolution graphs can be found [here](https://observablehq.com/@jjj/untitled/2).
\[12\] The distributions are calculated using Monte Carlo sampling and Kernel smoothing, so are not perfectly smooth. This also led to errors around bounds being outside of the 0 to 100 range.
\[13\] For this experiment, Elizabeth informally reports that the time saved ranged from 0-60 minutes per question, but she did not keep the kind of notes required to estimate an average.
\[14\] This is a rough model of calculating this and we can imagine there being better ways of doing it. Suggestions are welcome.
\[15\] Using this transformation allows us to visualise the fact smaller scores obtained later in the contest can still be as impressive as earlier scores. For example, moving from 90% confidence to 99% confidence takes roughly as much evidence as moving from 50% to 90% confidence. Phrased in terms of odds ratios, both updates involve evidence of strength roughly 10:1.
# Participate in future experiments or run your own
[Foretold.io](https://www.lesswrong.com/posts/wCwii4QMA79GmyKz5/introducing-foretold-io-a-new-open-source-prediction) was built as an open platform to enable more experimentation with prediction-related ideas. We have also made [data and analysis calculations](https://observablehq.com/@jjj/untitled/2) from this experiment publicly available.
If youd like to:
* Run your own experiments on other questions
* Do additional analysis on this experimental data
* Use an amplification set-up within your organisation
Wed be happy to consider providing advice, operational support, and funding for forecasters. Just comment here or reach out to [this email](mailto:jacob@parallelforecast.com).
If youd like to participate as a forecaster in future prediction experiments, you can [sign-up here](https://mailchi.mp/60b8ea91e592/ol3ptgmr5d).
# Acknowledgements
Funding for this project was provided by the Berkeley Existential Risk Initiative and the EA Long-term Future Fund.
We thank Beth Barnes and Owain Evans for helpful discussion.
We are also very thankful to all the participants.

View File

@ -0,0 +1,86 @@
A review of two free online MIT Global Poverty courses
==============
## Introduction
MIT's Poverty Action Lab has made their courses available online, and this is a big deal because the teaching is of excellent quality. I've taken _Microeconomics_, _Evaluating Social Programs_, _The Challenges of Global Poverty_ and _Foundations of Development Policy_.
Regarding the first two courses, I remember that _Evaluating Social Programs_ delivers on the title, as does the book _Running Randomized Evaluations: A Practical Guide_. I remember _Microeconomics_ as being a solid introduction to the subject by a competent teacher of above-average charisma.
Below is a review of the last two courses: _The Challenges of Global Poverty_ and _Foundations of Development Policy_, written while they were fresh in my mind, while the teachers hadn't yet won the Nobel Price in Economics, and lightly edited thereafter. Both courses scavenge from the book _Poor Economics_, but also expand upon it. Because they have significant overlap, I would recommend just the second: _Foundations of Development Policy_, which covers the material in more depth.
## General structure
The general structure of each unit of both courses is:
* A question is presented
* Different options, all of which _a priori_ sound plausible, are discussed.
* An experiment or a series of experiments is presented which answers the question in a manner more decisive than speculation could.
I think that constant repetition of this structure pushes the point that you really can't trust which arguments seem more plausible to you, or trust that you're being presented with a plausible model of impact, but instead have to go and do the experiment. This is one of the important intuitions which the course helps understand on a deep level.
When the professors deviate from that general structure, they become less convincing. In particular, they seem to really not like macro-economy / geopolitics. As a particular example which really stuck out for me, Ben Olken [helped increase Pakistan's tax revenue](https://www.povertyactionlab.org/evaluation/incentivizing-property-tax-inspectors-through-performance-based-postings-pakistan), which is not clearly net positive to me. On the one hand, individuals have less money, on the other hand, I could imagine the marginal revenue being invested in schools, used to do something essentially useless, or to increase defense spending, the value of which might be negative, which depends on the geopolitical situation.
## An example unit: Nutrition and Productivity.
The question presented is whether there is a nutrition based poverty trap. A mathematical formalism for a poverty trap is presented, in which wealth at time t+1 depends on wealth at time t. A poverty trap appears if falling below a wealth threshold leads to a further sliding down, that is, if the relationship between wealth at time t and t+1 looks like:
![](images/be7f99df5b8aa33ef7aadd37a7560aa24505e5d9.png) as opposed to like
![](images/8b0e4c8fbb9b0400998fbbc58158fa7c79aaeebb.png)
(In both cases, you start at y0 at time 0, move to y1 at time 1, to y2 at time 2, etc.)
And this formalism is then applied to the case of nutrition:
* On day n, you have wealth W(n)
* On day n, you consume an amount of food F(n) = f(W(n))
* On day n+1, your production depends on how well you've eaten the day before, that is W(n+1) = g(F(n)) = g(f(W(n)) = h(W(n)) , so ultimately, wealth on day n+1 depends on wealth on day n.
The question is whether, in practice, a poverty trap mediated by nutrition arises. A mathematical condition necessary and sufficient for a poverty trap to arise is not just stated, but efforts are made so that students without a mathematical background can understand why that is.
Then, some data is presented, which attempts to solve the problem. First, some cross-sectional data, that is, getting data from the population in general, without conducting a trial. But this is insufficient. Then, they present data from a GiveDirectly randomized trial, and give some details as to its implementation. From this first randomized trial, they get some numbers for how calories and food expenditure vary with wealth. Then, they look at another randomized trial which measures the impact of calories on productivity (by providing laborers different amounts of calories, and paying them proportionally per unit of output). Putting both things together, they conclude that a poverty trap as mediated by day-to-day nutrition is unlikely (as one might have intuited).
However, child nutrition could cause such a poverty trap, because nutrition deficiencies as a child could affect productivity throughout life. They apply the model again, looking at the impact of deworming (because if worms don't eat your food, you do) and see that it does have a significant impact on wages later in life.
To complement the lectures, a bibliography is provided:
* _Poor Economics_: Chapter 2.
* "Household Response to Income Changes: Evidence from an Unconditional Cash Transfer Program in Kenya" (Haushofer and Shapiro, 2013).
* "Giffen behavior and subsistence consumption" (Jensen and Miller, 2008).
* "Causal effect of health on labor market outcomes: Experimental evidence" (Thomas et al., 2006).
* "Worms at work: Long-run impacts of child health gains" (Baird, Hicks, Kremer, and Miguel, 2012).
* "Are there nutrient-based poverty traps? Evidence on iron deficiency and schooling attainment in Peru" (Chong et al., 2014).
* Video: "The Name of the Disease"
* Optional: "Wealth, Health, and Health Services in Rural Rajasthan" (Banerjee, Deaton and Duflo, 2004).
* Optional: _Poor Economics_: Chapter 3.
The readings comprise ~300 pages; the size of a small book (every week). In other words, the student has the possibility of digging pretty deep, which I generally did. It was a significant time investment.
As homework, some data from the Kremers deworming project in Kenya (see the readings) is provided, and one plays around with it in R and does some analysis, which gets more complicated as the course progresses. Then, some questions about an unrelated 45 page report on malaria nets are asked.
For a higher level overview, [here](https://nunosempere.github.io/ea/14.740x_Syllabus.pdf) is the course syllabus
## R
Throughout the courses, I picked up R (a programming language well-suited for analyzing datasets). I then read parts of the book [R for Data Science](https://r4ds.had.co.nz/), which is available for free online, and used it for [some self-experimentation in calibration](https://nunosempere.github.io/rat/Self-experimentation-calibration.html) (using the exercises for these courses as a source of data), for a data science hackathon, [to analyze the results of an EA mental health survey](https://nunosempere.github.io/rat/eamentalhealth/analysis/writeup), etc. The courses definitely helped, but I think that personal interest plus previous experience with programming also played an important role.
## Quality of the teachers & of the pedagogy
I feel that the quality of the teachers is very noticeably better than that of the professors either at the University of Vienna or at the Autonomous University of Madrid (maths and philosophy, at the undergraduate level). Perhaps they more closely approach the ideal of a [zetetic explanation](https://www.lesswrong.com/posts/i2Dnu9n7T3ZCcQPxm/zetetic-explanation), approaching a topic from many different angles. Perhaps they just have a _depth_ to them. If a student asks a miscellaneous question, they can answer it, and they take the time to do it. They express ideas clearly. At the beginning of the course, you're asked to write down your reasons for taking it, and to find an accountability partner.
It might be the case that the teachers have a carefully-considered cognitive model of the student (see: [Why books don't work: Why lectures don't work](https://andymatuschak.org/books/)). They're good pedagogues. They ask questions to the students, and keep the lectures engaging. Implicitly, all the students in the recorded classroom have read all the recommended texts beforehand, so the interaction is meaningful. The online TAs were also quite helpful; on some occasions I asked for additional information, and they could point me to relevant resources.
Some of the RCTs which are discussed have been carried out by the professors themselves, and this is not a function of their vanity, but of the relevance of these RCTs. \[Note: In my defense, that sentence was written before the profs won the Nobel price, and while I was still getting an understanding of the impact of the randomista movement in general and of JPAL in particular\]
## Value of the certification
I'm very uncertain about the signalling value of the certification. I myself just audited the courses (i.e., did not pay edx/MIT $1,000 for a certificate), but did all the exercises and the final online exam. There is also a proctored exam, which I didn't take.
## Call to action and epistemic status
The [MIT edx courses on Data, Economics, and Development Policy](https://micromasters.mit.edu/dedp/) reopen this upcoming February, so if the [course syllabus](https://nunosempere.github.io/ea/14.740x_Syllabus.pdf) appeals to you, you might want to set some time aside this week to consider whether you want to take any of these courses, and whether that makes sense in your particular situation. In my experience, it helps if you pick a concrete time (11 pm on Friday), instead of a fuzzy "I will do this sometime this week".
Regarding my epistemic status, I can vouch for the quality of the content and of the pedagogy, but not for the signalling value. Models of which kinds of EAs are likely to get career capital from this kind of online course are very welcome; Charity Entrepeneurship mentions [taking online courses](https://forum.effectivealtruism.org/posts/QxCpXjGmHbpX45nxo/how-to-increase-your-odds-of-starting-a-career-in-charity#Possible_actions) as a factor which would increase your odds of starting a career in charity entrepreneurship
I also strongly suspect that this post is the result of a selection effect, both in terms of liking these courses in particular, and online courses in general, more than average. For example, I much prefer the flexibility of online courses and I'm happy to provide my own motivation structures, perhaps to an unusual degree. I thought that the review was worth posting anyways.

View File

@ -0,0 +1,183 @@
A review of two books on survey-making
==============
Reading time: 15 mins. Epistemic status:
> Simplicio: I have a question. When people make surveys, how do they make sure that the questions measure what they want to measure?
> Salviati: Woe be upon me.
## Introduction
The two books reviewed: _The Power of Survey Design_ (TPOSD) and _Improving survey questions: Design and Evaluation_ (IDS) have given me an appreciation of the biases and problems that are likely to pop up when having people complete surveys. This knowledge might perhaps be valuable both for those in the EA and rationality communities who find themselves making surveys, as well as for those who have to interpret them.
## For the eyes of those who are designing a survey
You might want to read this review for the kicks, then:
a) If you don't want to spend too much time, the pareto principle thing to do might be to use [this checklist](https://nunosempere.github.io/ea/Surveys/Checklist), and [this list of principles](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/Surveys/ListOfPrinciplesISD.pdf), both from the books under review. I've also found [this summary](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/Surveys/Summary.pdf) to be very nonthreatening.
b) If you want to spend a moderate amount of time:
* Chapter 3 of _The Power of Survey Design_ (68 pages) and/or Chapter 4 of _Improving survey questions_ (22 pages) for general things to watch out for when writting questions. Chapter 3 of _The Power of Survey Design_ is the backbone of the book.
* Chapter 5 of _The Power of Survey Design_ (40 pages) for how to use the dark arts to have more people answer your questions willingly and happily.
c) For even more detail:
* Chapters 2 and 3 of _Improving survey questions_ (38 and 32 pages, respectively) for considerations on gathering factual and subjective data, respectively.
* Chapter 5 of _Improving survey questions_ (25 pages) for how to evaluate/test your survey before the actual implementation.
* Chapter 6 of _Improving survey questions_ (12 pages) for advice about trying to find something like hospital records to validate your questionnaire with, or about repeating some important questions in slightly different form and get really worried if answerers answer differently.
* The introductions, i.e. Chapter 1 and 2 of _The Power of Survey Design_ (9 and 22 pages, respectively), and Chapter 1 of _Improving survey questions_ (7 pages) if introductions are your thing, or if you want to plan your strategy. In particular, Chapter 2 of TPOSD has a cool Gantt chart.
[Here is the index for _Improving Survey Questions_](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/Surveys/ISDIndex.pdf) and [here is the index for _The Power of Survey Design_](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/Surveys/PrinciplesOfSurveyDesignIndex.pdf). [libgen.io](libgen.io) will be of use if you want an electronic copy. Note that using that webpage might only be legal if you already own a physical copy of the books, depending on your jurisdiction. Also note that the World Bank [offers _The Power of Survey Design_ for free](http://documents.worldbank.org/curated/en/726001468331753353/The-power-of-survey-design-a-users-guide-for-managing-surveys-interpreting-results-and-influencing-respondents).
Both books are dated in some respects; neither mentions online surveys, and they both make more emphasis on field surveys. However, I think that on the broad principles and considerations, both books remain useful guides. Nonetheless, I don't have any particular attachment to these two books; I'd expect that any book on survey making by an author which worked on the field professionally, or published by an university press, is likely to be roughly as useful as the two above.
## Some ways in which people are inconsistent or incoherent when answering survey questions
For the casual reader, here is a nonexhaustive collection of curious anecdotes mentioned in the first book.
* A Latinobarometro poll in 2004 showed that while a clear majority (63 percent) in Latin America would never support a military government, 55 percent would not mind a nondemocratic government if it solved economic problems.
* When asked about a fictitious “Public Affairs Act” one-third of respondents volunteered an answer
* The choice of numeric scales has an impact on response patterns: Using a scale which goes from -5 to 5 produces a different distribution of answers than using a scale that goes from 0 to 10.
* The order of questions influences the answer, and so does wording as well: framing the question with the term "welfare" instead of with the formulation "incentives for people with low incomes" produces a large effect.
* Options that appear at the beginning of a long list seem to have a higher likelihood of being selected. For example, when alternatives are listed from poor to excellent rather than the other way around, respondents are more likely to use the negative end of the scale. Unless it's in a phone interview, or read out loud, in which case the last options are more likely.
* When asked whether they had visited a doctor in the last two weeks, if respondents have had a recent doctor visit, but not one within the last 14 days, there is a tendency to want to report it. In essence, they feel that accurate reporting really means that they are the kind of person who saw a doctor recently, if not exactly and precisely within the last two weeks.
* The percentage of people supporting US involvement in WW2 almost doubled if the word "Hitler" appeared in the question.
These examples highlight the point that people do not always have consistent opinions which are elicited by the question, but that the spectrum of answers is influenced by the wording of the question.
## Influencing respondents
Chapter 5 of _The Power of Survey Design: A Users Guide for Managing Surveys, Interpreting Results, and Influencing Respondents_ goes over the influencing respondents part. The author has spent way more time thinking about the topic than the survey-taker and can thus nudge him.
On the one hand, the author could write biased questions with the intention of eliciting the answers he wishes to obtain. However, this chapter makes more emphasis in the following: once good questions have been written, how do you convice people, perhaps initially recluctant, to take part in your survey? How do you get them to answer sensitive questions truthfully? Some mechanisms to do this are:
### Legitimacy and appearances.
At the beginning, make sure to assure legal confidentiality, maybe research the relevant laws in your jurisdiction and make reference to them. Name drop sponsors, include contact names and phone numbers. Explain the importance of your research, its unique characteristics and practical benefits.
There is a part of signalling confidentiality, legitimacy, competence which involves actually doing the thing. This is also the case for explaining the purpose of your survey, and arguing that its goals are aligned with the goals of the respondent. For example, if you assure legal confidentiality, but then ask information which would permit easy deanonimization, people might notice and get pissed. But another part consists of merely being aware of the legitimacy dimension.
The first questions should be easy, pleasant, and interesting. Build up confidence in the survey's objective, stimulate their interest and participation by making sure that the respondent is able to see the relationship between the question asked and the purpose of the study. Don't ask sensitive questions at the beginning of your survey. Instead, allow time for the respondent's System 1 to accept your claim of legitimacy. It is also recommended that sensitive questions be made longer, as they are then perceived as less threatening. Part of that length can be a sentence or two explaining that all the options are ultimately acceptable, and that the answerer won't be judged for them.
### Don't bore the answerer.
Cooperation will be highest when the questionnaire is interesting and when it avoids items difficult to answer, time-consuming, or embarrassing. An example of this might be starting with a prisoner's dilemma with real payoffs, which will might double as the monetary incentive to complete the survey. Or, more generally, starting with open questions.
It serves no purpose to ask the respondent about something he or she does not understand clearly or that is too far in the past to remember correctly; doing so generates inaccurate information.
Don't ask a long sequence of very similar questions. This bores and irritates people, which leads them to answer mechanically. A term used for this is acquiescence bias: in questions with an "agree-disagre" or "yes-no" format, people tend to agree or say yes even when the meaning is reversed. In long lists of questions on a "0-5" scale, people tend to choose 2.
On the other hard, don't make questions too hard. In general, telling the respondents a definition and asking them to clasify themselves is too much work.
### Elite respondents
Elites are apparently quickly irritated if the topic of the questions is not of interest to them. Vague queries generate a sense of frustration, and lead to a perception that the study is not legitimate. Oversimplifications are noticed and disliked.
To mitigate this, one might start with a narrative question, and add open questions at regular intervals throughout the form. Elites “resent being encased in the straightjacket of standardized questions” and feel particularly frustrated if they perceive that the response alternatives do not accurately address their key concern.
In general, a key component of herding elite respondents is to match the level of cognitive complexity of the question with the respondent's level of cognitive ability, as not doing so leads to frustration. Overall, it seems to me that the concept of "elite respondent" applies to the highly intelligent and ambitious crowd characteristic of the EA and rationality movements.
## Useful categories.
### Memory
Events less than two weeks into the past can be remembered without much error. There are several ways in which people can estimate the frequency with which something happens, i.e., to answer questions of the form "How often does X?", chiefly:
* Availability heuristic: How easy it is to remember or come up with instances of X?
* Episodic enumeration: Recalling and counting occurrences of an event. How many individual instances of X can you come up with?
* Resorting to some sense of normative frequency: How often _should_ one wash your hands?
* etc.
Of these, episodic enumeration turns out to be the most accurate, and people employ it more the less instances of the event in question there are. The wording of the question might be changed to facilitate episodic enummeration, even explaining the concept explicitly and asking the respondent to commit to using it.
Asking a longer question, and communicating to responders the significance of the question has a positive effect on the accuracy of the answer. For example, one might employ phrasings such as “please take your time to answer this question,” “the accuracy of this question is particularly important,” or “please take at least 30 seconds to think about this question before answering”.
If you want to measure knowledge, take into account that recognizing is easier than recalling. More people will be able to recognize a definition of effective altruism than be able to produce one on their own.Furthermore, if you use a multiple question with n options, and x% of people knew the answer, whereas (100-x)% didn't, you might expect that (100-x)/n % hadn't known the answer, but guessed correctly by chance, so you'd see that y% = x% + (100-x)/n % selected the correct option.
### Consistency and Ignorance.
In of our examples at the beginning, one third of respondents gave an opinion about a ficticious Act. This generalizes; respondents rarely admit ignorance. It is thus a good idea to offer an option for "I don't know", or "I don't really care about this topic". With regards to consistency, it is a good idea to ask similar questions in different parts of the questionnaire to check the consistency of answers.
### Subjective vs objective questions
The author of _Improving Survey Questions_ views the distinction between objective and subjective questions as very important. Apparently, there are serious metaphysical implications to the fact that there is no direct way to know about people's subjective states independent of what they tell us. To this, the author devotes a whole chapter.
Anyways, despite the lack of an independent measure, there are still things to do, chiefly:
* Place answers on a single well defined continuum
* Specify clearly what is to be rated.
For example, "how depressed are you?" is neither on a well defined continuum, nor is it clear what is to be rated (as of yet). The [Becket Depression Inventory](https://en.wikipedia.org/wiki/Beck_Depression_Inventory) gives a score on a well defined continuum and clearly specifies what is to be rated (but is, for some purposes, too long).
The author warns us about the dangers of misinterpreting subjective questions:
> "The concept of bias is meaningless for subjective questions. By changing wording, response order, or other things, it is possible to change the distribution of answers. However, the concept of bias implies systematic deviations from some true score, and there is no true score... Do not conclude that "most people favor gun control", "most people oppose abortions"... **All that happened is that a majority of respondents picked response alternatives to a particular question that the researcher chose to interpret as favorable or positive.**"
## Test your questionnaire
I appreciated the pithy phrases "Armchair discussions cannot replace direct contact with the population being analyzed" and "Everybody thinks they can write good survey questions". With respect to testing a questionnaire, the books go over different strategies and argue for some reflexivity when deciding what type of test to undertake.
In particular, the intuitive or traditional way to go about testing a questionnaire would be a focus group: you have some test subjects, have them take the survey, and then talk with them or with the interviewers. This, the authors argue, is messy, because some people might dominate the conversation out of proportion to the problems they encountered. Additionally, random respondents are not actually very good judges of questions.
Instead, no matter what type of test you're carrying out, having a spreadsheet with issues for each question, filled individually and before any discussion, makes the process less prone to social effects.
Another alternative is to try to get in the mind of the respondent while they're taking the survey. To this effect, you can ask respondents:
* to paraphrase their understanding of the question.
* to define terms
* for any uncertainties or confusions
* how accurately they were able to answer certain question and how likely they think they or others would be to distort answers to certain questions
* if the question called for a numerical figure, how they arrived at the number.
For example, for the question: "Overall, how would you rate your health: excellent, very good, fair, or poor?", a followup question might be: "When you said that your health was (previous answer), what did you take into account or think about in making that rating?"
Considerations about tiring the answerer still apply: a long list of similar questions is likely to induce boredom. For this reason, ISQ recommends testing "half a dozen" questions at a time.
As an aside, if you want to measure the amount of healthcare consumed in the last 6 months, you might come up with a biased estimate even if your questions aren't problematic, and this would be because the people who just died consume a lot of healthcare, but can't answer your survey.
## Tactics
### Be aware of the biases
Be aware of the ways of the ways a question can be biased. Don't load your questions: don't use positive or negative adjectives in your question. Take into account social desirability bias: "Do you work?" has implications with regards to status.
An good example given, which tries to reduce social desirability bias, is the following:
> Sometimes we know that people are not able to vote, because they are not interested in the election, because they can't get off from work, because they have family pressures, or for many other reasons. Thinking about the presidential elections last November, did you actually vote in that election or not?
Incidentally, self-administered surveys are great at not creating bias because of the interviewer; answerers don't feel a need to impress.
There is also the aspect of managing self-images: it's not only that the respondent may want to impress, it's also that they may want to think about themselves in certain ways. You don't want to have respondents feel they're put in a negative (that is, inaccurate) light. Respondents "are concerned that they'll be misclassified, and they'll distort the answers in a way they think will provide a more accurate picture". One countermeasure against this is to allow the respondent to give context. For example:
* How much did you drink last weekend?
* Do you feel that this period is representative?
* What is a normal amount to drink in your social context?
Thus, they survey-maker can get into their head and manage the way they perceive the questions, so as to minimize the sense that certain answers will be negatively valued. "Permit respondents to present themselves in a positive way at the same time they provide the information needed".
In questions for which biases are likely to pop up, consider explicitly explaining to respondents that giving accurate answers is the most important thing they can do. Have respondents make a commitment to give accurate answers at the beginning; it can't hurt.
This, together with legitimacy signaling, has been shown to _reduce the number of books which well-educated people report reading_.
### Don't confuse question objective with question.
The soundest advice any person beginning to design a survey instrument could receive is to produce a good, detailed list of question objectives and an analysis plan that outlines how the data will be used. "If a researcher cannot match a question with an objective and a role in the analysis plan, the question should not be asked", the authors claim.
Further, it is not enough to simply put your objective in question form. For example, in the previous example, the question objective could be finding out which proportion of the population votes, and simply translating it into a question (f.ex., "Did you vote in the last presidential election?") is likely to turn up all kinds of interesting biases. To reduce them, one should employ bias-mitigation strategies like the above.
### An avalanche of advice.
The combined two books contain a plethora of advice, as well as examples which help build one's intuitions. To recap, if you can't afford to take the time to read a book, the pareto-principle thing to do might be to read [this checklist](https://nunosempere.github.io/ea/Surveys/Checklist), [this list of principles](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/Surveys/ListOfPrinciplesISD.pdf), or [this nonthreatening summary](https://github.com/NunoSempere/nunosempere.github.io/blob/master/ea/Surveys/Summary.pdf). Some advice which didn't quite fit in the previous sections, but which I couldn't leave unmentioned, is:
* Ask one question at a time. For example: "Compared to last year, how much are you winning at life?" is confusing, and would be less so if it was divided into: "How much are you winning at life today?" and "how much were you winning at life last year?". If the question was important, a short paragraph explaining what you mean by winning at life would be in order.
* Explanatory paragraphs come before the question, not after. After the respondent thinks she has read a question, she will not listen to the definition provided afterwards. Ditto for technical terms.
* Not avoiding the use of double negatives makes for confusing sentences, like this one.
* Avoid the use of different terms with the same meaning.
* Make your alternatives mutually exclusive and exhaustive.
* Don't make all your questions too long. As a rule of thumb, keep your question under of 20 words and 3 commas (unless you're trying to estimulate recall, or if it's a question about sensitive topics).
* The longer the list of questions, the lower the quality of the data.
## Closing thoughts.
The rabbit hole of designing questionnaires is deep, but seems well mapped. I don't think that this needs to become common knowledge, but I expect that a small number of people might benefit greatly from the pointers given here. Boggling at the concept of a manual, I am grateful to have access to the effort of someone who has spent an inordinate amount of time studying the specific topic of interviews.

View File

@ -0,0 +1,44 @@
<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="19.717ex" height="6.176ex" style="vertical-align: -1.838ex;" viewBox="0 -1867.7 8489.4 2659.1" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg" aria-labelledby="MathJax-SVG-1-Title">
<title id="MathJax-SVG-1-Title">{\displaystyle x={\frac {{\sqrt {4ac+b^{2}}}-b}{2a}}}</title>
<defs aria-hidden="true">
<path stroke-width="1" id="E1-MJMATHI-78" d="M52 289Q59 331 106 386T222 442Q257 442 286 424T329 379Q371 442 430 442Q467 442 494 420T522 361Q522 332 508 314T481 292T458 288Q439 288 427 299T415 328Q415 374 465 391Q454 404 425 404Q412 404 406 402Q368 386 350 336Q290 115 290 78Q290 50 306 38T341 26Q378 26 414 59T463 140Q466 150 469 151T485 153H489Q504 153 504 145Q504 144 502 134Q486 77 440 33T333 -11Q263 -11 227 52Q186 -10 133 -10H127Q78 -10 57 16T35 71Q35 103 54 123T99 143Q142 143 142 101Q142 81 130 66T107 46T94 41L91 40Q91 39 97 36T113 29T132 26Q168 26 194 71Q203 87 217 139T245 247T261 313Q266 340 266 352Q266 380 251 392T217 404Q177 404 142 372T93 290Q91 281 88 280T72 278H58Q52 284 52 289Z"></path>
<path stroke-width="1" id="E1-MJMAIN-3D" d="M56 347Q56 360 70 367H707Q722 359 722 347Q722 336 708 328L390 327H72Q56 332 56 347ZM56 153Q56 168 72 173H708Q722 163 722 153Q722 140 707 133H70Q56 140 56 153Z"></path>
<path stroke-width="1" id="E1-MJMAIN-34" d="M462 0Q444 3 333 3Q217 3 199 0H190V46H221Q241 46 248 46T265 48T279 53T286 61Q287 63 287 115V165H28V211L179 442Q332 674 334 675Q336 677 355 677H373L379 671V211H471V165H379V114Q379 73 379 66T385 54Q393 47 442 46H471V0H462ZM293 211V545L74 212L183 211H293Z"></path>
<path stroke-width="1" id="E1-MJMATHI-61" d="M33 157Q33 258 109 349T280 441Q331 441 370 392Q386 422 416 422Q429 422 439 414T449 394Q449 381 412 234T374 68Q374 43 381 35T402 26Q411 27 422 35Q443 55 463 131Q469 151 473 152Q475 153 483 153H487Q506 153 506 144Q506 138 501 117T481 63T449 13Q436 0 417 -8Q409 -10 393 -10Q359 -10 336 5T306 36L300 51Q299 52 296 50Q294 48 292 46Q233 -10 172 -10Q117 -10 75 30T33 157ZM351 328Q351 334 346 350T323 385T277 405Q242 405 210 374T160 293Q131 214 119 129Q119 126 119 118T118 106Q118 61 136 44T179 26Q217 26 254 59T298 110Q300 114 325 217T351 328Z"></path>
<path stroke-width="1" id="E1-MJMATHI-63" d="M34 159Q34 268 120 355T306 442Q362 442 394 418T427 355Q427 326 408 306T360 285Q341 285 330 295T319 325T330 359T352 380T366 386H367Q367 388 361 392T340 400T306 404Q276 404 249 390Q228 381 206 359Q162 315 142 235T121 119Q121 73 147 50Q169 26 205 26H209Q321 26 394 111Q403 121 406 121Q410 121 419 112T429 98T420 83T391 55T346 25T282 0T202 -11Q127 -11 81 37T34 159Z"></path>
<path stroke-width="1" id="E1-MJMAIN-2B" d="M56 237T56 250T70 270H369V420L370 570Q380 583 389 583Q402 583 409 568V270H707Q722 262 722 250T707 230H409V-68Q401 -82 391 -82H389H387Q375 -82 369 -68V230H70Q56 237 56 250Z"></path>
<path stroke-width="1" id="E1-MJMATHI-62" d="M73 647Q73 657 77 670T89 683Q90 683 161 688T234 694Q246 694 246 685T212 542Q204 508 195 472T180 418L176 399Q176 396 182 402Q231 442 283 442Q345 442 383 396T422 280Q422 169 343 79T173 -11Q123 -11 82 27T40 150V159Q40 180 48 217T97 414Q147 611 147 623T109 637Q104 637 101 637H96Q86 637 83 637T76 640T73 647ZM336 325V331Q336 405 275 405Q258 405 240 397T207 376T181 352T163 330L157 322L136 236Q114 150 114 114Q114 66 138 42Q154 26 178 26Q211 26 245 58Q270 81 285 114T318 219Q336 291 336 325Z"></path>
<path stroke-width="1" id="E1-MJMAIN-32" d="M109 429Q82 429 66 447T50 491Q50 562 103 614T235 666Q326 666 387 610T449 465Q449 422 429 383T381 315T301 241Q265 210 201 149L142 93L218 92Q375 92 385 97Q392 99 409 186V189H449V186Q448 183 436 95T421 3V0H50V19V31Q50 38 56 46T86 81Q115 113 136 137Q145 147 170 174T204 211T233 244T261 278T284 308T305 340T320 369T333 401T340 431T343 464Q343 527 309 573T212 619Q179 619 154 602T119 569T109 550Q109 549 114 549Q132 549 151 535T170 489Q170 464 154 447T109 429Z"></path>
<path stroke-width="1" id="E1-MJSZ1-221A" d="M263 249Q264 249 315 130T417 -108T470 -228L725 302Q981 837 982 839Q989 850 1001 850Q1008 850 1013 844T1020 832V826L741 243Q645 43 540 -176Q479 -303 469 -324T453 -348Q449 -350 436 -350L424 -349L315 -96Q206 156 205 156L171 130Q138 104 137 104L111 130L263 249Z"></path>
<path stroke-width="1" id="E1-MJMAIN-2212" d="M84 237T84 250T98 270H679Q694 262 694 250T679 230H98Q84 237 84 250Z"></path>
</defs>
<g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)" aria-hidden="true">
<use xlink:href="#E1-MJMATHI-78" x="0" y="0"></use>
<use xlink:href="#E1-MJMAIN-3D" x="850" y="0"></use>
<g transform="translate(1906,0)">
<g transform="translate(120,0)">
<rect stroke="none" width="6342" height="60" x="0" y="220"></rect>
<g transform="translate(60,775)">
<use xlink:href="#E1-MJSZ1-221A" x="0" y="94"></use>
<rect stroke="none" width="3569" height="60" x="1000" y="885"></rect>
<g transform="translate(1000,0)">
<use xlink:href="#E1-MJMAIN-34" x="0" y="0"></use>
<use xlink:href="#E1-MJMATHI-61" x="500" y="0"></use>
<use xlink:href="#E1-MJMATHI-63" x="1030" y="0"></use>
<use xlink:href="#E1-MJMAIN-2B" x="1685" y="0"></use>
<g transform="translate(2686,0)">
<use xlink:href="#E1-MJMATHI-62" x="0" y="0"></use>
<use transform="scale(0.707)" xlink:href="#E1-MJMAIN-32" x="607" y="408"></use>
</g>
</g>
<use xlink:href="#E1-MJMAIN-2212" x="4792" y="0"></use>
<use xlink:href="#E1-MJMATHI-62" x="5793" y="0"></use>
</g>
<g transform="translate(2656,-687)">
<use xlink:href="#E1-MJMAIN-32" x="0" y="0"></use>
<use xlink:href="#E1-MJMATHI-61" x="500" y="0"></use>
</g>
</g>
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 5.4 KiB

View File

@ -0,0 +1,573 @@
Shapley Values II: Philantropic Coordination Theory & other miscellanea.
==============
{Epistemic status: Less confused. Much as the Matrix sequels, so too I expect this post to be worse than the original, but still be worth having.}
## Introduction
In [Shapley values: Better than counterfactuals](https://forum.effectivealtruism.org/posts/XHZJ9i7QBtAJZ6byW/shapley-values-better-than-counterfactuals), we introduced the concept of Shapley values. This time, we assume knowledge of what Shapley Values are, and we:
* Propose a solution to _Point 4: Philantropic Coordination Theory_ of OpenPhilantropy's [Technical and Philosophical Questions That Might Affect Our Grantmaking](https://www.openphilanthropy.org/blog/technical-and-philosophical-questions-might-affect-our-grantmaking). Though by no means the philosopher's stone, it may serve as a good working solution. We remark that an old GiveWell solution might have been to harsh on the donor with whom they were coordinating.
* We explain some setups for measuring the Shapley value of forecasters in the context of a prediction market or a forecasting tournament.
* We outline an impossibility theorem for value attribution, similar to Arrow's impossibility theorem in voting theory. Though by no means original, knowing that all solutions are unsatisfactory we might be nudged towards thinking about which kind of tradeoffs we want to make. We consider how this applies to the [Banzhaf power index](https://en.wikipedia.org/wiki/Banzhaf_power_index), mentioned in previous posts.
* We consider how Shapley values fare with respect to the critiques in Derek Parfit's paper [_Five Mistakes in Moral Mathematics_](http://www.stafforini.com/docs/Parfit%20-%20Five%20mistakes%20in%20moral%20mathematics.pdf).
* We share some Shapley value puzzles: scenarios in which the Shapley value of an action is ambiguous or unintuitive.
* We propose several speculative forms of Shapley values: Shapley values + Moments of Consciousness, Shapley Values + Decision Theory, Shapley Values + Expected Value
* We conclude with some complimentary (and yet obligatory) ramblings about Shapley Values, Goodhart's law, and Stanovich's disrationalia.
Because this post is long, images will be interspersed throughout to clearly separate sections and provide rest for tired eyes. This is an habit I have from my blogging days, though which I have not seen used in this forum.
![](images/bc4a6add2d82c0297031b883af215a4dff297d94.png)
## Philantropic Coordination Theory:
GiveWell posed, in 2014, the following dilemma: (numbers rounded to make some calculations easier later on):
> A donor recently told us of their intention to give $1 million to SCI. We and the donor disagree about what the right “maximum” for SCI is: we put it at $6 million, while the donor who is particularly excited about SCI relative to our other top charities would rather see SCI as close as possible to the very upper end of its range, meaning they would put the maximum at $8 million. (This donor is relying on our analysis of room for more funding but has slightly different values.)
> If we set SCIs total target at $6 million, and took into account our knowledge of this donors plans, we would recommend a very small amount of giving perhaps none at all this giving season, since we believe SCI will hit $6 million between the $3 million from Good Ventures, $1 million from this donor, and other sources of funding that we detailed in our previous post. The end result would be that SCI raised about $6 million, while the donor gave $1 million. On the other hand, if the donor had not shared their plans with us, and we set the total target at $6 million, we would recommend $1 million more in support to SCI this giving season; the donor could wait for the end of giving season before making the gift. The end result would be that SCI raised about $7 million, while the donor gave $1 million.
### A. What is going on in that dilemma?
(This section merely offers some indications as to what is going on. It's motivated by my intense dislike of solutions which appear magically, but might be slightly rambly. Mathematically inclined readers are very welcome to stop here and try to come up with their own solutions., while casual readers are welcome to skip to next section. )
| Group | Outcome |
|-------|-------|
| {} | 0* |
| {A} | 6 million to SCI |
| {B} | ? |
| {A,B} | 6 million + $X to the SCI + $Y million displaced to ?? |
If we try to calculate the Shapley value in this case, we notice that it depends on what the donor would have done with their budget in the counterfactual case, where the displaced money would go to, and how much each party would have valued that.
In any case, their Shapley values are:
![](images/0b51c9065e85902d93abc4de5c676a162431fd9e.png) ![](images/48294ee8a66ec565a22576254823d2292d1d6f7b.png)
One can understand this as follows: Player A has Value({A}) already in their hand, and Player B has Value({B}), and they're considering whether to cooperate. If they do, then the surplus from cooperating is Value({A,B}) - (Value({A}) + Value({B})), and it get's shared equally:
`1/2*(Value({A,B}) - (Value({A}) + Value({B})))` goes to A, which had `Value({A})` already
`1/2*(Value({A,B}) - (Value({A}) + Value({B})))` goes to B, which had `Value({B})` already
If the donor wouldn't have known what to do with their money in the absence of GiveWell's research, the surplus increases, whereas if the donor only changed their mind slightly, the surplus decreases.
At this point, we could try to assign values to the above, look at the data and make whatever decision seems most intuitive:
> In cases where we really dont know what were doing, like utilitarianism, one can still make System 1 decisions, but making them with the System 2 data in front of you can change your mind. (Source: [If its worth doing its worth doing with made up statistics](https://slatestarcodex.com/2013/05/02/if-its-worth-doing-its-worth-doing-with-made-up-statistics/))
We could also assign values to the above and math it out. As I was doing so, I first tried to define the problem and see if I could come up with. After trying to find some meaning to setting their Shapley values to be equal, and imagining some contrived moral trades, I ended coming up with a solution involving value certificates. That is, if both GiveWell and the donor are willing to sell value certificates of their Shapley value of their donations, what happens?
While trying to find a solution, I found it helpful to specify some functions and play around with some actual numbers. In the end, they aren't necessary to _explain_ the solution, but I personally dislike it when solutions appear magically out of thin air
For simplicity, we'll suppose that GiveWell donates $6 million, rather than donating some and moving some.
For the donor:
* The donor values $X dollars donated to SCI as f1(X)
* The donor values $Y dollars donated to GiveDirectly as f2(Y)=Y (donations to GiveDirectly seem exeedingly scalable).
* From 8 million onwards, the donor prefers donating to GiveDirectly, that is, f1' > 1 from 0 to 8 million, but f1' < 1 afterwards
For GiveWell:
* GW values $X dollars donated to SCI as g1(X)
* GW values $Y dollars donated to GiveDirectly as g2(Y)=Y
* From 6 million onwards, GW prefers donating to GiveDirectly, that is, g1' > 1 from 0 to 6 million, but g1' < 1 afterwards.
Some functions which satisfy the above might be f1(X) =((80.1)/0.9)\*X(0.9), g1(X) = 2\*sqrt(6)\*sqrt(X) (where X is in millions).
A = GiveWell; B = The donor.
| Group | Outcome |
|-------|-------|
| {} | 0* |
| {A} | 6 million to the SCI |
| {B} | 0* |
| {A,B} | 6 million + $X to the SCI + $Y to GiveDirectly |
What are the units for the value of these outcomes? They're arbitrary units, one might call them "utilons". Note that they're not directly exhangeable, that is, the phrases "GiveWell values the impact of one dollar donated to SCI less than the donor does", or "GiveWell values one of their (GiveWell's) utilons as much as the donor values one of their (the donor's) utilons" might not be true, or even meaningful sentences. I tried to consider _two_ tables of Shapley values, one for GW-utilons and another one for donor-utilons, which aren't directly comparable. However, that line of reasoning proved unfruitful.
Now, from a Shapley value perspective, GiveWell gets the impact of the first 6 million donated to the SCI, and half the value of every additional dollar donated by the donor (minus the donor's outside option, which we've asumed to be 0), and the donor gets the other half.
### B. A value certificate equilibrium.
So we've asumed that:
* The donor's outside option is 0
* From 6 million onwards, GiveWell prefers that donations be given to GiveDirectly, but the donor prefers that they be given to SCI.
How much should, then, the donor donate to the SCI? Suppose that GiveWell has offering certificates of impact for their Shapley value (respectively counterfactual) impact. Consider that te donor could spend half of his million in donations to SCI. With the other half, the donor could:
* Either buy GiveWell's share of the impact for that first half million.
* Donate it to SCI.
Because of the diminishing returns to investing in SCI, the donor should buy GiveWell's share of the impact.
Or, in other words, suppose that the donor donates $X to the SCI and $Y to GiveDirectly.
* If X=Y=0.5 million, GiveWell and the donor can agree to interchange their Shapley (respectively, counterfactual) values, so that GiveWell is responsible for the GiveDirectly donation, and the donor for the SCI donation.
* If X > 0.5 million, then the donor would want to stop donating to SCI and instead buy value certificates from GiveWell corresponding to the earlier part of the half million (which are worth more, because of diminishing returns).
* If X < 0.5 million, then the donor would want to buy more SCI until X=0.5 million, then buy the certificates from GiveWell corresponding to the earlier half a million.
Yet another way to see this would be, asuming the donor has a granularity of $1 (one dollar):
* The first dollar should go to the SCI. He gets some impact I from that, and GiveWell also gets some impact I', of equal magnitude.
* The second dollar could go either towards buying GiveWell's certificate of impact for the impact of the first dollar, or be a further donation to SCI. Because of diminishing returns, buying I', the impact of the first dollar is worth more than the impact of donating a second dollar to SCI. GiveWell automatically funnels the money from the certificate of impact to, say, GiveDirectly.
* The third dollar should go to SCI.
* The fourth dollar should go towards buying GiveWell's share of impact from the first dollar
* And so on.
Note that, because GiveWell's alternative is known, GiveWell doesn't have to see the donor's money; he can send it directly to GiveDirectly. Alternatively, the donor and GiveWell can agree to "displace" some of GiveWell's donation, such that GiveWell donates part of the amount that they would have donated to SCI to GiveDirectly instead.
### Conclusions, remarks and Caveats.
* In short, the above for philantropic coordination theory, expressed in words, might be:
> Divide the value from cooperating equally. This value depends on both what the outside options for the parties involved are, where any money displaced goes to, and how much each party values each of those options. Shapley values, as well as certificates of impact might be useful formalisms to quantify this.
* The solution as considered above seems to me to be at a sweet spot between;
* Being principled and fair. In the two player case, for Shapley values this comes from equally splitting the gains from cooperating.
* Considering most of the relevant information.
* Not being too computationally intensive. A lot of this comes from treating GiveWell/ OpenPhil/ GoodVentures as one cohesive agent. Further, we actually don't have access to the countefactual world in which GiveWell doesn't exist (which is, admittedly, a weakness of the method), and we could have spent arbitrary amounts of time and effort attempting to model it. But for a moral trade of $1 million, it might actually be worth it to spend that time and effort!
* One particular simplification was considering the donor's outside option to be 0\* (relative zero), which simplified our calculus. If it had been nonzero, we would have to have considered the value to the donor and the value to GiveWell of the donor's outside option separately. This is doable, but makes the explanation of the core idea behind the solution somewhat more messy; see the appendix for a worked example.
* Assuming that the outside option of the donor is 0 leads to the same solution as GiveWell's original post (split in the middle). However, it is harsh on the donor if their outside option is better, according to the Shapley values/certificates of impact formalism above. Or, in other words, the gains from cooperating might not be in the middle.
* Note that while the above may perhaps be a mathematically elegant solution, the questions "what donors like", "what narratives are more advantageous", "how do we create common knowledge that everyone is acting in good faith", and public relations in general are not being considered here _at all_. In particular, in the original GiveWell solution, the moral trade is presented in terms of "using the information advantage", which may or may not be more savvy
* In this case, we have modelled SCI and GiveDirectly as parts of nature, rather than as agents, but modelling them as agents wouldn't have changed our conclusions (though it would have complicated our analysis). In particular, regardless of whether GiveDirectly and/or SCI are agents in our model, if the donor is willing to donate to them, they should also be willing to buy certificates of impact from GiveWell corresponding to that donation.
* When buying a certificate of impact, the donor would in fact be willing to pay _more_ than $1, because $1 dollar can't get him that much value any more, due to diminishing returns. Similarly, GiveWell would be willing to sell it for _less_ than $1, because of the same reasons; once diminishing returns start setting in, they would have to donate less than $1 to their best alternative to get the equivalent of $1 dollar of donations to SCI. I've thus pretended that in this market with one seller and one buyer, the price is agreed to be $1. Another solution would be to have an efficient market in certificates of impact.
* The value certificate equilibrium is very similar regardless of whether one is thinking in terms of Shapley values or counterfactuals. I feel, but can't prove, that Shapley values add a kind of clarity and crispness to the reasoning, if only because they force you to consider all the moving parts.
![](images/14578566a70fe4029fdf4fc0b37253dce1b735d2.jpg)
## Shapley values of forecasters.
### Shapley values are different from normal scoring in practice
Suppose that you are wrong, but you are _interestingly_ wrong. You may be very alarmed about an issue which people are not alarmed at all, thus moving the aggregate to a medium degree of alertness.
Or you might be the first to realize something. In a forecasting tournament, under some scoring rules, you might want to actively mislead people into thinking the opposite, for example by giving obviously flawed reasons for your position, thus increasing your own score. This is less than ideal.
[The literature](https://users.cs.duke.edu/~conitzer/predictionUAI09.pdf) points to Shapley values as being able to solve some of these problems in the context of prediction markets, while making sure that participants report their true beliefs. It can be shown that Shapley values have optimal results on the face on strategic behaviour, and that they are incentive compatible, that is, agents have no incentive to misreport their private information. In particular, if you reward forecasters in proportion to their Shapley values, they have an incentive to cooperate, which is something that is not trivial to arrange with other scoring rules.
In practice, taking the data from a [previous experient](https://forum.effectivealtruism.org/posts/ZTXKHayPexA6uSZqE/part-2-amplifying-generalist-research-via-forecasting), I [rolled the results differently](https://observablehq.com/@nunosempere/shapley-value-experiments-part-i), trying to approximate something similar to the Shapley value of each prediction, with the data at hand. And the resulting ranking did in fact look different. For example, in one question, the final market aggregate approximated the resolution distribution pretty well, but it was composed of individual forecasters all being somewhat overconfident in their own pet region.
Most memorably, one user who has a high ranking in Metaculus, but didn't fare that well in our competition would have done much better under a Shapley-value scoring rule. In this case, what I think happened was that the user genuinely had information to contribute, but was not very familiar with distributions.
Because the Shapley value can be proved to be, in some sense, optimal with respect to incentives, and that this might make a difference in practice, I'd intuitively be interested in seeing it used to reward forecasters. However, the Shapley value is in general both too computationally expensive (as of 2020), and requires us to use information to which we don't have acess. With this in mind, what follows are some approximations of the Shapley value which I think are intriguing in the context of prediction markets:
### Red Team vs Blue Team.
Given a proper scoring function which takes a prediction, a prior, and a resolution, a member of the blue team would get
`A*Score(Member) + B*AverageScore(Blue team without the member) - (A + B)*Score(Red team) (+ Constant)`
(A,B) are positive constant which determine how much you want to reward the individual as opposed to the group. As long as they're positive, incentives remain the same.
In this setup, forecasters have the incentive to reveal all of their information to their team, to make sure they use it, to use the information given by their teeam, and to make sure that this information isn't found by the red team. Further, if both teams have similar capabilities, whomever has to pay for the forecasting tournament can decide how much to pay by choosing a suitable constant.
However, you end up duplicating efforts. This may be a positive in the cases where you really want different perspectives, and in case you don't want your forecasters to anchor on the first consensus which forms. A competitive spirit may form. However, this design is probably too wasteful.
### Gradually reveal information
Suppose that there were some predictions in a prediction market: {Prior, P1, P2, P3, ..., Pk, ..., Pn}. A new forecaster comes in, sees only the prior, and makes his first prediction, f(0). He then sees the first prediction, P1, and makes another prediction, f(1). He sees the second prediction, and makes another prediction, f(2), and so on:
* (Prior) -> f(0)
* {P1} -> f(1)
* {P1, P2} -> f(2)
* {P1, P2, ..., Pk} -> f(k)
* {P1, P2, P3, ..., Pk, ..., Pn} -> f(n)
Let AGk be the aggregate after the first k predictions: {P1, ..., Pk}, and AG0 be the prior.
Is this enough to calculate the Shapley value? No. We would also need to know what the predictor would have done had they seen, say, only {P3}. Sadly, we can't induce amnesia on out forecasters (though we could induce amnesia on bots and on other artificial systems!). In any case, we can reward our predictor proportionally to:
`[Score(f(0)) - Score(AG0)] + [Score(f(1)) - Score(AG1)] + ... + [Score(f(n)) - Score(AGn)]`
We can also reward Predictor Number N in proportion to `f(n) - f(n-1)`. That is, in proportion to how future forecasters improved after seeing the information which Predictor Number N produced. This has the properties that:
* Our forecaster has the incentive to make the best prediction with the information he has, at each step.
* Other forecasters have the incentive to make their predictions (which may have a comment attached) as useful and as informative as possible.
This still might be too expensive, that is, it might require too many steps, so we can simplify it further, so that the forecaster only makes two predictions, g(0) and g(1):
* (Prior) -> g(0)
* {P1, P2, P3, ..., Pk, ..., Pn} -> g(1)
Then, we can reward forecaster number n in proportion to:
`(g(1) - g(0)) * (Some measure of the similarity between g(1) and Pn such as the KL divergence)`
while the new forecaster gets rewarded in proportion to
`Score(g0) - Score(prior) + Score(g(1)) - Score(AGn).`
This still preserves some of the same incentives as above, though in this case, attribution becomes more tricky. Further, anecdotically, seeing someone's distribution before updating gives more information than seeing someone's distribution after they've updated, so just seeing the contrast between g(0) and g(1) might be useful to future forecasters.
![](images/4bf82fbcce46ac1e5520ce4070c8883525c80bde.jpg)
## A value attribution impossibility theorem.
Epistemic status: Most likely not original; this is really obvious once you think about it.
The Shapley value is uniquely defined by:
* Efficiency. The sum of the values of all agents equals the value of the grand coalition.
* Equal treatment of equals. If, for every world, the counterfactual value of two agents is the same, the two agents should have the same value.
* Linearity. "If two games, played independently, are regarded as sections of a single game, the value to the players who participate in just one section is unchanged, while the value of those who play in both sections is the sum of their sectional values."
* Null player. If a player adds 0 value to every coalition, the player has a Shapley value of 0.
But there are some other eminently reasonable properties (their characterizations follow) which we would also like our value function to have:
* Irrelevance of cabals.
* Protection against parasites.
* Agency agnosticism.
* Elegance in happenstance.
Because the Shapley values are uniquely defined, and because none of the above are true for Shapley values, this constitutes an imposibility theorem.
### Irrelevance of Cabals
If a group of agents `A1, A2, A3, ...` decide to form a super-agent SA, Value(SA) should be equal to `Value(A1) + Value(A2) + Value(A3) + ...`
* If there are cases in which Value(SA) > ΣValue(A\_i), then agents will have the incentive to form cabals.
* If there are cases in which Value(SA) < ΣValue(A\_i), then agents will have the incentive to split as much as possible. This is the case for Shapley values.
One way to overcome irrelevance of cabals would be to prescribe a canonical list of agents, so that agents can't form super-agents, or, if they do, these super-agents just have as their value the sum of the Shapley values of their members. However, in many cases, talking about super-agents, like organizations, not people, is incredibly convenient for analysis. Further, it is in some cases not clear what is an agent, or what has agentic properties, so we would only let go of this condition very recluctantly.
Another way to acquire this property would be to work within a more limited domain. For example, if we restrict the Shapley value formalism to the space of binary outcomes (where, for example, a law either passes or doesn't pass, or an image gets classified either correctly or incorrectly), we get the [Banzhaf power index](https://en.wikipedia.org/wiki/Banzhaf_power_index) which happens to have irrelevance of cabals because of the simplicity of the domain which it considers. To repeat myself, the Banzaf power index values, mentioned in the previous post are just Shapley values divided by a different normalizing constant (!), and constrained to a simpler domain.
### Protection against parasites.
The contribution should be proportional to the effort made. In particular, consider the following two cases:
* $1000 and 100h are needed for an altruistic project. Peter Parasite learns about it and calls Anna Altruist, and she puts in the $1000 and the 100h needed for the project.
* $1000 and 100h are needed for an altruistic project. Pamela Philantropist learns about it and calls Anna Altruist and they each cough up $500 and 50h to make it possible.
The value attribution function should deal with this situation by assigning more value to Pamela Philantropist than to Peter Parasite. Note that Lloyd Shapley points to something similar in [his original paper](https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM670.pdf); see comments on the axioms on p. 6 and subsequent of the pdf, but ultimately dismisses it.
Also note that counterfactual reasoning is really vulnerable to parasites. See the last example in the secition: _Five Mistakes in Moral Mathematics_.
### Agency agnosticism.
A long, long time ago, an apple fell on Newton, and the law of gravity was discovered. I wish for a value attribution function which doesn't require me to differentiate between Newton and the apple, and define one as an agent process, and the other one as a non-agent process.
Some value functions which fulfill this criterion:
* All entities are attributed a value of 0.
* All value in the universe is assigned to one particular entity.
* All value in the universe is asigned to all possible entities.
This requirement might be impossible to fulfill coherently. If it _is_ fulfilled, it may produce weird scenarios, such as "Nature" being reified into being responsible for billions of deaths. Failing this, I wish for a canonical way to decide whether an entity is an "agent". This may also be very difficult.
### Elegance in happenstance.
> a party that spent a huge amount of money on a project that was almost certainly going to be wasteful and ended up being saved when by sheer happenstance another party appeared to save the project was not making good spending decisions
I wish for a value attribution rule that somehow detects when situations such as those happen and adjusts accordingly. In particular, Shapley Values don't take into account:
* Intentions
* What is the expected value of an action given the information which are reasonable agent ought to have had?
See the section on Shapley Values + Expected Values below on how one might do that, as well as the last section on when one might one want to do that. On the other hand, if you incentivize something other than outcomes, you run the risk of incentivizing incompetence.
### Conclusions
Much like in the case of voting theory, the difficulty will be in managing tradeoffs, rather than in choosing the one true voting system to rule them all (pun intended).
With that in mind, one of the most interesting facts about Arrow's impossibility theory is that there are voting methods which aren't bound by it, like [Score Voting](https://en.wikipedia.org/wiki/Score_voting). Quoting from Wikipedia:
> As it satisfies the criteria of a deterministic voting method, with non-imposition, non-dictatorship, monotonicity, and independence of irrelevant alternatives, it may appear that it violates Arrow's impossibility theorem. The reason that score voting is not a counter-example to Arrow's theorem is that it is a cardinal voting method, while the "universality" criterion of Arrow's theorem effectively restricts that result to ordinal voting methods
As such, I'm hedging my bets: impossibility theorems must be taken with a grain of salt; they can be stepped over if their assumptions do not hold.
![](images/3edb1377c2726d6123865c43360d67c08adb9aca.jpg)
## Parfit's [_Five Mistakes in Moral Mathematics_](http://www.stafforini.com/docs/Parfit%20-%20Five%20mistakes%20in%20moral%20mathematics.pdf).
The Indian Mathematician Brahmagupta describes the solution to the quadratic equation as follows:
> 18.44. Diminish by the middle \[number\] the square-root of the rupas multiplied by four times the square and increased by the square of the middle \[number\]; divide the remainder by twice the square. \[The result is\] the middle \[number\].
to describe the solution to the quadratic equation ax^2 + bx = c.
![](images/5c7efddde4aa1a7119a15d93783f6c682d7212cf.svg)
I read Parfit's piece with the same admiration, sadness and sorrow with which I read the above paragraph. On the one hand, he is oftent clearly right. On the other hand, he's just working with very rudimentary tools: mere words.
With that in mind, how do Shapley values stand up to critique? Parfit proceeds by presenting several problems, and Toby Ord suggested that Shapley values might perform inadequately on some of them; I'll sample the Shapley solution to those I thought might be the trickiest ones:
### The First Rescue Mission
> I know all of the following. A hundred miners are trapped in a shaft with flood-waters rising. These men can be brought to the surface in a lift raised by weights on long levers. If I and three other people go to stand on some platform, this will provide just enough weight to raise the lift, and will save the lives of these hundred men. If I do not join this rescue mission, I can go elsewhere and save, single-handedly, the lives of ten other people. There is a fifth potential rescuer. If I go elsewhere, this person will join the other three, and these four will save the hundred miners.
Do Shapley Values solve this satisfactorily? They do. If you go to save the other ten, your Shapley value is in fact higher. You can check this on [http://shapleyvalue.com/](http://shapleyvalue.com/); input 100 for every group with four participants, 10 for every group of less than four people which contains you. For the total, try with either 110, or 100, corresponding to whether you remain or leave.
### The Second Rescue Mission
> As before, the lives of a hundred people are in danger. These people can be saved if I and three other people join in a rescue mission. We four are the only people who could join this mission. If any of us fails to join, all of the hundred people will die. If I fail to join, I could go elsewhere and save, single-handedly, fifty other lives.
Do Shapley Values solve this satisfactorily? They do. By having a stronger outside option, your Shapley value is in fact increased, even if you end up not taking it. Again, you can check on the link above, inputing 50 whenever you're in the group, and 100 (or 50 again) when the whole group is involved.
### Simultaneous headshots
> X and Y simultaneously shoot and kill me. Either shot, by itself, would have killed.
Do Shapley Values solve this satisfactorily? Maybe? They each get responsiblity for half a death; whether this is satisfactory is up to discussion. I agree that the solution is counter-intuive, but I'm not sure it's it's wrong. In particular, consider the question with the signed reversed:
> I die of an unexpected heart attack. X and Y simultaneously revive me (they both make a clone of me with the memory backup I had just made the day before, but the law sadly only allows one instance per being, so one has to go).
In this case, I find that my intuition is reversed; X and Y "should focus on cases which are less likely to be resurrected", and I see it as acceptable that they each get half the impact. I take this as a sign that my intuition here isn't that reliable in this case; resurrecting someone should probably be as good as kiling them is bad.
Consider also:
> You have put a bounty of 1 million dollars on the death of one of your hated enemies. As before, X and Y killed them with a simultaneous headshot. How do you divide the bounty?
On the other hand, we can add a fix on Shapley values to consider intentions (that is, expected values), which maybe fixes this problem. We can also use different variations of Shapley values depending on whether we want to coordinate with, award, punish, incentivize someone, or to attribute value. For this, see the last and the second to last sections. In conclusion, this example is deeply weird to me, perhaps because the "coordinate" and "punish" intuitions go in different directions.
### German Gift
Statement of the problem: X tricks me into drinking poison, of a kind that causes a painful death within a few minutes. Before this poison has any effect, Y kills me painlessly.
Do Shapley Values solve this satisfactorily? They do. They can differentiate between the case where Y kills me painlessly _because_ I've been poisoned (in which case she does me a favor), and the case where Y intended to kill me painlessly anyways. Depending on how painful the death is (for example, a thousand and one times worse than just dying), Y might even end up with a positive Shapley value even in that second case.
### Third Rescue Mission
Statement of the problem: As before, if four people stand on a platform, this will save the lives of a hundred miners. Five people stand on this platform.
Do Shapley Values solve this satisfactorily? They do. They further differentiate cleanly between the cases where:
* All five people see the opportunity at the same time. In this case, each person gets 1/5th of the total.
* Four people detect the opportunity at the same time. Seeing them, a fifth person joins in. In this case, the initial four people each get 1/4th, and the fifth person gets 0.
### An additional problem:
Here is an additional problem which I also find counterintuitive (though I'm unsure on how much to be confused about it):
> X kills me. Y resurrects me. I value my life at 100 utilons.
Here, the Shapley value of X is -50, and the Shapley value of Y is 50. Note, however, that the moment in which Y has an outside option to save someone else, their impact jumps to 100.
Ozzie Gooen comments:
> Note that in this case case, the counterfactuals would be weird too. The counterfactual value of X is 0, because Y would save them. The counterfactual value of Y would be +100, for saving them. So if you were to try to assign value, Y would get lots, and X wouldnt lose anything. X and Y could then scheme with each other to do this 100 times and ask for value each time
### Comments
Overall, I think that Shapley values do pretty well on the problems posed by Parfit on _Five Mistakes on Moral Mathematics_. It saddens me to see that Parfit had to resort to using words, which are unwieldy, for considering hypothesis of like the following:
> (C7) Even if an act harms no one, this act may be wrong because it is one of a set of acts that together harm other people. Similarly, even if some act benefits no one, it can be what someone ought to do, because it is one of a set of acts that together benefit other people.
![](images/c1c6470e1f4712d38d21ce31a6a2a4baa3671ab9.jpg)
## Shapley value puzzles
\[The first four puzzles of this section, and the commentary in between, were written by Ozzie Gooen.\]
Your name is Emma. You see 50 puppies drowning in a pond. You think you only have enough time to save 30 puppies yourself, but you look over and see a person in the distance. You yell out, they come over (their name is Phil), and together you save all the puppies from drowning.
Calculate the Shapley values for:
* You (Emma)
* Phil
The correct answer, of course, for both, should have been “an infinitesimal fraction” of the puppies. In your case, your parents were necessary for you to exist, so they should get some impact. Their parents too. Also, there were many people responsible for actions that led to your being there through some chaotic happenstance. Also, in many worlds where you would have not been there, someone else possibly would have; they deserve some Shapley value as well.
In moral credit assignment, it seems sensible that all humans should be included. That includes all those who came before, many of whom were significant in forming the exact world we have today.
However, maybe we want a more intuitive answer for a very specific version of the Shapley value; well only include value from the moment when we started the story above.
Now the answer is Emma: 40 puppies, Phil: 10 puppies. In total, you share 50 saved puppies. You can tell by trying it out in [this calculator](http://shapleyvalue.com/).
Now that weve solved all concerns with Shapley values, lets move on to some simpler examples.
You (Emma again) are enjoying a nice lonely stroll in the park when you hear a person talking loudly on their cell phone. Their name is Mark. You stare to identify the voice, and you spot some adorable puppies drowning right next to him. You yell at Mark to help you save the puppies, but he shrugs and walks away, continuing his phone conversation. You save 30 puppies. However, you realize that if it werent for Mark, you wouldnt have noticed them at all.
Calculate the Shapley values for:
* You (Emma)
* Mark
You (Emma again) are enjoying a nice lonely stroll in the park when you hear a rock splash in a pond. You look and notice some 30 adorable puppies drowning right to it. You save all of the puppies. You realize that if it werent for the rock, you wouldnt have noticed them at all.
Calculate the Shapley values for:
* You (Emma)
* The Rock
You (Emma again), are enjoying a nice stroll in the park. Alarmedly, 29 paperclip satisficers\* inform you that a paperclip is going to be lost forever, and 30 adorable puppies will drown unless you do something about it. You, together with the paperclip satisficers, spend three grueling hours saving the 30 puppies and the paperclip.
Calculate the Shapley values for:
* You (Emma)
* Each paperclip satisficer
You (Emma again), decide that this drowning puppies business must stop, and create the Puppies Liberation Front. You cooperate with the Front for the Liberation of Puppies, such that the PLF gets the puppies out of the water, and the FLP dries them, and both activities are necessary to rescue a puppy. Together, you rescue 30 puppies.
Calculate the Shapley values for:
* The Puppies Liberation Front:
* The Front for the Liberation of Puppies:
The Front for the Liberation of Puppies splits off a subgroup in charge of getting the towels: The Front for Puppie Liberation. Now;
* The Puppies Liberation Front gets the puppies out of the water
* The Front for the Liberation of Puppies dries them
* The Front for Puppie Liberation makes sure there are enough clean & warm towels for every puppie.
All steps are necessary. Together, you save 30 puppies. Calculate the Shapley value of:
* The Puppies Liberation Front:
* The Front for the Liberation of Puppies:
* The Front for Puppie Liberation:
Your name is Emma. Phil sees 30 puppies drowning in a pond, and he yells at you to come and save them. To your frustration, Phil just watches while you do the hard work. But you realize that without Phils initial shouting, you would never have saved the 30 puppies.
Calculate the Shapley values for:
* You (Emma)
* Phil
You are Emma, again. You finally find the person who has been trying to drown so many puppies, Lucy. You ask how many puppies she threw into the water: 100. Relieved, you realize you (and you alone) have managed to save all of them.
Calculate the Shapley values for:
* You (Emma)
* Lucy
![](images/16fb6f6948ae9600cee6ae878a46ea21172e3821.jpg)
## Speculative Shapley extensions.
### Shapley values + Decision theory.
Epistemic status: Very biased. I am quite convinced that some sort of timeless decision theory is probably better. I also think that it is more adequate than competing theories in domains where other agents are simulating you, like philantropic coordination questions.
In the previous post, Toby Ord writes:
> In particular, the best things you have listed in favour of the Shapley value applied to making a moral decision correctly apply when you and others are all making the decision 'together'. If the others have already committed to their part in a decision, the counterfactual value approach looks better.
> e.g. on your first example, if the other party has already paid their $1000 to P, you face a choice between creating 15 units of value by funding P or 10 units by funding the alternative. Simple application of Shapley value says you should do the action that creates 10 units, predictably making the world worse.
> One might be able to get the best of both methods here if you treat cases like this where another agent has already committed to a known choice as part of the environment when calculating Shapley values. But you need to be clear about this. I consider this kind of approach to be a hybrid of the Shapley and counterfactual value approaches, with Shapley only being applied when the other agents' decisions are still 'live'. As another example, consider your first example and add the assumption that the other party hasn't yet decided, but that you know they love charity P and will donate to it for family reasons. In that case, the other party's decision, while not yet made, is not 'live' in the relevant sense and you should support P as well.
Note that the argument, though superficially about Shapley values, is actually about which decision theory one is using; Toby Ord seems to be using CDT (or, perhaps, EDT), whereas I'm solidly in the camp of updateless/functional/timeless decision theories. From my perspective, proceeding as the comment above suggests would leave you wide open to blackmail, would incentivize commitment races and other nasty things (i.e., if I'm the other party in the example above, by commiting to donate to my pet cause, I can manipulate Toby Ord to donate money to the charity I love from family reasons, (and to take it away from more effective charities (unless, of course, Toby Ord has previously commited to not negotiating with such blackmailers, and I know that))). I'm not being totally fair here, and timeless decision theories also have other bullets to bite (perhaps [transparent Newcomb](https://arbital.com/p/transparent_newcombs_problem/) and [counterfactual mugging](https://wiki.lesswrong.com/wiki/Counterfactual_mugging) are the more antiintuitive examples (though I'd bite the bullet for both (as a pointer, see [In logical time, all games are iterated games](https://www.lesswrong.com/posts/dKAJqBDZRMMsaaYo5/in-logical-time-all-games-are-iterated-games))))
But, as fascinating as they might be, we don't _actually_ have to have discussions about decision theories so full of parentheses that they look like Lisp programs. We can just earmark the point of disagreement. We can specify that if you're running a causal decision theory, you will want to consider only agents whom you can causally affect, and will only include such agents on your Shapley value calculations, whereas if you're running a different decision theory you might consider a broader class of agents to be "live", including some of those who have already made their decisions. In both cases, you're going to have to bite the bullets of your pet decision theory.
Personally, and only half-jokingly, there is a part of me which is very surprised to see decision theory in general and timeless decision theories in particular be used for a legit real life problem, namely philantric coordination theory, as the examples I'm most familiar with all involve Omegas which can predict you almost perfectly, Paul Elkmans who can read your facial microexpressions, and other such contrived scenarios.
### Shapley values + Moments of consciousness.
One way to force irrelevance of cabals is to define a canonical definition of agent, and have the value of the group just be the sum of the values of the individuals. One such canonical definition could be over moments of consciousness, that is, you consider each moment of consciousness to be an agent, and you compute Shapley values over that. The value attributed to a person would be the sum of the Shapley values of each of their moments of consciousness. Similarly, if you need to compute the value of a group, you compute the Shapley value of the moments of consciousness of the integrants the the group.
On the one hand, the exact result of this procedure are particularly uncomputable as of 2020. And yet, using your intuition, you can see that the Shapley value of a person would be proportional to the number of necesary moments of consciousness which the person contributes, so not all is lost. On the other hand, it buys you irrelevance of cabals, and somewhat solves the parasite problem:
* $1000 and 100h are needed for an altruistic project. Peter Parasite learns about it and calls Anna Altruist, and she puts in the $1000 and the 100h needed for the project.
* $1000 and 100h are needed for an altruistic project. Pamela Philantropist learns about it and calls Anna Altruist and they each cough up $500 and 50h to make it possible.
So taking into account moments-of-consciousness would assign more value to those who put in more (necessary) effort. Your milleage may vary with regards to whether you consider this to be a positive.
### Shapley Values + Sensitivity Analysis + Expected Values
Epistemic Status: Not original, but I don't know where I got the idea from.
Given some variable of interest, average the Shapley values over different values of that variable, proportional to their probabilities. In effect, don't report the Shapley values directly, but do a [Sensitivity Analysis](https://en.wikipedia.org/wiki/Sensitivity_analysis) on them.
In the cases in which you want to punish or reward an outcome, and in particular in cases where you want to point to a positive or negative example for other people, you might want to act not on the Shapley value from the world we ended up living in, but to take the expected value of the Shapley Value by the agent under consideration. Expected values either with the information you had, or from the information they had.
If you the agent's information for those expected values, this allows you to punish evil but incompetent people, celebrate positive but misguided acts (which you might want to do for e.g., small kids). If you use your own information for those expected values, this also allows you to celebrate competent but unlucky entrepeneurs, and punish evil and competent people even when, by chance, they don't succeed.
However, Shapley values are probably be too unsophisticated to be used in situations which are primarily about social incentives.
![](images/8260e1c392a2a710e9421817175e84156922de3d.jpeg)
## Some complimentary (and yet obligatory) ramblings about Shapley Values, Goodhart's law, and Stanovich's disrationalia.
So one type of Goodhart's law is [Regressional Goodhart](https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy#Regressional_Goodhart).
> When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal. Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6'3", and a random 7' person in their 20s would probably not be as good
Other examples relevant to the discussion at hand:
* If you're optimizing your counterfactual value, you're also optimizing for the difference between the counterfactual values and the more nuanced thing which you care about.
* If you're optimizing your Shapley value, you're also optimizing for the difference between the Shapley values and the more nuanced thing which you care about.
* If you're optimizing for cost-effectiveness, you're also optimizing for the difference between being cost-effective and the more nuanced thing which you care about.
* If you're optimizing for being legible, you're also optimizing for the difference between being legible and the more nuanced thing which you care about.
* More generally, if you're optimizing for a measure of impact, you're also optimizing for the difference between what you care about and the measure of impact.
Of course, one could take this as a fully general counterargument against optimization and decide to become chaotic good. Instead, one might recognize that, even though you might cut yourself with a lightsaber, lightsabers are powerful, and you would want to have one. Or, in other words, to be really stupid, you need a theory (for an academic treatment on the matter, consult work by Keith Stanovich). Don't let Shapley values be that theory.
One specific way one can be stupid with Shapley values is by not disambiguating between different use cases, and not notice that Shapley values are imperfect at the social ones:
* Coordinate. Choose what to do together with a group of people.
* Award. Pick a positive exemplar and throw status at them, so that people know what is your or your group's idea of "good".
* Punish. Pick a negative exemplar and act so as to make clear to your group that this exemplar corresponds to your group's idea of "bad".
* Incentivize. You want to see more of X. Create mechanisms so that people do more of X.
* Attribute. Many people have contributed to X. How do you divide the spoils?
To conclude, commenters in the last post emphasized that one should not consider Shapley Values as the philosopher's stone, the summum bonum, and I've smoothed my initial enthusiasm somewhat since then. With that in mind, we considered Shapley values in a variety of cases, from the perhaps useful in practice (philantropic coordination theory, forecasting incentives), to the exceedingly speculative, esoteric and theoretical.
### Assorted references
* [_Donor coordination and the “givers dilemma”_](https://blog.givewell.org/2014/12/02/donor-coordination-and-the-givers-dilemma/)
* [_Technical and Philosophical Questions That Might Affect Our Grantmaking_](https://www.openphilanthropy.org/blog/technical-and-philosophical-questions-might-affect-our-grantmaking)
* [If its worth doing its worth doing with made up statistics](https://slatestarcodex.com/2013/05/02/if-its-worth-doing-its-worth-doing-with-made-up-statistics/)
* [_Prediction Markets, Mechanism Design, and Cooperative Game Theory_](https://users.cs.duke.edu/~conitzer/predictionUAI09.pdf)
* [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)
* [Shapley values: Better than counterfactuals](https://forum.effectivealtruism.org/posts/XHZJ9i7QBtAJZ6byW/shapley-values-better-than-counterfactuals)
* [Score voting](https://en.wikipedia.org/wiki/Score_voting)
* [Banzhaf power index](https://en.wikipedia.org/wiki/Banzhaf_power_index)
* [Notes on the N-Person game](https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM670.pdf)
* [Amplifying generalist research via forecasting](https://forum.effectivealtruism.org/posts/ZTXKHayPexA6uSZqE/part-2-amplifying-generalist-research-via-forecasting)
* [Shapley Value Approximations for a Forecasting Experiment](https://observablehq.com/@nunosempere/shapley-value-experiments-part-i)
* [_Five Mistakes in Moral Mathematics_](http://www.stafforini.com/docs/Parfit%20-%20Five%20mistakes%20in%20moral%20mathematics.pdf).
* [Functional Decision Theory: A New Theory of Instrumental Rationality](https://arxiv.org/abs/1710.05060)
* [In logical time all games are iterated games](https://www.lesswrong.com/posts/dKAJqBDZRMMsaaYo5/in-logical-time-all-games-are-iterated-games).
* [Sensitivity analysis](https://en.wikipedia.org/wiki/Sensitivity_analysis)
* [Regressional Goodhart](https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy#Regressional_Goodhart).
## Appendix: A more complicated worked example.
Suppose that you have two players, player a and Player B, and three charities: GiveDirectly, SCI, and the Not Quite Optimal Charity (NQOC).
| Charity | Value for A | Value for B |
|---------|-----------------------------------|-------------|
| GD | y=x | y=x |
| SCI | y=2x if x<10, y=x/2 + 15 if x>=10 | y=2x |
| NQOC | 1/10 | 1/5 |
Or, in graph form:
![](images/b640f8bfd4ce1e54a632d7bb4a3c4bd9e9695418.jpg)
Caveat:
* I don't actually think that the diminishing returns are likely to be that brutal and that sudden, but assuming they are simplified calculations.
* I don't actually know whether 1/5th to 1/10th of the value of a donation to GiveDirectly is a reasonable estimate of what an informed donor would have made in the absence of GiveWell's analysis.
Then suppose that:
* Player A has 10 million, and information
* Player B has 1 million, and not much information.
Then the outcomes might look like
| Group | Value for A | Value for B |
|-------|--------------|----------------|
| {} | 0* | 0* |
| {A} | 20 | 20 |
| {B} | 1/10 | 1/5 |
| {A,B} | 20 + X/2 + Y | 20 + 2X + Y|
where X and Y are the quantities which player B donates to SCI and GiveDirectly, respectively, and they sum up to 1. So the value of cooperating is
* X/2 + Y -1/10 according to A's utility function
* 2X + Y - 1/5 according to B's utility function
In particular, if B donates only to SCI, then the value of cooperating is:
* X/2 -1/10 according to A's utility function
* 2X - 1/5 according to B's utility function
Now, we have the interesting scenario in which A is willing to sell their share of the impact for `X/4 - 1/20`, but B is willing to buy it for more, that is, for `X - 1/10`. In this case, suppose that they come to a gentleman's agreement and decide that the fair price is `(X/4 - 1/20)/2 + (X - 1/10)/2`, which simplified, is equal to `(5 X)/8 - 3/40`.
Now, player B then donates X to SCI and buys player A's certificates of impact for that donation (which are cheaper than continuing to donate). If they spend the million, then `X + (5 X)/8 - 3/40 = 1`, so X≈0.66, that is, 0.66 million are donated to SCI, whereas 0.34 million are spent buying certificates of impact from player A, which then donates that to GiveDirectly.

View File

@ -0,0 +1,126 @@
New Cause Proposal: International Supply Chain Accountability
==============
\[Epistemic Status: Totally serious; altered afterwards to remove some April Fool's content. Thanks to Aaron Gertler for reading a version of this draft.\]
# Related material
* [Prospecting for Gold](https://www.effectivealtruism.org/articles/prospecting-for-gold-owen-cotton-barratt/)
* [Cause X Guide](https://forum.effectivealtruism.org/posts/kFmFLcdSFKo2GFJkc/cause-x-guide&sa=D&ust=1577137448641000)
* [How do we create international supply chain accountability](https://forum.effectivealtruism.org/posts/wFeEdpT5AS8xCHj3q/how-do-we-create-international-supply-chain-accountability#5SnJAqdxSkkraPdbm)
* [Will companies meet their animal welfare commitments?](https://forum.effectivealtruism.org/posts/XdekdWJWkkhur9gvr/will-companies-meet-their-animal-welfare-commitments)
* [Overview of Capitalism and Socialism for Effective Altruism](https://forum.effectivealtruism.org/posts/ktEfsoGfBFGsaiY46/overview-of-capitalism-and-socialism-for-effective-altruism)
* [_The Revolution, Betrayed_](https://www.amazon.com/Revolution-Betrayed-Leon-Trotsky/dp/0486433986), by Leon Trostky
# What is international supply chain accountability?
I think that one useful way to define the term is
1. as a strategy
2. used to forcefully incentivize global companies to align their production processes with a given morality,
3. making use of the fact that these global companies are hit much harder by reactions, thoughts, incentives, reprisals, etc. which happen in their country of origin,
4. as opposed to in India, Bangladesh, Vietnam and other countries in their chain of production, even about things which happen in these other countries.
One could also understand supply chain accountability as a goal (making multinational companies less exploitative) rather than as a strategy, in the same way that one could understand effective altruism in terms of its ideals, rather than in terms of the specific strategies which it employs.
Example 1: After the collapse of a Bangladeshi garment factory which killed more than a thousand people \[[source / en](https://en.wikipedia.org/wiki/2013_Dhaka_garment_factory_collapse)\], a media shitstorm happened around the world. Transforming the temporary flare in outrage into specific and lasting commitments and agreements was the work of many organizations \[1\] under the banner of the concept of supply chain accountability. See: [The Bangladesh Accord on Fire and Building Safety](https://bangladeshaccord.org/)
Example 2: Supply chain accountability in the case of Inditex, one of the biggest multinationals in the textile industry, over the last 20 years: \[[source / es](http://relats.org/documentos/GLOBBoixGarrido.pdf)\]:
* Inditex signed a "Global Framework Agreement" with Comisiones Obreras (later also with the global federation IndustriALL), and this agreement was later renewed \[[source / en](http://www.industriall-union.org/global-framework-agreements)\].
* Comisiones Obreras carried out some investigative work in every country in which Inditex has factories, with the aim of identifying the issues which caused the most harm in each place, and slowly taking measures to improve them. For example, if a factory had fired workers which tried to syndicate, they convice the factory to rehire the workers. If a factory had hired people below working age, they made sure that these workers get enough money to complete their primary and secondary education.
* An auditing system was erected, in which Inditex itself audits their own contractors, after which Comisiones Obreras, which has less capacity, visits and audits ~2% of factories themselves.
* The philosophical underpinnings of such a Global Agreement were developed further. For example, professors at Spanish universities theorize about the topic and teach about it, providing a philosophical or moral basis about why supply chain accountability is positive and necessary. However, this is a posteriori.
# Might this be of interest to Effective Altruism?
## Importance
Gut feeling: Very large.
Multinational companies employ a lot of people, directly or indirectly, on their chain of production. In the case of Inditex, this corresponds to 2.2 million people, distributed across 5,000 factories in 40 countries \[[source / es / 2019](http://iboix.blogspot.com/2019/12/el-acuerdo-marco-global-de-industriall.html)\]. This has grown from ~1 million in 2014 \[[source / es](http://iboix.blogspot.com/2014/07/renovacion-del-acuerdo-marco-global-de.html)\].
Given the number of people involved, reducing the incidence of fatal accidents, sexual abuse, exploitative working conditions, making those working conditions marginally better, marginally increasing wages, might be impactful, as would accelerating the rate at which that happens. [Here](https://www.getguesstimate.com/models/14645) is a back-of-the-envelope Guesstimate model which incorporates the number of people in Inditex's chain of production and their estimated work times, in order to calculate the estimated lives saved due to the fire and safety accord mentioned above.
The main conclusion is that scale of the problem is so vast, that any percentage improvement is likely to be massively beneficial. In the specific case of deaths averted because of the Bangladesh Accord on Fire and Building Safety, I estimate a cost effectiveness of 4400$ per life saved, with a 90% confidence interval of $2 300 to $18 000. However, note that companies themselves pledge that money.
The model has a second part, in which I try to produce some estimates in terms of QALYs, and in which I use my best judgement in the absence of hard sources. That is, I estimate the improvement in life quality because of less shitty working conditions, but Im building on thin air because hard numbers dont exist. I'm less confident in this part, not only because of uncertainty as to the magnitudes themselves, but also because cost-effectiveness becomes less meaningful when companies would pledge the money themselves. Additionally, it is not clear to me what proportion of that impact would go to one particular organization, like Comisiones Obreras, as opposed to the many other organizations which are working on the area.
## Tractability
Gut feeling: Somewhat tractable, but perhaps not by the EA community as it is now.
My intuitions about this being somewhat tractable come from the proof of possibility which Inditex and Comisiones Obreras provide, as well as the [46 other Global Framework Agreements](http://www.industriall-union.org/global-framework-agreements) made by the IndustriALL federation
How tractable is this for the EA community in particular? Here are some considerations:
* The EA community doesn't really have a strong track record in this area, whereas other groups and organizations, like the federation [IndustriALL](http://www.industriall-union.org/) (of which Comisiones Obreras, a national workers union, is an affiliate) seems to be doing ok.
* CEA seems to be averse to public attention, though this may not apply to the EA community in general.
* On the other hand, the animal suffering movement has recently had a series of very visible victories using similar methods. Incidentally, it seems that some insights were rediscovered; perhaps the animal suffering movement could learn more effectively from trade unionism victories and defeats, and vice-versa. For example, trade unionists have historically relied more on newspapers and organizing demonstrations, and not used Facebook ads as aggressively as the animal suffering movement (but perhaps could). In the other direction, I think that actors in the space of supply chain accountability realized the importance of independent audits for pledges somewhat earlier than animal suffering movement \[2\].
## Neglectedness
Gut feeling: Many stakeholders, but I'm unclear what their average level of effectiveness is.
While researching the area, I got the impression of a proliferation of stakeholders (IndustriALL, MSI-TN, FairTrade Initiative, the ILO, etc.), such that an additional stakeholder of average effectiveness would just stand in the way. It might thus make more sense to support and expand existing organizations (chiefly, IndustriALL), whereas with manpower or with funds.
An additional point is that if several organizations define corporate responsibility differently, companies have a tendency to choose the least demanding definitions, so an additional organization might instead only be able to do harm. \[3\]
Tangentially, some of the primary sources for Inditex's agreement were in Spanish, and I wonder whether EA has neglected charities in Spanish-speaking countries because of this language barrier. On this point, I have copious notes on the aforementioned agreement & related sources, which are available, in Spanish, [here](https://nunosempere.github.io/ea/CCOO-Inditex)
# Some Open Questions.
## Does the EA philosophy help here?
Unionists are not effective altruists. In particular, they're not particularly utilitarian. It's not clear that the effective altruism philosophy or people would be beneficial here, as opposed to capable unionists convinced of the importance of worker organization, given that unionists already have something working.
Money could be used, but I am not sure whether the relevant organizations accept donations (I have asked a while ago, but Im waiting for an answer). In particular, I'd expect that Comisiones Obreras / IndustriALL could meaningfully use very large amounts of money, but that they might be reticent to take it.
In particular, both IndustriALL and the unions it is composed of already have a working model based on individual worker affiliation, rather than on patronage. It may be that this model allows them to have little to no principal/agent problems.
## Does Inditex's Agreement generalize?
Or was Inditex just particularly cooperative? It does seem that Inditex was particularly cooperative, in comparison to, for example, another Spanish retailer which goes by the name of [Mango](https://en.wikipedia.org/wiki/Mango_(retailer)). Which other companies would be similarly influenced by the public opinion in their own countries?
Is any EA local group organized well enough that a corporate accountability campaign targeting a particular company could be created and carried out, resulting in all factories in its direct and indirect chain of production being audited, within 10 years? Would there be any company particularly well-suited enough for this?
In particular, Id intuitively expect convincing a previously reluctant company to sign such a supply chain accountability agreement a to be comparable in hardness to the one which [doubled Zurichs development aid](https://forum.effectivealtruism.org/posts/dTdSnbBB2g65b2Fb9/eaf-s-ballot-initiative-doubled-zurich-s-development-aid) \[4\].
## How do you quantify the impact of corporate responsibility campaigns?
Right now, supply chain accountability campaigns have not been focused on measuring impact in the sense understood by the EA community. However, Comisiones Obreras are extremely transparent, and they have released [copious amounts](http://iboix.blogspot.com/) of information as to their projects and activities, which could be built upon to estimate their impact.
Besides working factory-to-factory, the Spanish trade unionists which I looked into also had a theory of broader systemic change:
* By popularizing the ideas of trade unionism and, to a lesser extent, class warfare, and the social technology of worker organizations, workers will be able to organize themselves better.
* Once workers are organized, they will be able to demand better working conditions, and get them.
Estimating the validity of this theory of systemic change could also be done by looking at the history of trade unionism. (Or, better yet, by doing a randomized trial introducing the idea in a country in which it isn't very developed yet.)
# Conclusions
I am ignorant of how the marginal impact of work in this area compares to other policy work. The biggest thing international supply chain accountability has going for it is the _scale_; it could potentially absorb both great quantities of money and effort, and positively affect the lives of a very large number of people \[5\]
The negatives are the poor fit between EA and the trade unionism movement. In particular, EA isn't really conscious of its sometimes petty-bourgeois character \[6\], whereas trade unionism, having marxist roots and inclinations, takes class differences seriously. For example, Comisiones Obreras still is a communist/socialist organization (but less so than in the past). It is possible that having these organizations accept money would take some work, but I also believe that they could meaningfully use it.
Overall, perhaps International Supply Chain Accountability would be a good target for a [shallow investigation](https://www.openphilanthropy.org/research/cause-selection#Shallow_investigations) by any EA organization which wishes to diversify their moral parliament.
---
\[1\] Which organizations? The \[Bangladesh Accord website\] mentions:
> “The Accord is a legally-binding agreement between global brands & retailers and IndustriALL Global Union & UNI Global Union and eight of their Bangladeshi affiliated unions to work towards a safe and healthy garment and textile industry in Bangladesh.”
These two global unions are, in practice, meta-unions, to which many other organizations are affiliated, like Comisiones Obreras in Spain.
\[2\] See: [Will companies meet their animal welfare commitments?](https://forum.effectivealtruism.org/posts/XdekdWJWkkhur9gvr/will-companies-meet-their-animal-welfare-commitments)
\[3\] I cant remember which organization was said to be less strict than the rest, and trying to track it down brings [unrelated criticism](https://en.wikipedia.org/wiki/International_Fairtrade_Certification_Mark#Criticism) of FairTrade in the food producing sector, rather than in the textile industry.
\[4\] For example, what's up with Adidas in Germany? Is their [corporate responsibility approach](https://www.adidas-group.com/en/sustainability/managing-sustainability/general-approach/#/people-priorities-for-2020/) just blabber? (Most likely.) Do they have independent auditors? (Most likely not.)
\[5\] This assessment also applies to marxist revolution more generally.
Moreover, the EA movement could learn from the mistakes of communism. A point of reference which Ive personally found enlightening is be _The Revolution, Betrayed_, by Trotsky.
\[6\] If one attends [EA Global](https://slatestarcodex.com/2017/08/16/fear-and-loathing-at-effective-altruism-global-2017/), one does not get the impression that EAs, after they put down their vegetarian cocktails, will go on to the streets and sing [_The Internationale_](https://www.youtube.com/watch?v=t8EMx7Y16Vo). Perhaps theyre missing out.

View File

@ -0,0 +1,131 @@
Forecasting Newsletter: April 2020
==============
A forecasting digest with a focus on experimental forecasting.
* You can sign up [here](https://forecasting.substack.com).
* You can also see this post on LessWrong [here](https://www.lesswrong.com/posts/e4C7hTmbmPfLjJzXT/forecasting-newsletter-april-2020-1)
The newsletter itself is experimental, but there will be at least five more iterations. Feel free to use this post as a forecasting open thread.
Why is this relevant to EAs?
* Some items are immediately relevant (e.g., forecasts of famine).
* Others are projects whose success I'm cheering for, and which I think have the potential to do great amounts of good (e.g., Replication Markets).
* The remaining are relevant to the extent that cross-polination of ideas is valuable.
* Forecasting may become/is becoming a powerful tool for world-optimization, and EAs may want to avail themselves of this tool.
Conflict of interest: With Foretold in general and Jacob Laguerros in particular. This is marked as (c.o.i) throughout the text.
## Index
* Prediction Markets & Forecasting platforms.
* Augur.
* PredictIt & Election Betting Odds.
* Replication Markets.
* Coronavirus Information Markets.
* Foretold. (c.o.i).
* Metaculus.
* Good Judgement and friends.
* In the News.
* Long Content.
## Prediction Markets & Forecasting platforms.
Forecasters may now choose to forecast any of the four horsemen of the Apocalypse: Death, Famine, Pestilence and War.
### Augur: [augur.net](https://www.augur.net/)
Augur is a decentralized prediction market. It will be undergoing its [first major update](https://www.augur.net/blog/augur-v2/).
### Predict It & Election Betting Odds: [predictIt.org](https://www.predictit.org/) & [electionBettingOdds.com](http://electionbettingodds.com/)
PredictIt is a prediction platform restricted to US citizens or those who bother using a VPN. [Anecdotically](https://www.lesswrong.com/posts/qzRzQgxiZa3tPJJg8/free-money-at-predictit), it often has free energy, that is, places where one can earn money by having better probabilities, and where this is not too hard. However, due to fees & the hassle of setting it up, these inefficiencies don't get corrected. In PredictIt, the [world politics](https://www.predictit.org/markets/5/World) section...
* gives a 17% to [a Scottish independence referendum](https://www.predictit.org/markets/detail/6236/Will-Scottish-Parliament-call-for-an-independence-referendum-in-2020) (though read the fine print).
* gives 20% to [Netanyahu leaving before the end of the year](https://www.predictit.org/markets/detail/6238/Will-Benjamin-Netanyahu-be-prime-minister-of-Israel-on-Dec-31,-2020)
* gives 64% to [Maduro remaining President of Venezuela before the end of the year](https://www.predictit.org/markets/detail/6237/Will-Nicol%C3%A1s-Maduro-be-president-of-Venezuela-on-Dec-31,-2020).
The question on [which Asian/Pacific leaders will leave office next?](https://www.predictit.org/markets/detail/6655/Which-of-these-8-Asian-Pacific-leaders-will-leave-office-next) also looks like it has a lot of free energy, as it overestimates low probability events.
[Election Betting Odds](https://electionbettingodds.com/) aggregates PredictIt with other such services for the US presidential elections.
### Replication Markets: [replicationmarkets.com](https://www.replicationmarkets.com)
Replication Markets is a project where volunteer forecasters try to predict whether a given study's results will be replicated with high power. Rewards are monetary, but only given out to the top N forecasters, and markets suffer from sometimes being dull. They have added [two market-maker bots](https://www.replicationmarkets.com/index.php/2020/04/16/meet-the-bots/) and commenced and conclude their 6th round. They also added a sleek new widget to visualize the price of shares better.
### Coronavirus Information Markets: [coronainformationmarkets.com](https://coronainformationmarkets.com/)
For those who want to put their money where their mouth is, there is now a prediction market for coronavirus related information. The number of questions is small, and the current trading volume started at $8000, but may increase. Another similar platform is [waves.exchange/prediction](https://waves.exchange/prediction), which seems to be just a wallet to which a prediction market has been grafted on.
Unfortunately, I couldn't make a transaction in these markets with ~30 mins; the time needed to be included in an ethereum block is longer and I may have been too stingy with my gas fee.
### Foretold: [foretold.io](https://www.foretold.io/) (c.o.i)
Foretold is an forecasting platform which has experimentation and exploration of forecasting methods in mind. They bring us:
* A new [distribution builder](https://www.highlyspeculativeestimates.com/dist-builder) to visualize and create probability distributions.
* Forecasting infrastructure for [epidemicforecasting.org](http://epidemicforecasting.org).
### Metaculus: [metaculus.com](https://www.metaculus.com/)
Metaculus is a forecasting platform with an active community and lots of interesting questions. They bring us a series of tournaments and question series:
* [The Ragnarök question series on terrible events](https://www.metaculus.com/questions/?search=cat:series--ragnarok)
* [Pandemic and lockdown series](https://pandemic.metaculus.com/lockdown/)
* [The Lightning Round Tournament: Comparing Metaculus Forecasters to Infectious Disease Experts](https://www.metaculus.com/questions/4166/the-lightning-round-tournament-comparing-metaculus-forecasters-to-infectious-disease-experts/). "Each week you will have exactly 30 hours to lock in your prediction on a short series of important questions, which will simultaneously be posed to different groups of forecasters. This provides a unique opportunity to directly compare the Metaculus community prediction with other forecasting methods." Furthermore, Metaculus swag will be given out to the top forecasters.
* [Overview of Coronavirus Disease 2019 (COVID-19) forecasts](https://pandemic.metaculus.com/COVID-19/).
* [The Salk Tournament for coronavirus (SARS-CoV-2) Vaccine R&D](https://pandemic.metaculus.com/questions/4093/the-salk-tournament-for-coronavirus-sars-cov-2-vaccine-rd/).
* [Lockdown series: when will life return to normal-ish?](https://pandemic.metaculus.com/lockdown/)
### /(Good Judgement?\[^\]\*)|(Superforecast(ing|er))/gi
Good Judgement Inc. is the organization which grew out of Tetlock's research on forecasting, and out of the Good Judgement Project, which won the [IARPA ACE forecasting competition](https://en.wikipedia.org/wiki/Aggregative_Contingent_Estimation_(ACE)_Program), and resulted in the research covered in the _Superforecasting_ book.
The Open Philantropy Project has funded [this covid dashboard](https://goodjudgment.io/covid/dashboard/) by their (Good Judgement Inc.'s) Superforecasting Analytics Service, with predictions solely from superforecasters; see more on [this blogpost](https://www.openphilanthropy.org/blog/forecasting-covid-19-pandemic).
Good Judgement Inc. also organizes the Good Judgement Open ([gjopen.com](http://gjopen.com))\[[https://www.gjopen.com/](https://www.gjopen.com/)\], a forecasting platform open to all, with a focus on serious geopolitical questions. They structure their questions in challenges, to which they have recently added one on [the Coronavirus Outbreak](https://www.gjopen.com/challenges/43-coronavirus-outbreak); some of these questions are similar in spirit to the short-fuse Metaculus Tournament.
Of the questions which have been added recently to the Good Judgment Open, the crowd [doesn't buy](https://www.gjopen.com/questions/1580-before-1-january-2021-will-tesla-release-an-autopilot-feature-designed-to-navigate-traffic-lights) that Tesla will release an autopilot feature to navigate traffic lights, despite announcements to the contrary. Further, the aggregate...
* is extremely confident that, [before 1 January 2021](https://www.gjopen.com/questions/1595-before-1-january-2021-will-the-russian-constitution-be-amended-to-allow-vladimir-putin-to-remain-president-after-his-current-term), the Russian constitution will be amended to allow Vladimir Putin to remain president after his current term.
* gives a lagging estimate of 50% on [Benjamin Netanyahu ceasing to be the prime minister of Israel before 1 January 2021](https://www.gjopen.com/questions/1498-will-benjamin-netanyahu-cease-to-be-the-prime-minister-of-israel-before-1-january-2021).
* and 10% for [Nicolás Maduro](https://www.gjopen.com/questions/1423-will-nicolas-maduro-cease-to-be-president-of-venezuela-before-1-june-2020) leaving before the 1st of June.
* [forecasts famine](https://www.gjopen.com/questions/1559-will-the-un-declare-that-a-famine-exists-in-any-part-of-ethiopia-kenya-somalia-tanzania-or-uganda-in-2020) (70%).
* Of particular interest is that GJOpen didn't see the upsurge in tests (and thus positives) in the US until until the day before they happened, for [this question](https://www.gjopen.com/questions/1599-how-many-total-cases-of-covid-19-in-the-u-s-will-the-covid-tracking-project-report-as-of-sunday-26-april-2020). Forecasters, including superforecasters, went with a linear extrapolation from the previous n (usually 7) days. However, even though the number of cases looks locally linear, it's also globally exponential, as [this 3Blue1Brown video](https://www.youtube.com/watch?v=Kas0tIxDvrg) shows. On the other hand, an enterprising forecaster tried to fit a Gompertz distribution, but then fared pretty badly.
## In the News
* [Forecasts in the time of coronavirus](https://ftalphaville.ft.com/2020/04/08/1586350137000/Forecasts-in-the-time-of-coronavirus/): The Financial times runs into difficulties trying to estimate whether some companies are overvalued, because the stock value/earnings ratio, which is otherwise an useful tool, is going to infinity as earnings go to 0 during the pandemic.
* [Predictions are hard, especially about the coronavirus](https://www.vox.com/future-perfect/2020/4/8/21210193/coronavirus-forecasting-models-predictions): Vox has a short and sweet article on the difficulties of prediction forecasting; of note is that epidemiology experts are not great predictors.
* [538: Why Forecasting COVID-19 Is Harder Than Forecasting Elections](https://fivethirtyeight.com/features/politics-podcast-why-forecasting-covid-19-is-harder-than-forecasting-elections/)
* [COVID-19: Forecasting with Slow and Fast Data](https://www.stlouisfed.org/on-the-economy/2020/april/covid-19-forecasting-slow-fast-data). A short and crisp overview by the Federal Reserve Bank of St Louis on lagging economic measurement instruments, which have historically been quite accurate, and on the faster instruments which are available right now. Highlight: "As of March 31, the WEI \[a faster, weekly economic index\] indicated that GDP would decline by 3.04% at an annualized rate in the first quarter, a much more sensible forecast than that which is currently indicated by the ENI (a lagging measure which predicts 2.26% _growth_ on an annualized basis in the first quarter)".
* [Decline in aircraft flights clips weather forecasters' wings](https://www.theguardian.com/news/2020/apr/09/decline-aircraft-flights-clips-weather-forecasters-wings-coronavirus): Coronavirus has led to reduction in number of aircraft sending data used in making forecasts.
* [The World in 2020, as forecast by The Economist](https://www.brookings.edu/blog/future-development/2020/04/10/the-world-in-2020-as-forecast-by-the-economist/). The Brookings institution looks back at forecasts for 2020 by _The Economist_.
* Forbes brings us this [terrible, terrible opinion piece](https://www.forbes.com/sites/josiecox/2020/04/14/life-work-after-covid-19-coronavirus-forecast-accuracy-brighter-future/#28732f74765b) which mentions Tetlock, goes on about how humans are terrible forecasters, and then predicts that there will be no social changes because of covid with extreme confidence.
* [The Challenges of Forecasting the Spread and Mortality of COVID-19](https://www.heritage.org/public-health/report/the-challenges-forecasting-the-spread-and-mortality-covid-19). The Heritage foundation brings us a report with takeaways of particular interest to policymakers. It has great illustrations of how the overall mortality changes with different assumptions. Note that criticisms of and suggestions for the current US administration are worded kindly, as the Heritage Foundation is a conservative organization.
* [Why most COVID-19 forecasts were wrong](https://www.afr.com/wealth/personal-finance/why-most-covid-19-forecasts-were-wrong-20200415-p54k40). Financial review article suffers from hindsight bias.
* [Banks are forecasting on gut instinct — just like the rest of us](https://www.ft.com/content/4b8108e5-b04c-4304-9f40-825076a4fed7). Financial Times article starts with "We all cling to the belief that somebody out there, somewhere, knows what the heck is going on. Someone — well-connected insider, evil mastermind — must hold the details on the coming market crash, the election in November, or when the messiah will return. In moments of crisis, this delusion tightens its grip," and it only gets better.
* ['A fool's game': 4 economists break down what it's like forecasting the worst downturn since the Great Recession](https://www.businessinsider.com/economists-what-its-like-forecasting-recession-experience-unemployment-coronavirus-2020-4). "'My outlook right now is that I don't even have an outlook,' Martha Gimbel, an economist at Schmidt Futures, told Business Insider. 'This is so bad and so unprecedented that any attempt to forecast what's going to happen here is just a fool's game.'"
* [IMF predicts -3% global depression](https://blogs.imf.org/2020/04/14/the-great-lockdown-worst-economic-downturn-since-the-great-depression/). "Worst Economic Downturn Since the Great Depression".
* [COVID-19 Projections](https://covid19.healthdata.org/united-states-of-america): A really sleek US government coronavirus model. See [here](https://www.lesswrong.com/posts/QuzAwSTND6N4k7yNj/seemingly-popular-covid-19-model-is-obvious-nonsense) for criticism. See also: [Epidemic Forecasting](http://epidemicforecasting.org/) (c.o.i).
* [The M5 competition is ongoing](https://www.kaggle.com/c/m5-forecasting-accuracy/data).
* [Some MMA forecasting](https://mmajunkie.usatoday.com/2020/04/fantasy-fight-forecasting-ufc-welterweight-title-usman-masvidal-woodley-edwards). The analysis surprised me; it could well have been a comment in a GJOpen challenge.
* [Self-reported COVID-19 Symptoms Show Promise for Disease Forecasts](https://www.cmu.edu/news/stories/archives/2020/april/self-reported-covid-19-symptoms-disease-forecasts.html). "Thus far, CMU is receiving about one million responses per week from Facebook users. Last week, almost 600,000 users of the Google Opinion Rewards and AdMob apps were answering another CMU survey each day."
* [Lockdown Policy and Disease Eradication](https://www.isical.ac.in/~covid19/Modeling.html). Researchers in India hypothesize on what the optimal lockdown policy may be.
* [Using a delay-adjusted case fatality ratio to estimate under-reporting](https://cmmid.github.io/topics/covid19/severity/global_cfr_estimates.html).
* [The first modern pandemic](https://www.gatesnotes.com/Health/Pandemic-Innovation). In which Bill Gates names covid-SARS "Pandemic I" and offers an informed overview of what is yet to come.
* [36,000 Missing Deaths: Tracking the True Toll of the Coronavirus Crisis](https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html).
* There is a shadow industry which makes what look to be really detailed reports on topics of niche interest: Here is, for example, a [$3,500 report on market trends for the Bonsai](https://technovally.com/business-methodology-by-2020-2029-bonsai-market/)
* [An active hurricane season will strain emergency response amid pandemic, forecasters warn](https://www.cbsnews.com/news/hurricane-season-2020-active-strain-emergency-response-coronavirus-pandemic/). "Schlegelmilch stresses that humanity must get better at prioritizing long-term strategic planning."
## Long Content
* [Atari, early](https://aiimpacts.org/atari-early/). "Deepmind announced that their Agent57 beats the human baseline at all 57 Atari games usually used as a benchmark."
* [A failure, but not of prediction](https://slatestarcodex.com/2020/04/14/a-failure-but-not-of-prediction/); a SlateStarCodex Essay.
* [Philip E. Tetlock on Forecasting and Foraging as a Fox](https://medium.com/conversations-with-tyler/philip-tetlock-tyler-cowen-forecasting-sociology-30401464b6d9); an interview with Tyler Cowen. Some highly valuable excerpts on counterfactual reasoning. Mentions [this program](https://www.iarpa.gov/index.php/research-programs/focus/focus-baa) and [this study](https://journals.sagepub.com/doi/10.1177/0022022105284495), on the forefront of knowledge.
* [Assessing Kurzweil's 1999 predictions for 2019](https://www.lesswrong.com/posts/GhDfTAtRMxcTqAFmc/assessing-kurzweil-s-1999-predictions-for-2019). Kurzweil made on the order of 100 predictions for 2019 in his 1999 book _The Age of Spiritual Machines_. How did they fare? We'll find out, next month.
* [Zvi on Evaluating Predictions in Hindsight](https://www.lesswrong.com/posts/BthNiWJDagLuf2LN2/evaluating-predictions-in-hindsight). A fun read. Of course, the dissing of Scott Alexander's prediction is fun to read, but I really want to see how a list of Zvi's predictions fares.
* An oldie related to the upcoming US elections: [Which Economic Indicators Best Predict Presidential Elections?](https://fivethirtyeight.blogs.nytimes.com/2011/11/18/which-economic-indicators-best-predict-presidential-elections/), from 2011's Nate Silver.
* [A rad comment exchange at GJOpen](https://www.gjopen.com/comments/comments/1018771) in which cool superforecaster @Anneinak shares some pointers.
* [As the efficient markets hypothesis turns 50, it is time to bin it](https://www.ft.com/content/dbf88254-22af-11ea-b8a1-584213ee7b2b) for a Financial Times article, from Jan 1st and thus untainted by coronavirus discussion. Related: [This LW comment by Wei Dai](https://www.lesswrong.com/posts/jAixPHwn5bmSLXiMZ/open-and-welcome-thread-february-2020?commentId=a9YCk3ZtpQZCDqeqR#wAHCXmnywzfhoQT9c) and [this tweet](https://twitter.com/ESYudkowsky/status/1233174331133284353) from Eliezer Yudkowsky. See also a very rambly article by an Australian neswpaper: [Pandemic highlights problems with efficient-market hypothesis](https://independentaustralia.net/politics/politics-display/pandemic-highlights-problems-with-efficient-market-hypothesis,13776).

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

View File

@ -0,0 +1,278 @@
Forecasting Newsletter: May 2020.
==============
A forecasting digest with a focus on experimental forecasting. The newsletter itself is experimental, but there will be at least four more iterations. Feel free to use this post as a forecasting open thread; feedback is welcome.
* You can sign up [here](https://forecasting.substack.com).
* You can also see this post on LessWrong [here](https://www.lesswrong.com/posts/M45QmAKGJWxuuiSbQ/forecasting-newsletter-may-2020)
* And the post is archived [here](https://nunosempere.github.io/ea/ForecastingNewsletter/)
Why is this relevant to Effective Altruism?
* Some items are immediately relevant (e.g., forecasts of famine).
* Others are projects whose success I'm cheering for, and which I think have the potential to do great amounts of good (e.g., Replication Markets).
* The remaining are relevant to the extent that cross-polination of ideas is valuable.
* Forecasting may become a powerful tool for world-optimization, and EAs may want to avail themselves of this tool.
## Index
* Prediction Markets & Forecasting platforms.
* Augur.
* Coronavirus Information Markets.
* CSET: Foretell.
* Epidemic forecasting (c.o.i).
* Foretold. (c.o.i).
* /(Good Judgement?\[^\]\*)|(Superforecast(ing|er))/gi
* Metaculus.
* PredictIt & Election Betting Odds.
* Replication Markets.
* In the News.
* Grab bag.
* Negative examples.
* Long Content.
## Prediction Markets & Forecasting platforms.
### Augur: [augur.net](https://www.augur.net/)
Augur is a decentralized prediction market. [Here](https://bravenewcoin.com/insights/augur-price-analysis-token-success-hinges-on-v2-release-in-june) is a fine piece of reporting outlining how it operates and the road ahead.
### Coronavirus Information Markets: [coronainformationmarkets.com](https://coronainformationmarkets.com/)
For those who want to put their money where their mouth is, this is a prediction market for coronavirus related information.
Making forecasts is tricky, so would-be-bettors might be better off pooling their forecasts together with a technical friend. As of the end of this month, the total trading volume of active markets sits at $26k+ (upwards from $8k last month), and some questions have been resolved already.
Further, according to their FAQ, participation from the US is illegal: _"Due to the US position on information markets, US citizens and residents, wherever located, and anyone physically present in the USA may not participate in accordance with our Terms."_ Nonetheless, one might take the position that the US legal framework on information markets is so dumb as to be illegitimate.
### CSET: Foretell
The [Center for Security and Emerging Technology](https://cset.georgetown.edu/) is looking for (unpaid, volunteer) forecasters to predict the future to better inform policy decisions. The idea would be that as emerging technologies pose diverse challenges, forecasters and forecasting methodologies with a good track record might be a valuable source of insight and advice to policymakers.
One can sign-up on [their webpage](https://www.cset-foretell.com/). CSET was previously funded by the [Open Philanthropy Project](https://www.openphilanthropy.org/giving/grants/georgetown-university-center-security-and-emerging-technology); the grant writeup contains some more information.
### Epidemic Forecasting: [epidemicforecasting.org](http://epidemicforecasting.org/) (c.o.i)
As part of their efforts, the Epidemic Forecasting group had a judgemental forecasting team that worked on a variety of projects; it was made up of forecasters who have done well on various platforms, including a few who were official Superforecasters.
They provided analysis and forecasts to countries and regions that needed it, and advised a vaccine company on where to locate trials with as many as 100,000 participants. I worked a fair bit on this; hopefully more will be written publicly later on about these processes.
They've also been working on a mitigation calculator, and on a dataset of COVID-19 containment and mitigation measures.
Now theyre looking for a project manager to take over: see [here](https://www.lesswrong.com/posts/ecyYjptcE34qAT8Mm/job-ad-lead-an-ambitious-covid-19-forecasting-project) for the pitch and for some more information.
### Foretold: [foretold.io](https://www.foretold.io/) (c.o.i)
I personally added a distribution drawer to the [Highly Speculative Estimates](https://www.highlyspeculativeestimates.com/drawer) utility, for use within the Epidemic Forecasting forecasting efforts; the tool can be used to draw distributions and send them off to be used in Foretold. Much of the code for this was taken from Evan Wards open-sourced [probability.dev](https://probability.dev/) tool.
### /(Good Judgement?\[^\]\*)|(Superforecast(ing|er))/gi
(The title of this section is a [regular expression](https://en.wikipedia.org/wiki/Regular_expression), so as to accept only one meaning, be maximally unambiguous, yet deal with the complicated corporate structure of Good Judgement.)
Good Judgement Inc. is the organization which grew out of Tetlock's research on forecasting, and out of the Good Judgement Project, which won the [IARPA ACE forecasting competition](https://en.wikipedia.org/wiki/Aggregative_Contingent_Estimation_(ACE)_Program), and resulted in the research covered in the _Superforecasting_ book.
Good Judgement Inc. also organizes the Good Judgement Open [gjopen.com](https://www.gjopen.com/), a forecasting platform open to all, with a focus on serious geopolitical questions. They structure their questions in challenges. Of the currently active questions, here is a selection of those I found interesting (probabilities below):
* [Before 1 January 2021, will the People's Liberation Army (PLA) and/or Peoples Armed Police (PAP) be mobilized in Hong Kong?](https://www.gjopen.com/questions/1499-before-1-january-2021-will-the-people-s-liberation-army-pla-and-or-people-s-armed-police-pap-be-mobilized-in-hong-kong)
* [Will the winner of the popular vote in the 2020 United States presidential election also win the electoral college?](https://www.gjopen.com/questions/1495-will-the-winner-of-the-popular-vote-in-the-2020-united-states-presidential-election-also-win-the-electoral-college)\- This one is interesting, because it has infrequently gone the other way historically, but 2/5 of the last USA elections were split.
* [Will Benjamin Netanyahu cease to be the prime minister of Israel before 1 January 2021?](https://www.gjopen.com/questions/1498-will-benjamin-netanyahu-cease-to-be-the-prime-minister-of-israel-before-1-january-2021). Just when I thought he was out, he pulls himself back in.
* [Before 28 July 2020, will Saudi Arabia announce the cancellation or suspension of the Hajj pilgrimage, scheduled for 28 July 2020 to 2 August 2020?](https://www.gjopen.com/questions/1621-before-28-july-2020-will-saudi-arabia-announce-the-cancellation-or-suspension-of-the-hajj-pilgrimage-scheduled-for-28-july-2020-to-2-august-2020)
* [Will formal negotiations between Russia and the United States on an extension, modification, or replacement for the New START treaty begin before 1 October 2020?](https://www.gjopen.com/questions/1551-will-formal-negotiations-between-russia-and-the-united-states-on-an-extension-modification-or-replacement-for-the-new-start-treaty-begin-before-1-october-2020)s
Probabilities: 25%, 75%, 40%, 62%, 20%
On the Good Judgement Inc. side, [here](https://goodjudgment.com/covidrecovery/) is a dashboard presenting forecasts related to covid. The ones I found most worthy are:
* [When will the FDA approve a drug or biological product for the treatment of COVID-19?](https://goodjudgment.io/covid-recovery/#1384)
* [Will the US economy bounce back by Q2 2021?](https://goodjudgment.io/covid-recovery/#1373)
* [What will be the U.S. civilian unemployment rate (U3) for June 2021?](https://goodjudgment.io/covid-recovery/#1374)
* [When will enough doses of FDA-approved COVID-19 vaccine(s) to inoculate 25 million people be distributed in the United States?](https://goodjudgment.io/covid-recovery/#1363)
Otherwise, for a recent interview with Tetlock, see [this podcast](https://medium.com/conversations-with-tyler/philip-tetlock-tyler-cowen-forecasting-sociology-30401464b6d9), by Tyler Cowen.
### Metaculus: [metaculus.com](https://www.metaculus.com/)
Metaculus is a forecasting platform with an active community and lots of interesting questions. In their May pandemic newsletter, they emphasized having "all the benefits of a betting market but without the actual betting", which I found pretty funny.
This month they've organized a flurry of activities, most notably:
* [The Salk Tournament](https://pandemic.metaculus.com/questions/4093/the-salk-tournament-for-coronavirus-sars-cov-2-vaccine-rd/) on vaccine development
* [The El Paso Series](https://pandemic.metaculus.com/questions/4161/el-paso-series-supporting-covid-19-response-planning-in-a-mid-sized-city/) on collaboratively predicting peaks.
* [The Lightning Round Tournament](https://pandemic.metaculus.com/questions/4166/the-lightning-round-tournament-comparing-metaculus-forecasters-to-infectious-disease-experts/), in which metaculus forecasters go head to head against expert epidemiologists.
* They also present a [Covid dashboard](https://pandemic.metaculus.com/COVID-19/).
### Predict It & Election Betting Odds: [predictIt.org](https://www.predictit.org/) & [electionBettingOdds.com](http://electionbettingodds.com/)
PredictIt is a prediction platform restricted to US citizens, but also accessible with a VPN. This month, they present a map about the electoral college result in the USA. States are colored according to the market prices:
![](images/654d6212cd170b9287738a89bd6b4535248ed6e1.png)
Some of the predictions I found most interesting follow. The market probabilities can be found below; the engaged reader might want to write down their own probabilities and then compare.
* [Will Benjamin Netanyahu be prime minister of Israel on Dec. 31, 2020?](https://www.predictit.org/markets/detail/6238/Will-Benjamin-Netanyahu-be-prime-minister-of-Israel-on-Dec-31,-2020)
* [Will Trump meet with Kim Jong-Un in 2020?](https://www.predictit.org/markets/detail/6265/Will-Trump-meet-with-Kim-Jong-Un-in-2020)
* [Will Nicolás Maduro be president of Venezuela on Dec. 31, 2020?](https://www.predictit.org/markets/detail/6237/Will-Nicol%C3%A1s-Maduro-be-president-of-Venezuela-on-Dec-31,-2020)
* [Will Kim Jong-Un be Supreme Leader of North Korea on Dec. 31?](https://www.predictit.org/markets/detail/6674/Will-Kim-Jong-Un-be-Supreme-Leader-of-North-Korea-on-Dec-31)
* [Will a federal charge against Barack Obama be confirmed before November 3?](https://www.predictit.org/markets/detail/6702/Will-a-federal-charge-against-Barack-Obama-be-confirmed-before-November-3)
Some of the most questionable markets are:
* [Will Trump switch parties by Election Day 2020?](https://www.predictit.org/markets/detail/3731/Will-Trump-switch-parties-by-Election-Day-2020)
* [Will Michelle Obama run for president in 2020?](https://www.predictit.org/markets/detail/4632/Will-Michelle-Obama-run-for-president-in-2020)
* [Will Hillary Clinton run for president in 2020?](https://www.predictit.org/markets/detail/4614/Will-Hillary-Clinton-run-for-president-in-2020)
Market probabilities are: 76%, 9%, 75%, 82%, 8%, 2%, 6%, 11%.
[Election Betting Odds](https://electionbettingodds.com/) aggregates PredictIt with other such services for the US presidential elections, and also shows an election map. The creators of the webpage used its visibility to promote [ftx.com](https://ftx.com/), another platform in the area, whose webpage links to effective altruism and mentions:
> FTX was founded with the goal of donating to the worlds most effective charities. FTX, its affiliates, and its employees have donated over $10m to help save lives, prevent suffering, and ensure a brighter future.
### Replication Markets: [replicationmarkets.com](https://www.replicationmarkets.com)
On Replication Markets, volunteer forecasters try to predict whether a given study's results will be replicated with high power. Rewards are monetary, but only given out to the top few forecasters, and markets suffer from sometimes being dull.
The first week of each round is a survey round, which has some aspects of a Keynesian beauty contest, because it's the results of the second round, not the ground truth, what is being forecasted. This second round then tries to predict what would happen if the studies were in fact subject to a replication, which a select number of studies then undergo.
There is a part of me which dislikes this setup: here was I, during the first round, forecasting to the best of my ability, when I realize that in some cases, I'm going to improve the aggregate and be punished for this, particularly when I have information which I expect other market participants to not have.
At first I thought that, cunningly, the results of the first round would be used as priors for the second round, but a [programming mistake](https://www.replicationmarkets.com/index.php/2020/05/12/we-just-gave-all-our-forecasters-130-more-points/) by the organizers revealed that they use a simple algorithm: claims with p < .001 start with a prior of 80%, p < .01 starts at 40%, and p < .05 starts at 30%.
## In The News.
Articles and announcements in more or less traditional news media.
* [Locust-tracking application for the UN](https://www.research.noaa.gov/article/ArtMID/587/ArticleID/2620/NOAA-teams-with-the-United-Nations-to-create-locust-tracking-application) (see [here](https://www.washingtonpost.com/weather/2020/05/13/east-africa-locust-forecast-tool/) for a take by the Washington Post), using software originally intended to track the movements of air pollution. NOAA also sounds like a valuable organization: "NOAA Research enables better forecasts, earlier warnings for natural disasters, and a greater understanding of the Earth. Our role is to provide unbiased science to better manage the environment, nationally, and globally."
* [United Nations: World Economic Situation and Prospects as of mid-2020](https://www.un.org/development/desa/dpad/publication/world-economic-situation-and-prospects-as-of-mid-2020/). A recent report is out, which predicts a 3.2% contraction of the global economy. Between 34 and 160 million people are expected to fall below the extreme poverty line this year. Compare with [Fitch ratings](https://www.fitchratings.com/research/sovereigns/further-economic-forecast-cuts-global-recession-bottoming-out-26-05-2020), which foresee a 4.6% decline in global GDP.
* [Fox News](https://www.fox10phoenix.com/news/cdc-says-all-models-forecast-increase-in-covid-19-deaths-in-coming-weeks-exceeding-100k-by-june-1) and [Business Insider](https://www.businessinsider.com/cdc-forecasts-100000-coronavirus-deaths-by-june-1-2020-5?r=KINDLYSTOPTRACKINGUS) report about the CDC forecasting 100k deaths by June the 1st, differently.
* Some transient content on 538 about [Biden vs past democratic nominees](https://fivethirtyeight.com/features/how-does-biden-stack-up-to-past-democratic-nominees/), about [Trump vs Biden polls](https://fivethirtyeight.com/features/you-can-pay-attention-to-those-trump-vs-biden-polls-but-be-cautious/) and about [the USA vicepresidential draft](https://fivethirtyeight.com/features/its-time-for-another-2020-vice-presidential-draft/), and an old [review of the impact of VP candidates in USA elections](http://baseballot.blogspot.com/2012/07/politically-veepstakes-isnt-worth.html) which seems to have aged well. 538 also brings us this overview of [models with unrealistic-yet-clearly-stated assumptions](https://projects.fivethirtyeight.com/covid-forecasts/.)
* [Why Economic Forecasting Is So Difficult in the Pandemic](https://hbr.org/2020/05/why-economic-forecasting-is-so-difficult-in-the-pandemic). Harvard Review Economists share their difficulties. Problems include "not knowing for sure what is going to happen", the government passing legislation uncharacteristically fast, sampling errors and reduced response rates from surveys, and lack of knowledge about epidemiology.
* [IBM releases new AI forecasting tool](https://www.ibm.com/products/planning-analytics): "IBM Planning Analytics is an AI-infused integrated planning solution that automates planning, forecasting and budgeting." See [here](https://www.channelasia.tech/article/679887/ibm-adds-ai-fuelled-forecasting-planning-analytics-platform/) or [here](https://www.cio.com/article/3544611/ibm-adds-ai-fueled-forecasting-to-planning-analytics-platform.html) for a news take.
* Yahoo has automated finance forecast reporting. It took me a while (two months) to notice that the low-quality finance articles that were popping up in my google alerts were machine-generated. See [Synovus Financial Corp. Earnings Missed Analyst Estimates: Here's What Analysts Are Forecasting Now](https://finance.yahoo.com/news/synovus-financial-corp-earnings-missed-152645825.html), [Wienerberger AG Earnings Missed Analyst Estimates: Here's What Analysts Are Forecasting Now](https://finance.yahoo.com/news/wienerberger-ag-earnings-missed-analyst-070545629.html), [Park Lawn Corporation Earnings Missed Analyst Estimates: Here's What Analysts Are Forecasting Now](https://news.yahoo.com/park-lawn-corporation-earnings-missed-120314826.html); they have a similar structure, paragraph per paragraph, and seem to have been generated from a template which changes a little bit depending on the data (they seem to have different templates for very positive, positive, neutral and negative change). To be clear, I could program something like this given a good finance api and a spare week/month, and in fact did so a couple of years ago for an automatic poetry generator, _but I didn't notice because I wasn't paying attention_.
* [Wimbledon organisers set to net £100 million insurance payout after taking out infectious diseases cover following 2003 SARS outbreak, with tournament now cancelled because of coronavirus](https://www.dailymail.co.uk/sport/tennis/article-8183419/amp/Wimbledon-set-net-huge-100m-insurance-payout-tournament-cancelled.html). Cheers to Wimbledon.
* [The Post ranks the top 10 faces in New York sports today](https://nypost.com/2020/05/02/the-post-ranks-the-top-10-faces-in-new-york-sports-today/), accompanied by [Pitfall to forecasting top 10 faces of New York sports right now](https://nypost.com/2020/05/03/pitfall-to-forecasting-top-10-faces-of-new-york-sports-right-now/). Comparison with the historical situation: Check. Considering alternative hypothesis: Check. Communicating uncertainty to the reader in an effective manner: Check. Putting your predictions out to be judged: Check.
* [In Forecasting Hurricane Dorian, Models Fell Short](https://www.scpr.org/news/2020/04/30/92263/in-forecasting-hurricane-dorian-models-fell-short/) (and see [here](https://www.nhc.noaa.gov/data/tcr/AL052019_Dorian.pdf) for the National Hurricane Center report). "Hurricane forecasters and the models they depend on failed to anticipate the strength and impact of last year's deadliest storm." On the topic of weather, see also [Nowcasting the weather in Africa](https://phys.org/news/2020-05-storm-chasers-life-saving.html) to reduce fatalities, and [Misunderstanding Of Coronavirus Predictions Is Eerily Similar To Weather Forecasting](https://www.forbes.com/sites/marshallshepherd/2020/05/22/misunderstanding-of-coronavirus-predictions-is-eerily-similar-to-weather-forecasting/#2f1288467f75), Forbes speculates.
* [Pan-African Heatwave Health Hazard Forecasting](http://www.walker.ac.uk/research/projects/pan-african-heatwave-health-hazard-forecasting/). "The main aim, is to raise the profile of heatwaves as a hazard on a global scale. Hopefully, the project will add evidence to this sparse research area. It could also provide the basis for a heat early warning system." The project looks to be in its early stages, yet nonetheless interesting.
* [Nounós Creamery uses demand-forecasting platform to improve production process](https://www.dairyfoods.com/articles/94319-noun%C3%B3s-creamery-uses-demand-forecasting-platform-to-improve-production-process). The piece is shameless advertising, but it's still an example of predictive models used out in the wild in industry.
## Grab Bag
Podcasts, blogposts, papers, tweets and other recent nontraditional media.
* Some interesting discussion about forecasting over at Twitter, in [David Manheim](https://twitter.com/davidmanheim)'s and [Philip Tetlock](https://twitter.com/PTetlock)'s accounts, some of which have been incorporated into this newsletter. [This twitter thread](https://twitter.com/lukeprog/status/1262492767869009920) contains some discussion about how Good Judgement Open, Metaculus and expert forecasters fare against each other, but note the caveats by @LinchZhang: "For Survey 10, Metaculus said that question resolution was on 4pm ET Sunday, a lot of predictors (correctly) gauged that the data update on Sunday will be delayed and answered the letter rather than the spirit of the question (Metaculus ended up resolving it ambiguous)." [This thread](https://twitter.com/mlipsitch/status/1257857079756365824) by Marc Lipsitch has become popular, and I personally also enjoyed [these](https://twitter.com/LinchZhang/status/1262127601176334336) [two](https://twitter.com/LinchZhang/status/1261427045977874432) twitter threads by Linchuan Zhang, on forecasting mistakes.
* [SlateStarCodex](https://slatestarcodex.com/2020/04/29/predictions-for-2020/) brings us a hundred more predictions for 2020. Some analysis by Zvi Mowshowitz [here](https://www.lesswrong.com/posts/gSdZjyFSky3d34ySh/slatestarcodex-2020-predictions-buy-sell-hold) and by user [Bucky](https://www.lesswrong.com/posts/orSNNCm77LiSEBovx/2020-predictions).
* [FLI Podcast: On Superforecasting with Robert de Neufville](https://futureoflife.org/2020/04/30/on-superforecasting-with-robert-de-neufville/). I would have liked to see a more intense drilling on some of the points. It references [The NonProphets Podcast](https://nonprophetspod.wordpress.com/), which looks like it has some more in-depth stuff. Some quotes:
> So its not clear to me that our forecasts are necessarily affecting policy. Although its the kind of thing that gets written up in the news and who knows how much that affects peoples opinions, or they talk about it at Davos and maybe those people go back and they change what theyre doing.
> I wish it were used better. If I were the advisor to a president, I would say you should create a predictive intelligence unit using superforecasters. Maybe give them access to some classified information, but even using open source information, have them predict probabilities of certain kinds of things and then develop a system for using that in your decision making. But I think were a fair ways away from that. I dont know any interest in that in the current administration.
> Now one thing I think is interesting is that often people, theyre not interested in my saying, “Theres a 78% chance of something happening.” What they want to know is, how did I get there? What is my arguments? Thats not unreasonable. I really like thinking in terms of probabilities, but I think it often helps people understand what the mechanism is because it tells them something about the world that might help them make a decision. So I think one thing that maybe can be done is not to treat it as a black box probability, but to have some kind of algorithmic transparency about our thinking because that actually helps people, might be more useful in terms of making decisions than just a number.
* [Space Weather Challenge and Forecasting Implications of Rossby Waves](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2018SW002109). Recent advances may help predict solar flares better. I don't know how bad the worst solar flare could be, and how much a two year warning could buy us, but I tend to view developments like this very positively.
* [An analogy-based method for strong convection forecasts in China using GFS forecast data](https://www.tandfonline.com/doi/full/10.1080/16742834.2020.1717329). "Times in the past when the forecast parameters are most similar to those forecast at the current time are identified by searching a large historical numerical dataset", and this is used to better predict one particular class of meteorological phenomena. See [here](https://www.eurekalert.org/pub_releases/2020-05/ioap-ata051520.php) for a press release.
* The Cato Institute releases [12 New Immigration Ideas for the 21st Century](https://www.cato.org/publications/white-paper/12-new-immigration-ideas-21st-century), including two from Robin Hanson: Choosing Immigrants through Prediction Markets & Transferable Citizenship. The first idea is to have prediction markets forecast the monetary value of taking in immigrants, and decide accordingly, then rewarding forecasters according to their accuracy in predicting e.g. how much said immigrants pay in taxes.
* [A General Approach for Predicting the Behavior of the Supreme Court of the United States](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2463244). What seems to be a pretty simple algorithm (a random forest!) seems to do pretty well (70% accuracy). Their feature set is rich but doesn't seem to include ideology. It was written in 2017; today, I'd expect that a random bright highschooler might be able to do much beter.
* [From Self-Prediction to Self-Defeat: Behavioral Forecasting, Self-Fulfilling Prophecies, and the Effect of Competitive Expectations](https://pubmed.ncbi.nlm.nih.gov/14561121/). Abstract: Four studies explored behavioral forecasting and the effect of competitive expectations in the context of negotiations. Study 1 examined negotiators' forecasts of how they would behave when faced with a very competitive versus a less competitive opponent and found that negotiators believed they would become more competitive. Studies 2 and 3 examined actual behaviors during negotiation and found that negotiators who expected a very competitive opponent actually became less competitive, as evidenced by setting lower, less aggressive reservation prices, making less demanding counteroffers, and ultimately agreeing to lower negotiated outcomes. Finally, Study 4 provided a direct test of the disconnection between negotiators' forecasts for their behavior and their actual behaviors within the same sample and found systematic errors in behavioral forecasting as well as evidence for the self-fulfilling effects of possessing a competitive expectation.
* [Neuroimaging results altered by varying analysis pipelines](https://www.nature.com/articles/d41586-020-01282-z). Relevant paragraph: "the authors ran separate prediction markets, one for the analysis teams and one for researchers who did not participate in the analysis. In them, researchers attempted to predict the outcomes of the scientific analyses and received monetary payouts on the basis of how well they predicted performance. Participants — even researchers who had direct knowledge of the data set — consistently overestimated the likelihood of significant findings". Those who had more knowledge did slightly better, however.
* [Forecasting s-curves is hard](https://constancecrozier.com/2020/04/16/forecasting-s-curves-is-hard/): Some clear visualizations of what it says on the title.
* [Forecasting state expenses for budget is always a best guess](https://www.mercurynews.com/2020/05/20/letter-forecasting-state-expenses-for-budget-is-always-a-best-guess/); exactly what it says on the tin. Problem could be solved with a prediction market or forecasting tournament.
* [Fashion Trend Forecasting](https://arxiv.org/pdf/2005.03297.pdf) using Instagram and baking preexisting knowledge into NNs.
* [The advantages and limitations of forecasting](https://rwer.wordpress.com/2020/05/12/the-advantages-and-limitations-of-forecasting/). A short and sweet blog post, with a couple of forecasting anecdotes and zingers.
## Negative examples.
I have found negative examples to be useful as a mirror with which to reflect on my own mistakes; highlighting them may also be useful for shaping social norms. [Andrew Gelman](https://statmodeling.stat.columbia.edu/) continues to fast-pacedly produce blogposts on this topic. Meanwhile, amongst mortals:
* [Kelsey Piper of Vox harshly criticizes the IHME model](https://www.vox.com/future-perfect/2020/5/2/21241261/coronavirus-modeling-us-deaths-ihme-pandemic). "Some of the factors that make the IHME model unreliable at predicting the virus may have gotten people to pay attention to it;" or "Other researchers found the true deaths were outside of the 95 percent confidence interval given by the model 70 percent of the time."
* The [Washington post](https://www.washingtonpost.com/outlook/2020/05/19/lets-check-donald-trumps-chances-getting-reelected/) offers a highly partisan view of Trump's chances of winning the election. The author, having already made a past prediction, and seeing as how other media outlets offer a conflicting perspective, rejects the information he's learnt, and instead can only come up with more reasons which confirm his initial position. Problem could be solved with a prediction market or forecasting tournament.
* [California politics pretends to be about recession forecasts](https://calmatters.org/economy/2020/05/newsom-economic-forecast-criticism-california-model-recession-budget/). See also: [Simulacra levels](https://www.lesswrong.com/posts/fEX7G2N7CtmZQ3eB5/simulacra-and-subjectivity?commentId=FgajiMrSpY9MxTS8b); the article is at least three levels removed from consideration about bare reality. Key quote, about a given forecasting model: "Its just preposterously negative... How can you say that out loud without giggling?" See also some more prediction ping-pong, this time in New Jersey, [here](https://www.njspotlight.com/2020/05/fiscal-experts-project-nj-revenue-losses-wont-be-as-bad-as-murphys-team-forecast/). Problem could be solved with a prediction market or forecasting tournament.
* [What Is the Stock Market Even for Anymore?](https://www.nytimes.com/interactive/2020/05/26/magazine/stock-market-coronavirus-pandemic.html). A New York Times claims to have predicted that the market was going to fall (but can't prove it with, for example, a tweet, or a hash of a tweet), and nonetheless lost significant amounts of his own funds. ("The market dropped another 1,338 points the next day, and though my funds were tanking along with almost everyone elses, I found some empty satisfaction, at least, in my prognosticating.") The rest of the article is about said reported being personally affronted with the market not falling further ("the stock markets shocking resilience (at least so far) has looked an awful lot like indifference to the Covid-19 crisis and the economic calamity it has brought about. The optics, as they say, are terrible.")
* [Forecasting drug utilization and expenditure: ten years of experience in Stockholm](https://bmchealthservres.biomedcentral.com/articles/10.1186/s12913-020-05170-0). A normally pretty good forecasting model had the bad luck of not foreseeing a Black Swan, and sending a study to a journal just before a pandemic, so that it's being published now. They write: "According to the forecasts, the total pharmaceutical expenditure was estimated to increase between 2 and 8% annually. Our analyses showed that the accuracy of these forecasts varied over the years with a mean absolute error of 1.9 percentage points." They further conclude: "Based on the analyses of all forecasting reports produced since the model was established in Stockholm in the late 2000s, we demonstrated that it is feasible to forecast pharmaceutical expenditure with a reasonable accuracy." Presumably, this has increased further because of covid, sending the mean absolute error through the roof. If the author of this paper bites you, you become a Nassim Taleb.
* Some films are so bad it's funny. [This article fills the same niche](https://www.moneyweb.co.za/investing/yes-it-is-possible-to-predict-the-market/) for forecasting. It has it all: Pythagorean laws of vibration, epicycles, an old and legendary master with mystical abilities, 90 year predictions which come true. Further, from the [Wikipedia entry](https://en.wikipedia.org/wiki/William_Delbert_Gann#Controversy): "He told me that his famous father could not support his family by trading but earned his living by writing and selling instructional courses."
* [Austin Health Official Recommends Cancelling All 2020 Large Events, Despite Unclear Forecasting](https://texasscorecard.com/local/austin-health-official-recommends-cancelling-all-2020-large-events-despite-unclear-forecasting/). Texan article does not consider the perspective that one might want to cancel large events precisely _because_ of the forecasting uncertainty.
* [Auditor urges more oversight, better forecasting at the United State's Department of Transport](https://www.wral.com/coronavirus/auditor-urges-more-oversight-better-forecasting-at-dot/19106691/): "Instead of basing its spending plan on project-specific cost estimates, Wood said, the agency uses prior-year spending. That forecasting method doesn't account for cost increases or for years when there are more projects in the works." The budget of the organization is $5.9 billion. Problem could be solved with a prediction market or forecasting tournament.
## Long content
This section contains items which have recently come to my attention, but which I think might still be relevant not just this month, but throughout the years. Content in this section may not have been published in the last month.
* [How to evaluate 50% predictions](https://www.lesswrong.com/posts/DAc4iuy4D3EiNBt9B/how-to-evaluate-50-predictions). "I commonly hear (sometimes from very smart people) that 50% predictions are meaningless. I think that this is wrong."
* [Named Distributions as Artifacts](https://blog.cerebralab.com/Named%20Distributions%20as%20Artifacts). On how the named distributions we use (the normal distribution, etc.), were selected for being easy to use in pre-computer eras, rather than on being a good ur-prior on distributions for phenomena in this universe.
* [The fallacy of placing confidence in confidence intervals](https://link.springer.com/article/10.3758/s13423-015-0947-8). On how the folk interpretation of confidence intervals can be misguided, as it conflates: a. the long-run probability, before seeing some data, that a procedure will produce an interval which contains the true value, and b. and the probability that a particular interval contains the true value, after seeing the data. This is in contrast to Bayesian theory, which can use the information in the data to determine what is reasonable to believe, in light of the model assumptions and prior information. I found their example where different confidence procedures produce 50% confidence intervals which are nested inside each other particularly funny. Some quotes:
> Using the theory of confidence intervals and the support of two examples, we have shown that CIs do not have the properties that are often claimed on their behalf. Confidence interval theory was developed to solve a very constrained problem: how can one construct a procedure that produces intervals containing the true parameter a fixed proportion of the time? Claims that confidence intervals yield an index of precision, that the values within them are plausible, and that the confidence coefficient can be read as a measure of certainty that the interval contains the true value, are all fallacies and unjustified by confidence interval theory.
> “I am not at all sure that the confidence is not a confidence trick. Does it really lead us towards what we need the chance that in the universe which we are sampling the parameter is within these certain limits? I think it does not. I think we are in the position of knowing that either an improbable event has occurred or the parameter in the population is within the limits. To balance these things we must make an estimate and form a judgment as to the likelihood of the parameter in the universe that is, a prior probability the very thing that is supposed to be eliminated.”
> The existence of multiple, contradictory long-run probabilities brings back into focus the confusion between what we know before the experiment with what we know after the experiment. For any of these confidence procedures, we know before the experiment that 50 % of future CIs will contain the true value. After observing the results, conditioning on a known property of the data — such as, in this case, the variance of the bubbles — can radically alter our assessment of the probability.
> “You keep using that word. I do not think it means what you think it means.” Íñigo Montoya, The Princess Bride (1987)
* [Psychology of Intelligence Analysis](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/), courtesy of the American Central Intelligence Agency, seemed interesting, and I read chapters 4, 5 and 14. Sometimes forecasting looks like reinventing intelligence analysis; from that perspective, I've found this reference work useful. Thanks to EA Discord user @Willow for bringing this work to my attention.
* Chapter 4: Strategies for Analytical Judgement. Discusses and compares the strengths and weaknesses of four tactics: situational analysis (inside view), applying theory, comparison with historical situations, and immersing oneself on the data. It then brings up several suboptimal tactics for choosing among hypotheses.
* Chapter 5: When does one need more information, and in what shapes does new information come from?
> Once an experienced analyst has the minimum information necessary to make an informed judgment, obtaining additional information generally does not improve the accuracy of his or her estimates. Additional information does, however, lead the analyst to become more confident in the judgment, to the point of overconfidence.
> Experienced analysts have an imperfect understanding of what information they actually use in making judgments. They are unaware of the extent to which their judgments are determined by a few dominant factors, rather than by the systematic integration of all available information. Analysts actually use much less of the available information than they think they do.
> There is strong experimental evidence, however, that such self-insight is usually faulty. The expert perceives his or her own judgmental process, including the number of different kinds of information taken into account, as being considerably more complex than is in fact the case. Experts overestimate the importance of factors that have only a minor impact on their judgment and underestimate the extent to which their decisions are based on a few major variables. In short, people's mental models are simpler than they think, and the analyst is typically unaware not only of which variables should have the greatest influence, but also which variables actually are having the greatest influence.
* Chapter 14: A Checklist for Analysts. "Traditionally, analysts at all levels devote little attention to improving how they think. To penetrate the heart and soul of the problem of improving analysis, it is necessary to better understand, influence, and guide the mental processes of analysts themselves." The Chapter also contains an Intelligence Analysis reading list.
* [The Limits of Prediction: An Analysts Reflections on Forecasting](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/csi-studies/studies/vol-63-no-4/Limits-of-Prediction.html), also courtesy of the American Central Intelligence Agency. On how intelligence analysts should inform their users of what they are and aren't capable of. It has some interesting tidbits and references on predicting discontinuities. It also suggests some guiding questions that the analyst may try to answer for the policymaker.
* What is the context and reality of the problem I am facing?
* How does including information on new developments affect my problem/issue?
* What are the ways this situation could play out?
* How do we get from here to there? and/or What should I be looking out for?
> "We do not claim our assessments are infallible. Instead, we assert that we offer our most deeply and objectively based and carefully considered estimates."
* [How to Measure Anything](https://www.lesswrong.com/posts/ybYBCK9D7MZCcdArB/how-to-measure-anything), a review. "Anything can be measured. If a thing can be observed in any way at all, it lends itself to some type of measurement method. No matter how “fuzzy” the measurement is, its still a measurement if it tells you more than you knew before. And those very things most likely to be seen as immeasurable are, virtually always, solved by relatively simple measurement methods."
* The World Meteorological organization, on their mandate to guarantee that [no one is surprised by a flood](https://public.wmo.int/en/our-mandate/water/no-one-is-surprised-by-a-flood). Browsing the webpage it seems that the organization is either a Key Organization Safeguarding the Vital Interests of the World or Just Another of the Many Bureaucracies Already in Existence, but it's unclear to me how to differentiate between the two. One clue may be their recent [Caribbean workshop on impact-based forecasting and risk scenario planning](https://public.wmo.int/en/media/news/caribbean-workshop-impact-based-forecasting-and-risk-scenario-planning), with the narratively unexpected and therefore salient presence of Gender Bureaus.
* [95%-ile isn't that good](https://danluu.com/p95-skill/): "Reaching 95%-ile isn't very impressive because it's not that hard to do."
* [The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic](https://arxiv.org/abs/cond-mat/0410063): Identifying thermodynamic entropy with the Bayesian uncertainty of an ideal observer leads to problems, because as the observer observes more about the system, they update on this information, which in expectation reduces uncertainty, and thus entropy. But entropy increases with time.
* This might be interesting to students in the tradition of E.T. Jaynes: for example, the paper directly conflicts with this LessWrong post: [The Second Law of Thermodynamics, and Engines of Cognition](https://www.lesswrong.com/posts/QkX2bAkwG2EpGvNug/the-second-law-of-thermodynamics-and-engines-of-cognition), part of _Rationality, From AI to Zombies_. The way out might be to postulate that actually, the Bayesian updating process itself would increase entropy, in the form of e.g., the work needed to update bits on a computer. Any applications to Christian lore are left as an exercise for the reader. Otherwise, seeing two bright people being cogently convinced of different perspectives does something funny to my probabilities: it pushes them towards 50%, but also increases the expected time I'd have to spend on the topic to move them away from 50%.
* [Behavioral Problems of Adhering to a Decision Policy](https://pdfs.semanticscholar.org/7a79/28d5f133e4a274dcaec4d0a207daecde8068.pdf)
> Our judges in this study were eight individuals, carefully selected for their expertise as handicappers. Each judge was presented with a list of 88 variables culled from the past performance charts. He was asked to indicate which five variables out of the 88 he would wish to use when handicapping a race, if all he could have was five variables. He was then asked to indicate which 10, which 20, and which 40 he would use if 10, 20, or 40 were available to him.
> We see that accuracy was as good with five variables as it was with 10, 20, or 40. The flat curve is an average over eight subjects and is somewhat misleading. Three of the eight actually showed a decrease in accuracy with more information, two improved, and three stayed about the same. All of the handicappers became more confident in their judgments as information increased.
> ![](images/e8ac191e43364ff35bdc19361dd92c9a74e7109a.png)
* The study contains other nuggets, such as:
* An experiment on trying to predict the outcome of a given equation. When the feedback has a margin of error, this confuses respondents.
* "However, the results indicated that subjects often chose one gamble, yet stated a higher selling price for the other gamble"
* "We figured that a comparison between two students along the same dimension should be easier, cognitively, than a 13 comparison between different dimensions, and this ease of use should lead to greater reliance on the common dimension. The data strongly confirmed this hypothesis. Dimensions were weighted more heavily when common than when they were unique attributes. Interrogation of the subjects after the experiment indicated that most did not wish to change their policies by giving more weight to common dimensions and they were unaware that they had done so."
* "The message in these experiments is that the amalgamation of different types of information and different types of values into an overall judgment is a difficult cognitive process. In our attempts to ease the strain of processing information, we often resort to judgmental strategies that do an injustice to the underlying values and policies that were trying to implement."
* "A major problem that a decision maker faces in his attempt to be faithful to his policy is the fact that his insight into his own behavior may be inaccurate. He may not be aware of the fact that he is employing a different policy than he thinks hes using. This problem is illustrated by a study that Dan Fleissner, Scott Bauman, and I did, in which 13 stockbrokers and five graduate students served as subjects. Each subject evaluated the potential capital appreciation of 64 securities. \[...\] A mathematical model was then constructed to predict each subject's judgments. One output from the model was an index of the relative importance of each of the eight information items in determining each subjects judgments \[...\] Examination of Table 4 shows that the brokers perceived weights did not relate closely to the weights derived from their actual judgments.
* I informally replicated this.
* As remedies they suggest to create a model by eliciting the expert, either by having the expert make a large number of judgments and distilling a model, or by asking the expert what they think the most important factors are. A third alternative suggested is computer assistance, so that the experiment participants become aware of which factors influence their judgment.
* [Immanuel Kant, on Betting](https://www.econlib.org/archives/2014/07/kant_on_betting.html)
Vale.
Conflicts of interest: Marked as (c.o.i) throughout the text.
Note to the future: All links are automatically added to the Internet Archive. In case of link rot, go [there](https://archive.org/).

View File

@ -0,0 +1,166 @@
Forecasting Newsletter: June 2020.
==============
# Forecasting Newsletter. June 2020.
## Highlights
1. Facebook launches [Forecast](https://www.forecastapp.net/), a community for crowdsourced predictions.
2. Foretell, a forecasting tournament by the Center for Security and Emerging Technology, is now [open](https://www.cset-foretell.com/).
3. [A Preliminary Look at Metaculus and Expert Forecasts](https://www.metaculus.com/news/2020/06/02/LRT/): Metaculus forecasters do better.
Sign up [here](https://forecasting.substack.com/) or browse past newsletters [here](https://nunosempere.github.io/ea/ForecastingNewsletter/)
## Index
* Highlights.
* In the News.
* Prediction Markets & Forecasting Platforms.
* Negative Examples.
* Hard to Categorize.
* Long Content.
## In the News.
* Facebook releases a forecasting app ([link to the app](https://www.forecastapp.net/), [press release](https://npe.fb.com/2020/06/23/forecast-a-community-for-crowdsourced-predictions-and-collective-insights/), [TechCrunch take](https://techcrunch.com/2020/06/23/facebook-tests-forecast-an-app-for-making-predictions-about-world-events-like-covid-19/), [hot-takes](https://cointelegraph.com/news/crypto-prediction-markets-face-competition-from-facebook-forecasts)). The release comes before Augur v2 launches, and it is easy to speculate that it might end up being combined with Facebook's stablecoin, Libra.
* The Economist has a new electoral model out ([article](https://www.economist.com/united-states/2020/06/11/meet-our-us-2020-election-forecasting-model), [model](https://projects.economist.com/us-2020-forecast/president)) which gives Trump an 11% chance of winning reelection. Given that Andrew Gelman was involved, I'm hesitant to criticize it, but it seems a tad overconfident. See [here](https://statmodeling.stat.columbia.edu/2020/06/19/forecast-betting-odds/) for Gelman addressing objections similar to my own.
* [COVID-19 vaccine before US election](https://www.aljazeera.com/ajimpact/wall-street-banking-covid-19-vaccine-election-200619204859320.html). Analysts see White House pushing through vaccine approval to bolster Trump's chances of reelection before voters head to polls. "All the datapoints we've collected make me think we're going to get a vaccine prior to the election," Jared Holz, a health-care strategist with Jefferies, said in a phone interview. The current administration is "incredibly incentivized to approve at least one of these vaccines before Nov. 3."
* ["Israeli Central Bank Forecasting Gets Real During Pandemic"](https://www.nytimes.com/reuters/2020/06/23/world/middleeast/23reuters-health-coronavirus-israel-cenbank.html). Israeli Central Bank is using data to which it has real-time access, like credit-card spending, instead of lagging indicators.
* [Google](https://www.forbes.com/sites/jeffmcmahon/2020/05/31/thanks-to-renewables-and-machine-learning-google-now-forecasts-the-wind/) produces wind schedules for wind farms. "The result has been a 20 percent increase in revenue for wind farms". See [here](https://www.pv-magazine-australia.com/2020/06/01/solar-forecasting-evolves/) for essentially the same thing on solar forecasting.
* Survey of macroeconomic researchers predicts economic recovery will take years, reports [538](https://fivethirtyeight.com/features/dont-expect-a-quick-recovery-our-survey-of-economists-says-it-will-likely-take-years/).
## Prediction Markets & Forecasting platforms.
Ordered in subjective order of importance:
* Foretell, a forecasting tournament by the Center for Security and Emerging Technology, is now [open](www.cset-foretell.com). I find the thought heartening that this might end up influencing bona-fide politicians.
* Metaculus
* posted [A Preliminary Look at Metaculus and Expert Forecasts](https://www.metaculus.com/news/2020/06/02/LRT/): Metaculus forecasters do better, and the piece is a nice reference point.
* was featured in [Forbes](https://www.forbes.com/sites/erikbirkeneder/2020/06/01/do-crowdsourced-predictions-show-the-wisdom-of-humans/#743b7e106d9d).
* announced their [Metaculus Summer Academy](https://www.metaculus.com/questions/4566/announcing-a-metaculus-academy-summer-series-for-new-forecasters/): "an introduction to forecasting for those who are relatively new to the activity and are looking for a fresh intellectual pursuit this summer"
* [Replication Markets](https://predict.replicationmarkets.com/) might add a new round with social and behavioral science claims related to COVID-19, and a preprint market, which would ask participants to forecast items like publication or citation. Replication Markets is also asking for more participants, with the catchline "If they are knowledgeable and opinionated, Replication Markets is the place to be to make your opinions really count."
* Good Judgement family
* [Good Judgement Open](https://www.gjopen.com/): Superforecasters were [able](https://www.gjopen.com/comments/1039968) to detect that Russia and the USA would in fact undertake some (albeit limited) form of negotiation, and do so much earlier than the general public, even while posting their reasons in full view.
* Good Judgement Analytics continues to provide its [COVID-19 dashboard](https://goodjudgment.com/covidrecovery/).
* [PredictIt](https://www.predictit.org/) & [Election Betting Odds](http://electionbettingodds.com/). I stumbled upon an old 538 piece on fake polls: [Fake Polls are a Real Problem](https://fivethirtyeight.com/features/fake-polls-are-a-real-problem/). Some polls may have been conducted by PredictIt traders in order to mislead or troll other PredictIt traders; all in all, an amusing example of how prediction markets could encourage worse information.
* [An online prediction market with reputation points](https://www.lesswrong.com/posts/sLbS93Fe4MTewFme3/an-online-prediction-market-with-reputation-points), implementing an [idea](https://sideways-view.com/2019/10/27/prediction-markets-for-internet-points/) by Paul Christiano. As of yet slow to load.
* Augur:
* [An overview of the platform and of v2 modifications](https://bravenewcoin.com/insights/augur-price-analysis-v2-release-scheuled-for-june-12th).
* Augur also happens to have a [blog](https://augur.substack.com/archive) with some interesting tidbits, such as the extremely clickbaity [How One Trader Turned $400 into $400k with Political Futures](https://augur.substack.com/p/how-one-trader-turned-400-into-400k) ("I find high volume markets...like the Democratic Nominee market or the 2020 Presidential Winner market... and what Im doing is Im just getting in line at the buy price and waiting my turn until my orders get filled. Then when those orders get filled I just sell them for 1c more.")
* [Coronavirus Information Markets](https://coronainformationmarkets.com/) is down to ca. $12000 in trading volume; it seems like they didn't take off.
## Negative examples.
* World powers to converge on strategies for presenting COVID-19 information to make forecasters' jobs more interesting:
* [Brazil stops releasing COVID-19 death toll and wipes data from official site](https://www.theguardian.com/world/2020/jun/07/brazil-stops-releasing-covid-19-death-toll-and-wipes-data-from-official-site).
* Meanwhile, in Russia, [St Petersburg issues 1,552 more death certificates in May than last year, but Covid-19 toll was 171](https://www.theguardian.com/world/2020/jun/04/st-petersburg-death-tally-casts-doubt-on-russian-coronavirus-figures).
* In the US, [CDC wants states to count probable coronavirus cases and deaths, but most arent doing it](https://www.washingtonpost.com/investigations/cdc-wants-states-to-count-probable-coronavirus-cases-and-deaths-but-most-arent-doing-it/2020/06/07/4aac9a58-9d0a-11ea-b60c-3be060a4f8e1_story.html)
* [India has the fourth-highest number of COVID-19 cases, but the Government denies community transmission](https://www.abc.net.au/news/2020-06-21/india-coronavirus-fourth-highest-covid19-community-transmission/12365738)
* One suspects that this denial is political, because India is otherwise [being](https://www.maritime-executive.com/editorials/advanced-cyclone-forecasting-is-saving-thousands-of-lives) [extremely](https://economictimes.indiatimes.com/news/politics-and-nation/world-meteorological-organization-appreciates-indias-highly-accurate-cyclone-forecasting-system/articleshow/76280763.cms) [competent](https://economictimes.indiatimes.com/news/politics-and-nation/mumbai-to-get-hyperlocal-rain-outlooks-flood-forecasting-launched/articleshow/76343558.cms) in weather forecasting.
* Youyang Gu's model, widely acclaimed as one of the best coronavirus models for the US, produces 95% confidence intervals which [seem too narrow](https://twitter.com/LinchZhang/status/1270443040860106753) when extended to [Pakistan](https://covid19-projections.com/pakistan).
* Some discussion on [twitter](https://twitter.com/vidur_kapur/status/1269749592867905537): "Only a fool would put a probability on whether the EU and the UK will agree a trade deal", says Financial Times correspondent, and other examples.
## Hard to categorize.
* [A Personal COVID-19 Postmortem](https://www.lesswrong.com/posts/B7sHnk8P8EXmpfyCZ/a-personal-interim-covid-19-postmortem), by FHI researcher [David Manheim](https://twitter.com/davidmanheim).
> I think it's important to clearly and publicly admit when we were wrong. It's even better to diagnose why, and take steps to prevent doing so again. COVID-19 is far from over, but given my early stance on a number of questions regarding COVID-19, this is my attempt at a public personal review to see where I was wrong.
* [FantasyScotus](https://fantasyscotus.net/user-predictions/case/altitude-express-inc-v-zarda/) beat [GoodJudgementOpen](https://www.gjopen.com/questions/1300-in-zarda-v-altitude-express-inc-will-the-supreme-court-rule-that-the-civil-rights-act-of-1964-prohibition-against-employment-discrimination-because-of-sex-encompasses-discrimination-based-on-an-individual-s-sexual-orientation) on legal decisions. I'm still waiting to see whether [Hollywood Stock Exchange](https://www.hsx.com/search/?action=submit_nav&keyword=Mulan&Submit.x=0&Submit.y=0) will also beat GJOpen on [film predictions](https://www.gjopen.com/questions/1608-what-will-be-the-total-domestic-box-office-gross-for-disney-s-mulan-as-of-8-september-2020-according-to-box-office-mojo).
* [How does pandemic forecasting resemble the early days of weather forecasting](https://www.foreignaffairs.com/articles/united-states/2020-06-29/how-forecast-outbreaks-and-pandemics); what lessons can the USA learn from the later about the former? An example would be to create an organization akin to the National Weather Center, but for forecasting.
* Linch Zhang, a COVID-19 forecaster with an excellent track-record, is doing an [Ask Me Anything](https://forum.effectivealtruism.org/posts/83rHdGWy52AJpqtZw/i-m-linch-zhang-an-amateur-covid-19-forecaster-and), starting on Sunday the 7th; questions are welcome!
* [The Rules To Being A Sellside Economist](https://blogs.tslombard.com/the-rules-to-being-a-sellside-economist). A fun read.
> 5. How to get attention: If you want to get famous for making big non-consensus calls, without the danger of looking like a muppet, you should adopt the 40% rule. Basically you can forecast whatever you want with a probability of 40%. Greece to quit the euro? Maybe! Trump to fire Powell and hire his daughter as the new Fed chair? Never say never! 40% means the odds will be greater than anyone else is saying, which is why your clients need to listen to your warning, but also that they shouldnt be too surprised if, you know, the extreme event doesnt actually happen.
* [How to improve space weather forecasting](https://eos.org/research-spotlights/how-to-improve-space-weather-forecasting) (see [here](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2018SW002108#) for the original paper):
> For instance, the National Oceanic and Atmospheric Administrations Deep Space Climate Observatory (DSCOVR) satellite sits at the location in space called L1, where the gravitational pulls of Earth and the Sun cancel out. At this point, which is roughly 1.5 million kilometers from Earth, or barely 1% of the way to the Sun, detectors can provide warnings with only short lead times: about 30 minutes before a storm hits Earth in most cases or as little as 17 minutes in advance of extremely fast solar storms.
* [Coup cast](https://oefresearch.org/activities/coup-cast): A site that estimates the yearly probability of a coup. The color coding is misleading; click on the countries instead.
* [Prediction = Compression](https://www.lesswrong.com/posts/hAvGi9YAPZAnnjZNY/prediction-compression-transcript-1). "Whenever you have a prediction algorithm, you can also get a correspondingly good compression algorithm for data you already have, and vice versa."
* Other LessWrong posts which caught my attention were [Betting with Mandatory Post-Mortem](https://www.lesswrong.com/posts/AM5JiWfmbAytmBq82/betting-with-mandatory-post-mortem) and [Radical Probabilism](https://www.lesswrong.com/posts/ZM63n353vh2ag7z4p/radical-probabilism-transcript)
* [Box Office Pro](https://www.boxofficepro.com/the-art-and-science-of-box-office-forecasting/) looks at some factors around box-office forecasting.
## Long Content.
* [When the crowds aren't wise](https://hbr.org/2006/09/when-crowds-arent-wise); a sober overview, with judicious use of [Cordocet's jury theorem](https://en.wikipedia.org/wiki/Condorcet's_jury_theorem)
> Suppose that each individual in a group is more likely to be wrong than right because relatively few people in the group have access to accurate information. In that case, the likelihood that the groups majority will decide correctly falls toward zero as the size of the group increases.
> Some prediction markets fail for just this reason. They have done really badly in predicting President Bushs appointments to the Supreme Court, for example. Until roughly two hours before the official announcement, the markets were essentially ignorant of the existence of John Roberts, now the chief justice of the United States. At the close of a prominent market just one day before his nomination, “shares” in Judge Roberts were trading at $0.19—representing an estimate that Roberts had a 1.9% chance of being nominated.
> Why was the crowd so unwise? Because it had little accurate information to go on; these investors, even en masse, knew almost nothing about the internal deliberations in the Bush administration. For similar reasons, prediction markets were quite wrong in forecasting that weapons of mass destruction would be found in Iraq and that special prosecutor Patrick Fitzgerald would indict Deputy Chief of Staff Karl Rove in late 2005.
* [A review of Tetlocks Superforecasting (2015)](https://dominiccummings.com/2016/11/24/a-review-of-tetlocks-superforecasting-2015/), by Dominic Cummings. Cummings then went on to hire one such superforecaster, which then resigned over a [culture war](https://www.bbc.com/news/uk-politics-51545541) scandal, characterized by adversarial selection of quotes which indeed are outside the British Overton Window. Notably, Dominic Cummings then told reporters to "Read Philip Tetlock's _Superforecasters_, instead of political pundits who don't know what they're talking about."
* [Assessing the Performance of Real-Time Epidemic Forecasts: A Case Study of _Ebola_ in the Western Area Region of Sierra Leone, 2014-15](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6386417/). The one caveat is that their data is much better than coronavirus data, because Ebola symptoms are more evident; otherwise, pretty interesting:
> Real-time forecasts based on mathematical models can inform critical decision-making during infectious disease outbreaks. Yet, epidemic forecasts are rarely evaluated during or after the event, and there is little guidance on the best metrics for assessment.
> ...good probabilistic calibration was achievable at short time horizons of one or two weeks ahead but model predictions were increasingly unreliable at longer forecasting horizons.
> This suggests that forecasts may have been of good enough quality to inform decision making based on predictions a few weeks ahead of time but not longer, reflecting the high level of uncertainty in the processes driving the trajectory of the epidemic.
> Comparing different versions of our model to simpler models, we further found that it would have been possible to determine the model that was most reliable at making forecasts from early on in the epidemic. This suggests that there is value in assessing forecasts, and that it should be possible to improve forecasts by checking how good they are during an ongoing epidemic.
> One forecast that gained particular attention during the epidemic was published in the summer of 2014, projecting that by early 2015 there might be 1.4 million cases. This number was based on unmitigated growth in the absence of further intervention and proved a gross overestimate, yet it was later highlighted as a “call to arms” that served to trigger the international response that helped avoid the worst-case scenario.
> Methods to assess probabilistic forecasts are now being used in other fields, but are not commonly applied in infectious disease epidemiology
> The deterministic SEIR model we used as a null model performed poorly on all forecasting scores, and failed to capture the downturn of the epidemic in Western Area.
> On the other hand, a well-calibrated mechanistic model that accounts for all relevant dynamic factors and external influences could, in principle, have been used to predict the behaviour of the epidemic reliably and precisely. Yet, lack of detailed data on transmission routes and risk factors precluded the parameterisation of such a model and are likely to do so again in future epidemics in resource-poor settings.
* In the selection of quotes above, we gave an example of a forecast which ended up overestimating the incidence, yet might have "served as a call to arms". It's maybe a real-life example of a forecast changing the true result, leading to a fixed point problem, like the ones hypothesized in the parable of the [Predict-O-Matic](https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic).
* It would be a fixed point problem if \[forecast above the alarm threshold\] → epidemic being contained, but \[forecast below the alarm thresold\] → epidemic not being contained.
* Maybe the fix-point solution, i.e., the most self-fulfilling (and thus, accurate) forecast, would have been a forecast on the edge of the alarm threshold, which would have ended up leading to mediocre containment.
* The [troll polls](https://fivethirtyeight.com/features/fake-polls-are-a-real-problem/) created by PredictIt traders are perhaps a more clear cut example of Predict-O-Matic problems.
* [Calibration Scoring Rules for Practical Prediction Training](https://arxiv.org/abs/1808.07501). I found it most interesting when considering how Brier and log rules didn't have all the pedagogic desiderata.
* I also found the following derivation of the logarithmic scoring rule interesting. Consider: If you assign a probability to n events, then the combined probability of these events is p1 x p2 x p3 x ... pn. Taking logarithms, this is log(p1 x p2 x p3 x ... x pn) = Σ log(pn), i.e., the logarithmic scoring rule.
* [Binary Scoring Rules that Incentivize Precision](https://arxiv.org/abs/2002.10669). The results (the closed-form of scoring rules which minimize a given forecasting error) are interesting, but the journey to get there is kind of a drag, and ultimately the logarithmic scoring rule ends up being pretty decent according to their measure of error.
* Opinion: I'm not sure whether their results are going to be useful for things I'm interested in (like human forecasting tournaments, rather than Kaggle data analysis competitions). In practice, what I might do if I wanted to incentivize precision is to ask myself if this is a question where the answer is going to be closer to 50%, or closer to either of 0% or 100%, and then use either the Brier or the logarithmic scoring rules. That is, I don't want to minimize an l-norm of the error over \[0,1\], I want to minimize an l-norm over the region I think the answer is going to be in, and the paper falls short of addressing that.
* [How Innovation Works—A Review](https://quillette.com/2020/05/29/how-innovation-works-a-review/). The following quote stood out for me:
> Ridley points out that there have always been opponents of innovation. Such people often have an interest in maintaining the status quo but justify their objections with reference to the precautionary principle.
* [A list of prediction markets](https://docs.google.com/spreadsheets/d/1XB1GHfizNtVYTOAD_uOyBLEyl_EV7hVtDYDXLQwgT7k/edit#gid=0), and their fates, maintained by Jacob Laguerros. Like most startups, most prediction markets fail.
Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go [here](https://archive.org/)
---
> "I beseech you, in the bowels of Christ, think it possible that you may be mistaken." [Oliver Cromwell](https://en.wikipedia.org/wiki/Cromwell%27s_rule)
---

Binary file not shown.

After

Width:  |  Height:  |  Size: 247 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.9 KiB

View File

@ -0,0 +1,303 @@
Forecasting Newsletter: July 2020.
==============
## Highlights
* Social Science Prediction Platform [launches](https://socialscienceprediction.org/).
* Ioannidis and Taleb [discuss](https://forecasters.org/blog/2020/06/14/covid-19-ioannidis-vs-taleb/) optimal response to COVID-19.
* Report tries to [foresee](https://reliefweb.int/report/world/forecasting-dividends-conflict-prevention-2020-2030) the (potentially quite high) dividends of conflict prevention from 2020 to 2030.
## Index
* Highlights.
* Prediction Markets & Forecasting Platforms.
* New undertakings.
* Negative Examples.
* News & Hard to Categorize Content.
* Long Content.
Sign up [here](https://forecasting.substack.com) or browse past newsletters [here](https://nunosempere.github.io/ea/ForecastingNewsletter/).
## Prediction Markets & Forecasting Platforms.
Ordered in subjective order of importance:
* Metaculus continues hosting great discussion.
* In particular, it has recently hosted some high-quality [AI questions](https://www.metaculus.com/questions/?search=cat:computing--ai).
* User @alexrjl, a moderator on the platform, [offers on the EA forum](https://forum.effectivealtruism.org/posts/5udsgcnK5Cii2vA9L/what-questions-would-you-like-to-see-forecasts-on-from-the) to operationalize questions and post them on Metaculus, for free. This hasn't been picked up by the EA Forum algorithms, but the offer seems to me to be quite valuable. Some examples of things you might want to see operationalized and forecasted: the funding your organization will receive in 2020, whether any particularly key bills will become law, whether GiveWell will change their top charities, etc.
* [Foretell](https://www.cset-foretell.com/) is a prediction market by the University of Georgetown's Center for Security and Emerging Technology, focused on questions relevant to technology-security policy, and on bringing those forecasts to policy-makers.
* Some EAs, such as myself or a mysterious user named _foretold_, feature on the top spots of their (admittedly quite young) leaderboard.
* I also have the opportunity to create a team on the site: if you have a proven track record and would be interested in joining such a team, get in touch before the 10th of August.
* [Replication Markets](https://predict.replicationmarkets.com/)
* published their [first paper](https://royalsocietypublishing.org/doi/10.1098/rsos.200566)
* had some difficulties with cheaters:
> "The Team at Replication Markets is delaying announcing the Round 8 Survey winners because of an investigation into coordinated forecasting among a group of participants. As a result, eleven accounts have been suspended and their data has been excluded from the study. Scores are being recalculated and prize announcements will go out soon."
* Because of how Replication Markets are structured, I'm betting the cheating was by manipulating the Keynesian beauty contest in a [Predict-O-Matic](https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic) fashion. That is, cheaters could have coordinated to output something surprising during the Keynesian Beauty Contest round, and then make that surprising thing come to happen during the market trading round. Charles Twardy, principal investigator at Replication Markets, gives a more positive take on the Keynesian beauty contest aspects of Replication Markets [here](https://www.lesswrong.com/posts/M45QmAKGJWxuuiSbQ/forecasting-newsletter-may-2020?commentId=ckyk8AiiWuaqoy3dN).
* still have Round 10 open until the 3rd of August.
* At the Good Judgement family, Good Judgement Analytics continues to provide its [COVID-19 dashboard](https://goodjudgment.com/covidrecovery/).
> Modeling is a very good way to explain how a virus will move through an unconstrained herd. But when you begin to put in constraints” — mask mandates, stay-at-home orders, social distancing — “and then the herd has agency whether theyre going to comply, at that point, human forecasters who are very smart and have read through the models, thats where they really begin to add value. Marc Koehler, Vice President of Good Judgement, Inc., in a [recent interview](https://builtin.com/data-science/superforecasters-good-judgement)
* [Highly Speculative Estimates](https://www.highlyspeculativeestimates.com/dist-builder), an interface, library and syntax to produce distributional probabilistic estimates led by Ozzie Gooen, now accepts functions as part of its input, such that more complicated inputs like the following are now possible:
```
# Variable: Number of ice creams an unsupervised child has consumed,
# when left alone in an ice cream shop.
# Current time (hours passed)
t=10
# Scenario with lots of uncertainty
w_1 = 0.75 ## Weight for this scenario.
min_uncertain(t) = t*2
max_uncertain(t) = t*20
# Optimistic scenario
w_2 = 0.25 ## Weight for the optimistic scenario
min_optimistic(t) = 1*t
max_optimistic(t) = 3*t
mean(t) = (min_optimistic(t) + max_optimistic(t)/2)
stdev(t) = t*(2)^(1/2)
# Overall guess
## A long-tailed lognormal for the uncertain scenario
## and a tight normal for the optimistic scenario
mm(min_uncertain(t) to max_uncertain(t), normal(mean(t), stdev(t)), [w_1, w_2])
## Compare with: mm(2 to 20, normal(2, 1.4142), [0.75, 0.25])
```
* [PredictIt](https://www.predictit.org/) & [Election Betting Odds](http://electionbettingodds.com/) each give a 60%-ish to Biden.
* See [Limits of Current US Prediction Markets (PredictIt Case Study)](https://www.lesswrong.com/posts/c3iQryHA4tnAvPZEv/limits-of-current-us-prediction-markets-predictit-case-study), on how spread, transaction fees, withdrawal fees, interest rate which one could otherwise be earning, taxes, and betting limits make it so that:
> "Current prediction markets are so bad in so many different ways that it simply is not surprising for people to know better than them, and it often is not possible for people to make money from knowing better."
* [Augur](https://www.augur.net/), a betting platform built on top of Ethereum, launches v2. Here are [two](https://bravenewcoin.com/insights/augur-price-analysis-v2-release-scheuled-for-june-12th) [overviews](https://www.coindesk.com/5-years-after-launch-predictions-market-platform-augur-releases-version-2) of the platform and of v2 modifications
### New undertakings
* [Announcing the Launch](http://evavivalt.com/2020/07/announcing-the-launch-of-the-social-science-prediction-platform) of the [Social Science Prediction Platform](https://socialscienceprediction.org/), a platform aimed at collecting and popularizing predictions of research results, in order to improve social science; see [this Science article](https://science.sciencemag.org/content/366/6464/428.full) for the background motivation:
> A new result builds on the consensus, or lack thereof, in an area and is often evaluated for how surprising, or not, it is. In turn, the novel result will lead to an updating of views. Yet we do not have a systematic procedure to capture the scientific views prior to a study, nor the updating that takes place afterward. What did people predict the study would find? How would knowing this result affect the prediction of findings of future, related studies?
> A second benefit of collecting predictions is that they \[...\] can also potentially help to mitigate publication bias. However, if priors are collected before carrying out a study, the results can be compared to the average expert prediction, rather than to the null hypothesis of no effect. This would allow researchers to confirm that some results were unexpected, potentially making them more interesting and informative, because they indicate rejection of a prior held by the research community; this could contribute to alleviating publication bias against null results.
> A third benefit of collecting predictions systematically is that it makes it possible to improve the accuracy of predictions. In turn, this may help with experimental design.
* On the one hand, I could imagine this having an impact, and the enthusiasm of the founders is contagious. On the other hand, as a forecaster I don't feel enticed by the platform: they offer a $25 reward to grad students (which I am not), and don't spell it out for me why I would want to forecast on their platform as opposed to on [all](http://metaculus.com/) [the](https://www.gjopen.com/) [other](https://replicationmarkets.com/) [alternatives](https://www.cset-foretell.com/) [available](https://thepipelineproject.org) [to](https://www.augur.net/) [me](https://www.predictit.org/), even accounting for altruistic impact.
* [Ought](https://www.lesswrong.com/posts/SmDziGM9hBjW9DKmf/2019-ai-alignment-literature-review-and-charity-comparison#Ought) is a research lab building tools to delegate open-ended reasoning to AI & ML systems.
* Since concluding their initial factored cognition experiments in 2019, theyve been building tools to capture and automate the reasoning process in forecasting: [Ergo](https://github.com/oughtinc/ergo), a library for integrating model-based and judgmental forecasting, and [Elicit](https://elicit.ought.org), a tool built on top of Ergo to help forecasters express and share distributions.
* Theyve recently run small-scale tests exploring amplification and delegation of forecasting, such as: [Amplify Rohins Prediction on AGI researchers & Safety Concerns](https://www.lesswrong.com/posts/Azqmzp5JoXJihMcr4/competition-amplify-rohin-s-prediction-on-agi-researchers), [Amplified forecasting: What will Bucks informed prediction of compute used in the largest ML training run before 2030 be?](https://www.metaculus.com/questions/4732/amplified-forecasting-what-will-bucks-informed-prediction-of-compute-used-in-the-largest-ml-training-run-before-2030-be/), and [Delegate a Forecast](https://forum.effectivealtruism.org/posts/GKnXGiobbg5PFikzJ/delegate-a-forecast).
* See also [Amplifying generalist research via forecasting](https://forum.effectivealtruism.org/posts/ZTXKHayPexA6uSZqE/part-2-amplifying-generalist-research-via-forecasting), previous work in a similar direction which was also inspired by Paul Christiano's Iterated Distillation and Amplification agenda.
* In addition to studying factored cognition in the forecasting context, they are broadly interested in whether the EA community could benefit from better forecasting tools: they can be reached out to [team@ought.org](mailto:team@ought.org) if you want to give them feedback or discuss their work.
* [The Pipeline Project](https://thepipelineproject.org) is a project similar to Replication Markets, by some of the same authors, to find out whether people can predict whether a given study will replicate. They offer authorship in an appendix, as well as a chance to get a token monetary compensation.
* [USAID's Intelligent Forecasting: A Competition to Model Future Contraceptive Use](https://competitions4dev.org/forecastingprize/). "First, we will award up to 25,000 USD in prizes to innovators who develop an intelligent forecasting model—using the data we provide and methods such as artificial intelligence (AI)—to predict the consumption of contraceptives over three months. If implemented, the model should improve the availability of contraceptives and family planning supplies at health service delivery sites throughout a nationwide healthcare system. Second, we will award a Field Implementation Grant of approximately 100,000 to 200,000 USD to customize and test a high-performing intelligent forecasting model in Côte dIvoire."
* [Omen](omen.eth.link) is another cryptocurrency-based prediction market, which seems to use the same front-end (and probably back-end) as [Corona Information Markets](https://coronainformationmarkets.com/). It's unclear what their advantages with respect to Augur are.
* [Yngve Høiseth](https://github.com/yhoiseth/python-prediction-scorer) releases a prediction scorer, based on his previous work on Empiricast. In Python, but also available as a [REST](https://stackoverflow.com/questions/671118/what-exactly-is-restful-programming?rq=1) [API](https://predictionscorer.herokuapp.com/docs#/default/brier_score_v1_rules_brier_score__probability__get)
## Negative Examples.
* The International Energy Agency had terrible forecasts on solar photo-voltaic energy production, until [recently](https://pv-magazine-usa.com/2020/07/12/has-the-international-energy-agency-finally-improved-at-forecasting-solar-growth/):
> ![](images/7244132c6380f86a5fc5327b5c6abb70e741097a.jpg)
> ...Its a scenario assuming current policies are kept and no new policies are added.
> ...the discrepancy basically implies that every year loads of unplanned subsidies are added... So it boils down to: its not a forecast and any error you find must be attributed to that. And no you cannot see how the model works.
> The IEA website explains the WEO process: “The detailed projections are generated by the World Energy Model, a large-scale simulation tool, developed at the IEA over a period of more than 20 years that is designed to replicate how energy markets function.”
## News & Hard to Categorize Content.
* [Budget credibility of subnational forecasts](http://www.levyinstitute.org/publications/budget-credibility-of-subnational-governments-analyzing-the-fiscal-forecasting-errors-of-28-states-in-india).
> Budget credibility, or the ability of governments to accurately forecast macro-fiscal variables, is crucial for effective public finance management. Fiscal marksmanship analysis captures the extent of errors in the budgetary forecasting... Partitioning the sources of errors, we identified that the errors were more broadly random than due to systematic bias, except for a few crucial macro-fiscal variables where improving the forecasting techniques can provide better estimates.
* See also: [How accurate are \[US\] agencies procurement forecasts?](https://federalnewsnetwork.com/contracting/2020/07/how-accurate-are-agencies-procurement-forecasts/) and [Forecasting Inflation in a Data-Rich Environment: The Benefits of Machine Learning Methods](https://www.tandfonline.com/doi/full/10.1080/07350015.2019.1637745) (which finds random forests a hard to beat approach)
* [Bloomberg on the IMF's track record on forecasting](https://www.bloomberg.com/graphics/2019-imf-forecasts/) ([archive link, without a paywall](http://archive.is/hj0CG)).
> A Bloomberg analysis of more than 3,200 same-year country forecasts published each spring since 1999 found a wide variation in the direction and magnitude of errors. In 6.1 percent of cases, the IMF was within a 0.1 percentage-point margin of error. The rest of the time, its forecasts underestimated GDP growth in 56 percent of cases and overestimated it in 44 percent. The average forecast miss, regardless of direction, was 2.0 percentage points, but obscures a notable difference between the average 1.3 percentage-point error for advanced economies compared with 2.1 percentage points for more volatile and harder-to-model developing economies. Since the financial crisis, however, the IMFs forecast accuracy seems to have improved, as growth numbers have generally fallen.
> Banking and sovereign debt panics hit Greece, Ireland, Portugal and Cyprus to varying degrees, threatening the integrity of the euro area and requiring emergency intervention from multinational authorities. During this period, the IMF wasnt merely forecasting what would happen to these countries but also setting the terms. It provided billions in bailout loans in exchange for implementation of strict austerity measures and other policies, often bitterly opposed by the countries citizens and politicians.
* I keep seeing evidence that Trump will lose reelection, but I don't know how seriously to take it, because I don't know how filtered it is.
* For example, the [The Economist's model](https://projects.economist.com/us-2020-forecast/president) forecasts 91% that Biden will win the upcoming USA elections. Should I update somewhat towards Biden winning after seeing it? What if I suspect that it's the most extreme model, and that it has come to my attention because of that fact? What if I suspect that it's the most extreme model which will predict a democratic win? What if there was another equally reputable model which predicts 91% for Trump, but which I never got to see because of information filter dynamics?
* The [the Primary Model](http://primarymodel.com/) confirmed my suspicions of filter dynamics. It "does not use presidential approval or the state of the economy as predictors. Instead it relies on the performance of the presidential nominees in primaries", and on how many terms the party has controlled the White House. The model has been developed by an [otherwise unremarkable](https://en.wikipedia.org/wiki/Helmut_Norpoth) professor of political science at New York's Stony Brook University, and has done well in previous election cycles. It assigns 91% to Trump winning reelection.
* [Forecasting at Uber: An Introduction](https://eng.uber.com/forecasting-introduction/). Uber forecasts demand so that they know amongst other things, when and where to direct their vehicles. Because of the challenges to testing and comparing forecasting frameworks at scale, they developed their own software for this.
* [Forecasting Sales In These Uncertain Times](https://www.forbes.com/sites/billconerly/2020/07/02/forecasting-sales-in-these-uncertain-times).
> \[...\] a company selling to lower-income consumers might use the monthly employment report for the U.S. to see how people with just a high school education are doing finding jobs. A business selling luxury goods might monitor the stock market.
* [Unilever Chief Supply Officer on forecasting](https://www.supplychaindive.com/news/unilever-csco-agility-forecasting-coronavirus/581323/): "Agility does trump forecasting. At the end of the day, every dollar we spent on agility has probably got a 10x return on every dollar spent on forecasting or scenario planning."
> An emphasis on agility over forecasting meant shortening planning cycles — the company reduced its planning horizon from 13 weeks to four. The weekly planning meeting became a daily meeting. Existing demand baselines and even artificial intelligence programs no longer applied as consumer spending and production capacity strayed farther from historical trends.
* [An updated introduction to prediction markets](https://daily.jstor.org/how-accurate-are-prediction-markets/), yet one which contains some nuggets I didn't know about.
> This bias toward favorable outcomes... appears for a wide variety of negative events, including diseases such as cancer, natural disasters such as earthquakes and a host of other events ranging from unwanted pregnancies and radon contamination to the end of a romantic relationship. It also emerges, albeit less strongly, for positive events, such as graduating from college, getting married and having favorable medical outcomes.
> Nancy Reagan hired an astrologer, Joan Quigley, to screen Ronald Reagans schedule of public appearances according to his horoscope, allegedly in an effort to avoid assassination attempts.
> Google, Yahoo!, Hewlett-Packard, Eli Lilly, Intel, Microsoft, and France Telecom have all used internal prediction markets to ask their employees about the likely success of new drugs, new products, future sales.
> Although prediction markets can work well, they dont always. IEM, PredictIt, and the other online markets were wrong about Brexit, and they were wrong about Trumps win in 2016. As the Harvard Law Review points out, they were also wrong about finding weapons of mass destruction in Iraq in 2003, and the nomination of John Roberts to the U.S. Supreme Court in 2005. There are also plenty of examples of small groups reinforcing each others moderate views to reach an extreme position, otherwise known as groupthink, a theory devised by Yale psychologist Irving Janis and used to explain the Bay of Pigs invasion.
> although thoughtful traders should ultimately drive the price, that doesnt always happen. The \[prediction\] markets are also no less prone to being caught in an information bubble than British investors in the South Sea Company in 1720 or speculators during the tulip mania of the Dutch Republic in 1637.
* [Food Supply Forecasting Company gets $12 million in Series A funding](https://techcrunch.com/2020/07/15/crisp-the-platform-for-demand-forecasting-the-food-supply-chain-gets-12-million-in-funding/)
## Long Content.
* [Michael Story](https://twitter.com/MWStory/status/1281904682378629120), "Jotting down things I learned from being a superforecaster."
> Small teams of smart, focused and rational generalists can absolutely smash big well-resourced institutions at knowledge production, for the same reasons startups can beat big rich incumbent businesses
> There's a _lot_ more to making predictive accuracy work in practice than winning a forecasting tournament. Competitions are about daily fractional updating, long lead times and exhaustive pre-forecast research on questions especially chosen for competitive suitability
> Real life forecasting often requires fast turnaround times, fuzzy questions, and difficult-to-define answers with unclear resolution criteria. In a competition, a question with ambiguous resolution is thrown out, but in a crisis it might be the most important work you do
* Lukas Gloor on [takeaways from Covid forecasting on Metaculus](https://forum.effectivealtruism.org/posts/xwG5MGWsMosBo6u4A/lukas_gloor-s-shortform?commentId=ZNgmZ7qvbQpy394kG)
* [Ambiguity aversion](https://en.wikipedia.org/wiki/Ambiguity_aversion). "Better the devil you know than the devil you don't."
> An ambiguity-averse individual would rather choose an alternative where the probability distribution of the outcomes is known over one where the probabilities are unknown. This behavior was first introduced through the [Ellsberg paradox](https://en.wikipedia.org/wiki/Ellsberg_paradox) (people prefer to bet on the outcome of an urn with 50 red and 50 blue balls rather than to bet on one with 100 total balls but for which the number of blue or red balls is unknown).
* Gregory Lewis: [Use uncertainty instead of imprecision](https://forum.effectivealtruism.org/posts/m65R6pAAvd99BNEZL/use-resilience-instead-of-imprecision-to-communicate).
> If your best guess for X is 0.37, but you're very uncertain, you still shouldn't replace it with an imprecise approximation (e.g. "roughly 0.4", "fairly unlikely"), as this removes information. It is better to offer your precise estimate, alongside some estimate of its resilience, either subjectively ("0.37, but if I thought about it for an hour I'd expect to go up or down by a factor of 2"), or objectively ("0.37, but I think the standard error for my guess to be ~0.1").
* [Expert Forecasting with and without Uncertainty Quantification and Weighting: What Do the Data Say?](https://www.rff.org/publications/journal-articles/expert-forecasting-and-without-uncertainty-quantification-and-weighting-what-do-data-say/): "its better to combine expert uncertainties (e.g. 90% confidence intervals) than to combine their point forecasts, and its better still to combine expert uncertainties based on their past performance."
* See also a [1969 paper](https://www.jstor.org/stable/pdf/3008764.pdf) by future Nobel Prize winner Clive Granger: "Two separate sets of forecasts of airline passenger data have been combined to form a composite set of forecasts. The main conclusion is that the composite set of forecasts can yield lower mean-square error than either of the original forecasts. Past errors of each of the original forecasts are used to determine the weights to attach to these two original forecasts in forming the combined forecasts, and different methods of deriving these weights are examined".
* [How to build your own weather forecasting model](https://www.yachtingmonthly.com/sailing-skills/how-to-build-your-own-weather-forcecast-73104). Sailors realize that weather forecasting are often corrupted by different considerations (e.g., a reported 50% of rain doesn't happen 50% of the time), and search for better sources. One such source is the original, raw data used to generate weather forecasts: GRIB files (Gridded Information in Binary), which lack interpretation. But these have their own pitfalls, which sailors must learn to take into account. For example, GRIB files only take into account wind speed, not tidal acceleration, which can cause a significant increase in apparent wind.
> Forecasts are inherently political, says Dashew. They are the result of people perhaps getting it wrong at some point so some pressures to interpret them in a different or more conservative way very often. These pressures change all the time so they are often subject to outside factors.
> Singleton says he understands how pressures on forecasters can lead to this opinion being formed: In my days at the Met Office when the Shipping Forecast used to work under me, they always said they try to tell it like it is and they do not try to make it sound worse.
* [Forecasting the dividends of conflict prevention from 2020 - 2030](https://reliefweb.int/report/world/forecasting-dividends-conflict-prevention-2020-2030). Study quantifies the dynamics of conflict, building a transition matrix between different states (peace, high risk, negative peace, war, and recovery) and validating it using historical dataset; they find (concurring with the previous literature), that countries have a tendency to fall into cycles of conflict. They conclude that changing this transition matrix would have a very high impact. Warning: extensive quoting follows.
> Notwithstanding the mandate of the United Nations to promote peace and security, many member states are still sceptical about the dividends of conflict prevention. Their diplomats argue that it is hard to justify investments without being able to show its tangible returns to decision-makers and taxpayers. As a result, support for conflict prevention is halting and uneven, and governments and international agencies end up spending enormous sums in stability and peace support operations after-the-fact.
> This study considers the trajectories of armed conflict in a 'business-as-usual' scenario between 2020-2030. Specifically, it draws on a comprehensive historical dataset to determine the number of countries that might experience rising levels of collective violence, outright armed conflict, and their associated economic costs. It then simulates alternative outcomes if conflict prevention measures were 25%, 50%, and 75% more effective. As with all projections, the quality of the projections relies on the integrity of the underlying data. The study reviews several limitations of the analysis, and underlines the importance of a cautious interpretation of the findings.
> If current trends persist and no additional conflict prevention action is taken above the current baseline, then it is expected that there will be three more countries at war and nine more countries at high risk of war by 2030 as compared to 2020. This translates into roughly 677,250 conflict-related fatalities (civilian and battle-deaths) between the present and 2030. By contrast, under our most pessimistic scenario, a 25% increase in effectiveness of conflict prevention would result in 10 more countries at peace by 2030, 109,000 fewer fatalities over the next decade and savings of over $3.1 trillion. A 50% improvement would result in 17 additional countries at peace by 2030, 205,000 fewer deaths by 2030, and some $6.6 trillion in savings.
> Meanwhile, under our most optimistic scenario, a 75% improvement in prevention would result in 23 more countries at peace by 2030, resulting in 291,000 lives saved over the next decade and $9.8 trillion in savings. These scenarios are approximations, yet demonstrate concrete and defensible estimates of both the benefits (saved lives, displacement avoided, declining peacekeeping deployments) and cost-effectiveness of prevention (recovery aid, peacekeeping expenditures). Wars are costly and the avoidance of “conflict traps” could save the economy trillions of dollars by 2030 under the most optimistic scenarios. The bottom line is that comparatively modest investments in prevention can yield lasting effects by avoiding compounding costs of lost life, peacekeeping, and aid used for humanitarian response and rebuilding rather than development. The longer conflict prevention is delayed, the more expensive responses to conflict become.
> In order to estimate the dividends of conflict prevention we analyze violence dynamics in over 190 countries over the period 1994 to 2017, a time period for which most data was available for most countries. Drawing on 12 risk variables, the model examines the likelihood that a war will occur in a country in the following year and we estimate (through linear, fixed effects regressions) the average cost of war (and other states, described below) on 8 dependent variables, including loss of life, displacement, peacekeeping deployments and expenditures, oversea aid and economic growth. The estimates confirm that, by far, the most costly state for a country to be in is war, and the probability of a country succumbing to war in the next year is based on its current state and the frequency of other countries with similar states having entered war in the past.
> At the core of the model (and results) is the reality that countries tend to get stuck in so-called violence and conflict traps. A well-established finding in the conflict studies field is that once a country experiences an armed conflict, it is very likely to relapse into conflict or violence within a few years. Furthermore, countries likely to experience war share some common warning signs, which we refer to as “flags” (up to 12 flags can be raised to signal risk). Not all countries that enter armed conflict raise the same warning flags, but the warning flags are nevertheless a good indication that a country is at high risk. These effects create vicious cycles that result in high risk, war and frequent relapse into conflict. Multiple forms of prevention are necessary to break these cycles. The model captures the vicious cycle of conflict traps, through introducing five states and a transition matrix based on historical data (see Table 1). First, we assume that a country is in one of five 'states' in any given year. These states are at "Peace", "High Risk", "Negative Peace", "War" and "Recovery" (each state is described further below). Drawing on historical data, the model assesses the probability of a country transitioning to another state in a given year (a transition matrix).
> For example, if a state was at High Risk in the last year, it has a 19.3% chance of transitioning to Peace, a 71.4% chance of staying High Risk, a 7.6% chance of entering Negative Peace and a 1.7% chance of entering War the following year.
> By contrast, high risk states are designated by the raising of up to 12 flags. These include: 1) high scores by Amnesty International's annual human rights reports (source: Political Terror Scale), 2) the US State Department annual reports (source: Political Terror Scale), 3) civilian fatalities as a percentage of population (source: ACLED), 4) political events per year (source: ACLED) 5) events attributed to the proliferation of non-state actors (source: ACLED), 6) battle deaths (source: UCDP), 7) deaths by terrorism (source: GTD), 8) high levels of crime (source: UNODC), 9) high levels of prison population (source: UNODC), 10) economic growth shocks (source: World Bank), 11) doubling of displacement in a year (source: IDMC), and 12) doubling of refugees in a year (source: UNHCR). Countries with two or more flags fall into the "high risk" category. Using these flags, a majority of countries have been at high risk for one or more years from 1994 to 2017, so it is easier to give examples of countries that have not been at high risk.
> Negative peace states are defined by combined scores from Amnesty International and the US State Department. Countries in negative peace are more than five times as likely to enter high risk in the following year than peace (26.8% vs. 4.1%).
> A country that is at war is one that falls into a higher threshold of collective violence, relative to the size of the population. Specifically, it is designated as such if one or more of the following conditions are met: above 0.04 battle deaths or .04 civilian fatalities per 100,000 according to UCDP and ACLED, respectively, or coding of genocide by the Political Instability Task Force Worldwide Atrocities Dataset. Countries experiencing five or more years of war between 1994 and 2017 included Afghanistan, Somalia, Sudan, Iraq, Burundi, Central African Republic, Sri Lanka, DR Congo, Uganda, Chad, Colombia, Israel, Lebanon, Liberia, Yemen, Algeria, Angola, Sierra Leone, South Sudan, Eritrea and Libya.
> Lastly, recovery is a period of stability that follows from war. A country is only determined to be recovering if it is not at war and was recently in a war. Any country that exits in the war state is immediately coded as being in recovery for the following five years, unless it relapses into war. The duration of the recovery period (five years) is informed by the work of Paul Collier et al, but is robust also to sensitivity tests around varying recovery lengths.
> The model does not allow for countries to be high risk and in recovery in the same year, but there is ample evidence that countries that are leaving a war state are at a substantially higher risk of experiencing war recurrence, contributing to the conflict trap described earlier. Countries are twice as likely to enter high risk or negative peace coming out of recovery as they are to enter peace, and 10.2% of countries in recovery relapse into war every year. When a country has passed the five year threshold without reverting to war, it can move back to states of peace, negative peace or high risk.
> The transition matrix underlines the very real risk of countries falling into a 'conflict trap'. Specifically, a country that is in a state of war has a very high likelihood of staying in this condition in the next year (72.6%) and just a 27.4% chance of transitioning to recovery. Once in recovery, a country has a 10.2% chance of relapse every year, suggesting only a 58% chance (1-10.2%)^5 that a country will not relapse over five years.
> As Collier and others have observed, countries are often caught in prolonged and vicious cycles of war and recovery (conflict traps), often unable to escape into a new, more peaceful (or less war-like) state
* War is expensive. So is being at high risk of war.
> Of course, the loss of life, displacement, and accumulated misery associated with war should be reason enough to invest in prevention, but there are also massive economic benefits from successful prevention. Foremost, the countries at war avoid the costly years in conflict, with growth rates 4.8% lower than countries at peace. They also avoid years of recovery and the risk of relapse into conflict. Where prevention works, conflict-driven humanitarian needs are reduced, and the international community avoids peacekeeping deployments and additional aid burdens, which are sizable.
> Conclusion: The world can be significantly better off by addressing the high risk of destructive violence and war with focused efforts at prevention in countries at high risk and those in negative peace. This group of countries has historically been at risk of higher conflict due to violence against civilians, proliferation of armed groups, abuses of human rights, forced displacement, high homicide, and incidence of t error. None of this is surprising. Policymakers know that war is bad for humans and other living things. What is staggering is the annual costs of war that we will continue to pay in 2030 through inaction today conceivably trillions of dollars of economic growth, and the associated costs of this for human security and development, are being swept off t he table by the decisions made today to ignore prevention.
* [COVID-19: Ioannidis vs. Taleb](https://forecasters.org/blog/2020/06/14/covid-19-ioannidis-vs-taleb/)
> On the one hand, Nassim Taleb has clearly expressed that measures to stop the spread of the pandemic must be taken as soon as possible: instead of looking at data, it is the nature of a pandemic with a possibility of devastating human impact that should drive our decisions.
> On the other hand, John Ioannidis acknowledges the difficulty in having good data and of producing accurate forecasts, while believing that eventually any information that can be extracted from such data and forecasts should still be useful, e.g. to having targeted lockdowns (in space, time, and considering the varying risk for different segments of the population).
* [Taleb](https://forecasters.org/blog/2020/06/14/on-single-point-forecasts-for-fat-tailed-variables/): _On single point forecasts for fat tailed variables_. Leitmotiv: Pandemics are fat-tailed.
> ![](images/d263195904a7942604599ff703fcb71f28d0a156.png) ![](images/860ccc6875dd7044a884708cd8c34c6bb3d70506.png)
> We do not need more evidence under fat tailed distributions — it is there in the properties themselves (properties for which we have ample evidence) and these clearly represent risk that must be killed in the egg (when it is still cheap to do so). Secondly, unreliable data — or any source of uncertainty — should make us follow the most paranoid route. \[...\] more uncertainty in a system makes precautionary decisions very easy to make (if I am uncertain about the skills of the pilot, I get off the plane).
> Random variables in the power law class with tail exponent α ≤ 1 are, simply, not forecastable. They do not obey the \[Law of Large Numbers\]. But we can still understand their properties.
> As a matter of fact, owing to preasymptotic properties, a heuristic is to consider variables with up to α ≤ 5/2 as not forecastable — the mean will be too unstable and requires way too much data for it to be possible to do so in reasonable time. It takes 1014 observations for a “Pareto 80/20” (the most commonly referred to probability distribution, that is with α ≈ 1.13) for the average thus obtained to emulate the significance of a Gaussian with only 30 observations.
* [Ioannidis](https://forecasters.org/blog/2020/06/14/forecasting-for-covid-19-has-failed/): _Forecasting for COVID-19 has failed_. Leitmotiv: "Investment should be made in the collection, cleaning and curation of data".
> Predictions for hospital and ICU bed requirements were also entirely misinforming. Public leaders trusted models (sometimes even black boxes without disclosed methodology) inferring massively overwhelmed health care capacity (Table 1) \[3\]. However, eventually very few hospitals were stressed, for a couple of weeks. Most hospitals maintained largely empty wards, waiting for tsunamis that never came. The general population was locked and placed in horror-alert to save the health system from collapsing. Tragically, many health systems faced major adverse consequences, not by COVID-19 cases overload, but for very different reasons. Patients with heart attacks avoided visiting hospitals for care \[4\], important treatments (e.g. for cancer) were unjustifiably delayed \[5\], mental health suffered \[6\]. With damaged operations, many hospitals started losing personnel, reducing capacity to face future crises (e.g. a second wave). With massive new unemployment, more people may lose health insurance. The prospects of starvation and of lack of control for other infectious diseases (like tuberculosis, malaria, and childhood communicable diseases for which vaccination is hindered by the COVID-19 measures) are dire...
> The core evidence to support “flatten-the-curve” efforts was based on observational data from the 1918 Spanish flu pandemic on 43 US cities. These data are >100-years old, of questionable quality, unadjusted for confounders, based on ecological reasoning, and pertaining to an entirely different (influenza) pathogen that had ~100-fold higher infection fatality rate than SARS-CoV-2. Even thus, the impact on reduction on total deaths was of borderline significance and very small (10-20% relative risk reduction); conversely many models have assumed 25-fold reduction in deaths (e.g. from 510,000 deaths to 20,000 deaths in the Imperial College model) with adopted measures
> Despite these obvious failures, epidemic forecasting continued to thrive, perhaps because vastly erroneous predictions typically lacked serious consequences. Actually, erroneous predictions may have been even useful. A wrong, doomsday prediction may incentivize people towards better personal hygiene. Problems starts when public leaders take (wrong) predictions too seriously, considering them crystal balls without understanding their uncertainty and the assumptions made. Slaughtering millions of animals in 2001 aggravated a few animal business stakeholders, most citizens were not directly affected. However, with COVID-19, espoused wrong predictions can devastate billions of people in terms of the economy, health, and societal turmoil at-large.
> Cirillo and Taleb thoughtfully argue \[14\] that when it comes to contagious risk, we should take doomsday predictions seriously: major epidemics follow a fat-tail pattern and extreme value theory becomes relevant. Examining 72 major epidemics recorded through history, they demonstrate a fat-tailed mortality impact. However, they analyze only the 72 most noticed outbreaks, a sample with astounding selection bias. The most famous outbreaks in human history are preferentially selected from the extreme tail of the distribution of all outbreaks. Tens of millions of outbreaks with a couple deaths must have happened throughout time. Probably hundreds of thousands might have claimed dozens of fatalities. Thousands of outbreaks might have exceeded 1,000 fatalities. Most eluded the historical record. The four garden variety coronaviruses may be causing such outbreaks every year \[15,16\]. One of them, OC43 seems to have been introduced in humans as recently as 1890, probably causing a “bad influenza year” with over a million deaths \[17\]. Based on what we know now, SARS-CoV-2 may be closer to OC43 than SARS-CoV-1. This does not mean it is not serious: its initial human introduction can be highly lethal, unless we protect those at risk.
* The (British) Royal Economic Society presents a panel on [What is a scenario, projection and a forecast - how good or useful are they particularly now?](https://www.youtube.com/watch?v=2SUBlUINIqI). The start seems promising: "My professional engagement with economic and fiscal forecasting was first as a consumer, and then a producer. I spent a decade happily mocking other people's efforts, as a journalist, since when I've spent two decades helping colleagues to construct forecasts and to try to explain them to the public." The first speaker, which corresponds to the first ten minutes, is worth listening to; the rest varies in quality.
> You have to construct the forecast and explain it in a way that's fit for that purpose
* I liked the following taxonomy of what distinct targets the agency the first speaker works for is aiming to hit with their forecasts:
1. as an input into the policy-making process,
2. as a transparent assessment of public finances
3. as a prediction of whether the government will meet whatever fiscal rules it has set itself,
4. as a baseline against which to judge the significance of further news,
5. as a challenge to other agencies "to keep the bastards honest".
* The limitations were interesting as well:
1. they require us to produce a forecast that's conditioned on current government policy even if we and everyone else expect that policy to change that of course makes it hard to benchmark our performance against counterparts who are producing unconditional forecasts.
2. The forecasts have to be explainable; a black box model might be more accurate but be less useful.
3. they require detailed discussion of the individual forecast lines and clear diagnostics to explain changes from one forecast to the next precisely to reassure people that those changes aren't politically motivated or tainted - the forecast is as much about delivering transparency and accountability as about demonstrating predictive prowess
4. the forecast numbers really have to be accompanied by a comprehensible narrative of what is going on in the economy and the public finances and what impact policy will have - Parliament and the public needs to be able to engage with the forecast we couldn't justify our predictions simply with an appeal to a statistical black box and the Chancellor certainly couldn't justify significant policy positions that way.
> "horses for courses, the way you do the forecast, the way you present it depends on what you're trying to achieve with it"
> "People use scenario forecasting in a very informal manner. which I think that could be problematic because it's very difficult to basically find out what are the assumptions and whether those assumptions and the models and the laws can be validated"
> Linear models are state independent, but it's not the same to receive a shock where the economy is in upswing as when the economy is during a recession.
* Some situations are too complicated to forecast, so one conditions on some variables being known, or following a given path, and then studies the rest, calling the output a "scenario."
> One week delay in intervention by the government makes a big difference to the height of the \[covid-19\] curve.
> I don't think it's easy to follow the old way of doing things. I'm sorry, I have to be honest with you. I spent 4 months just thinking about this problem and you need to integrate a model of the social behavior and how you deal with the risk to health and to economy in these models. But unfortunately, by the time we do that it won't be relevant.
> It amuses me to look at weather forecasts because economists don't have that kind of technology, those kind of resources.
---
Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go [here](https://archive.org/)
---
> "horses for courses, the way you do the forecast, the way you present it depends on what you're trying to achieve with it"
---

View File

@ -0,0 +1,123 @@
Forecasting Newsletter: August 2020.
==============
## Highlights
538 releases [model](https://projects.fivethirtyeight.com/2020-election-forecast/) of the US elections; Trump predicted to win ~30% of the time.
[Study](https://link.springer.com/article/10.1007%2Fs10654-020-00669-6) offers instructive comparison of New York covid models, finds that for the IHME model, reported death counts fell inside the 95% prediction intervals only 53% of the time.
Biggest decentralized trial [to date](https://blog.kleros.io/kleros-community-update-july-2020/#case-302-the-largest-decentralized-trial-of-all-time), with 511 jurors asked to adjudicate a case coming from the Omen prediction market: "Will there be a day with at least 1000 reported corona deaths in the US in the first 14 days of July?."
## Index
* Highlights
* Prediction Markets & Forecasting Platforms
* In The News
* Hard To Categorize
* Long Content
Sign up [here](https://forecasting.substack.com/) or browse past newsletters [here](https://forum.effectivealtruism.org/s/HXtZvHqsKwtAYP6Y7).
## Prediction Markets & Forecasting Platforms
On [PredictIt](htps://predictit.org/), presidential election prices are close to [even odds](https://www.predictit.org/markets/detail/3698), with Biden at 55, and Trump at 48.
Good Judgement Inc. continues providing their [dashboard](https://goodjudgment.io/covid-recovery/), and the difference between the probability assigned by superforecasters to a Biden win (~75%), and those offered by [betfair](https://www.betfair.com/sport/politics) (~55%) was enough to make it worth for me to place a small bet. At some point, Good Judgement Inc. and Cultivate Labs started a new platform on the domain [covidimpacts.com](https://www.covidimpacts.com), but forecasts there seem weaker than on Good Judgement Open.
[Replication Markets](https://www.replicationmarkets.com/) started their COVID-19 round, and created a page with COVID-19 [resources for forecasters](https://www.replicationmarkets.com/index.php/frequently-asked-questions/resources-for-forecasters/).
Nothing much to say about [Metaculus](https://www.metaculus.com/questions/) this month, but I appreciated their previously existing list of [prediction resources](https://www.metaculus.com/help/prediction-resources/).
[Foretell](https://www.cset-foretell.com) has a [blog](https://www.cset-foretell.com/blog), and hosted a forecasting forum which discussed
* metrizicing the grand. That is, decomposing and operationalizing big picture questions into smaller ones, which can then be forecasted.
* operationalizing these big picture questions might also help identify disagreements, which might then either be about the indicators, proxies or subquestions chosen, or about the probabilities given to the subquestions.
* sometimes we can't measure what we care about, or we don't care about what we can measure.
* one might be interested in questions about the future 3 to 7 years from now, but questions which ask about events 3 to 15 months in the future (which forecasting tournaments can predict better) can still provide useful signposts.
Meanwhile, ethereum-based prediction markets such as Omen or Augur are experiencing difficulties because of the rise of decentralized finance (DeFi) and speculation and excitement about it. That speculation and excitement has increased the gas price (fees), such that making a casual prediction is for now too costly.
## In The News
[Forecasting the future of philanthropy](https://www.fastcompany.com/90532945/forecasting-the-future-of-philanthropy). The [American Lebanese Syrian Associated Charities](https://en.wikipedia.org/wiki/American_Lebanese_Syrian_Associated_Charities), the largest healthcare related charity in the United States, whose mission is to fund the [St. Jude Children's Research Hospital](https://en.wikipedia.org/wiki/St._Jude_Children%27s_Research_Hospital). To do this, they employ aggressive fundraising tactics, which have undergone modifications throughout the current pandemic.
[Case 302: the Largest Decentralized Trial of All Time](https://blog.kleros.io/kleros-community-update-july-2020/#case-302-the-largest-decentralized-trial-of-all-time). Kleros is a decentralized dispute resolution platform. "In July, Kleros had its largest trial ever where 511 jurors were drawn in the General Court to adjudicate a case coming from the Omen prediction market: Will there be a day with at least 1000 reported Corona death in the US in the first 14 days of July?." [Link to the case](https://court.kleros.io/cases/302)
[ExxonMobil Slashing Permian Rig Count, Forecasting Global Oil Glut Extending Well into 2021](https://www.naturalgasintel.com/exxonmobil-slashing-permian-rig-count-forecasting-global-oil-glut-extending-well-into-2021/). My own interpretation is that the gargantuan multinational's decision is an honest signal of an expected extended economic downturn.
> Supply is expected to exceed demand for months, “and we anticipate it will be well into 2021 before the overhang is cleared and we returned to pre-pandemic levels,” Senior Vice President Neil Chapman said Friday during a conference call.
> “Simply put, the demand destruction in the second quarter was unprecedented in the history of modern oil markets. To put it in context, absolute demand fell to levels we havent seen in nearly 20 years. Weve never seen a decline with this magnitude and pace before, even relative to the historic periods of demand volatility following the global financial crisis and as far back as the 1970s oil and energy crisis.”
> Even so, ExxonMobils Permian rig count is to be sharply lower than it was a year ago. The company had more than 50 rigs running across its Texas-New Mexico stronghold as of last fall. At the end of June it was down to 30, “and we expect to cut that number by at least half again by the end of this year,” Chapman said.
[Google Cloud AI and Harvard Global Health Institute Collaborate on new COVID-19 forecasting model](https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-is-releasing-the-covid-19-public-forecasts).
[Betting markets](https://smarkets.com/event/40554343/politics/uk/brexit/trade-deals) put [UK-EU trade deal in 2020 at 66%](https://sports.yahoo.com/betting-odds-put-ukeu-trade-deal-in-2020-at-66-095009521.html) (now 44%).
[Experimental flood forecasting system didnt help](https://www.hindustantimes.com/mumbai-news/flood-forecasting-system-didn-t-help/story-mJanM39kxJPOvFma6TeqUM.html) in Mumbai. The system was to provide a three day advance warning, but didn't.
FiveThirtyEight covers various facets of the USA elections: [Biden Is Polling Better Than Clinton At Her Peak](https://fivethirtyeight.com/features/biden-is-polling-better-than-clinton-at-her-peak/), and releases [their model](https://fivethirtyeight.com/features/how-fivethirtyeights-2020-presidential-forecast-works-and-whats-different-because-of-covid-19/), along with some [comments about it](https://fivethirtyeight.com/features/our-election-forecast-didnt-say-what-i-thought-it-would/)
In other news, this newsletter reached 200 subscribers last week.
## Hard to Categorize
[Groundhog day](https://en.wikipedia.org/wiki/Groundhog_Day) is a tradition in which American crowds pretend to believe that a small rat has oracular powers.
[Tips](https://politicalpredictionmarkets.com/blog/) for forecasting on PredictIt. These include betting against Trump voters who arrive at PredictIt from Breitbart.
Linch Zhang asks [What are some low-information priors that you find practically useful for thinking about the world?](https://forum.effectivealtruism.org/posts/SBbwzovWbghLJixPn/what-are-some-low-information-priors-that-you-find)
[AstraZeneca looking for a Forecasting Director](https://careers.astrazeneca.com/job/wilmington/forecasting-director-us-renal/7684/16951921) (US-based).
[Genetic Engineering Attribution Challenge](https://www.drivendata.org/competitions/63/genetic-engineering-attribution/).
NSF-funded tournament looking to compare human forecasters with a random forest ML model from Johns Hopkins in terms of forecasting the success probability of cancer drug trials. More info [here](https://www.fandm.edu/magazine/magazine-issues/spring-summer-2020/spring-summer-2020-articles/2020/06/10/is-there-a-better-way-to-predict-the-future), and one can sign-up [here](https://www.pytho.io/human-forest). I've heard rewards are generous, but they don't seem to be specified on the webpage. Kudos to Joshua Monrad.
Results of an [expert forecasting session](https://twitter.com/juan_cambeiro/status/1291153289879392257) on covid, presented by expert forecaster Juan Cambeiro.
A playlist of [podcasts related to forecasting](https://open.spotify.com/playlist/4LKES4QcFNozmwImjHWrBX?si=twuBPF-fSxejbpMwUToatg). Kudos to Michał Dubrawski.
## Long Content
[A case study in model failure? COVID-19 daily deaths and ICU bed utilization predictions in New York state](https://link.springer.com/article/10.1007%2Fs10654-020-00669-6) and commentary: [Individual model forecasts can be misleading, but together they are useful](https://link.springer.com/article/10.1007/s10654-020-00667-8).
> In this issue, Chin et al. compare the accuracy of four high profile models that, early during the outbreak in the US, aimed to make quantitative predictions about deaths and Intensive Care Unit (ICU) bed utilization in New York. They find that all four models, though different in approach, failed not only to accurately predict the number of deaths and ICU utilization but also to describe uncertainty appropriately, particularly during the critical early phase of the epidemic. While overcoming these methodological challenges is key, Chin et al. also call for systemic advances including improving data quality, evaluating forecasts in real-time before policy use, and developing multi-model approaches.
> But what the model comparison by Chin et al. highlights is an important principle that many in the research community have understood for some time: that no single model should be used by policy makers to respond to a rapidly changing, highly uncertain epidemic, regardless of the institution or modeling group from which it comes. Due to the multiple uncertainties described above, even models using the same underlying data often have results that diverge because they have made different but reasonable assumptions about highly uncertain epidemiological parameters, and/or they use different methods
> .. the rapid deployment of this approach requires pre-existing infrastructure and evaluation systems now and for improved response to future epidemics. Many models that are built to forecast on a scale useful for local decision making are complex, and can take considerable time to build and calibrate
> a group with a history of successful influenza forecasting in the US (Los Alamos National Lab (4)) was able to produce early COVID-19 forecasts and had the best coverage of uncertainty in the Chin et al. analysis (80-100% of observations fell within the 95% prediction interval for most forecasts). In contrast, the new Institute for Health Metrics and Evaluation statistical approach had low reliability; after the latest analyzed revision only 53% of reported death counts fell with the 95% prediction intervals.
> The original IHME model underestimates uncertainty and 45.7% of the predictions (over 1- to 14-step-ahead predictions) made over the period March 24 to March 31 are outside the 95% PIs. In the revised model, for forecasts from of April 3 to May 3 the uncertainty bounds are enlarged, and most predictions (74.0%) are within the 95% PIs, which is not surprising given the PIs are in the order of 300 to 2000 daily deaths. Yet, even with this major revision, the claimed nominal coverage of 95% well exceeds the actual coverage. On May 4, the IHME model undergoes another major revision, and the uncertainty is again dramatically reduced with the result that 47.4% of the actual daily deaths fall outside the 95% PIs—well beyond the claimed 5% nominal value.
> the LANL model was the only model that was found to approach the 95% nominal coverage, but unfortunately this model was unavailable at the time Governor Cuomo needed to make major policy decisions in late March 2020.
> Models that are consistently poorly performing should carry less weight in shaping policy considerations. Models may be revised in the process, trying to improve performance. However, improvement of performance against retrospective data offers no guarantee for continued improvement in future predictions. Failed and recast models should not be given much weight in decision making until they have achieved a prospective track record that can instill some trust for their accuracy. Even then, real time evaluation should continue, since a model that performed well for a given period of time may fail to keep up under new circumstances.
[Do Prediction Markets Produce WellCalibrated Probability Forecasts?](https://academic.oup.com/ej/article-abstract/123/568/491/5079498).
> Abstract: This article presents new theoretical and empirical evidence on the forecasting ability of prediction markets. We develop a model that predicts that the time until expiration of a prediction market should negatively affect the accuracy of prices as a forecasting tool in the direction of a favourite/longshot bias. That is, highlikelihood events are underpriced, and lowlikelihood events are overpriced. We confirm this result using a large data set of prediction market transaction prices. Prediction markets are reasonably well calibrated when time to expiration is relatively short, but prices are significantly biased for events farther in the future. When time value of money is considered, the miscalibration can be exploited to earn excess returns only when the trader has a relatively low discount rate.
> We confirm this prediction using a data set of actual prediction markets prices from1,787 market representing a total of more than 500,000 transactions
Paul Christiano on [learning the Prior](https://ai-alignment.com/learning-the-prior-48f61b445c04) and on [better priors as a safety problem](https://ai-alignment.com/better-priors-as-a-safety-problem-24aa1c300710).
A presentation of [radical probabilism](https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1); a theory of probability which relaxes some assumptions in classical Bayesian reasoning.
[Forecasting Thread: AI timelines](https://www.lesswrong.com/posts/hQysqfSEzciRazx8k/forecasting-thread-ai-timelines), which asks for (quantitative) forecasts until human-machine parity. Some of the answers seem insane or suspicious, in that they have very narrow tails, sharp spikes, and don't really update on the fact that other people disagree with them.
---
Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go [there](https://archive.org/) and input the dead link.
---
> _We hope that people will pressure each other into operationalizing their \[big picture outlooks\]. If we have no way of proving you wrong, we have no way of proving you right. We need falsifiable forecasts._
> Source: Foretell Forecasting Forum. Inexact quote.
---

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

View File

@ -0,0 +1,192 @@
Forecasting Newsletter: September 2020.
==============
## Highlights
* Red Cross and Red Crescent societies have been trying out [forecast based financing](https://www.forecast-based-financing.org/our-projects/what-can-go-wrong/), where funds are released before a potential disaster happens based on forecasts thereof.
* Andrew Gelman releases [Information, incentives, and goals in election forecasts](http://www.stat.columbia.edu/~gelman/research/unpublished/forecast_incentives3.pdf); 538's 80% political predictions turn out to have happened [88% of the time](https://projects.fivethirtyeight.com/checking-our-work/).
* Nonprofit Ought organizes a [forecasting thread on existential risk](https://www.lesswrong.com/posts/6x9rJbx9bmGsxXWEj/forecasting-thread-existential-risk-1), where participants display and discuss their probability distributions for existential risk.
## Index
* Highlights
* Prediction Markets & Forecasting Platforms
* In The News
* Hard To Categorize
* Long Content
Sign up [here](https://forecasting.substack.com/) or browse past newsletters [here](https://forum.effectivealtruism.org/s/HXtZvHqsKwtAYP6Y7).
## Prediction Markets & Forecasting Platforms
Metaculus updated their [track record page](https://www.metaculus.com/questions/track-record/). You can now look at accuracy across time, at the distribution of brier scores, and a calibration graph. They also have a new black swan question: [When will US metaculus users face an emigration crisis?](https://www.metaculus.com/questions/5287/when-will-america-have-an-emigration-crisis/).
Good Judgement Open has a [thread](https://www.gjopen.com/questions/1779-are-there-any-forecasting-tips-tricks-and-experiences-you-would-like-to-share-and-or-discuss-with-your-fellow-forecasters) in which forecasters share and discuss tips, tricks and experiences. An account is needed to browse it.
[Augur](https://www.augur.net/blog/amm-para-augur/) modifications in response to higher ETH prices. Some unfiltered comments [on reddit](https://www.reddit.com/r/ethfinance/comments/ixhy3j/daily_general_discussion_september_22_2020/g68yra6/?context=3)
An overview of [PlotX](https://blockonomi.com/plotx-guide/), a new decentralized prediction protocol/marketplace. PlotX focuses on non-subjective markets that can be programmatically determined, like the exchange rate between currencies or tokens.
A Replication Markets participant wrote [What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers](https://fantasticanachronism.com/2020/09/11/whats-wrong-with-social-science-and-how-to-fix-it/). See also: [An old long-form introduction to Replication Markets](https://www.adamlgreen.com/replication-markets/).
Georgetown's CSET is attempting to use forecasting to influence policy. A seminar discussing their approach [Using Crowd Forecasting to Inform Policy with Jason Matheny](https://georgetown.zoom.us/webinar/register/WN_nlXO7sQdSYyYBqhnzkh3hg) is scheduled for the 19th of October. But their current forecasting tournament, foretell, isn't yet very well populated, and the aggregate isn't that good because participants don't update all that often, leading to sometimes clearly outdated aggregates. Perhaps because of this relative lack of competition, my team is in 2nd place at the time of this writting (with myself at #6, Eli Lifland at #12 and Misha Yagudin at #21). You can join foretell [here](https://www.cset-foretell.com/).
There is a new contest on Hypermind, [The Long Fork Project](https://prod.hypermind.com/longfork/en/welcome.html), which aims to predict the impact of a Trump or a Biden victory in November, with $20k in prize money. H/t to user [ChickCounterfly](https://www.lesswrong.com/posts/hRsRgRcRk3zHLPpqm/forecasting-newsletter-august-2020?commentId=8gAKasi8w5v64QpbM).
The University of Chicago's Effective Altruism group is hosting a forecasting tournament between all interested EA college groups starting October 12th, 2020. More details [here](https://forum.effectivealtruism.org/posts/rePMmgXLwdSuk5Edg/ea-uni-group-forecasting-tournament)
## In the News
News media sensationalizes essentially random fluctuations on US election odds caused by big bettors entering prediction markets such as Betfair, where bets on the order of $50k can visibly alter the market price. Simultaneously, polls/models and prediction market odds have diverged, because a substantial fraction of bettors lend credence to the thesis that polls will be biased as in the previous elections, even though polling firms seem to have improved their methods.
* [Trump overtakes Biden as favorite to win in November: Betfair Exchange](https://www.reuters.com/article/us-usa-elections-bets-idUSKBN25T1L6)
* [US Election: Polls defy Trump's comeback narrative but will the market react?](https://betting.betfair.com/politics/us-politics/us-election-tips-and-odds-polls-defy-trumps-comeback-narrative-but-will-the-market-react-030920-171.html)
* [Betting Markets Swing Toward Trump, Forecasting Tightening Race](https://www.forbes.com/sites/jackbrewster/2020/09/02/betting-markets-swing-toward-trump-forecasting-tightening-race/#22fafa8b6bfe)
* [Biden leads in the polls, but betters are taking a gamble on Trump](https://www.foxnews.com/politics/biden-leads-polls-betters-gamble-trump)
* [UK Bookmaker Betfair Shortens Joe Biden 2020 Odds After Bettor Wagers $67K](https://www.casino.org/news/uk-bookmaker-betfair-shortens-joe-biden-2020-odds/)
* [Avoid The Monster Trump Gamble - The Fundamental Numbers Havent Changed](http://politicalgambler.com/avoid-the-monster-trump-gamble-the-fundamental-numbers-havent-changed/)
Red Cross and Red Crescent societies have been trying out forecast based financing. The idea is to create forecasts and early warning indicators for some negative outcome, such as a flood, using weather forecasts, satellite imagery, climate models, etc, and then release funds automatically if the forecast reaches a given threshold, allowing the funds to be put to work before the disaster happens in a more automatic, fast and efficient manner. Goals and modus operandi might resonate with the Effective Altruism community:
> "In the precious window of time between a forecast and a potential disaster, FbF releases resources to take early action. Ultimately, we hope this early action will be more **effective at reducing suffering**, compared to waiting until the disaster happens and then doing only disaster response. For example, in Bangladesh, people who received a forecast-based cash transfer were less malnourished during a flood in 2017." (bold not mine)
* Here is the "what can go wrong" section of their [slick yet difficult to navigate webpage](https://www.forecast-based-financing.org/our-projects/what-can-go-wrong/), and an introductory [video](https://www.youtube.com/watch?v=FcuKUBihHVI).
[Prediction Markets' Time Has Come, but They Aren't Ready for It](https://www.coindesk.com/prediction-markets-election). Prediction markets could have been useful for predicting the spread of the pandemic (see: [coronainformationmarkets.com](http://coronainformationmarkets.com)), or for informing presidential election consequences (see: Hypermind above), but their relatively small size makes them less informative. Blockchain based prediction technologies, like Augur, Gnosis or Omen could have helped bypass US regulatory hurdles (which ban many kinds of gambling), but the recent increase in transaction fees means that "everything below a $1,000 bet is basically economically unfeasible"
Floods in India and Bangladesh:
* [Time to develop a reliable flood forecasting model (for Bangladesh)](https://www.thedailystar.net/opinion/news/time-develop-reliable-flood-forecasting-model-1952061)
> This year, flood started somewhat earlier than usual. The Brahmaputra water crossed the danger level (DL) on June 28, subsided after a week, and then crossed the DL again on July 13 and continued for 26 days. It inundated over 30 percent of the country
* [Google's AI Flood Forecasting Initiative now expanded to all parts of India](https://www.timesnownews.com/technology-science/article/googles-ai-flood-forecasting-initiative-now-expanded-to-all-parts-of-india-heres-how-it-helps/646340); [Google bolsters its A.I.-enabled flood alerts for India and Bangladesh](https://fortune.com/2020/09/01/google-ai-flood-alerts-india-bangladesh/)
> “One assumption that was presumed to be true in hydrology is that you cannot generalize across water basins,” Nevo said. “Well, its not true, as it turns out.” He said Googles A.I.-based forecasting model has performed better on watersheds it has never encountered before in training than classical hydrologic models that were designed specifically for that river basin.
[The many tribes of 2020 election worriers: An ethnographic report](https://www.washingtonpost.com/outlook/2020/09/01/many-tribes-2020-election-worriers-an-ethnographic-report/) by the Washington Post.
Electricity time series demand and supply forecasting startup [raises $8 million](https://news.crunchbase.com/news/myst-ai-closes-6m-series-a-to-forecast-energy-demand-supply/). I keep seeing this kind of announcement; doing forecasting well in an underforecasted domain seems to be somewhat profitable right now, and it's not like there is an absence of domains to which forecasting can be applied. This might be a good idea for an earning-to-give startup.
[NSF and NASA partner to address space weather research and forecasting](https://www.nsf.gov/news/special_reports/announcements/090120.01.jsp). Together, NSF and NASA are investing over $17 million into six, three-year awards, each of which contributes to key research that can expand the nation's space weather prediction capabilities.
In its monthly report, OPEC said it expects the pandemic to reduce demand by 9.5 million barrels a day, forecasting a fall in demand of 9.5% from last year, [reports the Wall Street Journal](https://www.wsj.com/articles/opec-deepens-forecast-for-decline-in-global-oil-demand-11600083622)
Some [criticism](https://www.theblockcrypto.com/post/76453/arca-gnosis-defi-project-call) of Gnosis, a decentralized prediction markets startup, by early investors who want to cash out. [Here](https://www.ar.ca/blog/understanding-arcas-request-for-change-at-gnosis) is a blog post by said early investors; they claim that "Gnosis took out what was in effect a 3+ year interest-free loan from token holders and failed to deliver the products laid out in its fundraising whitepaper, quintupled the size of its balance sheet due simply to positive price fluctuations in ETH, and then launched products that accrue value only to Gnosis management."
[What a study of video games can tell us about being better decision makers](https://qz.com/1899461/how-individuals-and-companies-can-get-better-at-making-decisions/) ($), a frustratingly well-paywalled, yet exhaustive, complete and informative overview of the IARPA's FOCUS tournament:
> To study what makes someone good at thinking about counterfactuals, the intelligence community decided to study the ability to forecast the outcomes of simulations. A simulation is a computer program that can be run again and again, under different conditions: essentially, rerunning history. In a simulated world, the researchers could know the effect a particular decision or intervention would have. They would show teams of analysts the outcome of one run of the simulation and then ask them to predict what would have happened if some key variable had been changed.
## Negative Examples
[Why Donald Trump Isnt A Real Candidate, In One Chart](https://fivethirtyeight.com/features/why-donald-trump-isnt-a-real-candidate-in-one-chart/), wrote 538 in 2015.
> For this reason alone, Trump has a better chance of cameoing in another “Home Alone” movie with Macaulay Culkin — or playing in the NBA Finals — than winning the Republican nomination.
[Travel CFOs Hesitant on Forecasts as Pandemic Fogs Outlook](https://www.airbus.com/newsroom/press-releases/en/2020/09/airbus-reveals-new-zeroemission-concept-aircraft.html), reports the Wall Street Journal.
> "We're basically prevented from saying the word 'forecast' right now because whatever we forecast...it's wrong," said Shannon Okinaka, chief financial officer at Hawaiian Airlines. "So we've started to use the word 'planning scenarios' or 'planning assumptions.'"
## Long Content
Andrew Gelman et al. release [Information, incentives, and goals in election forecasts](http://www.stat.columbia.edu/~gelman/research/unpublished/forecast_incentives3.pdf).
* Neither The Economist's model nor 538's are fully Bayesian. In particular, they are not martingales, that is, their current probability is not the expected value of their future probability.
> campaign polls are more stable than every before,and even the relatively small swings that do appear can largely be attributed to differential nonresponse
> Regarding predictions for 2020, the creator of the Fivethirtyeight forecast writes, "we think its appropriate to make fairly conservative choices _especially_ when it comes to the tails of your distributions. Historically this has led 538 to well-calibrated forecasts (our 20%s really mean 20%)" (Silver, 2020b). But conservative prediction corresponds can produce a too-wide interval, one that plays it safe by including extra uncertainty. In other words, conservative forecasts should lead to underconfidence: intervals whose coverage is greater than advertised. And, indeed, according to the calibration plot shown by Boice and Wezerek (2019) of Fivethirtyeights political forecasts, in this domain 20% for them really means 14%, and 80% really means 88%.
[The Literary Digest Poll of 1936](https://en.wikipedia.org/wiki/The_Literary_Digest#Presidential_poll). A poll so bad that it destroyed the magazine.
* Compare the Literary Digest and Gallup polls of 1936 with The New York Times's [model of 2016](https://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html) and [538's 2016 forecast](https://projects.fivethirtyeight.com/2016-election-forecast/#plus), respectively.
> In retrospect, the polling techniques employed by the magazine were to blame. Although it had polled ten million individuals (of whom 2.27 million responded, an astronomical total for any opinion poll),\[5\] it had surveyed its own readers first, a group with disposable incomes well above the national average of the time (shown in part by their ability to afford a magazine subscription during the depths of the Great Depression), and those two other readily available lists, those of registered automobile owners and that of telephone users, both of which were also wealthier than the average American at the time.
> Research published in 1972 and 1988 concluded that as expected this sampling bias was a factor, but non-response bias was the primary source of the error - that is, people who disliked Roosevelt had strong feelings and were more willing to take the time to mail back a response.
> George Gallup's American Institute of Public Opinion achieved national recognition by correctly predicting the result of the 1936 election, while Gallup also correctly predicted the (quite different) results of the Literary Digest poll to within 1.1%, using a much smaller sample size of just 50,000.\[5\] Gallup's final poll before the election also predicted Roosevelt would receive 56% of the popular vote: the official tally gave Roosevelt 60.8%.
> This debacle led to a considerable refinement of public opinion polling techniques, and later came to be regarded as ushering in the era of modern scientific public opinion research.
[Feynman in 1985](https://infoproc.blogspot.com/2020/09/feynman-on-ai.html), answering questions about whether machines will ever be more intelligent than humans.
[Why Most Published Research Findings Are False](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124), back from 2005. The abstract reads:
> There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
[Reference class forecasting](https://en.wikipedia.org/wiki/Reference_class_forecasting). Reference class forecasting or comparison class forecasting is a method of predicting the future by looking at similar past situations and their outcomes. The theories behind reference class forecasting were developed by Daniel Kahneman and Amos Tversky. The theoretical work helped Kahneman win the Nobel Prize in Economics.Reference class forecasting is so named as it predicts the outcome of a planned action based on actual outcomes in a reference class of similar actions to that being forecast.
[Reference class problem](https://en.wikipedia.org/wiki/Reference_class_problem)
> In statistics, the reference class problem is the problem of deciding what class to use when calculating the probability applicable to a particular case. For example, to estimate the probability of an aircraft crashing, we could refer to the frequency of crashes among various different sets of aircraft: all aircraft, this make of aircraft, aircraft flown by this company in the last ten years, etc. In this example, the aircraft for which we wish to calculate the probability of a crash is a member of many different classes, in which the frequency of crashes differs. It is not obvious which class we should refer to for this aircraft. In general, any case is a member of very many classes among which the frequency of the attribute of interest differs. The reference class problem discusses which class is the most appropriate to use.
* See also some thoughts on this [here](https://www.lesswrong.com/posts/iyRpsScBa6y4rduEt/model-combination-and-adjustment)
[The Base Rate Book](https://research-doc.credit-suisse.com/docView?language=ENG&format=PDF&source_id=csplusresearchcp&document_id=1065113751&serialid=Z1zrAAt3OJhElh4iwIYc9JHmliTCIARGu75f0b5s4bc%3D) by Credit Suisse.
> This book is the first comprehensive repository for base rates of corporate results. It examines sales growth, gross profitability, operating leverage, operating profit margin, earnings growth, and cash flow return on investment. It also examines stocks that have declined or risen sharply and their subsequent price performance. We show how to thoughtfully combine the inside and outside views. The analysis provides insight into the rate of regression toward the mean and the mean to which results regress.
## Hard To Categorize
[Improving decisions with market information: an experiment on corporate prediction markets](https://link.springer.com/article/10.1007/s10683-020-09654-y) ([sci-hub](https://sci-hub.se/https://link.springer.com/article/10.1007/s10683-020-09654-y); [archive link](https://web.archive.org/web/20200927114741/https://sci-hub.se/https://link.springer.com/article/10.1007/s10683-020-09654-y))
> We conduct a lab experiment to investigate an important corporate prediction market setting: A manager needs information about the state of a project, which workers have, in order to make a state-dependent decision. Workers can potentially reveal this information by trading in a corporate prediction market. We test two different market designs to determine which provides more information to the manager and leads to better decisions. We also investigate the effect of top-down advice from the market designer to participants on how the prediction market is intended to function. Our results show that the theoretically superior market design performs worse in the lab—in terms of manager decisions—without top-down advice. With advice, manager decisions improve and both market designs perform similarly well, although the theoretically superior market design features less mis-pricing. We provide a behavioral explanation for the failure of the theoretical predictions and discuss implications for corporate prediction markets in the field.
The nonprofit Ought organized a [forecasting thread on existential risk](https://www.lesswrong.com/posts/6x9rJbx9bmGsxXWEj/forecasting-thread-existential-risk-1), where participants display and discuss their probability distributions for existential risk, and outline some [reflections on a previous forecasting thread on AI timelines](https://www.lesswrong.com/posts/6LJjzTo5xEBui8PqE/reflections-on-ai-timelines-forecasting-thread).
A [draft report on AI timelines](https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines), [summarized in the comments](https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines?commentId=7d4q79ntst6ryaxWD)
Gregory Lewis has a series of posts related to forecasting and uncertainty:
* [Use resilience, instead of imprecision, to communicate uncertainty](https://forum.effectivealtruism.org/posts/m65R6pAAvd99BNEZL/use-resilience-instead-of-imprecision-to-communicate)
* [Challenges in evaluating forecaster performance](https://forum.effectivealtruism.org/posts/JsTpuMecjtaG5KHbb/challenges-in-evaluating-forecaster-performance)
* [Take care with notation for uncertain quantities](https://forum.effectivealtruism.org/posts/E3CjL7SEuq958MDR4/take-care-with-notation-for-uncertain-quantities)
[Estimation of probabilities to get tenure track in academia: baseline and publications during the PhD](https://forum.effectivealtruism.org/posts/3TQTec6FKcMSRBT2T/estimation-of-probabilities-to-get-tenure-track-in-academia).
[How to think about an uncertain future: lessons from other sectors & mistakes of longtermist EAs](https://forum.effectivealtruism.org/posts/znaZXBY59Ln9SLrne/how-to-think-about-an-uncertain-future-lessons-from-other). The central thesis is:
> Expected value calculations, the favoured approach for EA decision making, are all well and good for comparing evidence backed global health charities, but they are often the wrong tool for dealing with situations of high uncertainty, the domain of EA longtermism.
Discussion by a PredictIt bettor on [how he made money by following Nate Silver's predictions](https://www.reddit.com/r/TheMotte/comments/i6yuis/culture_war_roundup_for_the_week_of_august_10_2020/g1ab8qh/?context=3&sort=best), from r/TheMotte.
Also on r/TheMotte, on [the promises and deficiencies of prediction markets](https://www.reddit.com/r/TheMotte/comments/iseo9j/culture_war_roundup_for_the_week_of_september_14/g59ydcx/?context=3):
> Prediction markets will never be able to predict the unpredictable. Their promise is to be better than all of the available alternatives, by incorporating all available information sources, weighted by experts who are motivated by financial returns.
> So, you'll never have a perfect prediction of who will win the presidential election, but a good prediction market could provide the best possible guess of who will win the presidential election.
> To reach that potential, you'd need to clear away the red tape. It would need to be legal to make bets on the market, fees for making transaction need to be low, participants would need faith in the bet adjudication process, and there can't be limits to the amount you can bet. Signs that you'd succeeded would include sophisticated investors making large bets with a narrow bid/ask spread.
> Unfortunately prediction markets are nowhere close to that ideal today; they're at most "barely legal," bet sizes are limited, transaction fees are high, getting money in or out is clumsy and sketchy, trading volumes are pretty low, and you don't see any hedge funds with "prediction market" desks or strategies. As a result, I put very little stock in political prediction markets today. At best they're populated by dumb money, and at worst they're actively manipulated by campaigns or partisans who are not motivated by direct financial returns.
[Nate Silver](https://twitter.com/NateSilver538/status/1300449268633866241) on a small twitter thread on prediction markets: "Most of what makes political prediction markets dumb is that people assume they have expertise about election forecasting because they a) follow politics and b) understand "data" and "markets". Without more specific domain knowledge, though, that combo is a recipe for stupidity."
* Interestingly, I've recently found out that 538's political predictions are probably [underconfident](https://projects.fivethirtyeight.com/checking-our-work/), i.e., an 80% happens 88% of the time.
[Deloitte](https://www2.deloitte.com/us/en/pages/about-deloitte/articles/press-releases/a-tale-of-two-holiday-seasons-as-a-k-shaped-recovery-model-emerges-consumer-spending-heavily-bifurcated.html) forecasts US holiday season retail sales (but doesn't provide confidence intervals.)
[Solar forecast](https://www.nytimes.com/2020/09/15/science/sun-solar-cycle.html). Sun to leave the quietest part of its cycle, but still remain relatively quiet and not produce world-ending coronal mass ejections, the New York Times reports.
The Foresight Insitute organizes weekly talks; here is one with Samo Burja on [long-lived institutions](https://www.youtube.com/watch?v=6cCcX0xydmk).
[Some examples of failed technology predictions](https://eandt.theiet.org/content/articles/2020/09/the-eccentric-engineer-the-perils-of-forecasting/).
Last, but not least, Ozzie Gooen on [Multivariate estimation & the Squiggly language](https://www.lesswrong.com/posts/kTzADPE26xh3dyTEu/multivariate-estimation-and-the-squiggly-language):
![](images/fef39d9a14a8ca8986c984ba2f8227d1581d9421.jpg)
---
Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go [there](https://archive.org/) and input the dead link.
---
> [Littlewood's law](https://en.wikipedia.org/wiki/Littlewood%27s_law) states that a person can expect to experience events with odds of one in a million (defined by the law as a "miracle") at the rate of about one per month."
---

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

View File

@ -0,0 +1,108 @@
Forecasting Newsletter: October 2020.
==============
## Highlights
* Facebook's Forecast now out of [out of beta](https://npe.fb.com/2020/10/01/forecast-update-making-forecast-available-to-everyone/).
* British Minister and experts give [probabilistic predictions](https://www.independent.co.uk/news/uk/politics/brexit-trade-deal-chances-probability-likelihood-boris-johnson-eu-summit-b1045775.html) of the chance of a Brexit deal.
* CSET/Foretell publishes an [issue brief](https://cset.georgetown.edu/wp-content/uploads/CSET-Future-Indices.pdf) on their approach to using forecasters to inform big picture policy questions.
## Index
* Highlights
* Prediction Markets & Forecasting Platforms
* In The News
* (Corrections)
* Long Content
* Hard To Categorize
Sign up [here](https://forecasting.substack.com/) or browse past newsletters [here](https://forum.effectivealtruism.org/s/HXtZvHqsKwtAYP6Y7). I'm considering creating a Patreon or substack for this newsletter; if you have any strong views, leave a comment.
## Prediction Markets & Forecasting Platforms
Facebook's Forecast app now [out of beta](https://npe.fb.com/2020/10/01/forecast-update-making-forecast-available-to-everyone/) in the US and Canada.
Hypermind, a prediction market with virtual points but occasional monetary rewards, is organizing a [contest](https://prod.lumenogic.com/ngdp/en/welcome.html) for predicting US GDP in 2020, 2021 and 2022. Prizes sum up to $90k.
Metaculus held the [Road to Recovery](https://www.metaculus.com/questions/5335/forecasting-tournament--road-to-recovery/), and [20/20 Insight Forecasting](https://www.metaculus.com/questions/5336/the-2020-insight-forecasting-contest/) contests. It and collaborators also posted the results of their [2020 U.S. Election Risks Ssurvey](https://www.metaculus.com/news/2020/10/20/results-of-2020-us-election-risks-survey/).
[CSET](https://cset.georgetown.edu/wp-content/uploads/CSET-Future-Indices.pdf) publishes a report on using forecasters to inform big picture policy questions.
> We illustrate Foretells methodology with a concrete example: First, we describe three possible scenarios, or ways in which the tech-security landscape might develop over the next five years. Each scenario reflects different ways in which U.S.-China tensions and the fortunes of the artificial intelligence industry might develop. Then, we break each scenario down into near-term predictors and identify one or more metrics for each predictor. We then ask the crowd to forecast the metrics. Lastly, we compare the crowds forecasts with projections based on historical data to identify trend departures: the extent to which the metrics are expected to depart from their historical trajectories.
Replication Markets opens their [Prediction Market for COVID-19 Preprints](https://www.replicationmarkets.com/index.php/rm-c19/). Surveys opened on October 28, and markets will open on November 11, 2020.
## In the News
The European Union is attempting to build a model of the Earth [at 1km resolution](https://www.sciencemag.org/news/2020/10/europe-building-digital-twin-earth-revolutionize-climate-forecasts) as a test ground for its upcoming supercomputers. Typical models run at a resolution of 10 to 100km.
Michael Gove, a British Minister, gave a [66% chance to a Brexit deal](https://www.theguardian.com/politics/2020/oct/07/eu-needs-clear-sign-uk-will-get-real-in-brexit-talks-says-irish-minister). The Independent follows up by [giving the probabilities of different experts](https://www.independent.co.uk/news/uk/politics/brexit-trade-deal-chances-probability-likelihood-boris-johnson-eu-summit-b1045775.html)
Some 538 highlights:
* US general election polls are generally a random walk, rather than [having momentum](https://fivethirtyeight.com/features/the-misunderstanding-of-momentum/).
* [Pollsters have made some changes since 2016](https://fivethirtyeight.com/features/what-pollsters-have-changed-since-2016-and-what-still-worries-them-about-2020/), most notably weighing by education.
* An [interactive presidential forecast](https://fivethirtyeight.com/features/were-letting-you-mess-with-our-presidential-forecast-but-try-not-to-make-the-map-too-weird/)
[New York magazine](https://nymag.com/intelligencer/2020/10/nate-silver-and-g-elliott-morris-are-fighting-on-twitter.html) goes over some differences between 538's and The Economist's forecast for the US election.
[Reuters](https://www.reuters.com/article/global-forex-election/fx-options-market-reflects-more-confidence-in-biden-election-win-idUSL1N2GT207) looks at the volatility between the dollar and the yen or Swiss franc as a proxy for tumultuous elections. Reuters' interpretation is that a decline in long-run volatility implies that the election is not expected to be contested.
Meanwhile, new systems for [forecasting outbreaks](https://www.porkbusiness.com/article/forecasting-outbreaks-could-be-game-changer-pork-industry) in the American pork industry may help prevent outbreaks, and also make the industry more profitable.
* On the topic of animals, see also a Metaculus question on whether [the EU will announce going cage-free by 2024](https://www.metaculus.com/questions/5431/will-the-eu-announce-by-2024-going-cage-free/).
## Corrections
In the September newsletter, I claimed that bets on the order of $50k could visibly move Betfair's odds. I got some [pushback](https://www.reddit.com/r/slatestarcodex/comments/j25ct9/what_are_everyones_probabilities_for_a_biden_win/g778mg8/?context=8&depth=9). I asked Betfair itself, and their answer was:
> It would definitely be an oversimplification to say that “markets can be moved with 10 to 50k”, because it would depend on a number of other factors such as how much is available at that price at any one time and if anyone puts more money up at that price once all available money is taken.
> For example if someone placed £100k on Biden at 1.44 and there was £35k at 1.45, and £57k at 1.44, then around £7k would be unmatched and the market would now be 1.43-1.44 on Biden. But if someone else still thinks the price should remain at 1.45-1.46 they could place bets to get it back to that, so the market will shift back almost immediately.
> So to clarify, the bets outlined in those articles arent necessarily the sole reason for the market moving, therefore they cant be deemed the causal connection. They are just headline examples to provide colour to the betting patterns at the time. I hope that is useful, let me know if you need any more info.
## Negative Examples
Boeing [releases](https://www.fool.com/investing/2020/10/15/boeings-commercial-market-outlook-seems-optimistic/) an extremely positive market outlook. "A year ago, Boeing was predicting services market demand to be $3.13 trillion from 2019-2028, making the prediction for $3 trillion from 2020-2029 look optimistic."
## Long Content
The [World Agricultural Supply and Demand Estimates](https://www.usda.gov/oce/commodity/wasde) is a monthly report by the US Department of Agriculture. It provides monthly estimates and past figures for crops worldwide, and for livestock production in the US specifically (meat, poultry, dairy), which might be of interest to the animal suffering movement. It also provides estimates of the past reliability of those forecasts. The October report can be found [here](https://www.usda.gov/oce/commodity/wasde/wasde1020.pdf), along with a summary [here](https://www.feedstuffs.com/markets/usda-raises-meat-poultry-production-forecast). The image below presents the 2020 and 2021 predictions, as well as the 2019 numbers:
![](images/c1fa5c2d58e4b0b76d0c7afab896ee89a7289559.png)
The Atlantic considers scenarios under which [Trump refuses to concede](https://www.theatlantic.com/magazine/archive/2020/11/what-if-trump-refuses-concede/616424/). Warning: very long, very chilling.
National Geographic on [the limits and recent history of weather forecasting](https://www.nationalgeographic.com/science/2020/10/hurricane-path-forecasts-have-improved-can-they-get-better/). There are reasons to think that forecasting the weather accurately two weeks in advance might be difficult.
Andreas Stuhlmüller, of Ought, plays around with GPT-3 to [output probabilities](https://twitter.com/stuhlmueller/status/1317492314495909888); I'm curious to see what comes out of it. I'd previously tried (and failed) to get GPT-3 to output reasonable probabilities for Good Judgment Open questions.
A 2019 paper by Microsoft on [End-User Probabilistic Programming](https://www.microsoft.com/en-us/research/uploads/prod/2019/09/End-User-Probabilistic-Programming-QEST-2019.pdf), that is, on adding features to spreadsheet software to support uncertain values, quantify uncertainty, propagate errors, etc.
The [2020 Presidential Election Forecasting symposium](https://www.cambridge.org/core/journals/ps-political-science-and-politics/2020-presidential-election-forecasting-symposium) presents 12 different election forecasts, ranging from blue wave to Trump win. [Here](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/78235400F6BB7E2E370214D1A2307028/S104909652000147Xa.pdf/introduction_to_forecasting_the_2020_us_elections.pdf) is an overview.
[Blue Chip Economic Indicators](https://lrus.wolterskluwer.com/store/product/blue-chip-economic-indicators/) and [Blue Chip financial forecasts](https://lrus.wolterskluwer.com/store/product/blue-chip-financial-forecasts/) are an extremely expensive forecasting option for various econometric variables. A monthly suscription costs $2,401.00 and $2,423.00, respectively, and provides forecasts by 50 members of prestigious institutions ("Survey participants such as Bank of America, Goldman Sachs & Co., Swiss Re, Loomis, Sayles & Company, and J.P. MorganChase, provide forecasts..."). An estimate of previous track record and accuracy [isn't available](https://www.overcomingbias.com/2020/08/how-to-pick-a-quack.html) before purchase. Further information on [Wikipedia](https://en.wikipedia.org/wiki/Blue_Chip_Economic_Indicators)
* [Chief U.S. economist Ellen Zentner of Morgan Stanley](https://asunow.asu.edu/20201005-morgan-stanley-economist-wins-lawrence-r-klein-award-forecasting-accuracy) won the [Lawrence R. Klein Award](https://asunow.asu.edu/20201005-morgan-stanley-economist-wins-lawrence-r-klein-award-forecasting-accuracy) for the most accurate econometric forecasts among the 50 groups who participate in Blue Chip financial forecast surveys.
* I would be very curious to see if Metaculus' top forecasters, or another group of expert forecasts, could beat the Blue Chips. I'd also be curious how they fared on January, February and March of this year.
## Hard to categorize.
Scientists use [precariously balanced rock formations to improve accuracy of earthquake forecasts](https://www.dailymail.co.uk/sciencetech/article-8798677/Rock-clocks-balanced-boulders-improve-accuracy-earthquake-forecasts.html). They can estimate when the rock formation appeared, and can calculate what magnitude an earthquake would have had to be to destabilize it. Overall, a neat proxy.
Some [superforecasters](https://twitter.com/annieduke/status/1313653673994514432) to follow on twitter.
_Dart Throwing Spider Monkey_ proudly presents _[Intro to Forecasting 01 - What is it and why should I care?](https://www.youtube.com/watch?v=e6Q7Ez3PkOw)_ and _[Intro to Forecasting 02 - Reference class forecasting](https://www.youtube.com/watch?v=jrU3o7wK23s)_.
I've gone through the Effective Altruism Forum and LessWrong and added or made sure that the forecasting tag is applied to the relevant posts for October (LessWrong [link](https://www.lesswrong.com/tag/forecasting-and-prediction?sortedBy=new), Effective Altruism forum [link](https://forum.effectivealtruism.org/tag/forecasting?sortedBy=new)). This provides a change-log for the month. For the Effective Altruism forum, this only includes Linch Zhang's post on [Some learnings I had from forecasting in 2020](https://forum.effectivealtruism.org/posts/kAMfrLJwHpCdDSqsj/some-learnings-i-had-from-forecasting-in-2020). For LessWrong, this also includes a [post announcing that Forecast, a prediction platform by Facebook](https://www.lesswrong.com/posts/CZRyFcp6HSyZ7Jj8Q/launching-forecast-a-community-for-crowdsourced-predictions) is now out of beta.
---
Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go [there](https://archive.org/) and input the dead link.
---
Using actuarial life tables and an adjustment for covid, the implied probability that all 246 readers of this newsletter drop dead before the next month is at least 10^(-900) (if they were uncorrelated). See [this Wikipedia page](https://en.wikipedia.org/wiki/Orders_of_magnitude_(probability)) or [this xkcd comic](https://xkcd.com/2379/) for a comparison with other low probability events, such as asteroid impacts.
---

View File

@ -0,0 +1,76 @@
Incentive Problems With Current Forecasting Competitions.
==============
This post outlines some incentive problems with forecasting tournaments, such as Good Judgement Open, CSET, or Metaculus. These incentive problems may be problematic not only because unhinged actors might exploit them, but also because of mechanisms such as those outlined in [Unconscious Economics](https://www.lesswrong.com/posts/PrCmeuBPC4XLDQz8C/unconscious-economics). For a similar post about PredictIt, a prediction market in the US, see [here](https://www.lesswrong.com/posts/c3iQryHA4tnAvPZEv/limits-of-current-us-prediction-markets-predictit-case-study). This post was written in collaboration with [Nuño Sempere](https://forum.effectivealtruism.org/users/nunosempere), who should be added as a coauthor soon. This is a crosspost from [LessWrong](https://www.lesswrong.com/posts/tyNrj2wwHSnb4tiMk/incentive-problems-with-current-forecasting-competitions).
## Problems
**Discrete prizes distort forecasts**
If a forecasting tournament offers a prize to the top X forecasters, the objective "be in the top X forecasters" differs from "maximize predictive accuracy". The effects of this are greater the smaller the number of questions.
For example if only the top forecaster wins a prize, you might want to predict a surprising scenario, because if it happens you will reap the reward, whereas if the most likely scenario happens, everyone else will have predicted it too.
Consider for example [this prediction contest](https://predictingpolitics.com/2020/08/02/the-predictions-are-in/), which only had a prize for #1. The following question asks about the margin Biden would win or lose Georgia by:
![](images/f736a63590b98d329a456b5ff1cc055da86d416c.png)
Then the most likely scenario might be a close race, but the prediction which would have maximized your odds of coming in #1st might be much more extremized, because other predictors are more tightly packed in the middle.
This effect also applies if you think of "becoming a superforecaster", which you can do if you are in the top 2% and have over 100 predictions in Good Judgement Open, as a distinct objective from "maximizing reward accuracy".
**Incentives not to share information and to produce corrupt information**
In a forecasting tournament, there is a disincentive to sharing information, because other forecasters can use it to improve relative standing. This not only includes not sharing information but also providing misleading information. As a counterpoint, other forecasters can and will often point out flaws in your reasoning if you're wrong. 
**Incentives to selectively pick questions.**
If one is maximizing the Brier score, one is incentivized to pick easier questions. Specifically, if someone has a brier score _b2_, then they should not make a prediction on any question where the probability is between b and (1-b), **even if they know the true probability exactly.** Tetlock explicitly mentions this in one of his [Commandments for superforecasters](https://fs.blog/2015/12/ten-commandments-for-superforecasters/): “Focus on questions where your hard work is likely to pay off”. Yet if we care about making better predictions of things we need to know the answer to, the skill of "trying to answer easier questions so you score better" is not a skill we should reward, let alone encourage the development of.
A related issue is that, if one is maximizing the difference between one's Brier score and the aggregates Brier score, one is incentivized to pick questions for which the one thinks the aggregate is particularly wrong. This is not necessarily a problem, but can be. 
**Incentive to just copy the community on every question.**
In scoring systems which more directly reward making many predictions, such as the Metaculus scoring system where in general one has to be both wrong and very confident to actually lose points, predictors are heavily incentivized to make predictions on as many questions as possible in order to move up the leaderboard. In particular, a strategy of “predict the community median with no thought” could see someone rise to the top 100 with a few months of signing up.
This is plausibly less bad than the incentives above, although this depends on your assumptions. If the main value of winning “metaculus points” is personal satisfaction, then predicting exactly the community median is unlikely to keep people entertained for long. New users predicting something fairly close to the community median on lots of questions, but updating a little bit based on their own thinking, is arguably not a problem at all, as the small adjustments may be enough to improve the crowd forecast, and the high volume of practice that users with this strategy experience can lead to rapid improvement.
## Solutions
**Probabilistic rewards**
In tournaments with fairly small numbers of questions, where paying only the top few places would incentivize forecasters to make overconfident predictions to maximize their chance of first-place finishes as discussed above, probabilistic rewards may be used to mitigate this effect. 
In this case, rather than e.g. having prizes for the top three scorers, prizes would be distributed according to a lottery, where the number of tickets each player received was some function of their score, thereby incentivizing players to maximize their expected score, rather than their chance of scoring highest. 
Which precise function should be used is a non-trivial question: If the payout structure is too “flat”, then there is not sufficient incentive for people to work hard on their forecasts compared to just entering with the community median or some reasonable prior. If on the other hand the payout structure too heavily rewards finishing in one of the top few places, then the original problem returns. This seems like a promising avenue for a short research project or article.
**Giving rewards to the best forecasters among many questions.**
If one gives rewards to the top three forecasters for 10 questions in a contest in which there are 200 forecasters, the top three forecasters might be as much a function of luck as of skill, which might be perceived as unfair. Giving prizes for much larger pools of questions makes this effect smaller. 
**Forcing forecasters to forecast on all questions**
This fixes the incentive to pick easier questions to forecast on. 
A similar idea is assuming that forecasters have predicted the community median on any question that they haven't forecast on, until they make their own prediction, and then reporting the average brier score over all questions. This has the disadvantage of not rewarding "first movers/market makers", however it has the advantage of "punishing" people for not correcting a bad community median in a way that relative Brier doesn't (at least in terms of how people experience it).
**Scoring groups**
If one selects a group of forecasters and offers to reward them in proportion to the Brier score of the groups predictions for a fixed set of questions, then the forecasters now have the incentive to share information with the group. This group of forecasters could be pre-selected for having made good predictions in the past. 
**Designing collaborative scoring rules**
One could design rules around, e.g., Shapley values, but this is tricky to do. Currently, some platforms make it possible to give upvotes to the most insightful forecasters, but if upvotes were monetarily rewarded, one might not have the incentive to upvote other participants contributions as opposed to waiting for ones contributions to be upvoted. 
In practice, Metaculus and GJOpen do have healthy communities which collaborate, where trying to maximize accuracy at the expense of other forecasters is frowned upon, but this might change with time, and it might not always be replicable. In the specific case of Metaculus, monetary prizes are relatively new, but becoming more frequent. It remains to be seen whether this will change the community dynamic.
**Divide information gatherers and prediction producers.**
In this case, information gatherers might then be upvoted by prediction producers, who wouldnt have any disincentive not to do so. Alternatively, some prediction producers might be shown information from different information gatherers, or select which information was responsible for a particular change in their forecast. A scheme in which the two tasks are separated might also lead to efficiency gains.
**Other ideas**
* One might explicitly reward reasoning rather than accuracy (this has been tried on Metaculus for the insight tournament, and also for the [El Paso series).](https://pandemic.metaculus.com/contests/?selected=el-paso) This has its own downsides, notably that its not obvious that reasoning which looks good/reads well is actually correct.
* One might make objectives more fuzzy, like the Metaculus points system, hoping this would make it more difficult to hack.
* One might reward activity, i.e., frequency of updates, or some other proxy expected to correlate with forecasting accuracy. This might work better if the correlation is causal (i.e., better forecasters have higher accuracy because they forecast more often), rather than due to a confounding factor. The obvious danger with any such strategy is that rewarding the proxy [is likely to break the correlation](https://www.lesswrong.com/posts/YtvZxRpZjcFNwJecS/the-importance-of-goodhart-s-law#:~:text=Goodhart's%20law%20states%20that%20once,to%20the%20Bank%20of%20England.).

View File

@ -0,0 +1,41 @@
Announcing the Forecasting Innovation Prize
==============
## Motivation
There is already a fair amount of interest around Effective Altruism in judgemental forecasting. We think theres a whole lot of good research left to be done.
The valuable research seems to be all over the place. We could use people to speculate on research directions, outline incentive mechanisms, try novel forecasting questions with friends, and outline new questions that deserve forecasts. Some of this requires a fair amount of background knowledge, but a lot doesnt. 
The EA and LW communities have a history of using [prizes](https://forum.effectivealtruism.org/posts/GseREh8MEEuLCZayf/nunosempere-s-shortform?commentId=WPStS4qhJS7Mz6KCA) to encourage work in exciting areas. Were going to try one in forecasting research. If this goes well, wed like to continue and expand this going forward.
## Prize
This prize will total $1000 between multiple recipients, with a minimum first place prize of $500. We will aim for 2-5 recipients in total. The prize will be paid for by the [Quantified Uncertainty Research Institute](https://quantifieduncertainty.org/) (QURI).
## Rules
To enter, first make a public post online between now and Jan 1, 2021. We encourage you to either post directly or make a link post to either LessWrong or the EA Forum. Second, complete [this form](https://docs.google.com/forms/d/e/1FAIpQLSdDQg31F3v-QYEvSXp7Oahg-qagigN4PPXediYGWYKaPDD3Lg/viewform?usp=sf_link), also before Jan 1, 2021. 
## Research Feedback
If youd like feedback or would care to discuss possible research projects, please do reach out! To do so, fill out [this form](https://docs.google.com/forms/d/e/1FAIpQLSf13ZDuz1uERMl_Se2VOiKQzPn2AhQTJOkJpbCy7uSFh3cUOg/viewform?usp=sf_link). Were happy to advise at any stages of the process. 
## Judges
The judges will be [AlexRJL](https://forum.effectivealtruism.org/users/alexrjl), [Nuño Sempere](https://forum.effectivealtruism.org/users/nunosempere), [Eric Neyman](https://sites.google.com/view/ericneyman/), [Tamay Besiroglu](https://forum.effectivealtruism.org/users/tamay), [Linch Zhang](https://forum.effectivealtruism.org/users/linch) and [Ozzie Gooen](https://forum.effectivealtruism.org/users/oagr). The details of the judging process will vary depending on how many submissions we get. Well try to select winners for their importance, novelty, and presentation.
## Some Possible Research Areas
Areas of work we would be excited to see explored:
* Operationalizing questions in important domains so that they can be predicted in e.g., Metaculus. This is currently a significant bottleneck; its surprisingly difficult to write good questions. Examples in the past have been the [Ragnarök](https://www.metaculus.com/questions/?search=cat:series--ragnarok) or the [Animal Welfare series](https://www.metaculus.com/questions/3068/animal-welfare-series/). A possible suggestion might be to try to come up with forecastable [fire alarms](https://www.lesswrong.com/posts/kvSLHYY5igtixEqMB/fire-alarm-for-agi) for AGI. Tamay Besiroglu has suggested a “S&P 500 but for AI forecasts,” i.e., a group of forecasting questions which track something useful for AI (or for other domains.)
* Small experiments where you and/or a group of people use forecasting for your own decision making, and write up what youve learned. For example, set up a [Foretold](https://www.foretold.io/) community to decide on which research document you want to write up next. [Predictions as a Substitute for Reviews](https://acesounderglass.com/2020/08/06/predictions-as-a-substitute-for-reviews/) is an example here.
* New forecasting approaches, or forecasting tools being used in new and interesting ways, or applied to new domains. For example, [Amplifying generalist research via forecasting](https://www.lesswrong.com/posts/FeE9nR7RPZrLtsYzD/amplifying-generalist-research-via-forecasting-results-from), or Oughts [AI timelines forecasting thread](https://www.lesswrong.com/posts/hQysqfSEzciRazx8k/forecasting-thread-ai-timelines).
* Estimable or [gears-level](https://www.lesswrong.com/posts/B7P97C27rvHPz3s9B/gears-in-understanding) models of the world that are well positioned to be used in forecasting. For example, a decomposition informed by ones own expertise of a difficult question into smaller questions, each of which can be then forecasted. Recent work by [CSET-foretell](https://cset.georgetown.edu/wp-content/uploads/CSET-Future-Indices.pdf) would be an example of this.
* Suggestions for or basic implementation of better tooling for forecasters, like a Bayes rule calculator for considering many pieces of evidence, a Laplace law calculator, etc.
* New theoretical schemes which propose solutions to current problems around forecasting. For a recent example, see [Time Travel Markets for Intellectual Accounting](https://www.lesswrong.com/posts/DonsyZjFMgsXnZAFX/time-travel-markets-for-intellectual-accounting).
* Elicitation of expert forecasters of useful questions. For example, the [probabilities](https://forum.effectivealtruism.org/posts/Z5KZ2cui8WDjyF6gJ/some-thoughts-on-toby-ord-s-existential-risk-estimates) of the x-risks outlined in [The Precipice](https://theprecipice.com/).
* Overviews of existing research, or thoughts or reflections on existing prediction tournaments and similar. For example, Zvis posts on prediction markets, [here](https://www.lesswrong.com/posts/a4jRN9nbD79PAhWTB/prediction-markets-when-do-they-work) and [here](https://www.lesswrong.com/posts/k286sEwyuY7SiQjcs/prediction-markets-are-about-being-right).
* Figuring out why some puzzling behavior happens in current prediction markets or forecasting tournaments, like in [Limits of Current US Prediction Markets (PredictIt Case Study)](https://www.lesswrong.com/posts/c3iQryHA4tnAvPZEv/limits-of-current-us-prediction-markets-predictit-case-study). For a new puzzle suggested by Eric Neyman, consider that PredictIt is thought to be limited because it caps trades at $850, has various fees, etc, which makes it not the sort of market that big, informed players can enter and make efficient. But that fails to explain why markets without such caps, such as FTX, have prices similar to PredictIt. So, is PredictIt reasonable or is FTX unreasonable? If the former, why is there such a strong expert consensus against what PredictIt says so often? If the latter, why is FTX unreasonable?
* Comments on existing posts can themselves be very valuable. Feel free to submit a list of good comments instead of one single post.

Some files were not shown because too many files have changed in this diff Show More