diff --git a/blog/2023/03/01/computable-solomoff/src/main.tex b/blog/2023/03/01/computable-solomoff/src/main.tex new file mode 100644 index 0000000..d99b387 --- /dev/null +++ b/blog/2023/03/01/computable-solomoff/src/main.tex @@ -0,0 +1,87 @@ +\documentclass[12pt,authoryear]{elsarticle} +\makeatletter +\def\ps@pprintTitle{% + \let\@oddhead\@empty + \let\@evenhead\@empty + \def\@oddfoot{\centerline{\thepage}}% + \let\@evenfoot\@oddfoot} + +%% The amssymb package provides various useful mathematical symbols + +\begin{document} + +\begin{frontmatter} + + +\title{A computable version of Solomonoff induction} + +%% use optional labels to link authors explicitly to addresses: +%% \author[label1,label2]{} +%% \address[label1]{} +%% \address[label2]{} + +\author[1]{Nu\~{n}o Sempere} + +\address[1]{Quantified Uncertainty Research Institute, Mexico} + +\begin{abstract} + I present a computable version of Solomonoff induction. +\end{abstract} + +\begin{keyword} +%% keywords here, in the form: keyword \sep keyword +Solomonoff \sep induction \sep computable +\end{keyword} + +\end{frontmatter} + +\section{The key idea: arrive at the correct hypothesis in finite time} + +\begin{enumerate} + \item Start with a finite set of turing machines, $\{T_0, ..., T_n\}$ + \item If none of the $T_i$ predict your trail bits, $(B_0, ..., B_m)$, compute the first $m$ steps of Turing machine $T_{n+1}$. If $T_{n+1}$ doesn't predict them either, go to $T_{n+2}$, and so on\footnote{Here we assume that we have an ordering of Turing machines, i.e., that $T_i$ is simpler than $T_{(i+1)}$} + \item Observe the next bit, purge the machines from your set which don't predict it. If none predict it, GOTO 2. +\end{enumerate} + + +Then in finite time, you will arrive at a set which only contains the simplest TM which describes the process generating your train of bits. Proof: + +\begin{itemize} + \item Let $T_j$ be the simplest machine which describes the process generating your trail of bits. Then by virtue of it being such, all TM machines $\{T_1,...,T_{j-1}\}$ are not the simplest machine generating your trail of bits. Meaning that at some point, they predict something distinct from your trail of bits, meaning that at some point step 2. of the process above discards them. In particular, the step after discarding $T_{j-1}$ is arriving at $T_j$, and staying there forever. + \item Therefore, this process arrives at the simplest TM which describes the process generating your trail of bits in finite time. +\end{itemize} +QED. + +\section{Using the above scheme to arrive at a probability } + +Now, the problem with the above scheme is that if we use our set of turing machines to output a probability for the next bit + +$$ P\Big(b_{m + 1} = 1 | (B_0, ..., B_m) \Big) := \frac{1}{n} \cdot \sum_0^n \Big(T_i(m+1) = 1\Big) $$ + +then our probabilities are going to be very janky. For example, at step $ j - 1 $, that scheme would output a 0\% probability to the correct value of the next bit. + +To fix this, in step 2, we can require that there be not only one turing machine that predicts all past bits, but multiple of them. What multiple? Well, however many your compute and memory budgets allow. This would make your probabilities less janky. + +Interestingly, that scheme also suggests that there is a tradeoff between arriving at the correct hypothesis as fast as possible—in which case we would just implement the first scheme at full speed—and producing accurate probabilities—in which case it seems like we would use the modification just outlined. + +\section{A downside} + +Note that a downside of the procedures outlined above is that at the point we arrive at the correct hypothesis, we don't know that this is the case. + +\section{An irrelevant epicycle: dealing with Turing machines that take too long or do not halt.} + +When thinking about Turing machines, one might consider one particular model, e.g., valid C programs. But in that case, it is easy to consider + +\begin{verbatim} +void main(){ + while(true){ + } +} +\end{verbatim} + + +\end{document} + +\endinput +%% +%% End of file `elsarticle-template-harv.tex'. diff --git a/blog/2023/03/01/computable-solomoff/src/old.text b/blog/2023/03/01/computable-solomoff/src/old.text new file mode 100644 index 0000000..1400f0e --- /dev/null +++ b/blog/2023/03/01/computable-solomoff/src/old.text @@ -0,0 +1,1132 @@ +\documentclass[12pt,authoryear]{elsarticle} +\makeatletter +\def\ps@pprintTitle{% + \let\@oddhead\@empty + \let\@evenhead\@empty + \def\@oddfoot{\centerline{\thepage}}% + \let\@evenfoot\@oddfoot} +%% Otherwise, annoying "Preprint submitted to Elsevier." + +% \documentclass[preprint, 12pt,authoryear]{elsarticle} + +%% Use the option review to obtain double line spacing +%% \documentclass[authoryear,preprint,review,12pt]{elsarticle} + +%% Use the options 1p,twocolumn; 3p; 3p,twocolumn; 5p; or 5p,twocolumn +%% for a journal layout: +%% \documentclass[final,1p,times,authoryear]{elsarticle} +%% \documentclass[final,1p,times,twocolumn,authoryear]{elsarticle} +%% \documentclass[final,3p,times,authoryear]{elsarticle} +%% \documentclass[final,3p,times,twocolumn,authoryear]{elsarticle} +%% \documentclass[final,5p,times,authoryear]{elsarticle} +%% \documentclass[final,5p,times,twocolumn,authoryear]{elsarticle} + +%% For including figures, graphicx.sty has been loaded in +%% elsarticle.cls. If you prefer to use the old commands +%% please give \usepackage{epsfig} + +\usepackage[page]{appendix} %% title,toc,titletoc + +%% The amssymb package provides various useful mathematical symbols +\usepackage{amssymb} +%% The amsthm package provides extended theorem environments +\usepackage{amsmath} +\usepackage{amsthm} +\newtheorem{theorem}{Theorem} +\newtheorem{hypothesis}{Hypothesis}[theorem] + +\newenvironment{proofsketch}{% + \renewcommand{\proofname}{Proof sketch}\proof}{\endproof} + +%% The lineno packages adds line numbers. Start line numbering with +%% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on +%% for the whole article with \linenumbers. +%% \usepackage{lineno} + +\usepackage{hyperref} +\hypersetup{ + colorlinks=true, + linkcolor=blue, + filecolor=magenta, + urlcolor=cyan, +} +\urlstyle{same} + +\usepackage{graphicx} +\graphicspath{ {./images/} } +\usepackage{wrapfig} +\usepackage{svg} +%% https://www.overleaf.com/learn/latex/Inserting_Images +\renewcommand*{\bibfont}{\raggedright} + + +\usepackage{tikz} +\newcommand*\circled[1]{\tikz[baseline=(char.base)]{ + \node[shape=circle,draw,inner sep=2pt] (char) {#1};}} +%% https://tex.stackexchange.com/questions/7032/good-way-to-make-textcircled-numbers +\usepackage{float} + +%\journal{International Journal Of Forecasting} + +\begin{document} + +\begin{frontmatter} + +%% Title, authors and addresses + +%% use the tnoteref command within \title for footnotes; +%% use the tnotetext command for theassociated footnote; +%% use the fnref command within \author or \address for footnotes; +%% use the fntext command for theassociated footnote; +%% use the corref command within \author for corresponding author footnotes; +%% use the cortext command for theassociated footnote; +%% use the ead command for the email address, +%% and the form \ead[url] for the home page: +%% \title{Title\tnoteref{label1}} +%% \tnotetext[label1]{} +%% \author{Name\corref{cor1}\fnref{label2}} +%% \ead{email address} +%% \ead[url]{home page} +%% \fntext[label2]{} +%% \cortext[cor1]{} +%% \address{Address\fnref{label3}} +%% \fntext[label3]{} + +\title{Alignment Problems With Current Forecasting Platforms} + +%% use optional labels to link authors explicitly to addresses: +%% \author[label1,label2]{} +%% \address[label1]{} +%% \address[label2]{} + +\author[1]{Nu\~{n}o Sempere\corref{cor1}} +\author[2]{Alex Lawsen} + +\address[1]{Quantified Uncertainty Research Institute, Vienna} +\address[2]{Kings Maths School, United Kingdom} + +\cortext[cor1]{Corresponding author. E-mail address: \url{nuno@quantifieduncertainty.org}} + + +\begin{abstract} + We present alignment problems in current forecasting platforms, such as Good Judgment Open, CSET-Foretell or Metaculus. We classify those problems as either reward specification problems or principal-agent problems, and we propose solutions. For instance, the scoring rule used by Good Judgment Open is not proper, and Metaculus tournaments disincentivize sharing information and incentivize distorting one's true probabilities to maximize the chances of placing in the top few positions which earn a monetary reward. We also point out some partial similarities between the problem of aligning forecasters and the problem of aligning artificial intelligence systems. + %We outline some alignment problems with forecasting tournaments, such as Good Judgement Open, CSET-foretell, or Metaculus, and classify these problems using the inner/outer alignment categorization scheme. Most notably, Good Judgment Open's scoring rule is not proper---under some scenarios, forecasters can expect to obtain a better score by inputting something other than their true probabilities---and this effect is quantitatively large. +\end{abstract} + +%%Graphical abstract +%\begin{graphicalabstract} +%\includegraphics{grabs} +%\end{graphicalabstract} + +%%Research highlights +%\begin{highlights} + +%\item Good Judgement's scoring rule is not proper. +%\item Forecasting competitions in Metaculus and other platforms might want to consider probabilistic rewards. +%\item Forecasting competitions exhibit both inner and outer alignment problems. +%\end{highlights} + +\begin{keyword} +%% keywords here, in the form: keyword \sep keyword +forecasting \sep forecasting tournament \sep incentives \sep incentive problems \sep alignment problems \sep Good Judgement \sep Cultivate Labs \sep Metaculus \sep CSET-foretell + +%% PACS codes here, in the form: \PACS code \sep code + +%% MSC codes here, in the form: \MSC code \sep code +%% or \MSC[2008] code \sep code (2000 is the default) + +\end{keyword} + +\end{frontmatter} + +%% \linenumbers + +%% main text + +\section{Introduction} +\label{Introduction} + +\subsection{Motivation: The importance of alignment problems for the forecasting ecosystem}\label{Motivation} + +Forecasting systems and competitions such as those organized by Good Judgment, Metaculus or CSET-foretell have been used to inform probabilities of nuclear war (\cite{Rodriguez}), the probability of different coronavirus scenarios (\cite{covidrecovery}), the chances of each presidential candidate winning the US election (\cite{USelections}), the probability of heightened geopolitical tensions or conflict with China (\cite{SouthChina}), the likelihood of global catastrophic risk (\cite{ragnarok}), or the rates of AI progress (\cite{aiprogress}). Further, these probabilities aim to be action guiding, that is, to be accurate and respected enough to influence real world decisions: + +\begin{quote} + ``...the tools are good enough that the remarkably inexpensive forecasts they generate should be on the desks of decision makers, including the president of the United States" (\cite{Tetlock}, Chapter 9) +\end{quote} + +However, these forecasting competitions sometimes inadvertently provide forecasters with incentives not to reveal their best forecasts. As as a highlight of the paper, in \S\ref{GJ not proper}, we prove that the scoring rule used in Good Judgement Open, CSET-foretell, and other Cultivate Labs platforms is not proper. That is, in some scenarios, a forecaster with correct probabilistic beliefs can input a much higher probability than would reflect their true beliefs into the platform, and still obtain a better score in expectation, and this effect is quantitatively large. + +Notably, Good Judgement draws from the top 2\% of forecasters from Good Judgement Open, who are then dubbed Superforecasters™, which introduces further incentive distortions. + +The incentive problems we identify are not only problematic because score-focused forecasters might exploit them, but also because platform users might interpret imperfect reward schemes as feedback, and because the flawed incentive schemes will fail to incentivize, identify and reward the best forecasters. + +For an overview of the broader literature around incentives for forecasters, see the recent literature review in (\cite{Witkowski}). + +%Within this broader literature, we the authors have particular expertise in human judgmental forecasting systems as they occur in practice. This refers to forecasting competitions or prediction markets in which flesh and blood humans predict on uncertain questions which are difficult to distill into a form suitable for data science or machine learning methods. + +\subsection{Overview of the paper} + +Section \S\ref{Motivation} provides the motivation for this paper, and section \S\ref{Alignment terminology} introduces the alignment terminology we use as an organizing principle for the rest of the paper. Sections \S\ref{Outer alignment incentive Problems} and \S\ref{Inner alignment incentive Problems} outline the alignment problems which we identify, and Appendix \S\ref{Simulations} provides some numerical simulations to support our points. In section \S\ref{Solutions} we propose possible solutions. Section \S\ref{Conclusion} concludes, and draws parallels to the broader artificial intelligence alignment problem. +\newpage + +\newpage +\subsection{Alignment terminology}\label{Alignment terminology} + +The process of creating a forecasting system comprises several distinct steps: + +\begin{enumerate} + \item Broader society, with its own goals, spawns a forecasting system, whose goals are to obtain accurate probabilities. + \item The forecasting system chooses a formal objective function which operationalizes ``obtain accurate probabilities". + \item Forecasters then maximize their actual reward, which might depend on their own preferences and inclinations. +\end{enumerate} + +\begin{wrapfigure}{l}{0.4\textwidth} + \centering + \includegraphics[width=0.35\textwidth]{InnerOuterAlignment2} +\end{wrapfigure} + +An alignment failure is possible at each of those steps. For instance, a failure in the first step, where broader society's goals are not aligned with a forecasting system's goals, might correspond to obtaining forecasts about Ebola without being cognizant that different forecasts may provoke different responses, and hence lead to a different number of deaths, and that the prediction which minimizes deaths might not be the one which is most accurate. A failure in the second step, where the goals of the forecasting system are not aligned with its scoring rule, might correspond to choosing a reward function which, when maximized, doesn't result in optimal forecasts, for example, a function which rewards successfully convincing other forecasters of false information: this maximizes the liar's comparative accuracy, but might decrease the accuracy of the broader scoring system. A failure in the third step, where the goals of the forecasters are not aligned with the goals of the formal scoring rule might correspond to a bug in the implementation of the scoring rule, so that forecasters optimize their score of the unaligned buggy implementation. + +\cite{Hubinger} distinguish between inner and outer alignment. Outer alignment refers to choosing an objective function which to optimize, and making sure that the gap between this objective function and one's own true goals is as small as possible. This has some similarities to our first two steps above: making sure that we want to optimize for obtaining accurate probabilities, and making sure that our scoring rule leads to accurate probabilities. For instance, if we care about saving lives, we would have to check that having better forecasts leads to more lives saved, and that the scoring rule, if optimized, leads to more accurate forecasts. + +Inner alignment then refers to making sure that an optimizer is trying to optimize a previously chosen base objective function, when this optimizer is also being optimized for. Normally this refers to making sure that the reward function learnt by a reinforcement learner during training closely approximates a base objective function, even in a different environment outside of training. In the case of a forecasting system, this bears some resemblance to making sure that forecasters maximize their score according to a previously chosen scoring rule, for instance by paying forecasters more money the lower their Brier score is. + +Nonetheless, human forecasters have some differences with reinforcement learning systems, in that they are not quite trained by forecasting systems from scratch, but merely repurposed or incentivized. For this reason, it is perhaps more accurate to say that the first two steps are a akin to a reward specification problem, and the last step is akin to a principal-agent problem. + +The focus of this paper is on failures for forecasting platforms related to steps \circled{2} and \circled{3}, respectively covered in sections \S\ref{Outer alignment incentive Problems} and \S\ref{Inner alignment incentive Problems}. %Beyond this paper, the inner/outer alignment categorization scheme might serve to organize research around incentive alignment in a more structured manner. + +%However, we include some speculative examples of \circled{1} ---including examples related to other forecasting systems--- in \ref{More outer alignment failures}. [TO DO] + +\section{Reward specification problems $\approx$ Outer alignment problems} +\label{Outer alignment incentive Problems} + +This question discusses cases where, as individual forecasters maximize their comparative accuracy score\footnote{e.g., their relative Brier score}, or some other score determined by a forecasting platform, problems arise where this does not maximize the accuracy of the whole forecasting system. That is, the forecasting platform chose some reward function, forecasters are optimizing that reward function, but it turns out that the reward function fails to capture some aspect which the creators or clients of the forecasting platform also care about. + +An classical example of this would be a Sybil attack: if a forecaster created many puppet accounts on Good Judgment Open, CSET or Metaculus, and used them to make many bad forecasts on the questions she had made predictions on, her comparative accuracy score (and the comparative accuracy score of those who predicted the same questions as her) would increase. + +As long as set of questions she choose to forecast on were relatively unique, she would benefit. She would still be maximizing her comparative accuracy score, but that score would have ceased to be related to the broader objectives of the forecasting system. + +\subsection{The scoring rule incentivizes forecasters to selectively pick questions} +\label{Selectively pick questions} + +Forecasters seeking to obtain a good score are incentivized to selectively pick questions, and in some cases, are better off not making predictions \textit{even if they know the true probability exactly.} Tetlock mentions a related point in one of his ``Commandments for superforecasters'' (\cite{Tetlock}, pp. 277-278): ``Focus on questions where your hard work is likely to pay off''. Yet if we care about making better predictions of things we need to know the answer to, the skill of ``trying to answer easier questions so one scores better" is not a skill we should reward, let alone encourage the development of. + +For the case where the forecasting system rewards the Brier score\footnote{In Brier Scoring, lower scores are better}, if a forecaster has a brier score $p \cdot (1-p)$, then they should not make a prediction on any question where the probability is between $p$ and $(1-p)$, \textit{even if they know the true probability exactly.} + +\begin{theorem} + A forecaster wishing to obtain a low average Brier score, and who has\footnote{Either currently, or in expectation} a Brier score of $p \cdot (1-p)$ (with $p\le(1-p)$ without loss of generality) should only make predictions in questions where the probability is lower than $p$ or higher than $(1-p)$. +\end{theorem} +\begin{proof} + Suppose that the forecaster makes a prediction on a binary question which has a probability $q\le 0.5$ of resolving positively, with $p < q < (1-p)$. Then the expected Brier score is: + + \begin{equation} + E[Score] = q \cdot (1-q)^2 + (1-q) \cdot (0-q)^2 = q \cdot (1-q) \cdot (1-q + q) = q \cdot (1-q) + \end{equation} + + As $f(x) = x \cdot (1-x)$ is strictly increasing from $0$ to $0.5$, if $p0.5$, consider that the question has a probability $q'$ of resolving negatively and swap $p$ and $(1-p)$. + +\end{proof} + +In fact, if a forecaster has so far performed better than random guessing, there exists a range of probabilities which, if used, are guaranteed to hurt their score not only in expectation, but regardless of the outcome of the event. This is stated formally as follows. + +\begin{theorem} + A forecaster with brier score $b^2$ who forecasts that an event has probability $p$ is guaranteed to end up with a worse brier score, regardless of the outcome of the event, if $bb^2$ as $p>b$, so the player's score will be worse. + + + If $p>0.5$, the best score the player can achieve is $(1-p)^2$, when the event occurs. +\begin{equation} +\begin{split} +(1-p)>1-(1-b)\\ +\implies(1-p)>b\\ +\implies(1-p)^2>b^2\\ +\end{split} +\end{equation} + Hence, in this case the player's score will also be worse. + +\end{proof} + +Some competitions don't reward the Brier score, but the relative Brier score (or some combination of both). The relative Brier score is defined as the difference between the forecaster's Brier scores, and the aggregate's Brier scores. As before, a lower score is better. + +As in the previous case, forecasters should sometimes not predict on some questions, even if they know the probability exactly. This is not necessarily a problem, as it might lead to better allocation of the forecaster's attention, but can be. + +\begin{theorem} + A forecaster seeking to obtain a low average relative Brier score score, and who has\footnote{Again, either currently or in expectation} a relative Brier score of $r$, should only make predictions in questions where: +\begin{equation} + E[\textnormal{the forecaster's Brier score}] - E[\textnormal{Brier score of the aggregate}] < r +\end{equation} +\end{theorem} +\begin{proof} +\begin{equation} +\begin{split} +&E[\textnormal{the forecaster's Brier score}] - E[\textnormal{Brier score of the aggregate}] \\ +&= E[\textnormal{the forecaster's Brier score} - \textnormal{Brier score of the aggregate}] \\ +&= E[\textnormal{relative Brier score}] +\end{split} +\end{equation} +and the forecaster should only predict if $E[\textnormal{relative Brier score}]} + +%% else use the following coding to input the bibitems directly in the +%% TeX file. +\bibliographystyle{plain} +\begin{thebibliography}{00} + +%% \bibitem[Author(year)]{label} +%% Text of bibliographic item +\bibitem[CSET-Foretell (2020)]{SouthChina} + CSET-Foretell (2020) + ``Will the Chinese military or other maritime security forces fire upon another country's civil or military vessel in the South China Sea between January 1 and June 30, 2021, inclusive?'' + URL: https://web.archive.org/web/20201031221709/https://goodjudgment.io/ superforecasts/\#1338 + +\bibitem[Enten (2017)]{Fake polls real problem} + Enten, H. (2017) + ``Fake Polls Are A Real Problem" + URL: https://fivethirtyeight.com/features/fake-polls-are-a-real-problem/ + +\bibitem[Friedman (2001)]{friedman1} + Friedman, D. (2001) + \textit{Law's Order: What Economics Has to Do with Law and Why It Matters} + +\bibitem[Friedman (1990)]{friedman2} + Friedman, D. (1990) + \textit{Price Theory: An Intermediate Text}. + +\bibitem[Good Judgement (2018)]{gjscience} + Good Judgment (2018) + ``The Science of Superforecasting'' + Archived URL: https://web.archive.org/web/20180408044422/http://goodjudgment.com/science.html + +\bibitem[Good Judgment Scoring Rule (2019)]{GJSR} + Good Judgment (2019) + ``4. How are my forecasts scored for accuracy?'' + URL: https://www.gjopen.com/faq\#faq4 + +\bibitem[Good Judgement (2020a)]{covidrecovery} + Good Judgment (2020a) + ``Public Dashboard'' + URL: https://goodjudgment.com/covidrecovery/ + archived URL: https://web.archive.org/web/20201120231552/ https://goodjudgment.com/covidrecovery/ + +\bibitem[Good Judgement (2020b)]{USelections} + Good Judgment (2020b) + ``Who will win the 2020 United States presidential election?'' + URL: https://web.archive.org/web/20201031221709/ https://goodjudgment.io/superforecasts/\#1338 + +\bibitem[Good Judgement (2020c)]{gjfirst} + Good Judgment (2020c) + ``The First Championship Season'' + Archived URL: + https://web.archive.org/web/20201127110425/https://goodjudgment + .com/resources/the-superforecasters-track-record/the-first-championship-season + +\bibitem[Hubinger et al. (2019)]{Hubinger} + Hubinger, E. et al. (2019) + ``Risks from Learned Optimization in Advanced Machine Learning Systems" + URL: https://arxiv.org/abs/1906.01820 + +\bibitem[Krakovna et al. (2020)]{Krakovna} + Krakovna V. et al. (2020) + ``Specification gaming: the flip side of AI ingenuity" + URL: https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity/ + Specification gaming examples in AI - master list: https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml + +\bibitem[Karvetski et al. (s.a)]{Karvetski} + Karvetski C., Minto T., Twardy, C.R. (s.a.) + ``Proper scoring of contingent stopping questions". + Unpublished. + +\bibitem[Lagerros (2019)]{lagerros} + Lagerros, J. (2019) + ``Unconscious Economics'' + URL: https://www.lesswrong.com/posts/PrCmeuBPC4XLDQz8C/unconscious-economics + +\bibitem[Lichtendahl et al. (2007)]{Lichtendahl} + Lichtendahl et al. + ``Probability Elicitation, Scoring Rules, and Competition Among Forecasters'' + Management Science, Vol. 53. N. 11. + URL: https://pubsonline.informs.org/doi/abs/10.1287/mnsc.1070.0729?journalCode=mnsc + +\bibitem[Metaculus (2018)]{ragnarok} + Metaculus (2018) + ``Ragnarök Question Series'' + URL: https://www.metaculus.com/questions/1506/ragnar\%25C3\%25B6k-question-series-overview-and-upcoming-questions/ + +\bibitem[Metaculus (2020)]{aiprogress} + Metaculus (2020) + ``Forecasting AI Progress'' + URL: https://www.metaculus.com/questions/1506/ragnar\%25C3\%25B6k-question-series-overview-and-upcoming-questions/ + +\bibitem[Metaculus (2021)]{FAQ} + Metaculus (2021) + ``FAQ'' + URL: https://www.metaculus.com/help/faq/\#fewpoints + +\bibitem[Rodriguez (2019)]{Rodriguez} + Rodriguez, L. (2019) + ``How many people would be killed as a direct result of a US-Russia nuclear exchange?'' + URL: https://forum.effectivealtruism.org/posts/FfxrwBdBDCg9YTh69/how-many-people-would-be-killed-as-a-direct-result-of-a-us + +\bibitem[Tetlock et al. (2015)]{Tetlock} %% . https://fs.blog/2015/12/ten-commandments-for-superforecasters/} + Tetlock, P. \& Gardner, D. (2015) \textit{Superforecasting: The Art and Science of Prediction.} + +\bibitem[Witkowski et al. (2021)]{Witkowski} + Witkowski J. et al. (2021) ``Incentive-Compatible Forecasting Competitions" + URL: https://arxiv.org/abs/2101.01816v1 + +\bibitem[Yeargain (2017)]{Yeargain} + Yeargain, T. (2020) + Missouri Law Review, Vol. 85, Issue 1. + ``Fake Polls, Real Consequences: The Rise of Fake Polls and the Case for Criminal Liability Case for Criminal Liability" (pp. 140-150). + URL: https://scholarship.law.missouri.edu/cgi/viewcontent.cgi?article=4418 + \&context=mlr +\end{thebibliography} + +\newpage +\begin{appendices} + +\section{Numerical Simulations}\label{Simulations} + +\subsection{Method used in the main body of the paper} +To quantify the optimal amount of distortion, we simulate a tournament many times, and observe the results. A tournament is made out of questions and users. + +We model questions as logistic distributions, with a mean of 0, and a standard deviation itself drawn from a logistic distribution of mean 20 and standard deviation 2. For instance, a question might be a logistic distribution of mean 0 and standard deviation 15. At question resolution time, a point is randomly drawn from the logistic distribution. The code to represent this looks roughly as follows: + +\begin{verbatim} +generateQuestion = function(meanOfTheStandardDeviation, + standardDeviationOfTheMean){ + mean <- 0 + sd <- randomDrawFromLogistic(meanOfTheStandardDeviation, + standardDeviationOfTheMean) + questionResult <- randomDrawFromLogistic(mean, sd) + question <- c(mean, sd, questionResult) + return(question) +} +\end{verbatim} + +Users attempt to guess the mean and standard deviation of each question, and each guess has some error. The code to represent this looks roughly as follows: + +\begin{verbatim} +generateUser = function(meanOfTheMean, standardErrorOfTheMean, + meanOfTheStandardDeviation, standardErrorOfTheStandardDeviation){ + user <- function(question){ + questionMean <- question[1] + questionSd <- question[2] + questionResolution <- question[3] + questionMeanGuessedByUser <- questionMean + + randomDrawFromLogistic(meanOfTheMean, standardErrorOfTheMean) + questionSdGuessedByUser <- questionSd + + randomDrawFromLogistic(meanOfTheStandardDeviation, standardErrorOfTheStandardDeviation)) + probabilityDensityOfResolutionGuessedByUser <- + getLogisticDensityAtPoint(questionResolution, + questionMeanGuessedByUser, questionSdGuessedByUser) + return(probabilityDensityOfResolutionGuessedByUser) + } + return(user) +} +\end{verbatim} + +We model the average user as having +\begin{itemize} + \item \verb|meanOfTheMean=5| + \item \verb|standardErrorOfTheMean=5| + \item \verb|meanOfTheStandardDeviation=5| + \item\verb|standardErrorOfTheStandardDeviation=5|. +\end{itemize} + +We then consider a "perfect predictor"---a user who knows what the mean and the standard deviation of a question are---and consider how much that perfect predictor would want to distort her own guess to maximize her chances of placing in the top 3 or users. More details can be found in the \href{https://github.com/NunoSempere/Online-Appendix-to-Incentive-Problems-In-Forecasting-Tournaments}{Online Appendix} accompanying this paper. + +\subsection{Simluations with more complex distributions of players} +\subsubsection{For binary questions} +A binary question elicits a probability from 0 to 100\%, and is resolved as either true or false. They exist in all three platforms we consider (Metaculus, Good Judgment Open and CSET). + +For the binary case, we first consider a simulated tournament with 10 questions, each with a `true' binary probability between 0 and 100\%. We also consider the following types of forecasters: + +\begin{enumerate} + \item Highly skilled predictors: 10 predictors predict a single probability on each question. Their predictions are off from the "true" binary probability by anywhere from -0.5 to 0.5 bits + \item Unsophisticated extremizers: 10 highly skilled predictors (whose predictions are off from the "true" binary probability by anywhere from -0.5 to 0.5 bits) who extremize their probabilities by 0.3 bits + \item Sophisticated extremizers: 5 highly skilled predictors (whose predictions are off from the "true" binary probability by anywhere from -0.5 to 0.5 bits) who take the question closest to 50\% and randomly move it to either 0\% or 100\% + \item Unskilled predictors: 10 predictors predict a single probability on each question. Their predictions are off from the "true" binary probability by anywhere from -2 to 2 bits +\end{enumerate} + +We ran the tournament 10,000 times. We find that sophisticated extremizers do best, followed by lucky unskilled predictors, followed by unsophisticated extremizers, followed by honest highly skilled predictors. + +\newpage + +\begin{figure}[h!] + \includegraphics[width=10cm]{ProbWin5} + \centering + \caption{\% of the time different predictors reach the top 5} +\end{figure} + +\begin{figure}[h!] + \includegraphics[width=10cm]{BrierScores5} + \centering + \caption{Mean Brier score for each group (lower is better)} +\end{figure} +\newpage + +We can also explore how this changes with the number of questions. In this case, we ran 1000 simulations for each possible number of questions, in increments of five. Results were as follows: +\\ +\begin{figure}[h!] + \includegraphics[width=\textwidth]{ProbDiscreteTopWithNumQuestionsTwoExtremizedQuestions} + \centering + \caption{Probability that a ``highly skilled predictor'' will obtain the lowest Brier score, for a tournament with $n$ binary questions} +\end{figure} + +The results would further vary in terms of how many questions the forecasters who extremize choose to extremize. Dynamic selection of number of questions to extremize based on the total number of questions could provide further opportunities for improved exploitation, but this was not explored. +%% I can actually do this, it just takes ~half an hour to an hour per simulation +%% Forecasters who extremize extremize two questions +%% Forecasters who extremize extremize three questions +%% Forecasters who extremize extremize four questions +%% If you can write the code I'll run it. +\newpage +\subsubsection{For continuous questions} + +As before, a continuous question elicits a probability distribution, and is resolved as a resolution value, where the different participants are rewarded in proportion to the probability density they assigned to that resolution value. They exist only in Metaculus (and on other experimental platforms, like foretold.io) + +We first ran 20,000 simulations of a tournament with 100 participants and 30 questions. Results for all 30 questions were drawn from a logistic distribution of mean 50 and standard deviation 10. + +The participants were divided into: + +\begin{enumerate} + \item One perfect predictor, which predicts a logistic with mean 50 and sd 10, i.e., the true underlying distribution. Represented in green. + \item 10 highly skilled predictors which are somewhat wrong, at various levels of systematic over or underconfidence. They predict a single logistic on each question with mean chosen randomly from 45-55 and standard deviation ranging from \{4,6 ...20,22\}. Represented in orange. + \item 10 highly skilled predictors, trying to stand out by extremizing some of their forecast distributions. They predict a single logistic on each question with mean chosen randomly from 45-55 and standard deviation 10 for 25 of the questions and standard deviation 5 for a (randomly selected) other 5. Represented in light greenish brown. + \item 20 unskilled predictors. They predict a single logistic with means for each question chosen randomly from 35-65 and standard deviation 10. Represented in light blue. + \item 20 unskilled, overconfident predictors. They predict a single logistic with means for each question chosen randomly from 35-65 and SD 5. Represented in dark blue. + \item 39 unskilled, underconfident predictors. They predict a single logistic with means for each question chosen randomly from 35-65 and SD 20. Represented in pink. +\end{enumerate} + +Scoring was according to the Metaculus score formula, though we used the mean of the scores instead of the score of the mean to simplify the process. %% [TO DO: Change] +%% Nuño: Maybe change this? +%% Agreed, it seems possible that it could make a significant difference, which would be a notable result if true. + +On such a tournament, it would be plausible for only the top 5 forecasters to be rewarded, so we present the probabilities of being in the top 5 for each group. + +We find that those who change the position of the mean slightly from the true distribution (i.e., groups 3, 4 and, to a lesser extent, 2), gain a 5\% absolute advantage in terms of getting into the top 5, and a 50\% relative advantage. Each forecaster who does this is $\approx 15\%$ likely to be successful in reaching the top 5. In contrast, the perfect predictor only has a $\approx 10\%$ chance of doing so. + +This can be explained by understanding that a forecaster who wants to reach a discrete ceiling faces a trade-off between the mean of the expected score, and the variance of that score. If the goal is to reach a threshold, increasing the variance at the expense of the mean turns out to in this case be worth it. Graphs which showcase this effect follow. +\\ +\begin{figure}[h!] + \includegraphics[width=\textwidth]{ProbContinuousHighlySkilledInTopWithNumQuestions} + \centering + \caption{Approximate probability (understood as ``frequency during simulations'') that a ``highly skilled forecaster'' will obtain the lowest Brier score, for a tournament with $n$ continuous questions} +\end{figure} + +%% TODO: Discussion of "bubbles" + +%\includegraphics{MeanScoreVsFreqTop5} + +%See \ref{Simul discrete} for numerical simulations which attempt to quantify this effect. We find that for $n=30$ questions, as in Metaculus tournaments, the effect is notable. In particular, consider the following simulations: + +% Alex: Referring to specific past tournaments is probably better here. Actually, other than the Academy Series and the Insight tournament, Metaculus has not run any 10-question discrete-only tournaments. The Insight tournament obviously changed to a probabilistic reward structure in response to these concerns, which might be worth discussing explicitly. +% Nuño: Do you want to add discussion to those past tournaments? I'm not as familiar. + +% Nuño: Removed graphs below as somewhat low quality. +%\newpage +% +%\begin{figure}[h!] +% \includegraphics[width=12cm]{ContinuousSimulationMeanScores} +% \centering +% \caption{Mean scores for continuous distributions} +%\end{figure} +% +%\begin{figure}[h!] +% \includegraphics[width=12cm]{ContinuousSimulationMeanPosition} +% \centering +% \caption{Mean position for continuous distributions} +%\end{figure} +% +%\newpage +% +%\begin{figure}[h!] +% \includegraphics[width=12cm]{ContinuousSimulationFrequencyInTop5} +% \centering +% \caption{Frequency in the top 5} +%\end{figure} +% +%\begin{figure}[h!] +% \includegraphics[width=12cm]{MeanScoreVsFreqTop5} +% \centering +% \caption{Mean score vs probability of being in the top 5} +%\end{figure} +% +%\newpage +%% Nuño: Maybe some graphs here seeing how this effect changes with more and more questions. + +\section{Extremization calculations} +\label{Extremization calculations} + +The expected ``participation-rate weighted Brier score'' for a question which closes on the nth day if the event happens, and or in the 5th day if the event hasn't happened by then, will be: + +\begin{equation} +\begin{split} + E[PWBS] &= 0.25 \cdot PWBS(\textnormal{The event happens in the first day}) \\ + &+ 0.75\cdot 0.25 \cdot PWBS(\textnormal{The event happens in the second day}) \\ + &+ 0.75^2\cdot 0.25 \cdot PWBS(\textnormal{The event happens in the third day}) \\ + &+ 0.75^3\cdot 0.25 \cdot PWBS(\textnormal{The event happens in the fourth day}) \\ + &+ 0.75^4\cdot PWBS(\textnormal{The event doesn't happen}) \\ +\end{split} +\end{equation} + +Now, the integral in (\ref{PWBS definition}) transforms into a simple sum, because probabilities stay constant throughout the day, so + +\begin{equation} +\begin{split} + &PWBS(\textnormal{The event happens in the nth day}) \\ + &= \frac{Brier(p(S_1)) + ... + Brier(p(S_n))}{n} +\end{split} +\end{equation} + +Likewise for the event not happening, except that $Brier(p)$ will be equal to $(0-p)^2$ instead of $(1-p)^2$: + +\begin{equation} +\begin{split} + &PWBS(\textnormal{The event doesn't happen}) \\ + &= \frac{Brier(p(S_1)) + ... + Brier(p(S_4))}{4} +\end{split} +\end{equation} +And so, the +\begin{equation} +\begin{split} + &E[PWBS] = 0.25 \cdot (1-p(S_1))^2\\ + &+ 0.75\cdot 0.25 \cdot \frac{ (1-p(S_1))^2 + (1-p(S_2))^2}{2} \\ + &+ 0.75^2\cdot 0.25 \cdot \frac{ (1-p(S_1))^2 + (1-p(S_2))^2 + (1-p(S_3))^2 }{3} \\ + &+ 0.75^3\cdot 0.25 \cdot \frac{ (1-p(S_1))^2 + (1-p(S_2))^2 + (1-p(S_3))^2 +(1-p(S_4))^2}{4} \\ + &+ 0.75^4 \cdot \frac{ (0-p(S_1))^2 + (0-p(S_2))^2 + (0-p(S_3))^2 +(0-p(S_4))^2}{4} \\ +\end{split} +\end{equation} + +Given this, we can calculate the expected ``participation weighted Brier score" for the true probabilities, and the extremized probabilities. + +\begin{equation} + \begin{split} + E[PWBS(\textnormal{honest probabilities)}] = 0.193 \\ + E[PWBS(\textnormal{extremized probabilities)}] = 0.182 + \end{split} +\end{equation} + +\end{appendices} +\end{document} + +\endinput +%% +%% End of file `elsarticle-template-harv.tex'.