While taking two MITx courses on development policy: *Foundations of Development Policy* and *The Challenges of Global Poverty*, I had the opportunity to answer a great number of self-assessment questions.
For each of them, I noted down:
1. The question
2. The date
3. The type of question: {Multiple choice, Multiple selection, True or False, Enter a value}
4. Whether it was a question made during the lecture, given for homework, or part of an exam
5. My answer
6. Whether I got it right
7. Did I get it right the first time, or did I need a second try?
8. "My inner experience. One of {Hunch, Somewhat confident, Confident, Very confident, Incredibly Confident}
9. A short comment, if I felt like it.
10. My current mental state, according to the Becket Depression Checklist (only for the second half of the dataset; for the first part, the average value of the second part is given instead)
11. The probability I assign to my answer being correct, in odds form. For example, probabilities of 50%, 75%, 90% and 95% correspond to odds of 1:1, 1:3,1:9 and 1:19 respectively.
I have 505 observations. The dataset is available if I know you or if you can get someone who knows me to vouch for you.
In this case, two pictures: The second merges probabilities > and <than.5intheobviousway:itinterpretshavingassignedaprobabilityof,say,0.33to"X"ashavingassignedaprobabilityof0.66to"NotX".Workingwithodds,thisisalsostraightforward:ifyouthinkthat1:2arefairoddsinfavorof"X",youalsothinkthat2:1arefairoddsinfavorof"NotX".
I notice that my 1:5 is closer to 1:2.5 in reality, with n=28 observations. My 1:15 is also closer to 1:5, but I think that this particularity can be explained by 1:15 being the default value, i.e., the value which got written when I left that cell blank. I'll nonetheless pay attention to that in the future. On the bright side, my 1:2 and 1:3 odds are exactly on point.
## 2. How do I compare to a some simple regression models?
I create four simple linear regression models and interpret their output as probability. I also consider a really really dumb predictor, for comparison purposes.
- 1. The first one regresses the binary outcome (1 if true, 0 if false) on the variables 3,4,7 and 10 outlined in the set up. That is, I consider the type of question, whether it was homework, normal or part of an exam, my score in the BDC and whether it was the first try or not.
- 2. The second includes all the data outlined in the set up, except my subjective probability. Or, in other words, all the above + {Hunch, Somewhat confident, Confident, Very confident, Incredibly Confident} as factors.
- 3. The third only regresses binary outcome on my subjective probability.
- 4. The fourth includes all the numerical factors outlined in the setup.
- 5. The fifth model is really dumb, and just outputs as a probability my total base rate. There were 505 questions, I got 451 right, so it outputs a probability of 451/505 every time. If it only gets partial data, it calculates the base rate for the data it has, and always predicts that afterwards.
| | Which variable does the regression try to predict? | Variables regressed on (what information does this model work with) | Brier score tested & trained on the whole set | Trained on 80% and tested on the rest (average value, 1000+ times) | Trained on 50% and tested on the rest (average value, 1000+ times) |
| 1. Dumbest model | Binary outcome | None. Empty regression, just the intercept | 0.095496 | 0.09538288 | 0.09542671 |
| 2. Regression without any subjective factors | Binary outcome | 1. Type of question 2. Homework vs Exam vs Lecture question 3. First vs second try 4. BDC | 0.082598 | 0.131296 | 0.1412152 |
| 3. Regression model with inner experience | Binary outcome | 1. Type of question 2. Homework vs Exam vs Lecture question 3. First vs second try 4. Becker Depression Checklist Score. 5. Inner experience: Hunch to Incredibly Confident | 0.076962 | 0.1040722 | 0.1149272 |
| 4. Full regression model | Binary outcome | 1. Type of question 2. Homework vs Exam vs Lecture question 3. First vs second try 4. Becker Depression Checklist Score. 5. Inner experience: Hunch to Incredibly Confident 6. Subjective probability | 0.073224 | 0.09260587 | 0.1020023 |
| 5. Regression model with only my subjective probability | Binary outcome | 1. Subjective probability | 0.075541 | 0.07545493 | 0.07538371 |
| 6. Subjective probability | Does not apply | Not a regression model | Does not appy | Not a regression model | Does not apply | My Brier score was 0.0755985 | - | - |
A dumb model which always outputs the overall base rate gets a Brier score of 0.09549652. If I train it 1000 times on a randomly selected 80% of my data set and test it on the other 20% it still gets a 0.095-ish Brier score on average. By "train it", I mean "regress the binary output on a constant variable of value 1", or "calculate the base rate for that 80%, and use that". It is surprising to me that this is does than some of the other models, some of the time.
I found it rather surprising that how depressed I was (BDC: Becker Depression Checklist) didn't seem to have that big of an effect. In particular, I try to adjust for my mood, but I wasn't particularly expecting to succeed. Anecdotically, I do see an effect of my mood on the extremity of my odds: The sadder I am the more recluctant I am to give 1:1000, 1:10000 and higher odds, even about things which I'm really sure about.
A multiple selection (MS) question can be thought of as a conjunction of several true or false (TF) questions, so the coefficient is lower. Surprisingly, though, multiple choice questions seem to be easier than true or false questions. This might be because for multiple choice questions the MITx team had a slight bias towards having the right answer be "a.", which I detected and exploited. The other possible type of question is the one in which I'm asked to enter a value, which is unsurprisingly harder.
Otherwise, having needed a second try means that the question was hard, so there is accordingly a penalty. All in all, these coefficients seem reasonable.
If I use this model to output predicted probabilities for each question, I get a brier score of 0.0825985. If I instead train the model on a random selection of 80% of the data points, and test it on the other 20%, my answer depends on the specific selection. If I do that a 1000 times, I get a Brier score of 0.13-ish, on average. However, is this a result of having less data, or of having overfitted? To answer that, I trained the model on 50% of the data (selected randomly, 1000 times), and try to predict the other half, in which case it gets a Brier score of 0.14-ish, on average. Perhaps with more data it would have gotten to 0.12-ish, so the difference from that to 0.08-ish is probably a result of overfitting.
### 2.3. Everything except my subjective probablity
As expected, the coefficients associated with a measure of my inner confidence check out. Huch <Somewhatconfident<Confident<Veryconfident<Incrediblyconfident(IC).Notethatallofthefactorsarepresent,insteadofoneofthemhavingbeenswallowedbytheintercept,becausetherewere3timeswhichIjustleftthatquestionblank,andIdidn'twanttoremovethatdata.
If I use this model to output predicted probabilities for each question, I get a brier score of 0.07696276. This is *scarily close* to my own Brier score of 0.0755985, until one realizes that a) The dataset in which this is tested is the same dataset in which the model has been trained, and b) it already contains some of my subjective information in the form of my inner experience.
If I instead train the model on a random selection of 80% of the data points, and test it on the other 20%, my answer depends on the specific selection. If I do that a 1000 times, I get a Brier score of 0.10-ish, on average. Again, the question arises of whether this is a result of having less data, or of having overfitted. To answer that, I again trained the model on 50% of the data (selected randomly, 1000 times), and try to predict the other half, in which case it gets a Brier score of 0.11-ish, on average. Perhaps with more data it would have gotten to 0.9-ish, so the difference from that to 0.07-ish is probably a result of overfitting.
### 2.4. Including my subjective probability & everything else.
All the other factors become slightly more irrelevant. It seems that my subjective probability does add information, a lot of it. After having seen the graph at the beginning, this is not surprising.
If I use this model to output probabilities for each question, I get a Brier score of 0.07322408, which is *better* than Brier score alone. As before, if I train this regression model on a random selection of 80% of the data points, and test it on the other 20%, my answer depends on the specific selection. If I do that a 1000 times and average the result, I get a Brier score of 0.9-ish. Again, the question arises of whether this is a result of having less data, or of having overfitted. To answer that, I again trained the model on 50% of the data (selected randomly, 1000 times), and try to predict the other half, in which case it gets a Brier score of 0.10-ish, on average. Perhaps with more data it would have gotten to 0.8-ish. To close a call.
Multiply my probability by 1.005-ish and take 1.2% from that, and I'd be slightly more accurate. I'm not reading much into that. If I do this, I get an slightly better Brier score of 0.075541, slightly better than my own 0.0755985.
If, like before, I train that model 1000 times on a randomly selected 80% of my dataset, and test it on the other 20%, I get on average a Brier score of 0.07545493, slightly *better* than my own 0.0755985, but not by much. Perhaps it gets that slight advantage because the p*1.0005 - 1.2% corrects my uncalibrated 1:15 odds without murking the rest too much? Surprisingly, if I train it on a randomly selected 50% of the dataset (1000 times), its average Brier score improves to 0.07538371. I do not think that a difference of 0.002 tells me much.
In conclusion, I am surprised that the dumb model beats the others most of the time, though I think this might be explained by the combination of not having that much data and of having a lot of variables: the random errors in my regression are large. I see that I am in general well calibrated (in the particular domain analyzed here) but with room for improvement when giving 1:5, 1:6, and 1:15 odds.