nunosempere.github.io/ESPR-Evaluation/Designing-Surveys-Review/Review.md
2019-01-27 15:16:35 +01:00

24 KiB
Raw Blame History

Review of The Power of Survey Design and Improving Survey Questions

[Epistemic status: Confident.]

Simplicio: I have a question.
Salviati: Woe be upon me.
Simplicio: When people make surveys, how do they make sure that the questions measure what they want to measure?

Outline

  • Introduction.
  • For the eyes of those who are designing a survey.
  • People are inconsistent. Some ways in which populations are sistematically biased
  • The Dark Arts!
    • Legitimacy
    • Don't bore the answerer
    • Elite respondents.
  • Useful categories
    • Memory
    • Consistency and ignorance
    • Subjective vs objective questions
  • Tactics
    • Be aware of the biases
    • Don't confuse the question with the question objective.
    • An avalanche of advice.
  • Closing thoughts

Introduction

As part of my research on ESPR's impact, I've read two books on the topic of survey design, namely The Power of Survey Design (TPOSD) and Improving survey questions: Design and Evaluation.

They have given me an appreciation of the biases and problems that are likely to pop up when having people complete surveys, and I think this knowledge would be valuable to a variety of people in the EA and rationality communities.

For example, some people are looking into mental health as an effective cause area. In particular, in Spain Danica Wilbanks is working on trying to estimate the prevalence of mental health issues in the EA community. Something to consider in this case is that people with severe depression might be less likely to answer a survey, because doing so takes effort. So the actual proportion in the survey is likely to be an underestimate. Unless people with mental health issues are more likely to participate in a survey about the topic.

I've gotten some enjoyment and extra motivation out of inhabiting the state of mind of an HPMOR Dark Lord while framing the study of these matters as learning the Dark Arts. May you share this enjoyment with me.

For the eyes of those who are designing a survey:

You might want to read this review for the quicks, then:

a) If you don't want to spend too much time: Focus on this checklist, this list of principles, as well as this neat summary I found on the internet.

b) If you want to spend a moderate amount of time:

  • Chapter 3 of The Power of Survey Design (68 pages) and/or Chapter 4 of Improving survey questions (22 pages) for general things to watch out for when writting questions. Chapter 3 of TPOSD is the backbone of the book.
  • Chapter 5 of The Power of Survey Design (40 pages) for how to use the dark arts to have more people answer your questions willingly and happily.

c) For even more detail:

  • The introductions, i.e. Chapter 1 and 2 of The Power of Survey Design (9 and 22 pages, respectively), and Chapter 1 of Improving survey questions (7 pages) if introductions are your thing, or if you want to plan your strategy. In particular, Chapter 2 of TPOSD has a cool Gantt Chart.
  • Chapters 2 and 3 of Improving survey questions (38 and 32 pages, respectively) for considerations on gathering factual and subjective data, respectively.
  • Chapter 5 of Improving survey questions (25 pages) for how to evaluate/test your survey before the actual implementation.
  • Chapter 6 of Improving survey questions (12 pages) for kind of obvious advice about trying to find something like hospital records to validate your questionnaire with, or about repeating some important questions in slightly different form and get really worried if they don't answer the same thing.

Here and here are the indexes for both books. libgen.io might be of use to download an illegal copy.

Both books are clearly dated in some aspects: neither considers online surveys, as self-administered surveys were, back in the day, mailing surveys to people. The second suggests: "sensitive questions are put on a tape player (such as a Walkman) that can be heard only through earphones". However, I think that on the broad principles and considerations, both books remain useful guides.

People are inconsistent. Some ways in which populations are sistematically biased

Here is a nonexhaustive collection of curious anecdotes mentioned in the first book:

  • A Latinobarometro poll in 2004 showed that while a clear majority (63 percent) in Latin America would never support a military government, 55 percent would not mind a nondemocratic government if it solved economic problems.

  • When asked about a fictitious “Public Affairs Act” one-third of respondents volunteered an answer

  • The choice of numeric scales has an impact on response patterns: Using a scale which goes from -5 to produces a different distribution of answers than using a scale that goes from 0 to 10.

  • The order of questions influences the answer. Wording as well: framing the question with the term "welfare" instead of with the formulation "incentives for people with low incomes" produces a big effect.

  • Options that appear at the beginning of a long list seem to have a higher likelihood of being selected. For example, when alternatives are listed from poor to excellent rather than the other way around, respondents are more likely to use the negative end of the scale. Unless it's in a phone interview, or read out loud, in which case the last options are more likely.

  • When asked whether they had visited a doctor in the last two weeks: Apparently, when respondents have had a recent doctor visit, but not one within the last two weeks, there is a tendency to want to report it. In essence, they feel that accurate reporting really means that they are the kind of person who saw a doctor recently, if not exactly and precisely within the last two weeks.

  • The percentage of people supporting US involvement in WW2 almost doubled if the word "Hitler" appeared in the question.

Frankly, I find this so fucking scary. I guess that some part of me implictly had a model of people having a thought out position with respect to democracy, which questions merely elicited. As if.

The Dark Arts!

Key extract: "Evidence shows that expressions of reclutance can be overcome" (p. 175, The Power of Survey Design, and Chapter 5 of the same book). I'm fascinated by this chapter, because the author has spent way more time thinking about this than the survey-taker: he is one or two levels above the potential answerer and can nudge his behavior.

As a short aside, the analogies to pick up artistry are abundant. One could caritatively summarize their position as highlighting that questions pertaining romance and sex will be answered differently depending on how they're posed, because people don't have an answer written in stone beforehand.

Of course, the questionnaire writer could write biased questions with the intention of producing the answers he wishes to obtain, but these books go in a subtle direction: Once good questions have been written, how do you convice people, perhaps initially reclutant, to take part in your survey? How do you get them to answer sensitive questions truthfully?

For example:

Three factors have been proven to affect persuasion: the quantity of the arguments presented, the quality of the arguments, and the relevance of the topic to the respondent. Research on attitude change shows that the number of arguments (quantity) presented has an impact on respondent attitudes only if saliency is low (figure 5.6). Conversely, the quality of the arguments has a positive impact on respondents only if personal involvement is high (figure 5.7) When respondents show high involvement, argument quality has a much stronger effect on persuasion, while weak arguments might be counterproductive. At the same time, when saliency is low, the quantity of the arguments appears to be effective, while their quality has no significant persuasive effect (figure 5.8) (Petty and Cacioppo 1984).
These few minutes of introduction will determine the climate of the entire interview. Hence, this time is extremely important and it must be used to pique the respondents interest...

Most importantly, how do you defend against someone who has carried out multiple randomized trials to map out the different behaviors you might adopt, and how to best persuade you in each of them? I feel that Tzvi's essay on people who "are out to get you" has mapped the possible behaviors you might adopt in defense. Chief among them is actually being aware.

Legitimacy

At the beginning, make sure to assure legal confidentiality, maybe research the relevant laws in your jurisdiction and make reference to them. Name drop sponsors, include contact names and phone numbers. Explain the importance of your research, its unique characteristics and practical benefits.

There is a part of signalling [spelling] confidentiality, legitimacy, competence which involves actually doing the thing. For example, if you assure legal confidentiality, but then ask information which would permit easy deanonimization, people might notice and get pissed. But another part is merely: be aware of this dimension.

The first questions should be easy, pleasant, and interesting. Build up confidence in the survey's objective, stimulate their interest and participation by making sure that the respondent is able to see the relationship between the question asked and the purpose of the study.

Make sensitive questions longer, as they are then perceived as less threatening. Perhaps add a preface explaining that both alternatives are ok. Don't ask them at the beginning of your survey.

Bears repeating: Don't bore the answerer.

Seems obvious. Cooperation will be highest when the questionnaire is interesting and when it avoids items difficult to answer, time-consuming, or embarrassing. In my case, making my survey interesting means starting with a prisoner's dilemma with real payoffs, which will double as the monetary incentive to complete the survey.

It serves no purpose to ask the respondent about something he or she does not understand clearly or that is too far in the past to remember correctly; doing so generates inaccurate information.

Don't ask a long sequence of very similar questions. This bores and irritates people, which leads them to answer mechanically. A term used for this is acquiescence bias: in questions with an "agree-disagre" or "yes-no" format, people tend to agree or say yes even when the meaning is reversed. In questions with a "0-5" scale, people tend to choose 2.

On the other hard, don't make them hord to hard. In general, telling the respondents a definition and asking them to clasify themselves, is too much work.

Elite respondents

This section might be particularly relevant for the high-IQ crowd characteristic of the EA and LW movements. Again, the key movement is to match the level of cognitive complexity of the question with the respondent's level of cognitive ability, as not doing so leads to frustration. Looking back on my experiences as a survey participant, this does mirror my experience.

Elites are apparently quickly irritated if the topic of the questions is not of interest to them. Vague queries generate a sense of frustration, and lead to a perception that the study is not legitimate. Oversimplifications are noticed and disliked.

Start with a narrative question, add open questions at regular intervals throughout the form. Elites “resent being encased in the straightjacket of standardized questions” and feel particularly frustrated if they perceive that the response alternatives do not accurately address their key concern.

For example, the recent 80,000 Hours had the following question: "Have your career plans changed in any way as result of engaging with 80,000 Hours?". The possible answers were not really exhaustive, and in particularly, there was no option for "I made a big change, but only partially as a result of 80,000 Hours". Or "I made a big change, but I am really not sure of what the counterfactual scenario would have been". I remember that this frustrated me, because, as far as I remember the alternatives did not provide a clear way to express this.

Improving Survey Questions goes on at length about ensuring that people are asked questions to which they know the answers, and in some cases, "Have your career plans changed in any way as result of engaging with 80,000 Hours?" [look up which exact question was it] might be one of them. Perhaps an alternative would be to divide that question in:

  • Have your career plans changed in any way in the last year?
  • How big was that change?
  • Did 80,000h have any influence on it?
  • Where "100" means that 80,000h was unambiguously causally responsible [spelling] for that change, "50" means that you would have given it even odds to you making that change in the absence of any interaction with 80K, and "0" means that you're absolutely sure 80K had nothing to do with that change, how much has 80,000h influenced that change? responsibility [check spelling].

Yet I'm not confident that formulation is superior, and at some level, I trust 80K to have done their homework.

Useful categories.

Memory

Events less than two weeks into the past can be remembered without much error. There are several ways in which people can estimate the frequency with which something happens, chiefly:

  • Availability bias: How easy it is to remember X.
  • Episodic enumeration: Recalling and counting occurrences of an event
  • Resorting to some sense of normative frequency.
  • etc.

Of these, episodic enumeration turns out to be the most accurate, and people use it more the less instances of the things there are. The wording of the question might be changed to facilitate episodic enummeration.

Asking a longer question, and communicating to responders the significance of the question has a positive effect on the accuracy of the answer. This means phrasing such as “please take your time to answer this question,” “the accuracy of this question is particularly important,” or “please take at least 30 seconds to think about this question before answering”.

If you want to measure knowledge, take into account that recognizing is easier than recalling. More people will be able to recognize a definition of effective altruism than be able to produce one on their own. If you use a multiple question with n options, and x% of people knew the answer, whereas (100-x)% didn't, you might expect that (100-x)/n % hadn't known the answer, but guessed correctly by chance, so you'd see that y% = x% + (100-x)/n % selected the correct option.

Consistency and Ignorance.

In of our examples at the beginning, one third of respondents gave an opinion about a ficticious Act. This generalizes; respondents rarely admit ignorance. It is thus a good idea to offer an "I don't know", or "I don't really care about this topic". The recent SlateStarCodex Community Survey had a problem in this regard with respect to some questions, because once checked, they couldn't go unchecked.

With regards to consistency, it is a good idea to ask similar questions in different parts of the questionnaire to check the consistency of answers. Reverse some of the questions.

Subjective vs objective questions

The author of Improving Survey Questions views the distinction between objective and subjective questions as very important. That there is no direct way to know about people's subjective states independent of what they tell us apparently has serious metaphysical implications. To this, he devotes a whole chapter.

Anyways, despite the lack of an independent measure, there are still things to do, chiefly:

  • Place answers on a single well defined continuum
  • Specify clearly what is to be rated.

And yet, the author goes full relativist:

"The concept of bias is meaningless for subjective questions. By changing wording, response order, or other things, it is possible to change the distribution of answers. However, the concept of bias implies systematic deviations from some true score, and there is no true score... Do not conclude that "most people favor gun control", "most people oppose abortions"... All that happened is that a majority of respondents picked response alternatives to a particular question that the researcher chose to interpret as favorable or positive."

Test your questionnaire

I appreciated the pithy phrases "Armchair discussions cannot replace direct contact with the population being analyzed" and "Everybody thinks they can write good survey questions". With respect to testing a questionnaire, the books go over different strategies and argues for some reflexivity when deciding what type of test to undertake.

In particular, the intuitive or traditional way to go about testing a questionnaire would be a focus group: you have some test subjects, have them take the survey, and then talk with them or with the interviewers. This, the authors argue, is messy, because some people might dominate the conversation out of proportion to the problems they encountered. Additionally, random respondents are not actually very good judges of questions.

Instead, no matter what type of test you're carrying out, having a table with issues for each question, filled individually and before any discussion, makes the process less prone to social effects.

Another alternative is to try to get in the mind of the respondent while they're taking the survey. To this effect, you can ask respondents:

  • to paraphrase their understanding of the question.
  • to define terms
  • for any uncertainties or confusions
  • how accurately they were able to answer certain question and how likely they think they or others would be to distort answers to certain questions
  • if the question called for a numerical figure, how they arrived at the number.

F.ex.:
Question: Overall, how would you rate your health: excellent, very good, fair, or poor?
Followup question: When you said that your health was (previous answer), what did you take into account or think about in making that rating?

In the case of pretesting the survey, a division in conventional, behavioral and cognitive interview is presented, and the cases in which each of them are more adequate are outlined.

Considerations about tiring the answerer still apply: a long list of similar questions is likely to induce boredom. For this reason, ISQ recommends testing "half a dozen" questions at a time.

As an aside, if you want to measure the amount of healthcare consumed in the last 6 months, you might come up with a biased estimate even if your questions aren't problematic, and this would be because the people who just died consume a lot of healthcare, but can't answer your survey.

Tactics

Be aware of the biases

Be aware of the ways of the ways a question can be biased. Don't load your questions: don't use positive or negative adjectives in your question. Take into account social desirability bias: "Do you work" has implications with regards to status.

An good example given, which tries to reduce social desirability bias, is the following:

Sometimes we know that people are not able to vote, because they are not interested in the election, because they can't get off from work, because they have family pressures, or for many other reasons. Thinking about the presidential elections last November, did you actually vote in that election or not?

Additionally, black respondents were significantly more likely to report that they had voted in the last election to a black interviewer than to a white interviewer. By the way, self-administered surveys are great at not creating bias because of the interviewer; answerers don't feel such a need to impress.

There is also the aspect of managing self-images: it's not only that the respondent may want to impress, it's also that she may want to think about herself in certain ways. You don't want to have respondents feel they're put in a negative (i.e., inaccurate) light. Respondents "are concerned that they'll be misclassified, and they'll distort the answers in a way they think will provide a more accurate picture" (i.e., they'll lie through their teeth). The defense against this is to allow the respondent to give context.
For example:

  • How much did you drink last weekend?
  • Do you feel that this period is representative?
  • What is a normal amount to drink in your social context?

So, get into their head and manage the way they perceive the questions. Minimize the sense that certain answers will be negatively valued. "Permit respondents to present themselves in a positive way at the same time they provide the information needed".

In questions for which biases are likely to pop up, consider explicitly explaining to respondents that giving accurate answers is the most important thing they can do. Have respondents make a commitment to give accurate answers at the beginning; it can't hurt.

This, together with legitimacy signaling, has been tested, and it reduces the number of books which well-educated people report reading.

Don't confuse question objective with question.

In the previous example, the question objective could be finding out what proportion of the population votes. Simply putting the objective in question form (f.ex., "Did you vote in the last presidential election?") is not enough.

The soundest advice any person beginning to design a survey instrument could receive is to produce a good, detailed list of question objectives and an analysis plan that outlines how the data will be used

If a researcher cannot match a question with an objective and a role in the analysis plan, the question should not be asked, tell us our authorities.

An avalanche of advice.

The combined 464 pages contain a plethora of advice. To not feel overwhelmed, this checklist, this list of principles, or this helpful summary I found on the internet might be helpful.

Ask one question at a time. For example: "Compared to last year, how much are you winning at life?" is confusing, and would be less so if it was divided into: "How much are you winning at life today?" and "how much were you winning at life last year?". If the question was particularly important, a short paragraph explaining what you mean by winning at life would be in order.

An aclatory paragraph would have to come before the question, as would other clarifications: after the respondent thinks she has read a question, she will not listen to the definition provided afterwards. Same for technical terms.

An avalanche of advice is can be gathered from our two books: Not avoiding the use of double negatives makes for confusing sentences, like this one. Avoid the use of different terms with the same meaning. Make your alternatives mutually exclusive and exhaustive. Don't make your questions all too long. As a rule of thumb, keep your question under of 20 words and 3 commas (unless you're trying to estimulate recall, or it's a question about sensitive topics). Remember that the longer the list of questions, the lower the quality of the data. And so on.

Closing thoughts.

The rabbit hole of designing questionnaires is deep, but seems well mapped.

Because of the power of specialization, this doesn't need to become common knowledge, but I expect that a small number of people, f.ex., those who occasionally design community surveys, or those who want to estimate the impact of an activity, might benefit greatly from the pointers given here. I'd also be happy to lend a hand, if needed.

Boggling at the concept of a manual, I am grateful to have access to the wisdom of someone who has spent a lifetime studying the specific topic of interviews, and who provides a framework for me to think about them.

I appreciate that the books are doing 95% of the thinking for me, or in other words, that the authors have spent more than 20 times thinking about those things as I have. MIRI has been speaking about preparadigmatic fields, and I've noticed a notable jump between my previous diffuse intuitions and the structure which these books provide.