Designing survey forms with evaluative scales
Project management has some things in common with playing with a kite. One needs to adapt well and quickly to changes in external conditions following observation of performance. Otherwise, one runs the risk of blindly hitting the ground.
Differently from kites, though, a project or a programme requires more than open eyes. It requires sound data collection and analysis. Beneficiary / client feedback in opinion surveys can help tracking performance, user satisfaction and improvement needs. This is particularly so if one uses reproducible, computer-based random samples in line with professional statistical/data-science standards. This is the way to learn from project implementation based on modern scientific and computational methods.
Some time ago, I had a consulting assignment with a youth-violence reduction project in Brazil. The project needed baseline data for its logframe indicators. So, we designed a system for collecting, storing, processing and reporting data using Open Data Kit in Android devices. We also used R (statistical-computing language) for programming a reproducible sample, as well as all data processing and analytical reporting.
Before collecting data, I had to train over 25 people including interviewers and partner staff, who also suggested changes to the data collection form. This post is about one of these suggestions, which was a particularly good lesson learned.
Jane Davidson’s article “Breaking out of the Likert scale trap” inspired me to propose the inclusion of direct evaluative questions instead of the traditional Likert scales. It is a very good post claiming that by using evaluative terms right in the questionnaire, participant ratings become a lot easier to interpret in terms of quality or value. I also think so.
The Likert scale using “strongly agree” to “strongly disagree” is great for assessing opinions and knowledge from respondents. However, the scale makes it difficult to draw evaluative conclusions on quality or value of a training workshop, project or programme, for example. So, the scale suggested by Davidson was as follows:
- “poor / inadequate”;
- “barely adequate”;
- “very good”
The draft data-collection form used the same label categories as those above but translated to Portuguese. During the interviewer training workshop, one of the participants spotted a potential problem that I also did not notice before. The label categories were not well balanced…
The problem was that in the scale above there are three positive and two negative scale categories or levels. Hence, the likelihood / probability of a positive result tends to be higher. Those unbalanced options are a potential source of bias.
For preventing such bias, we changed the labels proposed by Davidson to:
- “very poor” or “very low”
- “poor” or “low”
- “regular” or “average”
- “good” or “high”
- “very good” or “very high”.
Additional answer categories
I would recommend to include the categories “Not sure, I don’t know”, “Not applicable”, in order to allow a more complete respondent feedback. The numeric scale can integrate these new categories depending on the question (e.g., answering not applicable or reporting not to know the action under evaluation can also indicate the quality of its outreach and impact).
Sometimes, it can also be interesting to have the answer option “I do not want to answer” for sensitive questions about income or abuse, for example. This option, of course, should not be part of the numeric evaluative scale. Otherwise, one will mix up different types of result.
The corresponding numeric intervals must also be balanced.
For a scale from 1 to 5 (one being the worst case, as in the article from Davidson, or the other way round as it is the case in Germany where the score one is the best), the interval from the function “cut” in R (statistical computing language) is:
> cut(1:5, breaks = 5)
 (0.996,1.8] (1.8,2.6] (2.6,3.4] (3.4,4.2] (4.2,5]
This would be equivalent to:
- from 1 to 1.80: “very poor/very low”
- from 1.81 to 2.60: “poor/low”
- from 2.61 to 3.40: “regular”
- from 3.41 to 4.20: “good”
- from 4.21 to 5.00: “very good”
The same can be done for a scale based one the interval from 1 to 7 if one includes the categories “Not sure, I don’t know”, “Not applicable”. The R output from the cut function for a scale with seven categories is as follows:
> cut(1:7, breaks = 7)
 (0.994,1.86] (1.86,2.71] (2.71,3.57] (3.57,4.43] (4.43,5.29]
 (5.29,6.14] (6.14,7.01]
Preventing response bias
For further preventing bias, the survey introduction can try to make survey participants aware about the risk of providing biased answers. An introduction following the paragraph below can help:
Respondents in such questionnaires sometimes repeat the same answers for different questions, mark extreme answers trying to be polite or as form of calling attention to a specific aspect, or even rate items in the middle categories in order to keep neutrality when they are actually thinking something else. Please avoid this as much as you can, as it prevents us from understanding the real situation.
If you are asking for real feedback from clients/beneficiaries and stakeholders, interviewers must be external to your project team. Ideally, they should be outsourced and receive training on interviewing methods and not associated to the implementing organisations or related to their staff members. This helps preventing interviewer bias (when results are different depeding on who collects data). This can be the case, for example, when humanitarian-aid beneficiaries have suggestions for support improvement but fear loosing future support after having provided critical feedback.
I benefited from Davidson’s contribution and I thought it would be good to try to contribute as well. Monitoring and evaluations with robust scientific standards can powerful for learning and improving policies, programmes, projects and products.
The evaluative scales can be very helpful but it does not mean that Likert scales should be avoided by all means. I also use Likert scales in my forms, particularly in those aiming to test subject knowledge from participants in capacity development actions such as projects including training workshops or a course module.
Also, it is worth including an open question about problems (e.g., What are the three main problems in your village?) as well as an open question about suggestions for improvement or additional comments. Text data can be analysed with word clouds and dendrograms, for example. This can complement well scoring data in monitoring and evaluation. It is also an opportunity for projects and programmes to track opportunities while making sure that they are addressing the issues that their beneficiaries or clients consider most important.
I hope you enjoyed this post and would be happy to receive any suggestion or comment.
Good monitoring and evaluation!