Project management has some things in common with playing with a kite. One needs to adapt well and quickly to changes in external conditions following observation of performance. Otherwise, one runs the risk of blindly hitting the ground.
Some time ago, I had a consulting assignment with a youth-violence reduction project in Brazil. The project needed baseline data for its logframe indicators. So, we designed a system for collecting, storing, processing and reporting data using Open Data Kit in Android devices. We also used R (statistical-computing language) for programming a reproducible sample, as well as all data processing and analytical reporting.
Before collecting data, I had to train over 25 people including interviewers and partner staff, who also suggested changes to the data collection form. This post is about one of these suggestions, which was a particularly good lesson learned.
Jane Davidson’s article “Breaking out of the Likert scale trap” inspired me to propose the inclusion of direct evaluative questions instead of the traditional Likert scales. It is a very good post claiming that by using evaluative terms right in the questionnaire, participant ratings become a lot easier to interpret in terms of quality or value. I also think so.
The Likert scale using “strongly agree” to “strongly disagree” is great for assessing opinions and knowledge from respondents. However, the scale makes it difficult to draw evaluative conclusions on quality or value of a training workshop, project or programme, for example. So, the scale suggested by Davidson was as follows:
“poor / inadequate”;
The draft data-collection form used the same label categories as those above but translated to Portuguese. During the interviewer training workshop, one of the participants spotted a potential problem that I also did not notice before. The label categories were not well balanced…
The problem was that in the scale above there are three positive and two negative scale categories or levels. Hence, the likelihood / probability of a positive result tends to be higher. Those unbalanced options are a potential source of bias.
For preventing such bias, we changed the labels proposed by Davidson to:
“very poor” or “very low”
“poor” or “low”
“regular” or “average”
“good” or “high”
“very good” or “very high”.
Additional answer categories
I would recommend to include the categories “Not sure, I don’t know”, “Not applicable”, in order to allow a more complete respondent feedback. The numeric scale can integrate these new categories depending on the question (e.g., answering not applicable or reporting not to know the action under evaluation can also indicate the quality of its outreach and impact).
Sometimes, it can also be interesting to have the answer option “I do not want to answer” for sensitive questions about income or abuse, for example. This option, of course, should not be part of the numeric evaluative scale. Otherwise, one will mix up different types of result.
The corresponding numeric intervals must also be balanced.
For a scale from 1 to 5 (one being the worst case, as in the article from Davidson, or the other way round as it is the case in Germany where the score one is the best), the interval from the function “cut” in R (statistical computing language) is:
The same can be done for a scale based one the interval from 1 to 7 if one includes the categories “Not sure, I don’t know”, “Not applicable”. The R output from the cut function for a scale with seven categories is as follows:
For further preventing bias, the survey introduction can try to make survey participants aware about the risk of providing biased answers. An introduction following the paragraph below can help:
Respondents in such questionnaires sometimes repeat the same answers for different questions, mark extreme answers trying to be polite or as form of calling attention to a specific aspect, or even rate items in the middle categories in order to keep neutrality when they are actually thinking something else. Please avoid this as much as you can, as it prevents us from understanding the real situation.
If you are asking for real feedback from clients/beneficiaries and stakeholders, interviewers must be external to your project team. Ideally, they should be outsourced and receive training on interviewing methods and not associated to the implementing organisations or related to their staff members. This helps preventing interviewer bias (when results are different depeding on who collects data). This can be the case, for example, when humanitarian-aid beneficiaries have suggestions for support improvement but fear loosing future support after having provided critical feedback.
I benefited from Davidson’s contribution and I thought it would be good to try to contribute as well. Monitoring and evaluations with robust scientific standards can powerful for learning and improving policies, programmes, projects and products.
The evaluative scales can be very helpful but it does not mean that Likert scales should be avoided by all means. I also use Likert scales in my forms, particularly in those aiming to test subject knowledge from participants in capacity development actions such as projects including training workshops or a course module.
Also, it is worth including an open question about problems (e.g., What are the three main problems in your village?) as well as an open question about suggestions for improvement or additional comments. Text data can be analysed with word clouds and dendrograms, for example. This can complement well scoring data in monitoring and evaluation. It is also an opportunity for projects and programmes to track opportunities while making sure that they are addressing the issues that their beneficiaries or clients consider most important.
I hope you enjoyed this post and would be happy to receive any suggestion or comment.
In this post I try to explore this and some other questions using open-source statistical computing R language and public recruitment data from CharityJob’s website. According to CharityJob, the site is the United Kingdom’s busiest one for charity, fundraising, NGO and not for profit jobs.
In addition to presenting these powerful open-source tools and data-exploring techniques, I hope that this post can help the public, specially applicants and workers to get an update on salaries and trends in the sector. The jobs analysed here are mostly UK-based ones and published by UK-based organisations. Therefore, the results below are not meant to represent the entire sector worldwide. I still hope though that this post can provide some positive contribuition to the evolution of the sector in both the southern and the northern hemispheres.
For those of you who are only interested in the end analysis, please jump to the results section. However, I encourage you to explore how these tools work. I believe that they can help speeding up and improving quality of the so-much-needed charity, social-enterprise, development-aid and humanitarian work globally.
I used here some basic techniques of web scraping (web harvesting or web data extraction), which is a computer software technique of extracting information from websites. The source code in RMarkdown is available for download and use based on GNU General Public License at this link: Rmarkdown code. Everything was preapred with the open-source, freely-accesible and powerful statistical computing language “R” (R version 3.2.0 ) and the development interface RStudio (Version 0.99.441).
This post is based on public data. The post is my sole responsibility and can in no way be taken to reflect the views of CharityJobs’ staff.
Downloading data from CharityJobs
Using RStudio, the first step is to download the website data. CharityJobs’ search engine contains over 140 webpages, each of them with a list of 18 jobs in most cases. Hence I expected to get information about around 2,500 job announcements. For that, the first step was to download the data and get rid of what I did not wanted (e.g. css and hmtl codes). The code chunck below describes how I did it. The code contains explanatory comments indicated by hashtags (‘#’). I am sure that many would be able to write this code in a much more elegant and efficient way. I would be very thankful to receive your comments and suggestions!
# Loading the necessary packages. It assumes that they are installed.
# Please type ‘?install.packages()’ on the R console for additional information.
suppressWarnings(suppressPackageStartupMessages(require(rvest))) # Credits to Hadley Wickham (2016)
suppressPackageStartupMessages(require(stringr)) # Credits to Hadley Wickham (2015)
suppressPackageStartupMessages(require(lubridate)) # Credits to Garrett Grolemund, Hadley Wickham (2011)
suppressPackageStartupMessages(require(dplyr)) # Credits to Hadley Wickham and Romain Francois (2015)
suppressPackageStartupMessages(require(xml2)) # Credits to Hadley Wickham (2015)
suppressPackageStartupMessages(require(pander)) # Credits to Gergely Daróczi and Roman Tsegelskyi (2015)
suppressPackageStartupMessages(require(ggplot2)) # Credits to Hadley Wickham (2009)
## Downloading website information into a list called `charityjobs` and closing connections
charityjobs <- lapply(urls, . %>% read_html(.))
Tyding up and parsing data
The next step is to parse or clean up the text string of each of the about 140 webpages. I decided to build a custom function for that, which I could use to loop through the content of each element of the charityjobs list. The function should also save the parsed data into a data frame. This data frame should include information on recruiters, position titles, salary ranges and deadline data. The code chuck below presents this function, which I called salarydata.
## Creating a function for parsing data which uses the read_html output (list ‘charityjobs’)
sal <- sal %>% filter(!grepl(“hours”, sal$salary_range))
sal <- sal %>% filter(!grepl(“hour”, sal$salary_range))
sal <- sal %>% filter(!grepl(“p/h”, sal$salary_range))
sal <- sal %>% filter(!grepl(“week”, sal$salary_range))
sal <- sal %>% filter(!grepl(“ph”, sal$salary_range))
sal <- sal %>% filter(!grepl(“day”, sal$salary_range))
sal <- sal %>% filter(!grepl(“daily”, sal$salary_range))
sal <- sal %>% filter(!grepl(“plus”, sal$salary_range))
sal <- sal %>% filter(!grepl(“\\+”, sal$salary_range))
salaries <- rbind(salaries, sal)
Creating full dataframe and other adjustments
The last step before exploring the data was to run the function salarydata to create the full dataframe. After that, I parsed lower and upper salaries into separated columns, deleted data which may have been incorrectly parsed or data concerning daily-rate and hourly-rate jobs / consulting assignments. Only yearly salaries between GBP 4,000 and GBP 150,000 have been considered. All salary data is in British Pounds (GBP) and refer to annual salaries, which sometimes do not include benefits such as pension.
Cleaning the salary-range variable was a tricky step as the website allows users to type in both salary amounts and additional text (e.g. 30,000, 30K, or 25-30k). Therefore, I had to iterate some times until the output was good enough. I am quite sure that the code chunk below can be written in a more elegant way. Again, please let me know in case you have any suggestions here.
The output below presents the summary of the full dataframe (first 10 observations).
## Source: local data frame [1,704 x 6]
## deadline recruiter
## (time) (chr)
## 1 2016-09-04 ZSL
## 2 2016-09-12 Alliance Publishing Trust
## 3 2016-08-31 Save the Children
## 4 2016-08-30 Blind Veterans UK
## 5 2016-09-08 Headway SELNWK
## 6 2016-08-30 Saferworld
## 7 2016-09-22 Pro-Finance
## 8 2016-09-06 TPP Recruitment
## 9 2016-09-06 Harris Hill
## 10 2016-09-20 Hays London Ebury Gate
## .. … …
## Variables not shown: position (chr), lower_salary (dbl), upper_salary
## (dbl), salary_range (chr)
The final dataset contains information of 1,704 jobs of various types, based on yearly-salary figures. They exclude consultancy assignments and other jobs based on hour and day rates as well as jobs which did not provide salary information. The table below presents the summary statistics concerning the lower and upper salaries.
The table below presents standard descriptive statistics for lower and upper salaries. For job announcements providing a single value (not a salary range), that single amount has been incorporated to the dataset variable upper_salary while the variable lower_salary was set as NA (not available). That is why the number of observations (N) is 785 for lower salaries and 1,704 for upper salaries. About 54% of the job announcements did not provide salary range information but just the single salary amount.
Summary statistics of salaries (in British pounds / GBP)
In a more in-depth analysis for some future post, it can be interesting to look into payments for jobs paying by hour and by day as well for more specific job categories. One way for approaching specific job categories can be by defining tags for job titles using standard words from titles (e.g., director, management, assistant) and groupping them by tag type in a new factor variable.
Histogram with distribution of lower salaries (GBP)
Histogram with distribution of upper salaries (GBP)
The 10 most frequent recruiters
The table below presents the ranking of the 10 most frequent recruiters in the dataset. Column “N” presents the number of total announcements for each recruiter while column “Freq” shows the percentage of total announcements for each recruiter. Among these are also recruitment agencies.
The tables below show the ranking of the jobs with the 10 lowest and 10 highest upper salaries.
The jobs with the 10 lowest upper salaries (GBP)
The jobs with the 10 highest upper salaries (GBP)
I also wanted to quickly explore possible relationships between deadline dates and salary levels, just for fun. It could be, for example, that some periods had lower average-salary offers than others.
Despite the large number of job announcements in the dataset (N=1704), all observations refer to jobs with application deadlines between 24 August 2016 and 23 September 2016. This is a short time span for such analysis, but I explored it anyway just as an example of what these tools and techniques can do.
The plot below shows the mean (or average) upper salary for each day throughout the period. The variation in the mean salary as well as salary levels seem higher for jobs with deadlines as from September. The dashed line represents the results of the linear regression. The linear model however fails to detect any statistically significant relationship between mean salary and application date (R2 = 0.006; p = 0.69).
Average upper salary by date (GBP)
Next, I will use word clouds to explore job titles. The larger the word in the cloud, the higher is its frequency in the dataset. The words below are only those mentioned in at least 10 job announcements. The plot indicates that management positions are the most frequent ones, followed by coordination jobs, as well as officer, recruitment and fund-raising jobs.
Word cloud of job titles
The cloud below shows the most frequent words in the names of the recruiting institutions. I assumed that its results could provide hints about the most active thematic areas in terms of job anouncements. The words in the plot below are also those which have been mentioned in at least 10 job anouncements. The word cloud suggests that recruitment agencies are among the leading ones, as expected (see section “The 10 most frequent recruiters”). Organisations working with children, cancer and alzheimer patients also seem to stand out.
Word cloud of recruiters
The charity, development aid, not-for-profit and social enterprise sector is evolving rapidly. This process is powered both by increasingly-critical global challenges and, of course, by capable and motivated entrepreneurs, staff and service suppliers. This is a sector which is sometimes too much romantised by some people. As a consultant and entrepreneur in the sector, I am often asked how I manage to deal with all day dreamers I come accross in my way. No judgment about that but this indicates how much the sector is still unknown to the public. This is a sector which has become increasingly professional and results oriented. I believe that computing for data analysis can help the sector, particularly concerning monitoring and evaluating performance, which should include staff and beneficiary / client satisfaction.
I hope you enjoyed this tour and would be happy to receive your suggestions for additonal analysis and improvement. You can access this post with more updated data at: https://rpubs.com/EduardoWF/charityjobs.
Keep coding and take care!
Written by: Eduardo W. Ferreira, PhD / Consultant, data scientist, trainer and facilitator. Eduardo supports designing, managing and evaluating projects and programmes for consultancy firms, non-governmental organisations, governments, research institutions and international organisations (Additional information).