How much are charity, fundraising, NGO and non-profits currently paying their new staff? Web scrapping CharityJobs
In this post I try to explore this and some other questions using open-source statistical computing R language and public recruitment data from CharityJob’s website. According to CharityJob, the site is the United Kingdom’s busiest one for charity, fundraising, NGO and not for profit jobs.
In addition to presenting these powerful open-source tools and data-exploring techniques, I hope that this post can help the public, specially applicants and workers to get an update on salaries and trends in the sector. The jobs analysed here are mostly UK-based ones and published by UK-based organisations. Therefore, the results below are not meant to represent the entire sector worldwide. I still hope though that this post can provide some positive contribuition to the evolution of the sector in both the southern and the northern hemispheres.
For those of you who are only interested in the end analysis, please jump to the results section. However, I encourage you to explore how these tools work. I believe that they can help speeding up and improving quality of the so-much-needed charity, social-enterprise, development-aid and humanitarian work globally.
I used here some basic techniques of web scraping (web harvesting or web data extraction), which is a computer software technique of extracting information from websites. The source code in RMarkdown is available for download and use based on GNU General Public License at this link: Rmarkdown code. Everything was preapred with the open-source, freely-accesible and powerful statistical computing language “R” (R version 3.2.0 ) and the development interface RStudio (Version 0.99.441).
This post is based on public data. The post is my sole responsibility and can in no way be taken to reflect the views of CharityJobs’ staff.
Downloading data from CharityJobs
Using RStudio, the first step is to download the website data. CharityJobs’ search engine contains over 140 webpages, each of them with a list of 18 jobs in most cases. Hence I expected to get information about around 2,500 job announcements. For that, the first step was to download the data and get rid of what I did not wanted (e.g. css and hmtl codes). The code chunck below describes how I did it. The code contains explanatory comments indicated by hashtags (‘#’). I am sure that many would be able to write this code in a much more elegant and efficient way. I would be very thankful to receive your comments and suggestions!
# Loading the necessary packages. It assumes that they are installed.
# Please type ‘?install.packages()’ on the R console for additional information.
suppressWarnings(suppressPackageStartupMessages(require(rvest))) # Credits to Hadley Wickham (2016)
suppressPackageStartupMessages(require(stringr)) # Credits to Hadley Wickham (2015)
suppressPackageStartupMessages(require(lubridate)) # Credits to Garrett Grolemund, Hadley Wickham (2011)
suppressPackageStartupMessages(require(dplyr)) # Credits to Hadley Wickham and Romain Francois (2015)
suppressPackageStartupMessages(require(xml2)) # Credits to Hadley Wickham (2015)
suppressPackageStartupMessages(require(pander)) # Credits to Gergely Daróczi and Roman Tsegelskyi (2015)
suppressPackageStartupMessages(require(ggplot2)) # Credits to Hadley Wickham (2009)
## Creating list of URLs (webpages)
urls <- paste(“https://www.charityjob.co.uk/jobs?page=”, seq(1:140), sep = “”)
## Downloading website information into a list called `charityjobs` and closing connections
charityjobs <- lapply(urls, . %>% read_html(.))
Tyding up and parsing data
The next step is to parse or clean up the text string of each of the about 140 webpages. I decided to build a custom function for that, which I could use to loop through the content of each element of the charityjobs list. The function should also save the parsed data into a data frame. This data frame should include information on recruiters, position titles, salary ranges and deadline data. The code chuck below presents this function, which I called salarydata.
## Creating a function for parsing data which uses the read_html output (list ‘charityjobs’)
salarydata <- function(list) {
# Creating auxiliary variables and databases
list_size <- length(list)
salaries <- data.frame(deadline=character(),
recruiter=character(),
position=character(),
salary_range=character())
for (i in seq_along(1:list_size)){
size <- list[[i]] %>% html_nodes(“.salary”) %>% html_text() %>% length()
#Intermediary dataframe
sal <- data.frame(deadline=rep(NA, size),
recruiter=rep(NA, size),
position=rep(NA, size),
salary_range=rep(NA, size))
## Filling out intermediary data for deadlines for application
sal$deadline[1:size] <- list[[i]] %>%
html_nodes(“.closing:nth-child(4) span”) %>% html_text() %>%
.[!grepl(“^[Closing:](*)”,.)] %>% rbind()
## Filling out intermediary data for recruiters
sal$recruiter[1:size] <- (list[[i]] %>%
html_nodes(“.recruiter”) %>% html_text() %>%
gsub(“\r\n\ \\s+”, “”,.) %>%
gsub(“\r\n”, ” “, .) %>%
gsub(“^\\s+|\\s+$”, “”, .)) %>%
rbind()
## Filling out intermediary data for positions
sal$position[1:size] <- list[[i]] %>%
html_nodes(“.title”) %>% html_text() %>%
gsub(“\r\n\ \\s+”, “”,.) %>%
gsub(“\r\n”, ” “, .) %>%
gsub(“^\\s+|\\s+$”, “”, .) %>%
rbind()
## Filling out intermediary data for salary ranges
sal$salary_range[1:size] <- list[[i]] %>%
html_nodes(“.salary”) %>% html_text() %>%
gsub(“(£..)\\.”, “\\1”, .) %>% gsub(“\\.(.)k(+) |\\.(.)K(+)”, “\\100 \\2”, .) %>%
gsub(“(*)k(+) |(*)K(+)”, “\\1000 \\2”, .) %>%
gsub(“k”, “000 “, .) %>% # Substituting remaining ks
gsub(“^(£..)\\-“, “\\1000; “, .) %>% # Adding thousands for figures withou “k”
gsub(“- £”, “; “, .) %>% gsub(“-£”, “; “, .) %>% gsub(“£”, “”, .) %>% # Removing pounds signs
gsub(“-“, “;”, .) %>% gsub(“–”, “;”, .) #%>% # Removing dashes
## Excluding per-hour and per-day jobs
sal <- sal %>% filter(!grepl(“hours”, sal$salary_range))
sal <- sal %>% filter(!grepl(“hour”, sal$salary_range))
sal <- sal %>% filter(!grepl(“p/h”, sal$salary_range))
sal <- sal %>% filter(!grepl(“week”, sal$salary_range))
sal <- sal %>% filter(!grepl(“ph”, sal$salary_range))
sal <- sal %>% filter(!grepl(“day”, sal$salary_range))
sal <- sal %>% filter(!grepl(“daily”, sal$salary_range))
sal <- sal %>% filter(!grepl(“plus”, sal$salary_range))
sal <- sal %>% filter(!grepl(“\\+”, sal$salary_range))
salaries <- rbind(salaries, sal)
}
return(salaries)
}
Creating full dataframe and other adjustments
The last step before exploring the data was to run the function salarydata to create the full dataframe. After that, I parsed lower and upper salaries into separated columns, deleted data which may have been incorrectly parsed or data concerning daily-rate and hourly-rate jobs / consulting assignments. Only yearly salaries between GBP 4,000 and GBP 150,000 have been considered. All salary data is in British Pounds (GBP) and refer to annual salaries, which sometimes do not include benefits such as pension.
Cleaning the salary-range variable was a tricky step as the website allows users to type in both salary amounts and additional text (e.g. 30,000, 30K, or 25-30k). Therefore, I had to iterate some times until the output was good enough. I am quite sure that the code chunk below can be written in a more elegant way. Again, please let me know in case you have any suggestions here.
# Creating a full and clean dataframe
salaries <- salarydata(charityjobs)
# Parsing salary-range variable
salaries$salary_range <- gsub(“, “, “,”, salaries$salary_range) %>%
gsub(” ; “, “;”, .) %>% gsub(“; “, “;”, .) %>%
gsub(“,[:A-z:]”, ” “, .) %>%
gsub(“\\(*”, “”, .) %>%
gsub(“\\:”, “”, .) %>%
gsub(“[:A-z:],[:A-z:]”, ” “, .) %>%
gsub(“(..),00\\…”, “\\1,000”, .) %>%
gsub(“(..),0\\…”, “\\1,000”, .) %>%
gsub(“[A-z]”, “”, .) %>% gsub(“,”, “”, .) %>%
gsub(“\\.”, “”, .) %>% gsub(“^\\s+”, “”, .) %>%
gsub(“\\s([[:digit:]])”, “;\\1”, .) %>%
gsub(“\\s+”, “”, .) %>% gsub(“^[[:digit:]]{1};”, “”, .) %>%
gsub(“\\(“, “”, .) %>% gsub(“\\)”, “”, .) %>% # Deleting “(” and “)”
gsub(“\\/”, “”, .) %>% gsub(“000000”, “0000”, .) %>% # Deleting “/” and correcting digits
gsub(“([[:digit:]]{2})00000”, “\\1000”, .) %>% # Correcting number of digits
gsub(“([[:digit:]]{5})00”, “\\1”, .) # Correcting number of digits
# Adjusting data and computing lower and upper salaries using “;” as separator
salaries <- suppressWarnings(salaries %>%
mutate(upper_salary=gsub(“^.*;”, “”, salaries$salary_range)) %>%
mutate(lower_salary=gsub(“;.*”, “”, salaries$salary_range)) %>%
mutate(upper_salary=as.numeric(upper_salary)) %>%
mutate(lower_salary=as.numeric(lower_salary)) %>%
filter(upper_salary<150000) %>% filter(upper_salary>4000) %>%
filter(lower_salary<150000) %>% filter(lower_salary>4000) %>%
mutate(lower_salary=ifelse(lower_salary>=upper_salary, NA, lower_salary)) %>%
filter(is.na(upper_salary)!=TRUE) %>% tbl_df() %>%
select(deadline, recruiter, position,
lower_salary, upper_salary, salary_range) %>%
mutate(deadline=dmy(deadline)))
The output below presents the summary of the full dataframe (first 10 observations).
## Source: local data frame [1,704 x 6]
##
## deadline recruiter
## (time) (chr)
## 1 2016-09-04 ZSL
## 2 2016-09-12 Alliance Publishing Trust
## 3 2016-08-31 Save the Children
## 4 2016-08-30 Blind Veterans UK
## 5 2016-09-08 Headway SELNWK
## 6 2016-08-30 Saferworld
## 7 2016-09-22 Pro-Finance
## 8 2016-09-06 TPP Recruitment
## 9 2016-09-06 Harris Hill
## 10 2016-09-20 Hays London Ebury Gate
## .. … …
## Variables not shown: position (chr), lower_salary (dbl), upper_salary
## (dbl), salary_range (chr)
Results
The final dataset contains information of 1,704 jobs of various types, based on yearly-salary figures. They exclude consultancy assignments and other jobs based on hour and day rates as well as jobs which did not provide salary information. The table below presents the summary statistics concerning the lower and upper salaries.
The table below presents standard descriptive statistics for lower and upper salaries. For job announcements providing a single value (not a salary range), that single amount has been incorporated to the dataset variable upper_salary while the variable lower_salary was set as NA (not available). That is why the number of observations (N) is 785 for lower salaries and 1,704 for upper salaries. About 54% of the job announcements did not provide salary range information but just the single salary amount.
Summary statistics of salaries (in British pounds / GBP)

In a more in-depth analysis for some future post, it can be interesting to look into payments for jobs paying by hour and by day as well for more specific job categories. One way for approaching specific job categories can be by defining tags for job titles using standard words from titles (e.g., director, management, assistant) and groupping them by tag type in a new factor variable.
Histogram with distribution of lower salaries (GBP)

Histogram with distribution of upper salaries (GBP)

The 10 most frequent recruiters
The table below presents the ranking of the 10 most frequent recruiters in the dataset. Column “N” presents the number of total announcements for each recruiter while column “Freq” shows the percentage of total announcements for each recruiter. Among these are also recruitment agencies.

The tables below show the ranking of the jobs with the 10 lowest and 10 highest upper salaries.
The jobs with the 10 lowest upper salaries (GBP)

The jobs with the 10 highest upper salaries (GBP)

I also wanted to quickly explore possible relationships between deadline dates and salary levels, just for fun. It could be, for example, that some periods had lower average-salary offers than others.
Despite the large number of job announcements in the dataset (N=1704), all observations refer to jobs with application deadlines between 24 August 2016 and 23 September 2016. This is a short time span for such analysis, but I explored it anyway just as an example of what these tools and techniques can do.
The plot below shows the mean (or average) upper salary for each day throughout the period. The variation in the mean salary as well as salary levels seem higher for jobs with deadlines as from September. The dashed line represents the results of the linear regression. The linear model however fails to detect any statistically significant relationship between mean salary and application date (R2 = 0.006; p = 0.69).
Average upper salary by date (GBP)

Next, I will use word clouds to explore job titles. The larger the word in the cloud, the higher is its frequency in the dataset. The words below are only those mentioned in at least 10 job announcements. The plot indicates that management positions are the most frequent ones, followed by coordination jobs, as well as officer, recruitment and fund-raising jobs.
Word cloud of job titles

The cloud below shows the most frequent words in the names of the recruiting institutions. I assumed that its results could provide hints about the most active thematic areas in terms of job anouncements. The words in the plot below are also those which have been mentioned in at least 10 job anouncements. The word cloud suggests that recruitment agencies are among the leading ones, as expected (see section “The 10 most frequent recruiters”). Organisations working with children, cancer and alzheimer patients also seem to stand out.
Word cloud of recruiters

Moving forward
The charity, development aid, not-for-profit and social enterprise sector is evolving rapidly. This process is powered both by increasingly-critical global challenges and, of course, by capable and motivated entrepreneurs, staff and service suppliers. This is a sector which is sometimes too much romantised by some people. As a consultant and entrepreneur in the sector, I am often asked how I manage to deal with all day dreamers I come accross in my way. No judgment about that but this indicates how much the sector is still unknown to the public. This is a sector which has become increasingly professional and results oriented. I believe that computing for data analysis can help the sector, particularly concerning monitoring and evaluating performance, which should include staff and beneficiary / client satisfaction.
I hope you enjoyed this tour and would be happy to receive your suggestions for additonal analysis and improvement. You can access this post with more updated data at: https://rpubs.com/EduardoWF/charityjobs.
Keep coding and take care!
Written by: Eduardo W. Ferreira, PhD / Consultant, data scientist, trainer and facilitator. Eduardo supports designing, managing and evaluating projects and programmes for consultancy firms, non-governmental organisations, governments, research institutions and international organisations (Additional information).