Young families can have a hard time navigating through the complexity and risks of real estate markets.
Is the price of a house in line with the market?
What is the amount that I should offer for a specific house?
Those planning to sell houses also need to engage one or more real estate brokers to estimate their house price; these, however, have an interest of their own in the process. Yet, when wanting to sell one’s house, one needs to know whether the house price estimate is in line with market prices.
Artificial intelligence can help to shed light on some of these questions. To demonstrate that, we used two machine learning algorithms to forecast house prices based on their specific characteristics. For that we used publicly-available web-scraped data from real-estate websites, which contain house offers in Bremen (where we are based).
Since the process behind the curtains can be quite complex, we packed everything into a web application (in short, ‘web-app’). This application, which we called “ImmoBot” simply requires users to specify the house characteristics and will then provide them with the estimates in no time. ImmoBot also presents the used dataset and plots, which we will be expanding from time to time.
Price forecasts are based on regression analysis from a gradient boosting algorithm and from a Random-Forest algorithm, which are common machine learning tools. Since the forecasts from each algorithm differ, we added the average of the forecast estimates as an additional piece of information.
Gradient boosting and Random Forest are decision tree-based ensemble models. In gradient boosting, a shallow and weak tree is first trained, and then the next tree is trained based on the errors of the first one. The process continues with a new tree being sequentially added to the ensemble and the new successive tree correcting the errors of the ensemble of preceding trees. In contrast, random forest is an ensemble of deep independent trees.
We hope that this helps as a simple example of how artificial intelligence can be applied while helping our Bremen users navigating the complex real estate markets.
Please share this web-app and let us know in case you have any suggestions or questions.
Disclaimer: Please notice that the price forecasts are merely illustrative. Neither movimentar GmbH nor any person acting on their behalf may be held responsible for the use that may be made of the information presented here.
We are pleased to announce that we are now officially on Linkedin to engage and connect with our clients and people interested in digitalisation, data science and technology for management and evaluation of international development and humanitarian actions. Make sure you check our company page regularly, as we will update you on current projects and any potential job opportunities. Connect with us and join the exchange! #projectmanagement#datascience#internationaldevelopment
Reproducible samples and analyses are critical for data quality, particularly in monitoring and evaluation of project activities. Okay, some may say, but does it really matter at all? Yes, it does. It helps setting a seed for the future.
Evolution in project monitoring and evaluation
Let us take some capacity-development activity such as a training as an example. In the past, it might have been easy to simply write a “qualitative” description of the contents of a training with some technical jargon / acronyms and mention the number of participants that were there (hopefully also being able to provide or “estimate” a percentage of women among participants).
Monitoring and reporting project implementation was largely seen as a bureaucratic requirement. There was no structured way to learn from participants. There was virtually no systematic data collection for knowing how to improve training contents, methods, materials and results based on participants’ views and suggestions (feedback).
This fitted well with the traditional top-down approach to capacity development. In a way, it also made sense. Digital data-collection tools were complicated to use and considered expensive to maintain. They also represented more pressure on project managers and staff.
Project managers could be confident (and I am afraid some still are) that treating reports merely as regular “literary” exercises while focusing efforts on financial compliance would be enough. After all: “in the end we will write something nice in the implementation report”. Learning from project implementation and evolving from experiences were suffocated by a binary logic of “success” or “failure”. In such a context, it is easy to miss the fact that experimentation “failures” are important steps towards learning for impact success.
Paradigm change
This context has been changing fast in the era of data abundance and analytics. Many still see terms such as “automation” and “machine learning” as threatening. Personally I think that improvised, unstructured and scientifically-weak monitoring, evaluation, accountability and learning systems have done enough harm in terms of loss of resources and opportunities. This is particularly so in the public and international development sector. It is great to see that things are finally evolving from discourse to practice.
Learning from experience is gradually becoming easier and cheaper. The powerful and open source computational tools that are freely available such as R and Python can make it easier to reduce sample-selection bias but require at least some basic knowledge of their syntax. Many organisations are still adapting to the paradigm shift from top-down, specialised expertise to a more collaborative and multidisciplinary data-driven approach to monitoring, evaluation and learning. This process requires data-science skills that blend computing and statistics following professional monitoring and evaluation standards. Investment in human resources and targeted recruitment / contracting are key. Data management and analysis using traditional spreadsheet software such as MS-Excel and conventional, proprietary statistical packages (e.g., SPSS and STATA) are not enough anymore for a world with complex, unstructured data.
Sampling in a scientifically-robust (but simple) way
A common question that clients have asked me is about how best to select participants for feedback surveys in activities such as training and events. Thinking about this and the context above I developed a very simple app using Shinyapps. This app generates a reproducible random sample list of numbers. Samples are reproducible with the function set.seed(44343) using R.
You can access and use the app at https://movimentar.shinyapps.io/randomsample/. You will simply need to input the total number of participants in the event/activity, and a sample percentage. This will depend on the size of your activity. After that, you can visualise the result in a reactive table and download the output in XLSX format (MS Excel).
The “magic” here is that if anyone executes set.seed() with the same number specified between the parentheses, one will always see the same randomised sample. This makes it reproducible while avoiding the problem of sample-selection bias. So, people in the future can also learn from your experience with assurance that you put some thought into data quality.
It is also possible to draw reproducible samples in many other statistical computing languages. In Python, for example, you should import random and call random.seed() to set your seed number. After that you need to import numpy and call numpy.random.choice() to get your sample. However, be aware that the seed number (44343) used as reference in the randomsample app will generate a different sample in Python as the app is built in R.
The app’s source code is publicly available for download on Github. I hope that this helps others to learn more about these tools. Code contributions will be very welcome too.
Let us learn for real. It is time to set.seed() for the future.
Written by: Eduardo W. Ferreira, PhD / Consultant, data scientist, trainer and facilitator. Eduardo supports designing, managing and evaluating projects and programmes for consultancy firms, non-governmental organisations, governments, research institutions and international organisations (Additional information).
Project management has some things in common with playing with a kite. One needs to adapt well and quickly to changes in external conditions following observation of performance. Otherwise, one runs the risk of blindly hitting the ground.
Differently from kites, though, a project or a programme requires more than open eyes. It requires sound data collection and analysis. Beneficiary / client feedback in opinion surveys can help tracking performance, user satisfaction and improvement needs. This is particularly so if one uses reproducible, computer-based random samples in line with professional statistical/data-science standards. This is the way to learn from project implementation based on modern scientific and computational methods.
Lesson learned
Some time ago, I had a consulting assignment with a youth-violence reduction project in Brazil. The project needed baseline data for its logframe indicators. So, we designed a system for collecting, storing, processing and reporting data using Open Data Kit in Android devices. We also used R (statistical-computing language) for programming a reproducible sample, as well as all data processing and analytical reporting.
Before collecting data, I had to train over 25 people including interviewers and partner staff, who also suggested changes to the data collection form. This post is about one of these suggestions, which was a particularly good lesson learned.
Jane Davidson’s article “Breaking out of the Likert scale trap” inspired me to propose the inclusion of direct evaluative questions instead of the traditional Likert scales. It is a very good post claiming that by using evaluative terms right in the questionnaire, participant ratings become a lot easier to interpret in terms of quality or value. I also think so.
The Likert scale using “strongly agree” to “strongly disagree” is great for assessing opinions and knowledge from respondents. However, the scale makes it difficult to draw evaluative conclusions on quality or value of a training workshop, project or programme, for example. So, the scale suggested by Davidson was as follows:
“poor / inadequate”;
“barely adequate”;
“good”
“very good”
“excellent”
The draft data-collection form used the same label categories as those above but translated to Portuguese. During the interviewer training workshop, one of the participants spotted a potential problem that I also did not notice before. The label categories were not well balanced…
The problem was that in the scale above there are three positive and two negative scale categories or levels. Hence, the likelihood / probability of a positive result tends to be higher. Those unbalanced options are a potential source of bias.
For preventing such bias, we changed the labels proposed by Davidson to:
“very poor” or “very low”
“poor” or “low”
“regular” or “average”
“good” or “high”
“very good” or “very high”.
Additional answer categories
I would recommend to include the categories “Not sure, I don’t know”, “Not applicable”, in order to allow a more complete respondent feedback. The numeric scale can integrate these new categories depending on the question (e.g., answering not applicable or reporting not to know the action under evaluation can also indicate the quality of its outreach and impact).
Sometimes, it can also be interesting to have the answer option “I do not want to answer” for sensitive questions about income or abuse, for example. This option, of course, should not be part of the numeric evaluative scale. Otherwise, one will mix up different types of result.
Numeric analysis
The corresponding numeric intervals must also be balanced.
For a scale from 1 to 5 (one being the worst case, as in the article from Davidson, or the other way round as it is the case in Germany where the score one is the best), the interval from the function “cut” in R (statistical computing language) is:
The same can be done for a scale based one the interval from 1 to 7 if one includes the categories “Not sure, I don’t know”, “Not applicable”. The R output from the cut function for a scale with seven categories is as follows:
For further preventing bias, the survey introduction can try to make survey participants aware about the risk of providing biased answers. An introduction following the paragraph below can help:
Respondents in such questionnaires sometimes repeat the same answers for different questions, mark extreme answers trying to be polite or as form of calling attention to a specific aspect, or even rate items in the middle categories in order to keep neutrality when they are actually thinking something else. Please avoid this as much as you can, as it prevents us from understanding the real situation.
If you are asking for real feedback from clients/beneficiaries and stakeholders, interviewers must be external to your project team. Ideally, they should be outsourced and receive training on interviewing methods and not associated to the implementing organisations or related to their staff members. This helps preventing interviewer bias (when results are different depeding on who collects data). This can be the case, for example, when humanitarian-aid beneficiaries have suggestions for support improvement but fear loosing future support after having provided critical feedback.
Final remarks
I benefited from Davidson’s contribution and I thought it would be good to try to contribute as well. Monitoring and evaluations with robust scientific standards can powerful for learning and improving policies, programmes, projects and products.
The evaluative scales can be very helpful but it does not mean that Likert scales should be avoided by all means. I also use Likert scales in my forms, particularly in those aiming to test subject knowledge from participants in capacity development actions such as projects including training workshops or a course module.
Also, it is worth including an open question about problems (e.g., What are the three main problems in your village?) as well as an open question about suggestions for improvement or additional comments. Text data can be analysed with word clouds and dendrograms, for example. This can complement well scoring data in monitoring and evaluation. It is also an opportunity for projects and programmes to track opportunities while making sure that they are addressing the issues that their beneficiaries or clients consider most important.
I hope you enjoyed this post and would be happy to receive any suggestion or comment.
In this post I try to explore this and some other questions using open-source statistical computing R language and public recruitment data from CharityJob’s website. According to CharityJob, the site is the United Kingdom’s busiest one for charity, fundraising, NGO and not for profit jobs.
In addition to presenting these powerful open-source tools and data-exploring techniques, I hope that this post can help the public, specially applicants and workers to get an update on salaries and trends in the sector. The jobs analysed here are mostly UK-based ones and published by UK-based organisations. Therefore, the results below are not meant to represent the entire sector worldwide. I still hope though that this post can provide some positive contribuition to the evolution of the sector in both the southern and the northern hemispheres.
For those of you who are only interested in the end analysis, please jump to the results section. However, I encourage you to explore how these tools work. I believe that they can help speeding up and improving quality of the so-much-needed charity, social-enterprise, development-aid and humanitarian work globally.
I used here some basic techniques of web scraping (web harvesting or web data extraction), which is a computer software technique of extracting information from websites. The source code in RMarkdown is available for download and use based on GNU General Public License at this link: Rmarkdown code. Everything was preapred with the open-source, freely-accesible and powerful statistical computing language “R” (R version 3.2.0 ) and the development interface RStudio (Version 0.99.441).
This post is based on public data. The post is my sole responsibility and can in no way be taken to reflect the views of CharityJobs’ staff.
Downloading data from CharityJobs
Using RStudio, the first step is to download the website data. CharityJobs’ search engine contains over 140 webpages, each of them with a list of 18 jobs in most cases. Hence I expected to get information about around 2,500 job announcements. For that, the first step was to download the data and get rid of what I did not wanted (e.g. css and hmtl codes). The code chunck below describes how I did it. The code contains explanatory comments indicated by hashtags (‘#’). I am sure that many would be able to write this code in a much more elegant and efficient way. I would be very thankful to receive your comments and suggestions!
# Loading the necessary packages. It assumes that they are installed.
# Please type ‘?install.packages()’ on the R console for additional information.
suppressWarnings(suppressPackageStartupMessages(require(rvest))) # Credits to Hadley Wickham (2016)
suppressPackageStartupMessages(require(stringr)) # Credits to Hadley Wickham (2015)
suppressPackageStartupMessages(require(lubridate)) # Credits to Garrett Grolemund, Hadley Wickham (2011)
suppressPackageStartupMessages(require(dplyr)) # Credits to Hadley Wickham and Romain Francois (2015)
suppressPackageStartupMessages(require(xml2)) # Credits to Hadley Wickham (2015)
suppressPackageStartupMessages(require(pander)) # Credits to Gergely Daróczi and Roman Tsegelskyi (2015)
suppressPackageStartupMessages(require(ggplot2)) # Credits to Hadley Wickham (2009)
## Downloading website information into a list called `charityjobs` and closing connections
charityjobs <- lapply(urls, . %>% read_html(.))
Tyding up and parsing data
The next step is to parse or clean up the text string of each of the about 140 webpages. I decided to build a custom function for that, which I could use to loop through the content of each element of the charityjobs list. The function should also save the parsed data into a data frame. This data frame should include information on recruiters, position titles, salary ranges and deadline data. The code chuck below presents this function, which I called salarydata.
## Creating a function for parsing data which uses the read_html output (list ‘charityjobs’)
sal <- sal %>% filter(!grepl(“hours”, sal$salary_range))
sal <- sal %>% filter(!grepl(“hour”, sal$salary_range))
sal <- sal %>% filter(!grepl(“p/h”, sal$salary_range))
sal <- sal %>% filter(!grepl(“week”, sal$salary_range))
sal <- sal %>% filter(!grepl(“ph”, sal$salary_range))
sal <- sal %>% filter(!grepl(“day”, sal$salary_range))
sal <- sal %>% filter(!grepl(“daily”, sal$salary_range))
sal <- sal %>% filter(!grepl(“plus”, sal$salary_range))
sal <- sal %>% filter(!grepl(“\\+”, sal$salary_range))
salaries <- rbind(salaries, sal)
}
return(salaries)
}
Creating full dataframe and other adjustments
The last step before exploring the data was to run the function salarydata to create the full dataframe. After that, I parsed lower and upper salaries into separated columns, deleted data which may have been incorrectly parsed or data concerning daily-rate and hourly-rate jobs / consulting assignments. Only yearly salaries between GBP 4,000 and GBP 150,000 have been considered. All salary data is in British Pounds (GBP) and refer to annual salaries, which sometimes do not include benefits such as pension.
Cleaning the salary-range variable was a tricky step as the website allows users to type in both salary amounts and additional text (e.g. 30,000, 30K, or 25-30k). Therefore, I had to iterate some times until the output was good enough. I am quite sure that the code chunk below can be written in a more elegant way. Again, please let me know in case you have any suggestions here.
The output below presents the summary of the full dataframe (first 10 observations).
## Source: local data frame [1,704 x 6]
##
## deadline recruiter
## (time) (chr)
## 1 2016-09-04 ZSL
## 2 2016-09-12 Alliance Publishing Trust
## 3 2016-08-31 Save the Children
## 4 2016-08-30 Blind Veterans UK
## 5 2016-09-08 Headway SELNWK
## 6 2016-08-30 Saferworld
## 7 2016-09-22 Pro-Finance
## 8 2016-09-06 TPP Recruitment
## 9 2016-09-06 Harris Hill
## 10 2016-09-20 Hays London Ebury Gate
## .. … …
## Variables not shown: position (chr), lower_salary (dbl), upper_salary
## (dbl), salary_range (chr)
Results
The final dataset contains information of 1,704 jobs of various types, based on yearly-salary figures. They exclude consultancy assignments and other jobs based on hour and day rates as well as jobs which did not provide salary information. The table below presents the summary statistics concerning the lower and upper salaries.
The table below presents standard descriptive statistics for lower and upper salaries. For job announcements providing a single value (not a salary range), that single amount has been incorporated to the dataset variable upper_salary while the variable lower_salary was set as NA (not available). That is why the number of observations (N) is 785 for lower salaries and 1,704 for upper salaries. About 54% of the job announcements did not provide salary range information but just the single salary amount.
Summary statistics of salaries (in British pounds / GBP)
In a more in-depth analysis for some future post, it can be interesting to look into payments for jobs paying by hour and by day as well for more specific job categories. One way for approaching specific job categories can be by defining tags for job titles using standard words from titles (e.g., director, management, assistant) and groupping them by tag type in a new factor variable.
Histogram with distribution of lower salaries (GBP)
Histogram with distribution of upper salaries (GBP)
The 10 most frequent recruiters
The table below presents the ranking of the 10 most frequent recruiters in the dataset. Column “N” presents the number of total announcements for each recruiter while column “Freq” shows the percentage of total announcements for each recruiter. Among these are also recruitment agencies.
The tables below show the ranking of the jobs with the 10 lowest and 10 highest upper salaries.
The jobs with the 10 lowest upper salaries (GBP)
The jobs with the 10 highest upper salaries (GBP)
I also wanted to quickly explore possible relationships between deadline dates and salary levels, just for fun. It could be, for example, that some periods had lower average-salary offers than others.
Despite the large number of job announcements in the dataset (N=1704), all observations refer to jobs with application deadlines between 24 August 2016 and 23 September 2016. This is a short time span for such analysis, but I explored it anyway just as an example of what these tools and techniques can do.
The plot below shows the mean (or average) upper salary for each day throughout the period. The variation in the mean salary as well as salary levels seem higher for jobs with deadlines as from September. The dashed line represents the results of the linear regression. The linear model however fails to detect any statistically significant relationship between mean salary and application date (R2 = 0.006; p = 0.69).
Average upper salary by date (GBP)
Next, I will use word clouds to explore job titles. The larger the word in the cloud, the higher is its frequency in the dataset. The words below are only those mentioned in at least 10 job announcements. The plot indicates that management positions are the most frequent ones, followed by coordination jobs, as well as officer, recruitment and fund-raising jobs.
Word cloud of job titles
The cloud below shows the most frequent words in the names of the recruiting institutions. I assumed that its results could provide hints about the most active thematic areas in terms of job anouncements. The words in the plot below are also those which have been mentioned in at least 10 job anouncements. The word cloud suggests that recruitment agencies are among the leading ones, as expected (see section “The 10 most frequent recruiters”). Organisations working with children, cancer and alzheimer patients also seem to stand out.
Word cloud of recruiters
Moving forward
The charity, development aid, not-for-profit and social enterprise sector is evolving rapidly. This process is powered both by increasingly-critical global challenges and, of course, by capable and motivated entrepreneurs, staff and service suppliers. This is a sector which is sometimes too much romantised by some people. As a consultant and entrepreneur in the sector, I am often asked how I manage to deal with all day dreamers I come accross in my way. No judgment about that but this indicates how much the sector is still unknown to the public. This is a sector which has become increasingly professional and results oriented. I believe that computing for data analysis can help the sector, particularly concerning monitoring and evaluating performance, which should include staff and beneficiary / client satisfaction.
I hope you enjoyed this tour and would be happy to receive your suggestions for additonal analysis and improvement. You can access this post with more updated data at: https://rpubs.com/EduardoWF/charityjobs.
Keep coding and take care!
Written by: Eduardo W. Ferreira, PhD / Consultant, data scientist, trainer and facilitator. Eduardo supports designing, managing and evaluating projects and programmes for consultancy firms, non-governmental organisations, governments, research institutions and international organisations (Additional information).
The new logframe makes it clearer that projects need to have baseline and target values disaggregated by sex for their indicators already during the submission stage. Changes in the column for the intervention logic may generate some confusion though … The logframe now opens the possibility of more than one specific objective, naming them also “outcomes”. Below that, one will see “outputs” and then “activities”. This may contradict the European Commission’s Project Cycle Management Manual but seems to be in line with the policies adopted by other development agencies (e.g., USAID, DFID and SDC) which also use the result chain:
Looking at the section “definitions” in the logframe template, the specific objective is not even mentioned. Therefore, I can imagine that in many cases the specific objective will be just dropped in many grant proposals. This can be good for result orientation as one now needs to breakdown results into outcomes (consequence/effect of project deliverables such as increased weekly income) and ouputs (concrete deliverables from activity implementation such as number of trainings or training participants by sex). This forces applicants to reflect more on objectively verifiable results.
My recommendation in terms of project structure is as follows:
1) A single overall objective and long-term goal (impact) indicating the strategic orientation of the project. The convention here is to start with “To contribute to … “,
2) Two or three key outcome areas (No more than five to ensure maximium clarity to implementing staff and key stakeholders). The convention is to present outcomes and outputs in the past tense (e.g., Improved coping strategies of at least 2,200 vulnerable farmer households).
3) One to three key outputs for each outcome area. These should also be phrased in past tense and must concretely come from activity implementation. So, take a few minutes to brainstorm products/deliverables for each activity. This will give you elements for designing the goal, outcomes, outputs as well as their indicators. In the end, always cross check if your brainstorming results are well refected in your outputs, outcomes and indicator statements.
4) You should have as many activities as necessary. Quantify activities as much as possible and do not include tasks (sub-activities) in the logframe but only the key project activities. I recommend to plan activities by output (max. four per output for simplicity and clarity). You need to make sure that the activities are really delivering the outputs you plan. If there is one output for which there is not related activity do not hesitate to add the required activity (or delete the output).
How important is the logframe?
The logframe is a crucial document. It should be the main guiding document for the project staff. It can help a lot to ensure smooth implementation when well designed. The logframe is one of the annexes to the contract with the European Uninon. Hence it will be legally binding.
That does not mean that the logframe is written on stone. The EU Project Cycle Manual (PCM) emphasises the importance of undertaking regular project monitoring and evaluation (M&E). Assessments of indicator performance during project implementation may require logframe adjustments (e.g., increases or reductions in indicator targets). So, the general rule is:
If changes do not affect the basic purpose of the action (e.g., replacing two nurses by a doctor or the other way round should be fine), you should inform changes without delay (e.g., in interim reports following yearly project implementation review and planning workshops).
If a major change is necessary that may affect the basic purpose of the action (e.g., substantial changes in the intervention logic – 1st column of the logframe), the project applicant will need to ask for permission from the project officer in the EU. They can do that by sending a revised logframe with highlighted changes and a letter with detailed justifications.
It is important to make sure that everyone have these PCM principles and contract rules in mind when designing and adjusting the logframe.
Increased scientific rigour in monitoring and evaluation
Projects are policy experiments, which require scientifically valid and reliable M&E data. Credible data is critically important to demonstrate value and to justify potential adjustments in your initial plans. This also helps the EU to justify the use of funds towards taxpayers. All this requires professional scientific methods and principles such as randomisation and reproducibility. Using digital data-collection tools (e.g., Kobotoolbox / Open Data Kit) and computer syntax to draw reproducible random samples, process and analyse data have become increasingly important in our digital era of analytic dashboards, algorithms and automated reports. These are great tools to ensure quality and manage the substantial data load that your project will need to work with.
I highly recommend watching Ester Dufluo’s TED talk on how the lack of data on aid’s impact raises questions about how to provide it. And rightly so. There are still a few colleagues who are still reluctant about employing data science and statistics to project management in international development and humanitarian cooperation. Yes, all this rigour leads to increased time and result pressure on project implementers. However, it also helps increasing transparency, real learning and impact potential. Integrating mobile-data collection and programmatic data processing and analysis can help you to save time and resources by increasing efficiency of your M&E processes. Doing all this on paper forms, for example, increases the likelihood of data-quality problems and is too resource intensive compared with the existing data-collection, database and visualisation tools.
Final remarks
Goal/impact indicators (the highest level) can be focused on a broader geographic area or on forecasts for a longer period than the project implementation, for example. Please be aware that you should also monitor and report the situation for long-term indicators or indicators focusing on broader administrative territorial units (e.g., regions, country, state or municipality).
The separation of “results” into “outcomes” and “outputs” with baseline and target values disaggregated by sex is likely to be a challenge, specially when it comes to output baseline values … You will need to become clearer about your targets by sex, and commit to that.
Good luck to us all in our next project proposals!
Written by: Eduardo W. Ferreira, PhD / Consultant, data scientist, trainer and facilitator. Eduardo supports designing, managing and evaluating projects and programmes for consultancy firms, non-governmental organisations, governments, research institutions and international organisations (Additional information).
M&E-data tidying can help increasing your team’s productivity, saving resources and making more time available for the data-analysis stage and, of course, for your beneficiaries. If your project M&E data is “messy”, you will probably need to spend quite some time until you can begin any comprehensive analysis. Understanding well what tidy is data and ensuring that the output of your project’s M&E system is tidy can save you a great deal of trouble.
In the video below, you will find some useful tips and functions for tidying data with R, presented by Hadley Wickham, creator of many popular R packages.
Wickham’s paper on Tidy Data is another very useful resource and its abstract is presented below. You might also want to read his complete article on Tidy Data.
A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.