Training at the ODI - It’s a Date

With Open Data saving our planet for the next generation and bringing forth the next evolution of the web, it was surprising to find that explaining how computers understand dates was one of the most critical parts of the recent Open Data in Practice training course.

Dates are of critical importance when it comes to processing data but in order for computers to process them, they must first understand them, but even then we still know that things can go wrong as it has for Apple, twice in the last three years (2010/11 and 2012/13).

Computers like to handle dates as specific objects, conforming to a specific calendar and don’t care much about the human representation. In order to work with dates in a CSV, spreadsheet and even XML first requires the computer to change them all to date objects. Asking graphing tools to plot a date series that hasn’t been translated will normally either result in an error or a best effort where values like 31/2/2013 are perfectly valid as it is just 3 numbers with two /’s.

During the validation and cleaning exercises of the open data in practice course we looked at tools to handle date formats, specifically how it is possible to use Open Refine to identify errors in the data set. Unfortunately, humans are a pain and will always find a new way to introduce a new representation the computer has never seen before and is unable to translate into a date to be plotted on a graph. “End of Q3” in the due date column is a nice example. What year? “By the end of 2012” is another one. In both cases what was wrong with typing in an actual date!

Thisproblem with dates became particularly evident for one group on the final day of the Open Data in Practice course. On this day people were asked to apply their new knowledge of open data, tools, techniques and practices to an area of their choice.

The cabinet office team (as they became known) decided to attempt creating a visualisation of the scheduled release of open data from each government department compared to the actual amount of data available. They wanted to see which departments had met their deadlines over time and which were lagging behind.

Stage one was to visualise the intended release schedule, contained in a single dataset available via data.gov.uk. With many potential visualisation toolkits to pick between, and the source data in CSV, it should be fairly strait forward to start working with the data, just like you would in excel. The breakdown to the right shows the approximate amount of time spent on the key tasks in order to create the visualisation. As you can see more time was spent preparing the dataset than actually getting it visualized.

As well as being a familiar story to scientists and techies, consider the last time you sorted your own email, digital photo collection or computer desktop. Things start out neat and tidy, but the situation changes and that nice filing system you created no longer works for any new content! Surprising that the same applies to such a simple thing as a date format.

Having fixed many of the errors, and given up on others due to sheer boredom, it was time to create the visualisation (shown below). This early version shows the different government departments (y-axis) against the data to be released over time (x-axis). Pressing play will show a series of coloured bubbles moving and growing as the number of datasets gets bigger over time.

DIAGRAM

Notice that there is data missing for many of the departments… two words… date format.

Having a week to cover the main issues on open data from licensing, publishing to consumption, validation and visualisation I was extreamely impressed with the projects that were attempted at the end of the week. It is both testament to the people and the community out there who are making the tools much easier to use. We are just beginning to see the evolution of the web of data, but one thing is for sure. If we don’t sort out the problem with dates then we might be stuck in the past…