Guest post: How important is the quality of open data?
By Dr Dennis McDonald, Senior Consultant at BaleFire Global
Martin Doyle's Is open data at risk from poor data quality is a thoughtful piece but doesn’t address this question:
Should data quality standards observed in open data programmes be any different from the data quality standards observed in any other programmes that produce or consume data?
My first response is to answer with a definite “No!” However, I think the question is worth discussing. Data quality is a complex issue that people have been wrestling with for a long time. I remember way back in graduate school doing a class project on measuring “error rates” in how metadata were assigned to technical documentation that originated from multiple sources.
Just defining what we meant by “error” was an intellectually challenging exercise that introduced me to the complexities of defining quality as well as the impacts quality variations can have on information system cost and performance.
Reading Doyle’s article reminded me of that early experience and how complex quality measurement can be in programmes that are designed to make data accessible and re-usable.
One way to look at such questions is in terms of trade-offs. Would we gain more benefit by exposing potentially faulty data files now to public scrutiny and re-use than we gain by delaying and spending more time and resources to "clean up" the data before it’s opened up for public access?
Setting aside for the moment how we define "quality" -- and not all the points made by Doyle are directly related to quality -- database managers have always had to concern themselves with standards, data cleaning, corrections, and manual versus automated approaches to data quality control. In fact, I'm not convinced that "adding" quality control measures throughout the process will significantly add to costs; in fact, adding quality control measures may actually reduce costs, especially when data errors are caught early on.
One thing to consider when deciding whether or not to release data that may not be 100% "clean" -- whatever that is defined to mean -- is that it is a basic principle of open data that others will be free to use and re-use the data. If data are distributed with flaws, will re-use of that data compound those flaws by impacting other systems and processes that may be beyond the control of the issuer? Plus, if downstream problems do occur, who gets to pay for the cleanup?
Data quality concerns in our own lifetimes have become central to commerce and communication as transactions of all types have moved online or onto the web. Buying, selling, and managing personal shopping, health, and financial affairs reliably place a high emphasis on data accuracy, completeness, and reliability. We’re used to that. Our high expectations of data quality are at least partly due to how purpose-built systems with specific functions are expected to operate by delivering support for specified transactions. Data quality issues can directly impact system performance (and sometimes profitability) in immediately measurable ways. The same can also be true of data provided by open data programmes.
As Doyle reports, the move to more open data by governments (he focuses specifically on the UK but I think his observations are widely relevant) exposes issues with how some open data programmes are managed and governed. Such issues can include a variation in data standards, inconsistent or incompatible business processes, variations in software, and occasionally, outright sloppiness.
Even when you do everything “right,” though, you may still run into problems. I remember once I was managing a large software-and-database consolidation project where, even after many hours of analysis and data transformation programming and testing, there still remained a group of financial transactions that were unable to make it from System A to System B without manual processing. It turned out that there were basic incompatibilities in the two systems’ underlying data models due to their having been built on different accounting assumptions. What one system considered to be an outright “error,” the other system considered to be 100% correct.
What's the remedy for situations where there are potentially so many areas where errors and data quality variations can creep into the system? Doyle’s solution list is logical and straightforward:
- investing in standards that make data consistent
- ensuring encoding methods are used and checked
- ensuring duplicate data is always removed during frequent data quality checks
- removing dependency on software that produces inconsistent or proprietary results
- ensuring governance that avoids confusion
It’s not all about standards
Note that the above solutions are not all about formally-developed and de facto "standards" -- though obviously standards are important. What is also needed is a recognition that publishing data is part of an ongoing process. Success depends not only on the adoption of standards but also on the ability to manage or at least coordinate the people, processes, and technologies that need to mesh together to make open data programmes effective and sustainable. Quality variations occurring at one location in the change may have no impact on that point and may not show up till later.
Who’s in charge?
Blaming one company or product line for open data programme failures, as seems to be the case in ComputerWeekly.com's [Microsoft gets flak over "rubbish" UK data](http://www.computerweekly.com/blogs/public-sector/2014/09/microsoft-gets-flack-over-rubb-8.html "ComputerWeekly.com article: Microsoft gets flak over "rubbish" UK data"), is simplistic. A lot of gears have to work together in an open data programme that depends on technology, software, and organisations working together. One crucial question concerns the last thing mentioned in Doyle's list: governance. Who's in charge? Who has authority, responsibility, accountability? Ultimately, where's the money to come from? And, who is responsible for managing expectations about how the system will perform?
Open data programmes wherever they occur can involve many different players who have to work together even though their loyalties lie with different organisations. If participants don't share a common purpose and strong central or top down leadership is lacking or unavailable, does that mean that open data programmes and their emphasis on standards are doomed?
Of course not. Support for open data at all levels of government is still strong. Planning exercises such as the World Bank’s Open Data Readiness Assessment Tool explicitly recognise policy and governance issues as being important to the success of open data efforts.
Still, there certainly are real issues that need to be managed. These include not only the types of data problems mentioned in the ComputerWeekly.com article but also process changes associated with data standardisation that I have written about before.
Sharing as platform
Open data system development involving multiple systems and organisations is manageable when people are willing to work together in a collaborative fashion to pursue realistic and sustainable goals and benefits. Participants also need to share information about what they are doing, including the provision of “data about the data” as provided in the Open Data Institute’s Data Set Certificates.
My own reading of the situation is that such sharing is occurring, partly as an outgrowth of the “open data” movement itself, and partly as an outgrowth of the increasingly social nature of work as more people become accustomed to information sharing via modern tools, relationships, and networks.
While social networking and sharing are no substitute for leadership, they do provide a platform for collaboration in all the business and technical areas relevant to open data.
Regarding the question, "How important is the quality of open data?" My answer is "Very important." One of our challenges, then, is to make sure that everyone involved in the process sees -- and understands -- how what they do along the way does have an impact on open data quality.