07 July 2010

Is your data readable?

In talking with clients and colleagues about open data and open government this is the one question that comes up over and over again. The word “data” means a collection or body of facts that represent the qualitative or quantitative attributes of a variable or set of variables but what does “open data” mean?

To answer this question I like to look at what we are trying to achieve by opening data. The promise of open data is that if we make government administrative data available to the public value will be created in ways that we may or may not be able to imagine. The value will be created by using the data. So, what is open data? Ultimately, it’s data you can use. In this series of blog posts I will explore the various ways data can be made more usable.

What makes data usable?

In a previous post I proposed some dimensions that move toward a usability scale. In this post I propose a minimum standard of usability. In other words, what are the absolute minimum requirements that must be satisfied in order to consider something open data? To answer this question one could look at the dimensions of usability individually and decide for each one, what would be the minimum level of usability below which data is not usable.

One of the main measures of usability is readability.  In other words, how easy is it to read?

For example, this list of cities with their geographic areas and populations is data:


Data collected into rows and columns in this way is typically called a data set (or dataset). By putting this dataset in my blog post in a table I have made it available to you but the fact that I made it available to you as a screenshot of my spreadsheet means to read it would be difficult, error prone and would require expensive software or scripting. Which makes it pretty much unusable by you.

Another method in use by governments today is is to publish data as a PDF formatted document. This is marginally better than posting as an image. It’s technically possible to extract the data from PDF files as I have demonstrated in a previous post, but it’s still expensive, time consuming and error prone.

What I could do instead is make that same data available as an HTML table in this blog post, like this:




CityAreaPopulation
Victoria19.6878057
Vancouver114.67578041
Kelowna211.69120812


Technically, this is a level better than both images and PDF files but it will still get me low points on the usability scale because in order to read it a programmer still has to write a script specifically for reading this data from my blog post, a time consuming and wasteful process. If you’re unfortunate enough to need to read data from an HTML page, another previous blog post describes how to do this.

To really improve the usability of this software it makes sense to publish it in a format that represents data in a form that makes the data easily accessible. Many people are familiar with spreadsheets, which are a popular tool for reading and manipulation of tabular data so making data available in spreadsheet format makes it more usable in the sense that people can obtain spreadsheet programs to read the tabular data. For example, here is the same data published in the open .ODS format supported by a wide variety of spreadsheet software providers, and here it is published in the XLS format a proprietary format controlled by the Microsoft corporation.

The advantage to publishing in spreadsheet format is that while still requiring specialized scripts and software to read, at least the rows and columns are well defined which translates into fewer errors.  This is what I would consider the minimum bar for usable open data.  It's not as usable as I would like, but it is usable without too much risk.  In other words, if you have data in this format already and you don't have the budget to reformat it before publishing it, don't delay the release, just publish it as is.

Ideally though data is published in formats specifically designed for the purpose of information sharing, and that’s where the CSV, XML and JSON formats come in.

The CSV version of my dataset looks like this:
"City","Area","Population"
"Victoria",19.68,78057
"Vancouver",114.67,578041
"Kelowna",211.69,120812

The XML version looks like this:
<dataset>
 <data>
  <row><city>Victoria</city>19.69<population>78057</population></row>
  <row><city>Vancouver</city>114.67<population>578041</population></row>
  <row><city>Kelowna</city>211.69<population>120812</population></row>
 </data>
</dataset>

and the JSON version looks like this:
[
 {"city": "Victoria", "population": 78057, "area": 19.690000000000001},
 {"city": "Vancouver", "population": 578041, "area": 114.67},
 {"city": "Kelowna", "population": 120812, "area": 211.69}
]

While not quite pretty as the other human readable formats CSV, XML and JSON are open formats that provide structure making it very easy for programs to read the data. They are also well supported in almost all modern programming languages so that any programmer who wants to use your data can do so easily and accurately with free software and very little programming. And as a side benefit, its very easy and inexpensive to publish your administrative data into these formats using free software.

Publishing data in these open formats makes it easy for people to use open data. While publishing in HTML format is readable and is what I would consider the bare minimum for usability, depending on how it is done, other formats can make it much easier. And if your organization is serious about engaging people to collaborate and create value from the data they will want to make the data as usable as possible and making the data readable is one part of doing that.

1 comment:

sardire said...

Yes indeed. Check out http://openstructs.org/iron/iron-specification

Abstract
irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON). The notation specification includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. Profiles and examples are also provided for each of the irXML, irJSON and commON serializations.