23 June 2010

OpenDataBC: Toward A Data Usability Scale

I am currently involved in a project named OpenDataBC.  OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia.  It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.

Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.

The definition of "open" as it relates to data is still evolving at a rapid pace.  In it's ideal form what we mean by open data is:
Open data is data that you are allowed to use for free without restrictions.  Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.
By this definition, there is very little open data available today.  Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.

Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find.  For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.

1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use.  What we look for is machine readable data.  Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable.  A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable.  PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.

2. Is it accessible?
I should be able to get it easily over the internet.  I should be able to get it on demand, with a simple program using open source software.  I should not have to submit a form to get it.  I should be able to enter a URL and in return I get the data.

3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."

4. Is it free?
In this context, I mean am I free to use this data however I want?  Can I use it to produce a product that I sell?  Can I combine it with other data and publish it?  Can I sell a copy of it?  Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free.  For example, if  I have to enter into an agreement to use it, it's not free.

5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand.  The more time people have to spend understanding the license in order to use the data, the less usable the data is.  Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.

6. Is it provided without a fee?
The data needs to be available at no cost to the user.  If it costs money, it's less usable and it's not open data.

7. Is it complete?
Data should not be missing values that ought to be there.  If it's point-in-time data it should include all of the relevant information for that point in time.  If it's time series data, it should include the entire time series from the first record to the most recent record.   If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.

8. Is it timely?
The data should have the most up to date information as soon as it is available.  Ideally the data is available as an updated feed or at least updated on a regular schedule.  If the data is a feed, it should be available in as near real time as possible.

The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.

Please let me/us know if you think we're missing something or of something here needs adjusting.

No comments: