23 June 2010

OpenDataBC: Toward A Data Usability Scale

I am currently involved in a project named OpenDataBC.  OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia.  It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.

Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.

The definition of "open" as it relates to data is still evolving at a rapid pace.  In it's ideal form what we mean by open data is:
Open data is data that you are allowed to use for free without restrictions.  Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.
By this definition, there is very little open data available today.  Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.

Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find.  For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.

1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use.  What we look for is machine readable data.  Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable.  A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable.  PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.

2. Is it accessible?
I should be able to get it easily over the internet.  I should be able to get it on demand, with a simple program using open source software.  I should not have to submit a form to get it.  I should be able to enter a URL and in return I get the data.

3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."

4. Is it free?
In this context, I mean am I free to use this data however I want?  Can I use it to produce a product that I sell?  Can I combine it with other data and publish it?  Can I sell a copy of it?  Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free.  For example, if  I have to enter into an agreement to use it, it's not free.

5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand.  The more time people have to spend understanding the license in order to use the data, the less usable the data is.  Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.

6. Is it provided without a fee?
The data needs to be available at no cost to the user.  If it costs money, it's less usable and it's not open data.

7. Is it complete?
Data should not be missing values that ought to be there.  If it's point-in-time data it should include all of the relevant information for that point in time.  If it's time series data, it should include the entire time series from the first record to the most recent record.   If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.

8. Is it timely?
The data should have the most up to date information as soon as it is available.  Ideally the data is available as an updated feed or at least updated on a regular schedule.  If the data is a feed, it should be available in as near real time as possible.

The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.

Please let me/us know if you think we're missing something or of something here needs adjusting.

18 June 2010

Single Points of Failure

As I write this in 2010, our political, economic, cultural and social systems in the western world are for the most part driven by corporations.  Our art is subsidized by corporations, our charities are funded by corporations, our culture is promoted by corporations and our laws are defined by corporations.

These corporations come in various forms, whether they are businesses, governments, or religious corporations.  In many ways the corporate form is very useful.  It's one way to provide a structure for people to work toward a common goal.  It provides some level of predictability.  And in some cases, it provides for economies of scale.

Much has been written about the weaknesses of the corporate form, and the corruption it attracts so I leave that to others to draw attention to but there is one aspect of the corporate form that I don't see being written about, and that's the fact that they represent a single point of failure.  Big corporations have big failures.  We permit corporations to grow infinitely large and then rely on them not to fail.  But they do fail, we know they fail, and we permit it anyway.

In the enterprise software space the "one throat to choke" mantra is used to persuade the listener that putting all of your eggs in one basket is a good thing. What it hides though is the fact that the vendor represents a single point of failure, and when it's a large project, that often mean large failure.  The high failure rate of large IT projects is well known but how often are these throats actually choked?  Almost never.

If these large organizations never failed, that would be one thing, but the fact is, they do fail.  And somehow when we are voting, or shopping or doing our cost benefit analysis and selecting the vendor we forget that that entity we are dealing with may not be around tomorrow.  The "bet the farm" ideology invented hundreds of years ago and popularized during the industrial revolution is showing it's age in our current distributed global world.

When the costs of communication were high it made a lot of sense to build organizations as hierarchies to minimize the costs of communication through a top down pyramid and command and control theme. This model was so efficient that it offset the risks of the single point of failure. Today though, we have the internet and mobile phones minimizing those costs for everyone so the pyramid isn't adding as much value as it used to and the cost of the single point of failure is still there.

There are some structures and strategies though that can help with this.  They are used in organizations and projects that are designed with failure in mind.  These organizational models admit from the outset that there will be failures and rather than pretend that all is well and there will be someone to choke if anything goes wrong, they make failure part of the equation.

With the BP Oil Disaster destroying the gulf of Mexico, the failure of 247 US banks and millions unemployed due to the economic meltdown, people are starting to wake up to the enormous risk and costs of these single points of failure.  Simultaneously, open source software, open data, grass roots communities and cooperatives are becoming increasingly popular as people start to look for alternative ways to get things done.

Smart companies, governments and other organizations are letting go of "command and contol" and are discovering game changing philosophies based on engagement and collaboration that give them an edge that is not surprisingly almost non-existent in the traditional corporate form.

Collaboration, gifting and doing things for the sheer joy of working and contributing to the world and enhancing the quality of life of others are being rediscovered. And while we speak of these things as new, they are as old as civilization itself and were here long before the corporate form and will be here long after.