23 June 2010

OpenDataBC: Toward A Data Usability Scale

I am currently involved in a project named OpenDataBC.  OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia.  It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.

Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.

The definition of "open" as it relates to data is still evolving at a rapid pace.  In it's ideal form what we mean by open data is:
Open data is data that you are allowed to use for free without restrictions.  Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.
By this definition, there is very little open data available today.  Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.

Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find.  For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.

1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use.  What we look for is machine readable data.  Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable.  A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable.  PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.

2. Is it accessible?
I should be able to get it easily over the internet.  I should be able to get it on demand, with a simple program using open source software.  I should not have to submit a form to get it.  I should be able to enter a URL and in return I get the data.

3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."

4. Is it free?
In this context, I mean am I free to use this data however I want?  Can I use it to produce a product that I sell?  Can I combine it with other data and publish it?  Can I sell a copy of it?  Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free.  For example, if  I have to enter into an agreement to use it, it's not free.

5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand.  The more time people have to spend understanding the license in order to use the data, the less usable the data is.  Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.

6. Is it provided without a fee?
The data needs to be available at no cost to the user.  If it costs money, it's less usable and it's not open data.

7. Is it complete?
Data should not be missing values that ought to be there.  If it's point-in-time data it should include all of the relevant information for that point in time.  If it's time series data, it should include the entire time series from the first record to the most recent record.   If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.

8. Is it timely?
The data should have the most up to date information as soon as it is available.  Ideally the data is available as an updated feed or at least updated on a regular schedule.  If the data is a feed, it should be available in as near real time as possible.

The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.

Please let me/us know if you think we're missing something or of something here needs adjusting.


markson said...

for example, Spark, R programming, Python just as beneficial programming like SPSS and SAS. ExcelR Data Science Courses

DataScience Specialist said...

It was good experience to read about dangerous punctuation. Informative for everyone looking on the subject.
Data Science Course in Bangalore

DataScience Specialist said...

Excellent work done by you once again here. This is just the reason why I’ve always liked your work. You have amazing writing skills and you display them in every article. Keep it going!
Data Science Training in Bangalore

EXCELR said...

I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work. data science training in Hyderabad

Tech Institute said...

Top quality information with unique content and excellent writing thank you.
Data Science Course in Hyderabad 360DigiTMG

Data Science Institute In Banglore said...

I'm really thankful that I read this. It's extremely valuable and quite informative and I truly learned a great deal from it.

360DigiTMG Data Science Training Institute in Bangalore

Business Analytics Course said...

First of all, you have a great blog. I will be interested in more similar topics. I see you have some very useful topics, I will always check your blog thank you.

Business Analytics Course in Bangalore

Data Analytics Course said...

I am very happy to have seen your website and hope you have so many entertaining times reading here. Thanks again for all the details.

Data Analytics Course in Bangalore

Data Science Training said...

Top quality blog with very informative information found very useful thanks for sharing.
Data Analytics Course Online

Data Science said...

I really appreciate the writer's choice for choosing this excellent article information shared was valuable thanks for sharing.
Data Science Training in Hyderabad

Ashok said...

Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!

data science course in India

Ashok said...

Wow! Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future also.
Artificial Intelligence Course

Best Data Science Courses said...

This is an excellent article. I like this topic. This site has many advantages. I have found a lot of interesting things on this site. It helps me in so many ways. Thanks for posting this again.

Best Data Science Courses in Bangalore

Huongkv said...

Mua vé máy bay tại Aivivu, tham khảo

mua ve may bay di my

ve may bay eva tu my ve vn

vé máy bay vietnam airlines đi đà nẵng

vé máy bay đi đà lạt vietjet

lịch bay sài gòn phú quốc

tech science said...

Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.business analytics course in nagpur