23 June 2010

OpenDataBC: Toward A Data Usability Scale

I am currently involved in a project named OpenDataBC.  OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia.  It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.

Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.

The definition of "open" as it relates to data is still evolving at a rapid pace.  In it's ideal form what we mean by open data is:
Open data is data that you are allowed to use for free without restrictions.  Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.
By this definition, there is very little open data available today.  Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.

Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find.  For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.

1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use.  What we look for is machine readable data.  Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable.  A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable.  PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.

2. Is it accessible?
I should be able to get it easily over the internet.  I should be able to get it on demand, with a simple program using open source software.  I should not have to submit a form to get it.  I should be able to enter a URL and in return I get the data.

3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."

4. Is it free?
In this context, I mean am I free to use this data however I want?  Can I use it to produce a product that I sell?  Can I combine it with other data and publish it?  Can I sell a copy of it?  Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free.  For example, if  I have to enter into an agreement to use it, it's not free.

5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand.  The more time people have to spend understanding the license in order to use the data, the less usable the data is.  Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.

6. Is it provided without a fee?
The data needs to be available at no cost to the user.  If it costs money, it's less usable and it's not open data.

7. Is it complete?
Data should not be missing values that ought to be there.  If it's point-in-time data it should include all of the relevant information for that point in time.  If it's time series data, it should include the entire time series from the first record to the most recent record.   If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.

8. Is it timely?
The data should have the most up to date information as soon as it is available.  Ideally the data is available as an updated feed or at least updated on a regular schedule.  If the data is a feed, it should be available in as near real time as possible.

The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.

Please let me/us know if you think we're missing something or of something here needs adjusting.

16 comments:

markson said...

for example, Spark, R programming, Python just as beneficial programming like SPSS and SAS. ExcelR Data Science Courses

Data Science pune said...

Such a very useful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article.
Data Science Course in Pune
Data Science Training in Pune

Data Science Bangalore said...

Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.You can also check my articles as well.

Data Science In Banglore With Placements
Data Science Course In Bangalore
Data Science Training In Bangalore
Best Data Science Courses In Bangalore
Data Science Institute In Bangalore

Thank you..

DataScience Datascience said...

It was good experience to read about dangerous punctuation. Informative for everyone looking on the subject.
Data Science Course in Bangalore

DataScience Datascience said...

Excellent work done by you once again here. This is just the reason why I’ve always liked your work. You have amazing writing skills and you display them in every article. Keep it going!
Data Science Training in Bangalore

Data Science Institute In Banglore said...

Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
Data Science Institute in Bangalore

Best Data Science Courses In Bangalore said...

After reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article.
Data Science Certification in Bangalore

Data Science Certification said...

I just found this blog and hope it continues. Keep up the great work, it's hard to find good ones. I added to my favorites. Thank you.

360DigiTMG Data Science Certification

Tableau Course in Raipur - 360DigiTMG said...

Truly incredible blog found to be very impressive due to which the learners who ever go through it will try to explore themselves with the content to develop the skills to an extreme level. Eventually, thanking the blogger to come up with such an phenomenal content. Hope you aarrive with the similar content in future as well.

360DigiTMG Ethical Hacking Course

Data Analytics Course in Raipur - 360DigiTMG said...

Extraordinary blog went amazed with the content that they have developed in a very descriptive manner. This type of content surely ensures the participants to explore themselves. Hope you deliver the same near the future as well. Gratitude to the blogger for the efforts.

360DigiTMG Cloud Computing Course

hrithiksai said...

Found your post interesting to read. I cant wait to see your post soon. Good Luck for the upcoming update. This article is really very interesting and effective, data sciecne course in hyderabad

Data Science training - 360DigiTMG said...

Tremendous blog quite easy to grasp the subject since the content is very simple to understand. Obviously, this helps the participants to engage themselves in to the subject without much difficulty. Hope you further educate the readers in the same manner and keep sharing the content as always you do.

360DigiTMG Data Science Course

EXCELR said...

I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work. data science training in Hyderabad

Tech Institute said...

Top quality information with unique content and excellent writing thank you.
Data Science Course in Hyderabad 360DigiTMG

Data Science Institute In Banglore said...


I'm really thankful that I read this. It's extremely valuable and quite informative and I truly learned a great deal from it.

360DigiTMG Data Science Training Institute in Bangalore

Best Data Science Courses In Bangalore said...

Additionally, this is an excellent article which I truly like studying. It's not everyday I have the option to see something similar to this.
Data Science Course In Bangalore With Placement