Of all the concepts that data users have to concern themselves with, copying and the legal ramifications of copying has to be one of the most important. All technology is a product of copying but in the last 100 years media companies have invented and promoted the idea that copying is theft. It's one thing to have some difficulty and possibly incur some costs accessing and using data, it's an entirely different thing to have to worry about being prosecuted or sued for using data. Developers are rightly cautious about the liabilities associated with copying data.
The environment of caution around copying affects people releasing, promoting and using open data. If it's not perfectly clear to developers that they can use your data they often won't for this reason alone. Developers have enough to think about. If they have to wade through your license or terms of agreement to try to understand what all the ramifications are, they usually won't bother. If they decide to read it then even one clause that looks risky is enough to drive them away.
There are lots of interesting project ideas and few developers that can make them happen. If you want people to use your data, it makes sense to make that process as easy and risk free as possible. Fortunately, there are several ways to do that, but first, what exactly is the copyright protecting? When we talk about data in this context we are talking mostly about facts and collections of facts.
The interesting thing about facts when it comes to copyrights is that it's generally accepted that facts themeselves do not enjoy any copyright protection. The reason being that copyrights only generally apply to creative works, and facts, by their very defintion, are not creative. If I present a number to you as the truth then I am saying that I didn't make up the number, I at best discovered it, but it already existed. If I present a number to you and say I made it up - in other words I invented it from nothing, then it's not a fact, it's made up.
It's for this reason that the idea of licensing data then strikes me as odd. To grant someone a license to use something which they can already do is redundant. Attempting to restrict use by means of a license is then equally strange. If someone already has the right to use something then to restrict it would require their agreement, and why would anyone agree to fewer rights than they already have.
I think the USA is on the right track here. My preference is that all non-personal government data be released as public domain. Public domain is easy to understand and it fully opens the door to innovation giving developers the raw materials they can really use to create valuable apps.
Public servants with responsibility for publishing government data have decisions to make when it comes to making that data available to citizens on the internet. Along with readability and usability which I have covered in previous posts, a third aspect of open data is accessibility.
The purpose of releasing data as open data is to enable people to use your data. To use that data they have to get it from your computer to their computer. There are a variety of ways to do that, but in 2010 that means the Internet and and the HTTP protocol. If your datasets are very large, using anonymous FTP would likely also be acceptable to many developers. However HTTP is by far the simpler protocol to use. It has many advantages from a developer perspective over FTP and it is just as easy to set up from a publisher's perspective.
Accessibility is just as important as readability. Where poor readability imposes a one time cost to developers, poor accessibility actually imposes an ongoing transactional cost. As a developer, I can write scripts to decode data provided in proprietary formats like XLS or SHP (so long as that's still legal in Canada - locks today, proprietary formats tomorrow?). It's still costly in terms of my time, but once written, I can run that same script over and over again with no effort. Poor accessibilty on the other hand sometimes means that I can't readily automate the process. If I can't automate it, then every time I want to use your data, I incur cost in my time to manually download it. That may be fine for those users who only want to download your data once. But to the developers that you want to encourage to use your data as a platform to build valuable applications with, it's a barrier they won't likely cross due to the high transactional cost.
In some cases folks use login screens or mail back mechanisms to track who is accessing our government data. In some cases there are check boxes for so called "agreements" or "contracts" that are meant to force people into some sort of agreement before they use our data. The worst of course are cost recovery models where we are forced to pay for our data twice. First as taxpayers and again as users.
Open data is emerging from an era where the status quo belief was that government data had to be locked down. Whether or not that was ever true is debatable, but it's clearly not true in 2010.
When publishers realize that with open data, the data is likely going to be re-purposed and distributed in different forms anyway, and in ways that these methods won't track, then what are you really measuring that you couldn't measure with just a simple, unobtrusive web site and access log.
When we as publishers talk about accessibility, as with many aspects of open data, it's useful to remind ourselves of the reason we are doing open data in the first place. Making data accessible means making it as easy as possible for developers to gain access to and download the data. Not so you can pass some test of "openness" (although there are good reasons to do that which I will cover in a future post), but so people use your data. You want people to use your data. That's the point.
The purpose of releasing data as open data is to enable people to use your data. To use that data they have to get it from your computer to their computer. There are a variety of ways to do that, but in 2010 that means the Internet and and the HTTP protocol. If your datasets are very large, using anonymous FTP would likely also be acceptable to many developers. However HTTP is by far the simpler protocol to use. It has many advantages from a developer perspective over FTP and it is just as easy to set up from a publisher's perspective.
Accessibility is just as important as readability. Where poor readability imposes a one time cost to developers, poor accessibility actually imposes an ongoing transactional cost. As a developer, I can write scripts to decode data provided in proprietary formats like XLS or SHP (so long as that's still legal in Canada - locks today, proprietary formats tomorrow?). It's still costly in terms of my time, but once written, I can run that same script over and over again with no effort. Poor accessibilty on the other hand sometimes means that I can't readily automate the process. If I can't automate it, then every time I want to use your data, I incur cost in my time to manually download it. That may be fine for those users who only want to download your data once. But to the developers that you want to encourage to use your data as a platform to build valuable applications with, it's a barrier they won't likely cross due to the high transactional cost.
In some cases folks use login screens or mail back mechanisms to track who is accessing our government data. In some cases there are check boxes for so called "agreements" or "contracts" that are meant to force people into some sort of agreement before they use our data. The worst of course are cost recovery models where we are forced to pay for our data twice. First as taxpayers and again as users.
Open data is emerging from an era where the status quo belief was that government data had to be locked down. Whether or not that was ever true is debatable, but it's clearly not true in 2010.
When publishers realize that with open data, the data is likely going to be re-purposed and distributed in different forms anyway, and in ways that these methods won't track, then what are you really measuring that you couldn't measure with just a simple, unobtrusive web site and access log.
When we as publishers talk about accessibility, as with many aspects of open data, it's useful to remind ourselves of the reason we are doing open data in the first place. Making data accessible means making it as easy as possible for developers to gain access to and download the data. Not so you can pass some test of "openness" (although there are good reasons to do that which I will cover in a future post), but so people use your data. You want people to use your data. That's the point.
I am currently attending OSCON 2010 (Open Source Conference) in Portland Oregon. It's a conference for free and open source software enthusiasts, developers, hackers and users of all levels. There are about 5,000 people attending this year. I have met a lot of people here. Some who are passionate about free software, and some that are learning more about it and how it can provide value to their companies.
It's difficult to over-estimate the impact that free and open source software (FOSS) has had on computing and the world in general. First, of course, it powers the internet itself. If you use the internet, you use free and open source software. From the underlying protocols to email to ftp to web sites, it's all powered by free and open source software.
Practically every major web site you can think of (Google, Facebook, Wikipedia, Twitter, Foursquare, Google Maps, ... ) make heavy use of free and open source software. These companies measure traffic in many millions of users and billions of pages per month.
The Apache Web Server for example has been the most popular web server since April 1996 and powers almost 70 percent of all websites on the planet. There are free and open source operating systems, programming languages, office productivity suites, collaboration suites, web browsers, file and print servers and much more. There is a free and open source version of practically any software you can think of (and many that you haven't thought of).
And yet, here we are in 2010 and some are still not convinced that open source is suitable for government use. They are not convinced that this software developed by communities of generous and smart people is reliable and secure or supported enough for their purposes compared to proprietary solutions such as Internet Explorer. They put all of their trust in single vendor solutions and rely on companies like Microsoft and Oracle, and believe the stories told by such companies about open source software... that story goes something like this: "It's not enterprise ready... it's of varying quality... there is no support for it... you want to have one throat to choke."
Why aren’t governments using open source software anywhere and everywhere possible? Why do governments continue to seek out solutions with lock-in to certain vendors? Why would we continue to believe the big vendors that promise to be nice? Why do we citizens continue to pay millions upon millions of dollars for software?
Governments are unlike other corporations in that they are making decisions not for their own benefit, but for the benefit of us, the citizens. They don't take that responsibility lightly so decisions are made with great care and they often don't give themselves permission to try new things - or if they do, they do THAT with great care and concern because they don't want to make any mistakes with our resources. Trying something innovative occurs as a risky and so the status quo is long lived and new approaches are discouraged.
Governments appear to be the last hold out of proprietary software and as a result, are missing out on an opportunity to engage with and support the communities that support all of us. The rest of the world has figured out that free and open source software is the most secure, the most reliable, most innovative and the most cost effective software available. Leading internet companies that earn millions of dollars in revenues and could choose anything they want for their software needs are choosing open source software. We should let our governments know that we want them to choose free and open source software too.
The problem with free and open source software is this: It's hard to make a lot of money with free software. And, without a lot of money you can't own a public relations team and you can't spend a lot of money on armies of sales people and technical sales people with pre-written business cases and white papers and other collateral convincing people to use your products. Without a lot of money, you can't schmooze and throw hosted year end parties for your key clients in every major city.
Instead, with free and open source software, you put everything into the product and let the product speak for itself. You assume that people actually want things to work better. You build communities of people who are passionate about your product - not because it makes them look good - not because it's easier – not even because it's free - but because it provides exceptional value.
It's difficult to over-estimate the impact that free and open source software (FOSS) has had on computing and the world in general. First, of course, it powers the internet itself. If you use the internet, you use free and open source software. From the underlying protocols to email to ftp to web sites, it's all powered by free and open source software.
Practically every major web site you can think of (Google, Facebook, Wikipedia, Twitter, Foursquare, Google Maps, ... ) make heavy use of free and open source software. These companies measure traffic in many millions of users and billions of pages per month.
The Apache Web Server for example has been the most popular web server since April 1996 and powers almost 70 percent of all websites on the planet. There are free and open source operating systems, programming languages, office productivity suites, collaboration suites, web browsers, file and print servers and much more. There is a free and open source version of practically any software you can think of (and many that you haven't thought of).
And yet, here we are in 2010 and some are still not convinced that open source is suitable for government use. They are not convinced that this software developed by communities of generous and smart people is reliable and secure or supported enough for their purposes compared to proprietary solutions such as Internet Explorer. They put all of their trust in single vendor solutions and rely on companies like Microsoft and Oracle, and believe the stories told by such companies about open source software... that story goes something like this: "It's not enterprise ready... it's of varying quality... there is no support for it... you want to have one throat to choke."
Why aren’t governments using open source software anywhere and everywhere possible? Why do governments continue to seek out solutions with lock-in to certain vendors? Why would we continue to believe the big vendors that promise to be nice? Why do we citizens continue to pay millions upon millions of dollars for software?
Governments are unlike other corporations in that they are making decisions not for their own benefit, but for the benefit of us, the citizens. They don't take that responsibility lightly so decisions are made with great care and they often don't give themselves permission to try new things - or if they do, they do THAT with great care and concern because they don't want to make any mistakes with our resources. Trying something innovative occurs as a risky and so the status quo is long lived and new approaches are discouraged.
Governments appear to be the last hold out of proprietary software and as a result, are missing out on an opportunity to engage with and support the communities that support all of us. The rest of the world has figured out that free and open source software is the most secure, the most reliable, most innovative and the most cost effective software available. Leading internet companies that earn millions of dollars in revenues and could choose anything they want for their software needs are choosing open source software. We should let our governments know that we want them to choose free and open source software too.
The problem with free and open source software is this: It's hard to make a lot of money with free software. And, without a lot of money you can't own a public relations team and you can't spend a lot of money on armies of sales people and technical sales people with pre-written business cases and white papers and other collateral convincing people to use your products. Without a lot of money, you can't schmooze and throw hosted year end parties for your key clients in every major city.
Instead, with free and open source software, you put everything into the product and let the product speak for itself. You assume that people actually want things to work better. You build communities of people who are passionate about your product - not because it makes them look good - not because it's easier – not even because it's free - but because it provides exceptional value.
In talking with clients and colleagues about open data and open government this is the one question that comes up over and over again. The word “data” means a collection or body of facts that represent the qualitative or quantitative attributes of a variable or set of variables but what does “open data” mean?
To answer this question I like to look at what we are trying to achieve by opening data. The promise of open data is that if we make government administrative data available to the public value will be created in ways that we may or may not be able to imagine. The value will be created by using the data. So, what is open data? Ultimately, it’s data you can use. In this series of blog posts I will explore the various ways data can be made more usable.
What makes data usable?
In a previous post I proposed some dimensions that move toward a usability scale. In this post I propose a minimum standard of usability. In other words, what are the absolute minimum requirements that must be satisfied in order to consider something open data? To answer this question one could look at the dimensions of usability individually and decide for each one, what would be the minimum level of usability below which data is not usable.
One of the main measures of usability is readability. In other words, how easy is it to read?
For example, this list of cities with their geographic areas and populations is data:
Data collected into rows and columns in this way is typically called a data set (or dataset). By putting this dataset in my blog post in a table I have made it available to you but the fact that I made it available to you as a screenshot of my spreadsheet means to read it would be difficult, error prone and would require expensive software or scripting. Which makes it pretty much unusable by you.
Another method in use by governments today is is to publish data as a PDF formatted document. This is marginally better than posting as an image. It’s technically possible to extract the data from PDF files as I have demonstrated in a previous post, but it’s still expensive, time consuming and error prone.
What I could do instead is make that same data available as an HTML table in this blog post, like this:
Technically, this is a level better than both images and PDF files but it will still get me low points on the usability scale because in order to read it a programmer still has to write a script specifically for reading this data from my blog post, a time consuming and wasteful process. If you’re unfortunate enough to need to read data from an HTML page, another previous blog post describes how to do this.
To really improve the usability of this software it makes sense to publish it in a format that represents data in a form that makes the data easily accessible. Many people are familiar with spreadsheets, which are a popular tool for reading and manipulation of tabular data so making data available in spreadsheet format makes it more usable in the sense that people can obtain spreadsheet programs to read the tabular data. For example, here is the same data published in the open .ODS format supported by a wide variety of spreadsheet software providers, and here it is published in the XLS format a proprietary format controlled by the Microsoft corporation.
The advantage to publishing in spreadsheet format is that while still requiring specialized scripts and software to read, at least the rows and columns are well defined which translates into fewer errors. This is what I would consider the minimum bar for usable open data. It's not as usable as I would like, but it is usable without too much risk. In other words, if you have data in this format already and you don't have the budget to reformat it before publishing it, don't delay the release, just publish it as is.
Ideally though data is published in formats specifically designed for the purpose of information sharing, and that’s where the CSV, XML and JSON formats come in.
The CSV version of my dataset looks like this:
The XML version looks like this:
and the JSON version looks like this:
While not quite pretty as the other human readable formats CSV, XML and JSON are open formats that provide structure making it very easy for programs to read the data. They are also well supported in almost all modern programming languages so that any programmer who wants to use your data can do so easily and accurately with free software and very little programming. And as a side benefit, its very easy and inexpensive to publish your administrative data into these formats using free software.
Publishing data in these open formats makes it easy for people to use open data. While publishing in HTML format is readable and is what I would consider the bare minimum for usability, depending on how it is done, other formats can make it much easier. And if your organization is serious about engaging people to collaborate and create value from the data they will want to make the data as usable as possible and making the data readable is one part of doing that.
To answer this question I like to look at what we are trying to achieve by opening data. The promise of open data is that if we make government administrative data available to the public value will be created in ways that we may or may not be able to imagine. The value will be created by using the data. So, what is open data? Ultimately, it’s data you can use. In this series of blog posts I will explore the various ways data can be made more usable.
What makes data usable?
In a previous post I proposed some dimensions that move toward a usability scale. In this post I propose a minimum standard of usability. In other words, what are the absolute minimum requirements that must be satisfied in order to consider something open data? To answer this question one could look at the dimensions of usability individually and decide for each one, what would be the minimum level of usability below which data is not usable.
One of the main measures of usability is readability. In other words, how easy is it to read?
For example, this list of cities with their geographic areas and populations is data:
Data collected into rows and columns in this way is typically called a data set (or dataset). By putting this dataset in my blog post in a table I have made it available to you but the fact that I made it available to you as a screenshot of my spreadsheet means to read it would be difficult, error prone and would require expensive software or scripting. Which makes it pretty much unusable by you.
Another method in use by governments today is is to publish data as a PDF formatted document. This is marginally better than posting as an image. It’s technically possible to extract the data from PDF files as I have demonstrated in a previous post, but it’s still expensive, time consuming and error prone.
What I could do instead is make that same data available as an HTML table in this blog post, like this:
| City | Area | Population |
|---|---|---|
| Victoria | 19.68 | 78057 |
| Vancouver | 114.67 | 578041 |
| Kelowna | 211.69 | 120812 |
Technically, this is a level better than both images and PDF files but it will still get me low points on the usability scale because in order to read it a programmer still has to write a script specifically for reading this data from my blog post, a time consuming and wasteful process. If you’re unfortunate enough to need to read data from an HTML page, another previous blog post describes how to do this.
To really improve the usability of this software it makes sense to publish it in a format that represents data in a form that makes the data easily accessible. Many people are familiar with spreadsheets, which are a popular tool for reading and manipulation of tabular data so making data available in spreadsheet format makes it more usable in the sense that people can obtain spreadsheet programs to read the tabular data. For example, here is the same data published in the open .ODS format supported by a wide variety of spreadsheet software providers, and here it is published in the XLS format a proprietary format controlled by the Microsoft corporation.
The advantage to publishing in spreadsheet format is that while still requiring specialized scripts and software to read, at least the rows and columns are well defined which translates into fewer errors. This is what I would consider the minimum bar for usable open data. It's not as usable as I would like, but it is usable without too much risk. In other words, if you have data in this format already and you don't have the budget to reformat it before publishing it, don't delay the release, just publish it as is.
Ideally though data is published in formats specifically designed for the purpose of information sharing, and that’s where the CSV, XML and JSON formats come in.
The CSV version of my dataset looks like this:
"City","Area","Population" "Victoria",19.68,78057 "Vancouver",114.67,578041 "Kelowna",211.69,120812
The XML version looks like this:
<dataset> <data> <row><city>Victoria</city>19.69<population>78057</population></row> <row><city>Vancouver</city>114.67<population>578041</population></row> <row><city>Kelowna</city>211.69<population>120812</population></row> </data> </dataset>
and the JSON version looks like this:
[
{"city": "Victoria", "population": 78057, "area": 19.690000000000001},
{"city": "Vancouver", "population": 578041, "area": 114.67},
{"city": "Kelowna", "population": 120812, "area": 211.69}
]
While not quite pretty as the other human readable formats CSV, XML and JSON are open formats that provide structure making it very easy for programs to read the data. They are also well supported in almost all modern programming languages so that any programmer who wants to use your data can do so easily and accurately with free software and very little programming. And as a side benefit, its very easy and inexpensive to publish your administrative data into these formats using free software.
Publishing data in these open formats makes it easy for people to use open data. While publishing in HTML format is readable and is what I would consider the bare minimum for usability, depending on how it is done, other formats can make it much easier. And if your organization is serious about engaging people to collaborate and create value from the data they will want to make the data as usable as possible and making the data readable is one part of doing that.
I am currently involved in a project named OpenDataBC. OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia. It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.
Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.
The definition of "open" as it relates to data is still evolving at a rapid pace. In it's ideal form what we mean by open data is:
Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find. For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.
1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use. What we look for is machine readable data. Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable. A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable. PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.
2. Is it accessible?
I should be able to get it easily over the internet. I should be able to get it on demand, with a simple program using open source software. I should not have to submit a form to get it. I should be able to enter a URL and in return I get the data.
3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."
4. Is it free?
In this context, I mean am I free to use this data however I want? Can I use it to produce a product that I sell? Can I combine it with other data and publish it? Can I sell a copy of it? Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free. For example, if I have to enter into an agreement to use it, it's not free.
5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand. The more time people have to spend understanding the license in order to use the data, the less usable the data is. Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.
6. Is it provided without a fee?
The data needs to be available at no cost to the user. If it costs money, it's less usable and it's not open data.
7. Is it complete?
Data should not be missing values that ought to be there. If it's point-in-time data it should include all of the relevant information for that point in time. If it's time series data, it should include the entire time series from the first record to the most recent record. If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.
8. Is it timely?
The data should have the most up to date information as soon as it is available. Ideally the data is available as an updated feed or at least updated on a regular schedule. If the data is a feed, it should be available in as near real time as possible.
The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.
Please let me/us know if you think we're missing something or of something here needs adjusting.
Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.
The definition of "open" as it relates to data is still evolving at a rapid pace. In it's ideal form what we mean by open data is:
Open data is data that you are allowed to use for free without restrictions. Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.By this definition, there is very little open data available today. Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.
Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find. For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.
1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use. What we look for is machine readable data. Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable. A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable. PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.
2. Is it accessible?
I should be able to get it easily over the internet. I should be able to get it on demand, with a simple program using open source software. I should not have to submit a form to get it. I should be able to enter a URL and in return I get the data.
3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."
4. Is it free?
In this context, I mean am I free to use this data however I want? Can I use it to produce a product that I sell? Can I combine it with other data and publish it? Can I sell a copy of it? Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free. For example, if I have to enter into an agreement to use it, it's not free.
5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand. The more time people have to spend understanding the license in order to use the data, the less usable the data is. Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.
6. Is it provided without a fee?
The data needs to be available at no cost to the user. If it costs money, it's less usable and it's not open data.
7. Is it complete?
Data should not be missing values that ought to be there. If it's point-in-time data it should include all of the relevant information for that point in time. If it's time series data, it should include the entire time series from the first record to the most recent record. If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.
8. Is it timely?
The data should have the most up to date information as soon as it is available. Ideally the data is available as an updated feed or at least updated on a regular schedule. If the data is a feed, it should be available in as near real time as possible.
The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.
Please let me/us know if you think we're missing something or of something here needs adjusting.
As I write this in 2010, our political, economic, cultural and social systems in the western world are for the most part driven by corporations. Our art is subsidized by corporations, our charities are funded by corporations, our culture is promoted by corporations and our laws are defined by corporations.
These corporations come in various forms, whether they are businesses, governments, or religious corporations. In many ways the corporate form is very useful. It's one way to provide a structure for people to work toward a common goal. It provides some level of predictability. And in some cases, it provides for economies of scale.
Much has been written about the weaknesses of the corporate form, and the corruption it attracts so I leave that to others to draw attention to but there is one aspect of the corporate form that I don't see being written about, and that's the fact that they represent a single point of failure. Big corporations have big failures. We permit corporations to grow infinitely large and then rely on them not to fail. But they do fail, we know they fail, and we permit it anyway.
In the enterprise software space the "one throat to choke" mantra is used to persuade the listener that putting all of your eggs in one basket is a good thing. What it hides though is the fact that the vendor represents a single point of failure, and when it's a large project, that often mean large failure. The high failure rate of large IT projects is well known but how often are these throats actually choked? Almost never.
If these large organizations never failed, that would be one thing, but the fact is, they do fail. And somehow when we are voting, or shopping or doing our cost benefit analysis and selecting the vendor we forget that that entity we are dealing with may not be around tomorrow. The "bet the farm" ideology invented hundreds of years ago and popularized during the industrial revolution is showing it's age in our current distributed global world.
When the costs of communication were high it made a lot of sense to build organizations as hierarchies to minimize the costs of communication through a top down pyramid and command and control theme. This model was so efficient that it offset the risks of the single point of failure. Today though, we have the internet and mobile phones minimizing those costs for everyone so the pyramid isn't adding as much value as it used to and the cost of the single point of failure is still there.
There are some structures and strategies though that can help with this. They are used in organizations and projects that are designed with failure in mind. These organizational models admit from the outset that there will be failures and rather than pretend that all is well and there will be someone to choke if anything goes wrong, they make failure part of the equation.
With the BP Oil Disaster destroying the gulf of Mexico, the failure of 247 US banks and millions unemployed due to the economic meltdown, people are starting to wake up to the enormous risk and costs of these single points of failure. Simultaneously, open source software, open data, grass roots communities and cooperatives are becoming increasingly popular as people start to look for alternative ways to get things done.
Smart companies, governments and other organizations are letting go of "command and contol" and are discovering game changing philosophies based on engagement and collaboration that give them an edge that is not surprisingly almost non-existent in the traditional corporate form.
Collaboration, gifting and doing things for the sheer joy of working and contributing to the world and enhancing the quality of life of others are being rediscovered. And while we speak of these things as new, they are as old as civilization itself and were here long before the corporate form and will be here long after.
These corporations come in various forms, whether they are businesses, governments, or religious corporations. In many ways the corporate form is very useful. It's one way to provide a structure for people to work toward a common goal. It provides some level of predictability. And in some cases, it provides for economies of scale.
Much has been written about the weaknesses of the corporate form, and the corruption it attracts so I leave that to others to draw attention to but there is one aspect of the corporate form that I don't see being written about, and that's the fact that they represent a single point of failure. Big corporations have big failures. We permit corporations to grow infinitely large and then rely on them not to fail. But they do fail, we know they fail, and we permit it anyway.
In the enterprise software space the "one throat to choke" mantra is used to persuade the listener that putting all of your eggs in one basket is a good thing. What it hides though is the fact that the vendor represents a single point of failure, and when it's a large project, that often mean large failure. The high failure rate of large IT projects is well known but how often are these throats actually choked? Almost never.
If these large organizations never failed, that would be one thing, but the fact is, they do fail. And somehow when we are voting, or shopping or doing our cost benefit analysis and selecting the vendor we forget that that entity we are dealing with may not be around tomorrow. The "bet the farm" ideology invented hundreds of years ago and popularized during the industrial revolution is showing it's age in our current distributed global world.
When the costs of communication were high it made a lot of sense to build organizations as hierarchies to minimize the costs of communication through a top down pyramid and command and control theme. This model was so efficient that it offset the risks of the single point of failure. Today though, we have the internet and mobile phones minimizing those costs for everyone so the pyramid isn't adding as much value as it used to and the cost of the single point of failure is still there.
There are some structures and strategies though that can help with this. They are used in organizations and projects that are designed with failure in mind. These organizational models admit from the outset that there will be failures and rather than pretend that all is well and there will be someone to choke if anything goes wrong, they make failure part of the equation.
With the BP Oil Disaster destroying the gulf of Mexico, the failure of 247 US banks and millions unemployed due to the economic meltdown, people are starting to wake up to the enormous risk and costs of these single points of failure. Simultaneously, open source software, open data, grass roots communities and cooperatives are becoming increasingly popular as people start to look for alternative ways to get things done.
Smart companies, governments and other organizations are letting go of "command and contol" and are discovering game changing philosophies based on engagement and collaboration that give them an edge that is not surprisingly almost non-existent in the traditional corporate form.
Collaboration, gifting and doing things for the sheer joy of working and contributing to the world and enhancing the quality of life of others are being rediscovered. And while we speak of these things as new, they are as old as civilization itself and were here long before the corporate form and will be here long after.
In this OpenDataBC series of posts, I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites. In the first article of this series, I described how to write a script to scrape catalog data from web pages. In the second article I described how to write a program to transform the data. In this article, I describe how to convert a PDF document into useable data.
As part of the Apps for Climate Action Contest, the Province of BC released over 500 datasets in the Climate Action Data Catalogue. It's an impressive amount of data pulled from an array of sources both within BC and elsewhere.
In an ideal “open data” world, all of that data would be in an easily machine readable format that we could use to write programs directly. While that would be great, the reality today is a bit different. Much of the data that is made publicly available these days is in formats that are harder to use. For example, some of the data in the Climate Change Data Catalogue was released in PDF format. PDF is a proprietary format, meaning the format is controlled exclusively by one party, in this case the Adobe corporation.
An interesting fact is that it takes extra effort to get data from its raw form into PDF format. In other words, to publish data in an open format rather than in PDF format actually saves time, effort and money – up front. However, PDF became well established in the pre-open world, so a lot of data is already published using it. To switch existing software to publish in an open format might take time. As a result, at least temporarily, we need to find ways to get at the data in the PDF files.
In this post I describe how to do that. Looking through some of the available datasets in the catalogue, one that I find interesting is “Transit Ridership in Metro Vancouver”. The data is produced by Translink and is in a PDF format and looks like this:
What I am interested in is the number of passenger trips by year for the past few years. I am going to leave out the Seabus and the West Coast Express as I am mostly interested in the buses and the Skytrain.
What I would like is a dataset, in a CSV file. The way this program will work is essentially as follows:
Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.
The Code
The first thing we need to do is to read the PDF file into memory. The simple way to do that in Python is to use the urllib2 library and read the entire PDF from the original web site. Tying the script to the actual location of the file means we don't manually store the orginal file anywhere. If the City of Vancouver decided to move the URL we would have to adjust our code, but we're probably only going to run this code once so it's not a big deal. To read the PDF file into a memory variable we do this:
Now that we have the PDF file in memory, I want to parse the PDF file and turn it into raw text. To do this I use a free open source Python library called pdfminer. I have created a function called pdf_2_text for this purpose. Here's the function:
The pdf_to_text function starts by importing the components required to do the conversion. The pdfminer library provides a lot of functionality. In this example we are using a small fraction of its functionality to do what we need, which is to get at the content in the PDF. The main function that actually does the work is called process_pdf. It takes a PDFResourceManager object, a TextConverter object and a file object as parameters so the code before that call is setting up those parameters properly. I use a StringIO object rather than just passing the urllib2 object in because the PDF converter needs to use the seek method for random access which is not supported in urllib2. To gain this ability I put the data into a StringIO object, which supports seek.
When the pdf_to_text function is called with the contents of a PDF file it returns a string containing lines of text with each line containing one of the elements (numbers or labels) of the PDF file. Here's what it looks like on my system:
Now that we have the data in text format, we want to pull out the numbers that we are interested in. I am interested in the labels on the left, which start on line 6, the first numeric column (BUS), which starts on line 75 and the second numeric column (SKYTRAIN), which starts on line 144.
To start the process of extracting rows of data from the text file, I first split lines of the text file into a list like this:
Then I create a simple function called grab_one_row which besides having a very clever name, knows the relative placement of the three columns, and pulls one whole row at a time from the text file and returns it as a tuple. Here is the function:
Armed with that function, I can now collect most of the rows I am interested in with a simple generator line:
In the original PDF, the data for 2008 is placed further down the page so the last line needs to be added with a separate line of code like this:
now the rows array contains all of the data we are interested in, in an array that we can easily deal with. We just need to write them out to a CSV file to complete our work. To do that I created the rows_to_csv function. Here it is:
I wanted the resulting CSV file to have numbers rather than strings containing numbers for the numeric values. The last line of this function strips out the commas that were in the numbers in the PDF file and then converts the text to a long integer to be written the CSV file.
The resulting CSV file now looks like this:
This result is a lot easier to deal with than the original PDF file. Arguably, a small file such as this could also be converted with Open Office Spreadsheet by cutting from the PDF and pasting to the spreadsheet. The nice thing about doing this as a script as above is that we can use this same technique for very large PDF files that would be too onerous to do manually.
Here is the entire program with all of the code together at once:
and you can find the resulting CSV file here.
Once again, Python comes through for us. Clearly it's not as easy to convert a PDF file as it is to rip a table out of an HTML file, but being possible at all makes it something we can work with. And part of the beauty of “Open” is that now that I have done it, others don't have to. And I in turn will benefit from other contributors to the open ecosystem. If we all do a bit, it's an “everyone wins” scenario.
As part of the Apps for Climate Action Contest, the Province of BC released over 500 datasets in the Climate Action Data Catalogue. It's an impressive amount of data pulled from an array of sources both within BC and elsewhere.
In an ideal “open data” world, all of that data would be in an easily machine readable format that we could use to write programs directly. While that would be great, the reality today is a bit different. Much of the data that is made publicly available these days is in formats that are harder to use. For example, some of the data in the Climate Change Data Catalogue was released in PDF format. PDF is a proprietary format, meaning the format is controlled exclusively by one party, in this case the Adobe corporation.
An interesting fact is that it takes extra effort to get data from its raw form into PDF format. In other words, to publish data in an open format rather than in PDF format actually saves time, effort and money – up front. However, PDF became well established in the pre-open world, so a lot of data is already published using it. To switch existing software to publish in an open format might take time. As a result, at least temporarily, we need to find ways to get at the data in the PDF files.
In this post I describe how to do that. Looking through some of the available datasets in the catalogue, one that I find interesting is “Transit Ridership in Metro Vancouver”. The data is produced by Translink and is in a PDF format and looks like this:
What I am interested in is the number of passenger trips by year for the past few years. I am going to leave out the Seabus and the West Coast Express as I am mostly interested in the buses and the Skytrain.
What I would like is a dataset, in a CSV file. The way this program will work is essentially as follows:
- read the data from the source database
- extract the data from the PDF file into a list in memory
- write the list in memory out to a CSV file
Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.
The Code
The first thing we need to do is to read the PDF file into memory. The simple way to do that in Python is to use the urllib2 library and read the entire PDF from the original web site. Tying the script to the actual location of the file means we don't manually store the orginal file anywhere. If the City of Vancouver decided to move the URL we would have to adjust our code, but we're probably only going to run this code once so it's not a big deal. To read the PDF file into a memory variable we do this:
import urllib2
url = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf'
pdf = urllib2.urlopen(url).read()
Now that we have the PDF file in memory, I want to parse the PDF file and turn it into raw text. To do this I use a free open source Python library called pdfminer. I have created a function called pdf_2_text for this purpose. Here's the function:
def pdf_to_text(data):
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import StringIO
fp = StringIO.StringIO()
fp.write(data)
fp.seek(0)
outfp = StringIO.StringIO()
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, outfp, laparams=LAParams())
process_pdf(rsrcmgr, device, fp)
device.close()
t = outfp.getvalue()
outfp.close()
fp.close()
return t
The pdf_to_text function starts by importing the components required to do the conversion. The pdfminer library provides a lot of functionality. In this example we are using a small fraction of its functionality to do what we need, which is to get at the content in the PDF. The main function that actually does the work is called process_pdf. It takes a PDFResourceManager object, a TextConverter object and a file object as parameters so the code before that call is setting up those parameters properly. I use a StringIO object rather than just passing the urllib2 object in because the PDF converter needs to use the seek method for random access which is not supported in urllib2. To gain this ability I put the data into a StringIO object, which supports seek.
When the pdf_to_text function is called with the contents of a PDF file it returns a string containing lines of text with each line containing one of the elements (numbers or labels) of the PDF file. Here's what it looks like on my system:
Now that we have the data in text format, we want to pull out the numbers that we are interested in. I am interested in the labels on the left, which start on line 6, the first numeric column (BUS), which starts on line 75 and the second numeric column (SKYTRAIN), which starts on line 144.
To start the process of extracting rows of data from the text file, I first split lines of the text file into a list like this:
lines = text.splitlines()
Then I create a simple function called grab_one_row which besides having a very clever name, knows the relative placement of the three columns, and pulls one whole row at a time from the text file and returns it as a tuple. Here is the function:
def grab_one_row(lines,n):
return (lines[n],long(lines[n+69].replace(',','')),long(lines[n+138].replace(',','')))
Armed with that function, I can now collect most of the rows I am interested in with a simple generator line:
rows = [grab_one_row(lines,i) for i in range(6,26)]
In the original PDF, the data for 2008 is placed further down the page so the last line needs to be added with a separate line of code like this:
rows.append(grab_one_row(lines,39))
now the rows array contains all of the data we are interested in, in an array that we can easily deal with. We just need to write them out to a CSV file to complete our work. To do that I created the rows_to_csv function. Here it is:
def rows_to_csv(rows,filename):
# write the clean data out to a file
import csv
f = open(filename,'w')
writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(rows[0])
for row in rows[1:]:
writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',',''))))
I wanted the resulting CSV file to have numbers rather than strings containing numbers for the numeric values. The last line of this function strips out the commas that were in the numbers in the PDF file and then converts the text to a long integer to be written the CSV file.
The resulting CSV file now looks like this:
This result is a lot easier to deal with than the original PDF file. Arguably, a small file such as this could also be converted with Open Office Spreadsheet by cutting from the PDF and pasting to the spreadsheet. The nice thing about doing this as a script as above is that we can use this same technique for very large PDF files that would be too onerous to do manually.
Here is the entire program with all of the code together at once:
def pdf_to_text(data):
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import StringIO
fp = StringIO.StringIO()
fp.write(data)
fp.seek(0)
outfp = StringIO.StringIO()
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, outfp, laparams=LAParams())
process_pdf(rsrcmgr, device, fp)
device.close()
t = outfp.getvalue()
outfp.close()
fp.close()
return t
def grab_one_row(lines,n):
return (lines[n],lines[n+69],lines[n+138])
def rows_to_csv(rows,filename):
# write the clean data out to a file
import csv
f = open(filename,'w')
writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(rows[0])
for row in rows[1:]:
writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',',''))))
def run():
import urllib2
url = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf'
outfilename = 'translink_bus_skytrain_trips_1989_2008.csv'
pdf = urllib2.urlopen(url).read()
text = pdf_to_text(pdf)
lines = text.splitlines()
rows = [grab_one_row(lines,i) for i in range(6,26)]
rows.append(grab_one_row(lines,39))
rows_to_csv(rows,outfilename)
if __name__ == '__main__':
run()
and you can find the resulting CSV file here.
Once again, Python comes through for us. Clearly it's not as easy to convert a PDF file as it is to rip a table out of an HTML file, but being possible at all makes it something we can work with. And part of the beauty of “Open” is that now that I have done it, others don't have to. And I in turn will benefit from other contributors to the open ecosystem. If we all do a bit, it's an “everyone wins” scenario.
Subscribe to:
Posts (Atom)








