07 December 2010

Terms of Use

I am not a lawyer, and I do not give legal advice. I am a developer who uses open data on a regular basis and as such I spend a lot of time, but probably not nearly as much as I should, trying to understand open data licenses and terms of use that governments post on their open data portals.

The whole point of open data is to liberate data so it can be used. An organization's open data strategy then should be working toward that end result, encouraging and making it as easy as possible for people to use the data.

One of the first things that developers think about when contemplating writing an application using government data is, "am I going to get in trouble". This question, however absurd it may seem is very real to developers. If developers think they are going to get into some sort of trouble using a data source they will usually not create the app, which means governments and the citizens they serve lose an opportunity.

If governments are going to release data the most important thing is to release data in a way that is easy to understand from a legal perspective, preferably in a way that developers are already familiar with. There are already many licenses in use so inventing new licenses rather than releasing data under commonly understood mechanisms is a waste of effort on everyone's part.  As Chris Rasmussen spoke about at the recent OpenGovWest BC Conference, "We all think that our data is unique.  It's not true."

Unfortunately, many custom licenses currently in use today are often full of things that don't need to be there like, "you cannot break the law with this data" or "you can't say you're us".  I already know I can't break the law.  A disclaimer makes sense - but it doesn't need to be part of a license.  Being clear about preferences around attribution make sense, but these can go in a policy statement offered for clarity rather than in a license.

In my opinion, the best license is no license at all. It's just public domain. Many government organizations consider their open data as public domain but don't go that extra step and actually state it on their web site. That's unfortunate because it's by far the simplest and easiest and least expensive way to release data and by not stating it explicitly on the web site, developers are still left wondering if they'll get sued.

  Herb's Ideal Open Data Declaration
   * This data is in the public domain.
   * It comes with no guarantees.

Please consult with lawyers that "get" open data.  See if you can go public domain explicitly rather than implicitly and or consider using the Creative Commons Zero tool, before liberating the data and see if you can work together to make data we can all use.

13 November 2010

The Power of Open

The OpenGovWest BC conference is now complete. The day was filled with amazing speakers, amazing speaking formats and amazing topics.

One of the highlights of the conference was the talk given by Nick Charney (@nickcharney) and Walter Schwabe (@fusedlogic) where they talked to the audience about participation, and as part of their talk unveiled a blog where folks in the conference were encouraged to participate in real time, right there, while they were talking. Now, days later, blog posts are still being generated on http://www.opengovnorth.ca by individuals and the enthusiasm is still present.

Nick and Walter took a risk. They put the idea out there, provided a place for it to happen and then made a simple request for participation. Though they are both accomplished bloggers they didn't tell people what to write, or how to write it, and they didn't try to control the conversation. They shared their ideas generously and provided a space for expression.

When skilled speakers like Nick and Walter encourage audience members share their ideas with each other in real time, while the talk is going on, they are engaging the participants in a vastly larger conversation. And when the talk they are giving happens to be about encouraging this type of engagement then they are really leading by example, in an almost recursive way.

They also weren't trying to promote themselves, or their organisations, or trying to take credit for anyone else's work or building their brand.

No, they were just there as Nick and Walter, a couple of guys encouraging us to take a chance and move a little closer to the edge. Giving us a gift, expecting nothing in return. Within minutes the site was crashing because the server had exceeded its capacity.

In a closed model, people focus on controlling the message and dictating top down what is supposed to happen. This model is built on fear and lacks trust and although many results can be and are generated this way, communities are not. Contrast this with what Nick and Walter created.

I agree with others who have remarked that this particular conference has taken us from a great idea to a movement. I think there are several reasons for this and I want to acknowledge the lead organizer, Donna Horn (@inspiricity), who just like Walter and Nick, used an open approach with her expertise in community building and leadership to support, encourage and then trust the conveners and speakers to create their own parts of the conference. And the result was a level of enthusiasm from the conveners and the speakers that spilled over to everyone else in the conference.

That’s how a community is created and that's the power of open.

04 October 2010

Open Data vs Open Source

Open source and open data are two different things.  They are not related any more than they are both part of a current larger overall trend toward openness, and they both happen to involve a computer.

The temptation however is to treat them the same, and to pursue them both at the same time. In fact, I recently realized that I personally have been collapsing the two concepts. I was resisting proprietary software use in open data because the companies that produce the software have been so opposed to open source software.

However, to argue that governments should both "liberate public data" and "use open source software" is to confuse the matter. I personally would like to see both happen but I choose to focus on open data because I think it will provide immediate value for government.

Insisting that government use open source tools to produce that open data makes the issue unnecessarily complicated.  Governments are used to using whatever tools they are using and it's usually easiest for them to release data using their existing tools.

One of the greats thing about open data from the technical point of view is that it's really not very complicated. Governments have a myriad of technically complex data issues to deal with, but open data is not one of them. Pretty much any system that contains data can dump that data to an open format such as XML or CSV. The tools used to develop these systems come with this sort of support built in and the hosting is not complicated.

Open data is complicated in other ways, yes.  Open data is a policy issue. It's also a communication issue. It's also an attitude issue. But it's not a technical issue.  No special software is required, no special technology is required, no special hosting is required and no special security is required.  Because in the case of open data we actually WANT people to get the data.

14 September 2010

Just the Facts Please

I am an advocate for releasing data in as raw state as possible. Spending time on visualizations any kind of presentation, including maps, is one of the biggest wastes of time and money in the open data realm. Here's why: When you release visualizations people can only look at your visualization. When you release raw data people can do an infinite number of things including create visualizations, combine with other data, create applications, etc..

Rather than working on "presentation" or "value added" services when preparing to release public facts as open data, I encourage public servants to just get the facts out there. If a specific graph or visualization is requested regularly and the budget is available to pay for it, then fine, I support that. But to spend resources on specific visualizations or maps that no one has asked for, and then to release it in the name of "open data", is a waste when it's much simpler to just make the raw data available to the community so they can create their own visualizations.

Our public bodies have vast stores of our data but scarce money and time to release it. Instead of trying to add value in the form of graphs, visualizations, detailed analysis or maps I would much rather see them release more data in machine readable open file formats that everyone can use. Releasing data is the one part that the citizens must rely on the public servants to do. The rest, citizens can do themselves, given the opportunity.

15 August 2010

Is your data copyable?

Of all the concepts that data users have to concern themselves with, copying and the legal ramifications of copying has to be one of the most important. All technology is a product of copying but in the last 100 years media companies have invented and promoted the idea that copying is theft. It's one thing to have some difficulty and possibly incur some costs accessing and using data, it's an entirely different thing to have to worry about being prosecuted or sued for using data. Developers are rightly cautious about the liabilities associated with copying data.

The environment of caution around copying affects people releasing, promoting and using open data. If it's not perfectly clear to developers that they can use your data they often won't for this reason alone. Developers have enough to think about. If they have to wade through your license or terms of agreement to try to understand what all the ramifications are, they usually won't bother. If they decide to read it then even one clause that looks risky is enough to drive them away.

There are lots of interesting project ideas and few developers that can make them happen. If you want people to use your data, it makes sense to make that process as easy and risk free as possible. Fortunately, there are several ways to do that, but first, what exactly is the copyright protecting? When we talk about data in this context we are talking mostly about facts and collections of facts.

The interesting thing about facts when it comes to copyrights is that it's generally accepted that facts themeselves do not enjoy any copyright protection. The reason being that copyrights only generally apply to creative works, and facts, by their very defintion, are not creative. If I present a number to you as the truth then I am saying that I didn't make up the number, I at best discovered it, but it already existed. If I present a number to you and say I made it up - in other words I invented it from nothing, then it's not a fact, it's made up.

It's for this reason that the idea of licensing data then strikes me as odd. To grant someone a license to use something which they can already do is redundant. Attempting to restrict use by means of a license is then equally strange. If someone already has the right to use something then to restrict it would require their agreement, and why would anyone agree to fewer rights than they already have.

I think the USA is on the right track here. My preference is that all non-personal government data be released as public domain. Public domain is easy to understand and it fully opens the door to innovation giving developers the raw materials they can really use to create valuable apps.

11 August 2010

Is your data accessible?

Public servants with responsibility for publishing government data have decisions to make when it comes to making that data available to citizens on the internet. Along with readability and usability which I have covered in previous posts, a third aspect of open data is accessibility.

The purpose of releasing data as open data is to enable people to use your data. To use that data they have to get it from your computer to their computer. There are a variety of ways to do that, but in 2010 that means the Internet and and the HTTP protocol. If your datasets are very large, using anonymous FTP would likely also be acceptable to many developers. However HTTP is by far the simpler protocol to use. It has many advantages from a developer perspective over FTP and it is just as easy to set up from a publisher's perspective.

Accessibility is just as important as readability. Where poor readability imposes a one time cost to developers, poor accessibility actually imposes an ongoing transactional cost. As a developer, I can write scripts to decode data provided in proprietary formats like XLS or SHP (so long as that's still legal in Canada - locks today, proprietary formats tomorrow?). It's still costly in terms of my time, but once written, I can run that same script over and over again with no effort. Poor accessibilty on the other hand sometimes means that I can't readily automate the process. If I can't automate it, then every time I want to use your data, I incur cost in my time to manually download it. That may be fine for those users who only want to download your data once. But to the developers that you want to encourage to use your data as a platform to build valuable applications with, it's a barrier they won't likely cross due to the high transactional cost.

In some cases folks use login screens or mail back mechanisms to track who is accessing our government data. In some cases there are check boxes for so called "agreements" or "contracts" that are meant to force people into some sort of agreement before they use our data. The worst of course are cost recovery models where we are forced to pay for our data twice. First as taxpayers and again as users.

Open data is emerging from an era where the status quo belief was that government data had to be locked down. Whether or not that was ever true is debatable, but it's clearly not true in 2010.

When publishers realize that with open data, the data is likely going to be re-purposed and distributed in different forms anyway, and in ways that these methods won't track, then what are you really measuring that you couldn't measure with just a simple, unobtrusive web site and access log.

When we as publishers talk about accessibility, as with many aspects of open data, it's useful to remind ourselves of the reason we are doing open data in the first place. Making data accessible means making it as easy as possible for developers to gain access to and download the data. Not so you can pass some test of "openness" (although there are good reasons to do that which I will cover in a future post), but so people use your data. You want people to use your data. That's the point.

22 July 2010

Thoughts from OSCON 2010

I am currently attending OSCON 2010 (Open Source Conference) in Portland Oregon.   It's a conference for free and open source software enthusiasts, developers, hackers and users of all levels.  There are about 5,000 people attending  this year.  I have met a lot of people here.  Some who are passionate about free software, and some that are learning more about it and how it can provide value to their companies.

It's difficult to over-estimate the impact that free and open source software (FOSS) has had on computing and the world in general.  First, of course, it powers the internet itself.  If you use the internet, you use free and open source software.  From the underlying protocols to email to ftp to web sites, it's all powered by free and open source software.

Practically every major web site you can think of (Google, Facebook, Wikipedia, Twitter, Foursquare, Google Maps, ... ) make heavy use of free and open source software.  These companies measure traffic in many millions of users and billions of pages per month.

The Apache Web Server for example has been the most popular web server since April 1996 and powers almost 70 percent of all websites on the planet.  There are free and open source operating systems, programming languages, office productivity suites, collaboration suites, web browsers, file and print servers and much more.  There is a free and open source version of practically any software you can think of (and many that you haven't thought of).

And yet, here we are in 2010 and some are still not convinced that open source is suitable for government use.  They are not convinced that this software developed by communities of generous and smart people is reliable and secure or supported enough for their purposes compared to proprietary solutions such as Internet Explorer.  They put all of their trust in single vendor solutions and rely on companies like Microsoft and Oracle, and believe the stories told by such companies about open source software... that story goes something like this:  "It's not enterprise ready... it's of varying quality... there is no support for it... you want to have one throat to choke."

Why aren’t governments using open source software anywhere and everywhere possible?  Why do governments continue to seek out solutions with lock-in to certain vendors?  Why would we continue to believe the big vendors that promise to be nice?  Why do we citizens continue to pay millions upon millions of dollars for software?

Governments are unlike other corporations in that they are making decisions not for their own benefit, but for the benefit of us, the citizens. They don't take that responsibility lightly so decisions are made with great care and they often don't give themselves permission to try new things - or if they do, they do THAT with great care and concern because they don't want to make any mistakes with our resources. Trying something innovative occurs as a risky and so the status quo is long lived and new approaches are discouraged.

Governments appear to be the last hold out of proprietary software and as a result, are missing out on an opportunity to engage with and support the communities that support all of us.  The rest of the world has figured out that free and open source software is the most secure, the most reliable, most innovative and the most cost effective software available.  Leading internet companies that earn millions of dollars in revenues and could choose anything they want for their software needs are choosing open source software.  We should let our governments know that we want them to choose free and open source software too.

The problem with free and open source software is this:  It's hard to make a lot of money with free software.  And, without a lot of money you can't own a public relations team and you can't spend a lot of money on armies of sales people and technical sales people with pre-written business cases and white papers and other collateral convincing people to use your products.  Without a lot of money, you can't schmooze and throw hosted year end parties for your key clients in every major city.

Instead, with free and open source software, you put everything into the product and let the product speak for itself.  You assume that people actually want things to work better.  You build communities of people who are passionate about your product - not because it makes them look good - not because it's easier – not even because it's free - but because it provides exceptional value.

07 July 2010

Is your data readable?

In talking with clients and colleagues about open data and open government this is the one question that comes up over and over again. The word “data” means a collection or body of facts that represent the qualitative or quantitative attributes of a variable or set of variables but what does “open data” mean?

To answer this question I like to look at what we are trying to achieve by opening data. The promise of open data is that if we make government administrative data available to the public value will be created in ways that we may or may not be able to imagine. The value will be created by using the data. So, what is open data? Ultimately, it’s data you can use. In this series of blog posts I will explore the various ways data can be made more usable.

What makes data usable?

In a previous post I proposed some dimensions that move toward a usability scale. In this post I propose a minimum standard of usability. In other words, what are the absolute minimum requirements that must be satisfied in order to consider something open data? To answer this question one could look at the dimensions of usability individually and decide for each one, what would be the minimum level of usability below which data is not usable.

One of the main measures of usability is readability.  In other words, how easy is it to read?

For example, this list of cities with their geographic areas and populations is data:


Data collected into rows and columns in this way is typically called a data set (or dataset). By putting this dataset in my blog post in a table I have made it available to you but the fact that I made it available to you as a screenshot of my spreadsheet means to read it would be difficult, error prone and would require expensive software or scripting. Which makes it pretty much unusable by you.

Another method in use by governments today is is to publish data as a PDF formatted document. This is marginally better than posting as an image. It’s technically possible to extract the data from PDF files as I have demonstrated in a previous post, but it’s still expensive, time consuming and error prone.

What I could do instead is make that same data available as an HTML table in this blog post, like this:




CityAreaPopulation
Victoria19.6878057
Vancouver114.67578041
Kelowna211.69120812


Technically, this is a level better than both images and PDF files but it will still get me low points on the usability scale because in order to read it a programmer still has to write a script specifically for reading this data from my blog post, a time consuming and wasteful process. If you’re unfortunate enough to need to read data from an HTML page, another previous blog post describes how to do this.

To really improve the usability of this software it makes sense to publish it in a format that represents data in a form that makes the data easily accessible. Many people are familiar with spreadsheets, which are a popular tool for reading and manipulation of tabular data so making data available in spreadsheet format makes it more usable in the sense that people can obtain spreadsheet programs to read the tabular data. For example, here is the same data published in the open .ODS format supported by a wide variety of spreadsheet software providers, and here it is published in the XLS format a proprietary format controlled by the Microsoft corporation.

The advantage to publishing in spreadsheet format is that while still requiring specialized scripts and software to read, at least the rows and columns are well defined which translates into fewer errors.  This is what I would consider the minimum bar for usable open data.  It's not as usable as I would like, but it is usable without too much risk.  In other words, if you have data in this format already and you don't have the budget to reformat it before publishing it, don't delay the release, just publish it as is.

Ideally though data is published in formats specifically designed for the purpose of information sharing, and that’s where the CSV, XML and JSON formats come in.

The CSV version of my dataset looks like this:
"City","Area","Population"
"Victoria",19.68,78057
"Vancouver",114.67,578041
"Kelowna",211.69,120812

The XML version looks like this:
<dataset>
 <data>
  <row><city>Victoria</city>19.69<population>78057</population></row>
  <row><city>Vancouver</city>114.67<population>578041</population></row>
  <row><city>Kelowna</city>211.69<population>120812</population></row>
 </data>
</dataset>

and the JSON version looks like this:
[
 {"city": "Victoria", "population": 78057, "area": 19.690000000000001},
 {"city": "Vancouver", "population": 578041, "area": 114.67},
 {"city": "Kelowna", "population": 120812, "area": 211.69}
]

While not quite pretty as the other human readable formats CSV, XML and JSON are open formats that provide structure making it very easy for programs to read the data. They are also well supported in almost all modern programming languages so that any programmer who wants to use your data can do so easily and accurately with free software and very little programming. And as a side benefit, its very easy and inexpensive to publish your administrative data into these formats using free software.

Publishing data in these open formats makes it easy for people to use open data. While publishing in HTML format is readable and is what I would consider the bare minimum for usability, depending on how it is done, other formats can make it much easier. And if your organization is serious about engaging people to collaborate and create value from the data they will want to make the data as usable as possible and making the data readable is one part of doing that.

23 June 2010

OpenDataBC: Toward A Data Usability Scale

I am currently involved in a project named OpenDataBC.  OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia.  It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.

Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.

The definition of "open" as it relates to data is still evolving at a rapid pace.  In it's ideal form what we mean by open data is:
Open data is data that you are allowed to use for free without restrictions.  Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.
By this definition, there is very little open data available today.  Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.

Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find.  For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.

1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use.  What we look for is machine readable data.  Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable.  A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable.  PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.

2. Is it accessible?
I should be able to get it easily over the internet.  I should be able to get it on demand, with a simple program using open source software.  I should not have to submit a form to get it.  I should be able to enter a URL and in return I get the data.

3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."

4. Is it free?
In this context, I mean am I free to use this data however I want?  Can I use it to produce a product that I sell?  Can I combine it with other data and publish it?  Can I sell a copy of it?  Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free.  For example, if  I have to enter into an agreement to use it, it's not free.

5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand.  The more time people have to spend understanding the license in order to use the data, the less usable the data is.  Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.

6. Is it provided without a fee?
The data needs to be available at no cost to the user.  If it costs money, it's less usable and it's not open data.

7. Is it complete?
Data should not be missing values that ought to be there.  If it's point-in-time data it should include all of the relevant information for that point in time.  If it's time series data, it should include the entire time series from the first record to the most recent record.   If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.

8. Is it timely?
The data should have the most up to date information as soon as it is available.  Ideally the data is available as an updated feed or at least updated on a regular schedule.  If the data is a feed, it should be available in as near real time as possible.

The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.

Please let me/us know if you think we're missing something or of something here needs adjusting.

18 June 2010

Single Points of Failure

As I write this in 2010, our political, economic, cultural and social systems in the western world are for the most part driven by corporations.  Our art is subsidized by corporations, our charities are funded by corporations, our culture is promoted by corporations and our laws are defined by corporations.

These corporations come in various forms, whether they are businesses, governments, or religious corporations.  In many ways the corporate form is very useful.  It's one way to provide a structure for people to work toward a common goal.  It provides some level of predictability.  And in some cases, it provides for economies of scale.

Much has been written about the weaknesses of the corporate form, and the corruption it attracts so I leave that to others to draw attention to but there is one aspect of the corporate form that I don't see being written about, and that's the fact that they represent a single point of failure.  Big corporations have big failures.  We permit corporations to grow infinitely large and then rely on them not to fail.  But they do fail, we know they fail, and we permit it anyway.

In the enterprise software space the "one throat to choke" mantra is used to persuade the listener that putting all of your eggs in one basket is a good thing. What it hides though is the fact that the vendor represents a single point of failure, and when it's a large project, that often mean large failure.  The high failure rate of large IT projects is well known but how often are these throats actually choked?  Almost never.

If these large organizations never failed, that would be one thing, but the fact is, they do fail.  And somehow when we are voting, or shopping or doing our cost benefit analysis and selecting the vendor we forget that that entity we are dealing with may not be around tomorrow.  The "bet the farm" ideology invented hundreds of years ago and popularized during the industrial revolution is showing it's age in our current distributed global world.

When the costs of communication were high it made a lot of sense to build organizations as hierarchies to minimize the costs of communication through a top down pyramid and command and control theme. This model was so efficient that it offset the risks of the single point of failure. Today though, we have the internet and mobile phones minimizing those costs for everyone so the pyramid isn't adding as much value as it used to and the cost of the single point of failure is still there.

There are some structures and strategies though that can help with this.  They are used in organizations and projects that are designed with failure in mind.  These organizational models admit from the outset that there will be failures and rather than pretend that all is well and there will be someone to choke if anything goes wrong, they make failure part of the equation.

With the BP Oil Disaster destroying the gulf of Mexico, the failure of 247 US banks and millions unemployed due to the economic meltdown, people are starting to wake up to the enormous risk and costs of these single points of failure.  Simultaneously, open source software, open data, grass roots communities and cooperatives are becoming increasingly popular as people start to look for alternative ways to get things done.

Smart companies, governments and other organizations are letting go of "command and contol" and are discovering game changing philosophies based on engagement and collaboration that give them an edge that is not surprisingly almost non-existent in the traditional corporate form.

Collaboration, gifting and doing things for the sheer joy of working and contributing to the world and enhancing the quality of life of others are being rediscovered. And while we speak of these things as new, they are as old as civilization itself and were here long before the corporate form and will be here long after.

20 May 2010

OpenDataBC: Extracting Data from A4CA PDFs

In this OpenDataBC series of posts, I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites. In the first article of this series, I described how to write a script to scrape catalog data from web pages. In the second article I described how to write a program to transform the data. In this article, I describe how to convert a PDF document into useable data.

As part of the Apps for Climate Action Contest, the Province of BC released over 500 datasets in the Climate Action Data Catalogue. It's an impressive amount of data pulled from an array of sources both within BC and elsewhere.

In an ideal “open data” world, all of that data would be in an easily machine readable format that we could use to write programs directly. While that would be great, the reality today is a bit different. Much of the data that is made publicly available these days is in formats that are harder to use. For example, some of the data in the Climate Change Data Catalogue was released in PDF format. PDF is a proprietary format, meaning the format is controlled exclusively by one party, in this case the Adobe corporation.

An interesting fact is that it takes extra effort to get data from its raw form into PDF format. In other words, to publish data in an open format rather than in PDF format actually saves time, effort and money – up front. However, PDF became well established in the pre-open world, so a lot of data is already published using it. To switch existing software to publish in an open format might take time. As a result, at least temporarily, we need to find ways to get at the data in the PDF files.

In this post I describe how to do that. Looking through some of the available datasets in the catalogue, one that I find interesting is “Transit Ridership in Metro Vancouver”. The data is produced by Translink and is in a PDF format and looks like this:



What I am interested in is the number of passenger trips by year for the past few years. I am going to leave out the Seabus and the West Coast Express as I am mostly interested in the buses and the Skytrain.

What I would like is a dataset, in a CSV file. The way this program will work is essentially as follows:

  • read the data from the source database
  • extract the data from the PDF file into a list in memory
  • write the list in memory out to a CSV file

Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.

The Code
The first thing we need to do is to read the PDF file into memory. The simple way to do that in Python is to use the urllib2 library and read the entire PDF from the original web site. Tying the script to the actual location of the file means we don't manually store the orginal file anywhere. If the City of Vancouver decided to move the URL we would have to adjust our code, but we're probably only going to run this code once so it's not a big deal. To read the PDF file into a memory variable we do this:

import urllib2 
    url = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf'
    pdf = urllib2.urlopen(url).read() 

Now that we have the PDF file in memory, I want to parse the PDF file and turn it into raw text. To do this I use a free open source Python library called pdfminer. I have created a function called pdf_2_text for this purpose. Here's the function:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t

The pdf_to_text function starts by importing the components required to do the conversion. The pdfminer library provides a lot of functionality. In this example we are using a small fraction of its functionality to do what we need, which is to get at the content in the PDF. The main function that actually does the work is called process_pdf. It takes a PDFResourceManager object, a TextConverter object and a file object as parameters so the code before that call is setting up those parameters properly. I use a StringIO object rather than just passing the urllib2 object in because the PDF converter needs to use the seek method for random access which is not supported in urllib2. To gain this ability I put the data into a StringIO object, which supports seek.

When the pdf_to_text function is called with the contents of a PDF file it returns a string containing lines of text with each line containing one of the elements (numbers or labels) of the PDF file. Here's what it looks like on my system:



Now that we have the data in text format, we want to pull out the numbers that we are interested in. I am interested in the labels on the left, which start on line 6, the first numeric column (BUS), which starts on line 75 and the second numeric column (SKYTRAIN), which starts on line 144.

To start the process of extracting rows of data from the text file, I first split lines of the text file into a list like this:

lines = text.splitlines() 

Then I create a simple function called grab_one_row which besides having a very clever name, knows the relative placement of the three columns, and pulls one whole row at a time from the text file and returns it as a tuple. Here is the function:

def grab_one_row(lines,n): 
    return (lines[n],long(lines[n+69].replace(',','')),long(lines[n+138].replace(',',''))) 

Armed with that function, I can now collect most of the rows I am interested in with a simple generator line:

rows = [grab_one_row(lines,i) for i in range(6,26)] 

In the original PDF, the data for 2008 is placed further down the page so the last line needs to be added with a separate line of code like this:

rows.append(grab_one_row(lines,39)) 

now the rows array contains all of the data we are interested in, in an array that we can easily deal with. We just need to write them out to a CSV file to complete our work. To do that I created the rows_to_csv function. Here it is:

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',','')))) 

I wanted the resulting CSV file to have numbers rather than strings containing numbers for the numeric values. The last line of this function strips out the commas that were in the numbers in the PDF file and then converts the text to a long integer to be written the CSV file.

The resulting CSV file now looks like this:



This result is a lot easier to deal with than the original PDF file. Arguably, a small file such as this could also be converted with Open Office Spreadsheet by cutting from the PDF and pasting to the spreadsheet. The nice thing about doing this as a script as above is that we can use this same technique for very large PDF files that would be too onerous to do manually.

Here is the entire program with all of the code together at once:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t 
    
def grab_one_row(lines,n): 
    return (lines[n],lines[n+69],lines[n+138]) 

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',','')))) 

def run(): 
    import urllib2 
    url         = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf' 
    outfilename = 'translink_bus_skytrain_trips_1989_2008.csv' 
    
    pdf = urllib2.urlopen(url).read() 
    text = pdf_to_text(pdf) 
    
    lines = text.splitlines() 
    rows = [grab_one_row(lines,i) for i in range(6,26)] 
    rows.append(grab_one_row(lines,39)) 

    rows_to_csv(rows,outfilename) 
    
if __name__ == '__main__': 
    run() 

and you can find the resulting CSV file here.

Once again, Python comes through for us. Clearly it's not as easy to convert a PDF file as it is to rip a table out of an HTML file, but being possible at all makes it something we can work with. And part of the beauty of “Open” is that now that I have done it, others don't have to. And I in turn will benefit from other contributors to the open ecosystem. If we all do a bit, it's an “everyone wins” scenario.

17 May 2010

Facebook Steps Out of the Way

Like most people, I was a bit surprized with Facebook's recent changes with regard to privacy.  I don't think they have done anything wrong, but as a user I allowed myself to be lulled into a false sense of security.  Like most people, I believed that they wouldn't mess with the privacy settings of my account much, allowing me to control who got to see the personal information I put on Facebook on my terms.  I had agreed to their user agreement which stated they could change the terms at any time, but I didn't really pay attention to that fine print.

When they made the more recent change to allow my friends graph and my content to be harvested I realized something.  Not that Facebook is evil or bad, but that they offer a service that I thought was one thing, but it is something else.  I thought it was a way for me to connect with my friends and share my data with them, but actually it is a way for Facebook to profit from our personal data.  Or, as Tim Spalding so eloquently put it, "Why do free social networks tilt inevitably toward user exploitation? Because you're not their customer, you're their product."

For me it's not a big deal that my Facebook content is now available to anyone, I don't store anything particularly private there anyway.  But now, my behaviour has changed and I find myself using it even less than I did before.  Not so much because of the loss of control over my data or the fact that they didn't give me a cut, but because this is a company that is volatile with respect to its user policies.  Frankly, I just don't want to put in the time required to keep up with their changes.  So, I take my privacy into my own hands and limit what I place on Facebook.

On the other hand, I find the recent changes to Facebook pretty exciting.  I think an awesome opportunity has opened up now for a service to emerge that allows people to connect with their friends and at the same time protects their privacy.  400 million facebook users sharing information is a testament to the fact that people want to connect online.  The recent outcries and Facebook account deletions point to the fact that people also value privacy...i.e., there is clearly a market for connecting AND protecting privacy.

Facebook doesn't offer that service, but until now people were not sure if they did or they didn't.  And that ambiguity prevented other firms from offering that service because competing with Facebook was just a non-starter.  Now, thanks to Facebook's recent changes, it's clear.  They don't.

And in that gap between what people want and what is available, lies opportunity.

It's now clear, Facebook is not in the privacy business.  By stepping out of the way they make room for others that want to offer privacy as a key value proposition.

Personally, I would like to see a new type of social platform emerge.  Something taking ideas from status.net and webfinger and Diaspora.  I want a distributed social platform that I can host with any hosting provider that would allow me to connect with my friends. The difference is, I own it.  So long as my friends and I  have this system installed somewhere, our systems would talk to each other seamlessly.

This way we would not be dependent or at the mercy of any one vendor or their privacy policy changes.  We would be able to move our accounts to another host of our choosing, anytime we want, and we would be able to lock down or even delete data any time.

The software would be free and open source as well.  If anyone wanted to add some functionality or just contribute, that would be possible too.  Think self-hosted WordPress, but with social networking instead of blogging.

Rather than one massive "walled garden" users would each have their own garden in a community with other garden owners.  They would be able to share their data with whomever they choose.

I am grateful to Facebook and everything it has done to connect people.  It's truly an awesome service.

Ultimately though, it IS my data.  And I still want to share it on my terms.

11 May 2010

Innovators Isolation

One of the things that all innovators face at some level is a sense of isolation. By definition innovators are working on things that have never been worked on before. And they make up only about 2.5% of the population. If they participate in a specialized industry, it's pretty unlikely they'll get to work with other innovators in their field, never mind find someone who understands what it is they're so passionate about.

About four years ago I made a decision to attend four conferences per year. I consider it part of my ongoing professional development in two ways:

  1. Training - there are no formal training courses available for the skills I need for my work. I read any books that are available. Conference workshops and sessions provide me with the most current and relevant information about what other innovators are doing in my field.
  2. Connecting - when I attend a conference, I consciously choose to meet and connect with other attendees and presenters. I often get to connect with other inventors of some of the most exciting new technologies. It's typical to find folks exchanging ideas, talking about what we've actually done, what worked, what didn't work and we're thinking about for the future.  There are people who talk and there are people who ship. These people ship.

Next week I will attend Google I/O and later this year I will attend OSCON and possibly FOWA. Although I don't expect to find a workshop covering specifically what I am currently up to – such as privacy enhancing distributed enterprise service topologies or dataset forking technologies or probabilistic data linkage techniques - at these conferences, I do expect to find lots of people interested in the cutting edge of whatever it is they're passionate about.

I expect to find people who are daring to think differently. I will share my crazy ideas. I will hear other folks’ crazy ideas. I expect we'll have thoughts about each other's crazy ideas and I expect that after all that, we will acknowledge some of the ideas as crazy or just dumb. And some will seem not quite as crazy as they did before.

But more importantly, we will get the sense that we're not the only ones with crazy ideas, and we're not the only ones that have no idea what's on TV anymore, because we are working on something that we think is cool and could possibly even change the world.

06 May 2010

freedom debt and the decline of lock-in




Early adopters are starting to notice there's a cost to cool when it's supplied by a single vendor.  At first it's all fun and games.  Then you realize that you no longer own your music collection, you no longer own your social network, and you no longer own your data.

Time after time we see organizations start out challenging the status quo, getting huge, abusing their user base, breaking trust, declining, and finally becoming the legacy they fought against.  Facebook innovated past MySpace, Microsoft innovated past the IBM mainframe and Apple innovated past the IBM PC.  Years later, Facebook's commoditization of users and user content,  Microsoft's crushing domination of the PC market, and Apple's rigid control over anything that comes into contact with its products, including the internet, are all examples of lock-in in action.

They all start with great innovations and intentions but eventually the temptation to use lock-in as a strategy becomes too much to resist.  

It's the classic Innovators Dilemma.  

To amass a huge user base, organizations start out providing something of value or at least perceived value.  That's what gets people to join up.  The more compelling and universally recognized the value is, the more users.  Having first mover's advantage is great here because it means there is no competition.

Once people have joined, there's a need to retain them as customers.  

Option one is continuing to provide value over and above what their competitors offer.  Now that their competitors know what they're up to, it's only a matter of time before the innovation is imitated and possibly exceeded.  Continuing to innovate (taking risks) after they've already captured an audience can be difficult and something that doesn't come natural to large organizations.   

Option two, which is what most organizations default to, is the strategy of intentional 'lock-in'.  Lock-in is the practice of structuring the customer relationship so that it's difficult or impossible for users to switch to another vendor.  Such as holding user data hostage so that it can be entered but not retrieved, or preventing people from moving their purchased music to another platform.  Creating a walled garden environment, so if you want to play, you have to play within the walls, and leaving means leaving your toys behind.  These are just a few of the tactics used to discourage people from using their freedom of choice.  

Notice there's a common thread in option two: none of these tactics are in your best interest.  They are in the best interest of the organization.  

Jim Zemlin said it well at LinuxWorld a couple of years ago.  He likened using a certain well known set of products as volunteering to go to jail.   He pointed out that the  Jail looked a lot like a 4 star hotel room with video on demand, a great view, was clean and neat, and that most of us would find to be rather luxurious... but it's still a jail.

Freedom Debt
By offering you something shiny now, organizations get you to give up a bit of your freedom in the future.  "Take this shiny phone.  It's free", they say. "Don't concern yourself with where your data is stored; we'll take care of it."  

Right.

As we continue to wake up to what lock-in means and to become aware that what we are doing when we choose products and vendors that lock us in is essentially borrowing freedom from the future, I think we'll start to make different choices. 

The New Organization
The trends around open source, open data, open government, open protocols, open API's and open communications, indicate that more and more organizations are recognizing that people are aware of the consequence of giving up their freedom.  These organizational models point to lock-in as something to avoid, and use it as a lever to distinguish themselves from the legacy organizational models they disrupt.  

This not only makes those organizations more competitive, it also means their goals are more aligned with ours, and gives us something more valuable for the dollars or attention we give in return.  

And the big upside for them is that openness and our knowing we can leave at any time builds a kind of trust and loyalty that the old model can't hope to compete with.

02 May 2010

OpenDataBC #2: The A4CA Data Catalogue Transformer

In the OpenDataBC series of posts I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites.  My goal is to encourage people that might not otherwise consider interacting directly with data, to give it a try and see for themselves that it's easy, and maybe even fun.  :-)

Three weeks ago I sat down to have a closer look at the datasets that were released as part of the Province of BC's Apps 4 Climate Action (A4CA) contest.  I am very excited about the prospect of open data in BC and wanted to see what was available that might be interesting to use for various projects.  When I started to look through the datasets, I realized I was going to need to download it into either a spreadsheet or database to be able to really look at what was available.

In the previous article of this series, I described how to write a script to scrape the catalog data from the web page that contains it.  In this article I describe how to write a program to transform the data.

The goal of this program is to read the live catalog from the A4CA site and make it available in a useable form.  The program will do some data cleaning and write the data out to a file that can be read by a spreadsheet or database program.  I have decided to output the catalog data in a comma delimited form otherwise known as a csv file.

While writing the last article, I noticed that when the data was downloaded the data had some broken lines, extra spaces and some of the data was encoded with HTML encoding (for example: '&#8805;').  I want to be able to view the data in a plain text editor and in a simple spreadsheet program.  This means cleaning all of these extra lines and codes out so the data is easy to work with and understand during analysis.

The way this program will work is basically as follows:
  • read the data from the catalog page
  • clean the data
  • write the clean data out to a csv file

Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.

The Code
One of the first things we are going to need to do is grab the catalog data.  I saved my code from the last article in a module called read_a4ca_catalog. To make that module available I just import it.  We are also going to need a few other modules in our program so I import those at the same time.  The string module will help us when we want to clean the data.  We'll use the csv module when we want to write our output to a file.

import read_a4ca_catalog
import string
import csv

Now that we have all of the basic modules loaded we are set to start coding.

Reading the dataset is ridiculously easy now that we have a module specifically written to do that.  It's a one-liner that uses our module from last time:

raw_data = read_a4ca_catalog.read_a4ca_catalog()

Now we have our data in a list variable and we need to clean it.  For that we're going to go through the data row by row, cleaning each row and placing it into a new array called clean_data.

clean_data = []
for row in raw_data:
    clean_data.append( cleanup_row(row) )

Notice that the code calls a function called cleanup_row.  That function basically takes one row from the raw catalog data and goes through it cell by cell and cleans the data in each cell.  Cleaning the data consists of replacing the encoded tags with readable alternatives, replacing duplicate spaces with single spaces, and removing all characters that we don't want.  We create a string called legal_chars ahead of time, to specify which characters we consider legal for the output cells.  Here's the code:

legal_chars  = string.digits + string.ascii_letters + string.punctuation + ' '
def cleanup_row(row):
    def cleanup_element(element):
        t1 = element.replace('  ',' ').replace('&amp;','&').replace('&#8804;','<=').replace('&#8805;','>=').replace('&lt;','<').replace('&gt;','>')
        return ''.join([c for c in t1 if c in legal_chars])
    return [cleanup_element(element) for element in row]

We need to place this function above the code that uses it. In my program I have placed it right below the import statements.

Finally, we want to write the clean data out to a file.  For that I create an output file and call upon the Python csv module to write each row out to the file.   Just before I write out the rows of data, I write a row of labels so that the csv file has headings.

f = open('a4ca_catalog.csv','w')
writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
writer.writerow(['category','subtopic','title','link','agency','agencylink','format'])
for row in clean_data:
    writer.writerow(row)

That's it. We now have a .csv file that can be easily read by pretty much any spreadsheet or database program.

Here's the entire program with all the bits in order:
import read_a4ca_catalog
import string
import csv

legal_chars  = string.digits + string.ascii_letters + string.punctuation + ' '
def cleanup_row(row):
    def cleanup_element(element):
        t1 = element.replace('  ',' ').replace('&amp;','&').replace('&#8804;','<=').replace('&#8805;','&gt;=').replace('&lt;','<').replace('&gt;','>')
        return ''.join([c for c in t1 if c in legal_chars])
    return [cleanup_element(element) for element in row]

# read the a4ca catalog from the live web site
raw_data = read_a4ca_catalog.read_a4ca_catalog()

# clean the data
clean_data = []
for row in raw_data:
    clean_data.append( cleanup_row(row) )

# write the clean data out to a file
f = open('a4ca_catalog.csv','w')
writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
writer.writerow(['category','subtopic','title','link','agency','agencylink','format'])
for row in clean_data:
    writer.writerow(row)


And you can find the resulting csv file here.

The nice thing about doing this with a program rather than manually is that if the catalog is updated from time to time, all we have to do is run our program and we have a nice clean spreadsheet to work with that has the latest and greatest data.

Python comes with a lot of tools built in for manipulating data and it make short work of jobs like this. Now that the data is in spreadsheet form, we can look at it to see what different types of data there are, what format it's in and where it comes from, and we can do cross tabs and graphs to help us visualize those aspects.

28 April 2010

Data is Green

Of the three R's of responsible consumerism, I think the second R holds the most promise. The idea is that if you must buy something, then make it something durable and use it up.  This appeals to me.  If I spend a few extra bucks and purchase the durable good over the cheap one, I get to enjoy a superior product for longer.  And, as a bonus, I get to spend less of my life shopping.


On a recent trip to a small Mexican town I was struck by many cultural differences.  One of these was the extent to which things are used up.  Regardless of the motivation, it was clear that the residents of the town where I was staying were using things long after the point at which I personally would have discarded them.  From houses to automobiles to plastic containers to electronics to clothing and even food, things that I would have been comfortable with discarding here would continue to be used long after they made their way out of my life.

The Implications of Reuse for Data
As I was thinking about Reuse, it really struck me what an opportunity we have with our data.

We spend a lot of time and attention filling out forms and understanding terminology and concepts so that when we sign them, we know what we're doing and we get the services we need.  We do this when we want to interact with both governments and businesses.  We also provide the funds to enable government to collect all sorts of data about our resources, the places we live and the events that occur in our world so that we can be safe and secure and our resources utilized effectively.

We also spend money indirectly through taxes and fees to have that data stored, protected, backed up and maintained  by those public and corporate entities.  Its mind boggling to think about all the places my data resides and how many times my name, address and phone number are stored.

The data that we give  is only used for that one organization.  And, it's typically only used for the purpose it was collected for.

Why not find a way to store data so that it could be reused FOREVER by multiple parties?

Open Data and Reusability
If we were to add up all of the time, energy and resources required to store and manage our resource, geographical, financial and personal information and keep it locked up in hundreds or thousands of different locations, the impact on our environment must be significant.

Open Data allows that same data to be reused freely and infinitely for multiple purposes, so that we can maximize its value.  It also allows system developers to reduce their use of data because they don't have to reinvent data where it already exists.

I think Open Data represents an ultimate opportunity for reuse.  If data had a colour, it would have to be green.

19 April 2010

OpenDataBC: Accessing the A4CA Data Catalogue

The OpenDataBC Series: In this series of posts I will describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites.  My goal is to encourage people that might not otherwise consider interacting directly with data, to give it a try and see for themselves that it's easy, and maybe even fun.  :-)

Last week I sat down to have a closer look at the datasets that were released as part of the Province of BC's Apps 4 Climate Action (A4CA) contest.  I am very excited about the prospect of open data in BC and wanted to see what was available that might be interesting to use for various projects.

The A4CA data listed in the catalogue includes a range of formats and technologies.  Some are easier to work with than others.  Being able to browse the catalogue online is great but I really want to have a closer look and maybe do some analysis to find the data that is easy to work with.  For that I need to download the data so I can work with in it spreadsheet and / or database form.  In the perfect world, this data would be available on the site as a downloadable feed, with its own URL so programmers could simply point at the URL and get the data in XML format that goes with it.

The A4CA catalog provides a download to .CSV form, which is easy to work with but unfortunately, the link to that data is hidden behind a flash widget, so there is no way to download the data directly.  The page itself however does provide the data to the browser in the form of a table.  The table shows only 100 records at a time; however, there is another flash widget that allows the user to page through the data 100 records at a time, without a screen refresh.  That means the data is already in the browser, all 540 rows of it.  It's just a matter of scraping it out using a bit of code.

How it's done:
First, checking the robots.txt file for the site (www.gov.bc.ca) file reveals that the site allows for almost any type of program including this one, so that's great.

Prerequisites
My personal language of choice for this sort of task is Python, which comes pre-installed with all Linux and modern Mac machines and can be easily installed on Windows.  In my case I am using Python 2.6 under Ubuntu 9.10.  In addition to Python I am using Beautiful Soup which is an excellent library for scraping web sites.

The Code
The first thing we need to do in our program is to import the modules we need.  We are going to need urllib2 to grab the page and BeautifulSoup to parse it.  That's just one line:
import urllib2, BeautifulSoup

Next, we need to go out to the web and grab the HTML page and store it as a string called page:
page = urllib2.urlopen('http://data.gov.bc.ca/data.html')

Next, we create an object called soup using the BeautifulSoup library.
soup = BeautifulSoup.BeautifulSoup(page)

At this point we have the page loaded and we can read whatever parts of it we want using the methods provided by the soup object.

I am particularly interested in the table in the middle of the page, that contains the data I am after.  Looking at the raw HTML code from inside my browser I see that there is only one table on this page and that it's ID is set to 'example', so that makes it pretty easy to find using the find method provided by the soup object.
data_table = soup.find('table',id='example')

We also need a place to store our results.  I'll use an array for that.
records = []

Now that we have the table, we just want to cycle through the rows of the table and pull the data out.  For that we can use the Python for statement with the method findAll provided by the data_table object that we created.  With each row that we iterate through, we want to grab the text that is stored in each table cell.  This is easily accomplished by creating an array containing all of the cells in the row and then taking the parts we want to work with from that array.  Here's the code:

for row in data_table.findAll('tr'):
    if row.find('td'):
        cols = [a for a in row.findAll('td')]
        records.append([
            cols[0].text,
            cols[1].text,
            cols[2].text,
            cols[2].a['href'],
            cols[3].text,
            cols[3].a['href'],
            cols[4].text,
            ])

Pulling the text out of the table cells is as easy as accessing the .text member.  Two of the cells in each row have links, which I wanted to capture as well, so I accessed those by using the .a member and then access the href attribute which is where links are stored in HTML.

Now, each row in our records array contains one row from the table with the cell contents and links separated out.  This is a good start to making this data more usable for my purposes. 

Next, I plan to do some data cleaning and then start to do some analysis on it to get a feel for what's available in the A4CA catalogue.

And finally, here is the entire program:

def read_a4ca_catalog():
    import urllib2, BeautifulSoup

    page = urllib2.urlopen('http://data.gov.bc.ca/data.html')
    soup = BeautifulSoup.BeautifulSoup(page)

    records = []

    data_table = soup.find('table',id='example')
    for row in data_table.findAll('tr'):

        if row.find('td'):
            cols = [a for a in row.findAll('td')]
            records.append([
               cols[0].text,
               cols[1].text,
               cols[2].text,
               cols[2].a['href'],
               cols[3].text,
               cols[3].a['href'],
               cols[4].text,
               ])
    return records

if __name__ == '__main__':
    for record in read_a4ca_catalog():
        print record

With Python and BeautifulSoup it's easy to extract data from a site and I would encourage anyone to give it a try.  It's easier than you might think.

Now that we have the data in a form we can work with, how do we clean it up and make it more useful?  I'll cover that in the next article of this series.

17 April 2010

Real Time Notification System

At OpenGovWest I had the opportunity to hear about and discuss a host of innovative ideas involving open data and open government.  One of the most impressive examples that was discussed was OneBusAway, an excellent service that provides real time arrivals of transit buses.  When Brian Ferris introduced himself it was clear the whole room, including me, thought his service was a shining example of what was possible when open data is given a chance.

Brian's app is great for folks who use public transit now, and it makes mass transit even more convenient than it is already, so that goes some distance to reduce carbon emissions.  I got to thinking about that app and what else could be done with transportation and real time notification.  I was thinking what would it be like if grocery stores had rolling mini-marts that worked the same way, and notified you via an application or text message, that they were getting close.  They could have the basics (eggs, cheese, bread, milk, fresh fruits and vegetables) and you would be able to pop out to the street and get what you need.  No more time and gas wasted, and again, less carbon.

What if, indeed.

Fast forward one month, and I experienced this system first hand.  I found out that not only has this system been implemented but it's been in place for many years.  Some small Mexican villages have a scalable, just-in-time goods delivery system in place, complete with real time notification.  Goods ranging from bottled water to propane to fresh fruit and vegetables, fish, cheese and baked goods are transported throughout the city streets.  Families and business are notified 2 to 5 minutes in advance of deliveries the precise goods that will be soon passing through the neighbourhood.  The notification technology used is clean, inexpensive and emissions free and all mexican citizens and visitors are able to use this service for free.

It uses an oscillation of energy that moves through air, water, and other matter, in waves of pressure.  That's right.  Sound.

How does it work?

Vendors travel through the streets of the village either walking, by bicycle or in slow moving vehicles.  As they travel along they either verbally or through a recording, transmit a sound that is unique to them.  If it's a company the sound might be their trademark, if it's an individual entrepreneur she might have her own sound or she might simply announce what she is selling.  Consumers of goods can  hear these sounds from the streets, sidewalks or inside their homes, and because the sounds are distinct, they easily know what's coming.  The sounds are also loud and the vendors travel at a low rate of speed, so foks typically have several minutes to get their money together and prepare to meet the vendor at the door saving time, gas, money and the environment.

I think we sometimes get so caught up in our technology that we forget that there are often simpler, more basic solutions to some of the challenges we face today.  And many times, given the chance, these solutions will evolve on their own, without any grand design or oversite.  Instead of waiting for Apple to ship the next device,  subscribing to a 3G cell plan,  downloading the latest twitter client and then tweeting to my friends about what I am thinking, maybe I will just invite them to go for a walk or a bike ride so we can chat.

Adios amigos.