Herb Lainchbury

20 July 2011

About the License

Yesterday while we were all celebrating the awesome work of the BC Provincial government, a very well respected Harvard researcher, software developer and open data advocate was arrested by the US Federal government on charges related to computer hacking, based on allegations that he downloaded too many scholarly journals that he was entitled to get for free. He now faces a possible 30 years in jail.

I don't know Aaron, but I feel as though I do. He writes brilliant software code and releases it to the world as completely free public domain software. He advocates for open data and transparency and democracy in the US and he is a founder of demandprogress.org an organization dedicated to progressive policy changes for ordinary people.

Please consider visiting demandprogress.org and reading about what's happening to Aaron.

How does this relate to open data and the new DataBC portal?

It's about the license.

License : official or legal permission to do or own a specified thing.

It's strange that we need permission to use our own data and information. It's strange that we sometimes have to pay to use data that we have already paid to have created.

Being an open data advocate and application developer comes with a certain level of risk and anxiety. We are actively trying to do things that have not been done before. And unlike other areas where innovation takes place, we are innovating in an area involving strange legalities, usually as individuals with no corporate backing or protection, and the consequences of making a mistake can be severe. There is a lot of uncertainty. Although many of us are open source developers and thus are pretty familiar with copyright law, licensing and the jargon that goes along with them, very few of us are lawyers.

This is why we want governments to use standard licenses.

By choosing to invent a new license rather than use an existing one the BC Government has added to the uncertainty. Yes, they based it on the UK license, but it's clearly not the same as the UK license, otherwise they could have just used that.

Because they chose to invent a license I spent several hours last night pouring over the license and comparing it to both the PDDL and the UK license to see where those differences are and to see what additional risk I might have to take on as a result. Every developer I know will have to do the same thing now before they start using the data.

Many won't bother.

And that's the lost opportunity. People who get turned off by the custom license won't use the data, or won't bother coming to the hackathons, or won't bother creating that new app. It's just too risky. Sadly, we all lose. because as I understand it, the BC Provincial government is in this for the right reasons. It's clear to me that they absolutely get it. Innovative new ideas and applications will be generated as a result of this. This increased transparency and engagement and collaboration with the citizens will build trust and goodwill and is good for the government and good for the people of BC.

From what I can tell, not being a lawyer, as a standalone license, the BC Open Government License is actually mostly good (check out unrest.ca for some of the details on issues with the license). And, there are a handful of us, that will push through this licensing thing, grumble a bit, say "it's pretty good" because it is, and weigh our risks and move forward with our apps and visualizations and innovations.

BC is seen as a leader in citizen engagement and open data by local governments, other provinces and internationally. Looking at what's going on in the rest of the world, particularly in the US, we are really very fortunate to live where we do and to have the public service and leaders that we have.

I will encourage others to take the time to read the BC license rather than blowing it off because it's not standard. And I will continue to urge local governments and other provinces to use a standard license rather than invent their own.

This is a first great first attempt, and as Christy Clark said in her excellent and encouraging video, this is very much a work in progress. The license does have a version number, which to me implies that they are open to input, discussion and changing it if necessary, which is awesome.

19 July 2011

Remembering "Open"

Today the British Columbia Provincial Government launched a new Open Data Portal, making thousands of our publicly owned datasets available to us including everything from employee salaries to historical school locations to Local Government Incorporation Dates to the data catalogue itself. As a citizen and taxpayer in BC and an open data advocate I am very excited to see my own Provincial Government take these steps toward innovation and transparency. I congratulate those public servants within the BC Government that understood the opportunity, recognized value and championed the cause.

These days, governments all over the world are starting realize the value of "open" but it wasn't so long ago that we were at the opposite end of the spectrum. As a public servant in the 1980's, employed as a junior data analyst, I personally produced the monthly report for the minister and deputy minister showing the basic metrics of the ministry I was working for. As part of that job I was required to produce 5 copies of that report and place them in brown paper bags and tape them closed. I would then personally attempt delivery to the recipients office. If the recipient or their assistant weren't there to receive the report, I was required to take the report with me and try again later.

Those same metrics are still being used today and were released today as part of the Provincial Open Data portal. It's striking how far we have come in such a short amount of time.

Although we have gone through a very opaque phase with our governments, the idea of governments being open and transparent is not actually new. Our own BC Government has been publishing the public accounts and other financial information for decades. They also produce a monthly publication called the British Columbia Gazette and have done so since the 1920's that is teeming with useful information about our province, from disposition of Crown Lands, to election results, to tree farm licenses to road name changes.

Technology has evolved since these publications were originally developed. Where at one time publishing this type of data on paper was about as usable as one could have imagined, these days its available electronically, and hopefully soon it will be included as part of the open data portal.

So, although openness and transparency aren't new, they were definitely forgotten, and as I like to say, we are now remembering the value of "open".

Congratulations and a big "Thank you!" to our public service employees and political leaders who are helping to make this happen.

Now, I need to go to the portal and look for some data. :)

30 June 2011

Government as Platform - An Example

Tim O'Reilly uses the words "government as platform" to describe an interpretation of what Government 2.0 is. I am often asked what "government as platform" means. I think the question arises because much of government already operates as a platform. People don't distinguish it as something new because it's routine.

To explain what government as platform is, you can look at an example where it's already the norm such as our public roads. Our three levels of government are involved in the construction, maintenance and regulation of roads. Together they deliver infrastructure upon which we ride our bicycles and drive our cars.

We are used to the idea that we can hop on our bicycle or get in our car and and transport our selves on public roads. As a platform we are left to decide where we are going, when and how to get there. The outcomes of the platform, the actual transportation conducted, is not determined beforehand. The roads are built to certain standards and the drivers are left to figure out how to get from point A to point B.

And because the system is open, people and companies can and do invent innovative new ways to use the platform. Because they are free to choose their route, people optimize the routes they use to get from point A to Point B. Inventions like automobile GPS are created to help with navigation. Every year the entire vehicle industry (bicycles, cars, buses) releases new versions of their transportation products and introduces innovative new ones. Several huge industries can rely on the fact that our roads systems is a platform that they can build on.

And for travelers, the system works so well we can travel almost anywhere and we're usually not even aware of which level of government is responsible for which roads. We can even travel to different countries and travel, because all of the public roads work pretty much the same way. Done well, the platform becomes transparent. We don't even notice it.

I am convinced that this what we need with our public data. Our data – all of it – made freely available on the internet, with standard licensing, in formats we can use - would provide a platform for innovation like we've seen with transportation. Entire industries could grow on such a platform, providing jobs and value we can barely imagine right now.

And though only a few people will use the data in the beginning, like the few people building cars and bicycles, those people will create huge value for their fellow citizens all built on a platform managed by our governments.

17 March 2011

Selling Data - Privacy Is Not The Issue

When data can't be released it's usually for one of two reasons, privacy or cost. Leaving cost for a future post, let's focus on the privacy issue.

We don't want our personal medical records released to the public for free or for cost, for example. Most people I talk to can agree on that. In fact, in BC, we have laws that state that personal information collected for one purpose cannot be subsequently used for a different purpose. There are exceptions, but that's the general idea.

Sometimes governments sell data to the public. Sometimes that data contains people's contact information. Sometimes it's about organizations or places. But, before it can be sold, this data typically undergoes rigorous processes and checks to ensure that no personal data is compromised.

For example, for $32 anyone can request a business name search from BC Registry Services to find out if a company exists . For "as low as $5,250" anyone can purchase a business site license for postal code address data from Canada Post. And, anyone can download a wide variety of key socioeconomic data from Statistics Canada - for a price.

How does this relate to open data? Well, sometimes, privacy is held up as a reason for why some data that is currently sold by governments, cannot be freely released as open data. As if, somehow, paying money for that data alleviates any privacy concerns. It doesn't, because there aren't any. If there were, not only could it not be made available as open data, it also could not be sold.

While there may be very good reasons why data that is currently for sale by governments cannot be made available for free, privacy isn't one of them.

10 February 2011

Requirements of Open Data

As more and more governments start to realize the benefits of opening up their data, sometimes I hear chatter about enterprise solutions and though I understand the logic of hitching the wagon the the latest thing, I think there is a lot to gain by taking a strategic and pragmatic view of things. Open Data does not have to be an expensive exercise. In fact it can be very inexpensive from the data publisher's point of view.

There are three main requirements that have to be satisfied before data is considered "open data". They are:

1. Legal Framework - Anyone can use it for any legal purpose (PDDL or CC0 license)
2. Accessible - I can download it on the internet free of any mechanisms of control
3. Readable - it's published in a non-proprietary format

Of these three, the first requirement is most important and costs the least. And, in fact, without the first one there is really no point in doing the other two. If it's not legal for me to use the data then it doesn't matter what format it's in or whether I can get my hands on it... I won't use it.

The great thing is though that governments in some cases have already done the other two steps so the Legal Framework is all that's left to do. And, if they do that, then instantly and without any expensive technology, a raft of published data becomes "Open Data", ready to use.

For example, the City of Courtenay publishes a great RSS feed of "Surplus Equipment for Sale" but nowhere on their site does it say that I can use it. It's both accessible and readable so it's satisfying two of the three requirements but the first requirement isn't met so developers would likely shy away from using it. That's too bad because an app that went around and gathered up this type of information from all local governments and made it available as an mobile app would be pretty cool and would help the local governments sell their surplus equipment.

The BC Government on the other hand has this page which looks pretty much what it looked like over 10 years ago, and pages like this that basically say you can't use this data to make anything.

Here again, there are many examples of data that is both accessible and readable, but because of these pages, we can't use it. And that's unfortunate for both the government organizations that could benefit from the huge talent pool outside of government and for the citizens who pay for the data.

The good news is that governments already have a large amount of data online, and by getting the legal framework sorted out, citizens will instantly be able to use it to create innovative solutions and tools to help themselves and others.

04 January 2011

Book: Rework

I just finished reading Rework by the 37 Signals team (Jason Fried and David Heinemeier Hansson), a Christmas gift from my youngest son. If you are not familiar with 37signals and their work I recommend checking them out.

The book provides an outline of the 37signals philosophy including tips and opinion. It is aimed at small businesses, however, I think everyone from small business to large business to government organizations should read this book and think about how it might be applied to their work.

They don't say this explicitly but what the book is really about is rethinking some of the dogma of business-as-usual. I have often thought that any organization that has an "innovation" branch is already in trouble. If you read this book you'll find out why. An innovative culture isn't something you can install, or force directly. Innovative cultures happen by consistently rewarding innovation. Sharing cultures happen by consistently rewarding sharing. Organizations that consistently treat employees as untrustworthy, end up with a culture of fear and lack of trust.

For examples of how to do it right in a government context, I think the Province of BC is on the right track with their new B.C. Government Social Media Guidelines and their public statement of Open Data as Defining Principle No. 1 of their Citizens @ the Centre: B.C. Government 2.0 strategy. The very fact that these documents exist sends a signal to employees of a relatively large organization that they are trusted and empowered to engage citizens and empower citizens to create value from open government data. This kind of positive reinforcement will go a long way to creating a culture of learning and trust as people come up to speed with the new tools of social media and open data. Kudos to the folks that made this happen and the executives that supported them. I look forward to seeing what they do next.

Creating an environment where innovation happens, where sharing is rewarded, where great work is recognized and where trust is leveraged is the hallmark of an organization that gets it. 37signals definitely gets it. Rework is an easy and worthwhile read. If you're interested in innovation in the workplace, I recommend you read it.

01 January 2011

Shipped in 2010

One of my favorite authours of all time is Seth Godin. I purchase and read every book he writes and often give them as gifts. One of the things Seth talks about is "shipping". We use this term to describe the act of completing a project, getting it out the door. It doesn't matter if it was a hit or not, it just matters that it's done.

In a recent blog post Seth encourages people to publish their list of things they shipped last year, because it's not something we often do. I encourage everyone to 1) make your own list; and 2) read Seth's blog.

Here is a list of things that I shipped last year:

Launched OpenDataBC.ca
Identified 149 BC datasets from all levels of government
Created OpenDataBC Google Group which now has 70+ members
Established Open Government conference for BC
Held two hackathons for provinicial Apps for Climate Action Contest
Created Waterly.ca which won two awards including Best in BC (Yay!)
Participated in Google IO and OSCON conferences
Spoke on Open Data at the Ideawave conference
Sat on an Open Data panel at the Global Knowledge eGov conference
Accepted a CTO position with an exciting new Canadian startup
Held the first annual Victoria International Open Data Hackathon
Developed DataZoomer version 4
25 blog posts
500+ tweets

This isn't the entire list of things I worked on. I worked on many other things that either failed or that I didn't ship (yet). I also didn't do this alone. I was fortunate to be able to work with a bunch of dedicated and talented people this year.

2010 was a great year of learning and I look forward to an exciting 2011.

07 December 2010

Terms of Use

I am not a lawyer, and I do not give legal advice. I am a developer who uses open data on a regular basis and as such I spend a lot of time, but probably not nearly as much as I should, trying to understand open data licenses and terms of use that governments post on their open data portals.

The whole point of open data is to liberate data so it can be used. An organization's open data strategy then should be working toward that end result, encouraging and making it as easy as possible for people to use the data.

One of the first things that developers think about when contemplating writing an application using government data is, "am I going to get in trouble". This question, however absurd it may seem is very real to developers. If developers think they are going to get into some sort of trouble using a data source they will usually not create the app, which means governments and the citizens they serve lose an opportunity.

If governments are going to release data the most important thing is to release data in a way that is easy to understand from a legal perspective, preferably in a way that developers are already familiar with. There are already many licenses in use so inventing new licenses rather than releasing data under commonly understood mechanisms is a waste of effort on everyone's part. As Chris Rasmussen spoke about at the recent OpenGovWest BC Conference, "We all think that our data is unique. It's not true."

Unfortunately, many custom licenses currently in use today are often full of things that don't need to be there like, "you cannot break the law with this data" or "you can't say you're us". I already know I can't break the law. A disclaimer makes sense - but it doesn't need to be part of a license. Being clear about preferences around attribution make sense, but these can go in a policy statement offered for clarity rather than in a license.

In my opinion, the best license is no license at all. It's just public domain. Many government organizations consider their open data as public domain but don't go that extra step and actually state it on their web site. That's unfortunate because it's by far the simplest and easiest and least expensive way to release data and by not stating it explicitly on the web site, developers are still left wondering if they'll get sued.

  Herb's Ideal Open Data Declaration
   * This data is in the public domain.
   * It comes with no guarantees.

Please consult with lawyers that "get" open data. See if you can go public domain explicitly rather than implicitly and or consider using the Creative Commons Zero tool, before liberating the data and see if you can work together to make data we can all use.

13 November 2010

The Power of Open

The OpenGovWest BC conference is now complete. The day was filled with amazing speakers, amazing speaking formats and amazing topics.

One of the highlights of the conference was the talk given by Nick Charney (@nickcharney) and Walter Schwabe (@fusedlogic) where they talked to the audience about participation, and as part of their talk unveiled a blog where folks in the conference were encouraged to participate in real time, right there, while they were talking. Now, days later, blog posts are still being generated on http://www.opengovnorth.ca by individuals and the enthusiasm is still present.

Nick and Walter took a risk. They put the idea out there, provided a place for it to happen and then made a simple request for participation. Though they are both accomplished bloggers they didn't tell people what to write, or how to write it, and they didn't try to control the conversation. They shared their ideas generously and provided a space for expression.

When skilled speakers like Nick and Walter encourage audience members share their ideas with each other in real time, while the talk is going on, they are engaging the participants in a vastly larger conversation. And when the talk they are giving happens to be about encouraging this type of engagement then they are really leading by example, in an almost recursive way.

They also weren't trying to promote themselves, or their organisations, or trying to take credit for anyone else's work or building their brand.

No, they were just there as Nick and Walter, a couple of guys encouraging us to take a chance and move a little closer to the edge. Giving us a gift, expecting nothing in return. Within minutes the site was crashing because the server had exceeded its capacity.

In a closed model, people focus on controlling the message and dictating top down what is supposed to happen. This model is built on fear and lacks trust and although many results can be and are generated this way, communities are not. Contrast this with what Nick and Walter created.

I agree with others who have remarked that this particular conference has taken us from a great idea to a movement. I think there are several reasons for this and I want to acknowledge the lead organizer, Donna Horn (@inspiricity), who just like Walter and Nick, used an open approach with her expertise in community building and leadership to support, encourage and then trust the conveners and speakers to create their own parts of the conference. And the result was a level of enthusiasm from the conveners and the speakers that spilled over to everyone else in the conference.

That’s how a community is created and that's the power of open.

04 October 2010

Open Data vs Open Source

Open source and open data are two different things. They are not related any more than they are both part of a current larger overall trend toward openness, and they both happen to involve a computer.

The temptation however is to treat them the same, and to pursue them both at the same time. In fact, I recently realized that I personally have been collapsing the two concepts. I was resisting proprietary software use in open data because the companies that produce the software have been so opposed to open source software.

However, to argue that governments should both "liberate public data" and "use open source software" is to confuse the matter. I personally would like to see both happen but I choose to focus on open data because I think it will provide immediate value for government.

Insisting that government use open source tools to produce that open data makes the issue unnecessarily complicated. Governments are used to using whatever tools they are using and it's usually easiest for them to release data using their existing tools.

One of the greats thing about open data from the technical point of view is that it's really not very complicated. Governments have a myriad of technically complex data issues to deal with, but open data is not one of them. Pretty much any system that contains data can dump that data to an open format such as XML or CSV. The tools used to develop these systems come with this sort of support built in and the hosting is not complicated.

Open data is complicated in other ways, yes. Open data is a policy issue. It's also a communication issue. It's also an attitude issue. But it's not a technical issue. No special software is required, no special technology is required, no special hosting is required and no special security is required. Because in the case of open data we actually WANT people to get the data.

14 September 2010

Just the Facts Please

I am an advocate for releasing data in as raw state as possible. Spending time on visualizations any kind of presentation, including maps, is one of the biggest wastes of time and money in the open data realm. Here's why: When you release visualizations people can only look at your visualization. When you release raw data people can do an infinite number of things including create visualizations, combine with other data, create applications, etc..

Rather than working on "presentation" or "value added" services when preparing to release public facts as open data, I encourage public servants to just get the facts out there. If a specific graph or visualization is requested regularly and the budget is available to pay for it, then fine, I support that. But to spend resources on specific visualizations or maps that no one has asked for, and then to release it in the name of "open data", is a waste when it's much simpler to just make the raw data available to the community so they can create their own visualizations.

Our public bodies have vast stores of our data but scarce money and time to release it. Instead of trying to add value in the form of graphs, visualizations, detailed analysis or maps I would much rather see them release more data in machine readable open file formats that everyone can use. Releasing data is the one part that the citizens must rely on the public servants to do. The rest, citizens can do themselves, given the opportunity.

15 August 2010

Is your data copyable?

Of all the concepts that data users have to concern themselves with, copying and the legal ramifications of copying has to be one of the most important. All technology is a product of copying but in the last 100 years media companies have invented and promoted the idea that copying is theft. It's one thing to have some difficulty and possibly incur some costs accessing and using data, it's an entirely different thing to have to worry about being prosecuted or sued for using data. Developers are rightly cautious about the liabilities associated with copying data.

The environment of caution around copying affects people releasing, promoting and using open data. If it's not perfectly clear to developers that they can use your data they often won't for this reason alone. Developers have enough to think about. If they have to wade through your license or terms of agreement to try to understand what all the ramifications are, they usually won't bother. If they decide to read it then even one clause that looks risky is enough to drive them away.

There are lots of interesting project ideas and few developers that can make them happen. If you want people to use your data, it makes sense to make that process as easy and risk free as possible. Fortunately, there are several ways to do that, but first, what exactly is the copyright protecting? When we talk about data in this context we are talking mostly about facts and collections of facts.

The interesting thing about facts when it comes to copyrights is that it's generally accepted that facts themeselves do not enjoy any copyright protection. The reason being that copyrights only generally apply to creative works, and facts, by their very defintion, are not creative. If I present a number to you as the truth then I am saying that I didn't make up the number, I at best discovered it, but it already existed. If I present a number to you and say I made it up - in other words I invented it from nothing, then it's not a fact, it's made up.

It's for this reason that the idea of licensing data then strikes me as odd. To grant someone a license to use something which they can already do is redundant. Attempting to restrict use by means of a license is then equally strange. If someone already has the right to use something then to restrict it would require their agreement, and why would anyone agree to fewer rights than they already have.

I think the USA is on the right track here. My preference is that all non-personal government data be released as public domain. Public domain is easy to understand and it fully opens the door to innovation giving developers the raw materials they can really use to create valuable apps.

11 August 2010

Is your data accessible?

Public servants with responsibility for publishing government data have decisions to make when it comes to making that data available to citizens on the internet. Along with readability and usability which I have covered in previous posts, a third aspect of open data is accessibility.

The purpose of releasing data as open data is to enable people to use your data. To use that data they have to get it from your computer to their computer. There are a variety of ways to do that, but in 2010 that means the Internet and and the HTTP protocol. If your datasets are very large, using anonymous FTP would likely also be acceptable to many developers. However HTTP is by far the simpler protocol to use. It has many advantages from a developer perspective over FTP and it is just as easy to set up from a publisher's perspective.

Accessibility is just as important as readability. Where poor readability imposes a one time cost to developers, poor accessibility actually imposes an ongoing transactional cost. As a developer, I can write scripts to decode data provided in proprietary formats like XLS or SHP (so long as that's still legal in Canada - locks today, proprietary formats tomorrow?). It's still costly in terms of my time, but once written, I can run that same script over and over again with no effort. Poor accessibilty on the other hand sometimes means that I can't readily automate the process. If I can't automate it, then every time I want to use your data, I incur cost in my time to manually download it. That may be fine for those users who only want to download your data once. But to the developers that you want to encourage to use your data as a platform to build valuable applications with, it's a barrier they won't likely cross due to the high transactional cost.

In some cases folks use login screens or mail back mechanisms to track who is accessing our government data. In some cases there are check boxes for so called "agreements" or "contracts" that are meant to force people into some sort of agreement before they use our data. The worst of course are cost recovery models where we are forced to pay for our data twice. First as taxpayers and again as users.

Open data is emerging from an era where the status quo belief was that government data had to be locked down. Whether or not that was ever true is debatable, but it's clearly not true in 2010.

When publishers realize that with open data, the data is likely going to be re-purposed and distributed in different forms anyway, and in ways that these methods won't track, then what are you really measuring that you couldn't measure with just a simple, unobtrusive web site and access log.

When we as publishers talk about accessibility, as with many aspects of open data, it's useful to remind ourselves of the reason we are doing open data in the first place. Making data accessible means making it as easy as possible for developers to gain access to and download the data. Not so you can pass some test of "openness" (although there are good reasons to do that which I will cover in a future post), but so people use your data. You want people to use your data. That's the point.

22 July 2010

Thoughts from OSCON 2010

I am currently attending OSCON 2010 (Open Source Conference) in Portland Oregon. It's a conference for free and open source software enthusiasts, developers, hackers and users of all levels. There are about 5,000 people attending this year. I have met a lot of people here. Some who are passionate about free software, and some that are learning more about it and how it can provide value to their companies.

It's difficult to over-estimate the impact that free and open source software (FOSS) has had on computing and the world in general. First, of course, it powers the internet itself. If you use the internet, you use free and open source software. From the underlying protocols to email to ftp to web sites, it's all powered by free and open source software.

Practically every major web site you can think of (Google, Facebook, Wikipedia, Twitter, Foursquare, Google Maps, ... ) make heavy use of free and open source software. These companies measure traffic in many millions of users and billions of pages per month.

The Apache Web Server for example has been the most popular web server since April 1996 and powers almost 70 percent of all websites on the planet. There are free and open source operating systems, programming languages, office productivity suites, collaboration suites, web browsers, file and print servers and much more. There is a free and open source version of practically any software you can think of (and many that you haven't thought of).

And yet, here we are in 2010 and some are still not convinced that open source is suitable for government use. They are not convinced that this software developed by communities of generous and smart people is reliable and secure or supported enough for their purposes compared to proprietary solutions such as Internet Explorer. They put all of their trust in single vendor solutions and rely on companies like Microsoft and Oracle, and believe the stories told by such companies about open source software... that story goes something like this: "It's not enterprise ready... it's of varying quality... there is no support for it... you want to have one throat to choke."

Why aren’t governments using open source software anywhere and everywhere possible? Why do governments continue to seek out solutions with lock-in to certain vendors? Why would we continue to believe the big vendors that promise to be nice? Why do we citizens continue to pay millions upon millions of dollars for software?

Governments are unlike other corporations in that they are making decisions not for their own benefit, but for the benefit of us, the citizens. They don't take that responsibility lightly so decisions are made with great care and they often don't give themselves permission to try new things - or if they do, they do THAT with great care and concern because they don't want to make any mistakes with our resources. Trying something innovative occurs as a risky and so the status quo is long lived and new approaches are discouraged.

Governments appear to be the last hold out of proprietary software and as a result, are missing out on an opportunity to engage with and support the communities that support all of us. The rest of the world has figured out that free and open source software is the most secure, the most reliable, most innovative and the most cost effective software available. Leading internet companies that earn millions of dollars in revenues and could choose anything they want for their software needs are choosing open source software. We should let our governments know that we want them to choose free and open source software too.

The problem with free and open source software is this: It's hard to make a lot of money with free software. And, without a lot of money you can't own a public relations team and you can't spend a lot of money on armies of sales people and technical sales people with pre-written business cases and white papers and other collateral convincing people to use your products. Without a lot of money, you can't schmooze and throw hosted year end parties for your key clients in every major city.

Instead, with free and open source software, you put everything into the product and let the product speak for itself. You assume that people actually want things to work better. You build communities of people who are passionate about your product - not because it makes them look good - not because it's easier – not even because it's free - but because it provides exceptional value.

07 July 2010

Is your data readable?

In talking with clients and colleagues about open data and open government this is the one question that comes up over and over again. The word “data” means a collection or body of facts that represent the qualitative or quantitative attributes of a variable or set of variables but what does “open data” mean?

To answer this question I like to look at what we are trying to achieve by opening data. The promise of open data is that if we make government administrative data available to the public value will be created in ways that we may or may not be able to imagine. The value will be created by using the data. So, what is open data? Ultimately, it’s data you can use. In this series of blog posts I will explore the various ways data can be made more usable.

What makes data usable?

In a previous post I proposed some dimensions that move toward a usability scale. In this post I propose a minimum standard of usability. In other words, what are the absolute minimum requirements that must be satisfied in order to consider something open data? To answer this question one could look at the dimensions of usability individually and decide for each one, what would be the minimum level of usability below which data is not usable.

One of the main measures of usability is readability. In other words, how easy is it to read?

For example, this list of cities with their geographic areas and populations is data:

Data collected into rows and columns in this way is typically called a data set (or dataset). By putting this dataset in my blog post in a table I have made it available to you but the fact that I made it available to you as a screenshot of my spreadsheet means to read it would be difficult, error prone and would require expensive software or scripting. Which makes it pretty much unusable by you.

Another method in use by governments today is is to publish data as a PDF formatted document. This is marginally better than posting as an image. It’s technically possible to extract the data from PDF files as I have demonstrated in a previous post, but it’s still expensive, time consuming and error prone.

What I could do instead is make that same data available as an HTML table in this blog post, like this:

City	Area	Population
Victoria	19.68	78057
Vancouver	114.67	578041
Kelowna	211.69	120812

Technically, this is a level better than both images and PDF files but it will still get me low points on the usability scale because in order to read it a programmer still has to write a script specifically for reading this data from my blog post, a time consuming and wasteful process. If you’re unfortunate enough to need to read data from an HTML page, another previous blog post describes how to do this.

To really improve the usability of this software it makes sense to publish it in a format that represents data in a form that makes the data easily accessible. Many people are familiar with spreadsheets, which are a popular tool for reading and manipulation of tabular data so making data available in spreadsheet format makes it more usable in the sense that people can obtain spreadsheet programs to read the tabular data. For example, here is the same data published in the open .ODS format supported by a wide variety of spreadsheet software providers, and here it is published in the XLS format a proprietary format controlled by the Microsoft corporation.

The advantage to publishing in spreadsheet format is that while still requiring specialized scripts and software to read, at least the rows and columns are well defined which translates into fewer errors. This is what I would consider the minimum bar for usable open data. It's not as usable as I would like, but it is usable without too much risk. In other words, if you have data in this format already and you don't have the budget to reformat it before publishing it, don't delay the release, just publish it as is.

Ideally though data is published in formats specifically designed for the purpose of information sharing, and that’s where the CSV, XML and JSON formats come in.

The CSV version of my dataset looks like this:

"City","Area","Population"
"Victoria",19.68,78057
"Vancouver",114.67,578041
"Kelowna",211.69,120812

The XML version looks like this:

<dataset>
 <data>
  <row><city>Victoria</city>19.69<population>78057</population></row>
  <row><city>Vancouver</city>114.67<population>578041</population></row>
  <row><city>Kelowna</city>211.69<population>120812</population></row>
 </data>
</dataset>

and the JSON version looks like this:

[
 {"city": "Victoria", "population": 78057, "area": 19.690000000000001},
 {"city": "Vancouver", "population": 578041, "area": 114.67},
 {"city": "Kelowna", "population": 120812, "area": 211.69}
]

While not quite pretty as the other human readable formats CSV, XML and JSON are open formats that provide structure making it very easy for programs to read the data. They are also well supported in almost all modern programming languages so that any programmer who wants to use your data can do so easily and accurately with free software and very little programming. And as a side benefit, its very easy and inexpensive to publish your administrative data into these formats using free software.

Publishing data in these open formats makes it easy for people to use open data. While publishing in HTML format is readable and is what I would consider the bare minimum for usability, depending on how it is done, other formats can make it much easier. And if your organization is serious about engaging people to collaborate and create value from the data they will want to make the data as usable as possible and making the data readable is one part of doing that.

23 June 2010

OpenDataBC: Toward A Data Usability Scale

I am currently involved in a project named OpenDataBC. OpenDataBC is an open platform for government data sets and APIs released by governments in British Columbia. It makes it easy to find datasets by and about government, across all levels (provincial, regional, and municipal) and across all branches. The catalogue is both entered by hand and imported from multiple sources and is curated by our team of volunteers.

Being a site called "OpenDataBC" you would think it would be pretty straightforward to put such a site together. Take the available catalogues from Nanaimo, Vancouver and the province and stick them together and voila, a catalogue is born. But, it's actually not that easy. The site is named OpenDataBC because we wanted to pay particular attention to "Data" that is "Open" that originates in or is about "BC", and for that we have to be a bit more careful about how we put it together.

The definition of "open" as it relates to data is still evolving at a rapid pace. In it's ideal form what we mean by open data is:

Open data is data that you are allowed to use for free without restrictions. Open data does not require additional permission, agreements or forms to be filled out and it is free of any copyright restrictions, patents or other mechanisms of control.

By this definition, there is very little open data available today. Rather than soften the definition of open we think that it's useful to promote the use of data that's been released while acknowledging data that is more open (doing the right thing) while at the same time encouraging the data that is less open, to evolve.

Our goal is ultimately to facilitate the process of making more BC data available in a form that people can use. To that end OpenDataBC will highlight the most usable datasets that we can find. For that we need some sort of usability ranking or scale, which right now does not exist, so we are inventing it. Here I present the following questions as questions to consider when assessing the usability of data being released. It's a starting point and we expect it to evolve.

1. Is it machine readable electronic data?
Although technically a scanned image of a map with gold stickers pasted on it is data, is not something that a programmer can use. What we look for is machine readable data. Documents or electronic files containing data that are published in formats that a software program can ready easily and consistently without errors is considered machine readable. A databases, spreadsheets, CSV files are all examples of machine readable electronic data that are easily readable, thus they are considered more usable. PDF files, word documents, scanned images - while technically readable by a software program - it's not easy and it is time consuming, thus this it's less usable.

2. Is it accessible?
I should be able to get it easily over the internet. I should be able to get it on demand, with a simple program using open source software. I should not have to submit a form to get it. I should be able to enter a URL and in return I get the data.

3. Is it published in an open format?
From wikipedia: "An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open formats, proprietary formats are controlled and defined by private interests."

4. Is it free?
In this context, I mean am I free to use this data however I want? Can I use it to produce a product that I sell? Can I combine it with other data and publish it? Can I sell a copy of it? Data that puts any sort of restrictions on the ways in which the data can be used, or imposes any conditions or constraints on the user, is not free. For example, if I have to enter into an agreement to use it, it's not free.

5. Is it released under a common license?
Data that is released under a common license, such as the Creative Commons license or the Open Knowledge Definition are preferred over licenses created by the party releasing the data because licenses are hard to understand. The more time people have to spend understanding the license in order to use the data, the less usable the data is. Common licenses address this problem because once the license is learned for one dataset that license is understood and can be applied to other datasets released similarly.

6. Is it provided without a fee?
The data needs to be available at no cost to the user. If it costs money, it's less usable and it's not open data.

7. Is it complete?
Data should not be missing values that ought to be there. If it's point-in-time data it should include all of the relevant information for that point in time. If it's time series data, it should include the entire time series from the first record to the most recent record. If the data is about a geographical province, region or city, it should include the entire province, region or city and not leave out some geographical part of the data.

8. Is it timely?
The data should have the most up to date information as soon as it is available. Ideally the data is available as an updated feed or at least updated on a regular schedule. If the data is a feed, it should be available in as near real time as possible.

The plan is to add to this list and to refine the questions as we move along and gain experience with it. By applying a standardized set of questions to ask, users who come to the site will be able to easily determine what they might be up against if they want to use data in the catalogue. More usable data will thus be featured more prominently and less usable data will be identified as such so the issues that are contributing to it's less usable status can be addressed.

Please let me/us know if you think we're missing something or of something here needs adjusting.

18 June 2010

Single Points of Failure

As I write this in 2010, our political, economic, cultural and social systems in the western world are for the most part driven by corporations. Our art is subsidized by corporations, our charities are funded by corporations, our culture is promoted by corporations and our laws are defined by corporations.

These corporations come in various forms, whether they are businesses, governments, or religious corporations. In many ways the corporate form is very useful. It's one way to provide a structure for people to work toward a common goal. It provides some level of predictability. And in some cases, it provides for economies of scale.

Much has been written about the weaknesses of the corporate form, and the corruption it attracts so I leave that to others to draw attention to but there is one aspect of the corporate form that I don't see being written about, and that's the fact that they represent a single point of failure. Big corporations have big failures. We permit corporations to grow infinitely large and then rely on them not to fail. But they do fail, we know they fail, and we permit it anyway.

In the enterprise software space the "one throat to choke" mantra is used to persuade the listener that putting all of your eggs in one basket is a good thing. What it hides though is the fact that the vendor represents a single point of failure, and when it's a large project, that often mean large failure. The high failure rate of large IT projects is well known but how often are these throats actually choked? Almost never.

If these large organizations never failed, that would be one thing, but the fact is, they do fail. And somehow when we are voting, or shopping or doing our cost benefit analysis and selecting the vendor we forget that that entity we are dealing with may not be around tomorrow. The "bet the farm" ideology invented hundreds of years ago and popularized during the industrial revolution is showing it's age in our current distributed global world.

When the costs of communication were high it made a lot of sense to build organizations as hierarchies to minimize the costs of communication through a top down pyramid and command and control theme. This model was so efficient that it offset the risks of the single point of failure. Today though, we have the internet and mobile phones minimizing those costs for everyone so the pyramid isn't adding as much value as it used to and the cost of the single point of failure is still there.

There are some structures and strategies though that can help with this. They are used in organizations and projects that are designed with failure in mind. These organizational models admit from the outset that there will be failures and rather than pretend that all is well and there will be someone to choke if anything goes wrong, they make failure part of the equation.

With the BP Oil Disaster destroying the gulf of Mexico, the failure of 247 US banks and millions unemployed due to the economic meltdown, people are starting to wake up to the enormous risk and costs of these single points of failure. Simultaneously, open source software, open data, grass roots communities and cooperatives are becoming increasingly popular as people start to look for alternative ways to get things done.

Smart companies, governments and other organizations are letting go of "command and contol" and are discovering game changing philosophies based on engagement and collaboration that give them an edge that is not surprisingly almost non-existent in the traditional corporate form.

Collaboration, gifting and doing things for the sheer joy of working and contributing to the world and enhancing the quality of life of others are being rediscovered. And while we speak of these things as new, they are as old as civilization itself and were here long before the corporate form and will be here long after.

20 May 2010

OpenDataBC: Extracting Data from A4CA PDFs

In this OpenDataBC series of posts, I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites. In the first article of this series, I described how to write a script to scrape catalog data from web pages. In the second article I described how to write a program to transform the data. In this article, I describe how to convert a PDF document into useable data.

As part of the Apps for Climate Action Contest, the Province of BC released over 500 datasets in the Climate Action Data Catalogue. It's an impressive amount of data pulled from an array of sources both within BC and elsewhere.

In an ideal “open data” world, all of that data would be in an easily machine readable format that we could use to write programs directly. While that would be great, the reality today is a bit different. Much of the data that is made publicly available these days is in formats that are harder to use. For example, some of the data in the Climate Change Data Catalogue was released in PDF format. PDF is a proprietary format, meaning the format is controlled exclusively by one party, in this case the Adobe corporation.

An interesting fact is that it takes extra effort to get data from its raw form into PDF format. In other words, to publish data in an open format rather than in PDF format actually saves time, effort and money – up front. However, PDF became well established in the pre-open world, so a lot of data is already published using it. To switch existing software to publish in an open format might take time. As a result, at least temporarily, we need to find ways to get at the data in the PDF files.

In this post I describe how to do that. Looking through some of the available datasets in the catalogue, one that I find interesting is “Transit Ridership in Metro Vancouver”. The data is produced by Translink and is in a PDF format and looks like this:

What I am interested in is the number of passenger trips by year for the past few years. I am going to leave out the Seabus and the West Coast Express as I am mostly interested in the buses and the Skytrain.

What I would like is a dataset, in a CSV file. The way this program will work is essentially as follows:

read the data from the source database
extract the data from the PDF file into a list in memory
write the list in memory out to a CSV file

Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.

The Code
The first thing we need to do is to read the PDF file into memory. The simple way to do that in Python is to use the urllib2 library and read the entire PDF from the original web site. Tying the script to the actual location of the file means we don't manually store the orginal file anywhere. If the City of Vancouver decided to move the URL we would have to adjust our code, but we're probably only going to run this code once so it's not a big deal. To read the PDF file into a memory variable we do this:

import urllib2 
    url = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf'
    pdf = urllib2.urlopen(url).read()

Now that we have the PDF file in memory, I want to parse the PDF file and turn it into raw text. To do this I use a free open source Python library called pdfminer. I have created a function called pdf_2_text for this purpose. Here's the function:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t

The pdf_to_text function starts by importing the components required to do the conversion. The pdfminer library provides a lot of functionality. In this example we are using a small fraction of its functionality to do what we need, which is to get at the content in the PDF. The main function that actually does the work is called process_pdf. It takes a PDFResourceManager object, a TextConverter object and a file object as parameters so the code before that call is setting up those parameters properly. I use a StringIO object rather than just passing the urllib2 object in because the PDF converter needs to use the seek method for random access which is not supported in urllib2. To gain this ability I put the data into a StringIO object, which supports seek.

When the pdf_to_text function is called with the contents of a PDF file it returns a string containing lines of text with each line containing one of the elements (numbers or labels) of the PDF file. Here's what it looks like on my system:

Now that we have the data in text format, we want to pull out the numbers that we are interested in. I am interested in the labels on the left, which start on line 6, the first numeric column (BUS), which starts on line 75 and the second numeric column (SKYTRAIN), which starts on line 144.

To start the process of extracting rows of data from the text file, I first split lines of the text file into a list like this:

lines = text.splitlines()

Then I create a simple function called grab_one_row which besides having a very clever name, knows the relative placement of the three columns, and pulls one whole row at a time from the text file and returns it as a tuple. Here is the function:

def grab_one_row(lines,n): 
    return (lines[n],long(lines[n+69].replace(',','')),long(lines[n+138].replace(',','')))

Armed with that function, I can now collect most of the rows I am interested in with a simple generator line:

rows = [grab_one_row(lines,i) for i in range(6,26)]

In the original PDF, the data for 2008 is placed further down the page so the last line needs to be added with a separate line of code like this:

rows.append(grab_one_row(lines,39))

now the rows array contains all of the data we are interested in, in an array that we can easily deal with. We just need to write them out to a CSV file to complete our work. To do that I created the rows_to_csv function. Here it is:

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',',''))))

I wanted the resulting CSV file to have numbers rather than strings containing numbers for the numeric values. The last line of this function strips out the commas that were in the numbers in the PDF file and then converts the text to a long integer to be written the CSV file.

The resulting CSV file now looks like this:

This result is a lot easier to deal with than the original PDF file. Arguably, a small file such as this could also be converted with Open Office Spreadsheet by cutting from the PDF and pasting to the spreadsheet. The nice thing about doing this as a script as above is that we can use this same technique for very large PDF files that would be too onerous to do manually.

Here is the entire program with all of the code together at once:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t 
    
def grab_one_row(lines,n): 
    return (lines[n],lines[n+69],lines[n+138]) 

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',','')))) 

def run(): 
    import urllib2 
    url         = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf' 
    outfilename = 'translink_bus_skytrain_trips_1989_2008.csv' 
    
    pdf = urllib2.urlopen(url).read() 
    text = pdf_to_text(pdf) 
    
    lines = text.splitlines() 
    rows = [grab_one_row(lines,i) for i in range(6,26)] 
    rows.append(grab_one_row(lines,39)) 

    rows_to_csv(rows,outfilename) 
    
if __name__ == '__main__': 
    run()

and you can find the resulting CSV file here.

Once again, Python comes through for us. Clearly it's not as easy to convert a PDF file as it is to rip a table out of an HTML file, but being possible at all makes it something we can work with. And part of the beauty of “Open” is that now that I have done it, others don't have to. And I in turn will benefit from other contributors to the open ecosystem. If we all do a bit, it's an “everyone wins” scenario.

17 May 2010

Facebook Steps Out of the Way

Like most people, I was a bit surprized with Facebook's recent changes with regard to privacy. I don't think they have done anything wrong, but as a user I allowed myself to be lulled into a false sense of security. Like most people, I believed that they wouldn't mess with the privacy settings of my account much, allowing me to control who got to see the personal information I put on Facebook on my terms. I had agreed to their user agreement which stated they could change the terms at any time, but I didn't really pay attention to that fine print.

When they made the more recent change to allow my friends graph and my content to be harvested I realized something. Not that Facebook is evil or bad, but that they offer a service that I thought was one thing, but it is something else. I thought it was a way for me to connect with my friends and share my data with them, but actually it is a way for Facebook to profit from our personal data. Or, as Tim Spalding so eloquently put it, "Why do free social networks tilt inevitably toward user exploitation? Because you're not their customer, you're their product."

For me it's not a big deal that my Facebook content is now available to anyone, I don't store anything particularly private there anyway. But now, my behaviour has changed and I find myself using it even less than I did before. Not so much because of the loss of control over my data or the fact that they didn't give me a cut, but because this is a company that is volatile with respect to its user policies. Frankly, I just don't want to put in the time required to keep up with their changes. So, I take my privacy into my own hands and limit what I place on Facebook.

On the other hand, I find the recent changes to Facebook pretty exciting. I think an awesome opportunity has opened up now for a service to emerge that allows people to connect with their friends and at the same time protects their privacy. 400 million facebook users sharing information is a testament to the fact that people want to connect online. The recent outcries and Facebook account deletions point to the fact that people also value privacy...i.e., there is clearly a market for connecting AND protecting privacy.

Facebook doesn't offer that service, but until now people were not sure if they did or they didn't. And that ambiguity prevented other firms from offering that service because competing with Facebook was just a non-starter. Now, thanks to Facebook's recent changes, it's clear. They don't.

And in that gap between what people want and what is available, lies opportunity.

It's now clear, Facebook is not in the privacy business. By stepping out of the way they make room for others that want to offer privacy as a key value proposition.

Personally, I would like to see a new type of social platform emerge. Something taking ideas from status.net and webfinger and Diaspora. I want a distributed social platform that I can host with any hosting provider that would allow me to connect with my friends. The difference is, I own it. So long as my friends and I have this system installed somewhere, our systems would talk to each other seamlessly.

This way we would not be dependent or at the mercy of any one vendor or their privacy policy changes. We would be able to move our accounts to another host of our choosing, anytime we want, and we would be able to lock down or even delete data any time.

The software would be free and open source as well. If anyone wanted to add some functionality or just contribute, that would be possible too. Think self-hosted WordPress, but with social networking instead of blogging.

Rather than one massive "walled garden" users would each have their own garden in a community with other garden owners. They would be able to share their data with whomever they choose.

I am grateful to Facebook and everything it has done to connect people. It's truly an awesome service.

Ultimately though, it IS my data. And I still want to share it on my terms.

11 May 2010

Innovators Isolation

One of the things that all innovators face at some level is a sense of isolation. By definition innovators are working on things that have never been worked on before. And they make up only about 2.5% of the population. If they participate in a specialized industry, it's pretty unlikely they'll get to work with other innovators in their field, never mind find someone who understands what it is they're so passionate about.

About four years ago I made a decision to attend four conferences per year. I consider it part of my ongoing professional development in two ways:

Training - there are no formal training courses available for the skills I need for my work. I read any books that are available. Conference workshops and sessions provide me with the most current and relevant information about what other innovators are doing in my field.
Connecting - when I attend a conference, I consciously choose to meet and connect with other attendees and presenters. I often get to connect with other inventors of some of the most exciting new technologies. It's typical to find folks exchanging ideas, talking about what we've actually done, what worked, what didn't work and we're thinking about for the future. There are people who talk and there are people who ship. These people ship.

Next week I will attend Google I/O and later this year I will attend OSCON and possibly FOWA. Although I don't expect to find a workshop covering specifically what I am currently up to – such as privacy enhancing distributed enterprise service topologies or dataset forking technologies or probabilistic data linkage techniques - at these conferences, I do expect to find lots of people interested in the cutting edge of whatever it is they're passionate about.

I expect to find people who are daring to think differently. I will share my crazy ideas. I will hear other folks’ crazy ideas. I expect we'll have thoughts about each other's crazy ideas and I expect that after all that, we will acknowledge some of the ideas as crazy or just dumb. And some will seem not quite as crazy as they did before.

But more importantly, we will get the sense that we're not the only ones with crazy ideas, and we're not the only ones that have no idea what's on TV anymore, because we are working on something that we think is cool and could possibly even change the world.