15 August 2010

Is your data copyable?

Of all the concepts that data users have to concern themselves with, copying and the legal ramifications of copying has to be one of the most important. All technology is a product of copying but in the last 100 years media companies have invented and promoted the idea that copying is theft. It's one thing to have some difficulty and possibly incur some costs accessing and using data, it's an entirely different thing to have to worry about being prosecuted or sued for using data. Developers are rightly cautious about the liabilities associated with copying data.

The environment of caution around copying affects people releasing, promoting and using open data. If it's not perfectly clear to developers that they can use your data they often won't for this reason alone. Developers have enough to think about. If they have to wade through your license or terms of agreement to try to understand what all the ramifications are, they usually won't bother. If they decide to read it then even one clause that looks risky is enough to drive them away.

There are lots of interesting project ideas and few developers that can make them happen. If you want people to use your data, it makes sense to make that process as easy and risk free as possible. Fortunately, there are several ways to do that, but first, what exactly is the copyright protecting? When we talk about data in this context we are talking mostly about facts and collections of facts.

The interesting thing about facts when it comes to copyrights is that it's generally accepted that facts themeselves do not enjoy any copyright protection. The reason being that copyrights only generally apply to creative works, and facts, by their very defintion, are not creative. If I present a number to you as the truth then I am saying that I didn't make up the number, I at best discovered it, but it already existed. If I present a number to you and say I made it up - in other words I invented it from nothing, then it's not a fact, it's made up.

It's for this reason that the idea of licensing data then strikes me as odd. To grant someone a license to use something which they can already do is redundant. Attempting to restrict use by means of a license is then equally strange. If someone already has the right to use something then to restrict it would require their agreement, and why would anyone agree to fewer rights than they already have.

I think the USA is on the right track here. My preference is that all non-personal government data be released as public domain. Public domain is easy to understand and it fully opens the door to innovation giving developers the raw materials they can really use to create valuable apps.

11 August 2010

Is your data accessible?

Public servants with responsibility for publishing government data have decisions to make when it comes to making that data available to citizens on the internet. Along with readability and usability which I have covered in previous posts, a third aspect of open data is accessibility.

The purpose of releasing data as open data is to enable people to use your data. To use that data they have to get it from your computer to their computer. There are a variety of ways to do that, but in 2010 that means the Internet and and the HTTP protocol. If your datasets are very large, using anonymous FTP would likely also be acceptable to many developers. However HTTP is by far the simpler protocol to use. It has many advantages from a developer perspective over FTP and it is just as easy to set up from a publisher's perspective.

Accessibility is just as important as readability. Where poor readability imposes a one time cost to developers, poor accessibility actually imposes an ongoing transactional cost. As a developer, I can write scripts to decode data provided in proprietary formats like XLS or SHP (so long as that's still legal in Canada - locks today, proprietary formats tomorrow?). It's still costly in terms of my time, but once written, I can run that same script over and over again with no effort. Poor accessibilty on the other hand sometimes means that I can't readily automate the process. If I can't automate it, then every time I want to use your data, I incur cost in my time to manually download it. That may be fine for those users who only want to download your data once. But to the developers that you want to encourage to use your data as a platform to build valuable applications with, it's a barrier they won't likely cross due to the high transactional cost.

In some cases folks use login screens or mail back mechanisms to track who is accessing our government data. In some cases there are check boxes for so called "agreements" or "contracts" that are meant to force people into some sort of agreement before they use our data. The worst of course are cost recovery models where we are forced to pay for our data twice. First as taxpayers and again as users.

Open data is emerging from an era where the status quo belief was that government data had to be locked down. Whether or not that was ever true is debatable, but it's clearly not true in 2010.

When publishers realize that with open data, the data is likely going to be re-purposed and distributed in different forms anyway, and in ways that these methods won't track, then what are you really measuring that you couldn't measure with just a simple, unobtrusive web site and access log.

When we as publishers talk about accessibility, as with many aspects of open data, it's useful to remind ourselves of the reason we are doing open data in the first place. Making data accessible means making it as easy as possible for developers to gain access to and download the data. Not so you can pass some test of "openness" (although there are good reasons to do that which I will cover in a future post), but so people use your data. You want people to use your data. That's the point.