22 July 2010

Thoughts from OSCON 2010

I am currently attending OSCON 2010 (Open Source Conference) in Portland Oregon.   It's a conference for free and open source software enthusiasts, developers, hackers and users of all levels.  There are about 5,000 people attending  this year.  I have met a lot of people here.  Some who are passionate about free software, and some that are learning more about it and how it can provide value to their companies.

It's difficult to over-estimate the impact that free and open source software (FOSS) has had on computing and the world in general.  First, of course, it powers the internet itself.  If you use the internet, you use free and open source software.  From the underlying protocols to email to ftp to web sites, it's all powered by free and open source software.

Practically every major web site you can think of (Google, Facebook, Wikipedia, Twitter, Foursquare, Google Maps, ... ) make heavy use of free and open source software.  These companies measure traffic in many millions of users and billions of pages per month.

The Apache Web Server for example has been the most popular web server since April 1996 and powers almost 70 percent of all websites on the planet.  There are free and open source operating systems, programming languages, office productivity suites, collaboration suites, web browsers, file and print servers and much more.  There is a free and open source version of practically any software you can think of (and many that you haven't thought of).

And yet, here we are in 2010 and some are still not convinced that open source is suitable for government use.  They are not convinced that this software developed by communities of generous and smart people is reliable and secure or supported enough for their purposes compared to proprietary solutions such as Internet Explorer.  They put all of their trust in single vendor solutions and rely on companies like Microsoft and Oracle, and believe the stories told by such companies about open source software... that story goes something like this:  "It's not enterprise ready... it's of varying quality... there is no support for it... you want to have one throat to choke."

Why aren’t governments using open source software anywhere and everywhere possible?  Why do governments continue to seek out solutions with lock-in to certain vendors?  Why would we continue to believe the big vendors that promise to be nice?  Why do we citizens continue to pay millions upon millions of dollars for software?

Governments are unlike other corporations in that they are making decisions not for their own benefit, but for the benefit of us, the citizens. They don't take that responsibility lightly so decisions are made with great care and they often don't give themselves permission to try new things - or if they do, they do THAT with great care and concern because they don't want to make any mistakes with our resources. Trying something innovative occurs as a risky and so the status quo is long lived and new approaches are discouraged.

Governments appear to be the last hold out of proprietary software and as a result, are missing out on an opportunity to engage with and support the communities that support all of us.  The rest of the world has figured out that free and open source software is the most secure, the most reliable, most innovative and the most cost effective software available.  Leading internet companies that earn millions of dollars in revenues and could choose anything they want for their software needs are choosing open source software.  We should let our governments know that we want them to choose free and open source software too.

The problem with free and open source software is this:  It's hard to make a lot of money with free software.  And, without a lot of money you can't own a public relations team and you can't spend a lot of money on armies of sales people and technical sales people with pre-written business cases and white papers and other collateral convincing people to use your products.  Without a lot of money, you can't schmooze and throw hosted year end parties for your key clients in every major city.

Instead, with free and open source software, you put everything into the product and let the product speak for itself.  You assume that people actually want things to work better.  You build communities of people who are passionate about your product - not because it makes them look good - not because it's easier – not even because it's free - but because it provides exceptional value.

07 July 2010

Is your data readable?

In talking with clients and colleagues about open data and open government this is the one question that comes up over and over again. The word “data” means a collection or body of facts that represent the qualitative or quantitative attributes of a variable or set of variables but what does “open data” mean?

To answer this question I like to look at what we are trying to achieve by opening data. The promise of open data is that if we make government administrative data available to the public value will be created in ways that we may or may not be able to imagine. The value will be created by using the data. So, what is open data? Ultimately, it’s data you can use. In this series of blog posts I will explore the various ways data can be made more usable.

What makes data usable?

In a previous post I proposed some dimensions that move toward a usability scale. In this post I propose a minimum standard of usability. In other words, what are the absolute minimum requirements that must be satisfied in order to consider something open data? To answer this question one could look at the dimensions of usability individually and decide for each one, what would be the minimum level of usability below which data is not usable.

One of the main measures of usability is readability.  In other words, how easy is it to read?

For example, this list of cities with their geographic areas and populations is data:


Data collected into rows and columns in this way is typically called a data set (or dataset). By putting this dataset in my blog post in a table I have made it available to you but the fact that I made it available to you as a screenshot of my spreadsheet means to read it would be difficult, error prone and would require expensive software or scripting. Which makes it pretty much unusable by you.

Another method in use by governments today is is to publish data as a PDF formatted document. This is marginally better than posting as an image. It’s technically possible to extract the data from PDF files as I have demonstrated in a previous post, but it’s still expensive, time consuming and error prone.

What I could do instead is make that same data available as an HTML table in this blog post, like this:




CityAreaPopulation
Victoria19.6878057
Vancouver114.67578041
Kelowna211.69120812


Technically, this is a level better than both images and PDF files but it will still get me low points on the usability scale because in order to read it a programmer still has to write a script specifically for reading this data from my blog post, a time consuming and wasteful process. If you’re unfortunate enough to need to read data from an HTML page, another previous blog post describes how to do this.

To really improve the usability of this software it makes sense to publish it in a format that represents data in a form that makes the data easily accessible. Many people are familiar with spreadsheets, which are a popular tool for reading and manipulation of tabular data so making data available in spreadsheet format makes it more usable in the sense that people can obtain spreadsheet programs to read the tabular data. For example, here is the same data published in the open .ODS format supported by a wide variety of spreadsheet software providers, and here it is published in the XLS format a proprietary format controlled by the Microsoft corporation.

The advantage to publishing in spreadsheet format is that while still requiring specialized scripts and software to read, at least the rows and columns are well defined which translates into fewer errors.  This is what I would consider the minimum bar for usable open data.  It's not as usable as I would like, but it is usable without too much risk.  In other words, if you have data in this format already and you don't have the budget to reformat it before publishing it, don't delay the release, just publish it as is.

Ideally though data is published in formats specifically designed for the purpose of information sharing, and that’s where the CSV, XML and JSON formats come in.

The CSV version of my dataset looks like this:
"City","Area","Population"
"Victoria",19.68,78057
"Vancouver",114.67,578041
"Kelowna",211.69,120812

The XML version looks like this:
<dataset>
 <data>
  <row><city>Victoria</city>19.69<population>78057</population></row>
  <row><city>Vancouver</city>114.67<population>578041</population></row>
  <row><city>Kelowna</city>211.69<population>120812</population></row>
 </data>
</dataset>

and the JSON version looks like this:
[
 {"city": "Victoria", "population": 78057, "area": 19.690000000000001},
 {"city": "Vancouver", "population": 578041, "area": 114.67},
 {"city": "Kelowna", "population": 120812, "area": 211.69}
]

While not quite pretty as the other human readable formats CSV, XML and JSON are open formats that provide structure making it very easy for programs to read the data. They are also well supported in almost all modern programming languages so that any programmer who wants to use your data can do so easily and accurately with free software and very little programming. And as a side benefit, its very easy and inexpensive to publish your administrative data into these formats using free software.

Publishing data in these open formats makes it easy for people to use open data. While publishing in HTML format is readable and is what I would consider the bare minimum for usability, depending on how it is done, other formats can make it much easier. And if your organization is serious about engaging people to collaborate and create value from the data they will want to make the data as usable as possible and making the data readable is one part of doing that.