28 April 2010

Data is Green

Of the three R's of responsible consumerism, I think the second R holds the most promise. The idea is that if you must buy something, then make it something durable and use it up.  This appeals to me.  If I spend a few extra bucks and purchase the durable good over the cheap one, I get to enjoy a superior product for longer.  And, as a bonus, I get to spend less of my life shopping.


On a recent trip to a small Mexican town I was struck by many cultural differences.  One of these was the extent to which things are used up.  Regardless of the motivation, it was clear that the residents of the town where I was staying were using things long after the point at which I personally would have discarded them.  From houses to automobiles to plastic containers to electronics to clothing and even food, things that I would have been comfortable with discarding here would continue to be used long after they made their way out of my life.

The Implications of Reuse for Data
As I was thinking about Reuse, it really struck me what an opportunity we have with our data.

We spend a lot of time and attention filling out forms and understanding terminology and concepts so that when we sign them, we know what we're doing and we get the services we need.  We do this when we want to interact with both governments and businesses.  We also provide the funds to enable government to collect all sorts of data about our resources, the places we live and the events that occur in our world so that we can be safe and secure and our resources utilized effectively.

We also spend money indirectly through taxes and fees to have that data stored, protected, backed up and maintained  by those public and corporate entities.  Its mind boggling to think about all the places my data resides and how many times my name, address and phone number are stored.

The data that we give  is only used for that one organization.  And, it's typically only used for the purpose it was collected for.

Why not find a way to store data so that it could be reused FOREVER by multiple parties?

Open Data and Reusability
If we were to add up all of the time, energy and resources required to store and manage our resource, geographical, financial and personal information and keep it locked up in hundreds or thousands of different locations, the impact on our environment must be significant.

Open Data allows that same data to be reused freely and infinitely for multiple purposes, so that we can maximize its value.  It also allows system developers to reduce their use of data because they don't have to reinvent data where it already exists.

I think Open Data represents an ultimate opportunity for reuse.  If data had a colour, it would have to be green.

19 April 2010

OpenDataBC: Accessing the A4CA Data Catalogue

The OpenDataBC Series: In this series of posts I will describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites.  My goal is to encourage people that might not otherwise consider interacting directly with data, to give it a try and see for themselves that it's easy, and maybe even fun.  :-)

Last week I sat down to have a closer look at the datasets that were released as part of the Province of BC's Apps 4 Climate Action (A4CA) contest.  I am very excited about the prospect of open data in BC and wanted to see what was available that might be interesting to use for various projects.

The A4CA data listed in the catalogue includes a range of formats and technologies.  Some are easier to work with than others.  Being able to browse the catalogue online is great but I really want to have a closer look and maybe do some analysis to find the data that is easy to work with.  For that I need to download the data so I can work with in it spreadsheet and / or database form.  In the perfect world, this data would be available on the site as a downloadable feed, with its own URL so programmers could simply point at the URL and get the data in XML format that goes with it.

The A4CA catalog provides a download to .CSV form, which is easy to work with but unfortunately, the link to that data is hidden behind a flash widget, so there is no way to download the data directly.  The page itself however does provide the data to the browser in the form of a table.  The table shows only 100 records at a time; however, there is another flash widget that allows the user to page through the data 100 records at a time, without a screen refresh.  That means the data is already in the browser, all 540 rows of it.  It's just a matter of scraping it out using a bit of code.

How it's done:
First, checking the robots.txt file for the site (www.gov.bc.ca) file reveals that the site allows for almost any type of program including this one, so that's great.

Prerequisites
My personal language of choice for this sort of task is Python, which comes pre-installed with all Linux and modern Mac machines and can be easily installed on Windows.  In my case I am using Python 2.6 under Ubuntu 9.10.  In addition to Python I am using Beautiful Soup which is an excellent library for scraping web sites.

The Code
The first thing we need to do in our program is to import the modules we need.  We are going to need urllib2 to grab the page and BeautifulSoup to parse it.  That's just one line:
import urllib2, BeautifulSoup

Next, we need to go out to the web and grab the HTML page and store it as a string called page:
page = urllib2.urlopen('http://data.gov.bc.ca/data.html')

Next, we create an object called soup using the BeautifulSoup library.
soup = BeautifulSoup.BeautifulSoup(page)

At this point we have the page loaded and we can read whatever parts of it we want using the methods provided by the soup object.

I am particularly interested in the table in the middle of the page, that contains the data I am after.  Looking at the raw HTML code from inside my browser I see that there is only one table on this page and that it's ID is set to 'example', so that makes it pretty easy to find using the find method provided by the soup object.
data_table = soup.find('table',id='example')

We also need a place to store our results.  I'll use an array for that.
records = []

Now that we have the table, we just want to cycle through the rows of the table and pull the data out.  For that we can use the Python for statement with the method findAll provided by the data_table object that we created.  With each row that we iterate through, we want to grab the text that is stored in each table cell.  This is easily accomplished by creating an array containing all of the cells in the row and then taking the parts we want to work with from that array.  Here's the code:

for row in data_table.findAll('tr'):
    if row.find('td'):
        cols = [a for a in row.findAll('td')]
        records.append([
            cols[0].text,
            cols[1].text,
            cols[2].text,
            cols[2].a['href'],
            cols[3].text,
            cols[3].a['href'],
            cols[4].text,
            ])

Pulling the text out of the table cells is as easy as accessing the .text member.  Two of the cells in each row have links, which I wanted to capture as well, so I accessed those by using the .a member and then access the href attribute which is where links are stored in HTML.

Now, each row in our records array contains one row from the table with the cell contents and links separated out.  This is a good start to making this data more usable for my purposes. 

Next, I plan to do some data cleaning and then start to do some analysis on it to get a feel for what's available in the A4CA catalogue.

And finally, here is the entire program:

def read_a4ca_catalog():
    import urllib2, BeautifulSoup

    page = urllib2.urlopen('http://data.gov.bc.ca/data.html')
    soup = BeautifulSoup.BeautifulSoup(page)

    records = []

    data_table = soup.find('table',id='example')
    for row in data_table.findAll('tr'):

        if row.find('td'):
            cols = [a for a in row.findAll('td')]
            records.append([
               cols[0].text,
               cols[1].text,
               cols[2].text,
               cols[2].a['href'],
               cols[3].text,
               cols[3].a['href'],
               cols[4].text,
               ])
    return records

if __name__ == '__main__':
    for record in read_a4ca_catalog():
        print record

With Python and BeautifulSoup it's easy to extract data from a site and I would encourage anyone to give it a try.  It's easier than you might think.

Now that we have the data in a form we can work with, how do we clean it up and make it more useful?  I'll cover that in the next article of this series.

17 April 2010

Real Time Notification System

At OpenGovWest I had the opportunity to hear about and discuss a host of innovative ideas involving open data and open government.  One of the most impressive examples that was discussed was OneBusAway, an excellent service that provides real time arrivals of transit buses.  When Brian Ferris introduced himself it was clear the whole room, including me, thought his service was a shining example of what was possible when open data is given a chance.

Brian's app is great for folks who use public transit now, and it makes mass transit even more convenient than it is already, so that goes some distance to reduce carbon emissions.  I got to thinking about that app and what else could be done with transportation and real time notification.  I was thinking what would it be like if grocery stores had rolling mini-marts that worked the same way, and notified you via an application or text message, that they were getting close.  They could have the basics (eggs, cheese, bread, milk, fresh fruits and vegetables) and you would be able to pop out to the street and get what you need.  No more time and gas wasted, and again, less carbon.

What if, indeed.

Fast forward one month, and I experienced this system first hand.  I found out that not only has this system been implemented but it's been in place for many years.  Some small Mexican villages have a scalable, just-in-time goods delivery system in place, complete with real time notification.  Goods ranging from bottled water to propane to fresh fruit and vegetables, fish, cheese and baked goods are transported throughout the city streets.  Families and business are notified 2 to 5 minutes in advance of deliveries the precise goods that will be soon passing through the neighbourhood.  The notification technology used is clean, inexpensive and emissions free and all mexican citizens and visitors are able to use this service for free.

It uses an oscillation of energy that moves through air, water, and other matter, in waves of pressure.  That's right.  Sound.

How does it work?

Vendors travel through the streets of the village either walking, by bicycle or in slow moving vehicles.  As they travel along they either verbally or through a recording, transmit a sound that is unique to them.  If it's a company the sound might be their trademark, if it's an individual entrepreneur she might have her own sound or she might simply announce what she is selling.  Consumers of goods can  hear these sounds from the streets, sidewalks or inside their homes, and because the sounds are distinct, they easily know what's coming.  The sounds are also loud and the vendors travel at a low rate of speed, so foks typically have several minutes to get their money together and prepare to meet the vendor at the door saving time, gas, money and the environment.

I think we sometimes get so caught up in our technology that we forget that there are often simpler, more basic solutions to some of the challenges we face today.  And many times, given the chance, these solutions will evolve on their own, without any grand design or oversite.  Instead of waiting for Apple to ship the next device,  subscribing to a 3G cell plan,  downloading the latest twitter client and then tweeting to my friends about what I am thinking, maybe I will just invite them to go for a walk or a bike ride so we can chat.

Adios amigos.

13 April 2010

On Apple's SDK License Changes

IBM used to sell mainframe computers where if you wanted more functionality, they would send an IBM technician to install a huge hardware module in your computer.  It was a closed controlled system and IBM could basically do whatever they wanted in terms of making hardware available and the way they decided to go was closed.  If you wanted that extra functionality, there was only one place to get it.  IBM.

Fast forward 50 years.  It seems a lot of developers are frustrated by the new Apple license changes.  It's getting more closed than before.  It’s easy to get frustrated when the rules change and it seems like you don’t have any say in how that goes, but, ultimately, that's the nature of a closed platform.  It's not a democracy; Apple makes the rules.

You can however choose whether or not to support it with your dollars, your attention or your time.  Closed works for Apple, and I love what Apple has done with mobile applications, music distribution and video rentals so I do spend some of my dollars there.

Ultimately though, I support open platforms because I think in the long run, they work better, they scale better, they cost less and they are the right thing to do.  That's why I choose to spend a lot of my money, attention and time working on Android and Linux and evangelizing open source, open data and open government.

11 April 2010

Don’t Make Me Register

You’ve worked hard to create your open data catalog and you’ve spent endless hours and resources creating the data that you are making available.  You want to make sure your data is used appropriately and you want to be able to protect your intellectual property.   You decide you want to put up a registration page so you can track who is downloading what data and how often they do so.  That way, you’re covered.  Sounds like a good strategy, right?  Wrong.

When you make registration mandatory people leave your site.

Here’s why:

When people are presented with an e-mail address form to access content the first question they are likely to ask themselves is, “why do you need this?”  Some sites will offer some sort of reasoning, like, “so we can keep you informed”, but then the person wonders, if that’s true, why is it mandatory?   What if I don’t want you to keep me informed?  It’s clearly not the real reason.  Strike one. You have just lost some trust with this person.

People are used to being lied to, so some will press on.  They know you don’t need their e-mail address but you are insisting on it anyway, so, the next logical question is: “What are you going to do with it?”.  A number of reasons come to mind but the one that will often come up first is you want to send them spam, or worse, you want to sell their email address to others who will send them spam.  So, now the person is faced with another decision:  “Should I use my real email address or should I use the one I reserve for sites that I don’t want to get spam from, or should I go to hotmail and make a new one for this site?”

If they manage to stay interested in your site for this long and go through with the e-mail registration, then at least you know they are highly motivated to get at your content.  If that’s your goal, to attract only highly motivated people (unfortunately this includes vandals) and eliminate the rest, then putting up a registration form might not be a bad way to go.

But there is a group of people in the middle, between the highly motivated people and the people who will never use your stuff.  This group in the middle live in an area called The Long Tail.  Promoting the generation of value from content like open data is all about The Long Tail, and if you want the people in the middle to participate and create value for you, you want to make their participation as easy as possible, so your population of users is as large as possible making the long tail as long as possible.

In the middle are the people who are motivated enough to check your site out and maybe even create something cool but who aren’t willing to potentially compromise their email address to do so.  By making them jump through hoops without a valid reason you are destroying trust at the very beginning of the relationship and potentially turning them away.

Using the word “mashup” gives the work a kind of fun sound, like it’s easy.  It’s not.  And currently, there are relatively few people who can take your data, combine it with some more data or perhaps some code, and make something truly valuable.  I am not saying you have to cater to these people, but if you really want them to choose your site to spend their time on, consider making it as easy as possible for them.

If your goal is to foster citizen engagement and to provide ways for people to contribute and maybe even crowd-source your data to produce a valuable outcome you should be looking to make the process as easy as possible and keep that tail as long as possible.  Requiring users to register themselves to download your data tells them that you value their e-mail address more than you value their contribution which, if true, means you might want to rethink your strategy.  There are easier ways to get people’s email addresses.

And if you have functionality that actually is valuable to the user, and requires people to sign in, consider a friendlier alternative like OpenID.