Herb Lainchbury: 05/01/2010

20 May 2010

OpenDataBC: Extracting Data from A4CA PDFs

In this OpenDataBC series of posts, I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites. In the first article of this series, I described how to write a script to scrape catalog data from web pages. In the second article I described how to write a program to transform the data. In this article, I describe how to convert a PDF document into useable data.

As part of the Apps for Climate Action Contest, the Province of BC released over 500 datasets in the Climate Action Data Catalogue. It's an impressive amount of data pulled from an array of sources both within BC and elsewhere.

In an ideal “open data” world, all of that data would be in an easily machine readable format that we could use to write programs directly. While that would be great, the reality today is a bit different. Much of the data that is made publicly available these days is in formats that are harder to use. For example, some of the data in the Climate Change Data Catalogue was released in PDF format. PDF is a proprietary format, meaning the format is controlled exclusively by one party, in this case the Adobe corporation.

An interesting fact is that it takes extra effort to get data from its raw form into PDF format. In other words, to publish data in an open format rather than in PDF format actually saves time, effort and money – up front. However, PDF became well established in the pre-open world, so a lot of data is already published using it. To switch existing software to publish in an open format might take time. As a result, at least temporarily, we need to find ways to get at the data in the PDF files.

In this post I describe how to do that. Looking through some of the available datasets in the catalogue, one that I find interesting is “Transit Ridership in Metro Vancouver”. The data is produced by Translink and is in a PDF format and looks like this:

What I am interested in is the number of passenger trips by year for the past few years. I am going to leave out the Seabus and the West Coast Express as I am mostly interested in the buses and the Skytrain.

What I would like is a dataset, in a CSV file. The way this program will work is essentially as follows:

read the data from the source database
extract the data from the PDF file into a list in memory
write the list in memory out to a CSV file

Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.

The Code
The first thing we need to do is to read the PDF file into memory. The simple way to do that in Python is to use the urllib2 library and read the entire PDF from the original web site. Tying the script to the actual location of the file means we don't manually store the orginal file anywhere. If the City of Vancouver decided to move the URL we would have to adjust our code, but we're probably only going to run this code once so it's not a big deal. To read the PDF file into a memory variable we do this:

import urllib2 
    url = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf'
    pdf = urllib2.urlopen(url).read()

Now that we have the PDF file in memory, I want to parse the PDF file and turn it into raw text. To do this I use a free open source Python library called pdfminer. I have created a function called pdf_2_text for this purpose. Here's the function:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t

The pdf_to_text function starts by importing the components required to do the conversion. The pdfminer library provides a lot of functionality. In this example we are using a small fraction of its functionality to do what we need, which is to get at the content in the PDF. The main function that actually does the work is called process_pdf. It takes a PDFResourceManager object, a TextConverter object and a file object as parameters so the code before that call is setting up those parameters properly. I use a StringIO object rather than just passing the urllib2 object in because the PDF converter needs to use the seek method for random access which is not supported in urllib2. To gain this ability I put the data into a StringIO object, which supports seek.

When the pdf_to_text function is called with the contents of a PDF file it returns a string containing lines of text with each line containing one of the elements (numbers or labels) of the PDF file. Here's what it looks like on my system:

Now that we have the data in text format, we want to pull out the numbers that we are interested in. I am interested in the labels on the left, which start on line 6, the first numeric column (BUS), which starts on line 75 and the second numeric column (SKYTRAIN), which starts on line 144.

To start the process of extracting rows of data from the text file, I first split lines of the text file into a list like this:

lines = text.splitlines()

Then I create a simple function called grab_one_row which besides having a very clever name, knows the relative placement of the three columns, and pulls one whole row at a time from the text file and returns it as a tuple. Here is the function:

def grab_one_row(lines,n): 
    return (lines[n],long(lines[n+69].replace(',','')),long(lines[n+138].replace(',','')))

Armed with that function, I can now collect most of the rows I am interested in with a simple generator line:

rows = [grab_one_row(lines,i) for i in range(6,26)]

In the original PDF, the data for 2008 is placed further down the page so the last line needs to be added with a separate line of code like this:

rows.append(grab_one_row(lines,39))

now the rows array contains all of the data we are interested in, in an array that we can easily deal with. We just need to write them out to a CSV file to complete our work. To do that I created the rows_to_csv function. Here it is:

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',',''))))

I wanted the resulting CSV file to have numbers rather than strings containing numbers for the numeric values. The last line of this function strips out the commas that were in the numbers in the PDF file and then converts the text to a long integer to be written the CSV file.

The resulting CSV file now looks like this:

This result is a lot easier to deal with than the original PDF file. Arguably, a small file such as this could also be converted with Open Office Spreadsheet by cutting from the PDF and pasting to the spreadsheet. The nice thing about doing this as a script as above is that we can use this same technique for very large PDF files that would be too onerous to do manually.

Here is the entire program with all of the code together at once:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t 
    
def grab_one_row(lines,n): 
    return (lines[n],lines[n+69],lines[n+138]) 

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',','')))) 

def run(): 
    import urllib2 
    url         = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf' 
    outfilename = 'translink_bus_skytrain_trips_1989_2008.csv' 
    
    pdf = urllib2.urlopen(url).read() 
    text = pdf_to_text(pdf) 
    
    lines = text.splitlines() 
    rows = [grab_one_row(lines,i) for i in range(6,26)] 
    rows.append(grab_one_row(lines,39)) 

    rows_to_csv(rows,outfilename) 
    
if __name__ == '__main__': 
    run()

and you can find the resulting CSV file here.

Once again, Python comes through for us. Clearly it's not as easy to convert a PDF file as it is to rip a table out of an HTML file, but being possible at all makes it something we can work with. And part of the beauty of “Open” is that now that I have done it, others don't have to. And I in turn will benefit from other contributors to the open ecosystem. If we all do a bit, it's an “everyone wins” scenario.

17 May 2010

Facebook Steps Out of the Way

Like most people, I was a bit surprized with Facebook's recent changes with regard to privacy. I don't think they have done anything wrong, but as a user I allowed myself to be lulled into a false sense of security. Like most people, I believed that they wouldn't mess with the privacy settings of my account much, allowing me to control who got to see the personal information I put on Facebook on my terms. I had agreed to their user agreement which stated they could change the terms at any time, but I didn't really pay attention to that fine print.

When they made the more recent change to allow my friends graph and my content to be harvested I realized something. Not that Facebook is evil or bad, but that they offer a service that I thought was one thing, but it is something else. I thought it was a way for me to connect with my friends and share my data with them, but actually it is a way for Facebook to profit from our personal data. Or, as Tim Spalding so eloquently put it, "Why do free social networks tilt inevitably toward user exploitation? Because you're not their customer, you're their product."

For me it's not a big deal that my Facebook content is now available to anyone, I don't store anything particularly private there anyway. But now, my behaviour has changed and I find myself using it even less than I did before. Not so much because of the loss of control over my data or the fact that they didn't give me a cut, but because this is a company that is volatile with respect to its user policies. Frankly, I just don't want to put in the time required to keep up with their changes. So, I take my privacy into my own hands and limit what I place on Facebook.

On the other hand, I find the recent changes to Facebook pretty exciting. I think an awesome opportunity has opened up now for a service to emerge that allows people to connect with their friends and at the same time protects their privacy. 400 million facebook users sharing information is a testament to the fact that people want to connect online. The recent outcries and Facebook account deletions point to the fact that people also value privacy...i.e., there is clearly a market for connecting AND protecting privacy.

Facebook doesn't offer that service, but until now people were not sure if they did or they didn't. And that ambiguity prevented other firms from offering that service because competing with Facebook was just a non-starter. Now, thanks to Facebook's recent changes, it's clear. They don't.

And in that gap between what people want and what is available, lies opportunity.

It's now clear, Facebook is not in the privacy business. By stepping out of the way they make room for others that want to offer privacy as a key value proposition.

Personally, I would like to see a new type of social platform emerge. Something taking ideas from status.net and webfinger and Diaspora. I want a distributed social platform that I can host with any hosting provider that would allow me to connect with my friends. The difference is, I own it. So long as my friends and I have this system installed somewhere, our systems would talk to each other seamlessly.

This way we would not be dependent or at the mercy of any one vendor or their privacy policy changes. We would be able to move our accounts to another host of our choosing, anytime we want, and we would be able to lock down or even delete data any time.

The software would be free and open source as well. If anyone wanted to add some functionality or just contribute, that would be possible too. Think self-hosted WordPress, but with social networking instead of blogging.

Rather than one massive "walled garden" users would each have their own garden in a community with other garden owners. They would be able to share their data with whomever they choose.

I am grateful to Facebook and everything it has done to connect people. It's truly an awesome service.

Ultimately though, it IS my data. And I still want to share it on my terms.

11 May 2010

Innovators Isolation

One of the things that all innovators face at some level is a sense of isolation. By definition innovators are working on things that have never been worked on before. And they make up only about 2.5% of the population. If they participate in a specialized industry, it's pretty unlikely they'll get to work with other innovators in their field, never mind find someone who understands what it is they're so passionate about.

About four years ago I made a decision to attend four conferences per year. I consider it part of my ongoing professional development in two ways:

Training - there are no formal training courses available for the skills I need for my work. I read any books that are available. Conference workshops and sessions provide me with the most current and relevant information about what other innovators are doing in my field.
Connecting - when I attend a conference, I consciously choose to meet and connect with other attendees and presenters. I often get to connect with other inventors of some of the most exciting new technologies. It's typical to find folks exchanging ideas, talking about what we've actually done, what worked, what didn't work and we're thinking about for the future. There are people who talk and there are people who ship. These people ship.

Next week I will attend Google I/O and later this year I will attend OSCON and possibly FOWA. Although I don't expect to find a workshop covering specifically what I am currently up to – such as privacy enhancing distributed enterprise service topologies or dataset forking technologies or probabilistic data linkage techniques - at these conferences, I do expect to find lots of people interested in the cutting edge of whatever it is they're passionate about.

I expect to find people who are daring to think differently. I will share my crazy ideas. I will hear other folks’ crazy ideas. I expect we'll have thoughts about each other's crazy ideas and I expect that after all that, we will acknowledge some of the ideas as crazy or just dumb. And some will seem not quite as crazy as they did before.

But more importantly, we will get the sense that we're not the only ones with crazy ideas, and we're not the only ones that have no idea what's on TV anymore, because we are working on something that we think is cool and could possibly even change the world.

06 May 2010

freedom debt and the decline of lock-in

Early adopters are starting to notice there's a cost to cool when it's supplied by a single vendor. At first it's all fun and games. Then you realize that you no longer own your music collection, you no longer own your social network, and you no longer own your data.

Time after time we see organizations start out challenging the status quo, getting huge, abusing their user base, breaking trust, declining, and finally becoming the legacy they fought against. Facebook innovated past MySpace, Microsoft innovated past the IBM mainframe and Apple innovated past the IBM PC. Years later, Facebook's commoditization of users and user content, Microsoft's crushing domination of the PC market, and Apple's rigid control over anything that comes into contact with its products, including the internet, are all examples of lock-in in action.

They all start with great innovations and intentions but eventually the temptation to use lock-in as a strategy becomes too much to resist.

It's the classic Innovators Dilemma.

To amass a huge user base, organizations start out providing something of value or at least perceived value. That's what gets people to join up. The more compelling and universally recognized the value is, the more users. Having first mover's advantage is great here because it means there is no competition.

Once people have joined, there's a need to retain them as customers.

Option one is continuing to provide value over and above what their competitors offer. Now that their competitors know what they're up to, it's only a matter of time before the innovation is imitated and possibly exceeded. Continuing to innovate (taking risks) after they've already captured an audience can be difficult and something that doesn't come natural to large organizations.

Option two, which is what most organizations default to, is the strategy of intentional 'lock-in'. Lock-in is the practice of structuring the customer relationship so that it's difficult or impossible for users to switch to another vendor. Such as holding user data hostage so that it can be entered but not retrieved, or preventing people from moving their purchased music to another platform. Creating a walled garden environment, so if you want to play, you have to play within the walls, and leaving means leaving your toys behind. These are just a few of the tactics used to discourage people from using their freedom of choice.

Notice there's a common thread in option two: none of these tactics are in your best interest. They are in the best interest of the organization.

Jim Zemlin said it well at LinuxWorld a couple of years ago. He likened using a certain well known set of products as volunteering to go to jail. He pointed out that the Jail looked a lot like a 4 star hotel room with video on demand, a great view, was clean and neat, and that most of us would find to be rather luxurious... but it's still a jail.

Freedom Debt

By offering you something shiny now, organizations get you to give up a bit of your freedom in the future. "Take this shiny phone. It's free", they say. "Don't concern yourself with where your data is stored; we'll take care of it."

Right.

As we continue to wake up to what lock-in means and to become aware that what we are doing when we choose products and vendors that lock us in is essentially borrowing freedom from the future, I think we'll start to make different choices.

The New Organization

The trends around open source, open data, open government, open protocols, open API's and open communications, indicate that more and more organizations are recognizing that people are aware of the consequence of giving up their freedom. These organizational models point to lock-in as something to avoid, and use it as a lever to distinguish themselves from the legacy organizational models they disrupt.

This not only makes those organizations more competitive, it also means their goals are more aligned with ours, and gives us something more valuable for the dollars or attention we give in return.

And the big upside for them is that openness and our knowing we can leave at any time builds a kind of trust and loyalty that the old model can't hope to compete with.

02 May 2010

OpenDataBC #2: The A4CA Data Catalogue Transformer

In the OpenDataBC series of posts I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites. My goal is to encourage people that might not otherwise consider interacting directly with data, to give it a try and see for themselves that it's easy, and maybe even fun. :-)

Three weeks ago I sat down to have a closer look at the datasets that were released as part of the Province of BC's Apps 4 Climate Action (A4CA) contest. I am very excited about the prospect of open data in BC and wanted to see what was available that might be interesting to use for various projects. When I started to look through the datasets, I realized I was going to need to download it into either a spreadsheet or database to be able to really look at what was available.

In the previous article of this series, I described how to write a script to scrape the catalog data from the web page that contains it. In this article I describe how to write a program to transform the data.

The goal of this program is to read the live catalog from the A4CA site and make it available in a useable form. The program will do some data cleaning and write the data out to a file that can be read by a spreadsheet or database program. I have decided to output the catalog data in a comma delimited form otherwise known as a csv file.

While writing the last article, I noticed that when the data was downloaded the data had some broken lines, extra spaces and some of the data was encoded with HTML encoding (for example: '≥'). I want to be able to view the data in a plain text editor and in a simple spreadsheet program. This means cleaning all of these extra lines and codes out so the data is easy to work with and understand during analysis.

The way this program will work is basically as follows:

read the data from the catalog page
clean the data
write the clean data out to a csv file

Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.

The Code
One of the first things we are going to need to do is grab the catalog data. I saved my code from the last article in a module called read_a4ca_catalog. To make that module available I just import it. We are also going to need a few other modules in our program so I import those at the same time. The string module will help us when we want to clean the data. We'll use the csv module when we want to write our output to a file.

import read_a4ca_catalog
import string
import csv

Now that we have all of the basic modules loaded we are set to start coding.

Reading the dataset is ridiculously easy now that we have a module specifically written to do that. It's a one-liner that uses our module from last time:

raw_data = read_a4ca_catalog.read_a4ca_catalog()

Now we have our data in a list variable and we need to clean it. For that we're going to go through the data row by row, cleaning each row and placing it into a new array called clean_data.

clean_data = []
for row in raw_data:
    clean_data.append( cleanup_row(row) )

Notice that the code calls a function called cleanup_row. That function basically takes one row from the raw catalog data and goes through it cell by cell and cleans the data in each cell. Cleaning the data consists of replacing the encoded tags with readable alternatives, replacing duplicate spaces with single spaces, and removing all characters that we don't want. We create a string called legal_chars ahead of time, to specify which characters we consider legal for the output cells. Here's the code:

legal_chars  = string.digits + string.ascii_letters + string.punctuation + ' '
def cleanup_row(row):
    def cleanup_element(element):
        t1 = element.replace('  ',' ').replace('&amp;','&').replace('&#8804;','<=').replace('&#8805;','>=').replace('&lt;','<').replace('&gt;','>')
        return ''.join([c for c in t1 if c in legal_chars])
    return [cleanup_element(element) for element in row]

We need to place this function above the code that uses it. In my program I have placed it right below the import statements.

Finally, we want to write the clean data out to a file. For that I create an output file and call upon the Python csv module to write each row out to the file. Just before I write out the rows of data, I write a row of labels so that the csv file has headings.

f = open('a4ca_catalog.csv','w')
writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
writer.writerow(['category','subtopic','title','link','agency','agencylink','format'])
for row in clean_data:
    writer.writerow(row)

That's it. We now have a .csv file that can be easily read by pretty much any spreadsheet or database program.

Here's the entire program with all the bits in order:

import read_a4ca_catalog
import string
import csv

legal_chars  = string.digits + string.ascii_letters + string.punctuation + ' '
def cleanup_row(row):
    def cleanup_element(element):
        t1 = element.replace('  ',' ').replace('&amp;','&').replace('&#8804;','<=').replace('&#8805;','&gt;=').replace('&lt;','<').replace('&gt;','>')
        return ''.join([c for c in t1 if c in legal_chars])
    return [cleanup_element(element) for element in row]

# read the a4ca catalog from the live web site
raw_data = read_a4ca_catalog.read_a4ca_catalog()

# clean the data
clean_data = []
for row in raw_data:
    clean_data.append( cleanup_row(row) )

# write the clean data out to a file
f = open('a4ca_catalog.csv','w')
writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
writer.writerow(['category','subtopic','title','link','agency','agencylink','format'])
for row in clean_data:
    writer.writerow(row)

And you can find the resulting csv file here.

The nice thing about doing this with a program rather than manually is that if the catalog is updated from time to time, all we have to do is run our program and we have a nice clean spreadsheet to work with that has the latest and greatest data.

Python comes with a lot of tools built in for manipulating data and it make short work of jobs like this. Now that the data is in spreadsheet form, we can look at it to see what different types of data there are, what format it's in and where it comes from, and we can do cross tabs and graphs to help us visualize those aspects.

Herb Lainchbury