20 May 2010

OpenDataBC: Extracting Data from A4CA PDFs

In this OpenDataBC series of posts, I describe how to use some of the data that is being made available by the government of British Columbia on http://data.gov.bc.ca and related web sites. In the first article of this series, I described how to write a script to scrape catalog data from web pages. In the second article I described how to write a program to transform the data. In this article, I describe how to convert a PDF document into useable data.

As part of the Apps for Climate Action Contest, the Province of BC released over 500 datasets in the Climate Action Data Catalogue. It's an impressive amount of data pulled from an array of sources both within BC and elsewhere.

In an ideal “open data” world, all of that data would be in an easily machine readable format that we could use to write programs directly. While that would be great, the reality today is a bit different. Much of the data that is made publicly available these days is in formats that are harder to use. For example, some of the data in the Climate Change Data Catalogue was released in PDF format. PDF is a proprietary format, meaning the format is controlled exclusively by one party, in this case the Adobe corporation.

An interesting fact is that it takes extra effort to get data from its raw form into PDF format. In other words, to publish data in an open format rather than in PDF format actually saves time, effort and money – up front. However, PDF became well established in the pre-open world, so a lot of data is already published using it. To switch existing software to publish in an open format might take time. As a result, at least temporarily, we need to find ways to get at the data in the PDF files.

In this post I describe how to do that. Looking through some of the available datasets in the catalogue, one that I find interesting is “Transit Ridership in Metro Vancouver”. The data is produced by Translink and is in a PDF format and looks like this:



What I am interested in is the number of passenger trips by year for the past few years. I am going to leave out the Seabus and the West Coast Express as I am mostly interested in the buses and the Skytrain.

What I would like is a dataset, in a CSV file. The way this program will work is essentially as follows:

  • read the data from the source database
  • extract the data from the PDF file into a list in memory
  • write the list in memory out to a CSV file

Prerequisites
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.

The Code
The first thing we need to do is to read the PDF file into memory. The simple way to do that in Python is to use the urllib2 library and read the entire PDF from the original web site. Tying the script to the actual location of the file means we don't manually store the orginal file anywhere. If the City of Vancouver decided to move the URL we would have to adjust our code, but we're probably only going to run this code once so it's not a big deal. To read the PDF file into a memory variable we do this:

import urllib2 
    url = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf'
    pdf = urllib2.urlopen(url).read() 

Now that we have the PDF file in memory, I want to parse the PDF file and turn it into raw text. To do this I use a free open source Python library called pdfminer. I have created a function called pdf_2_text for this purpose. Here's the function:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t

The pdf_to_text function starts by importing the components required to do the conversion. The pdfminer library provides a lot of functionality. In this example we are using a small fraction of its functionality to do what we need, which is to get at the content in the PDF. The main function that actually does the work is called process_pdf. It takes a PDFResourceManager object, a TextConverter object and a file object as parameters so the code before that call is setting up those parameters properly. I use a StringIO object rather than just passing the urllib2 object in because the PDF converter needs to use the seek method for random access which is not supported in urllib2. To gain this ability I put the data into a StringIO object, which supports seek.

When the pdf_to_text function is called with the contents of a PDF file it returns a string containing lines of text with each line containing one of the elements (numbers or labels) of the PDF file. Here's what it looks like on my system:



Now that we have the data in text format, we want to pull out the numbers that we are interested in. I am interested in the labels on the left, which start on line 6, the first numeric column (BUS), which starts on line 75 and the second numeric column (SKYTRAIN), which starts on line 144.

To start the process of extracting rows of data from the text file, I first split lines of the text file into a list like this:

lines = text.splitlines() 

Then I create a simple function called grab_one_row which besides having a very clever name, knows the relative placement of the three columns, and pulls one whole row at a time from the text file and returns it as a tuple. Here is the function:

def grab_one_row(lines,n): 
    return (lines[n],long(lines[n+69].replace(',','')),long(lines[n+138].replace(',',''))) 

Armed with that function, I can now collect most of the rows I am interested in with a simple generator line:

rows = [grab_one_row(lines,i) for i in range(6,26)] 

In the original PDF, the data for 2008 is placed further down the page so the last line needs to be added with a separate line of code like this:

rows.append(grab_one_row(lines,39)) 

now the rows array contains all of the data we are interested in, in an array that we can easily deal with. We just need to write them out to a CSV file to complete our work. To do that I created the rows_to_csv function. Here it is:

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',','')))) 

I wanted the resulting CSV file to have numbers rather than strings containing numbers for the numeric values. The last line of this function strips out the commas that were in the numbers in the PDF file and then converts the text to a long integer to be written the CSV file.

The resulting CSV file now looks like this:



This result is a lot easier to deal with than the original PDF file. Arguably, a small file such as this could also be converted with Open Office Spreadsheet by cutting from the PDF and pasting to the spreadsheet. The nice thing about doing this as a script as above is that we can use this same technique for very large PDF files that would be too onerous to do manually.

Here is the entire program with all of the code together at once:

def pdf_to_text(data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 
    
    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 
    
    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t 
    
def grab_one_row(lines,n): 
    return (lines[n],lines[n+69],lines[n+138]) 

def rows_to_csv(rows,filename): 
    # write the clean data out to a file 
    import csv 
    f = open(filename,'w') 
    writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    writer.writerow(rows[0]) 
    for row in rows[1:]: 
        writer.writerow((row[0],long(row[1].replace(',','')),long(row[2].replace(',','')))) 

def run(): 
    import urllib2 
    url         = 'http://www.metrovancouver.org/about/publications/Publications/KeyFacts-TransitRidership1989-2008.pdf' 
    outfilename = 'translink_bus_skytrain_trips_1989_2008.csv' 
    
    pdf = urllib2.urlopen(url).read() 
    text = pdf_to_text(pdf) 
    
    lines = text.splitlines() 
    rows = [grab_one_row(lines,i) for i in range(6,26)] 
    rows.append(grab_one_row(lines,39)) 

    rows_to_csv(rows,outfilename) 
    
if __name__ == '__main__': 
    run() 

and you can find the resulting CSV file here.

Once again, Python comes through for us. Clearly it's not as easy to convert a PDF file as it is to rip a table out of an HTML file, but being possible at all makes it something we can work with. And part of the beauty of “Open” is that now that I have done it, others don't have to. And I in turn will benefit from other contributors to the open ecosystem. If we all do a bit, it's an “everyone wins” scenario.

2 comments:

Luke Closs said...

I have linked to your work and your CSV file from my a4ca wiki site:

http://demo.socialtext.net/a4cadata/index.cgi?transit_ridership_in_metro_vancouver

Thank you,
Luke

joshi.vj22 said...

Nice post :)
can u help me with:
how to detect images whether images are present in pdf or not?
thanks
response appreciated