Last week I sat down to have a closer look at the datasets that were released as part of the Province of BC's Apps 4 Climate Action (A4CA) contest. I am very excited about the prospect of open data in BC and wanted to see what was available that might be interesting to use for various projects.
The A4CA data listed in the catalogue includes a range of formats and technologies. Some are easier to work with than others. Being able to browse the catalogue online is great but I really want to have a closer look and maybe do some analysis to find the data that is easy to work with. For that I need to download the data so I can work with in it spreadsheet and / or database form. In the perfect world, this data would be available on the site as a downloadable feed, with its own URL so programmers could simply point at the URL and get the data in XML format that goes with it.
How it's done:
First, checking the robots.txt file for the site (www.gov.bc.ca) file reveals that the site allows for almost any type of program including this one, so that's great.
My personal language of choice for this sort of task is Python, which comes pre-installed with all Linux and modern Mac machines and can be easily installed on Windows. In my case I am using Python 2.6 under Ubuntu 9.10. In addition to Python I am using Beautiful Soup which is an excellent library for scraping web sites.
The first thing we need to do in our program is to import the modules we need. We are going to need urllib2 to grab the page and BeautifulSoup to parse it. That's just one line:
import urllib2, BeautifulSoup
Next, we need to go out to the web and grab the HTML page and store it as a string called page:
page = urllib2.urlopen('http://data.gov.bc.ca/data.html')
Next, we create an object called soup using the BeautifulSoup library.
soup = BeautifulSoup.BeautifulSoup(page)
At this point we have the page loaded and we can read whatever parts of it we want using the methods provided by the soup object.
I am particularly interested in the table in the middle of the page, that contains the data I am after. Looking at the raw HTML code from inside my browser I see that there is only one table on this page and that it's ID is set to 'example', so that makes it pretty easy to find using the find method provided by the soup object.
data_table = soup.find('table',id='example')
We also need a place to store our results. I'll use an array for that.
records = 
Now that we have the table, we just want to cycle through the rows of the table and pull the data out. For that we can use the Python for statement with the method findAll provided by the data_table object that we created. With each row that we iterate through, we want to grab the text that is stored in each table cell. This is easily accomplished by creating an array containing all of the cells in the row and then taking the parts we want to work with from that array. Here's the code:
for row in data_table.findAll('tr'): if row.find('td'): cols = [a for a in row.findAll('td')] records.append([ cols.text, cols.text, cols.text, cols.a['href'], cols.text, cols.a['href'], cols.text, ])
Pulling the text out of the table cells is as easy as accessing the .text member. Two of the cells in each row have links, which I wanted to capture as well, so I accessed those by using the .a member and then access the href attribute which is where links are stored in HTML.
Now, each row in our records array contains one row from the table with the cell contents and links separated out. This is a good start to making this data more usable for my purposes.
Next, I plan to do some data cleaning and then start to do some analysis on it to get a feel for what's available in the A4CA catalogue.
And finally, here is the entire program:
def read_a4ca_catalog(): import urllib2, BeautifulSoup page = urllib2.urlopen('http://data.gov.bc.ca/data.html') soup = BeautifulSoup.BeautifulSoup(page) records =  data_table = soup.find('table',id='example') for row in data_table.findAll('tr'): if row.find('td'): cols = [a for a in row.findAll('td')] records.append([ cols.text, cols.text, cols.text, cols.a['href'], cols.text, cols.a['href'], cols.text, ]) return records if __name__ == '__main__': for record in read_a4ca_catalog(): print record
With Python and BeautifulSoup it's easy to extract data from a site and I would encourage anyone to give it a try. It's easier than you might think.
Now that we have the data in a form we can work with, how do we clean it up and make it more useful? I'll cover that in the next article of this series.