Three weeks ago I sat down to have a closer look at the datasets that were released as part of the Province of BC's Apps 4 Climate Action (A4CA) contest. I am very excited about the prospect of open data in BC and wanted to see what was available that might be interesting to use for various projects. When I started to look through the datasets, I realized I was going to need to download it into either a spreadsheet or database to be able to really look at what was available.
In the previous article of this series, I described how to write a script to scrape the catalog data from the web page that contains it. In this article I describe how to write a program to transform the data.
The goal of this program is to read the live catalog from the A4CA site and make it available in a useable form. The program will do some data cleaning and write the data out to a file that can be read by a spreadsheet or database program. I have decided to output the catalog data in a comma delimited form otherwise known as a csv file.
While writing the last article, I noticed that when the data was downloaded the data had some broken lines, extra spaces and some of the data was encoded with HTML encoding (for example: '≥'). I want to be able to view the data in a plain text editor and in a simple spreadsheet program. This means cleaning all of these extra lines and codes out so the data is easy to work with and understand during analysis.
The way this program will work is basically as follows:
- read the data from the catalog page
- clean the data
- write the clean data out to a csv file
The following code requires the Python programming language, which comes pre-installed on all Linux and modern Mac machines and can be easily installed on Windows.
One of the first things we are going to need to do is grab the catalog data. I saved my code from the last article in a module called read_a4ca_catalog. To make that module available I just import it. We are also going to need a few other modules in our program so I import those at the same time. The string module will help us when we want to clean the data. We'll use the csv module when we want to write our output to a file.
import read_a4ca_catalog import string import csv
Now that we have all of the basic modules loaded we are set to start coding.
Reading the dataset is ridiculously easy now that we have a module specifically written to do that. It's a one-liner that uses our module from last time:
raw_data = read_a4ca_catalog.read_a4ca_catalog()
Now we have our data in a list variable and we need to clean it. For that we're going to go through the data row by row, cleaning each row and placing it into a new array called clean_data.
clean_data =  for row in raw_data: clean_data.append( cleanup_row(row) )
Notice that the code calls a function called cleanup_row. That function basically takes one row from the raw catalog data and goes through it cell by cell and cleans the data in each cell. Cleaning the data consists of replacing the encoded tags with readable alternatives, replacing duplicate spaces with single spaces, and removing all characters that we don't want. We create a string called legal_chars ahead of time, to specify which characters we consider legal for the output cells. Here's the code:
legal_chars = string.digits + string.ascii_letters + string.punctuation + ' ' def cleanup_row(row): def cleanup_element(element): t1 = element.replace(' ',' ').replace('&','&').replace('≤','<=').replace('≥','>=').replace('<','<').replace('>','>') return ''.join([c for c in t1 if c in legal_chars]) return [cleanup_element(element) for element in row]
We need to place this function above the code that uses it. In my program I have placed it right below the import statements.
Finally, we want to write the clean data out to a file. For that I create an output file and call upon the Python csv module to write each row out to the file. Just before I write out the rows of data, I write a row of labels so that the csv file has headings.
f = open('a4ca_catalog.csv','w') writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL) writer.writerow(['category','subtopic','title','link','agency','agencylink','format']) for row in clean_data: writer.writerow(row)
That's it. We now have a .csv file that can be easily read by pretty much any spreadsheet or database program.
Here's the entire program with all the bits in order:
import read_a4ca_catalog import string import csv legal_chars = string.digits + string.ascii_letters + string.punctuation + ' ' def cleanup_row(row): def cleanup_element(element): t1 = element.replace(' ',' ').replace('&','&').replace('≤','<=').replace('≥','>=').replace('<','<').replace('>','>') return ''.join([c for c in t1 if c in legal_chars]) return [cleanup_element(element) for element in row] # read the a4ca catalog from the live web site raw_data = read_a4ca_catalog.read_a4ca_catalog() # clean the data clean_data =  for row in raw_data: clean_data.append( cleanup_row(row) ) # write the clean data out to a file f = open('a4ca_catalog.csv','w') writer = csv.writer(f,delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL) writer.writerow(['category','subtopic','title','link','agency','agencylink','format']) for row in clean_data: writer.writerow(row)
And you can find the resulting csv file here.
The nice thing about doing this with a program rather than manually is that if the catalog is updated from time to time, all we have to do is run our program and we have a nice clean spreadsheet to work with that has the latest and greatest data.
Python comes with a lot of tools built in for manipulating data and it make short work of jobs like this. Now that the data is in spreadsheet form, we can look at it to see what different types of data there are, what format it's in and where it comes from, and we can do cross tabs and graphs to help us visualize those aspects.