Byte 2 2016

  • Description: Your final product will be an exploration of data set of your choice ([yourname]-explore.appspot.com)
  • Source Code: See https://github.com/jmankoff/data, in Assignments/jmankoff-explore. 
  • Hand In: Blackboard (see questions at end of assignment)

Overview

In this project, you will create a new application that explores static data available online through Google Fusion Tables (tutorials). This assignment has the following learning goals:
  • Displaying information using Jinja
  • Using SQL to query Google Fusion Tables
  • Using new data formats including csv and JSON
  • Reading and writing from files
  • Conducting exploratory data analysis

Detailed Instructions for Explore Data Byte

Although we have now created a visualization, we do not yet have programmatic access to the data. Our next step is to download data from fusion tables into Python. To do so, we will make use of the Google API Client Library. There is great getting started documentation for the api_client_library that you should look over.<

Register your Application and download the OAuth keys

Go the Google API Console and make sure you are in the correct byte. Then find the Google APIs area, and turn on the fusion table API. You will also need to create 'Credentials' for accessing the data from a server (a 'server API key'). I named mine 'jmankoff-explore'. Leave the box for IP addresses blank and then press OK. This will generate a new key (ideally we would restrict the IP address, and to do so you would need to know the IP addresses for your local computer and for google's appspot servers. That is left as an exercise to the reader).

Be sure to save the resulting key! You can just copy it off their website and into your code. It should be something like: XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX

We'll store the key in our code.

API_KEY = XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX

Using Discovery to Handle API Access

Since we already have the API key, all we really have to do is make sure that we use it every time we request data. Since we are using one of Google's own APIs (instead of a third party) we can leverage much more powerful libraries than last time around to access data. In particular, we will follow the description in the getting started introduction for building a 'service object'

Before we start, let's make sure we've included the correct set of libraries in our application. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google, similarly to how we installed feedburner in the last project. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For Fusion Table access

  • json will be used to decode certain response objects (from SQL queries of google fusion tables). JSON is an open text based standard for data, similar to XML in its uses.
  • jinja2 (as with the previous byte
  • urllib (a standard library for working with urls)
  • build (from apiclient.discovery). To get this: First, download the Google API Python Client, unzip it, and copy the subdirectories 'googleapiclient' into the [yourname]-explore project directory. Build depends on some other python libraries. You will need to 
    • download six and move the file 'six.py' into the lib directory of your application
    • download httplib2 and move the directory 'httplib2' (found inside of the directory 'python2' into the lib directory of your python application
    • download uritemplate and move the directory 'uritemplate' into the lib directory of your python application
    • download oauth2client and move the directory 'oauth2client' into the lib directory of your python application 
    • add the following to the libraries portion of your app.yaml file (along with webapp2 and jinja2)
      -  name: ssl
         version: latest
Now, set up your application with the correct set of libraries:
# Imports
import os
import jinja2
import webapp2
import logging
import json
import urllib

# this is used for constructing URLs to google's APIS
from googleapiclient.discovery import build

Now we can prepare to connect to fusion tables:

# This uses discovery to create an object that can talk to the 
# fusion tables API using the developer key
service = build('fusiontables', 'v1', developerKey=API_KEY)

At this point in the tutorial we need to have in mind specific things that we want to do with the fusiontables API. We will need to know the 'address' of the table we created. To find this, open your fusion table's File menu and click on About This Table. The ID at the bottom is the key you will need to access the data programmatically (it will be something like: NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX

Store it in a variable:

TABLE_ID = 'NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX'

Now you can begin to explore the information you can retrieve from the table. The API structure is provided as nested method calls in the service library (as documented in the tutorial on building and calling a service). You'll want to look at the documented set of API calls available as part of the APIs explorer to find out what specific method (API) calls you can make. For example, 

request = service.column().list(tableId=TABLE_ID) 

will get a list of column names for TABLE_ID

It can be a bit confusing to tell what the correct API calls supported by your service will be. First, it is important to remember that the parameters are case sensitive (so tableID will fail for example). Second, if you look at other parts of the documentation, you may be confused about what is supported. For example, in the documentation on "Using the API" it looks as though you would need to specify service.tables(tableID=TABLE_ID) but this will not work correctly (or even compile).

Exploring the data

Now that we have the data, let's explore it. First, be aware that the data is in JSON format. This format is well supported by python using a dictionary structure. We'll want to get a feel for the data using a few different tools:

  • Summarizing the raw information in different ways using descriptive statistics
  • Graphing the data
  • Asking specific questions that we might expect certain answers to or to quantify our intuition
  • Re-organizing the data into different groupings that might be more informative

While there are other tools we will develop as the course goes on, these are an excellent starting place.

First, make sure you understand the structure of the data. Each row is a different animal that arrived at the pond. What do the columns mean? What range of values do they have? Are those values categorical? Numerical? Text? Take the time to look at each column. You can use Google Fusion Tables to create a chart summarizing each column one by one. When you do, what discoveries do you make? You will probably quickly find that the fusion tables interface leaves much to be desired. The table itself is too large to get a sense of by scrolling through it. So what alternatives do we have?

We will download the data into Python to explore it. Let's go ahead and create a query that will get this information, using sql, a special-purpose language for querying relational databases. There is a fusion tables version of SQL and we'll be using the SELECT command. The format for a query is:

https://www.googleapis.com/fusiontables/v1/query?sql=query

Let's download some sample data (the first two dogs in the table)

def get_all_data(self):
    query = "SELECT * FROM " + TABLE_ID + " WHERE  AnimalType = 'DOG' LIMIT 2"
    response = service.query().sql(sql=query).execute()
    logging.info(response['columns'])
    logging.info(response['rows'])
        
    return response

And look at the results in the log:

INFO     2014-01-07 19:31:41,439 main.py:88] [u'Outcome', u'AnimalID', u'AnimalType', u'IntakeDate', u'IntakeYear', u'IntakeMonth', u'Name', u'Breed', u'Age', u'Sex', u'SpayNeuter', u'Size', u'Color', u'IntakeType', u'OutcomeSubtype', u'ZipFound', u'Latitude', u'Longitude', u'ZipPlaced', u'OutcomeDate', u'OutcomeYear', u'OutcomeMonth', u'icon type', u'IconName']

INFO     2014-01-07 19:31:41,439 main.py:89] [[u'', u'A346562', u'DOG', u'2011-04-16T00:00:00Z', u'2011', u'April', u'DIXIE', u'', u'Older than 1 year', u'MALE', u'true', u'MED', u'BLACK', u'Owner Surrender', u'', u'40218', 38.1938, -85.6573, u'NaN', u'2011-04-16T00:00:00Z', u'2011', u'April', u'NaN', u''], [u'', u'A364651', u'DOG', u'2011-03-11T00:00:00Z', u'2011', u'March', u'KAYLEE', u'YORKSHIRE TERR - POODLE TOY', u'Older than 1 year', u'FEMALE', u'true', u'TOY', u'BLACK', u'Impound', u'', u'40202', 38.2529, -85.7514, u'40218', u'2011-04-01T00:00:00Z', u'2011', u'April', u'NaN', u'']]

All of those 'u' annotations mean that we are dealing with unicode strings (with the exception of the latitude and longitude data). Some of these strings are legitimate (the column names can all remain strings; the AnimalID and AnimalType should be strings; and so on). However others would be easier to manipulate if they were in other formats (such as the dates). Let's start, however with some exploratory data analysis. 

Showing the data on the web 

We'd like to be able to see what has been loaded from Fusion Tables, and the log file is not entirely satisfactory. So we're going to also make sure we can see it on the web.  One of the most powerful aspects of Jinja is its ability to display dynamic information provided by Python. We can pass one or more variables to Jinja by placing them in context:

return template.render(headers=cols, content=rows)

Next, update the 'index.html' file to show the information. Since we are using bootstrap, you should consult the bootstrap table reference for details on how the table is represented in html. Note the use of {% ... %}. This indicates some logic that should be executed (in this case a for loop). The contents of {{ ... }} are replaced with their value.

<div class="table-responsive">
  <table class="table table-striped table-hover">
    <thead class="thead-inverse">
     <tr>
{% for item in headers %}
         <th>{{ item }}</th>
{% endfor %}
     </tr>
    </thead>
    <tbody>
{% for row in content %}
     <tr>
  {% for datum in row %}
            <td>{{ datum }} </td>
  {% endfor %}
     </tr>
{% endfor %}
    </tbody>
   </table>
</div> 

The resulting output looks like this:

Optional: Making your web page responsive 

The table that we just created is useful for getting a sense of your data, but it can be unwieldy to work with. In addition, it is not the best way to visualize your data. To make the web page more responsive and useful to the viewer, there are several steps. First, we want some way to select which columns of data are displayed. We will use 'ajax' to do this as follows:

First, we need to add some buttons to the web page (one for each header). We do this as follows:


    <div class="button-toolbar" role="toolbar" id="headers" name="headers">
      {% for column in allheaders %}
      <button type="button" class = "btn-space btn btn-info" id="{{column}}" onclick="updateTable('{{column}}')">{{column}}</button>
      {% endfor %}
    </div>
    
    <div id="animals"><table id="data" class="table table-striped table-hover">
<thead class ="thead-inverse">
  <tr id="cols">
  </tr>
</thead>
<tbody id="rows"> 
</tbody>
    </table></div>

Note that we use a very similar loop to before, but now instead of table entries, we are adding buttons. The value btn-info corresponds to a certain bootstrap button styling. Here, each button corresponds to a column of our table and we store that information in the button's 'id' in addition to displaying it. Another new thing here is the onclick method. This specifies what javascript method should be called when the button is clicked. Also notice that the table we create is empty -- we will populate it only as buttons are clicked on. 

Much farther down in your file, you can create a simple method for onclick that can help with debugging:   

<!-- Bootstrap core JavaScript
    ================================================== -->
    <!-- Placed at the end of the document so the pages load faster -->
    <script src="js/jquery-1.11.3.min.js"></script>
    <script src="js/bootstrap.min.js"></script>
    <script>
function updateTable(str) {
console.log("in update header");
console.log(str);
        }
    </script>
Now you can test it out. It should look something like this:

And when you click on buttons, you can see them in your javascript console window, which you can see if you select it in your browser:


Now all of the remaining magic happens in your javascript function and in main.py. First, the updateTable needs to indicate to the user that the button has been selected. Here, I flip its color (if it was selected it now isn't and vice versa). Note the use of console.log for debugging and checking things:

button = document.getElementById(str);
console.log(button.className);
if (button.className.includes("btn-info")) {
button.className = "btn btn-space btn-primary";
} else {
button.className = "btn btn-space btn-info";
}
Second, I collect the names of all of the buttons that are selected in a single array. A lot of this code is error checking -- that the button has a valid id; that it is not already in my array. Only the var isPrimary is actually testing whether the button is selected or not.
cols = [];
for (button in buttons) {
id = buttons[button].id;
if (buttons[button].id != null) {
var isPrimary = buttons[button].className.includes("btn-primary");
if (isPrimary && (cols.indexOf(id) == -1)) {
cols.push(id);
}
}
}

Finally, the ajax call needs to package this up and send it to main.py. Important things here: We've used the special URL _update_table, which we will have to handle in main.py. In addition, we've specified that we are using http POST, and sending json data. Finally, we've specified a function that will handle the returned values from main.py (this function right now simply logs to the console again for debugging purposes).
$.ajax({
url: '/_update_table',
dataType: 'json',
contentType: 'application/json',
data: JSON.stringify({'cols' : cols}),
type: 'POST',
success: function (res, status) {
console.log("successrunning");
// On Success
},

Now we have to update main.py to handle the json request. This is fairly simple. First, I added a helper method that can construct a query from a list of column names:
    string_cols = ""
    if cols == []:
        cols = ['*']
    for col in cols:
        string_cols = string_cols + ", " + col
    string_cols = string_cols[2:len(string_cols)]
    query = "SELECT " + string_cols + " FROM " + TABLE_ID + " WHERE AnimalType = DOG LIMIT 2"
    return query
This will only return 2 rows, and is slightly simpler than the final make_query function I created, but you can see how it works -- if no column is specified, it selects all columns. Otherwise it selects only those specified. 

Second, I modify the '/' route to only send over the headers (so we know what to put on the buttons):
@app.route('/')
def index():
    template = JINJA_ENVIRONMENT.get_template('templates/index.html')
    request = service.column().list(tableId=TABLE_ID)
    allheaders = get_all_data(make_query([], query_animals, 1))
    logging.info('allheaders')
    return template.render(allheaders=allheaders['columns'] )

Finally, I add a route for the new POST method which makes a query and returns the full data (similar to what we were previously doing in '/'). 
@app.route('/_update_table', methods=['POST']) 
def update_table():
    logging.info(request.get_json())
    cols = request.json['cols']
    result = get_all_data(make_query(cols, query_animals, 100))
    return json.dumps({'content' : result['rows'], 'headers' : result['columns']})

Now we can go back to testing and see in the console that this returns valid data, and finally finish our code by providing a more complete success function. A lot of the 'ajax' magic happens here -- after extracting the columns and rows from the returned data, we have to use javascript to modify the page contents to show these results. First, we find the table (which we had labeled with the id "data" when we created it), its headers (id "cols") and body (id "rows"). Next we delete everything currently in the table. Finally, we fill it with new column names (the "th" elements) and actual facts about animals (the "tr" and "td" elements). All of these new elements have to be appended to the correct part of the document object model as they are created. 
success: function (res, status) {
var cols = res["headers"];
var rows = res["content"];
               // get the table, columns and rows
table = document.getElementById("data");
headers = document.getElementById("cols");
               body = document.getElementById("rows");

               // delete current contents
headers.innerHTML='';
body.innerHTML = '';

for (col in cols) {
cell = document.createElement("th");
cell.innerHTML = cols[col];
headers.appendChild(cell);
}
for (row in rows) {
tr = document.createElement("tr");
for (item in rows[row]) {
td = document.createElement("td");
td.innerHTML = rows[row][item];
tr.appendChild(td);
}
body.appendChild(tr);
}
},

That's it! This is now an interactive table which can have columns added and deleted at will. Next year this part of byte 2 will be required and I'll add some information on how to use this method to dynamically populate a chart (using https://google-developers.appspot.com/chart/). But that's all for this year :).

Debugging as you go

Partway through this assignment, I discovered that one cannot have column names that have special characters in them in Fusion Tables when accessing them programmatically. For this reason, change 'Estimated Age' to 'Age'; 'Outcome Type' to 'Outcome'; and 'Animal Type' to 'Animal' in the Google Fusion Tables interface. This may seem like a strange requirement, but many bugs arise for these sorts of reasons, and that's why debugging systematically so that you check all of your assumptions (and discover the assumptions of others) is so important. 

The work we are doing is especially complex because it involves multiple moving parts. It is helpful to know about several options for tracing what is happening: 
  • When you are exploring solutions to a problem of this sort, I highly recommend using logger.info(...) frequently to double check that things are proceeding as you expect.
  • Don't forget to view the log itself, which will have not only your printouts, but also other bugs that arise, what exceptions happen, and so on

Exploratory Data Analysis in Python

The remainder of this assignment is concerned with characterizing the data. To do all of this, we will create a python file just for us (not for use in your google appspot application). We will also want to download the data locally so that we can write it to a file and then experiment easily and iteratively with this. Next, we will want to summarize the data. Finally, we may want to visualize aspects of it. 

Downloading the Data

It is relatively easy to download JSON data and store it in a file. This will be the first part of our new exploratory analysis file, which we could store in '[yourname]-explore.py' and execute at the command line by calling "python [yourname]-explore.py". Note that there are some small changes in what we import (this is because at least on my local machine, webapp2 is not visible to my command line python). You may have to install some of these libraries if your python cannot see them by default, typically "pip install [library]" will work. 

import csv
import httplib2
from apiclient.discovery import build
import urllib
import json

# This API key is provided by google as described in the tutorial
API_KEY = '...'

# This is the table id for the fusion table
TABLE_ID = '...'

try:
    fp = open("data.json")
    response = json.load(fp)
except IOError:
    service = build('fusiontables', 'v1', developerKey=API_KEY)
    query = "SELECT * FROM " + TABLE_ID + " WHERE AnimalType = 'DOG' LIMIT 10"
    response = service.query().sql(sql=query).execute()
    fp = open("data.json", "w+")
    json.dump(response, fp)

The try/except here checks whether the data is already downloaded by trying to open the file 'data.json'. If the data is downloaded, it simply loads it out of the file. If not, it connects to google fusion tables and downloads it, just as we do in our regular Google appspot code. At the end of all of this, response contains the parsed json as a dictionary just exactly as in our regular Google appspot code. Once you are sure this is working, go ahead and remove the 'data.json' and the words "LIMIT 10" from your code and you can download the entire data set. 

If you print len(response['rows']) you can find out how many rows are in the data set (remember, add this to the end of your explore.py code and run python explore.py to see the output). 

Now that you have a way to download the data, you will want to use python to help you explore it. At a minimum, you should 

  • write code to print out the number of occurrences of each unique value for each column you plan to analyze. Use this to decide which columns you will ignore, and whether you want to combine any values on further analysis. You can also use this to eyeball potential outliers. Here is some sample code I wrote to do this for my data set. One thing I noticed was that because this data was hand entered, there were a lot of empty values (data that was just missing a value for Age, for example).
summary = {} # this will be our summary of the data
columns = response['columns'] # the names of all columns
rows = response['rows'] # the actual data 

# we'll ignore some columns 
ignore = [u'Outcome', u'AnimalID', u'AnimalType', u'Name', u'IconName', u'IntakeDate', u'OutcomeDate', u'Latitude', u'Longitude', u'Breed']

for i in range(3, len(columns)):  # loops through each column
    if columns[i] in ignore: continue 

    answers = {} # will store unique values for this column

    for row in rows:
        key = row[i] 
        # convert any string values to ascii, and any empty strings 
        # to a string called 'EMPTY' we can use as a key
         if type(key) is unicode: key = row[i].encode('ascii','ignore') 
        if key == '': key = 'EMPTY'
       
            
        try:               # increase the count the key already exists
            answers[key] = answers[key] + 1
        except KeyError:   # or set it to 1 if it does not exist
           answers[key] = 1
        summary[columns[i]] = answers   # store the result in summary
print summary

This format is still fairly difficult to deal with, so I wrote some of these summary statistics out as csv files. In addition to making csv files and printing out statistics, I used pyplot to graph some of the numeric data. Note that for mac users, installing things like pyplot is easy with 'pip install', windows users have in the past reported having more trouble. One student suggested the following sequence (I haven't tested it): (1) find your library executable at http://www.lfd.uci.edu/~gohlke/pythonlibs/ -- make sure it matches your python/machine details (2) install it on your machine (3) run python and try importing the library you just installed. You can check its version with print [library].__version__

For example, here is a scatterplot of the range of latitude and longitude values in the data. Although you won't be able to use the same code for your data set, it should give you an idea of some things you can do in your own analysis. It can be found as 'explore.py' in the github repository for assignment 2 and is carefully commented. It also has suggestions for further analysis for these data in it that could be useful in other data sets. Note that when using pyplot multiple times in the same script, if you close the first plot window the next will appear. 

Questions you should be able to answer when you hand this in

When you are done, you should be able to answer these questions and provide evidence to support your answers. You should do this in your appspot application in the 'about.html' and 'quality.html' files. You may use hand drawn sketches, pyplot generated plots, and live interactive graphics to illustrate your answers.
  • Is the data complete? If not, what is missing and what concerns does it raise?
  • Is the data coherent? 
    • Does it make sense relative to itself? 
    • Does it match your expectations? 
    • Is the data is distributed across its possible values in a way that makes sense?
  • Is the data correct?
    • What aspects of this data can you verify?
    • What did you find?
  • Is the data accountable?
    • Where did it come from?
    • Who has access to it?
Comments