Homework Assignments

The homeworks in this course are broken into projects bytes which cover different topics. In addition, there will be two larger projects. You will do about one project byte every one to two weeks. The primary language we will use is Python, which has great support for data processing. We will use Google Apps to create a portfolio of applications as part of the assigned projects. You may use other platforms at your own risk. The project bytes tend to build on each other, and help to set you up to complete each bigger project.

All bytes are to be done as individual work. It is expected that students may assist each other with conceptual issues, but not provide code. If you use example code, you must explicitly acknowledge this. If you are unsure about these boundaries, ask.

Byte 1: Answering A Question with Fusion Table Data

Byte 2: Exploring and Cleaning Data and Using an API

Byte 3: Visualizing Mobile Data 

Project #1: Develop A Visual Data Narrative

Byte 4: Machine Learning

Final Project

Old Materials (For Inspiration Only!)

Older or in-progress Bytes not being used this year (could be a source of inspiration or instruction for final projects)

Byte 6: Working with Big Data

posted Mar 22, 2014, 7:00 PM by Jen Mankoff   [ updated Mar 23, 2015, 4:43 PM ]

Description: Your final product will visualize and reason about public Vital Birth Statistics 

Overview

To circumvent some of the difficulties (and costs!) of setting up a large data set, and because the course is primarily focusing on analytics, we will be using Google's BigQuery for this assignment.  You may find this article on getting started with google Big Query useful. We'll make use of the public Vital Birth Statistics data set that BigQuery provides (taken from the CDC under the following restrictions). A range interesting questions can be explored in that data set. There is also a great tutorial on creating a graph of some data from this set that may be helpful to read. Our work will focus on visualizing some basic statistics and developing a classifier. We will be 'closing the loop' in this assignment, meaning that once the classifier is trained, we will let people use it within Google appspot. Learning goals for this assignment include:
  • Working with a massive data set
  • Becoming familiar with BigQuery
  • Applying machine learning algorithms that can function with very large data 
  • Thinking about how to close the loop and actually use a classifier we produce 

The Tutorial

Before we can begin using BigQuery, you will need to create a [yourname]-byte6 application in Google Appspot using the same methods we've used in the other bytes, and turn on the BigData API in the Google Developer's Console:


You will also need to install the Google Cloud SDK on your local machine. You will also need to add the following libraries to your appspot application: 'apiclient', 'httplib2', 'oauth2client' and 'uritemplate'. The simplest way to do this is to copy those directories over from [yourname]-byte3. Your app.yaml file should reference the following handlers and libraries (so that we can visualize results and so on)
handlers:
- url: /favicon\.ico
  static_files: favicon.ico
  upload: favicon\.ico
- url: /stylesheets
  static_dir: stylesheets
- url: /d3
  static_dir: d3
- url: /data
  static_dir: data
  application_readable: true
- url: .*
  script: main.app

libraries:
- name: webapp2
  version: "2.5.2"
- name: MySQLdb
  version: "latest"
- name: jinja2
  version: latest
- name: numpy
  version: latest
- name: django
  version: latest

Accessing the data set at the command line

To circumvent some of the difficulties (and costs!) of setting up a large data set in BigQuery (which limits how much data you can upload per day, and charges for data sets), we'll be using the public Vital Birth Statistics data set that BigQuery provides (taken from the CDC under the following restrictions). We can see that a range interesting questions can be explored.

A quick way to try out different queries of interest is using bq shell --project_id=XXXXX at the command line. Once you install the Google Cloud SDK you should have access to the program bq. In my case, I have to start a new shell in Terminal and then run two scripts Google provides inside the installation directory for the Cloud SDK ('completion.bash.inc' and 'path.bash.inc'). Once I run those scripts, I can type at the prompt and see the following:

$ bq shell --project_id=jmankoff-byte6

Welcome to BigQuery! (Type help for more information.)

jmankoff-byte6> 

Now I can enter a query such as:

jmankoff-byte6>SELECT state, count(*) FROM [publicdata:samples.natality] GROUP by state

and the response looks something like the following table (elided for brevity). Note the large entry under NULL -- we never get away from data set problems, do we!

+-------+----------+

| state |   f0_    |

+-------+----------+

| AL    |  1957392 |

| AK    |   324183 |

| AZ    |  1837630 |

...

| WI    |  2294354 |

| WY    |   218735 |

| NULL  | 16998008 |

+-------+----------+

Creating a Cloud Storage Area of Your Own

If you wish to make use of Google's cloud database on your own (something that was not required in Byte 4 but is if you want to use further storage in future projects including this one), you will need to enable billing with Google. Simply click on the Cloud Storage Browser (shown at left) in the Google Developers Console 
... and you will see the following message:
You will need to click 'enable billing' and enter your credit card information. If you are concerned about costs, know that all the data stored in Byte 4's mobile data database has only cost me $30 so far this semester. The details on Google's billing scheme for pricing for Cloud SQL and pricing for BigQuery differ, but we are using public BigQuery data so we only need to worry about Cloud SQL in this Byte. 

Once you enable billing, you can create a new instance where we can store some data.
Let's call it 'birthstats':



Accessing Data from Within Google Appspot

The command prompt is a useful place to experiment, but what we really want to do is access the same data from within Google Appspot. We will need to make use of OAuth2.0 to access data from within Google Appspot. For this, we will first need to know certain information about the application you have created. Start by clicking on [yourname]-byte6 in Google Developer Console. 

To get the credentials for OAuth2 click on "APIs & auth" and then 'Credentials'. From there you can click 'download JSON' to download a client secrets file. 

You should also double check under permissions that the following permissions are enabled:

Note that the top email starts with a number. This is your 'project number' (you can also see it if you click on the 'Overview' tab). Make sure to record it in 'main.py' under BigQuery API settings:
# BigQuery API Settings
PROJECT_NUMBER        = 'XXXXXXXXXXXX' 

Next we will need to specify the correct credentials and set up the authorization. Since we are only using a public data set, we want the readonly credentials mentioned on Google's BigQuery Authentication Documentation:
credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery.readonly')
service     = build("bigquery", "v2", http=http)
now assuming we have things set up for jinja2, in get(self) we can make a query and send the results to the 'index.html' template:

query     = {'query':'SELECT state, count(*) FROM [{0}] GROUP by state;'.format(_DATABASE_NAME), 'timeoutMs':10000}

# service is the oauth2 setup that we created above
jobRunner = service.jobs()
# project number is the project number you should have defined in your app
reply     = jobRunner.query(projectId=_PROJECT_NUMBER,body=query).execute()
# similar to the google SQL work we did in byte4, the stuff we care about 
# is in rows
rows = reply[u'rows']
states = {}
for row in rows:
    name = row[u'f'][0][u'v']
    num = row[u'f'][1][u'v']
    # make sure we have a string in every case
    if name == None: name = u'None'
    state = {'state':unicode.encode(name)], 'total':int(num)}
    states = states + [state]

context = {"states": states}
# and render the response
self.render_response('index.html', context)

Finally we can show the table using jinja2:
<!DOCTYPE html>
<html>
<head>
  <link href="stylesheets/d3.css" rel="stylesheet" type="text/css">
  <meta charset="UTF-8">
  <title>Data Pipeline Project Byte 6 -- Big Data</title>
</head>
 <body>
   <h1>Data Pipeline Project Byte 6</h1>
   <h2>Query Results</h2>
   <ul>
   {% for state in states %}
      <li>{{state['state']}} : {{state['total']}}</li>
   {% endfor %}
 </body>
</html>

An excerpt from the results:


Debugging Hint

Unfortunately, the oauth2 queries will only work on the Appspot server. To avoid having to do all of our testing after uploading to the server each time, we can capture the query result from the server log and save it in 'data/states.json' (provided with your source code). Then we can modify our application to check whether or not we are running on the server and either run the original query or just load the 'states.json' file depending on our status. Eventually when we go interactive this will not work, but it will help tremendously at this stage. I have created a function run_query(query_string, filename), which decides whether to run the query or load a file depending on where the code is running: 
 if (os.getenv('SERVER_SOFTWARE') and
     os.getenv('SERVER_SOFTWARE').startswith('Google App Engine/')):
     query = {'query':query_string, 'timeoutMs':10000
     jobRunner = service.jobs()
     return jobRunner.query(projectID=_PROJECT_NUMBER, body=query).execute()
 else:
     # open the data stored in a file called filename
     try:
        fp = open(filename)
        return simplejson.load(fp)
     except IOError:
        logging.info("failed to load states.json file")
        return None

I had some trouble creating 'states.json' by hand so eventually I pasted it into a variable using python at the command line (e.g. data = <paste>) and then typed: 
>>> fp = open("data/states.json", 'w+')
>>> json.dump(data,fp)

Some additional debugging ideas: I find it useful to debug my python JSON parsing at a command line prompt locally (by grabbing the JSON produced by queries from the server log after uploading my application for the first time). Equally useful is to debug your SQL queries using the shell mentioned above. This will save you many iterations (and uploads when working on the server) and make it much easier to identify and fix problems. For example, I copied the query results into a 'data' variable at my python prompt and then went through the following sequence of debugging exercises:
>>> rows = data[u'rows']
>>> row = rows[0]
>>> row
{u'f': [{u'v': u'AL'}, {u'v': u'1957392'}]}
>>> state = row[u'f'][0][u'v']
>>> num = row[u'f'][1][u'v']
>>> "state %s num %s " % state, num
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "state %s num %s " % state num
  File "<stdin>", line 1
    "state %s num %s " % state num
                                 ^
SyntaxError: invalid syntax
>>> "state %s " % state 
u'state AL '
>>> unicode.decode(state)
u'AL'
>>> unicode.encode(state)
'AL'
>>> num = obj[u'f'][1][u'v']
>>> int(num)
1957392

and this last bit was after my web page [yourname]-byte6.appspot.com returned the following error and I traced it to the very last state returned by looking at my log printout of each state and number on the server log and realized that the empty state values were returning some sort of special value instead of a string:


>>> if state = None: state = u'None'
  File "<stdin>", line 1
    if state = None: state = u'None'
              ^
SyntaxError: invalid syntax
>>> if state == None: state = u'None'
... 
>>> state
u'AL'
>>> state = None
>>> if state == None: state = u'None'
... 
>>> state
u'None'

Homework Part 1: Visualizing your data

At this point, you have all of the information you need in order to query and visualize historical data about births in the U.S. since the 1970s. Rather than walk you through things you already know how to do, I will simply encourage you to play around and discover some interesting new things. You can read up more on the big query api for python, and explore the sample queries to get some ideas for what to do with the data.  You don't absolutely have to stick with the birth data (other public data sets are available), however the second half of this assignment is designed around using machine learning on the same data, so you may find it valuable to do so. 

Machine Learning with Big Data

For the machine learning part of this assignment, you will attempt to predict the health of a baby being born. Health of babies is measured using an apgar test, which is rated from 0 (unresponsive and not breathing) to 10 (perfect). We will use 7 as a cutoff for 'unhealthy'. In order to make your life much much easier, I have implemented a number of helper functions for the machine learning work you need to do. The key ones you should look at in the code, all well commented, include:
  • get_data (which will fetch a data set you can use for optimization, testing, or training) and return it as two arrays, the feature array, and the labels. get_data is also used to get a test set. We separate the test set and the training set by assigning odd rows to one and even rows to the other from the database. We use a random number as the index to ensure that we get different rows each time we run it. 
  • make_query_string (Helper function that makes a string based on the features and predict value that can retrieve data for get_data. If you want to use non-numeric features, you will have to modify this function).  
  • run_query (you already heard about this)
  • apgar_1min_test (labels an apgar value as 'healthy' (1) or 'unhealthy' (0))
  • loss_func (this is the standard loss function discussed in class. You won't need to modify it)
  • gradient_descent (this is the standard logistic regression gradient descent function discussed in class. You shouldn't need to modify it)
  • stochastic_gradient_descent (this is the function that collects up examples of X and y labels and calls gradient_descent on them repeatedly to find an approximate minimum. If you want to experiment with the regularization parameter, you might consider making a similar function which pulls down a single optimization set and then tries a bunch of different lam values on it. However you should not need to directly modify this function, except possibly to change the batch size to something smaller so things run quickly. 
  • apply_model (this function applies a machine learning model to a test data set that is passed in to it. it calculates the accuracy and returns it). 
You may also want to make note of these global variables:
  • _MAX_ROWS is the maximum number of rows that will be drawn from the database in any given query
  • features is the set of features that will be drawn from the database
  • _APGAR_CUTOFF is the cutoff used for the apgar value 
  • _TIMEOUT controls how long it is ok to wait for things to finish without causing an error

The gradient_descent algorithm loops normally loops repeatedly over the same data set. This is an appropriate way to train logistic regression using gradient descent. However, a better approach for big data is to select a different random subset each time we loop over the data (i.e. to train over small batches). This does require additional queries but has the nice property that we do not have to load the entire data set into memory in order to complete the learning process. It also works nicely in situations where parallelization is possible. This is a variant of stochastic gradient descent, which is supported by mainstream toolkits (including scikit-learn). A great tutorial on stochastic gradient descent with python code that you can use, can be found on deeplearning.net

Note that the heart of logistic regression is a simple loop in which we repeatedly recalculate the weights (up to a maximum number of times) and check whether we have reached a stop condition. The following code is a sample code of stochastic gradient decent. Each time we use 1 example to train our model in each iteration. You can also try batch gradient decent to train your model using all the examples you have retrieved in each iteration.
for i in xrange(iterations):
  for r in xrange(_MAX_ROWS):
    # run the next iteration of gradient descent
     weights = old_weights + step*(self.loss_func(X[r], old_weights - y_labels[r]))*X[r]
     weights = weights - lam*2.*step*old_weights

Note that a more efficient implementation of this could take advantage of numpy's matrix manipulation abilities. That is left as an (optional) exercise for the student in this instance and throughout the sample code. 

In stochastic gradient descent we add an outer loop to this:

for i in xrange(batch):
  # collect some good sample data
  bad_births, bad_births_labels = self.get_data(features, predict, prediction_test, zeros_string)
  good_births, good_births_labels = self.get_data(features, predict, prediction_test, ones_string)
  y_labels = np.concatenate([bad_births_labels, good_births_labels])
  X = np.concatenate([bad_births, good_births])

  # and run gradient descent on it
  weights = self.gradient_descent(weights, X, y_labels)

  # return a model at the end
  model = lambda x: self.loss_func(x, weights)<.5
  return model, weights

Note the importance of the loss function both in the gradient descent algorithm and in building our final model. As discussed in class, it is fairly straightforward: 

logistic = lambda s: 1.0 / (1.0 + numpy.exp(-s))

We can use this to calculate the accuracy in apply_model. Or we can simply make a single prediction, if we want to put this online for others to use. 

Your job is to use this code to further explore how predictable baby's health is (you can switch to a different data set if you want, of course, probably best to use another one freely available on bigquery). My accuracy is around 88% so far on the data I've played with (with a more extensive set of features than I've given you). Your job is to do the following:
  1. implement a zeror learner which returns a model that can be used to calculate zeror accuracy using apply_model for comparison to the regression model
  2. extend apply_model to also calculate precision and recall
  3. Show the results on your web page (you may want to consider how to explain the weights in terms of which features are important). Extra kudos if you check for significant difference between zeror and your regression model and report that too.
In some ways the specifics of the implementation don't matter -- we are simply fitting to data and then predicting using what we learned. In other ways they matter quite a bit, however. Let's look at our data and how we connect it to this algorithm.

Handin Expectations

The handin expectations for this assignment are based around your working byte6 
1) Your byte should show some exploration of the data set (beyond the table already included with the assignment), and show the results of the zeror and regression learners for accuracy, precision and recall.
2) Your byte should show the regression results as described above 

Going Further (non required source of fun ideas)

There is lots more fun stuff you can do with this Byte. For example, you could improve the speed by using better matrix math and/or adding ajax support for showing incremental results as the stochastic gradient descent algorithm is running (you'll want to significantly decrease _MAX_ROWS and increase batch size in that case). Or you could implement a user interface that lets someone set parameters themselves (as if they were a pregnant mom) and see the prediction

You could also create an optimization set, and use it to try some things that improve the accuracy further (depending on your coding skill, you could play with a different data set, play with different values for the regularization parameter lam, try adding different features, add support for non-numeric features, and so on.

Google BigQuery provides access to several public data sets. One is GitHub's commits (including the comments). This blog post looks at the emotions in those comments.

There are lots of other data sets one could upload to a service like BigQuery. One example data set that could be a lot of fun to work with is described in this paper on Meme tracking in the global news cycle. The meme dataset can be downloaded from snap and a description of the structure is there as well. It's quite simple. Each file includes: The URL of the document (labeled P) and the time of the post (labeled T) and one or more key phrases  (labeled Q for quote) as well as zero or more hyperlinks to other documents in the data set (labeled L). 
Before the data can be loaded into BigQuery, it needs to be formatted according to the BigQuery guidelines. One key requirement is that it be denormalized, luckily already a property of this data set. However, it is not in JSON format, so the first step is to write a quick script to convert it.

How could you improve on the user experience? What visualizations might you add? How could you you improve on the speed?

If you want to use Google's prediction API, you can do quite a bit even without scikit-learn (including online learning of regression). To use it, however, you have to move subsets of the data into regular cloud storage from the BigQuery storage.  

Byte 2: Exploring Data

posted Jan 7, 2014, 8:12 AM by Jen Mankoff   [ updated Jan 23, 2017, 6:59 PM by Nikola Banovic ]

  • Description: Your final product will be an exploration of data set of your choice ([yourname]-explore.appspot.com)
  • Source Code: See https://github.com/jmankoff/data, in Assignments/jmankoff-explore. 
  • Hand In: Canvas (see questions at end of assignment)

Overview

In this project, you will create a new application that explores static data available online through Google Fusion Tables (tutorialsexamples, and public tables). This assignment has the following learning goals:
  • Accessing data
  • Displaying information using Jinja (useful for later Bytes and Projects)
  • Using SQL to query Google Fusion Tables
  • Using new data formats including CSV and JSON
  • Reading and writing from files
  • Conducting data clean up and exploratory data analysis

Detailed Instructions for Explore Data Byte

Although we have created a visualization in the previous byte, we do not yet have programmatic access to the data. Our next step is to download data from fusion tables into Python. To do so, we will make use of the Google Cloud Console. There is great getting started documentation for the api_client_library that you should look over.

Register your Application and Download the API Keys

Go the Google Cloud Console and make sure you are in the correct byte. Then find the API Manager area, and turn on the Fusions Table API (hint: you may have to search for it in the Google Cloud Console). You will also need to create 'Credentials' for accessing the data from a Web server (an 'API key'). I named mine 'jmankoff-explore'. Leave the box for IP addresses blank and then press OK. This will generate a new key (ideally we would restrict the IP address, and to do so you would need to know the IP addresses for your local computer and for google's appspot servers. That is left as an exercise to the reader).

Be sure to save the resulting key! You can just copy it off their website and into your code. It should be something like: XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX

We'll store the key in our code.

API_KEY = XXxxXxXxNXXXxXxxNxNNxxXxxXNxN_XNxXXXNX

Using Discovery to Handle API Access

Since we already have the API key, all we really have to do is make sure that we use it every time we request data. Since we are using one of Google's own APIs (instead of a third party) we can leverage much more powerful libraries than last time around to access data. In particular, we will follow the description in the getting started introduction for building a 'service object'

Before we start, let's make sure we've included the correct set of libraries in our application. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For Fusion Table access:

  • json will be used to decode certain response objects (from SQL queries of google fusion tables). JSON is an open text based standard for data, similar to XML in its uses.
  • jinja2 (as with the previous byte
  • urllib (a standard library for working with urls)
  • build (from apiclient.discovery). To get this: First, download the Google API Python Client, unzip it, and copy the subdirectories 'googleapiclient' into the [yourname]-explore project directory. Build depends on some other python libraries. You will need to 
    • download six and move the file 'six.py' into the lib directory of your application
    • download httplib2 and move the directory 'httplib2' (found inside of the directory 'python2' into the lib directory of your python application
    • download uritemplate and move the directory 'uritemplate' into the lib directory of your python application
    • download oauth2client and move the directory 'oauth2client' into the lib directory of your python application 
    • add the following to the libraries portion of your app.yaml file (along with webapp2 and jinja2)
      -  name: ssl
         version: latest
Now, set up your application (main.py) with the correct set of libraries:
# Imports
import os
import jinja2
import webapp2
import logging
import json
import urllib

# this is used for constructing URLs to google's APIS
from googleapiclient.discovery import build

Now we can prepare to connect to fusion tables:

# This uses discovery to create an object that can talk to the 
# fusion tables API using the developer key
service = build('fusiontables', 'v1', developerKey=API_KEY)

At this point in the tutorial we need to have in mind specific things that we want to do with the fusiontables API. We will need to know the 'address' of the table we created. To find this, open your fusion table's File menu and click on About This Table. The ID at the bottom is the key you will need to access the data programmatically (it will be something like: NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX

Store it in a variable:

TABLE_ID = 'NxxxNXxXxxNxXXXXNXxXXXxXxxxNxXxNxXxxXxxX'

Now you can begin to explore the information you can retrieve from the table. The API structure is provided as nested method calls in the service library (as documented in the tutorial on building and calling a service). You'll want to look at the documented set of API calls available as part of the APIs explorer to find out what specific method (API) calls you can make. For example, 

request = service.column().list(tableId=TABLE_ID) 

will get a list of column names for TABLE_ID.

It can be a bit confusing to tell what the correct API calls supported by your service will be. First, it is important to remember that the parameters are case sensitive (so tableID will fail for example). Second, if you look at other parts of the documentation, you may be confused about what is supported. For example, in the documentation on "Using the API" it looks as though you would need to specify service.tables(tableID=TABLE_ID), but this will not work correctly (or even compile).

Using the data

Now that we have the data, let's explore it. First, be aware that the data is in JSON format. This format is well supported by python using a dictionary structure. We'll want to get a feel for the data using a few different tools:

  • Summarizing the raw information in different ways using descriptive statistics
  • Graphing the data
  • Asking specific questions that we might expect certain answers to or to quantify our intuition
  • Re-organizing the data into different groupings that might be more informative

While there are other tools we will develop as the course goes on, these are an excellent starting place.

First, make sure you understand the structure of the data. Each row is a different animal that arrived at the pond. What do the columns mean? What range of values do they have? Are those values categorical? Numerical? Text? Take the time to look at each column. You can use Google Fusion Tables to create a chart summarizing each column one by one. When you do, what discoveries do you make? You will probably quickly find that the fusion tables interface leaves much to be desired. The table itself is too large to get a sense of by scrolling through it. So what alternatives do we have?

We will download the data into Python to explore it. Let's go ahead and create a query that will get this information, using sql, a special-purpose language for querying relational databases. There is a fusion tables version of SQL and we'll be using the SELECT command. The format for a query is:

https://www.googleapis.com/fusiontables/v1/query?sql=query

Let's download some sample data for the first two dogs in the table (note that your column names will likely be different):

def get_all_data(self):
    query = "SELECT * FROM " + TABLE_ID + " WHERE  AnimalType = 'DOG' LIMIT 2"
    response = service.query().sql(sql=query).execute()
    logging.info(response['columns'])
    logging.info(response['rows'])
        
    return response

And look at the results in the log:

INFO     2014-01-07 19:31:41,439 main.py:88] [u'Outcome', u'AnimalID', u'AnimalType', u'IntakeDate', u'IntakeYear', u'IntakeMonth', u'Name', u'Breed', u'Age', u'Sex', u'SpayNeuter', u'Size', u'Color', u'IntakeType', u'OutcomeSubtype', u'ZipFound', u'Latitude', u'Longitude', u'ZipPlaced', u'OutcomeDate', u'OutcomeYear', u'OutcomeMonth', u'icon type', u'IconName']

INFO     2014-01-07 19:31:41,439 main.py:89] [[u'', u'A346562', u'DOG', u'2011-04-16T00:00:00Z', u'2011', u'April', u'DIXIE', u'', u'Older than 1 year', u'MALE', u'true', u'MED', u'BLACK', u'Owner Surrender', u'', u'40218', 38.1938, -85.6573, u'NaN', u'2011-04-16T00:00:00Z', u'2011', u'April', u'NaN', u''], [u'', u'A364651', u'DOG', u'2011-03-11T00:00:00Z', u'2011', u'March', u'KAYLEE', u'YORKSHIRE TERR - POODLE TOY', u'Older than 1 year', u'FEMALE', u'true', u'TOY', u'BLACK', u'Impound', u'', u'40202', 38.2529, -85.7514, u'40218', u'2011-04-01T00:00:00Z', u'2011', u'April', u'NaN', u'']]

All of those 'u' annotations mean that we are dealing with unicode strings (with the exception of the latitude and longitude data). Some of these strings are legitimate (the column names can all remain strings; the AnimalID and AnimalType should be strings; and so on). However others would be easier to manipulate if they were in other formats (such as the dates). Let's start, however with some exploratory data analysis. 

Showing the data on the web 

We'd like to be able to see what has been loaded from Fusion Tables, and the log file is not entirely satisfactory. So we're going to also make sure we can see it on the web.  One of the most powerful aspects of Jinja is its ability to display dynamic information provided by Python. We can pass one or more variables to Jinja by placing them in context in the appropriate handler (handler for index.html in main.py in this case):

return template.render(headers=cols, content=rows)

Next, update the 'index.html' file to show the information. Since we are using bootstrap, you should consult the bootstrap table reference for details on how the table is represented in html. Note the use of {% ... %}. This indicates some logic that should be executed (in this case a for loop). The contents of {{ ... }} are replaced with their value.

<div class="table-responsive">
  <table class="table table-striped table-hover">
    <thead class="thead-inverse">
     <tr>
{% for item in headers %}
         <th>{{ item }}</th>
{% endfor %}
     </tr>
    </thead>
    <tbody>
{% for row in content %}
     <tr>
  {% for datum in row %}
            <td>{{ datum }} </td>
  {% endfor %}
     </tr>
{% endfor %}
    </tbody>
   </table>
</div> 

The resulting output looks like this:

The table that we just created is useful for getting a sense of your data, but it can be unwieldy to work with. In addition, it is not the best way to visualize your data. Instead, it is often better to perform data clean up outside of the application and then use clean data in your application.

Choose your adventure: conducting data clean up and exploratory data analysis

The remainder of this assignment is concerned with characterizing the data, but you must first choose how you want to clean up and explore your data. Below you will find two options. The first option (Beginner to Intermediate level) uses a GUI application to load, clean up, and explore your data. The other option (Expert level) uses a Python Notebook and a number of Python. Choose your adventure based on your expertise. Note that you should choose only one option!

Exploratory Data Analysis in Wrangler (Beginner-Intermediate Level)

In this adventure you will use Wrangler to unwrangle your data. The Wrangler page has a video tutorial to help you get started. Make sure you watch and follow this tutorial to get familiar with the tool.

The first step is to download the data from your Fusion Table. You can do this by opening your table in the web browser, and then clicking on the "Download..." button in the File menu. Select CSV and save the file to your local machine. Then load the data into Wrangler and clean up your data using the Wrangler UI.

Once you are done cleaning up the data, export it by clicking on the "Export" button and then select Comma Separated Values. Once you download the data, use it to load it into a new Fusion Table (using the Fusion Tables UI and by selecting "New Table" from the File Menu).

Then use the same visualization technique like in Byte 1 to display your Fusion Table data in your appspot application.

Exploratory Data Analysis in Python (Expert Level)

In this adventure you will use Google Data Labs Notebook to clean up and explore the data. To do all of this, we will create a Python Notebook just for us (not for use in your google appspot application). We will also want to download the data locally so that we can write it to a file and then experiment easily and iteratively with this. Next, we will want to summarize the data. Finally, we may want to visualize aspects of it.

To create a Python Notebook, we will use Google Data Labs (Google's version of Jupyter), which connects to you Google Cloud Platform projects. Google enables you to run the notebook both locally and in the cloud (mostly for large jobs), but the cloud version may incur charges. For this byte we will stick with the local version. Please follow this tutorial to install and configure your local Google Data Labs.

Once you install and run your local Google Data Labs instance (HINT: remember to connect it to your Google Cloud Project for this byte), you can connect to it by going to http://localhost:8081/Note that all of the documentation and tutorials on how to use Google Data Labs will be included with the installation and accessible as notebooks. You can access these resources by going to http://localhost:8081/tree/datalab/docs

Once you are familiar with the basics of Python Notebooks go to the notebooks directory and start a new notebook (you could name it "[your name]-exploration"):

It is relatively easy to download Fusion Table data in JSON format and store it in a file. This will be the first part of our new exploratory analysis file, which we could download and store in our new notebook. Note that there are some small changes in what we import. You may have to install some of these libraries if your python cannot see them by default, typically "pip install [library]" will work. 

import csv
import httplib2
from apiclient.discovery import build
import urllib
import json

# This API key is provided by google as described in the tutorial
API_KEY = '...'

# This is the table id for the fusion table
TABLE_ID = '...'

try:
    fp = open("data.json")
    response = json.load(fp)
except IOError:
    service = build('fusiontables', 'v1', developerKey=API_KEY)
    query = "SELECT * FROM " + TABLE_ID + " WHERE AnimalType = 'DOG' LIMIT 10"
    response = service.query().sql(sql=query).execute()
    fp = open("data.json", "w+")
    json.dump(response, fp)

The try/except here checks whether the data is already downloaded by trying to open the file 'data.json'. If the data is downloaded, it simply loads it out of the file. If not, it connects to google fusion tables and downloads it, just as we do in our regular Google appspot code. At the end of all of this, response contains the parsed json as a dictionary just exactly as in our regular Google appspot code. Once you are sure this is working, go ahead and remove the 'data.json' and the words "LIMIT 10" from your code and you can download the entire data set.  If you print len(response['rows']) you can find out how many rows are in the data set. 

Now that you have a way to download the data, you will want to use python to help you explore it. However, JSON data is notoriously difficult to view, so we will transfer the data into Pandas (a Python library made specifically for data analysis) and create a Pandas data frame. First, import Pandas and related libraries into your notebook (Pandas is installed by default):

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

Then load the JSON data into a data frame:
data_df = pd.DataFrame(response[u'rows'], columns = response[u'columns'])

and display a few rows to make sure everything works:

data_df.head()

Pandas allows you to quickly access your data. Here is an example of how to access two columns from the original data frame:

lat_lon_df = data_df[['Latitude', 'Longitude']]
lat_lon_df.head()

We can also quickly clean up the data. For example, we can convert Latitude and Longitude to numeric values and remove empty values (all in one line). The code below applies to_numeric() function to each column in the data frame, then removes all NaN values:

lat_lon_df = lat_lon_df.apply(lambda x: pd.to_numeric(x, errors='ignore')).dropna()

Now that we have the data in a data frame, we can use Seaborn (a Python library for statistical data visualization) to visualize and explore the data. Seaborn also comes installed with the Google Data Labs. For example, use Seaborn to display a pairplot to visualize Latitude and Longitude as both individual histograms and scatter plots:

sns.pairplot(lat_lon_df)

Although you won't be able to use the same code for your data set, it should give you an idea of some things you can do in your own analysis. It can be found as 'explore.ipynb' in the github repository for assignment 2 and is carefully commented. You can also find ideas for different Seaborn plots in this gallery.

Once you have the plots, save them as images and include them in your appspot application.

Questions you should be able to answer when you hand this in

When you are done, you should be able to answer these questions and provide evidence to support your answers. You should do this in your appspot application in the 'about.html' and 'quality.html' files. You may use hand drawn sketches, generated plots, and live interactive graphics to illustrate your answers.
  • Is the data complete? If not, what is missing and what concerns does it raise?
  • Is the data coherent? 
    • Does it make sense relative to itself? 
    • Does it match your expectations? 
    • Is the data is distributed across its possible values in a way that makes sense?
  • Is the data correct?
    • What aspects of this data can you verify?
    • What did you find?
  • Is the data accountable?
    • Where did it come from?
    • Who has access to it?

Byte 3: Mobile Data

posted Dec 4, 2013, 12:24 PM by Jen Mankoff   [ updated Feb 17, 2017, 1:08 PM ]

Description: Your final product will be a web page that tells a story about your mobile data (e.g., common locations you visit and the activities that you perform at those locations) using visualizations.

Sample Source code: jmankoff-mobile
Learning Goals: 
  • Collecting your own data
  • Working with a data about a familiar topic
  • Working with mobile data
  • Visualizing Data

Set up for Mobile Byte

Setting up for Byte 3 is a multi step process that depends on what platform you are using (iphone versus android). You will need to do the following tasks:

Set up a sample byte, named [uniqueid]-mobile.appspot.com
  1. Install AWARE on your phone (required)
  2. Set up a cloud database for your byte's data (optional)
  3. Make sure you have the data. There are two possibilities here:
    • Connect AWARE to your cloud database and test your set-up (data will be automatically moved there for you)
    • Upload your data to your google Cloud Storage (not until Mobile byte is assigned)

Installing AWARE on your phone (required)

AWARE is open source and you might find it interesting to expand the set of sensors and capabilities of this application as part of a final project. Many off the shelf sensors are included that require no programming. Additional things can be done with programming. However since mobile application development is not a major focus of this class, our tutorial will not include further information about AWARE development.
If you are an Android user, you should download and install the default aware android client. Note that the client may use battery, but it will not cost you any of your data plan because it will only upload data when you have Wi-Fi access, by default. 
If you are an iPhone user, you can download and install AWARE using iTunes.

Setting up a Cloud Database for your Byte's Data (optional)

You will need to create a Cloud SQL database for your application in order to collect data from the aware framework in a format that you can compute over. Also note that once the data collection is over, you can just change the study status to closed and your client will stop uploading/collecting information. The data stays in the database unless you delete it. The Google documentation describes the steps you have to do. Use the menu at the top left of your cloud console for your byte (https://console.cloud.google.com/home/dashboard?project=[uniqueid]-mobile) to  create a Cloud SQL database  (select SQL, shown at bottom left, and then create instance, using the First Generation option).



You will need to keep track of the name of the database you created in Step 3 of  create a Cloud SQL database for your application. You might, for example, call it 'mobile-data.' For most other options, you can use the defaults. At the very bottom, you will need to specify 'Authorized Networks.' Add 80.69.77.149 as the authorized IP. You should also add the IP of a machine you have access to so that you can access the database directly from your laptop for setup and debugging purposes. In my case, I used a CMU server (anything that is not just dhcp will probably be fine). 

Finally note down the IP the instance has obtained, which will be needed in the aware server. You can see that on the listing of sql cloud instances associated with your byte (for example, I can see mine at https://console.cloud.google.com/sql/instances?project=jmankoff-mobile). You should also note down the instance connection name (mine is visible below as jmankoff-mobile:use-central1:mobile)


For testing, you will also need to set up a root password (which you can do on the access control page, just below the 'request an ip' button). As you can see in this image, you can do so by clicking on 'Access Control', then 'Users' and then 'change root password'

Once you create an instance, you will also create a database. Go to the Databases tab and click on "New database":

Then enter your database name:


Test your connection from your local machine. You can use MySQLClient, or at the command prompt (assuming you have mysql installed), you can type:
mysql --host=instance-IP --user=root --password

The result should look something like this:

Once you are at the mysql prompt, you can set up a new user if you want (instead of root) to use for remote access. 

CREATE USER 'newuser'@'80.69.77.149' IDENTIFIED BY 'password';
You will need to give this new user (or root) access to the database you created:

GRANT ALL PRIVILEGES ON data.* TO 'newuser'@'80.69.77.149';
You may also want to do this for your own IP address so that you can test everything by logging out and logging in again as 'newuser'. 

Connecting Aware to your cloud database (ONLY if you choose to setup your Cloud database)

Once you have AWARE installed and you've set up a database, it is time to connect the two of them together. To do so, will need to create a 'study' (of yourself) and specifies what you wish AWARE to sense. To do this, you need to go to the website for study creation. To access this website, you will need to accept a security certificate. In my mac, this caused the browser to try to stop me from proceeding:
I clicked on advanced and chose to proceed. This led me to a 'sign in with google' button. Once you do that you will be able to see the dashboard. 
Click on 'Create Study' at right. Give it whatever name and description you would like. 


Then select 'Remote Server'. You will need to provide the details of your google data store in order to complete this task (as described above, hostname, port, database, username and password). The hostname can be the IP address for your cloud database (which you used to test things out above). The port is always 3306 (the default for MySQL). The database will be whatever you named your database (as mentioned above) and the username and password are specified under access control as described above. The username is 'root' (or [newuser] if you created a new user) and the password is what you set. 



The next page that you see summarizes the study and also gives you the opportunity to set parameters.


You should click the pencil to the far right of 'Sensors' and customize what data you would like to collect. A minimum requirement for this byte is 'google activity recognition' and location (GPS or google fused location). Be sure to check the box for what you want and click on the green check mark at the far right to save. 

Joining the Study

Once you have finished creating the study, you will need to join it. Open the aware client on your phone and press the menu button. Choose 'Studies'. This will launch a QR code reader. 
 


Meanwhile, on your screen, you should press 'Show QRcode' (it's just above edit, as shown in the screen shot above). Note that you can get to this screen anytime by logging into the website for study creation and clicking on your study.  


Scan the code with your phone to add it to the study. Once you scan the QRCode on your phone, follow any instructions it may provide: install a plugin, activate the accessibility services (See Increasing Availability). If you are connected over Wi-Fi, you should see your phone on the list of Devices.


Finding a phone's ID [optional]

If you end up collecting data from multiple people (say as a final project) you may need to separate them from each other. The ID for a phone is found as follows:

Open the AWARE Sensors icon from the mobile phone’s launcher and you should be able to see your device ID as AWARE Device ID, a UUID (Unique Universal ID) alphanumeric value that is assigned to their client when they installed the AARE Client.


Increasing Availability (Android Users Only)

Once you install the application, it should start logging data to the server. However, if you have an Android phone, be aware that Android kills "non-essential" services that are running in the background when there is an issue with memory, battery, and so on. If you want to ensure that your application is running at all times, you must activate the AWARE framework in the Accessibility services (in your phone settings). This is not required, but will increase the reliability of your data. Note this will cause your phone to talk to you on Samsung devices only due to an accessibility bug. 

Samsung Bug (Android Users Only)

- Go to Settings -> Accessibility -> Talkback « ON »
- Select « Settings » from the bottom of the page
- Uncheck all Talkback options
- Switch Talkback « Off » again
- Go back to Settings -> Application manager -> All
- Disable Google TTS (text to speech)
- Disable Samsung TTS (text to speech)

If you’re experiencing the same issue on another phone, try this :

- Go to Settings -> App Manager -> Select GoogleTTS -> Disable
- Go to Settings -> App Manager -> Select SamsungTTS (on a Samsung phone) -> Disable
 

Testing your setup

To test if your phone is correctly working and hooked up, you can use the same website used to set up the study to send your phone an ESM message. This is only enabled once you have a phone connected to the study, and can be found at the bottom of the same screen where you edit the sensors. Here's what it looks like:



If everything goes well (i.e., your device is currently connected to the server, the server is up and running), you should see a question on your phone’s screen.

You can also test the database link by selecting the Broadcasts tab, and sending a request to sync database. The image below shows how this is done, with the user's phone that should sync selected. 
Finally, you should double check that there is actually data in your google cloud database. You can do this by logging into the mysql server remotely mysql --host=instance-IP --user=root --password or by checking the cloud console's log of read write operations, as here:

Main Mobile Byte Tasks and Handin Expectations

Data from your phone will be in the google Cloud Storage database you created. When developing on the Google Cloud Shell you will always need to 'deploy' your code and then test it online at [uniqueid]-mobile.appspot.com. When testing online, you can take advantage of Google's cloud logging service to see a trace of what happens, which is very useful for debugging. Access it using the same left hand menu that got you to SQL

There are two paths through this assignment, described below under 'Non Programmer Path' and 'Advanced User Path'
  • For those with minimal coding skills, we encourage you to export your data as a csv, and then load it into Lyra (documentation here). From Lyra you can export static images (less preferable) or vega-lite specifications (with your data embedded) which you can then load into your application as interactive visualizations (More preferable). We will update this with more details and demo in class on Thursday.
  • For others, you can work directly with Vega-lite. Some details are below.
  • Only for advanced users: Note that you could access your SQL database locally from your laptop if you have a static IP that you set up as an 'Authorized Network' when you set up your cloud instance. You can decide which approach to take.  

Non Programmer Path:

To export your data, you will need to use your google cloud console. You'll need to know the name of the database you used to collect AWARE data as well as the tables you wish to export to do this. Your first step will be to make sure that you know those names. Here are some commands that you can type into cloud shell to accomplish this. Note that you will need to use the same username you used when you set up the link between aware and the cloud shell, and you'll need to have set up permissions so that you're allowed to access that database from your cloud shell. Neither of those tasks are covered here but most likely you already set up a username and shell access when you set up aware.

gcloud beta sql connect [instance name] --user=[username]
example:
Next, you will need to check the table names and pick one. This is standard mysql, and here is a way to do that (with images showing applied examples). words capitalized in my examples are SQL commands. Sometimes you will see these in lowercase in the images, SQL doesn't really care. However when you name a database or table, you have to always keep the case exactly the same, so SQL does care for those.

SHOW DATABASES;
In this student's case, mobiledata contains the aware data. We'll call that [database] in example text instead so you know to use your database name there.
USE [database]
SHOW tables;
Once you have listed your tables, you should choose one to export, which we will call [table] (an example might be locations or battery).

Finally, you need to go to your cloud console and export the data, as described here
Here is a picture of what that would look like once set up properly to export the locations table as a csv file. Note that you have to click on 'advanced options' to enter the database name. Also note that you can write other more complex SQL queries if you wish.

Once the export is complete, you will get a browser notification. You have to go to a completely different part of the cloud console to download the exported file (sorry, folks!). To get to it you have to go to the storage area, and select the bucket for your database. 



You'll see either one file (a log), if your export failed, or two files (a log and a csv file) if your export succeeded. You can download the log file to find out why the export failed, or share it with us on piazza, if you run into trouble.

Once you have downloaded your CSV file, you can import it into wrangler to clean it, to Excel if you want to play around with different ways to visualize it, or directly into lyra to begin designing your final interactive visualization. 

Note that to import the data into Lyra, you either need to past raw values in, our use a URL. If you want to use a URL, in the interface shown one image previously in this tutorial, you can check the checkbox on the right (after the date) and it will set the data file to be publicly accessible. You can right click on the link provided and give that to Lyra. This image shows the part of Lyra used to upload data. 



Once your chart is complete, Lyra will let you export it using the share icon in the middle bottom of the screen. You can export it as a PNG, which you can embedd in your handin website on appspot as a static visualization. 
If you click 'with inlined data values' instead, you will get a json file, which you can then paste into your index.html template instead of the sample data shown in the sample code. To do this, you should find the beginning of the script tag up through the semi-colan:
   
<script>
 // Assign the specification to a local variable vlSpec.
 var vlSpec = {
   "data": {
     "values": [
{% for row in queries[0].results %}
          {"date": "{{row[0]}}", "count": {{row[1]}}},
        {% endfor %}
     ]
   },
   "mark": "bar",
   "encoding": {
     "x": {
      "field": "date", "type": "nominal",
   "axis": {
       "title": "Date"
    }},
     "y": {
       "field": "count", "type": "quantitative",
       "axis": {
         "title": "Record Count"
       }
     }
   }
 };

You will want to delete everything inside the outermost {}; so first delete it:
<script>
  // Assign the specification to a local variable vlSpec.
  var vlSpec = {



};

and then paste your stuff in the blank space. Now when you deploy your app, you should see the interactive visualization! Not only that, if you are doing it this way you don't even have to deploy to the cloud to debug -- it should work on the dev server too! Have fun.

Advanced User Path:

The library we will be using to access the data is MySQLdb (documentation). Note: You must have the MySQL-python package installed on your system (even on your Google Shell), but not in your appspot application directory. This can be done with something like sudo apt-get install python-mysqldb or using pip to install it). You will also need to add it to 'app.yaml' (or copy it over from my github version jmankoff-mobile).

libraries:
- name: MySQLdb
  version: "latest"

and (3) in 'main.py': 

import MySQLdb

Some other useful resources for working with MySQL databases from Python and Google Cloud Platform:
Now we can access the database. To access it you will need your DB instance connection name. You can get the name by going to the Overview tab for your connection in the Cloud Console or by executing gcloud sql instances describe [your-db-instance] in the Cloud Shell. The following code will show the names of all the tables in the database. :
_INSTANCE_NAME = '[your-instance-name]'
_DB_NAME = 'test' # or whatever name you choose
_PASSWD = 'your-password'

if (os.getenv('SERVER_SOFTWARE') and
    os.getenv('SERVER_SOFTWARE').startswith('Google App Engine/')):
    _DB = MySQLdb.connect(unix_socket='/cloudsql/' + _INSTANCE_NAME, db=_DB_NAME, user=_USER, passwd=_PASSWD, charset='utf8')
else:
    _DB = MySQLdb.connect(host='127.0.0.1', port=3306, db=_DB_NAME, user=_USER, charset='utf8')


@app.route('/')
def index():
    template = JINJA_ENVIRONMENT.get_template('templates/index.html')

    cursor = _DB.cursor()
    cursor.execute('SHOW TABLES')
    
    logging.info(cursor.fetchall())

    return template.render()

Yours may differ from mine, which were:
('applications_crashes',) ('applications_foreground',) ('applications_history',) ('applications_notifications',) ('aware_device',) ('esms',) ('locations',) ('mqtt_messages',) ('mqtt_subscriptions',) ('plugin_google_activity_recognition',) ('plugin_mode_of_transportation',)

If you have any trouble accessing your DB instance from your application, please make sure that your application has access to your DB. This is usually automatic, but you need to make sure. If you do not see the image below in the Access Control tab of your DB instance, you need to add your Application ID in your ”Access Control”:


Here are the details about some example tables:

Table: plugin_google_activity_recognition -- contains activities recognized by Google [possibly may not include all these varieties]
Columns:
  • _id: integer - primary key, auto incremented, just numbers each row
  • timestamp: double - unixtime in milliseconds of when we collected this data point
  • device_id: varchar - the unique device who sent this data.
  • activity_type: integer - a constant number assigned by Google for each activity_name (from here: http://developer.android.com/reference/com/google/android/gms/location/DetectedActivity.html). Values are:
    • in_vehicle: 0
    • on_bicycle: 1
    • on_foot: 2
    • still: 3
    • tilting: 5
    • unknown: 4
  • activity_name: varchar - a human-readable description of the physical activity. Possible values are: in_vehicle, on_bicycle, on_foot, still, unknown, tilting.
    • in_vehicle: the user is in a car/bus/train
    • on_bicycle: the user is biking
    • on_foot: the user is walking/running
    • still: the user is stationary somewhere
    • unknown: Google’s algorithm has no idea of what you are doing
    • tilting: the user has the phone on his hands/desk and its moving somehow minimally.
  • confidence: integer - Returns a value from 0 to 100 indicating the likelihood that the user is performing this activity. The larger the value, the more consistent the data used to perform the classification is with the detected activity.
  • activities: JSON Array with JSON objects with other potential activities the user might be doing. Each JSON object has two values: activity (which contains the activity_name) and confidence (as before). The sum of the confidences of all detected activities for a classification will be <= 100. This means that larger values such as a confidence of >= 75 indicate that it's very likely that the detected activity is correct, while a value of <= 50 indicates that there may be another activity that is just as or more likely.

Table: locations - the data from Google Fused Locations [possibly may not include all these columns]
Columns:
  • _id: integer - primary key, auto incremented, just numbers each row
  • timestamp: double - unixtime in milliseconds of when we collected this data point
  • device_id: varchar - the unique device who sent this data
  • double_latitude: double - the latitude of coordinates
  • double_longitude: double - the longitude of coordinates
  • double_bearing: double - to where (in degrees) is the user heading (used to know in which direction is the user moving).
  • double_speed: double - how fast is the user moving in meters/second
  • double_altitude: double - how high above sea level in meters is the user
  • provider: varchar - how was this location fix established. One of: fused, network, gps.
  • accuracy: double - how accurate is the location fix, in meters.
  • label: varchar - a label for this location, if it exists. Some plugins can use it to label a specific location (e.g., home, work, etc).

Exploring the data

We will illustrate how to explore the data in the context of location data which many of you probably collected. For exploring the data, you will want to always select for only data that matches the device ID you care about. For those of you doing this assignment who do not have your own android devices, my device id is '785434b8-ce03-46a2-b003-b7be938d5ff4'. This is also the default device ID used in the code. You can see all of the device IDs currently registered by using the query:

SELECT DISTINCT device_id FROM locations

The word DISTINCT in this query simply ensures that only unique values for the column device_id are returned.

There are a few queries you may find useful to use for exploring the data initially. For example, 

SELECT FROM_UNIXTIME(timestamp/1000,'%Y-%m-%d') AS day_with_data, COUNT(*) AS records FROM locations WHERE device_id = '670b950e-f821-4e40-bb6c-547e4499c9c5' GROUP by day_with_data; 

will show the number of days for which location data was collected. This query creates a string for each row (record) in the table 'location' which specifies the year, month and day that row was recorded. The rows are grouped based on this string, and then counted. In more detail:

  • We've already introduced SELECT in Byte 2. Normally, SELECT takes the names of columns (or *) as arguments, which specifies which columns of the table should be selected. 
  • The command FROM_UNIXTIME(...) is used instead of the column name. It takes as input a unix timestamp (number of seconds since January 1, 1970) which we need to convert it to a format that will be useful in our query -- namely Year-Month-Day format. The 'timestamp' column of the locations table is stored in milliseconds since 1/1/1970 so we first divide by1000. 
  • The command AS simply provides a name for what we just did (in this case the name 'day_with_data' for the formatted timestamps we are producing)
  • COUNT(*) will count the number of records in a table. However, because we end by saying GROUP BY, the count will be divided up into one count per group (the number of rows in each group). Groups are defined using 'day_with_data', which was the name we gave our timestamps. 
  • FROM name specifies which table this stuff will all be found in
  • WHERE device_id = '...' specifies that we should only look at records with that specific device_id (i.e. records recorded by my phone, or yours if you change the id)

The result, for my data, looked something like this at the end of january (of course, this will change over time).

#Days with data from location data
('2013-12-17', 415L)
('2013-12-18', 1216L)
('2013-12-19', 241L)
('2013-12-20', 81L)
('2013-12-21', 820L)
('2014-01-08', 371L)
('2014-01-09', 1110L)

If you simply deploy the code to appspot at this point, you should see results similar to this:


It is also possible to trace the duration of events. Because of the complexities of these queries, I'm going to show the python code used to construct them. This code helps me have more clarity with respect to the queries because I can use variables to represent the unix time formatting and other key things. The key thing to realize about this code is that we are simply putting together a string that will then be used as a query. The variables _ID and _ACTIVITY are global variables defined at the top of 'main.py'. _ID holds the device id for my phone, _ACTIVITY is the name of the table in which activity information is stored.

# turns a unix timestamp into Year-month-day format
day = "FROM_UNIXTIME(timestamp/1000,'%Y-%m-%d')"
# turns a unix timestamp into Hour:minute format
time_of_day = "FROM_UNIXTIME(timestamp/1000,'%H:%i')"
# calculates the difference between two timestamps in seconds
elapsed_seconds = "(max(timestamp)-min(timestamp))/1000"
# the name of the table our query should run on
table = _ACTIVITY
# turns a unix timestamp into Year-month-day Hour:minute format
day_and_time_of_day = "FROM_UNIXTIME(timestamp/100, '%Y-%m-%d %H:%i')"
# Groups the rows of a table by day and activity (so there will be one 
# group of rows for each activity that occurred each day.  
# For each group of rows, the day, time of day, activity name, and 
# elapsed seconds (difference between maximum and minimum) is calculated, 
query = "SELECT {0} AS day, {1} AS time_of_day, activity_name, {2} AS time_elapsed_seconds FROM {3} WHERE device_id='{4}'  GROUP BY day, activity_name, {5}".format(day, time_of_day, elapsed_seconds, table, _ID, day_and_time_of_day)

will show in order each activity that occurred during each day that data was recorded, along with the duration of that activity.  

Identifying Common Locations (Optional)

Here is an example of what you can do with your own data. One of the first things we will want to do with mobile data is identify common locations. This is most easily done by combining work in SQL with work in Python. In particular, we will first run a query             
query = "SELECT double_latitude, double_longitude FROM {0} WHERE device_id = '{1}'".format(_LOCATIONS, _ID)

To make life easier, I have written a helper function in 'main.py' that runs a query. The function also checks if the query failed and logs as much information as possible using logging.info() on failure. It returns the query results in a list, and is called make_query (you can find it in the source code on github). Once we have the locations in a list, we need to identify unique locations. For this, I have again written a helper function, called bin_locations, which takes a list of locations as input and compares them pairwise. The algorithm used is fairly simple: Store the first location in a dictionary under the bin name 1. For each new location, check its distance from each stored location already in the dictionary. If it is within epsilon of any of the stored locations, ignore it (it is not 'new'). If it is further than epsilon from all stored locations, add it to the dictionary under a new bin name -- it is a new location. Distance is calculated by the helper function distance_on_unit_sphere(..) which is based on John Cook's implementation and explanation. 

The results when I ran bin_locations on my data from a few years ago were: 

 latitude          longitude  hand calculated address  My explanation of this
 40.4520251    -79.943495499999997  436 Morewood Avenue, Pittsburgh, PA          Near CMU / my home
 40.435558200000003  -79.863789999999995  400-503 Sherwood Road, Wilkinsburg, PA    On way to Hidden Valley
 40.442977900000002    -79.781972100000004  Penn Lincoln Parkway, Monroeville, PA   On way to Hidden Valley
 40.355172099999997  -79.683622200000002  Pennsylvania Turnpike, Irwin, PA  On way to Hidden Valley
 40.264337300000001  -79.655496499999998  1937 Main Street, New Stanton  On way to Hidden Valley
 ...  ...  ...  ... several more of these

This summary immediately highlights several issues. The first is required to solve for this assignment. The remainder are optional (although would be very important in working further with this data than this byte takes you). 
  • First, the granularity here is to gross. In particular, everything happening within Pittsburgh is grouped at a single address (436 Morewood Avenue). This means that epsilon is probably not correct and needs to be tweaked. 
  • Second, we cannot differentiate between places where I spend a lot of time and places where I am just traveling through based on the data given. I have looked up the associated address by hand using http://www.gps-coordinates.net/ and interpreted it for you (on the right), but nothing automatic is taking place. A very simple way to fix this would simply be to keep track of the number of matches for each location. Places where I spend a lot of time should have a lot more values logged. This would be relatively easy to add to the information stored in each bin. 
  • The third problem is that that the 'label' for a location is whatever shows up first in the log in the general area of that popular spot. This is not as big an issue as the other two, but could be fixed by keeping a count for all of the lat/lon pairs within epsilon of the first location found, and then using the most popular one as the label. 
  • Lastly, investigating this data will be hard if we are always calculating addresses by hand (since lat/lon is essentially meaningless to the reader). It might make your life easier if you could do this automatically, and the google geocoding api is one way to do so.

Associating Activities with Locations (Optional)

At this point, after working with my code and adding to it, you should have a reasonably descriptive list of locations (i.e. a reasonable epsilon) and be able to divide them into locations, possibly grouped into those that are visited a lot (common locations) and locations that are visited only occasionally. Your next step is to explore what activities take place in and around these locations.  To do this, you first need to make a query that will get locations and activities, both organized by day and hour. You can also collect the elapsed time 

time_of_day = "FROM_UNIXTIME(timestamp/1000,'%H:%i')"
day = "FROM_UNIXTIME(timestamp/1000,'%Y-%m-%d')"
query = "SELECT {0} as day, {1} as time_of_day, double_latitude, double_longitude FROM {2} WHERE device_id = '{3}' GROUP BY day, time_of_day".format(day, time_of_day, _LOCATIONS, _ID)
locations = make_query(cursor, query)

day_and_time_of_day = "FROM_UNIXTIME(timestamp/100, '%Y-%m-%d %H')"
elapsed_seconds = "(max(timestamp)-min(timestamp))/1000"
query = "SELECT {0} as day, {1} as time_of_day, activity_name, {2} as time_elapsed_seconds FROM  {3} WHERE device_id = '{4}' GROUP BY day, activity_name, {5}".format(day, time_of_day, elapsed_seconds, _ACTIVITY, _ID, day_and_time_of_day)
activities = make_query(cursor, query)

# now we want to associate activities with locations. This will update the
# bins list with activities.
group_activities_by_location(bins, locations, activities, _EPSILON)

Once you have locations, activities, and bins, you can look up the location for each activity (using the day and hour that the activity occurred at as an index into the list of locations) and then look up the bin for the activity (using its location). A method for doing this is in the group_activities_by_location function in your code.
  
The result looks something like this. Each item has a lat, a lon, and then a series of activities and durations. Note that some locations don't seem to have an activity, and the very last set of things are activities for which we could not find a location that matched. Determining the cause of this is left as an exercise to the curious (we are no longer in the 'quality checking' phase of the semester, though that is certainly a legitimate and necessary thing to do with mobile code.

[40.4520251, -79.943495499999997, 'in_vehicle', 39392.071000000004, 'on_foot', 39727.334000000003, 'still', 70414.203999999998, 'tilting', 70699.740000000005, 'unknown', 68273.095000000001, 'in_vehicle', 81884.464000000007, 'on_foot', 40043.336000000003, 'still', 84955.171000000002, 'tilting', 79536.949999999997, 'unknown', 80886.019, 'still', 86292.365999999995, 'in_vehicle', 26267.181, 'on_foot', 69574.019, 'still', 86338.226999999999, 'tilting', 72176.975999999995, 'unknown', 69511.421000000002, 'in_vehicle', 77076.141000000003, 'on_bicycle', 47379.989999999998, 'on_foot', 76726.036999999997, 'still', 80446.138999999996, 'tilting', 77615.462, 'unknown', 77532.361000000004]
[40.435558200000003, -79.863789999999995]
[40.442977900000002, -79.781972100000004]
[40.355172099999997, -79.683622200000002]
[40.264337300000001, -79.655496499999998, 'in_vehicle', 61206.953000000001, 'on_foot', 59838.290000000001, 'still', 85463.236000000004, 'tilting', 79628.938999999998, 'unknown', 80889.201000000001]
[40.159720700000001, -79.479924699999998]
[40.128560100000001, -79.404786000000001, 'on_foot', 253.934]
[40.042347599999999, -79.228534199999999, 'on_bicycle', 8648.982, 'in_vehicle', 5743.6329999999998, 'still', 86355.557000000001, 'tilting', 82030.725000000006, 'unknown', 81165.418000000005]
[40.121983999999998, -79.303508800000003]
[40.210303699999997, -79.579402999999999]
[40.409026599999997, -79.719704500000006]
[['in_vehicle', 4357.4719999999998], ['on_foot', 10664.239], ['still', 28324.585999999999], ['tilting', 26078.298999999999], ['unknown', 28031.917000000001], ['still', 38368.474999999999], ['tilting', 0.0], ['unknown', 0.0]]

Creating a Dashboard for Visual Data Exploration using Vega-Lite

You will now develop a narrative story about your mobile data. Visual representation of data allows you to inspect the data quicker than looking at table printouts. Also, you will go beyond simple visualizations that you made using Google Fusion Tables. To do this, you will use Vega-Lite, a language for defining and creating visualizations that builds on D3 (a JavaScript-based visualization framework).

To "install" Vega-Lite, you need to add links to the Vega-Lite scripts to your template files. For example, you could add the following to the header of your index.html template:
<!-- Vega-lite visualization libraries -->
<script src="//d3js.org/d3.v3.min.js"></script>
<script src="//vega.github.io/vega/vega.js"></script>
<script src="//vega.github.io/vega-lite/vega-lite.js"></script>
<script src="//vega.github.io/vega-editor/vendor/vega-embed.js" charset="utf-8"></script>
    
<style media="screen">
   /* Add space between vega-embed links  */
    .vega-actions a {
      margin-right: 5px;
    }
</style>

Then in your template body you need to add an element which you will use to display the visualization in:
<div id="vis"></div>

And finally, you need to add another script element below that specifies the data and how to visualize it on your page:

   <script>
 // Assign the specification to a local variable vlSpec.
 var vlSpec = {
   "data": {
     "values": [
            {"date": "2017-02-01", "count": 3},
            {"date": "2017-02-02", "count": 2},
            {"date": "2017-02-03", "count": 5}, 
            {"date": "2017-02-04", "count": 6}, 
            {"date": "2017-02-05", "count": 2}, 
            {"date": "2017-02-06", "count": 3}, 
            {"date": "2017-02-07", "count": 3}, 
            {"date": "2017-02-08", "count": 4}, 
            {"date": "2017-02-09", "count": 1} 
     ]
   },
   "mark": "bar",
   "encoding": {
     "x": {
      "field": "date", "type": "nominal",
   "axis": {
       "title": "Date"
    }},
     "y": {
       "field": "count", "type": "quantitative",
       "axis": {
         "title": "Record Count"
       }
     }
   }
 };
 var embedSpec = {
   mode: "vega-lite",  // Instruct Vega-Embed to use the Vega-Lite compiler
   spec: vlSpec
   // You can add more vega-embed configuration properties here.
   // See https://github.com/vega/vega/wiki/Embed-Vega-Web-Components#configuration-propeties for more information.
 };
 // Embed the visualization in the container with id `vis`
 vg.embed("#vis", embedSpec, function(error, result) {
   // Callback receiving the View instance and parsed Vega spec
   // result.view is the View, which resides under the '#vis' element
 });
  </script>

In the above code vlSpec is a variable that contains all of the information about your graph (including the data). In fact, you could copy the contents of it and paste it into Vega-Lite editor to view it:


The Editor provides you with example charts that you could use to kick off your visualization and to view it with dummy data. Here is an example chart that you can select from the drop down:




Now we want to replace this hard-coded, dummy data with your actual data from the database. To do this, we use Jinja2 to include the data from our database into the page. You will replace the data code above with the code that takes the data from your first query and adds it to the page:
"data": {
    "values": [
    {% for row in queries[0].results %}
        {"date": "{{row[0]}}", "count": {{row[1]}}},
    {% endfor %}
    ]
}

Once you create and display your visualization, you will have an option to modify the chart in the editor again. You can use this to speed up your debugging and fixing process.

OPTIONAL READING: Ideally, you might consider displaying the data back in a mobile setting using an application development tool such as Titanium or PhoneGap, however we will not cover any of that in this tutorial. However, our emphasis is not on mobile app development in this class.

Hand In Expectations

 Your assignment is to provide a visual narrative of your mobile data. You will use visualizations to explore the relationship between different aspects of your data (e.g., activities and time of day, locations and time of day). For this Byte you should:
  • Deploy a working version of your code that guide the user through your narrative.
  • As always you will need to write some reasonable text 'about this data set' and comment on 'data quality' and different decisions you had to make when preparing the data set (e.g., what epsilon you picked for binning locations and why, how did you impute the data, how did you remove outliers).
  • You will need to be able to explain in general terms what you learned about the relationships in your data. The best answers would talk about the relationship between the question the visualization answers, the design choices that are made about the visualization, and how those choices help the user to get the most from the visualization.
  • When you create your visualizations, be sure to give them some sort of interactive aspect.

Project #1: Develop A Visual Data Narrative

posted Nov 19, 2013, 4:25 PM by Jen Mankoff   [ updated Apr 6, 2017, 10:47 AM by Nikola Banovic ]

Overview

In this project, you will develop a data visualization that tells a story about a data set of your choice. Whereas the Bytes were designed to teach you about the plumbing necessary to visualize data, the goal of this project is to explore the art of visual portraiture. You may work in groups on this project. 

Learning Goals

  1. Develop experience with narrative construction in the context of data visualization. Your final product will be judged based on the clarity of the narrative it represents. 
  2. Develop experience with selecting data that can effectively tell the story you wish to tell. Your final product will be judged based on the strength and appropriateness of the data you select. You may want to show dynamic information (such as showing the user information about themselves collected from twitter or facebook.com). Alternatively, you may want to use static data such as the many interesting sets on Google Fusion Tables. Here are some additional sources of interesting static data sets that could be uploaded to Google Fusion Tables.
    • Develop experience with using visualization techniques to both support a narrative and accurately represent the data. Your final product will be judged based on the strength, appropriateness, and truthfulness of the visualizations you select. You may use your choice of visualization tools. Regardless of what tool you use, you may also want to explore the blog infosthetics.com for creative ideas for your visualization. The specific data and visualization type you choose will depend on the data set and the story you are trying to tell. 
    • Develop experience with showing your visualization to a user to get feedback on it. You will be judged on the quality of the user feedback you present, and the impact it had on your project. 

    What You Will Hand In

    You will hand in a web page that addresses all 4 points above. That is, it should tell a story (either using text or video) about a visualization (which should be available on the web page). In addition, it should document the sources and approach. Finally it should document what you showed users, what you learned, and how that impacted your final results. You will hand in a link to this web page (which can be hosted on Google Appspot). In addition, you will prepare a poster for presenting in class (and you can bring your laptop for a live demo).

    BE SURE to put your names on your poster. In addition, attendance is required for the poster presentation session.

    Extra credit available at the discretion of the grader if you develop a complex custom visualization, support a high degree of interactivity, integrate complex data together from multiple sources, or develop a narrative that is especially deep and multifaceted.

    Student Midterm Projects


    Visualizing your data: Byte 4

    posted Nov 11, 2013, 8:37 AM by Jen Mankoff   [ updated Mar 3, 2016, 10:32 AM ]

    • Description: Your final product will be visualization of an animal adoption data set ([uniquid]-viz.appspot.com)
    • Hand In: Fill out the peer grading form on blackboard. 

    Overview

    In this project, you will create a visualization about the data from the Louisville Animal Metro Services (or a data set of your choice). In order to do this, you will need to make use of the D3 visualization toolkit.
    • Using a Javascript visualization library (D3)
    • Passing data from your python code to javascript running in the user's browser
    • Designing a visualization of your data

    Detailed Instructions for Visualization Byte

    This Byte should ideally build on the the data set you used in your data exploration Byte (Byte 2), or Byte 3 if you prefer. An example can be found at jmankoff-viz.appspot.com You should start by setting up a second google application, called [uniqueid]-byte4.appspot.com. You may use either a fusion table or a data base. We walk you through the use of  Google Charts and D3 in this tutorial. However there are many other platforms you may want to investigate in the future. D3 has derivatives such as NVD3 and Vida; other options are  HighCharts; and gRaphäel. You can see a comparison of the full set of options at socialcompare.com and this 2012 article from datamarket is also helpful. 

    Creating a Custom Visualization in Google Charts

    For this first phase of this tutorial, we will start with one of the simplest options for visualization that is available, Google's Charts API

    To use Google Charts, we will need to get data your database (or fusion table) through the python code in 'main.py' and send it through jinja all the way to google charts (which is embedded in a webpage using the javascript language).

    There are three important pieces to visualizing the data: Gathering the data, setting up the plumbing to display it in a chart, and adding the chart itself to 'index.html'. In the interest of having something to show as soon as possible (which supports debugging) we will do this in reverse order

    Adding a custom visualization to your web page

    We will first create a chart in 'index.html' showing fake data. To do this, you can literally copy the javascript code found in google's chart api documentation into 'index.hml'. In particular, we will copy the code for a column chart in between the <head></head> portion of 'index.html'. In order to display the chart, we need to add <div id="chart_div"></div> somewhere in the body of 'index.html' as well. At the end we should have something like this:


    Notice that the data for this chart is defined directly in the javascript we copied over, in the lines that say:

    var data = google.visualization.arrayToDataTable([
        ['Year', 'Sales', 'Expenses'],
        ['2004',  1000,      400],
        ['2005',  1170,      460],
        ['2006',  660,       1120],
        ['2007',  1030,      540]
     ]);

    In addition, the title and axis specifications are found in the javascript (and can be customized further):

    var options = {
        title: 'Company Performance',
        hAxis: {title: 'Year', titleTextStyle: {color: 'red'}}
        };

    We will want to replace this with our own chart. Since we know something about what our data will look like, let's first create fake data that is more realistic: 

    data = [
      ['Age', 'Adopted', 'Euthanized'],
      ['< 6 months',  1000,      400],
      ['6-12 months',  1170,      460],
      ['12-5 years',  660,       1120],
      ['>5 years',  1030,      540]
    ]

    You may find that the labels on the horizontal axis are cut off with this data. I updated my options as follows: 

    var options = {
    title: 'Animal Outcomes based on Age at Arrival',
    width: 400, height: 200,
    chartArea: {height: '50%'},
    hAxis: {title: 'Age', titleTextStyle: {color: 'red'}}
    };

    Setting up the plumbing for passing data from 'main.py' to the visualization

    Our next goal is to move the fake data to python and successfully pass it to the java script we just added to 'index.html'. 

    1) Place data into a table in 'main.py.' As it turns out the data structure syntax is identical in python and javascript so we can literally copy the data = [... definition above into 'main.py' 

    2) Next we need to JSON encode the data (this will turn it into a simple string); and store it in the context to pass to jinja (which will pass it on to 'index.html').

    Taking these two steps together, we get:

    @app.route('/')
    def index():
        data = [... # all the stuff above ]
        template = JINJA_ENVIRONMENT.get_template('templates/index.html')
        return temeplate.render({'data':data})

    3) Finally, we need to update the javascript in 'index.html' to retrieve the data. This simply requires us to write {{data|safe}} wherever we want to access the data. For example:

    var data = google.visualization.arrayToDataTable({{data|safe}})

    When you are passing information back and forth from your python code to jinja to java script for the visualization, it will be important to understand what information is available on both ends. You'll want to use the console for your browser to debug this (along with the 'console.log' function in javascript). In chrome, you access the console using an operating system specific key combination

    Debugging Hints

    The flow of information in this code is multi-faceted. You are (hopefully) loading data from somewhere in Python, and packaging it up to send to javascript. Inside of javascript you may do further processing, and visualize the code, which creates DOM elements. Because of these complexities, you need to trace errors across several possible locations. If there is an error in your python code, it is most easily caught by looking at the Google Appspot log file, where you can print things out using the familiar logging.info(). Also, crashed code will show up in the same log if they come from your python code.

    Assuming that your code doesn't crash somewhere in python, you may also need to debug on the javascript side. For this, you will want to use the javascript console, to which you can write (from within javascript scripts) using console.log(). Crashes in your javascript code will also show up in your console. As discussed in class, you can also inspect the DOM using the elements tab that shows up among the developer tools that include your console. You may have to go back and forth between debugging in your browser and in your python log files. 

    Using real data to show the relationship between age and outcome

    Although we have now created a custom visualization, it only functions with the fake data we gave it. Our next step is to hook it up to the data. 

    We can use the same code as from Byte2 to load the full data set directly from Google Fusion Tables. However, I have found that the speed of Fusion Tables can be variable to say the least. An alternative is to use the same mechanism as in 'explore.py' from Byte 2 to load the data from a file. You will need to download the data set into a file (such as 'data.json') using the code from explore.py because a google app engine application is not allowed to write to disk (it can write to a data store, but we will not be covering that in this class). 

    Once you have a file with data in it (you could just use the one from Byte 2), it needs to be placed into a static directory. We'll need to create a directory ('data/') inside [uniqueid]-byte4 and place 'data.json' in that directory. We'll also need to update app.yaml to tell google about the directory and make it application readable. NOTE: If you choose to do this, GOOGLE WILL CHARGE YOU A SMALL FEE FOR THE SPACE on an ongoing basis.

    handlers:
    - url: /favicon\.ico
      static_files: favicon.ico
      upload: favicon\.ico
    - url: /data
      static_dir: data
      application_readable: true
    - url: .*
      script: main.app

    Next, we'll use python to collect the parts of the data we care about (without serial SQL queries). For example, to map ages to outcomes we need to initialize an array that contains an entry for each age something like this:
        age_by_outcome = {}
        for age in ages:
            outcome_vals = {'Age':age}
            for outcome in outcomes:
                outcome_vals[outcome]= 0
            age_by_outcome[age] = outcome_vals

    and then fill it with data. We are simply looping through all of the rows of data and counting up information.
         
        # find the column id for ages
        ageid = columns.index(u'Age')
        
        # find the column id for outcomes
        outcomeid = columns.index(u'Outcome')

        # loop through each row
        for row in rows: 
            # get the age of the dog in that row
            age = age_mapping[row[ageid]]
            # get the outcome for the dog in that row
            outcome = row[outcomeid]

            if age not in ages: age = 'Unspecified'
            if outcome not in outcomes: outcome = 'Other'
            
            # now record what we found
            age_by_outcome[age][outcome] += 1

    Moving to D3

    To use D3 (much more sophisticated than Google Charts), first download the latest version from the D3 website and unzip it into the [yourname]-byte4 directory. Next, be sure to update your 'app.yaml' file so that your application knows where to find d3. We'll also want to make use of CSS stylesheets when using d3, so we'll add a directory for stylesheets to 'app.yaml' as well. You should change the handlers section to look like this:

    handlers:
    - url: /favicon\.ico
      static_files: favicon.ico
      upload: favicon\.ico
    - url: /data
      static_dir: data
      application_readable: true
    - url: /d3
      static_dir: d3

    - url: /stylesheets
      static_dir: stylesheets

    - url: .*
      script: main.app

    Scott Murray's D3 Fundamental's tutorial (or his free online book) will acquaint you with the basics of D3 (you may also find d3 tips and tricks useful). At a minimum, you'll want to produce a bar chart similar to the one we produced up above using Google's chart capabilities. However D3 can do so much more! This section of the tutorial will walk you through how I created the stacked bar chart at jmankoff-byte3.appspot.com (which I based on mbostock's example stacked bar chart).

    First, I organized the data in main.py into a list of dictionaries, containing the number of animals in each outcome. Here is what the final output looks like in the log:
    [{'Foster': 0, 'Returned to Owner': 0, 'Age': '<6mo', 'Adopted': 0, 'Euthanized': 0, 'Other': 0, 'Transferred to Rescue Group': 0}, {'Foster': 0, 'Returned to Owner': 0, 'Age': '6mo-1yr', 'Adopted': 0, 'Euthanized': 0, 'Other': 0, 'Transferred to Rescue Group': 0}, {'Foster': 0, 'Returned to Owner': 0, 'Age': '1yr-6yr', 'Adopted': 0, 'Euthanized': 0, 'Other': 0, 'Transferred to Rescue Group': 0}, {'Foster': 0, 'Returned to Owner': 0, 'Age': '>7yr', 'Adopted': 0, 'Euthanized': 0, 'Other': 0, 'Transferred to Rescue Group': 0}, {'Foster': 0, 'Returned to Owner': 0, 'Age': 'Unspecified', 'Adopted': 0, 'Euthanized': 0, 'Other': 0, 'Transferred to Rescue Group': 0}]
    This is created using about 30 lines of code in 'main.py'. The key section of that code is already listed above. The remainder of the code simply sets up the structure necessary for this to work.Once we have done this we pass it to 'index.html' as context:
           # add it to the context being passed to jinja
           variables = {'data':json.encode(age_by_outcome)}
           
           # and render the response
           template = JINJA_ENVIRONMENT.get_template('templates/index.html')
           return template.render(variables)

    Now we need to set up index.html. First we need to tell it about d3:
      <script type="text/javascript" src="d3/d3.v3.js"></script>

    Next we start on the script for displaying the data. First we move the data into variables accessible to javascript:
     
      <script>
           // ----------- EVERY CHART NEEDS DATA --------------
           // this is the data we passed from main.py
           // the format for data is: 
           // [{outcome1: amount1, ..., outcomen: amountn,
           // Age:'<6mo'}, ..., {outcome1: amount1, ... , Age: '>7yr'}]
           var data = {{data|safe}}
           // now collect the possible ages for use in axes of the chart
           var ages = data.map(function (d) {return d.Age});
           console.log(age);
           // and outcomes, for the same reason
           var outcomes = d3.keys(data[0]).filter(function(key) { return key !== "Age"; });
           console.log(outcomes);


    Now we can easily loop through the data to calculate information we will need later for graph creation.  We want to create a graph that stacks rectangles for each outcome on top of each other. This means that only the first outcome is at position y=0, the remaining will be proportionally higher based on the amount of data in each previous outcome. Looping is done in javascript by saying: 
           data.forEach(function(d) {

    We calculate a y0 and y1 (bottom and top) position for each rectangle. We also calculate the total height of all of the stacked bars (from the bottom of the bottom bar (0) to the top of the top bar). 
     // the y0 position (lowest position) for the first stacked bar will be 0 
     var y0 = 0;
     // we'll store everything in a list of dictionaries, d.outcomes
     d.outcomes = y_labels.map(function(name) {
       // each outcome has a name, a y0 position (it's bottom), 
           // and a y1 position (it's top). 
       res = {name: name, y0: y0, y1: y0 + d[name]};
           // and we also have to update y0 for the next rectangle.
       y0 = y0 + d[name];
       return res;});
     // we also store the total height for this stacked bar
     d.total = d.outcomes[d.outcomes.length - 1].y1;

    The next section of the d3 code, labeled 
           // ----------- EVERY CHART NEEDS SOME SETUP --------------
    sets up the axes and color scales. You should check the d3 documentation to understand more about what is going on here. For color picking, it can be helpful to use a site such as colorbrewer2.org

    The meat of any D3 visualization happens through DOM manipulation. D3 uses an SVG element for drawing, which in this case we place inside of the body of the HTML. In D3, a series of commands can be carried out as serial method calls, so for example we set up the svg using:
     // the svg element is for drawing. We set its size based 
     // on the margins defined earlier
     var svg = d3.select("#viz").append("svg")
         .attr("width", width + margin.left + margin.right)
         .attr("height", height + margin.top + margin.bottom)
           // and add a group that is inside the margins
       .append("g")
         .attr("transform", "translate(" + margin.left + "," + margin.top + ")");
    D3 also has a very unusual way of looping through data -- you simply reference it as yet another function call using something like .data(data). In the sample code we first create a group dom item for each bar by looping through the ages. Note that we select all '.Age' elements before we have created them (that happens in .append("g").
       // Create a group for each age
       var age = svg.selectAll(".Age")
          .data(data)
          .enter().append("g")
            .attr("class", "g")
            .attr("x_position", function (d) {return x(d.Age);})
            .attr("transform", function(d) {return "translate(" + x(d.Age) + ",0)"; });

    Next we create a rectangle for each outcome. Again, we are selecting all rects before we actually append them to the visualization. This is non-intuitive but allows d3 code to be written without loops. 

           // create a rectangle for each outcome (for each age)
           age.selectAll("rect")
                // bind the outcome data for that age to that rectangle
               .data(function(d) { return d.outcomes; })
             .enter().append("rect")
                 .attr("width", x.rangeBand())
                 // use the outcome data to determine y position and height
                 .attr("y", function(d) { return y(d.y1); })
                 .attr("height", function(d) { return y(d.y0) - y(d.y1); })
                 // use the color scale to determine the fill color
                 .attr("fill", function(d) { return color(d.name); })
                 
    At this point, you should be able to display a stacked bar chart in your browser generated using the code we just went through. However it is easy to add a little bit of interactivity. First, let's create a style sheet ('d3.css') which we reference in 'index.html' as:
      <link href="stylesheets/d3.css" rel="stylesheet" type="text/css">

     Next we can make our bars respond to hovering:
    rect {
            -moz-transition: all 0.3s;
            -o-transition: all 0.3s;
            -webkit-transition: all 0.3s;
            transition: all 0.3s;
    }

    rect:hover {
            fill: orange;
    }

    This is nice, but what if we want tooltips as well? A simple way to do this is to create a hidden div that we position and show based on mouse over events. To do this we need to add to our stylesheet:
    #tooltip.hidden {
            display: none;
    }

    and add the div to our HTML inside the <body>:
       <div id="tooltip" class="hidden">
           <p><strong>Number of Animals:</strong></p>
           <p><span id="value">100</span></p>
        </div>
     
    Finally, we need to add two more function calls to how we define our "rects":
       .on("mouseover", function(d) {
           //Get this bar's x/y values, then augment for the tooltip
           var xPosition = parseFloat(d3.select(this.parentNode).attr("x_position")) + 
               x.rangeBand() / 2;
           var yPosition = parseFloat(d3.select(this).attr("y")) +   14;

           //Update the tooltip position and value
           d3.select("#tooltip")
               .style("left", xPosition + "px")
               .style("top", yPosition + "px")
               .select("#value")
               .text(d.y1-d.y0 + " animals were " + d.name + ".");

                //Show the tooltip (it's a div that is otherwise always hidden)
                d3.select("#tooltip").classed("hidden", false);
                })
       // and cause it to disappear when the mouse exits 
       .on("mouseout", function(d) {
                d3.select("#tooltip").classed("hidden", true)});

    When you create your visualization, be sure to give it some sort of interactive aspect. To facilitate this, the data structure we introduce above was selected to match the structure of the data loaded from a multi-column csv file typical of other d3 tutorials. Here is an explanation of how d3 loads csv files and what the data looks like. This means you can compare our approach to other d3 tutorials that make charts from data that has rows and columns. Examples are mbostock's grouped bar chart and delimited.io's multi series charts. You will also find this data structure similar to single series data tutorials (except that they have only two columns rather than many, the labels and the values). An example is this excellent d3 tutorial. You can even go so far as loading csv files in appspot if you want to. 

    Hand In Expectations

    You will be asked three things about your handin. 

    First, does the code work and support interaction with the visual narrative in some fashion.

    Second, what narrative story does your visualization tell? The best answers would talk about the relationship between the question the visualization answers, the design choices that are made about the visualization, and how those choices help the user to get the most from the visualization. 

    Third, what did you do to improve the D3 chart in particular. A minimal change would be to improve the meaningfulness of the existing visualization. As it stands, the data you are displaying is hard to interpret because it's raw numbers. For example, it's hard to get a sense of the difference in percentage of animals with different outcomes because there were just fewer puppies than older animals that the shelter dealt with. This could be addressed in the visualization by changing the types of numbers used. You could also play with alternative visualizations (non bar charts); visualizations of other aspects of the data set; and other types of interactivity. For example, the D3 book we pointed you at walks through how to update the visualization when the user clicks on something like a radiobutton. You could allow the user to change the style or content of the visualization.

    Byte 6 [IN PROGRESS]: Working With Text

    posted Oct 28, 2013, 12:22 PM by Jen Mankoff   [ updated Mar 24, 2014, 10:34 PM ]

    • Description: Build an application that can download twitter data. Make sure the data is "clean" and in the right language. ...
    • Due date: See Blackboard
    • Grading: More details to be provided. 


    Overview

    In this project, you will create a new application that displays data from your Twitter account. You will need to use OAuth to access this information. For this assignment, we are not yet going to start dealing with language (a common use of twitter data). Instead we would like to explore the numerical side of data analysis. For this we will be counting tweets that match a search string. This assignment has the following learning goals:
    • Calculating basic statistics
    • Data cleaning
    • ...

    Detailed instructions for Byte 5

    Byte 5 will build on the concepts used in Bytes 1-4. You should start by setting up a second google application, called [yourname]-byte5.appspot.com. 

    Setting things up on the Twitter side 

    We will be using the streaming API from twitter, which is a RESTful API. We are going to use the twitter's application only authentication, which is much simpler than application + user authentication, and can give us access to search the tweet stream. To gain access to the Twitter API, you first need an authorization (a special password just for this application) from twitter (you will need a twitter account to obtain this). There are several types of authorizations supported by Twitter. For our purposes, the dev.twitter.com option will work well. 
     
    Go ahead and create your application, call it [yourname]-byte2. When you create this applications, you will be asked to provide a url for the application. You can use 'http://[yourname]-byte2.appspot.com' (without the quotes, of course). Further details on how to set up these applications see the instructions provided by Twitter (you can always go back and change these settings). When you are done creating the application, you should have a set of tokens called CONSUMER_KEY and CONSUMER_SECRET. Be sure to save both tokens.   

    Importing all of the necessary libraries

    There are a number of python libraries for twitter. However, the learning goals for this project are for you to be able to access any API, not just Twitter. Thus, we will be using lower level libraries to do authentication ourselves. To move forward, we will need to use a number of libraries taken from several different locations: Google installed libraries can be found at webapp2_extras (documentation for webapp2_extras) or at google.appengine.api (documentation for google.appengine). Additional google supported apis are at https://developers.google.com/api-client-library/python/apis/. Additionally, we sometimes have to "manually" install libraries, including some provided by google, similarly to how we installed feedburner in the last project. You should familiarize yourself with these libraries as you may find yourself needing them at times over the course of the semester. For twitter access, we will be using 
    • httplib2 from the Google APIs client library. You can get this from google, be aware that the installation instructions are a bit confusing: you need to scroll down to the section titled "App Engine" and then follow the instructions to download the latest version of google-api-python-client-gae-N.M.zip and install it in your [yourname]-byte2 directory (I unzipped the file, and then copied the subdirectory for httplib2 into that directory). This is similar to the way we installed feedparser in byte 1. 
    • json and jinja2 from webapp2_extras 
    • base64 (a standard library for encoding base 64 required by the oauth protocol)
    • urllib (a standard library for working with urls)
    When you are done setting all of this up, the header of your main.py file should look something like this:

    import base64
    import webapp2
    import logging
    from webapp2_extras import jinja2
    from webapp2_extras import json
    import httplib2
    import urllib

    Debugging as you go

    The work we are about to embark on can be somewhat complex, and involves multiple moving parts. It is helpful to know about several options for tracing what is happening: 
    • When you are exploring solutions to a problem of this sort, we highly recommend using logger.info(...) frequently to double check that things are proceeding as you expect.
    • Don't forget to view the log itself, which will have not only your printouts, but also other bugs that arise, what exceptions happen, and so on
    • You can try out the get and post requests you make if you direct them to RequestBin rather than twitter. Once you run the 'GET' or 'POST' refresh the page that RequestBin gives you and you can see exactly what you sent to twitter. Compare this to the tutorial to see if you have everything right.

    Implementing the application only OAuth protocol

    For our implementation, we will follow the documentation on twitter for application only authentication, which very clearly lays out the process for requesting authorization. To make things simple, we will store the key information we need to access twitter in four global variables. Note that this is not a very secure way to do things, having a "secrets" file that you load them from would be an improvement, for example.

    # this is the URL we will use to get our authorization token (stage one of the authorization)
    TWITTER_REQUEST_TOKEN_URL = 'https://api.twitter.com/oauth2/token'
    # this is the search URL we will use at the end
    TWITTER_STREAM_API_PATH   = 'https://api.twitter.com/1.1/search/tweets.json'

    # this is the set of keys we need to get our authorization token
    TWITTER_CONSUMER_KEY      = 'HxmQOAxoU9fnmrLOx7rbw'
    TWITTER_CONSUMER_SECRET   = '9awHN0NEsFuVcwZHaLqKIjPi9r6FIN2D52rcgw8'

    In addition, we need to create an http client which is also going to be global so that we can send 'POST' and 'GET' requests:
    # this is how we will send requests to twitter to get our authorization token
    http = httplib2.Http()

    The next step is to define one function for each step of the oauth. Note that right now we are doing no error checking, but it would be easily possible to check for errors by looking at the "response" portion of each http request. The first function will be called:
    def oauth_step1_get_authorization_token(self):

    It is a straight setup of the headers described in the twitter tutorial for application only authorization, including specific text for 'Content-Type', 'Accept-encoding', and 'Authorization'. The most complex part of this, shown below, is setting up the authorization correctly:
    credentials = "{0}:{1}".format(key, secret)        
    credentials = base64.b64encode(credentials)
    headers['Authorization'] = "Basic {0}".format(credentials)

    Once all the headers are set up, we simply call
    resp, content = http.request(token_request_url, 
                                 'POST', headers=headers, body=content)

    content will be text in serialized json format. We need to decode it, and retrieve the access token. You may want to read up on python supports json for storing and accessing content once decoded.
    tokens = json.decode(content)
    return tokens['access_token']

    Note that this the access token (tokens['access_token']) in this type of oauth does not change. You could store it in a file and retrieve it, and unless you specifically ask twitter to revoke the token, it will work. 

    The second function will be called 'auth_step2_make_api_request'. It takes as input an authorization token and a set of search terms. Once again we need to set headers following the  twitter tutorial for application only authorization, including 'Accept-encoding' and the 'Authorization' header (this time using the token): 
    headers['Authorization'] = "Bearer {0}".format(token)

    However we also need to construct a search URL using parameters supported by twitter. For example, we can specify that we only want 20 results using count=20. This also requires encoding the terms in a form that will work in the URL to take care of funny characters like spaces using urllib.quote().
    terms = urllib.quote(terms)
    url = "{0}?q={1}&result_type=recent&count=20".format(
                                               TWITTER_STREAM_API_PATH, terms)

    Finally, we run the search and decode the results:
    resp, content = http.request(token_request_url, 'POST',
                                 headers=headers, body=content)
    tokens = json.decode(content)

    Understanding and displaying tweets

    Now that we have the json decoded list of tweets retrieved using our search, it's time to display them. tokens  contains two things: A list of tweets (stored as 'statuses') and other meta data ('search_metadata'). First, however, we need to understand how json stores tweets. Tweets include a surprisingly complex set of information, made harder to understand by the presence of deprecated (old, unsupported) meta data. Luckily, twitter provides comprehensive documentation. For our application, we will want something like the 'text' attribute and the 'user'->'name' attribute. 

    However, before we proceed, it's important to understand that the results make use of unicode, a standard for displaying an international character set. Reading a great tutorial on unicode in python is a good place to start, as it can be very confusing. Unfortunately, python 2.7 does not use unicode by default in its strings, so we need to deal with this explicitly using (python provides a unicode tutorial). Jinja does expect unicode. So we will need to do some conversions along the way. 

    # We need a place to store the tweets so we can 
    # pass them to jinja
    tweets = [] 
    # Next, we need to iterate through the tokens in 
    # the status portion of our results:
    for tweet in tokens['statuses']:
        # next we extract the text for the tweet and unicode it
        text = unicode(tweet['text'])
        # and unicode the user's name
        name = unicode(tweet['user']['name'])
        # finally we place the tweet into the tweets
        tweets = tweets + [[text, name]]
    # and add the result to our context
    context = {"search": terms, "tweets":tweets}
    # so that we can render the tweets
    self.render_response('index.html', context)

    Cleaning the data

    Because we are dealing with textual data here, cleaning and examining the data is not easily handled using statistical techniques and graphs. Instead we need to start by looking at the content of the text, the characters on the screen and so on. Below are some results for a search for "xmas". There are two problems illustrated below (see if you can figure out what might be problematic before you read further). 


    The first tweet shown below is not in English. This may not be a problem for everyone but for me twitter results I cannot read are not all that useful. The second one has squares in it (just after the "...christmas."). Those are probably characters that are not being displayed correctly. Let's work through how to address each of these. 

    Which Language Do I want

    Identifying the language of text could be done in several ways. However in the case of twitter, we are lucky that every tweet comes with a 'lang' indicator (as described in twitter's tweet documentation). We simply need to call
    lang = tweet['lang']
    and then check 
    if lang = 'en' 
    to decide whether to display a tweet. After adding this into 'main.py' in the post function and double checking that it works correctly, foreign language tweets seem to have disappeared from my results.

    Unicode wasn't enough

    To identify the source of the problem with the "boxes" mixed in with text, I turned to the log. I searched for "...christmas" in the log and found the original text as follows:
    u'...christmas. \U0001f385\U0001f384 #christmas #xmas #family #instagood #like @ Home http://t.co/WPu6Sn12t5'

    I did some detective work and double checked the first code at http://unicodelookup.com (I had to convert it to hexadecimal and lookup 0x0001f385. Sure enough, it is undefined. The next question is: what does it encode if not unicode? Amazingly, I found the answer further down in my log file:

    It looks like an emoticon encoding. Further research suggested that the problem was not my code but rather the font set associated with the browser I was using. Sure enough -- boxes in chrome, nothing in firefox, and icons in safari (and searching for the same tweet in Twitter's own interface I saw the same results).

    Counting Tweets

    For this assignment, we are not yet going to start dealing with language directly. Instead we would like to explore the numerical side of data analysis. Let's start by printing the total number of tweets, and then organize them by day, week, and month. 

    First we need to get the date of each tweet, which can be found in tweet['created_at']. Time is provided in UTC format as in Wed Aug 27 13:08:45 +0000 2008. To parse the date, and convert it into local time required a fair amount of googling on my part due to the fact that python's standard date time libraries do not handle timezones very intuitively. In the end, after reading extensively and thanks to a stackoverflow comment pointing at work done by the Jeff Miller at blackbirdpy I learned that the best library for parsing the data is in the email.utils library. I tell this story by way of noting that we all have to search for solutions sometimes! Here's the import:

    # libraries needed for parsing and using dates
    from datetime import datetime
    import email.utils

    And the helper function that I wrote based on blackbirdpy (insert this into your MainHandler class).
     
     # copied and modified from blackbirdpy (https://github.com/jmillerinc/blackbirdpy)
        def timestamp_string_to_datetime(text):
            """Convert a string timestamp of the form 'Wed Jun 09 18:31:55 +0000 2010' into a Python datetime object."""
            tm_array = email.utils.parsedate_tz(text)
            tweet_created = datetime.datetime(*tm_array[:6]) - datetime.timedelta(seconds=tm_array[-1])
            tweet_local_datetime = tweet_created_datetime + 
                           (datetime.datetime.now() - datetime.datetime.utcnow())
            return tweet_local_datetime
     
    We can now extract the date of the tweet, in local time, by calling: 
    # the date of the tweet
    date = timestamp_string_to_datetime(tweet["created_at"])



    We may want to handle retweeted tweets differently. We can find them by checking tweet['retweeted_status'], which will exist only if the tweet is a retweet: 
    not_original = 'retweeted_status' in tweet.keys()


    Thoughts about ways to take this further (depending on your interests and skillset):

    A working application based on this tutorial will be sufficient to get full credit for this byte. An example can be found at http://jmankoff-byte2.appspot.com/ You can get extra credit if you impress the grader. Below are some ideas
    • Right now, we don't cache the authorization. This will cause occasional errors because twitter limits how often we can retrieve the authorization, and is silly since it doesn't change. We could store it in a file, in which case we only have to request it once, or we could store as a session variable.
    • The type of authorization we implemented is limited to only part of Twitter's API. To access per-user data, you need to do per-user authorization and then modify your authorization implementation to follow Twitter's guidelines for obtaining access tokens on behalf of a user. Note that for testing on your local machine in this case you'll want to create a second twitter application (call it [yourname]-byte2-dev), which would have to redirect to that machine instead of the appspot application.
    • No error checking: The authentication procedure can go wrong for many different reasons. It would be a good idea to check the responses coming back from twitter for standard errors before parsing the content, and handle the errors appropriately. 
    • We have created a basic text display for the tweets. How about replacing this with correct tweet parsing to show the images as images and link all the hash tags to new searches for that hash tag?
    • http://www.nltk.org/
    • Google's sentiment predictor at https://developers.google.com/prediction/docs/gallery#hosted_model

    Some questions you should be able to answer

    • What are some other ways you might have been able to address the language problem (if Twitter had not provided a solution)?
    • What are some ways to eliminate spam tweets that do not require natural language processing?

    Byte 1: Displaying an RSS Feed

    posted Oct 23, 2013, 7:47 AM by Jen Mankoff   [ updated Jan 4, 2016, 8:58 PM ]

    • Description: Your final product will be the a web-enabled application that can query and display an RSS feed (should be something like [yourpseudonym]-rss.appspot.com)
    • Source Code: See https://github.com/jmankoff/data, in Assignments/jmankoff-rss
    • For some assignments we will provide complete or partial source code that you can look at. It is recommended that you try to construct your own source code using the tutorial and only refer to the provided code as needed. This is especially important since we build up the source code iteratively in the tutorial, gradually replacing portions of it, and the provided source code only shows a single view (the final version). In addition, it will rarely be the case that you can use that source code entirely unmodified to complete an assignment. 

    Overview

    In this project, you will create a small application that displays data from an RSS feed. The work you do in this project byte is something you will build on throughout this class. This assignment has the following learning goals:

    • Setting up your environment
    • A first experience with Python
    • Learning how to acquire data from an external source
    • A first experience with the RSS format
    • A first experience with programmable HTML
    • A first experience with forms
    • A first experience setting up a question and deciding what data helps to answer the question.

    Detailed instructions for Byte 1

    This project requires you to use Pythion 2.7 (please note the version number) and some additional libraries that are available for python. To learn more about Python, you may want to explore www.pythontutor.com. The textbook Introduction to Computing and Programming in Python is an excellent introductory book aimed at non programmers.

    Setting Up Python using Google Apps

    Google Apps is a development environment that will let you place your code on the web with relative ease. An excellent "Getting Started" tutorial will walk you through the initial creation of a simple application that displays plain text on the web. 
    • You will need to select your language (select Python)
    • A framework (select Flask). You will also need to supply a project name (I used 'jmankoff-rss'). When you create a google web application, you will need a unique identifier for it that no one else on the web has used. A good idea for the assignments in this class is to prefix them with a unique id you choose (you can use your username, but then students grading you may know who you are, an anonymous id is fine too). 
    • Install google app engine on your machine (allow it to make command symlinks as you will want to do some things at the command line). 
    The tutorial is quite detailed and helpful. Be sure to follow it all the way through until you can load your website on the web.  

    Setting up a locally hosted version of your website

    You will need to tell Google Appspot about your website to load and debug things locally. The google AppEngine Launcher application looks something like this. To add my app (jmankoff-rss) I clicked on the + button on the bottom left, navigated to the appropriate directory, and filled in other details. 


    You can ignore "Admin Port" for now. "Port" (which you would have designated at set up time) is the port on which your local application is running. You have to press the 'Run' button to actually visit this application. The default is 8080, in which case you can view the results of your code at http://localhost:8080/.

    Installing Libraries

    Before we start, let's make sure we've included the correct set of libraries in our application. For this project we will be using:

    • Bootstrap. Head over to http://getbootstrap.com/getting-started/ to download it. Unzip it, and move the subdirectories (css, fonts, js) into the main directory of your application ('jmankoff-rss/' in my case).
    • JQuery. Download the compressed production version at https://jquery.com/download/, and place it in the 'js' directory that you just added to your application. 
    • Jinja2 (the Python Template Engine). Jinja is already installed, but you do need to tell appspot you are using it (see below)
    Now, set up your app.yaml with the correct information:
    # Handlers define how to route requests to your application.
    handlers:
    - url: /js
      static_dir: js
      application_readable: true
    - url: /fonts
      static_dir: fonts
      application_readable: true
    - url: /css
      static_dir: css
      application_readable: true
    - url: /templates
      static_dir: templates
      application_readable: true

    # ... and further down after a bunch of comments
    libraries:
    - name: jinja2
      version: latest
    - name: webapp2
      version: latest

    Customizing your Application

    To begin serving HTML pages, follow the Jinja2 tutorial. We've already updated the app.yaml file. Now we have to update the main.py file to follow through. Add the following lines at the top of the file:
    # Imports
    import os
    import jinja2
    import webapp2
    import logging

    JINJA_ENVIRONMENT = jinja2.Environment(
        loader=jinja2.FileSystemLoader(os.path.dirname(__file__)),
        extensions=['jinja2.ext.autoescape'],
        autoescape=True)

    And change the hello() route as follows:
    @app.route('/')
    def hello():
        template = JINJA_ENVIRONMENT.get_template('templates/index.html')
        return template.render()


    Create a directory inside the [yourname]-byte1 folder named 'templates' and put a file named 'index.html' inside. 'index.html' should contain the following html.
    <!DOCTYPE html>
    <html>
    <head>
    <title>Byte 1 Tutorial</title>
    </head>
    <body> <h1>Data Pipeline Project Byte 1 Example</h1> </body> </html>
    The result, when you load it, should look like this:

    Data Pipeline Project Rss Byte Example

    Now we want to add some bootstrap styling. The sample index.html and about.html files provided with your byte source code are based on a bootstrap theme, you can view more themes on the getting started. page at http://getbootstrap.com/getting-started/ and download sourcecode for example themes. Just be aware that you will need to modify these themes to reflect the directory structure of your google Appspot application. Specifically, you should use 'css/...' to refer to css files, and 'js/...' to refer to javascript. 

    Debugging your Application

    You can quickly and easily test your scripts as you go using a local web page. The Google App Engine Launcher gives you the information you need to do this: Edit main.py, save it, and check that your application is running (There should be a small green arrow to the left of it in the application launcher):


    You can ignore "Admin Port" for now. "Port" (which you would have designated at set up time) is the port on which your local application is running. The default is 8080, in which case you can view the results of your code at http://localhost:8080/ [Note: my port in the image above is 8082 because I changed the default, and my corresponding URL would be localhost:8082]. The "Logs" window is also extremely helpful. You can output debugging text there by using the python command Logging.info()

    Thus, a very good debugging and editing cycle is [Edit main.py] [reload local web page] [check results and log to make sure your code is doing what you think it is] [rinse and repeat]

    Collecting information from your RSS feed in your application

    Now that you can show HTML on the web using Bootstrap, it is time to show dynamic information from your RSS feed. To do so you will need to construct a url as follows:
     http://www.bing.com/search?q=[search term]&format=rss
    For example, if you want to search for dogs, use:
     http://www.bing.com/search?q=dog&format=rss

    You will need to modify main.py to take data from this feed and display it on the page. There are several ways to do this, ranging from downloading and parsing the raw html yourself to using a third party library that specializes in feeds. This homework will walk you through the latter solution. We will use the feedparser library. Feedparser documentation is available at http://pythonhosted.org/feedparser

    Because Feedparser is not part of the Python standard library, we will need to make sure Google has access to it. This requires downloading it, and copying it (specifically, the file feedparser.py into the same directory as main.py. Once this is done, you should be able to add import feedparser to the top of main.py

    Now that we have a way of testing the code that you write, let's talk about how to parse a feed. The basic approach is as follows: 
    import feedparser
    import logging


    feed = feedparser.parse("http://www.bing.com/search?q=[search term]&format=rss")


    for item in feed[ "items" ]:
    logging.info(item)

    Displaying the Feed Contents in Jinja

    First, collect the information. We will take advantage of a python simplification for creating a list using a loop here:

    data = [{"link": item.link, "title": item.title, "description": item.summary_detail} for item in feed["items"]]

    One of the most powerful aspects of Jinja is its ability to display dynamic information provided by Python. We can pass one or more variables to Jinja by placing them in context:
    return template.render(feed=data)

    Next, update the 'index.html' file to show the information:
    <h2>Feed Contents</h2>
    <div class="panel panel-primary">
    {% for item in feed %}
             <div class="panel-heading">
      <h3 class="panel-title"><a href="{{ item["link"] }}">{{
          item["title"]}}</a></h3>
    </div>
    <div class="panel-body">
      {{item["description"]["value"]}}
    </div>
    {% endfor %}
    </div> 

    Note the use of {% ... %}. This indicates some logic that should be executed (in this case a for loop). The contents of {{ ... }} are replaced with their value.  The resulting output looks like this:

    Letting the user control the search

    We will start with a very simple form that you can add to your 'index.html' file: 
    <form action="search" method="POST">
      Search Term: <input name="search_term" value="cats">
      <input type="submit" value="Enter Search Term">
    </form>

    We will also need a way to display the search results. This involves placing some additional logic inside the 'index.html' file to display the search terms (aids debugging) and results. We use the if / endif statements for error checking: If the term isn't present, the page will still render. 
    {% if search: %}
    <p>Searching for {{search}}</p>
    {% endif %}

    Now we need to collect the form data. This involves adding a handler for post to 'main.py' as follows: 
        def post(self):
            logging.info("post")
            terms = self.request.get('search_term')

            context = {"search": terms}
            self.render_response('index.html', **context)

    Note that the input name specified in 'index.html' and the string used in self.request.get need to match up for jinja to show anything. In the code above, 'search_term' will show up (see below) but since we have not provided any results, that part of the web page will not render. 

    When this is done, after you type a search term in, the web page at http://localhost:8080/ should show the following: 

    [... and so on]

    Finally, we need to use the search term result. First we need to make sure that flask knows we are accepting a post from the search form (and check the result).
    @app.route('/search', methods=['POST'])
    def search():
        term = request.form["search_term"]
        logging.info(term)

    Now we build the new search url (still inside the search method) and fetch the result for display.
        template = JINJA_ENVIRONMENT.get_template('templates/index.html')
        url = "http://www.bing.com/search?q=" + term + "&format=rss"
        feed = feedparser.parse(url)
        data = [{"link":item.link, "title":item.title, "description":item.summary_detail} for item in feed["items"]]
        return template.render(feed=data)

    Finally, if we want to show the default search term in the form input, we change it to 
    <form action="search" method="POST">
      Search Term: <input name="search_term" value={{search}}><br>
      <input type="submit" value="Enter Search Term">
    </form>

    and change the get method in 'main.py' to pass in "cat" as the default search term:
    context = {"feed" : feed, "search" : "cat"}

    Now we should have a working search form. Here is an example showing the results of a search for "cats"

    Setting up a question and answering it

    At this point you should have a working version of code that answers a question I set out (what news articles are available about this topic) corresponding to the reference application http://jmankoff-rss.appspot.com/. The deeper thinking in this assignment requires that you select RSS data and display its contents in a way that correspond to a question you have designed and answered. Note that the example code does not demonstrate this. This is the first step (and of course will eventually be much more iterative) in any data pipeline: Figuring out what data you need to answer your question. You should:
    1. Identify a question and ensure that it is clear to the person viewing your assignment what that question is (this will probably involve modifying index.html to show the question). 
    2. Identify data that you think can help answer that question (you are limited by what was introduced in this byte to data sources that are searchable, and return an RSS feed. If you want to go beyond that, see below). Here are some possibilities: 
      • Learn how to use bing to filter the search down to a specific type of page or topic. For example, to just search news about obama on bing, you can use http://www.bing.com/news/search?q=obama&format=rss
      • Use search term to filter which items from an RSS are displayed in your code rather than on someone else's server
      • Feedspot (paid service) or Feedly (paid service, connects to IFTTT which could enable interesting things)
      • Trulia (real estate search -- I searched for 'Pittsburgh, PA')
      • Yahoo! finance (I searched for 'CAT') 
    3. Display the results on the web page.
    When this assignment was originally written, RSS was everywhere, and Yahoo! pipes even let you turn non RSS stuff into RSS. There is unfortunately much less right now, so if you find good feed ideas, post them to the class! 

    Questions you should be able to answer when you hand this in

    What question does your application help to answer, and how does it let the end user answer that question? 

    Describe the flow of information from the end user (who enters a search term) until it is displayed back to the end user, in terms of the specific components relevant to the assignment (end user; jinja2; main.py; yahoo pipes). 

    Where is the URL for the working version of your assignment?

    Taking it further (optional things to think about):

    A working application based on this tutorial will be sufficient to get full credit for this byte. A reference version can be tried out at http://jmankoff-byte1.appspot.com/ You can get extra credit if you impress the grader. Below are some ideas:
    • Display results on a map or showing summary stats
    • Dynamically update your page as new information arrives
    • Playing with other types of actions. For example, Pocket, which you can hook up to IFTTT to do interesting things produces an RSS feed. The link for your pocket account will be something like 'https://getpocket.com/users/[yourpocketusername]/feed/' Or you can use a form to trigger IFTTT actions using 
    • Trigger IFTTT recipes from your webpage using the 'Maker' channel (use http://requestb.in/ to debug)
    • Something else you think of that we didn't :)


    1-7 of 7