Byte 6: Big Data v1

Description: Your final product will visualize and reason about public Vital Birth Statistics 
Due Date: 4/3

Overview

To circumvent some of the difficulties (and costs!) of setting up a large data set, and because the course is primarily focusing on analytics, we will be using Google's BigQuery for this assignment. We'll make use of the public Vital Birth Statistics data set that BigQuery provides (taken from the CDC under the following restrictions). A range interesting questions can be explored in that data set. There is also a great tutorial on creating a graph of some data from this set that may be helpful to read. Our work will focus on visualizing some basic statistics and developing a classifier. We will be 'closing the loop' in this assignment, meaning that once the classifier is trained, we will let people use it within Google appspot. Learning goals for this assignment include:
  • Working with a massive data set
  • Becoming familiar with BigQuery
  • Applying machine learning algorithms that can function with very large data 
  • Thinking about how to close the loop and actually use a classifier we produce 

The Tutorial

Before we can begin using BigQuery, you will need to create a [yourname]-byte6 application in Google Appspot using the same methods we've used in the other bytes, and turn on the BigData API in the Google Developer's Console:


You will also need to install the Google Cloud SDK on your local machine. You will also need to add the following libraries to your appspot application: 'apiclient', 'httplib2', 'oauth2client' and 'uritemplate'. The simplest way to do this is to copy those directories over from [yourname]-byte3. Your app.yaml file should reference the following handlers and libraries (so that we can visualize results and so on)
handlers:
- url: /favicon\.ico
  static_files: favicon.ico
  upload: favicon\.ico
- url: /stylesheets
  static_dir: stylesheets
- url: /d3
  static_dir: d3
- url: /data
  static_dir: data
  application_readable: true
- url: .*
  script: main.app

libraries:
- name: webapp2
  version: "2.5.2"
- name: MySQLdb
  version: "latest"
- name: jinja2
  version: latest
- name: numpy
  version: latest
- name: django
  version: latest

Accessing the data set at the command line

To circumvent some of the difficulties (and costs!) of setting up a large data set in BigQuery (which limits how much data you can upload per day, and charges for data sets), we'll be using the public Vital Birth Statistics data set that BigQuery provides (taken from the CDC under the following restrictions). We can see that a range interesting questions can be explored.

A quick way to try out different queries of interest is using bq shell at the command line. Once you install the Google Cloud SDK you should have access to the program bq. In my case, I have to start a new shell in Terminal and then run two scripts Google provides inside the installation directory for the Cloud SDK ('completion.bash.inc' and 'path.bash.inc'). Once I run those scripts, I can type at the prompt and see the following:

$ bq shell

Welcome to BigQuery! (Type help for more information.)

jmankoff-byte6> 

Now I can enter a query such as:

jmankoff-byte6>SELECT state, count(*) FROM [publicdata:samples.natality] GROUP by state

and the response looks something like the following table (elided for brevity). Note the large entry under NULL -- we never get away from data set problems, do we!

+-------+----------+

| state |   f0_    |

+-------+----------+

| AL    |  1957392 |

| AK      324183 |

| AZ    |  1837630 |

...

| WI    |  2294354 |

| WY      218735 |

| NULL  | 16998008 |

+-------+----------+

Creating a Cloud Storage Area of Your Own

If you wish to make use of Google's cloud database on your own (something that was not required in Byte 4 but is if you want to use further storage in future projects including this one), you will need to enable billing with Google. Simply click on the Cloud Storage Browser (shown at left) in the Google Developers Console 
... and you will see the following message:
You will need to click 'enable billing' and enter your credit card information. If you are concerned about costs, know that all the data stored in Byte 4's mobile data database has only cost me $30 so far this semester. The details on Google's billing scheme for pricing for Cloud SQL and pricing for BigQuery differ, but we are using public BigQuery data so we only need to worry about Cloud SQL in this Byte. 

Once you enable billing, you can create a new instance where we can store some data.
Let's call it 'birthstats':



Accessing Data from Within Google Appspot

The command prompt is a useful place to experiment, but what we really want to do is access the same data from within Google Appspot. We will need to make use of OAuth2.0 to access data from within Google Appspot. For this, we will first need to know certain information about the application you have created. Start by clicking on [yourname]-byte6 in Google Developer Console. 

To get the credentials for OAuth2 click on "APIs & auth" and then 'Credentials'. From there you can click 'download JSON' to download a client secrets file. 

You should also double check under permissions that the following permissions are enabled:

Note that the top email starts with a number. This is your 'project number' (you can also see it if you click on the 'Overview' tab). Make sure to record it in 'main.py' under BigQuery API settings:
# BigQuery API Settings
PROJECT_NUMBER        = 'XXXXXXXXXXXX' 

Next we will need to specify the correct credentials and set up the authorization. Since we are only using a public data set, we want the readonly credentials mentioned on Google's BigQuery Authentication Documentation:
credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery.readonly')
service     = build("bigquery", "v2", http=http)
now assuming we have things set up for jinja2, in get(self) we can make a query and send the results to the 'index.html' template:

query     = {'query':'SELECT state, count(*) FROM [{0}] GROUP by state;'.format(_DATABASE_NAME), 'timeoutMs':10000}

# service is the oauth2 setup that we created above
jobRunner = service.jobs()
# project number is the project number you should have defined in your app
reply     = jobRunner.query(projectId=_PROJECT_NUMBER,body=query).execute()
# similar to the google SQL work we did in byte4, the stuff we care about 
# is in rows
rows = reply[u'rows']
states = {}
for row in rows:
    name = row[u'f'][0][u'v']
    num = row[u'f'][1][u'v']
    # make sure we have a string in every case
    if name == None: name = u'None'
    state = {'state':unicode.encode(name)], 'total':int(num)}
    states = states + [state]

context = {"states": states}
# and render the response
self.render_response('index.html', context)

Finally we can show the table using jinja2:
<!DOCTYPE html>
<html>
<head>
  <link href="stylesheets/d3.css" rel="stylesheet" type="text/css">
  <meta charset="UTF-8">
  <title>Data Pipeline Project Byte 6 -- Big Data</title>
</head>
 <body>
   <h1>Data Pipeline Project Byte 6</h1>
   <h2>Query Results</h2>
   <ul>
   {% for state in states %}
      <li>{{state['state']}} : {{state['total']}}</li>
   {% endfor %}
 </body>
</html>

An excerpt from the results:


Debugging Hint

Unfortunately, the oauth2 queries will only work on the Appspot server. To avoid having to do all of our testing after uploading to the server each time, we can capture the query result from the server log and save it in 'data/states.json' (provided with your source code). Then we can modify our application to check whether or not we are running on the server and either run the original query or just load the 'states.json' file depending on our status. Eventually when we go interactive this will not work, but it will help tremendously at this stage. I have created a function run_query(query_string, filename), which decides whether to run the query or load a file depending on where the code is running: 
 if (os.getenv('SERVER_SOFTWARE') and
     os.getenv('SERVER_SOFTWARE').startswith('Google App Engine/')):
     query = {'query':query_string, 'timeoutMs':10000
     jobRunner = service.jobs()
     return jobRunner.query(projectID=_PROJECT_NUMBER, body=query).execute()
 else:
     # open the data stored in a file called filename
     try:
        fp = open(filename)
        return simplejson.load(fp)
     except IOError:
        logging.info("failed to load states.json file")
        return None

I had some trouble creating 'states.json' by hand so eventually I pasted it into a variable using python at the command line (e.g. data = <paste>) and then typed: 
>>> fp = open("data/states.json", 'w+')
>>> json.dump(data,fp)

Some additional debugging ideas: I find it useful to debug my python JSON parsing at a command line prompt locally (by grabbing the JSON produced by queries from the server log after uploading my application for the first time). Equally useful is to debug your SQL queries using the shell mentioned above. This will save you many iterations (and uploads when working on the server) and make it much easier to identify and fix problems. For example, I copied the query results into a 'data' variable at my python prompt and then went through the following sequence of debugging exercises:
>>> rows = data[u'rows']
>>> row = rows[0]
>>> row
{u'f': [{u'v': u'AL'}, {u'v': u'1957392'}]}
>>> state = row[u'f'][0][u'v']
>>> num = row[u'f'][1][u'v']
>>> "state %s num %s " % state, num
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "state %s num %s " % state num
  File "<stdin>", line 1
    "state %s num %s " % state num
                                 ^
SyntaxError: invalid syntax
>>> "state %s " % state 
u'state AL '
>>> unicode.decode(state)
u'AL'
>>> unicode.encode(state)
'AL'
>>> num = obj[u'f'][1][u'v']
>>> int(num)
1957392

and this last bit was after my web page [yourname]-byte6.appspot.com returned the following error and I traced it to the very last state returned by looking at my log printout of each state and number on the server log and realized that the empty state values were returning some sort of special value instead of a string:


>>> if state = None: state = u'None'
  File "<stdin>", line 1
    if state = None: state = u'None'
              ^
SyntaxError: invalid syntax
>>> if state == None: state = u'None'
... 
>>> state
u'AL'
>>> state = None
>>> if state == None: state = u'None'
... 
>>> state
u'None'

Homework Part 1: Visualizing your data

At this point, you have all of the information you need in order to query and visualize historical data about births in the U.S. since the 1970s. Rather than walk you through things you already know how to do, I will simply encourage you to play around and discover some interesting new things. You can read up more on the big query api for python, and explore the sample queries to get some ideas for what to do with the data.  You don't absolutely have to stick with the birth data (other public data sets are available), however the second half of this assignment is designed around using machine learning on the same data, so you may find it valuable to do so. 

Machine Learning with Big Data

This is a non-required part of the assignment. I spent so long debugging I decided it was better not to put it on your shoulders in a week! However I still think the information below could be of use to those who might want to go in this direction for the final project.

There is no direct way to run our algorithm on the BigQuery table we've been using. We cannot download the whole table (or we would not need to use BigQuery) and we cannot run our code on Google's BigQuery servers or write it in the BigQuery language. So instead, we will need to download a sample to train on. We will need to decide how large this sample should be, but something like 1,000 records might seem reasonable to start with (in fact, when I was debugging, I was working with 10s to 100s of records). 

Since we want to have high quality training data we may want to limit our results to those with no null values. Since we want to have a high quality sample, we will need to randomly select rows of data. The following query meets both criterion, and you can use it in your assignment. The HASH(NOW()) call is key to this -- each time it is run, NOW() is slightly different, and we also try to ensure that it is adjusted by parameters specific to the current row. By ordering our data set based on that hash, and then taking a certain number of rows, we get random rows. However this is slow and on your own data a simpler approach would be to add a column to your table containing random numbers. 

query_string = "SELECT born_alive_alive, born_alive_dead, born_dead, child_race, alcohol_use, ever_born, father_age, father_race, mother_age, mother_birth_state, mother_married, plurality, year, weight_pounds, apgar_1min, HASH(NOW()-source_year-gestation_weeks*10000+year-child_race*10-mother_age*100-record_weight*10) as hash_val FROM [{0}] WHERE born_alive_alive IS NOT NULL AND born_alive_dead IS NOT NULL AND born_dead IS NOT NULL AND child_race IS NOT NULL AND  alcohol_use IS NOT NULL AND ever_born IS NOT NULL AND father_age IS NOT NULL AND father_race IS NOT NULL AND mother_age IS NOT NULL AND mother_birth_state IS NOT NULL AND mother_married IS NOT NULL AND plurality IS NOT NULL AND year IS NOT NULL AND weight_pounds IS NOT NULL AND apgar_1min IS NOT NULL  AND apgar_1min != 99 ORDER BY hash_val LIMIT {1}".format(_DATABASE_NAME, num_records)

Since this query selects a random subset of the data, we can run it again to get testing data. Ideally we'd want to ensure there was no overlap, but since we cannot easily add an index to Google's public data, we're going to assume that the data set is large enough that this is not highly likely. 

The normal logistic regression algorithm loops repeatedly over the same data set. This is an appropriate way to train logistic regression using gradient descent. However, a better approach for big data is to select a different random subset each time we loop over the data (i.e. to train over small batches). This does require additional queries but has the nice property that we do not have to load the entire data set into memory in order to complete the learning process. It also works nicely in situations where parallelization is possible. This is a variant of stochastic gradient descent, which is supported by mainstream toolkits (including scikit-learn). A great tutorial on stochastic gradient descent with python code that you can use, can be found on deeplearning.net

Note that one of the problems I ran into is that the randomization method I've chosen for this data is very slow because it has to compute a hash function over the entire data set in order to pick a random subset. If you were working with non-public data you would simply compute the hash a single time and then be able to run very fast queries over the hashed data to select random subsets. 

For those who are curious, my goal was to have you implement logistic regression in numpy. Note that the heart of logistic regression is a simple loop in which we repeatedly recalculate the weights (up to a maximum number of times) and check whether we have reached a stop condition:

for i in range(max_iterations):
  # run the next iteration of gradient descent
  weights, stop_condition_met = self.gradient_descent(X, y, weights, eta=10.0)
  if stop_condition_met: break

The gradient_descent algorithm calculates two key values: the error (error), and a new weights, and using those a stop condition. Finally, when we are ready to test the final set of weights that is produced we need a way to calculate probabilities, which is where the sigmoid function comes in:

logistic = lambda s: 1.0 / (1.0 + numpy.exp(-s))

We can use this to calculate the predictions for a test set if we want to calculate things like accuracy
pred_y = logistic(weights.dot(test_X.T))

Or we can simply make a single prediction.

In some ways the specifics of the implementation don't matter -- we are simply fitting to data and then predicting using what we learned. In other ways they matter quite a bit, however. Let's look at our data and how we connect it to this algorithm.

Handin Expectations

The handin expectations for this assignment are based around (1) your working [yourname]-byte4 and (2) your understanding of the concepts.
1) Your byte should show some exploration of the data set (beyond the table already included with the assignment).
2)  You should read about how to upload data into Google BigQuery using MapReduce and be prepared to answer a question about this. You may find this article on getting started with google Big Query useful.
4) The sample byte has a very limited interactive experience. Wouldn't it be nice if you could press a 'train' button and then test it out by entering your own data (perhaps using drop down menus) and seeing what would be predicted? You don't have to implement any of this, but you should be prepared to explain what is learned by the regression how to use that result to make predictions about arbitrary new information provided by the user. 

Going Further (non required source of fun ideas)

Google BigQuery provides access to several public data sets. One is GitHub's commits (including the comments). This blog post looks at the emotions in those comments.

There are lots of other data sets one could upload to a service like BigQuery. One example data set that could be a lot of fun to work with is described in this paper on Meme tracking in the global news cycle. The meme dataset can be downloaded from snap and a description of the structure is there as well. It's quite simple. Each file includes: The URL of the document (labeled P) and the time of the post (labeled T) and one or more key phrases  (labeled Q for quote) as well as zero or more hyperlinks to other documents in the data set (labeled L). 
Before the data can be loaded into BigQuery, it needs to be formatted according to the BigQuery guidelines. One key requirement is that it be denormalized, luckily already a property of this data set. However, it is not in JSON format, so the first step is to write a quick script to convert it.

How could you improve on the user experience? What visualizations might you add? How could you you improve on the speed?

If you want to use Google's prediction API, you can do quite a bit even without scikit-learn (including online learning of regression). To use it, however, you have to move subsets of the data into regular cloud storage from the BigQuery storage.  
Comments