Byte 5: Machine Learning and Statistics

  • Description: This assignment is designed to help you get comfortable applying Statistics and Machine Learning related skills.


For this assignment you will use the animal shelter data set used in Byte 2 and Byte 3. Your goal is to predict the outcome for an animal based on its characteristics (such as breed, gender, age, and so on). The learning goals for this assignment include:
  • Using libraries that can support statistics and machine learning rather than 'doing it yourself'
  • Using a statistical test to check for significant differences in things that might predict adoption
  • Using a visualization to sanity check the results you are finding
  • Developing a feature set to be used for prediction based
  • Dividing your data into an optimization and test set
  • Using Naive Bayes and Decision Trees to train a classifier
  • Using 10-fold cross validation to test your classifier
  • Understanding your results and comparing them to a baseline classifier
  • Iterating on your features to improve the results
  • Showing that your results are significantly better than baseline 
  • Developing a report documenting your findings

Preparatory Work

Install SciPy, and SciKit-Learn (here are some additional installation hints), which we will use for our machine learning. Note, I ran into a small problem with the installation failing due to having the very latest version of XCode and Mavericks. If you have this version, you may have to download the Mavericks version of Command Line Tools from late October 2013 and then follow the directions to ensure that your Command Line Tools take precedence over XCode  by deleting XCode. A very good alternative may be using iPython notebook, installation and set up is described in the Byte 4 map assignment. 

You will also need to import a bunch of libraries:
import httplib2
from apiclient.discovery import build
import urllib
import json
import csv
import matplotlib.pyplot as plt 
import numpy as np
import scipy.stats
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn import cross_validation
from sklearn import preprocessing
from sklearn.dummy import DummyClassifier
In addition, you will need to download the data set for this assignment. To do this we will use code very similar to Byte2 to save a json file to disk and only query the fusion table if the file cannot be found. We will create one file for dogs and one for cats, you will need to add your fusion tables keys to the code to access them

# open the data stored in a file called "data.json"
    fp = open("data/dogs.json")
    dogs = json.load(fp)
    fp = open("data/cats.json")
    cats = json.load(fp)

# but if that file does not exist, download the data from fusiontables
except IOError:
    service = build('fusiontables', 'v1', developerKey=API_KEY)
    query = "SELECT * FROM " + TABLE_ID + " WHERE  AnimalType = 'DOG'"
    dogs = service.query().sql(sql=query).execute()
    fp = open("data/dogs.json", "w+")
    json.dump(dogs, fp)
    query = "SELECT * FROM " + TABLE_ID + " WHERE  AnimalType = 'CAT'"
    cats = service.query().sql(sql=query).execute()
    fp = open("data/cats.json", "w+")
    json.dump(cats, fp)

Checking for Significant Difference

The first step in this assignment will be to test some of the theories that came out of our earlier work Byte 3. In particular, we will test two theories: (1) that adoptions are more common among dogs than cats and (2) that adoptions are more common among young dogs than old dogs.  The source code provided with the assignment will walk you through the first of these tests but not the second.

For this type of data, which is categorical, instead of using a t-test we will need to use a Chi-Square test. Although we did not discuss this test in depth in class, we will not be implementing it, only using it. You should be comfortable picking tests like this based on how appropriate they are for your data, and the choice of test is laid out in the slide deck on introductory statistics (last slide). The assumptions for a Chi-Square test are very minimal and you should convince yourself that we meet them: Random sampling; population is 10x as large as its sample; variables are categorical; the expected frequency count for each variable is at least 5. The only one that we need to address is the last one -- in our data, some things are very rare (such as the outcome 'Found Report Expired'), and we can fix this just by combining them. 

The steps for doing a significance test, as discussed in class, are as follows. I've slightly modified them because the library does some things for us that we had to do by hand in the class example. 
1) Set an alpha value (we'll use p=.05)
2) Define our hypotheses:
    Hypothesis 1: cat_outcomes = dog_outcomes
    Hypothesis 2: cat_outcomes != dog_outcomes
3) Calculate the statistics (we'll use scipy.stats.chi2_contingency)
4) State decision rule: 
     Since chi2_contingency returns a p value, this is simply to compare the returned p value to alpha (.05)
5) State conclusion: Which hypothesis is rejected?

Of these steps, only step (3) requires any serious work. The text below will focus on that. 
In order to use scipy.stats.chi2_contingency we need to create an array containing the observed frequencies of cat outcomes and dog outcomes, something like this:
 Adopted  ...  Euthanized
 Dogs  #adopted  ...  #euthanized
 Cats  #adopted  ...  #euthanized

To fill in this table, we'll need to write some code that loops through the dog and cat data. Below I provide example code just for cats. First we set up the outcomes array: cat_outcomes = np.array([0.0,0,0,0,0,0]) Note that the first element is 0.0 so that everything is handled as a real number instead of integer. This will avoid certain types of rounding errors. 

Next we need to loop through all the rows and increment the correct outcome count. In order to avoid counts that are very low, we have selected a subset of the outcomes to count up, and we will put everything else in an 'Other' category. We store the outcomes we care about in an array called outcome_types:

outcome_types = ['Returned to Owner', 'Transferred to Rescue Group', 'Adopted', 'Foster', 'Euthanized']

and we compare to this array in our loop to decide what to increment. The try statement just catches outcomes not in our outcome_types array, in which case we increment the outcome for other (#5 in the cat_outcomes array).
rows = cats['rows'] # the actual data 
for cat in rows:
    # get the outcome stored for this cat
    value = cat[0]
        i = outcome_types.index(value)
        # one of outcome_types
        cat_outcomes[i] = cat_outcomes[i] + 1
    except ValueError:
        # everything else
        cat_outcomes[5] = cat_outcomes[5] + 1

Once we do this for both dogs and cats, we want to sanity check the data using a quick bar plot:
# plot the data to see what it looks like
fig, ax = plt.subplots()
index = np.arange(6)
bar_width = 0.35
opacity = 0.4
rects1 =, dog_outcomes, bar_width, alpha=opacity, color='b', label='Dogs')
rects2 =, cat_outcomes, bar_width, alpha=opacity, color='r', label='Cats')
plt.title('Number of animals by animal type and outcome type')
plt.xticks(index + bar_width, outcome_labels)

Which looks something like this:

Finally, we are ready to run our statistics. First we create the outcomes array which contains our observed values:
Observed = np.array([cat_outcomes, dog_outcomes])
Then we run the Chi-Squared test and capture the values (𝒳2, p, the number of degrees of freedom and the expected frequencies table generated as part of the test). For this analysis we care primarily about 𝒳2 and p: 
X_2, p, dof, expected= scipy.stats.chi2_contingency(Observed)
print "CHI-squared: ", X_2, "p = ", p

For your homework, you should run a Chi-Squared test testing whether young animals are more likely to be adopted than old animals. For this test, you can focus only on dogs if you like. You will need to decide what to do with all the animals for whom age is not known. You can put them in a separate category, delete them, or impute them (if you want to impute, you may want to (1) read the debate on this at numpy and (2) use the library for this provided by scikit-learn). In my reference implementation (which is not provided as part of this byte) I chose to put them in a separate category ('unknown'). Your code will need to check the age of each animal and then update an outcome number in the appropriate outcomes array. Aside from the fact that (if you follow my approach) you may have 3 outcome arrays instead of 2, this is very very similar to the previous test. 

Your hand in should include a chart produced by your code (similar to the one below) and a statement of the results giving the 𝒳2 and p values and stating whether the difference between the groups is significant or not. 

Phase 2: Machine learning

Now that we have some evidence that there things like Age and Animal Type may influence adoption outcomes, it seems worthwhile to try to predict outcome based on such things. To do so, we will need an array containing all the data for both dogs and cats, in random order. Because we will eventually want to divide this data up and explore it, we will save this to disk. This ensures that we don't re-randomize it each time we run the script and inadvertently cause the optimization set to contain different data each time (which would subvert it's whole purpose and bias the results).
    fp = open("data/random_dogs_and_cats.json")
    all_data = np.array(json.load(fp))

# but if that file does not exist, download the data from fusiontables
except IOError:
# make an array of all data about cats and dogs
    all_data = cats['rows'] + dogs['rows']
    # randomize it so the cats aren't all first
    fp = open("data/random_dogs_and-cats.json", "w+")
    json.dump(all_data, fp)
    all_data = np.array(all_data)

So all_data now contains the data which we will develop into an appropriate structure for our machine learning. The first thing we need to do with it is pull out a selection of the data that contains our features.

Develop a feature set

We could simply use the entire all_data array for our machine learning. However, one of the columns is our outcome measure (the class we are trying to predict) so at a minimum we need to remove that from all_data. In addition, there are features which may introduce bias (because they are closely associated with the outcome) or are too difficult to work with (because they have such a large number of possible values that they never or almost never repeat, thus not being very predictive). The initial set of features I suggest we use is: 
features = ['AnimalType', 'IntakeMonth', 'Breed', 'Age', 'Sex', 'SpayNeuter', 'Size', 'Color', 'IntakeType']

Now to select only that subset of the data, we will make use of numpy's array slicing and indexing capabilities, which will let us handle this task very efficiently in a single line of code. To do this, we need the column numbers of the columns which we will include. You can determine this by looking printing out the list of columns gotten using something like print cats['columns'] (or dogs['columns']) and then counting the columns to find the index of each feature listed above. Alternatively, you could write a loop to do this automatically (as in the sample code provided for this byte). 

Given a list of columns that we want to keep, called use_data in the source code provided, it is a simple matter to slice and dice all_data. We will need an X variable (the features) and a matching y variable (the classes). Each element of y indicates the correct classification for the corresponding row of X. out_index is just the column index for the class variable.
# Now we create a new array that only has the columns we care about in it
X = all_data[:, use_data]
# This is the column with the outcome values
y = all_data[:, out_index]

Prepare the class variable

While y is approximately what we want, too many classes will make it very difficult to achieve any reasonable classification accuracy. I suggest narrowing it down to just a few classes. You can limit it more, but a good start is to have a class for Euthanized, one for animals that are placed in a home/rescue center/returned to owner, and a class for everything else.

Outcomes = ["Euth.", "Home", "Other"]

It is very simple to update the labels in y -- simply use 
y[y=="Found Report Expired"] = "Other"
... and so on for each of the classes that you wish to replace. I put everything except 'Adopted' and 'Returned to Owner' under 'Other'.

Dividing the data into an optimization set and a test set

The last step before we begin to actually try to classify our data is to create appropriate sets. In an ideal world, we would pull out a special 'validation' set that is used only at the end to generate final numbers, an 'optimization' set which is used for most of the hard work of trying to identify the best algorithm, parameters, features, and so on, and a 'train' set which is used to prepare the classifier eventually to be tested on the 'validation' set. However, for this assignment we will simply use an 'optimization' set and a 'train/test' set for reporting final numbers. 

We will put 20% of the data in the optimization set. Since the data is in a random order we can simply pick the first 20%, using numpy's slicing and dicing facilities again:
nrows = len(all_data)
percent = len(X)/20
X_opt = X[:percent, :]
y_opt = y[:percent]
X_rest = X[percent:, :]
y_rest = y[percent:]

Using Orange

I will not be able to walk through a detailed example in Orange, but for those of you interested in using it, at this point in the tutorial you would switch over. I have updated the tutorial to save out two CSV files (one for an optimization set and one for the rest (validation) set of the data) in your 'data/' directory. I have also created a sample Orange script (which you would need to update to point at *your* .csv files instead of mine). Below is an image of it. I have added labels and arrows showing things you may want to double click on to explore this further. Orange has some very nice visualization features and I encourage everyone in the class to at least play with it (you can use the 'Orange-example.ows' file included with the assignment).

Rather than walk you through every part of it, I encourage you to double click on different circles to see more of what is going on inside them. One of the great features of Orange is that it is scriptable. It is accessible both directly from python if you install it using pip (similar to how we access scikit-learn below) as well as scriptable using python from within Orange. It is likely that this byte will replace all references to scikit-learn with Orange next year. In the meantime, enjoy what I have been able to make accessible :). 

Prepare the features for use in machine learning

For the machine learning algorithms we'd like to use, the features must be numerical. However all of the features listed above are categorical (textual) in nature. Luckily, the SciKit-Learn library which we will use for our machine learning provides a preprocessing tool for handling this situation. First, however, we need a list of all of the labels that show up across every feature. Luckily the '' program that we wrote for byte2. Rather than re-running that, I have placed all of the related '.csv' files in the 'data' directory in byte4 along with the '.json' files created throughout this tutorial. Similarly to selecting features (columns) you can either construct a list of all of the names found in every relevant '.csv' file by hand, or automatically. My code automatically creates a labels list. The preprocessing library we want to use is called LabelEncoder and we use it as follows:
le = preprocessing.LabelEncoder()
X = le.transform(X)

The new X will have numbers instead of strings inside it (but not to worry, we can use le to convert any of those numbers back into the appropriate string using le.inverse_transform

Training a classifier

We are finally ready to train and test a classifier. I will demonstrate with the optimization set, which is the set you should use to try to improve your numbers. We will compare the accuracy of two classifiers: A baseline Dummy classifier which simply picks the most frequent class (across the entire data set) no matter what; and a Naïve Bayes classifier:
dc = DummyClassifier(strategy='most_frequent',random_state=0)
gnb = GaussianNB()

Because we want to use 10-fold cross validation, we will use sklearn.cross_validation.StratifiedKFold, configured to divide the data into 10 folds. This module works by passing in both the data and the classifier. For example:
skf = cross_validation.StratifiedKFold(y_opt, 10)
This can be used to loop through a series of test and train sets as follows:
# loop through the folds
for train, test in skf:
    # extract the train and test sets
    X_train, X_test = X_opt[train], X_opt[test]
    y_train, y_test = y_opt[train], y_opt[test]
    # train the classifiers
    dc =, y_train)
    gnb =, y_train)

    # test the classifiers
    dc_pred = dc.predict(X_test)
    gnb_pred = gnb.predict(X_test)
You will then need to decide how you want to evaluate the effectiveness of these classifiers. You can calculate accuracy or an combination of things such as precision, recall and F score. Exploring these metrics (which were discussed in class) and other things such as the confusion matrix for your classification will help you to understand what types of errors are occurring and to iterate on things like features. Kappa is unfortunately not supported (you would have to implement it yourself). Metrics are supported by Scikit-learn using code such as:
dc_accuracy = metrics.accuracy_score(y_test, dc_pred)
In the sample code provided with the assignment, accuracy, precision and recall are calculated and printed for each fold of the 10 fold cross validation. In addition, an array of  accuracy scores is collected as follows:
    gnb_acc_scores = gnb_acc_scores + [gnb_accuracy]
    dc_acc_scores = dc_acc_scores + [dc_accuracy]

... so that we can print out the mean accuracy across all 10 folds after we are done.
print "============================================="
print " Results of optimization "
print "============================================="
print "Dummy Mean accuracy: ", np.mean(dc_acc_scores)
print "Naive Bayes Mean accuracy: ", np.mean(gnb_acc_scores)

Iterating and checking for significant difference

At this point you have a very first pass at a classifier. What you do next depends on your iteration, exploring the data, and test functions that try different things out. All of this will need to take place inside your optimization set. Example things you may want to try: Switching classifiers to something else discussed in class (such as decision trees); adding more features and/or modifying features (for example, can you come up with a replacement for 'Breed' that has fewer different entries but still captures the important facets of that feature?); and modifying the parameters of the classifier you are training (often using a loop and training it repeatedly with different parameter values to identify the value leading to the maximum score). Once you have decided on the best features, classifier, and parameters, you can switch to X_rest and y_rest and generate metrics for the true accuracy, precision, and recall of your work. Some great examples of how you might calculate and graph key metrics can be found on the scikit-learn examples page.

One thing that is very important is to look at the impact of different features on the classification outcome, when that is possible. The information about features differs from classifier to classifier (for example, in decision trees, the tree itself is key to interpreting the results). For Naive Bayes, there is no easy way to understand the relative influence of different features. However with a small number of features it is possible to try leaving different features out and see how this impacts accuracy. The features that have the biggest negative impact on accuracy are most important. For Decision Trees, you can export the entire decision tree as follows (you will probably want to embed this in your cross validation loop along with everything else of course). 

dt = tree.DecisionTreeClassifier(max_depth=5)
dt =, y_train)
tree.export_graphviz( dt, out_file = '', feature_names=features)
To view it you will can use an online .dot file viewer such as or (based on google appspot even :). You can also download GraphVizNote that the features are numerical and it is not entirely straightforward to convert from those back to an understanding of the original text features. However this will still give you a sense of how different features are being combined to solve the problem of fitting the tree. 

In order to judge whether your innovations you will need to test whether the difference in accuracy (or F score, or any other metric) is significant. Because these numbers are ratio, not categorical, we will use a regular t-test instead of a Chi-Squared test. Because the two samples are related we will use the scipy support for ttest_rel:

diff = np.mean(dc_acc_scores) - np.mean(gnb_acc_scores)
t, prob = scipy.stats.ttest_rel(dc_acc_scores, gnb_acc_scores)

When you run this, you should see a significant increase in accuracy between the two classifiers.

Hand In Expectations

The tutorial above gets you to the point where you can test for difference between dogs and cats, and generate a Naïve Bayes classifier that is pretty naïve, but still better than the dummy classifier. You will be asked the following questions at hand in:

Is there a significant difference in outcomes at different ages:
  • What is the hypothesis H0?
  • Show a chart comparing outcomes for each group
  • State your results giving the 𝒳2 and p values and stating whether H0 is rejected or not. 
Are decision trees any more accurate at classifying outcomes than Naive Bayes:
  • State the accuracy of Naive Bayes when applied to X_rest using 10fold cross validation
  • State the accuracy of Decision Trees when applied to X_rest using 10fold cross validation
  • State whether these accuracies are significantly different (give a p value)
  • Include a picture of the decision tree and talk about what you learned from it
What was the best classifier you were able to create:
  • What features did you create or modify?
  • What were the accuracy, precision and recall of the result on your 'rest' (non optimization) set?