Course Blog‎ > ‎

Byte 5 Results

posted Apr 7, 2015, 1:37 PM by Jen Mankoff   [ updated Apr 7, 2015, 1:37 PM ]
Hi all,

A brief summary of the Byte 5 results. First, overall, really nice job to all of you! 

First off, not everyone was very specific/clear about what subset of the data they were looking at if any (i.e. only cats or cats + dogs or only dogs) and how missing values were dealt with (eliminate data? impute? etc.). As a result, you had quite different results from those choices alone in terms of accuracy, as well as the structure of your decision trees. Something to consider going forward. I present results below as if these differences didn't exist since I mostly don't know much about them. This means you may have trouble replicating each others' results as well. 

Interesting things people tried with the features:
  • Categorizing breed into pure, mix, combo & unknown (& many other variants on simplifying breed across various assignments)
  • Changed IntakeMonth to the average ambient temperature  in Louisville, KY for each given month to account for periodicity and to test the possibility that weather plays a role in adoption. 
  • Modified ‘Size’ into 3 categories only (Small, Medium, Large). (& other variations on this idea)
  • Modified ‘Color’ into 10 categories only (Black, Blue, Pattern, Mix etc)
  • Tried 'Season' as a new feature, including 4 categories (spring, summer, autumn, winter). It turns out the accuracy rises when season is included.
  • Modified 'Age' to 3 categories (Young, Old, Unknown).
Interesting analyses people did
  • Showing a confusion matrix to explore which outcomes were leading to the biggest accuracy problems (e.g. 'other' in one case)
  • Many people tried removing features. While this can sometimes have value, I prefer to see specific reasons for removing those features, and many of you did this to the exclusion of other options such as pruning the decision tree, or making new more informative features
  • Several people tried an SVM classifier and found that it beat out the others
  • 'Dummy' classifier achieved accuracies of around 50% 
A sampling of best machine learning scores you achieved:

 Classifier     Problem     Accuracy Precision Recall     F-Score Author Notes
 Decision Tree 3 way .85 .85 .85 Chin Yang Oh features were Age, AnimalType, SpayNeuter, Size, IntakeType; max depth was 7    
 Decision Tree 3 Way  .65 .67 .54 .85  .76  .26 .74 .71 .35 Anna Kasunic  Created several new features similar to those described above in this post. Categories are in order: euthanized, homed, other. Shows scores on each category separately.
 Decision Tree 3 Way .66 Euth.(.71) Home(.69) Other(.52) Euth.(.81) Home(.79) Other(.33)  Rohit Thekkanal As the height of the tree was increased from 3 to 6, The number of features used increased from 3 to 9. This increased accuracy by 5%. The putting breed into 3 categories "EMPTY, NON MIX and MIX" also improved the accuracy.
 Decision Tree     3 Way .6672 Euth. .75 / Home .62 / Other .63 Euth. .84 / Home .89 / Other .24  Carol Cheng  I modified the feature "IntakeMonth" and "Size". I set Feb., Mar. and Apr. as Spring, May, June, July as Summer, Aug., Sep. and Oct as Autumn and Nov., Dec. and Jan. as Winter. In addition, I reduced the number of different values in "Size". I set "PUPPY", "TOY", "SMALL", "KITTN" as "SMALL", set "X-LRG" as "LARGE" and set empty value as the most frequent value "MED" so that the feature only have three values "SMALL", "MED" and "LARGE".
 Decision Tree Binary .8     .8 .8  Ankit Dhingra Used ['AnimalType', 'Age', 'SpayNeuter', 'Size', 'IntakeType']; Age and Size were mapped to a smaller set