Course Blog‎ > ‎

Byte 2 Results: Data Cleaning and Exploration

posted Feb 4, 2015, 7:27 PM by Jen Mankoff
Hi folks,

I spent a little time today looking over what you handed in for Byte 2. Fascinating stuff. Some really nice websites as well as insights. 

Not about cats and dogs: A couple of you asked to look at alternative data sets. Here's what was explored

Websites about cats and dogs

Some sample expectations/surprises about cats and dogs (my comments in italics)
  • Euthanization
    • 'If animals were euthanized, they did not receive a placement. This changed my assumptions of how I understood the metro services to work, and basically opened up a whole lot of other questions for me. For example, I'm wondering now if Animal Services is targeting specific areas in searching for strays because more strays are concentrated in those areas, or because it is convenient for them for other reasons.'
    • 'Dogs 1 year or younger were the only group of dogs that had an outcome other than euthanized as the most prevalent outcome. (This is especially jarring considering that when these dogs are considered as part of the whole population, euthanization is still the most common outcome, meaning it must be extremely normal among other dogs).'
    • 'My expectation was that, the mortality rates in cats and dogs will be similar to the rates as mentioned in the paper titled, "Birth and death rate estimates of cats and dogs in U.S. households and related factors"*. The paper states that crude death rates in cats are 8.3 cat deaths/100 cats and 7.9 dog deaths/100 dogs.... [In the Kentucky data set,] the average mortality rates in cats are 41 cats/100 cats and in dogs are 22 dogs/ 100 dogs." Interesting question: Are they measuring only natural death, or euthanization due to lack of adoptability as well in the paper?
    • 'I had recently read a paper where someone said that darker colored dogs were euthanized more than the lighter one. I really wanted to challenge that assumption but was shocked [to see it was true].' Interesting question: did you normalize by the number of dark dogs brought into the shelter? Are dark dogs brought in more (if they are) because they are harder to keep? Possible answer found in another student's analysis: ' colored pets were dominant in the data.' 
    • ' I expected pitbulls would get placed in homes less frequently (which proved true), and dogs with good reputations like Labradors and golden retrievers would have the highest placement.... Smaller breeds [of dogs] seemed to get placed most frequently'
    • 'There were around 2000 cats euthanized were under the subtype of "medical" while dogs under the same subtype were just 600. That's where cats got twice [the euthanization rate] than dogs.'
    • 'I assume the animals surrendered by owners would be a good health state compared to stray animals. Hence the proportion of owner surrendered animals euthanized would be lesser than stray animals euthanized. After analyzing the data, I saw that the percentage of animals euthanized for owner surrendered dogs was 48% while for stray dogs it was 42%. This was a very surprising finding. I have attached the pie chart and tables to show this point'
  • Time of year: Does it affect outcomes? 
    • One student found no. 
    • Another found that 'December, November and October have lesser data compared to other months.' 
    • Another student made it the focus of their website, and perhaps that sheds better light on the question?
    • One expectation I had about the data is about the chances of a cat to be adopted depending on when it was taken during the year. I first thought that whether a cat arrived in the center in January or say september had no effect on its chances to be adopted. Yet, this assumption proved to be wrong. As the figure attached shows, the percentage of animals that were adopted depends largely on the month of the year it arrived in the center. 25 % of the animals that arrived in December for example were adopted, against only 8% for those that arrived on September.
Some sample data cleaning problems about cats and dogs
  • Missing Values: Many of you reported on missing values: Empty values represent ~9% of the ZipWhereFound column, ~26% of the Age column, ~53% of the Breed column, ~9% of the Longitude column, ~35% of the ZipWherePlaced column, ~9% of the Latitude column, ~22% of the OutcomeSubtype column, and <1% for the other columns. Interestingly, one reason discovered by students for the ZipWherePlaced missing values was that euthanized dogs don't get a value for that... 
  • Ages were estimates. How accurate?
  • Outcomes:
    • Transfer: "Transferred to Rescue Group," is not really an outcome, but more of an intermediary step, so those dogs we don't have more data about. Maybe they were transferred to a no kill shelter? Or maybe they were transferred somewhere else that was also overcrowded, there's no way to tell.
    • Outcome Subtype: 'One issue I noticed is the lack of coherence in the column of Outcome subtype. There are entries such as "Behavior", also Heartworm which may refer to the common illness on dogs, but there are also web/TV Metro/At vet etc that seem like information distributing channels. It appears that there are multiple entry types that should not be categorized all under the same category of Outcome Subtype. This incoherent factor may contribute to ill-defined interpretation of the outcome of the data set.'
  • Color: 'someone with no knowledge about the keys to the database might have a tough time understanding which one amongst BR and BRN stands for BROWN and which one for BRINDLE, both of which are mentioned in the database as independent colors.  The abbreviations should be well described and I am not even sure if they are consistently used throughout.' Some dogs are labeled 'BLUE' another student found! Another found almost 40 different values for color! 
  • Script Error? Perhaps some missing data in my summarization script?? 'The total number of dogs rescued was 9635. However, when I add up the number of dogs in each category, I get 9619. Hence, the data does not add up.'
  • Breed: 
    • "I expected the attribute 'Breed' in the data would give further information of the animal. The breed of the animal is important because it can provide a lot of information like the characteristics of the animal and the habits of the animal. ... First,  the combination of breed name and 'MIX' makes the attribute difficult to be classified into groups. For example, if I try to draw a bar chart for this attribute, most groups will only have one instance in it, which makes the users of the data hard to figure out how many different breeds inside the data. A better idea would be recording the dominent breed of the animal in the attribute 'Breed' and then adding a new attribute called 'Mixed' which contains two labels 'Yes' and 'No'."
    • "Second, the order of the breeds across labels is not consistent, for example, 'DOMESTIC SH - SIAMESE' and 'SIAMESE - DOMESTIC SH' are basically the same label, however, due to the difference of the order, the two will be considered different labels."
    • Cat breeds missing at 97%! 
  • Location: 'I expected that I'd see a lot of variety in the "zip where found" data, and a lot of clustering around the "zip where placed" axis, since I assumed there were maybe a couple of shelter locations in the metro area. I actually found the opposite. The point were fairly centralized along the "where found" axis, and there was more dispersion for the "where placed" axis.' Explanation provided by another student: 'I expected the location data (latitude, longitude) to give more specificity than the zipcode information (Zip Where Found), but this was not the case. All the datapoints with a particular zipcode share the same latitude and longitude. This can be seen in the map as multiple placemarks overlap eachother.'