Course Blog‎ > ‎

Cleaning up Animal Shelter Data

posted Feb 6, 2014, 5:56 AM by Jen Mankoff   [ updated Feb 25, 2014, 12:19 PM ]
I thought for Byte 2 it might be fun to summarized what we all learned together. Here are some of those things:

Expectations about the data:
  • When dogs are abandoned: Some of you expected that dogs would be abandoned young (before bonding)
  • Outcomes: Some were surprised by the number of dogs euthanized. 
  • Animal arrivals are fairly uniformly distributed as one might expect. However there is a slight trend of more animals arriving in the shelter in summer than winter months. Thoughts on why?
  • Many of the animals have names. Why? (I suspect the shelter is giving the animals names to help them be adopted)
  • Many young dogs are euthanized. Why? Reasons are missing from the data, and breed is hard to compare because of issues mentioned below.
  • Expected outcomes to be similar for cats and dogs, but cats are much more likely to be euthanized.
Things that seem right
  • Ages are not numerical -- because they are estimates. Some questions about why these groupings (I would argue it has to do with expected adoptability). Also, it was pointed out that age could be ambiguous as it is described (though I would argue the shelter staff know how to interpret it).
  • Outcome dates are greater than Intake dates
  • Zip code found has a 99% overlap with the Louisville area
  • There is no particular relationship between zip found and zip placed (domain knowledge says this makes sense to me)
  • Some animalIDs are repeated. Based on other properties of these rows it seems that some animals that leave and return multiple times?
Errors and Problems in the data:
  • 37% of the age data is missing !
  • Spatial data is missing many values (over 2000 in the case of longitude)
  • Some zip codes are missing (esp. for zip found); some are invalid possibly due to typographical errors.
  • Outcome subtype, which can help us understand what goes on in the shelter, is missing in over 2000 rows.
  • Some euthanized animals also have a zip placed value 
  • Some columns are redundant (could introduce errors)
  • There are over 500 distinct breeds, some of which are redundant; and 75% of records have no value (across cats and dogs?)
Some visualizations to explore that highlight a few of these points: