Course Blog‎ > ‎

Some sample Byte 3 projects

posted Feb 24, 2014, 9:06 AM by Jen Mankoff   [ updated Feb 25, 2014, 3:41 AM ]
What sort of things seem to influence intake and outcomes for animals in a shelter? Here are some conclusions that visualizations provided by students completing Byte 4 suggest. 

(Yanan Jian)
(Runyun Zhang)
(Abdel Bourai)

These visualizations still leave us with many questions -- for some of them it is unclear which animals were included in the analysis, or how unknown values were dealt with. Additionally, they do not provide any statistical analysis of the conclusions I suggest. Finally, data cleaning is not clearly described on these web pages and in some cases could be of great value (for example combining similar breeds rather then allowing them to be split out could change the results of the first visualization). Nonetheless, these examples help to illustrate the value of visualization for helping us begin to ask (and answer) questions about our data.

I also want to share one of the nicest and most complete anwers about how to prepare the data. It is a great summary of how to go about not only identifying problems with data but also cleaning them. I made a few small edits to the presentation but the words are all those of one of your fellow students (Harsh), a CIT Masters student (INI).

1. Surveying the data

I first surveyed what type of data it was that was missing. Was the data missing because of any relation to some other fields? My conclusion was that here the data are missing completely at random (MCAR)[1] When we say that data are missing completely at random, we mean that the probability that an observation (Xi) is missing is unrelated to the value of Xi or to the value of any other variables. By looking into the nature of the missing data we may choose the way in which we can clean it.

2. Available options:
  • Listwise Deletion (Simplest Approach): This means we simply omit those cases with missing data and run our analyses on the remainder of the data. In this case many of the Unspecified Outcome values were blank. Also, many of the ages were not specified.
    Advantage: Listwise Deletion results in a substantial decrease in the available sample size, however under the assumption that data is missing completely at random it results in unbiased parameter estimates.
    Disadvantage: Loss of power, if the data is not MCAR then it introduces a bias.
  • Hot Deck Imputation: The process of substituting missing data with randomly substituted values. Basically for example, I would substitute a random age bin for a missing age.
    My conclusion: This could have been used in my data set, however the amount of unspecified values was large. I do not want to rely on the randomness of my browser (the java function is pseudo random at best). So that would mean statistical implications and a bias in my results.
  • Mean Substitution: The idea here is to substitute a mean for the missing data. This is usually used for numeric values and since the values here were non numeric for example "<6mo","6 months to 1 year" / Also the missing values for Outcomes such as blank outcomes,  No show etc I could not have used this. Well I could have - by substituting numbers for say the text values and then calculating the mean, however I could not verify if that is an acceptable method of substitution.
  • Regression imputation/ Machine Learning: You would ideally observe the relations between variables , build a statistical model and then classify the value with some probablity - sort of like machine learning. For example, you could say observe that dogs greater than 7 years seem to have outcome most frequently as "Euthanized", so you could guess that any dog > 7 years with missing outcome would be euthanized. This is actually a really good approach but I didn't get enough time to write code and test it out for this.
There are other techniques, but these are what I looked at and considered primarily given the short amount of time.

Conclusion: The best approach was Listwise Deletion (Deleting entries) since the data seems to be Missing Completely at Random and the deleted records only form a small fraction of the total data set. I also removed the Unspecified column from the Age, Since I felt this data was not helpful to the user, given the question what is the relationship between age and outcome?

References:

[1] This answer contains elements referenced from the website: "Treatment of Missing Data": http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

[2] http://en.wikipedia.org/wiki/Imputation_(statistics)

Comments