Daniel Neill spoke in class on Thursday. Daniel Neill is a faculty member at CMU's Heinz College, where he directs the Event and Pattern Detection Laboratory. He is also associated with the Machine Learning Department and Robotics Institute. His research is focused on novel statistical and computational methods for discovery of emerging events and other relevant patterns in complex and massive datasets, applied to real-world policy problems ranging from medicine and public health to law enforcement and security. This is a growing area of interest, and he mentioned a Data Science and Social Good summer program at the University of Chicago as an example of the broader interest in this sort of work. He also mentioned a joint PhD program between MLD and Heinz (I think) in machine learning and policy, a course on large scale data analysis for policy, event and pattern detection, and a seminar series.Dr. Neill spoke about his research at the intersection of machine learning and policy analyzing real-world city scale data. He focused on two case studies: Early detection of emerging disease outbreaks, and identification of upcoming crime hot spots. In the medical domain, he discussed Pattern detection by subset scan (looking for subsets that indicatesomething is going on). The overall goal is to not only detect emerging events but also pinpoint where and when they started and generally characterize the event. An example is detecting the increased use of pediatric electrolytes north of columbus (indication of a gastrointestinal disease outbreak) using over the counter medication sales. This is accomplished by testing a large set of hypothesis (constructed over data streams, subsets of locations, and time durations) against H _{0} (that no outbreak has occured). By computing expected counts F(D,S,W)=Pr(Data|H _{1}(D,S,W))/Pr(Data|H_{0}) for all H_{1}s it is possible to generate a p value for whether to reject H_{0}. This can be improved by generating fake data sets to have more examples for H_{0} (or better yet, using historical data). This initial work was extended to use topic modeling (specifically, LDA) to dynamically update D (thus detecting emerging spatial patterns of keywords). Thus, cases can be classified to topics. In one test, they were able to detect outbreaks in 5.3 vs 10.9 days (more than twice as fast as standard prodrome based method) Underlying this approach are enormous sets of subsets (H _{1}) that could require intensive computation to scan. However, Dr. Neill described (briefly) a result that it can be done in linear time. (proportional tonumber of data records, not the number of subsets) for a certain and very useful set of statistics. He did not go in enough depth for me to summarize that here. However the gist is that the data is sorted by priority (depends on the likelihood ration statistic F(S, T)) and then scanned in such a way that the highest scores are found. Dr. Neill then talked about the wish to identify any differentially affected subpopulations (e.g. gender, age, risk behaviors, etc). These are observed discrete-valued attributes, and we want to know the subset of values for each attribute that matter. In other words, we want to optimize each attribute (given all the others) until convergence. Gives a local maximum of the score function and can do standard things to get to the global maximum. Case study 2 focused on crime prediction in Chicago. By using the algorithms described above to detect emerging clusters, we can use these as features for prediction. He described being able to predict up to 1 week ahead with high accuracy. The predictions are very fine grained (by block/day) and can predict emerging hotspots (not just neighborhoods). In addition, by using logistic regression we can learn which leading indicator types are most relevant for prediction, and can include other features such as weather. A fascinating lecture and clearly just touching on a range of things worth reading up on further. |

Course Blog >