Class Calendar‎ > ‎

Class Calendar 2014

Lectures (Tentative Schedule, lectures in Blue are final)

Lectures will take place in NSH 1305
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).

Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings. 

Tues 1/13 Introduction & Overview of Data Science Pipeline 

Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data. 
Homework: Byte 1 Assigned 

Thurs 1/15 Scoping Projects; Asking good Questions

Description: How do we decide what questions to ask of the data

Tues 1/20 Structured vs Unstructured Data

Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc
ReadingsOptional: Stonebraker & Hellerstein: What Goes Around Comes Around 
        (a historical view of different classes of data modeling)
Homework: Byte 1 Due; Byte 2 Assigned 

Thurs 1/22 Acquiring Data

Description: Sampling issues; Pros and cons of different sources of data; Practical overview of APIs and OAuth

Tues 1/27 Understanding and Cleaning your Data

Description:  The four Cs (Correctness, Coherence, Completeness, and AcCountability); Case studies: Mouse data; Location Data
Readings: 

Thurs 1/29 Visualizing and Exploring your Data

Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
Readings: 

Tues 2/3: Information Visualization Overview & Introduction to D3 

Description: Overview of key concepts in information visualization and a brief discussion of D3 Th
Readings: 

Homework: Byte 2 Due; Byte 3 Assigned

Thursday 2/5 Information Visualization Case Study: StepGreen

Description: Discussion of 4 studies that influenced the design of the Stepgreen.org website.
Readings: 

Tues 2/10 Guest Lecture by Daniel Neill

Description: Daniel Neill is a faculty member at CMU's Heinz College, where he directs the Event and Pattern Detection Laboratory. He is also associated with the Machine Learning Department and Robotics Institute.  His research is focused on novel statistical and computational methods for discovery of emerging events and other relevant patterns in complex and massive datasets, applied to real-world policy problems ranging from medicine and public health to law enforcement and security. Slides available on Blackboard.

Thurs 2/12 Mobile Data & Map Data

Description: Discussion of what can be accomplished with mobile data collection. Description of Byte 4 (Mobile Byte or Map Byte) 
Slides:[9 Mobile][Slides for Byte 4]
Reading: Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., ... & Campbell, A. T. (2014, September). Studentlife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 3-14). ACM. [Video]
Homework: Byte 3 Due Friday 2/14; Byte 4 Assigned

Tues 2/17 Large-volume Geographic Data: [Guest lecture by Randy Sargent] 

Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery. No Slides

Sites visited: 


Explorables site
 (includes links to some of the below)

Gigapan Obama Inauguration

Racial dot map of the U.S.

AirNow air quality exploration

Lights at Night

Oil and gas drilling in selected states of the U.S.   (limited to states we've scraped data from.  If you want to join the scraping effort let me know!)

A year of fires

Wind map of Earth

Time Machine

Whole-Earth Time-lapse (be sure to zoom out and into other spots)

EVA 3-d high-dimensional data exploration

 


Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google.  As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present

Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies.  Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.

Thurs 2/19 Overview of Statistical Hypothesis Testing

Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on).
Homework: Byte 4 Due; First Project Assigned
Readings: 
Optional Readings:

Tues 2/24 T-Tests, Correlation and Regression

Description: Discussion of the t-test, the math and assumptions underlying it, and the process for using it. Discussion of correlation and regression, the math and assumptions underlying them, and how to use them. Also discussed limitations and paradoxes (such as Simpson's paradox).
Readings: 

Thurs 2/26 Finish Stats & Start Classification Basic & Metrics 

Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on. 
Readings: 
Slides: [Slides]

Tues 3/3 Project I Meetings: 9-12:45

Description: Meetings with Project I Groups to discuss project progress and goals. 

Thurs 3/5 Classification Algorithms

Description: Discussion of Decision Trees, Naïve Bayes, and Regression
http://pdf.aminer.org/001/202/088/evaluating_learning_algorithms_composed_by_a_constructive_meta_learning_scheme.pdf
Homework: Byte 5 Assigned

3/9-3/13: Spring Break

Tues 3/17 Project 1 Poster Session

Thursday 3/19 Infrastructure for Big Data

Description: We talked about infrastructure issues for big data
[Slides
Readings: 

Tues 3/24: Visualizing Big Data

Description: Discussion about Visualization of Big Data & Byte 6 option 1. 
Readings:

Thurs 3/26: Social Network Analytics 

Description: Discussion of social network analytics 
[Slides][Description of Byte 6 -- social networking]
Readings: 

HW: Byte 5 Due; Byte 6 Assigned

Tues 3/31: Guest Lecture: Afsanah Doryab (Tracking Individual Behavior)

Description: 
[Slides]
Readings: None

Thurs 4/2: Final Project Planning (Individual Meetings 9-12)

Tues 4/7: Guest Lecture: Aaron Steinfeld (Tiramisu) 

HW: Byte 6 Due
Description: The focus of the lecture was on the issues faced when crowdsourcing. Slides available on Blackboard. 
Readings: Tomasic et alMotivating Contribution in a Participatory Sensing System via Quid-Pro-Quo. To Appear in CSCW 2014. [Blackboard] 
Bio: Aaron is an associate research professor in the Robotics Institute at Carnegie Mellon and the co-director of the Rehabilitation Engineering Research Center on Accessible Public Transportation. He earned his Ph.D., M.S. and B.S. in industrial & operations engineering from the University of Michigan (1999, 1994 and 1993, respectively) and completed a postdoctoral position at the University of California, Berkeley (2000). Steinfeld’s interest is focused around constrained user interfaces and operator assistance, predominantly in the realms of human-robot interaction, rehabilitation, transportation and intelligent systems. He is interested in how to enable timely and appropriate interaction when interfaces are restricted through design, tasks, the environment, time pressures, and/or user abilities. He works on the Tiramisu project. 

Tiramisu Transit is a crowd-powered transit information system developed by researchers to improve users' transit experiences and transit accessibility. With Tiramisu - literally Italian for "pick me up" - anyone waiting at a bus stop with a smartphone can see which buses or light rail vehicles are due to arrive next and, thanks to the signals from riders already aboard, get an idea of how long they have to wait. When a rider first activates the app, Tiramisu displays the nearest stops and a list of buses or light rail vehicles that are scheduled to arrive. The list includes arrival times, based either on historical data for that route or on real-time reports from riders. When the desired vehicle arrives, the user indicates the level of "fullness" and then presses a button, allowing their phone to share an ongoing GPS trace with the Tiramisu server.  Once aboard, the rider can use Tiramisu to find out which stop is next and to report problems, positive experiences and suggestions.

Thurs 4/9: Guest Lecture by Anind Dey

Description: Discussion of Intelligible machine learning
[Lecture Slides: TBD]
Readings [tentative]:

Tues 4/14 9-12: Final Project Planning [Sign up links]

Thurs 4/16: No Class (Carnival)

Tues 4/21: Guest Lecture by Nikola Banovic on Extracting Routines 

Thurs 4/23: Final Exam Review Session 

Tues 4/28: Final Project Presentations [select timeslots]

Thurs 4/30: Final Project Presentations [select timeslots]

Final Exam: Take Home (to be discussed further in class)

Comments