Class Calendar‎ > ‎

Class Calendar 2016

Contents

  1. 1 Lectures (Tentative Schedule, lectures in Blue are final)
  2. 2 Tues 1/12 Introduction & Overview of Data Science Pipeline
  3. 3 Thurs 1/14 Scoping Projects; Asking good Questions & Selecting Data Sources
  4. 4 Tues 1/19 Structured vs Unstructured Data
  5. 5 Thurs 1/21 Exploring Your Data
  6. 6 Tues 1/26 Understanding and Cleaning Your Data
  7. 7 Thurs 1/28 Guest Lecture: Polo Chau: Data Cleaning and Integration
  8. 8 Thursday 2/11: Information Visualization Overview
  9. 9 Tuesday 2/16: Perception and Information Visualization; Practical Introduction to D3
  10. 10 Tuesday 2/23 The Role of Narrative in Visualization 
  11. 11 Tues 2/2 Acquiring Data From People
  12. 12 Thursday 2/4: Acquiring Data From Mobile Devices & The Web
  13. 13 Tuesday 2/9: Issues With Big Data Quality and Sampling
  14. 14 Thursday 2/18: Guest Lecture: Chinmay Kulkarni 
  15. 15 Thursday 2/25: Visualizing Big Data
  16. 16 Tuesday 3/1 Byte 4 In Class Work Day
  17. 17 Thursday 3/3 Guest Lecture: Randy Sargent: Large Volume Geographic Data
  18. 18 3/7-3/11: Spring Break: Homework: Byte 4 due on Tuesday 3/8
  19. 19 Tuesday 3/15: Causality, Bayesian Inference & Statistical Hypothesis Testing
  20. 20 Thursday 3/17: Guest Lecture
  21. 21 Tuesday 3/22: Finish Causality, further discussion of Bayesian Inference; Introduction of Regression
  22. 22 Thursday 3/24: Classification Basic & Metrics 
  23. 23 Tuesday 3/29: More Classification Algorithms & Metrics
  24. 24 Thursday 3/31: Project Part I Presentations
  25. 25 Tuesday 4/5: More Classification Algorithms
  26. 26 Readings: Introduction to the algorithms
  27. 27 Thursday 4/7: Classification of Big Data / Revisiting Big Query / Usable ML
  28. 28 Tues 4/12 9-12: Project Checkin [Sign up links]
  29. 29 ATTENDANCE REQUIRED
  30. 30 Thurs 4/14: No Class (Carnival)
  31. 31 Tues 4/19: Final Exam Review Session
  32. 32 Thurs 4/21: Project Checkin [Sign up links] 
  33. 33 ATTENDANCE REQUIRED 
  34. 34 Tues 4/26: Final Project Presentations [select timeslots] 
  35. 35 ATTENDANCE REQUIRED
  36. 36 Thurs 4/28: Final Project Presentations [select timeslots] 
  37. 37 ATTENDANCE REQUIRED
  38. 38 Final Exam: Take Home (to be discussed further in class)
  39. 39 http://v.isits.in/ Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, Julie A. Kientz, Understanding Quantified-Selfers’ Practices in Collecting and Exploring Personal Data. To Appear (CHI 2014) Mining the Quantified Self: Personal Knowledge Discovery as a Challenge for Data Science. Fawcett Tom. Big Data. January 2016, 3(4): 249-266. doi:10.1089/big.2015.0049. http://online.liebertpub.com/doi/full/10.1089/big.2015.0049 Optional: Take a look at the following 'show and tell' talks: http://quantifiedself.com/2013/07/mark-wilson-on-synthesizing-data/ andhttp://quantifiedself.com/2013/12/chris-bartley-understanding-chronic-fatigue/

Lectures (Tentative Schedule, lectures in Blue are final)

Lectures will take place in NSH 3002
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).
Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.

Tues 1/12 Introduction & Overview of Data Science Pipeline

Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data. 
Homework: Byte 1 Assigned; Byte 3 setup Assignment
Slides: [github]
fine

Thurs 1/14 Scoping Projects; Asking good Questions & Selecting Data Sources

Slides: [github]
fine 

Tues 1/19 Structured vs Unstructured Data

Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc; Practical overview of APIs and OAuth; 
Readings
        Required: Stonebraker & Hellerstein: What Goes Around Comes Around  Pages 1-2 (sections I and II); Section V (The Entity-Relationship Era); IX (The Object-Relational Era); X (Semi-Structured Data) (a historical view of different classes of data modeling)
Slides: [github]

Wednesday 1/20: Byte 1 Due; Byte 3 install due; 
where to introduce? before viz 

Thurs 1/21 Exploring Your Data

Description:Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
Readings: 
Homework: Byte 2 Assigned 
Slides: [github]

Friday 1/22: Byte 1 Peer Grading Due
xx where to introduce?

Tues 1/26 Understanding and Cleaning Your Data

Description:  The four Cs (Correctness, Coherence, Completeness, and AcCountability); Practical overview of survey question design issues

Readings: 
Slides: [github]

Thurs 1/28 Guest Lecture: Polo Chau: Data Cleaning and Integration

For suggested "reading": 
* I will mention (and play demo video) a tool called "open refine", a free open source tool for both cleaning and integration. 

* if have time, I hope to also show a video of "Wrangler" from Jeff here's group
That paper is good to read regardless. 

Slides: See blackboard

Thursday 2/11: Information Visualization Overview

Description: Overview of key concepts in information visualization; Testing visualizations; StepGreen Case Study
Readings: 
Reading question: Post a link or screenshot of a data visualization, and analyze how it addresses Tufte's six principles. 
Slides: [github]

Tuesday 2/16: Perception and Information Visualization; Practical Introduction to D3

Description: Overview of human perceptual factors affecting information visualization and a brief discussion of D3 
Readings:
Possible reading question: Although pattern detection is typically simpler with a graphical interface, are we missing out on interesting numerical relationships by allowing both the machine and the human analyst to focus only on what they do "best"?
Slides: [github][github]

Tuesday 2/23 The Role of Narrative in Visualization 

Readings:
Optional: 
Potential Reading Questions:
  • In what ways can factors external to the visualization itself, such as internalized knowledge and conventions at the individual and community level, interact with the rhetorical strategies used in a narrative visualization to influence interpretation?
  • How do communicative and explorative rhetorical strategies effectively work together in a narrative visualization?
  • Section 2.3 in the Hullman paper mentions how subtle changes in framing my influence, or otherwise solicit a particular opinion from the user. Can you find any examples of Visualizations that do this?
Homework: Byte 3 due;  Byte 4 (visualizing your data) assigned
Slides: [github]
xx maybe do a midterm project here? 

Tues 2/2 Acquiring Data From People

Description: Case study of data cleaning; Discussion of data sampling issues
Slides: [github]

Thursday 2/4: Acquiring Data From Mobile Devices & The Web

Description: Discussion of what can be accomplished with mobile data collection and other forms of sensed data. Description of Byte 3 (Mobile Byte) 
Reading: ProactiveTasks: the short of mobile device use sessions. Nikola Banovic, Christina Brant, Jennifer Mankoff, and Anind K. Dey. In Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services (MobileHCI '14). ACM, New York, NY, USA, 243-252. PDF
Slides: [github][github]

maybe add data streams

Tuesday 2/9: Issues With Big Data Quality and Sampling

Description: Infrastructure issues for big data; Sampling and Quality 
Readings: 
Homework: Byte 2 due; Byte 3 Mobile Assigned
Slides: [github]

Thursday 2/18: Guest Lecture: Chinmay Kulkarni 

Readings: None
Title: Advanced common-sense: Making sense of data that is invisible, ugly, or incomplete

This talk will be informal, and based on my own experience making sense of data from large-scale education applications. Through this talk, I want to remind you of data issues that are not emphasized in traditional data-processing pipelines; e.g. how do you estimate data that is hard to get? How do you sanity check data or run simple experiments that validate your hypotheses? Much of this is "common sense," and should be used combined with other techniques learned through class. 

Thursday 2/25: Visualizing Big Data

Description: Discussion about Visualization of Big Data 
Readings:
Slides: [github]

Tuesday 3/1 Byte 4 In Class Work Day

Description: In class work day/office hours [outcome: Probably don't repeat :]
Readings: 

Thursday 3/3 Guest Lecture: Randy Sargent: Large Volume Geographic Data

Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery. 

Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google.  As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present

Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies.  Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.
xx maybe touch on some of this in week one case study to prep for byte 1 

3/7-3/11: Spring Break: Homework: Byte 4 due on Tuesday 3/8

Tuesday 3/15: Causality, Bayesian Inference & Statistical Hypothesis Testing

Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on). Also discussed limitations and paradoxes (such as Simpson's paradox).
Readings: 
Reading Questions:
        1) Why is it important to estimate the likelihood of an outcome in the population and how might you do that?
        2) What are some examples of things that you might have data about for which process knowledge is your best option, rather than the frequency analysis typical of most statistics (e.g. randomized clinical trials, t-tests, etc). 

Optional Readings:
HW: Project Part I assigned
Slides: [github]

Thursday 3/17: Guest Lecture

Mike Blackhurst

University of Pittsburgh

University Center for Social and Urban Research

Tuesday 3/22: Finish Causality, further discussion of Bayesian Inference; Introduction of Regression

Description:  Discussion of causality and regression, the math and assumptions underlying regression, and how to use it. 
Readings: 
Slides: [github]

Thursday 3/24: Classification Basic & Metrics 

Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on. 
Readings: 
Slides: ]
xx can we add a machine learning assignemnt... 

Tuesday 3/29: More Classification Algorithms & Metrics

Description: Discussion of Decision Trees, Naïve Bayes, and Regression

Thursday 3/31: Project Part I Presentations

Tuesday 4/5: More Classification Algorithms

Thursday 4/7: Classification of Big Data / Revisiting Big Query / Usable ML

[Big Data Slides][Usable ML Slides: Blackboard]

    Tues 4/12 9-12: Project Checkin [Sign up links]

    ATTENDANCE REQUIRED

    Thurs 4/14: No Class (Carnival)

    Tues 4/19: Final Exam Review Session

    Thurs 4/21: Project Checkin [Sign up links

    ATTENDANCE REQUIRED 

    Tues 4/26: Final Project Presentations [select timeslots] 

    ATTENDANCE REQUIRED

    Thurs 4/28: Final Project Presentations [select timeslots] 

    ATTENDANCE REQUIRED

    Final Exam: Take Home (to be discussed further in class)


    Things cut from the class: