Class Calendar


Contents

  1. 1 Lectures (Tentative Schedule, lectures in Blue are tentative)
  2. 2 Tues 1/17 Introduction & Overview of Data Science Pipeline [Jen]
  3. 3 Thurs 1/19 Scoping Projects; Asking good Questions & Selecting Data Sources [Nikola]
  4. 4 Tues 1/24 Structured vs Unstructured Data [Jen]
  5. 5 Thurs 1/26 Theory and Practice of Data Cleaning [Nikola]
  6. 6 Tues 1/31 Data Sampling: Acquiring the right data
  7. 7 Thurs 2/2 Exploring Imperfect Data: Plots and Distributions [Jen]
  8. 8 Tuesday 2/7: Big Data of One [Nikola]
  9. 9 Thursday 2/9: Information Visualization Overview [Nikola]
  10. 10 Tuesday 2/14: Perception and Information Visualization [Jen]; Guide to Byte 3 [Nikola]
  11. 11 Thursday 2/16: The Role of Narrative in Visualization [Jen]
  12. 12 Tuesday 2/21: Byte 3 Help Day [Nikola]
  13. 13 Thursday 2/23: Visualizing Big Data [Jen]
  14. 14 Tuesday 2/28: Midterm Project Guidance Meetings [Sign up links]
  15. 15 Thursday 3/2: Iterative Design [Nikola]
  16. 16 Tuesday 3/7: Guest Lecture (Medical Informatics, Adam Perer)
  17. 17 Thursday 3/9: Midterm Project Guidance Meetings [Sign up links]
  18. 18 Friday 3/10-Sunday 3/19 Spring Break (No Classes)
  19. 19 Tuesday 3/21: Classification Basics & Algorithms [Jen]
  20. 20 Thursday 3/23: Midterm Project Presentations
  21. 21 Tuesday 3/28: Classification Metrics & Practical Null Hypothesis Testing [Nikola]
  22. 22 Slides: [GitHub] Thursday 3/30: Usable ML [Jen]
  23. 23 Tuesday 4/4: Classification and Regression Algorithms and Classification of Big Data [Nikola]
  24. 24 Slides: [GitHub]
  25. 25 Thursday 4/6: Guest Lecture: Mayank Goel
  26. 26 Tuesday 4/11: Integrating Classification into Interactive Systems [Nikola]
  27. 27 Thursday 4/13: Final Project Meetings [Sign Up]
  28. 28 Tuesday 4/18: Causality, Bayesian Inference & Statistical Hypothesis Testing [Nikola] Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on). Also discussed limitations and paradoxes (such as Simpson's paradox).
  29. 29 Thursday 4/20: No Class (Carnival) Tuesday 4/25: Finish Causality, further discussion of Bayesian Inference; Introduction of Regression [Nikola]
  30. 30 Thursday 4/27: Final Project Meetings [Sign Up]
  31. 31 Tuesday 5/2: Final Exam Review
  32. 32 Thursday 5/4: Final Exam
  33. 33 Finals Period: Final Project Presentations
  34. 34 Things cut from the class:

Lectures (Tentative Schedule, lectures in Blue are tentative)

Lectures will take place in NSH 3002
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).
Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.

Tues 1/17 Introduction & Overview of Data Science Pipeline [Jen]

Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data. 
Learning Goals
Homework: Byte 1 Assigned; Byte 3 setup Assignment
Slides: [Introduction]

Thurs 1/19 Scoping Projects; Asking good Questions & Selecting Data Sources [Nikola]

Description: How do we decide what questions to ask of the data; Pros and cons of different sources of data
Case study based, includes mobile data. 
Learning Goals: Learn how to ask a question that can be answered with data and explain how the question being answered affects the rest of the pipeline.
Reading:
  • ProactiveTasks: the short of mobile device use sessions. Nikola Banovic, Christina Brant, Jennifer Mankoff, and Anind K. Dey. In Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services (MobileHCI '14). ACM, New York, NY, USA, 243-252. PDF
  • Understanding the Challenges of Mobile Phone Usage Data. Karen Church, Denzil Ferreira, Nikola Banovic, and Kent Lyons. In Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '15). ACM, New York, NY, USA, 504-514. PDF

Tues 1/24 Structured vs Unstructured Data [Jen]

Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc; Practical overview of APIs and OAuth; 
Learning Goals
Readings: Required: Google's Introduction to (semi)-structured data.
                   Required: [on Canvas] Chapter 1 of Data Modeling Essentials (read sections: 1.3, 1.4, 1,6 & 1.11)
OptionalStonebraker & Hellerstein: What Goes Around Comes Around  Pages 1-2 (sections I and II); Section V (The Entity-Relationship Era); IX (The Object-Relational Era); X (Semi-Structured Data) (a historical view of different classes of data modeling)
Homework: Byte 1 Due; Byte 3 install due; Byte 2 Assigned
where to introduce? before viz 

Thurs 1/26 Theory and Practice of Data Cleaning [Nikola]

Description:  The four Cs (Correctness, Coherence, Completeness, and AcCountability); Practical overview of survey question design issues
Learning Goals: Understand and describe the four Cs of data quality, and explain causes of and fixes for quality issues for each of them.
Readings: Reading Question: List and briefly discuss one example of how bad data can affect data pipeline.
  • McCallum: Bad Data: Chapter 7 [on Canvas]
  • Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11). ACM, New York, NY, USA, 3363-3372. PDF
HomeworkByte 1 Peer Grading Due
Slides: [Data Quality]

Tues 1/31 Data Sampling: Acquiring the right data

Description: Discussion of data sampling issues
Slides:
[github]

Thurs 2/2 Exploring Imperfect Data: Plots and Distributions [Jen]

Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
Readings: Reading question: When is it valuable to read raw data without plots and how can plotting your data help you to identify data to read

Pearson: Mining Imperfect Data: Chapter 1

Optional: Gelman: Exploratory Data Analysis for Complex Models (read the article, discussions of article optional)
Slides: [Exploratory Visualization]

Tuesday 2/7: Big Data of One [Nikola]

Description: Big Data and how it relates to Big Data of One--people's personal data collected using mobile devices and other forms of sensed data.
Learning Goals: Be able to define Big Data and list major challenges in collecting and consuming Big Data.                  
Reading: Reading Question: Pick and briefly discuss one challenge that people in the quantify-self movement face when trying to understand their data
  • Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, and Julie A. Kientz. 2014. Understanding quantified-selfers' practices in collecting and exploring personal data. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems (CHI '14). ACM, New York, NY, USA, 1143-1152. DOI=10.1145/2556288.2557372 http://doi.acm.org/10.1145/2556288.2557372
HomeworkByte 2 Due;

Thursday 2/9: Information Visualization Overview [Nikola]

Description: Overview of key concepts in information visualization; Testing visualizations; StepGreen Case Study; Description of Byte 3 (Mobile/Visualization Byte)
Learning Goals:

Readings: Reading Question: Post a link or screenshot of a data visualization, and analyze how it addresses Tufte's six principles. 
Slides: [Overview of Info Viz  Case Study on Location]
HomeworkByte 2 Peer Grading Due,  Byte 3 Visualizing Mobile Visualization Assigned

Tuesday 2/14: Perception and Information Visualization [Jen]; Guide to Byte 3 [Nikola]

Description: Overview of human perceptual factors affecting information visualization and a brief discussion of D3 
Readings: Reading Question: Although pattern detection is typically simpler with a graphical interface, are we missing out on interesting numerical relationships by allowing both the machine and the human analyst to focus only on what they do "best"?
  • Ware: Visual Thinking for Design: Chapter 9 (The Dance of Meaning) [on Canvas]
  • Satyanarayan, Arvind, et al. "Vega-lite: A grammar of interactive graphics." IEEE Transactions on Visualization and Computer Graphics 23.1 (2017): 341-350.
  • Optional: Ware: Visual Thinking for Design: Chapter 1 (Visual Queries)

Thursday 2/16: The Role of Narrative in Visualization [Jen]

Readings:
Potential Reading Questions:
  • In what ways can factors external to the visualization itself, such as internalized knowledge and conventions at the individual and community level, interact with the rhetorical strategies used in a narrative visualization to influence interpretation?
  • How do communicative and explorative rhetorical strategies effectively work together in a narrative visualization?
  • Section 2.3 in the Hullman paper mentions how subtle changes in framing my influence, or otherwise solicit a particular opinion from the user. Can you find any examples of Visualizations that do this?

Tuesday 2/21: Byte 3 Help Day [Nikola]

Description: In class help with Byte 3.

Readings (in lieu of Big Data Quality and Sampling): 

Slides: [Byte 3 part 1 Byte 3 Part 2]

Thursday 2/23: Visualizing Big Data [Jen]

Description: Discussion about Visualization of Big Data 
Readings:
Slides: [Visualizing Big Data]
Homework: Byte 3 due;  midterm project assigned here [requires iterative design]

Tuesday 2/28: Midterm Project Guidance Meetings [Sign up links]

HomeworkByte 3 Peer Grading Due

Thursday 3/2: Iterative Design [Nikola]

Description: Discussion of HCI principles, rapid prototyping, and getting feedback from end users.
Readings:
  • (Required) Buxton, Bill. Sketching user experiences: getting the design right and the right design. Morgan Kaufmann, 2010. [Canvas]
  • (Required) Nielsen, Jakob. "Iterative user-interface design." Computer 26, no. 11 (1993): 32-41. https://www.nngroup.com/articles/iterative-design/
  • (Optional) Nielsen, Jakob. "Discount usability: 20 years." Jakob Nielsen's Alertboxhttps://www.nngroup.com/articles/discount-usability-20-years/

Tuesday 3/7: Guest Lecture (Medical Informatics, Adam Perer)

Thursday 3/9: Midterm Project Guidance Meetings [Sign up links]

Friday 3/10-Sunday 3/19 Spring Break (No Classes)

Tuesday 3/21: Classification Basics & Algorithms [Jen]

Description: Discussed the basic process by which classifiers are trained and used. Talked about the importance of having a train/test set that is separate from the data you experiment on.  Mention accuracy. Introduce some algorithms they will use in. Byte 4 (introduce algorithms ultimately useful with larger data sets)
Readings: 
Homework: Byte 4 Assigned; Discuss Byte 4 (Interactive Machine Learning)
Slides: [github]

Thursday 3/23: Midterm Project Presentations

Tuesday 3/28: Classification Metrics & Practical Null Hypothesis Testing [Nikola]

Description: Discussion about how to compare algorithms and what metrics to use (accuracy, precision and recall, kappa, f-score). Introduce practical null hypothesis testing (e.g., t-tests) as a rough check on whether differences are real. 
Learning goals: Be able to choose the best algorithm that will generalize to unseen data.
Readings: No readings for this lecture.

Slides: [GitHub]
Thursday 3/30: Usable ML [Jen]

Slides: [Usable ML Slides: Blackboard]

Tuesday 4/4: Classification and Regression Algorithms and Classification of Big Data [Nikola]

Description: Overview of different algorithms and their applications. Considerations for classification of Big Data.

Slides: [GitHub]

Thursday 4/6: Guest Lecture: Mayank Goel

Homework: Byte 4 Due; Final Projects Assigned.
Reading: de Greef, Lilian, et al. "Bilicam: using mobile phones to monitor newborn jaundice." Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2014.

Tuesday 4/11: Integrating Classification into Interactive Systems [Nikola]

Description: Getting labels. Making predictions. Assessing accuracy over time. Real world prediction problems. 

[Slides]

Homework: Final Project Proposals Due on paper in class

Thursday 4/13: Final Project Meetings [Sign Up]

Tuesday 4/18: Causality, Bayesian Inference & Statistical Hypothesis Testing [Nikola]
Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on). Also discussed limitations and paradoxes (such as Simpson's paradox).

Readings: 
Optional Readings:
HW: Project Part I assigned
Slides: [github]

Thursday 4/20: No Class (Carnival)
Tuesday 4/25: Finish Causality, further discussion of Bayesian Inference; Introduction of Regression [Nikola]

Description:  Discussion of causality and regression, the math and assumptions underlying regression, and how to use it. 
Readings: 
Slides: [github]
Homework: Byte 4 (Machine Learning with Big Data) due. Discussion of Final Project. 

Thursday 4/27: Final Project Meetings [Sign Up]

Tuesday 5/2: Final Exam Review

Thursday 5/4: Final Exam

Finals Period: Final Project Presentations

Things cut from the class:

  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM,55(4), 77-84. 
  • Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., ... & Campbell, A. T. (2014, September). Studentlife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 3-14). ACM. [Video
  • B. Lee, C. Plaisant, C. Sims Parr, J.-D. Fekete, N. Henry, "Task Taxonomy for Graph Visualization", Proc. of BELIV '06, April '06, pp. 1-5.
  • A. Perer, B. Shneiderman, "Balancing Systematic and Flexible Exploration of Social Networks," IEEE Trans. on Visualization and Computer Graphics, Vol. 12, No. 5, Sep.-Oct. 2006, pp. 693-700 
  • F. Viegas, S. Golder, and J. Donath, "Visualizing Email Content: Portraying Relationships from Conversational Histories", Proceedings of CHI 2006, Montreal, Canada, April 2006, pp. 979-988. 
  • M. Wattenberg and J. Kriss, "Designing for Social Data Analysis," IEEE Transactions on Visualization and Computer Graphics Vol. 12, No. 4, Jul.-Aug. 2006, pp. 549-557. 
  • http://v.isits.in/