Description

DS 121 is the second in the three-course sequence (DS 120, 121, 122) that introduces students to theoretical foundations of Data Science. DS 121 covers an introduction to key concepts from Linear Algebra (vector space, independence, orthogonality and matrix factorizations). The DS theme running through the course is exploratory data analysis, enabling a better understanding of the data at hand. The course will link mathematical concepts with computational thinking, specifically through the use of problem sets that require students to answer mathematically-posed questions using computation.

Effective Fall 2021, this course fulfills a single unit in each of the following BU Hub areas: Digital/Multimedia Expression, Quantitative Reasoning I, Critical Thinking.

Prerequisites: DS 110 and 120, or equivalents.

General Information

Lectures and Zoom livestream
The course instructor is Prof. Mayank Varia. This class meets on Tuesdays and Thursdays at 3:30-4:45pm in room PSY B33.

If you cannot attend a lecture in person, you can find the links to the Zoom livestream, recording, and lecture notes on the Course schedule page: https://piazza.com/class/l7p7fjsa9so6yg?cid=6 (just look to the right of this text).
Discussion section
Discussion sections will be led by TA Harshit Agrawal. There are two discussion sections: Mon 2:30-3:20pm in EOP 266, and Mon 4:40-5:30pm in EOP 260.
Gradescope
All homework assignments must be submitted to Gradescope: https://www.gradescope.com/courses/438338 (use entry code J364R6 to sign up). This is the only method to have your homework graded. Homework is typically due on Fridays at 8pm eastern time. Assignments will be accepted up to 12 hours late for a 10% grade reduction; later assignments will not be accepted.
Academic honesty policy
You must adhere to BU’s Academic Conduct Code at all times. Please be sure to read it here: https://www.bu.edu/academics/policies/academic-conduct-code. In particular: cheating on an exam, passing off another student’s work as your own, or plagiarism of writing or code are grounds for a grade reduction in the course and referral to BU’s Academic Conduct Committee. If you have any questions about the policy, please ask me in person or via a private Piazza note immediately, before taking an action that might be a violation.
Collaboration policy
The goal of homeworks is to learn. Therefore, I encourage you to use any and all resources that can help you to learn the material: computers/calculators, Piazza, lecture notes, textbooks, other websites, and your fellow classmates. There are only a few rules to keep in mind.

1. You must document on your homework submission: (a) the names of any other students you worked with, (b) any websites you used besides the ones listed in this syllabus, and (c) any code you have used from other sources.

2. You may not directly copy solutions from anyone else, or give your solutions to someone else to copy.

Basically: sharing ideas with attribution is fine, but sharing answers is not.

The goal of tests is for you to show me what you have learned. As a result, any form of collaboration is strictly prohibited. Computers and notes are also forbidden during tests unless I explicitly state otherwise. (That said, I encourage you to collaborate with classmates when studying lecture materials and preparing for tests.)

Announcements

Final project
11/22/2022, 8:01:17 AM

The goal of the final project is to give you the opportunity to further explore one of the topics covered in the course. Concretely, your objective is to analyze a dataset using any of the techniques we have covered this semester such as clustering, regression, principal components analysis, matrix factorization, or more.

 

You should find a dataset from a research paper, a Python repository, a blog post, or any other source (see some suggestions below). Then, you must either run a new analysis from scratch on this dataset, or you can reproduce the result of a prior work and add at least one extension. Note that it’s perfectly fine to use ideas from other papers or websites, but you must (a) cite any sources used and (b) describe concretely the parts of your project report that are similar to, or different than, the prior work.

 

Project Report Description: Your final report should be a Jupyter notebook containing the following sections:

  1. Introduction: Provide an easy-to-understand summary of what you’ve done and why it matters to you. This section should be understandable even by people who have not taken this course.

  2. Data: Describe the dataset you have studied. Explain what the objects and features are, show some samples, etc. If appropriate, describe where the dataset comes from and where it is applied in practice.

  3. Methodology: Here, you should state the analysis techniques that you used in the project. Make sure to explain what the algorithm you’re using does, and why you chose this particular strategy for analyzing the data. If appropriate, state a hypothesis that you plan to test.

 

  1. Analysis: Show the analysis itself, and include any charts/graphs/tables that help to visualize what you’ve done. This section should contain the code for implementing your methodology. For example, a project using matrix factorizations could include an implementation of one or more of the factorizations we discussed in class (or a related one).

  2. Results: Explain the takeaways from your analysis. For example, does your analysis support or refute the claim you were intending to study, and did your algorithm behave as expected? If your project is building upon someone else’s work, make sure to use this space to compare and contrast your findings with other works.

  3. Conclusion: Summarize the work you’ve done and the outcomes you’ve discovered.

  4. References: I’ll repeat, make sure to cite your work! You can use any textbook, website, paper, or other resource as long as you cite it. Using prior work without citing it is plagiarism and will be handled as stated in the course syllabus.

Here is the timeline for the project:

 

  • Wednesday, November 30: By this day, please send me a private Piazza note describing the project that you plan to pursue. That is, describe the dataset you want to analyze and your initial ideas for the kinds of data analysis you think would be relevant to understanding this dataset.

  • Monday, December 12: Submit your final project on Gradescope.

 

The proposal: This is a brief summary of what topic you would like to tackle for the project. Deciding and scoping what exactly you want to do is very much part of the project itself. I am happy to chat during office hours or by appointment about ideas you may have and provide relevant references. The proposal should provide an overview of the topic you are choosing.

Your project will be graded as follows:

  • 5% for submitting the initial project plan by Nov 30. (You will get full credit here for submitting any plan by Nov 30, even if we have some discussion about it afterward.)

  • 15% for each of the six sections in the report. We will grade each section based on how well your code uses the techniques learned in class, and how well your writing documents this work.

  • 5% for including a references section. (See more details below.)

 

Remember that it is crucial to include a references section and to properly compare your project with any prior work. Your bibliography entries should include author(s), title, and a website link (if appropriate).

 

Copying of ideas from prior work without citation will be considered plagiarism, and it is grounds for receiving a grade of 0 on the project along with further disciplinary action as described in the syllabus. As a general rule: cite liberally whenever you quote or paraphrase ideas from another paper, in every single sentence/paragraph where you do so. If you have any questions about this policy, send a private Piazza note to ask us.

 

Resources: 

 

Algorithms and Libraries:- Useful libraries for data science in Python include:

 

[Machine Learning]

[Data Processing]

[Visualizations]

 

Examples of applications can be found on:

 

Datasets:- You can use any open dataset. Some common sources include:

 

[Dataset aggregators for Machine Learning projects]

[Open Data]

Course schedule
9/5/2022, 4:48:35 PM

This page contains the lesson plan for DS 121 lectures.

WeekTopicReadingHomework
1VectorsBoyd-Vandenberghe Chapter 1 and 2.1

3Blue1Brown video 1 and video 2

2

MatricesBoyd-Vandenberghe Chapter 6

3Blue1Brown video 3 and video 4
HW1 due 9/16

3

GeometryBoyd-Vandenberghe Chapter 3

3Blue1Brown video 5 and video 6
HW2 due 9/23

4

ClusteringBoyd-Vandenberghe Chapter 4HW3 due 9/30

5

Clustering(no new reading, just review for the test)TEST on 10/6

6

LU decompositionAggarwal Section 2.4-2.5

3Blue1Brown video 7 and video 8
HW4 due 10/14

7

Subspaces & Orthogonality

Aggarwal Sections 2.6-2.7.2

Boyd-Vandenberghe Chapter 10

3Blue1Brown video 9

HW5 due 10/21

8

RegressionsBoyd-Vandenberghe Chapter 12HW6 due 10/28

9

Markov chains & Eigenvalues

Deisenroth-Faisal-Ong Section 4.2

3Blue1Brown video 13 and video 14

HW7 due 11/4

10

PageRank

Aggarwal Sections 10.1-10.3 and 10.6

TEST 2 on 11/8

11

DiagonalizationDeisenroth-Faisal-Ong Section 4.4HW8 due 11/18

12

SVDAggarwal Sections 7.1-7.4

13

PCAAggarwal Sections 8.1-8.2HW9 due 12/2

14

ClassifiersPROJECT due 12/12
EXAM on 12/19

Staff Office Hours
NameOffice Hours
Mayank Varia
When?
Where?
Harshit Agrawal
When?
Where?
Andy Yang
When?
Where?
Daniel Cho
When?
Where?
Lisa Wobbes
When?
Where?