Description
This course will examine how data analysis technologies can be used to improve decision-making. The aim is to study the fundamental principles and techniques of data science, and we will examine real- world examples and cases to place data science techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science. In addition, this course will work hands-on with the Python programming language and its associated data analysis libraries.
General Information
Content
Thanks to advances of Internet computing and software tools that allows to easily process and analyze data at scale, we are now able to extract invaluable insights from the vast amount of data generated daily. As a result, both the business and scientific world are undergoing a revolution which is fueled by one of the most sought after job profiles: the data scientist.
This course covers the fundamental steps of the data science pipeline:
* Data Wrangling
- Data acqusition (scraping, crawling, parsing, etc.)
- Data manipulation
- The many sources of data problems (and how to fix them): missing data, incorrect data, inconsistent representations
- Data quality testing
* Data Interpretation
- Working with "found data" (design of observational studies)
- Machine learning in practice (supervised and unsupervised, feature engineering, more data vs. advanced algorithms, curse of dimensionality, etc.)
- Text mining: vector space model, topic models, word embedding
- Social network analysis (influencers, community detection, etc.)
* Data Visualization
- Introduction to different plot types (1, 2, and 3 variables), layout best practices, network and geographical data
- Visualization to diagnose data problems, scaling visualization to large datasets, visualizing uncertain data
* Reporting
- Results reporting, infographics
- How to publish reproducible results
- Anonymization, ethical concerns
The students will learn the techniques during the ex-cathedra lectures and will be introduced, in the lab sessions, to the software tools required to complete the homework assignments.
In parallel, the students will embark on a semester-long project, split in agile teams of 3 students. The outcome of this team effort will be a project portfolio that will be made public (and available as open source).
At the end of the semester, students will also take a 2-hour final exam in a classroom.
This course covers the fundamental steps of the data science pipeline:
* Data Wrangling
- Data acqusition (scraping, crawling, parsing, etc.)
- Data manipulation
- The many sources of data problems (and how to fix them): missing data, incorrect data, inconsistent representations
- Data quality testing
* Data Interpretation
- Working with "found data" (design of observational studies)
- Machine learning in practice (supervised and unsupervised, feature engineering, more data vs. advanced algorithms, curse of dimensionality, etc.)
- Text mining: vector space model, topic models, word embedding
- Social network analysis (influencers, community detection, etc.)
* Data Visualization
- Introduction to different plot types (1, 2, and 3 variables), layout best practices, network and geographical data
- Visualization to diagnose data problems, scaling visualization to large datasets, visualizing uncertain data
* Reporting
- Results reporting, infographics
- How to publish reproducible results
- Anonymization, ethical concerns
The students will learn the techniques during the ex-cathedra lectures and will be introduced, in the lab sessions, to the software tools required to complete the homework assignments.
In parallel, the students will embark on a semester-long project, split in agile teams of 3 students. The outcome of this team effort will be a project portfolio that will be made public (and available as open source).
At the end of the semester, students will also take a 2-hour final exam in a classroom.
Learning Outcomes
By the end of the course, the student must be able to:
- Construct a coherent understanding of the techniques and software tools required to perform the fundamental steps of the Data Science pipeline.
- Perform data acquisition (data formats, dataset fusion, Web scrapers, REST APIs, open data, big data platforms, etc.)
- Perform data wrangling (fixing missing and incorrect data, data reconciliation, data quality assessments, etc.)
- Perform data interpretation (knowledge extraction, critical thinking, team discussions, ad-hoc visualizations, etc.)
- Perform result dissemination (reporting, visualizations, publishing reproducible results, ethical concerns, etc.)
- Construct a coherent understanding of the techniques and software tools required to perform the fundamental steps of the Data Science pipeline.
- Perform data acquisition (data formats, dataset fusion, Web scrapers, REST APIs, open data, big data platforms, etc.)
- Perform data wrangling (fixing missing and incorrect data, data reconciliation, data quality assessments, etc.)
- Perform data interpretation (knowledge extraction, critical thinking, team discussions, ad-hoc visualizations, etc.)
- Perform result dissemination (reporting, visualizations, publishing reproducible results, ethical concerns, etc.)
Teaching methods
- Physical in-class recitations and lab sessions
- Homework assignment and/or midterm
- Lab assignments
- Course project
- Homework assignment and/or midterm
- Lab assignments
- Course project
Transversal skills
- Evaluate one's own performance in the team, receive and respond appropriately to feedback.
- Give feedback (critique) in an appropriate fashion.
- Demonstrate the capacity for critical thinking
- Write a scientific or technical report
- Give feedback (critique) in an appropriate fashion.
- Demonstrate the capacity for critical thinking
- Write a scientific or technical report
Expected student activities
Students are expected to:
- Attend the lectures and lab sessions
- Complete lab assignments
- Conduct the class team project
- Read/watch the pertinent material before a lecture
- Engage during the class, and present their results in front of the other colleagues
- Attend the lectures and lab sessions
- Complete lab assignments
- Conduct the class team project
- Read/watch the pertinent material before a lecture
- Engage during the class, and present their results in front of the other colleagues
Assessment methods
- 20% continuous assessment during the semester (lab assignments)
- 40% final project, done in groups of 3
- 40% final exam
- 40% final project, done in groups of 3
- 40% final exam
Course Book
Data Science for Business: What you need to know about data mining and data analytic thinking. Provost & Fawcett (O’Reilly, 2013) (Updated 2019)
http://data-science-for-biz.com/
This book covers the fundamental material that will provide the basis for you to think and communicate about data science and business analytics. We will complement the book with discussions of applications, cases, and demonstrations, and possibly some additional readings or notes for material that is not covered in the book.
One particularly useful book for those interested in the “hands-on” component of the class: (OPTIONAL)
Python Machine Learning: Machine Learning and Deep Learning with Python, scikit- learn, and TensorFlow, 2nd Edition
by Sebastian Raschka & Vahid Mirjalili
https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939
http://data-science-for-biz.com/
This book covers the fundamental material that will provide the basis for you to think and communicate about data science and business analytics. We will complement the book with discussions of applications, cases, and demonstrations, and possibly some additional readings or notes for material that is not covered in the book.
One particularly useful book for those interested in the “hands-on” component of the class: (OPTIONAL)
Python Machine Learning: Machine Learning and Deep Learning with Python, scikit- learn, and TensorFlow, 2nd Edition
by Sebastian Raschka & Vahid Mirjalili
https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939
Lectures
Tuesday: 9:00 -11:30, Room: 146, FST 01
Friday: 8:30 - 10:00, Room 146, FST 01
Friday: 8:30 - 10:00, Room 146, FST 01
Labs
Wednesday: 11:30 - 13:00, Room 148, FST 01
Staff Office Hours
George Pallis
Pavlos Antoniou
Moysis Symeonidis