High-Dimensional Data Analysis and Machine Learning

Course title - Intitulé du cours High-dimensional data analysis and machine learning
Level / Semester - Niveau / semestre M1 / S2
School - Composante École d’Économie de Toulouse
Teacher - Enseignant responsable MONDON Camille
Other teacher(s) - Autre(s) enseignant(s) LAURENT Thibault
Lecture Hours - Volume Horaire CM 12
TA Hours - Volume horaire TD 0
TP Hours - Volume horaire TP 18
Course Language - Langue du cours English - Anglais
TA and/or TP Language - Langue des TD et/ou TP English - Anglais

Teaching staff contacts – Coordonnées de l’équipe pédagogique

Teacher E-mail Website
Camille Mondon camille.mondon@tse-fr.eu https://camillemondon.com/
Thibault Laurent thibault.laurent@tse-fr.eu http://www.thibault.laurent.free.fr/

Course Objectives – Objectifs du cours

This course is particularly relevant for students who are interested in pursuing their studies and career as data scientists. It does not contain advanced theory, but all methods and algorithms are described and implemented using R and results are analyzed in detail. The course is not difficult but requires much work throughout the semester.

The students are expected to develop skills in computational statistics and to be able to combine efficient programming with relevant statistical methods. The class will therefore include a large amount of practical applications. The course is divided into two parts.

The first part, given by Camille Mondon, is organized as follows:

Lesson Topic Field Typical Use Supervised / Unsupervised
1 Introduction + Principal Component Analysis (PCA) Statistics / Data Analysis Dimension reduction Unsupervised
2 Linear Discriminant Analysis (LDA) Statistics / Data Analysis Classification, interpretation Supervised
3 Classification and Regression Trees (CART) Machine Learning Nonlinear prediction Supervised
4 Bootstrap Statistics / Machine Learning Uncertainty estimation
5 Bagging Machine Learning Variance reduction Supervised
6 Random Forests Machine Learning Robust prediction Supervised
7 Questions
8 Oral exams

All methods are implemented using R (optionally python).

The second part gives on an introduction to parallel computing to deal with big data.

Prerequisites – Pré-requis

Proficient R programming, knowledge of descriptive statistics and principal components analysis.

Practical information about the sessions – Modalités pratiques de gestion du cours

For the first part, there are 8 weekly sessions of 3 hours. The lecture notes and the slides are made available to students, but it is highly recommended not to miss any session in order to be able to implement the statistical methods and interpret the results.

The second part consists in 2 sessions of 3 hours.

Warning

Personal laptops are accepted at the student’s own risk (some sessions take place in a computer room). Students are expected to actively participate to the class. Late arrivals or missing students will be reported and can result in a grade penalty.

Grading system – Modalités d’évaluation

The first part (eight lectures) is evaluated as follows:

Topic Grade (%) Grade (/20) Condition When
Quizzes 30% 6 1 point per quizz At the beginning of each practical session except the first.
Assignments 10% 2 2 points if all assignments are submitted on Moodle on time. On the Sunday before each lesson (except the first).
Final project 60% 12 See relevant section on the website Sunday March 1 (written), Monday March 2 (oral)
Important

Attendance is required on March 2.

Bibliography/references – Bibliographie/références

Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with R. New York, NY: Springer. https://doi.org/10.1007/978-1-4419-9650-3.
Husson, François, Sebastien Lê, and Jérôme Pagès. 2017. Exploratory Multivariate Analysis by Example Using R. 2nd ed. New York: Chapman; Hall/CRC. https://doi.org/10.1201/b21874.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-1418-1.
Williams, Graham. 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. New York, NY: Springer. https://doi.org/10.1007/978-1-4419-9890-3.

Session planning – Planification des séances

Mondays from January 5 to March 16 (except February 23) between 15:30 and 18:30.

Distance learning – Enseignement à distance

Distance learning can be provided, when necessary, by implementing interactive virtual classrooms, MCQ tests and other online exercises / assignments, chatrooms and forums.