High-Dimensional Data Analysis and Machine Learning
| Course title - Intitulé du cours | High-dimensional data analysis and machine learning |
|---|---|
| Level / Semester - Niveau / semestre | M1 / S2 |
| School - Composante | École d’Économie de Toulouse |
| Teacher - Enseignant responsable | MONDON Camille |
| Other teacher(s) - Autre(s) enseignant(s) | LAURENT Thibault |
| Lecture Hours - Volume Horaire CM | 12 |
| TA Hours - Volume horaire TD | 0 |
| TP Hours - Volume horaire TP | 18 |
| Course Language - Langue du cours | English - Anglais |
| TA and/or TP Language - Langue des TD et/ou TP | English - Anglais |
Teaching staff contacts – Coordonnées de l’équipe pédagogique
| Teacher | Website | |
|---|---|---|
| Camille Mondon | camille.mondon@tse-fr.eu | https://camillemondon.com/ |
| Thibault Laurent | thibault.laurent@tse-fr.eu | http://www.thibault.laurent.free.fr/ |
Course Objectives – Objectifs du cours
This course is particularly relevant for students who are interested in pursuing their studies and career as data scientists. It does not contain advanced theory, but all methods and algorithms are described and implemented using R and results are analyzed in detail. The course is not difficult but requires much work throughout the semester.
The students are expected to develop skills in computational statistics and to be able to combine efficient programming with relevant statistical methods. The class will therefore include a large amount of practical applications. The course is divided into two parts.
The first part, given by Camille Mondon, is organized as follows:
| Lesson | Topic | Field | Typical Use | Supervised / Unsupervised |
|---|---|---|---|---|
| 1 | Introduction + Principal Component Analysis (PCA) | Statistics / Data Analysis | Dimension reduction | Unsupervised |
| 2 | Linear Discriminant Analysis (LDA) | Statistics / Data Analysis | Classification, interpretation | Supervised |
| 3 | Classification and Regression Trees (CART) | Machine Learning | Nonlinear prediction | Supervised |
| 4 | Bootstrap | Statistics / Machine Learning | Uncertainty estimation | – |
| 5 | Bagging | Machine Learning | Variance reduction | Supervised |
| 6 | Random Forests | Machine Learning | Robust prediction | Supervised |
| 7 | Questions | |||
| 8 | Oral exams |
All methods are implemented using R (optionally python).
The second part gives on an introduction to parallel computing to deal with big data.
Prerequisites – Pré-requis
Proficient R programming, knowledge of descriptive statistics and principal components analysis.
Practical information about the sessions – Modalités pratiques de gestion du cours
For the first part, there are 8 weekly sessions of 3 hours. The lecture notes and the slides are made available to students, but it is highly recommended not to miss any session in order to be able to implement the statistical methods and interpret the results.
The second part consists in 2 sessions of 3 hours.
Personal laptops are accepted at the student’s own risk (some sessions take place in a computer room). Students are expected to actively participate to the class. Late arrivals or missing students will be reported and can result in a grade penalty.
Grading system – Modalités d’évaluation
The first part (eight lectures) is evaluated as follows:
| Topic | Grade (%) | Grade (/20) | Condition | When |
|---|---|---|---|---|
| Quizzes | 30% | 6 | 1 point per quizz | At the beginning of each practical session except the first. |
| Assignments | 10% | 2 | 2 points if all assignments are submitted on Moodle on time. | On the Sunday before each lesson (except the first). |
| Final project | 60% | 12 | See relevant section on the website | Sunday March 1 (written), Monday March 2 (oral) |
Attendance is required on March 2.
Bibliography/references – Bibliographie/références
Session planning – Planification des séances
Mondays from January 5 to March 16 (except February 23) between 15:30 and 18:30.
Distance learning – Enseignement à distance
Distance learning can be provided, when necessary, by implementing interactive virtual classrooms, MCQ tests and other online exercises / assignments, chatrooms and forums.