Projects
1 Objectives and constraints
The objectives of this project are:
- To form a 4-person group.
- To introduce and summarise a dataset (chosen from the list in Section 2).
- To apply and interpret the results of all six core methods studied in the course:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Classification and Regression Trees (CART)
- Bootstrap
- Bagging
- Random Forests
- To explain, apply and interpet the results of one supplementary method selected from the list in Section 3.
- To submit a reproducible report before February 22.
- To present the work in an oral examination on March 2.
You must use either R or Python for your experiments.
Each choice must be justified:
- why this dataset?
- Why are the core methods applicable?
- Why this supplementary method?
Use simple exploratory data analysis first, then apply the methods, and interpret results in a statistically sound way.
Large language models may only be used after the code and report are complete, and strictly for sentence-level English correction.
2 Choice of dataset
Each group must download a multivariate dataset from the following list of sources:
- World Bank Open Data — global development indicators, socio-economic and environmental data.
- SNCF Open Data — datasets from the French national railway company (schedules, stations, traffic).
- Toulouse Métropole Open Data — urban and transport datasets for Toulouse metropolitan area.
- data.gouv.fr — central French government portal, hosting datasets from national and local administrations.
- Open Collectivités — datasets on French local authorities, public finances, and administrative structures.
- European Union Open Data Portal — datasets from EU institutions covering economy, environment, health, and more.
- UNdata — United Nations statistics and indicators across multiple global domains.
- UCI Machine Learning Repository — classic repository of datasets for ML research and teaching.
- OpenStreetMap — collaborative project creating free geographic data (maps, locations, infrastructure).
- Data.gov (USA) — US government open data portal covering health, energy, transportation, and more.
- European Environment Agency (EEA) Open Data — environmental data for Europe: air, water, climate, biodiversity.
- OECD Data — international statistics on economy, education, health, and more.
You can use these tools but be careful to find an official dataset and to be able to understand how the data has been curated if that is the case:
- Google Dataset Search — search engine for datasets across the web.
- Kaggle Datasets — large repository of open datasets, often used for machine learning competitions.
- Your selected dataset must contain at least:
- 80 observations
- 6 numerical variables
- 1 categorical variable
- If there are only quantitative variables, there might be a way to create categories from a numerical variable.
- When there are time series, please restrict the study to one year.
- If cleaning or preprocessing is required, describe and justify it clearly.
Before starting the project, the chosen dataset must be approved by the teacher.
3 Supplementary methods
Each group must select one method from the following list:
| Method | Short description | Main reference | R / Python implementation | Difficulty |
|---|---|---|---|---|
| Kernel PCA | Nonlinear PCA using kernels | Schölkopf et al. 1998 — “Nonlinear Component Analysis as a Kernel Eigenvalue Problem” | R: kernlab • Python: sklearn |
★★★ |
| Sparse PCA | PCA with sparsity in loadings | Zou, Hastie, Tibshirani 2006 — “Sparse Principal Component Analysis” | R: elasticnet / PMA • Python: sklearn |
★★★ |
| ICS (Invariant Coordinate Selection) | Outlier-oriented dimension reduction | Tyler et al. 2009 — “Invariant Coordinate Selection” | R: ICS, ICSOutlier • Python: icspylab |
★★ |
| QDA | Quadratic class boundaries (nonlinear DA) | Friedman 1989 — “Regularized Discriminant Analysis” | R: MASS::qda • Python: sklearn |
★ |
| Gaussian Mixture Models | Model-based clustering via mixtures | McLachlan & Peel 2000 — “Finite Mixture Models” | R: mclust • Python: sklearn |
★★ |
| Spectral clustering | Graph Laplacian–based clustering | Ng, Jordan, Weiss 2002 — “On Spectral Clustering” | R: kernlab • Python: sklearn |
★★★ |
| Multidimensional Scaling (MDS) | Distance-preserving embedding | Kruskal & Wish 1978 — “Multidimensional Scaling” | R: cmdscale, smacof • Python: sklearn |
★★ |
| t-SNE | Nonlinear visualization preserving local neighborhoods | van der Maaten & Hinton 2008 — “Visualizing Data using t-SNE” | R: Rtsne • Python: sklearn |
★★★ |
| UMAP | Fast manifold learning for visualization | McInnes, Healy, Melville 2018 — “UMAP: Uniform Manifold Approximation and Projection” | R: umap • Python: umap-learn |
★★★ |
| Isolation Forest | Anomaly detection using random partitioning | Liu, Ting, Zhou 2008 — “Isolation Forest” | R: isotree • Python: sklearn |
★★ |
| Gradient Boosting | Boosting with additive trees | Friedman 2001 — “Greedy Function Approximation: Gradient Boosting” | R: gbm, xgboost • Python: sklearn |
★★★ |
| AdaBoost | Adaptive boosting for classification | Freund & Schapire 1997 — “A Decision-Theoretic Generalization of On-Line Learning” | R: adabag • Python: sklearn |
★★ |
| Extra-Trees | Extremely randomized trees ensemble | Geurts, Ernst, Wehenkel 2006 — “Extremely Randomized Trees” | R: extraTrees • Python: sklearn |
★★ |
| k-Nearest Neighbors (k-NN) | Classification by majority vote of neighbors | Cover & Hart 1967 — “Nearest Neighbor Pattern Classification” | R: class::knn • Python: sklearn |
★ |
| Regularized LDA | Shrinkage version of LDA for small-sample problems | Friedman 1989 — Regularized Discriminant Analysis | R: klaR::rda |
★ |
| k-means clustering | Partitioning clustering | MacQueen 1967 | R: kmeans • Python: sklearn |
★ |
| PAM / k-medoids | Robust alternative to k-means | Kaufman & Rousseeuw 1990 | R: cluster::pam • Python: pyclustering |
★ |
4 Project report format
- Report must be a Quarto (or R Markdown) project.
- Output must compile into one PDF ≤ 20 pages.
- Include:
- introduction and objectives
- dataset description and preprocessing
- contributions of group members
- methodology explanation
- interpretation of results
- conclusion and perspectives
- Code must be clean, organized and commented.
- Reproducibility is essential. This means that if you take your project folder to another new computer, it should build the final project files without error.
- Forgetting to embed all code and data files.
- Caching results and using them instead of data source.
- Using an absolute path to the data file.
- Using
install.packageswithout checking whether the package is already installed. - Not setting the seed when a method based on random number generation is used.
Contributions should be as detailed as possible (for each part of your project and oral presentation), as they will be used for grading. If you have no idea how to write them, you can refer to CRediT. If you used LLMs, you must include them in the contributions and explain precisely how and why they were used.
Remark. If you have difficulties uploading all the necessary sources to build your project on Moodle (for instance if the dataset is too large), please contact me in order to find a solution.
5 Oral defense
Each group will also complete an oral defense. The organization of the defense is the following:
- Each student has 3 minutes to present their contribution to the project and 3 minutes to answer the questions by myself and by one of the students from another group.
- Notes are not allowed, but that does not mean that the text should be written on the slides either.
- If your presentation is too long I will stop you and I will give you negative points.
- All together, each project should last less than 25 minutes.
You need to prepare some slides for your presentation. On the slides, you need to present your data and give some exploratory results, explain your objectives, present the different steps of your statistical analysis and comment your results. Don’t forget to conclude. Use graphics to describe your data and convey findings. Don’t forget to number the pages of your slides.
- You will have to ask a question to another student from the same session but from a different project than yours. I will give you at the beginning of each defense the name of the students who will ask questions and to whom they will ask their question.