Projects
1 Objectives and constraints
The objectives of this project are:
- To form a 4-person group.
- To introduce and summarise a dataset (chosen from the list in Section 2).
- To apply and interpret the results of all six core methods studied in the course:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Classification and Regression Trees (CART)
- Bootstrap
- Bagging
- Random Forests
- To explain, apply and interpet the results of one supplementary method selected from the list in Section 3.
- To submit a reproducible report before February 22.
- To present the work in an oral examination on March 2.
You must use either R or Python for your experiments.
Each choice must be justified:
- why this dataset?
- Why are the core methods applicable?
- Why this supplementary method?
Use simple exploratory data analysis first, then apply the methods, and interpret results in a statistically sound way.
Large language models may only be used after the code and report are complete, and strictly for sentence-level English correction.
2 Choice of dataset
Each group must download a multivariate dataset from the following list of sources:
- World Bank Open Data — global development indicators, socio-economic and environmental data.
- SNCF Open Data — datasets from the French national railway company (schedules, stations, traffic).
- Toulouse Métropole Open Data — urban and transport datasets for Toulouse metropolitan area.
- data.gouv.fr — central French government portal, hosting datasets from national and local administrations.
- Open Collectivités — datasets on French local authorities, public finances, and administrative structures.
- European Union Open Data Portal — datasets from EU institutions covering economy, environment, health, and more.
- UNdata — United Nations statistics and indicators across multiple global domains.
- UCI Machine Learning Repository — classic repository of datasets for ML research and teaching.
- OpenStreetMap — collaborative project creating free geographic data (maps, locations, infrastructure).
- Data.gov (USA) — US government open data portal covering health, energy, transportation, and more.
- European Environment Agency (EEA) Open Data — environmental data for Europe: air, water, climate, biodiversity.
- OECD Data — international statistics on economy, education, health, and more.
You can use these tools but be careful to find an official dataset and to be able to understand how the data has been curated if that is the case:
- Google Dataset Search — search engine for datasets across the web.
- Kaggle Datasets — large repository of open datasets, often used for machine learning competitions.
- Your selected dataset must contain at least:
- 80 observations
- 6 numerical variables
- 1 categorical variable
- If there are only quantitative variables, there might be a way to create categories from a numerical variable.
- When there are time series, please restrict the study to one year.
- If cleaning or preprocessing is required, describe and justify it clearly.
Before starting the project, the chosen dataset must be approved by the teacher.
3 Supplementary methods
Each group must select one method from the following list:
| Method | Short description | Main reference | R / Python implementation | Difficulty |
|---|---|---|---|---|
| Kernel PCA | Nonlinear PCA using kernels | Schölkopf et al. 1998 — “Nonlinear Component Analysis as a Kernel Eigenvalue Problem” | R: kernlab • Python: sklearn |
★★★ |
| Sparse PCA | PCA with sparsity in loadings | Zou, Hastie, Tibshirani 2006 — “Sparse Principal Component Analysis” | R: elasticnet / PMA • Python: sklearn |
★★★ |
| ICS (Invariant Coordinate Selection) | Outlier-oriented dimension reduction | Tyler et al. 2009 — “Invariant Coordinate Selection” | R: ICS, ICSOutlier • Python: icspylab |
★★ |
| QDA | Quadratic class boundaries (nonlinear DA) | Friedman 1989 — “Regularized Discriminant Analysis” | R: MASS::qda • Python: sklearn |
★ |
| Gaussian Mixture Models | Model-based clustering via mixtures | McLachlan & Peel 2000 — “Finite Mixture Models” | R: mclust • Python: sklearn |
★★ |
| Spectral clustering | Graph Laplacian–based clustering | Ng, Jordan, Weiss 2002 — “On Spectral Clustering” | R: kernlab • Python: sklearn |
★★★ |
| Multidimensional Scaling (MDS) | Distance-preserving embedding | Kruskal & Wish 1978 — “Multidimensional Scaling” | R: cmdscale, smacof • Python: sklearn |
★★ |
| t-SNE | Nonlinear visualization preserving local neighborhoods | van der Maaten & Hinton 2008 — “Visualizing Data using t-SNE” | R: Rtsne • Python: sklearn |
★★★ |
| UMAP | Fast manifold learning for visualization | McInnes, Healy, Melville 2018 — “UMAP: Uniform Manifold Approximation and Projection” | R: umap • Python: umap-learn |
★★★ |
| Isolation Forest | Anomaly detection using random partitioning | Liu, Ting, Zhou 2008 — “Isolation Forest” | R: isotree • Python: sklearn |
★★ |
| Gradient Boosting | Boosting with additive trees | Friedman 2001 — “Greedy Function Approximation: Gradient Boosting” | R: gbm, xgboost • Python: sklearn |
★★★ |
| AdaBoost | Adaptive boosting for classification | Freund & Schapire 1997 — “A Decision-Theoretic Generalization of On-Line Learning” | R: adabag • Python: sklearn |
★★ |
| Extra-Trees | Extremely randomized trees ensemble | Geurts, Ernst, Wehenkel 2006 — “Extremely Randomized Trees” | R: extraTrees • Python: sklearn |
★★ |
| k-Nearest Neighbors (k-NN) | Classification by majority vote of neighbors | Cover & Hart 1967 — “Nearest Neighbor Pattern Classification” | R: class::knn • Python: sklearn |
★ |
| Regularized LDA | Shrinkage version of LDA for small-sample problems | Friedman 1989 — Regularized Discriminant Analysis | R: klaR::rda |
★ |
| k-means clustering | Partitioning clustering | MacQueen 1967 | R: kmeans • Python: sklearn |
★ |
| PAM / k-medoids | Robust alternative to k-means | Kaufman & Rousseeuw 1990 | R: cluster::pam • Python: pyclustering |
★ |
4 Project report format
- Report must be a Quarto (or R Markdown) project.
- Output must compile into one PDF ≤ 20 pages.
- Include:
- introduction and objectives
- dataset description and preprocessing
- contributions of group members
- methodology explanation
- interpretation of results
- conclusion and perspectives
- Code must be clean, organized and commented.
- Reproducibility is essential. This means that if you take your project folder to another new computer, it should build the final project files without error.
- Forgetting to embed all code and data files.
- Caching results and using them instead of data source.
- Using an absolute path to the data file.
- Using
install.packageswithout checking whether the package is already installed. - Not setting the seed when a method based on random number generation is used.
Contributions should be as detailed as possible (for each part of your project and oral presentation), as they will be used for grading. If you have no idea how to write them, you can refer to CRediT. If you used LLMs, you must include them in the contributions and explain precisely how and why they were used.
Remark. If you have difficulties uploading all the necessary sources to build your project on Moodle (for instance if the dataset is too large), please contact me in order to find a solution.
5 Oral examination
Each group will also complete an oral defense:
- 10–15 minutes presentation
- 5 minutes questions on:
- dataset
- core methods
- chosen supplementary method
- interpretation and limitations