Projects

1 Objectives and constraints

The objectives of this project are:

  1. To form a 4-person group.
  2. To introduce and summarise a dataset (chosen from the list in Section 2).
  3. To apply and interpret the results of all six core methods studied in the course:
    • Principal Component Analysis (PCA)
    • Linear Discriminant Analysis (LDA)
    • Classification and Regression Trees (CART)
    • Bootstrap
    • Bagging
    • Random Forests
  4. To explain, apply and interpet the results of one supplementary method selected from the list in Section 3.
  5. To submit a reproducible report before February 22.
  6. To present the work in an oral examination on March 2.

You must use either R or Python for your experiments.

Important

Each choice must be justified:

  • why this dataset?
  • Why are the core methods applicable?
  • Why this supplementary method?

Use simple exploratory data analysis first, then apply the methods, and interpret results in a statistically sound way.

Warning

Large language models may only be used after the code and report are complete, and strictly for sentence-level English correction.

2 Choice of dataset

Each group must download a multivariate dataset from the following list of sources:

  • World Bank Open Data — global development indicators, socio-economic and environmental data.
  • SNCF Open Data — datasets from the French national railway company (schedules, stations, traffic).
  • Toulouse Métropole Open Data — urban and transport datasets for Toulouse metropolitan area.
  • data.gouv.fr — central French government portal, hosting datasets from national and local administrations.
  • Open Collectivités — datasets on French local authorities, public finances, and administrative structures.
  • European Union Open Data Portal — datasets from EU institutions covering economy, environment, health, and more.
  • UNdata — United Nations statistics and indicators across multiple global domains.
  • UCI Machine Learning Repository — classic repository of datasets for ML research and teaching.
  • OpenStreetMap — collaborative project creating free geographic data (maps, locations, infrastructure).
  • Data.gov (USA) — US government open data portal covering health, energy, transportation, and more.
  • European Environment Agency (EEA) Open Data — environmental data for Europe: air, water, climate, biodiversity.
  • OECD Data — international statistics on economy, education, health, and more.

You can use these tools but be careful to find an official dataset and to be able to understand how the data has been curated if that is the case:

Criteria to ensure the project feasibility
  • Your selected dataset must contain at least:
    • 80 observations
    • 6 numerical variables
    • 1 categorical variable
  • If there are only quantitative variables, there might be a way to create categories from a numerical variable.
  • When there are time series, please restrict the study to one year.
  • If cleaning or preprocessing is required, describe and justify it clearly.
Important

Before starting the project, the chosen dataset must be approved by the teacher.

3 Supplementary methods

Each group must select one method from the following list:

Method Short description Main reference R / Python implementation Difficulty
Kernel PCA Nonlinear PCA using kernels Schölkopf et al. 1998 — “Nonlinear Component Analysis as a Kernel Eigenvalue Problem” R: kernlab • Python: sklearn ★★★
Sparse PCA PCA with sparsity in loadings Zou, Hastie, Tibshirani 2006 — “Sparse Principal Component Analysis” R: elasticnet / PMA • Python: sklearn ★★★
ICS (Invariant Coordinate Selection) Outlier-oriented dimension reduction Tyler et al. 2009 — “Invariant Coordinate Selection” R: ICS, ICSOutlier • Python: icspylab ★★
QDA Quadratic class boundaries (nonlinear DA) Friedman 1989 — “Regularized Discriminant Analysis” R: MASS::qda • Python: sklearn
Gaussian Mixture Models Model-based clustering via mixtures McLachlan & Peel 2000 — “Finite Mixture Models” R: mclust • Python: sklearn ★★
Spectral clustering Graph Laplacian–based clustering Ng, Jordan, Weiss 2002 — “On Spectral Clustering” R: kernlab • Python: sklearn ★★★
Multidimensional Scaling (MDS) Distance-preserving embedding Kruskal & Wish 1978 — “Multidimensional Scaling” R: cmdscale, smacof • Python: sklearn ★★
t-SNE Nonlinear visualization preserving local neighborhoods van der Maaten & Hinton 2008 — “Visualizing Data using t-SNE” R: Rtsne • Python: sklearn ★★★
UMAP Fast manifold learning for visualization McInnes, Healy, Melville 2018 — “UMAP: Uniform Manifold Approximation and Projection” R: umap • Python: umap-learn ★★★
Isolation Forest Anomaly detection using random partitioning Liu, Ting, Zhou 2008 — “Isolation Forest” R: isotree • Python: sklearn ★★
Gradient Boosting Boosting with additive trees Friedman 2001 — “Greedy Function Approximation: Gradient Boosting” R: gbm, xgboost • Python: sklearn ★★★
AdaBoost Adaptive boosting for classification Freund & Schapire 1997 — “A Decision-Theoretic Generalization of On-Line Learning” R: adabag • Python: sklearn ★★
Extra-Trees Extremely randomized trees ensemble Geurts, Ernst, Wehenkel 2006 — “Extremely Randomized Trees” R: extraTrees • Python: sklearn ★★
k-Nearest Neighbors (k-NN) Classification by majority vote of neighbors Cover & Hart 1967 — “Nearest Neighbor Pattern Classification” R: class::knn • Python: sklearn
Regularized LDA Shrinkage version of LDA for small-sample problems Friedman 1989 — Regularized Discriminant Analysis R: klaR::rda
k-means clustering Partitioning clustering MacQueen 1967 R: kmeans • Python: sklearn
PAM / k-medoids Robust alternative to k-means Kaufman & Rousseeuw 1990 R: cluster::pam • Python: pyclustering

4 Project report format

  • Report must be a Quarto (or R Markdown) project.
  • Output must compile into one PDF ≤ 20 pages.
  • Include:
    • introduction and objectives
    • dataset description and preprocessing
    • contributions of group members
    • methodology explanation
    • interpretation of results
    • conclusion and perspectives
  • Code must be clean, organized and commented.
  • Reproducibility is essential. This means that if you take your project folder to another new computer, it should build the final project files without error.
Some common traps breaking reproducibility
  • Forgetting to embed all code and data files.
  • Caching results and using them instead of data source.
  • Using an absolute path to the data file.
  • Using install.packages without checking whether the package is already installed.
  • Not setting the seed when a method based on random number generation is used.
Important

Contributions should be as detailed as possible (for each part of your project and oral presentation), as they will be used for grading. If you have no idea how to write them, you can refer to CRediT. If you used LLMs, you must include them in the contributions and explain precisely how and why they were used.

Remark

Remark. If you have difficulties uploading all the necessary sources to build your project on Moodle (for instance if the dataset is too large), please contact me in order to find a solution.

5 Oral examination

Each group will also complete an oral defense:

  • 10–15 minutes presentation
  • 5 minutes questions on:
    • dataset
    • core methods
    • chosen supplementary method
    • interpretation and limitations