Projects

Author

Affiliation

Camille Mondon

Toulouse School of Economics

1 Objectives and constraints

The objectives of this project are:

To form a 4-person group.
To introduce and summarise a dataset (chosen from the list in Section 2).
To apply and interpret the results of all six core methods studied in the course:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Classification and Regression Trees (CART)
- Bootstrap
- Bagging
- Random Forests
To explain, apply and interpet the results of one supplementary method selected from the list in Section 3.
To submit a reproducible report before February 22.
To present the work in an oral examination on March 2.

You must use either R or Python for your experiments.

Important

Each choice must be justified:

why this dataset?
Why are the core methods applicable?
Why this supplementary method?

Use simple exploratory data analysis first, then apply the methods, and interpret results in a statistically sound way.

Warning

Large language models may only be used after the code and report are complete, and strictly for sentence-level English correction.

2 Choice of dataset

Each group must download a multivariate dataset from the following list of sources:

World Bank Open Data — global development indicators, socio-economic and environmental data.
SNCF Open Data — datasets from the French national railway company (schedules, stations, traffic).
Toulouse Métropole Open Data — urban and transport datasets for Toulouse metropolitan area.
data.gouv.fr — central French government portal, hosting datasets from national and local administrations.
Open Collectivités — datasets on French local authorities, public finances, and administrative structures.
European Union Open Data Portal — datasets from EU institutions covering economy, environment, health, and more.
UNdata — United Nations statistics and indicators across multiple global domains.
UCI Machine Learning Repository — classic repository of datasets for ML research and teaching.
OpenStreetMap — collaborative project creating free geographic data (maps, locations, infrastructure).
Data.gov (USA) — US government open data portal covering health, energy, transportation, and more.
European Environment Agency (EEA) Open Data — environmental data for Europe: air, water, climate, biodiversity.
OECD Data — international statistics on economy, education, health, and more.

You can use these tools but be careful to find an official dataset and to be able to understand how the data has been curated if that is the case:

Google Dataset Search — search engine for datasets across the web.
Kaggle Datasets — large repository of open datasets, often used for machine learning competitions.

Criteria to ensure the project feasibility

Your selected dataset must contain at least:
- 80 observations
- 6 numerical variables
- 1 categorical variable
If there are only quantitative variables, there might be a way to create categories from a numerical variable.
When there are time series, please restrict the study to one year.
If cleaning or preprocessing is required, describe and justify it clearly.

Important

Before starting the project, the chosen dataset must be approved by the teacher.

3 Supplementary methods

Each group must select one method from the following list:

Method	Short description	Main reference	R / Python implementation	Difficulty
Kernel PCA	Nonlinear PCA using kernels	Schölkopf et al. 1998 — “Nonlinear Component Analysis as a Kernel Eigenvalue Problem”	R: `kernlab` • Python: `sklearn`	★★★
Sparse PCA	PCA with sparsity in loadings	Zou, Hastie, Tibshirani 2006 — “Sparse Principal Component Analysis”	R: `elasticnet` / `PMA` • Python: `sklearn`	★★★
ICS (Invariant Coordinate Selection)	Outlier-oriented dimension reduction	Tyler et al. 2009 — “Invariant Coordinate Selection”	R: `ICS`, `ICSOutlier` • Python: `icspylab`	★★
QDA	Quadratic class boundaries (nonlinear DA)	Friedman 1989 — “Regularized Discriminant Analysis”	R: `MASS::qda` • Python: `sklearn`	★
Gaussian Mixture Models	Model-based clustering via mixtures	McLachlan & Peel 2000 — “Finite Mixture Models”	R: `mclust` • Python: `sklearn`	★★
Spectral clustering	Graph Laplacian–based clustering	Ng, Jordan, Weiss 2002 — “On Spectral Clustering”	R: `kernlab` • Python: `sklearn`	★★★
Multidimensional Scaling (MDS)	Distance-preserving embedding	Kruskal & Wish 1978 — “Multidimensional Scaling”	R: `cmdscale`, `smacof` • Python: `sklearn`	★★
t-SNE	Nonlinear visualization preserving local neighborhoods	van der Maaten & Hinton 2008 — “Visualizing Data using t-SNE”	R: `Rtsne` • Python: `sklearn`	★★★
UMAP	Fast manifold learning for visualization	McInnes, Healy, Melville 2018 — “UMAP: Uniform Manifold Approximation and Projection”	R: `umap` • Python: `umap-learn`	★★★
Isolation Forest	Anomaly detection using random partitioning	Liu, Ting, Zhou 2008 — “Isolation Forest”	R: `isotree` • Python: `sklearn`	★★
Gradient Boosting	Boosting with additive trees	Friedman 2001 — “Greedy Function Approximation: Gradient Boosting”	R: `gbm`, `xgboost` • Python: `sklearn`	★★★
AdaBoost	Adaptive boosting for classification	Freund & Schapire 1997 — “A Decision-Theoretic Generalization of On-Line Learning”	R: `adabag` • Python: `sklearn`	★★
Extra-Trees	Extremely randomized trees ensemble	Geurts, Ernst, Wehenkel 2006 — “Extremely Randomized Trees”	R: `extraTrees` • Python: `sklearn`	★★
k-Nearest Neighbors (k-NN)	Classification by majority vote of neighbors	Cover & Hart 1967 — “Nearest Neighbor Pattern Classification”	R: `class::knn` • Python: `sklearn`	★
Regularized LDA	Shrinkage version of LDA for small-sample problems	Friedman 1989 — Regularized Discriminant Analysis	R: `klaR::rda`	★
k-means clustering	Partitioning clustering	MacQueen 1967	R: `kmeans` • Python: `sklearn`	★
PAM / k-medoids	Robust alternative to k-means	Kaufman & Rousseeuw 1990	R: `cluster::pam` • Python: `pyclustering`	★

4 Project report format

Report must be a Quarto (or R Markdown) project.
Output must compile into one PDF ≤ 20 pages.
Include:
- introduction and objectives
- dataset description and preprocessing
- contributions of group members
- methodology explanation
- interpretation of results
- conclusion and perspectives
Code must be clean, organized and commented.
Reproducibility is essential. This means that if you take your project folder to another new computer, it should build the final project files without error.

Some common traps breaking reproducibility

Forgetting to embed all code and data files.
Caching results and using them instead of data source.
Using an absolute path to the data file.
Using install.packages without checking whether the package is already installed.
Not setting the seed when a method based on random number generation is used.

Important

Contributions should be as detailed as possible (for each part of your project and oral presentation), as they will be used for grading. If you have no idea how to write them, you can refer to CRediT. If you used LLMs, you must include them in the contributions and explain precisely how and why they were used.

Remark

Remark. If you have difficulties uploading all the necessary sources to build your project on Moodle (for instance if the dataset is too large), please contact me in order to find a solution.

5 Oral examination

Each group will also complete an oral defense:

10–15 minutes presentation
5 minutes questions on:
- dataset
- core methods
- chosen supplementary method
- interpretation and limitations