Introduction

Author

Affiliation

Published

January 5, 2026

1 Big Data

1.1 Definition

Big Data refers to datasets whose size, speed, or complexity make them difficult to collect, store, process, or analyze with traditional tools. The challenge is primarily computational, but it has strong implications for statistics and machine learning.

Big Data is commonly summarized by the 3 Vs:

Volume: massive number of observations \(n\)
Velocity: speed at which new data is generated
Variety: different types/structures of data from various origins.

Sometimes other Vs are mentioned:

Veracity: uncertainty, noise, and data quality issues
Value: impact and value added by a potential data analysis

(not discussed in this course).

1.2 Volume

1.2.1 The Global Datasphere

Global data production is measured in zettabytes (\(10^{21}\) bytes = 1 billion terabytes).
End of 2024: the global volume of data is estimated at 21 terabytes per person on average.
The volume of data created worldwide roughly doubles every 2–3 years (Reinsel, Gantz, and Rydning 2018).

Code

# Years
years <- 2010:2029

# Data volume in Zettabytes
datasphere_zb <- c(
  2.0, 5.0, 6.5, 9.0, 12.5, 15.5, 18.0, 26.0, 33.0, 41.0,
  64.2, 79.0, 97.0, 120.0, 173.4, 181.0, 221.0, 291.0, 394.0, 527.5
)

# Colors: historical (<=2024) blue, projections (>2024) orange
colors <- ifelse(years <= 2024, "blue", "orange")

Code

# Plotly bar chart
library(plotly)
plot_ly(
  x = ~years,
  y = ~datasphere_zb,
  type = 'bar',
  marker = list(color = colors)
) %>%
  layout(
    title = "",
    xaxis = list(title = "Year"),
    yaxis = list(title = "Data Volume (Zettabytes)"),
    template = "plotly_white",
    annotations = list(
      x = 2010,
      y = max(datasphere_zb) * 0.95,
      text = "",
      showarrow = FALSE,
      xanchor = "left",
      yanchor = "top",
      font = list(size = 10, color = "gray")
    )
  )

Figure 1: Volume of data or information created, captured, copied, and consumed worldwide from 2010 to 2029. Source: Taylor (2025).

1.2.2 AI and Training Data Explosion

The size of datasets used to train language models doubles approximately every six months.
Modern AI models (e.g. large language models) are trained on hundreds of gigabytes to terabytes of data, a scale unseen in classical statistics.

Figure 2: Growth of AI training dataset sizes. **Source:** Rahman and Owen (2024).

1.3 Velocity

1.3.1 The rising frequency of data generation

The global population is increasingly online, which means that every minute, a huge amount of data is generated on the internet.

The following figure illustrates the amount of data generated every minute of every day:

emails sent,
social media posts,
video streaming,
online transactions.

Figure 3: Data generated every minute on the internet. **Source:** Domo (2024).

1.3.2 An increasing number of connected devices

The Internet of Things (IoT) is defined in Recommendation ITU-T Y.2060 (06/2012) as a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies.
The IoT is made of smart devices connected on a network and working without relying on human interaction (e.g. sensors, smart devices, and connected infrastructure).
As of 2025, the number of connected IoT devices reached 21.1 billion globally (excluding consumer devices like PCs, laptops, cellphones, etc.)
IoT devices generate continuous streams of data, often at very high frequency.

Figure 4: Global IoT market forecast (in billions of connected IoT devices). **Source:** Sinha (2025).

1.4 Variety

1.4.1 Where does Big Data come from?

Digital platforms: clicks, searches, recommendations
Finance: transactions, high‑frequency trading
Public administration: tax, social security, census
Science: satellites, genomics, climate modeling
Industry: sensors, supply chains, logistics

Big Data is often collected for operational purposes, not for statistical inference.

Figure 5: 2018 Enterprise Datasphere by Industry (1 EB = 1 million TB). **Source:** Reinsel, Gantz, and Rydning (2018).

Data is often private, but some websites allow public access to large datasets in the perspective of Open Data.

World Bank Open Data — global development indicators, socio-economic and environmental data.
SNCF Open Data — datasets from the French national railway company (schedules, stations, traffic).
Toulouse Métropole Open Data — urban and transport datasets for Toulouse metropolitan area.
data.gouv.fr — central French government portal, hosting datasets from national and local administrations.
Open Collectivités — datasets on French local authorities, public finances, and administrative structures.
European Union Open Data Portal — datasets from EU institutions covering economy, environment, health, and more.
UNdata — United Nations statistics and indicators across multiple global domains.
Kaggle Datasets — large repository of open datasets, often used for machine learning competitions.
Google Dataset Search — search engine for datasets across the web.
UCI Machine Learning Repository — classic repository of datasets for ML research and teaching.
OpenStreetMap — collaborative project creating free geographic data (maps, locations, infrastructure).
Data.gov (USA) — US government open data portal covering health, energy, transportation, and more.
European Environment Agency (EEA) Open Data — environmental data for Europe: air, water, climate, biodiversity.
OECD Data — international statistics on economy, education, health, and more.

1.4.2 Complex data structures

In this course, we will often assume that each of the \(n\) observations can be represented as a vector
\[ X_i \in \mathbb{R}^p. \] In other words, a fixed number of numerical variables is measured for each individual.

However, many modern datasets do not fit into this framework. Instead of vectors, observations may be:

functions
probability distributions
networks or graphs
combinations of heterogeneous objects.

Such data are referred to as complex data.

Most classical methods covered in this course (PCA, LDA, CART, Random Forests) assume that we observe independent discrete random vectors in a Euclidean space.

Complex data invalidate this assumption. We need to understand their structure before applying any statistical or machine learning method.

This leads to new representations, metrics and notions of variability that are beyond the scope of this course.

1.4.2.1 Functional Data

In functional data analysis (FDA), each observation is a function: \[ f_i(t), \quad t \in \mathcal{T}, \] where \(\mathcal{T}\) is typically time, space, or wavelength.

Examples:

daily electricity consumption curves,
intraday financial price curves,
yield curves in economics,
temperature curves over a year.

Each observation is not a vector, but an entire curve.

Key challenges:

Infinite-dimensional nature of the data
Smoothness and regularity assumptions
Temporal dependence within each observation.

Typical tools:

Functional PCA
Basis expansions (Fourier, splines, wavelets)
Functional regression

Functional data emerged as a consequence of the rising number of IoT devices monitoring variables at a very high frequency, that are better analyzed as continuous curves rather than discrete numerical values.

1.4.2.2 Distributional Data

In distributional data analysis (DDA), each observation is a probability distribution: \[ \pi_i (x), \quad x \in \mathcal X \] (taking positive values and integrating to 1) rather than a single realization.

Examples:

income distributions across regions,
age distributions in populations,
uncertainty-aware measurements,
empirical distributions of prices or returns.
temperature distributions over a year.

Key challenges:

Standard distances (Euclidean) are meaningless
How to average distributions?

Typical tools:

Wasserstein (optimal transport) distances
Fréchet means of distributions
Aitchison structure (similar to compositional data).

1.4.2.3 Graph and Network Data

In graph data, each observation is a network: \[ G_i = (V_i, E_i), \] with nodes \(V_i\) and edges \(E_i\).

Examples: social networks, trade networks.

Key challenges: Size and topology vary across graphs. Dependence structure is intrinsic. Graphs encode relational information, not just attributes.

Typical tools:

Graph Laplacians and spectral methods
Network embeddings
Graph neural networks.

1.4.2.4 Other kinds of complex data

Textual Data
- Documents, articles, social media posts
- Represented via embeddings or topic models
- High-dimensional, sparse, and structured
Image and Signal Data
- Images as high-dimensional arrays
- Strong spatial dependence
- Often low-dimensional structure hidden in high dimensions
Longitudinal and Panel Data
- Repeated measurements over time
- Correlation across time and individuals
- Widely used in economics and social sciences.

2 High‑Dimensional Data

2.1 Definition

A dataset is high‑dimensional when the number of variables \(p\) is large relative to the number of observations \(n\): \[ p \ge n \quad \text{or} \quad p \gg n. \]

This is often summarized as ‘large \(p\), small \(n\)’ (Johnstone and Titterington 2009).

2.2 Examples

Gene expression: thousands of genes, few patients
Economic indicators: many correlated macro or firm‑level variables

Very often, complex data is also high-dimensional:

Text analysis: words as variables
Images: pixels as variables.

2.3 Statistical consequences

Observations are all approximately equidistant (Giraud 2021)
Sample covariance matrices are singular when \(p > n\)
Estimators have high variance
Overfitting (i.e. performing well on training data but poorly on unseen data) becomes unavoidable without constraints
Classical asymptotics fail.

These issues motivate dimension reduction and regularization (Bühlmann and Van De Geer 2011).

2.4 Big Data vs High‑Dimensional Data

Aspect	Big Data	High‑Dimensional Data
Main challenge	Computation	Statistical inference
Typical \(n\)	Very large	Small to moderate
Typical \(p\)	Moderate	Large
Key risk	Scalability	Overfitting
Typical tools	Distributed systems	PCA, sparsity

Some methods in this course apply beyond the strict \(n \gg p\) case, others have a limited interpretation when \(p\) rises.

3 Basics in data exploration

Definition 1 (Observations and variables)

A unit (or observation or case) is the core element in statistics (e.g. students, firms, countries, years)
A variable is a characteristic measured on the units (e.g. gender, age, number of employees, gross national product, unemployment rate).
A variable is categorical or qualitative (e.g. country of birth, gender) OR numerical or quantitative (e.g. GNP, age).

3.1 Univariate statistical analysis

Univariate statistical analysis focuses on the description of a single variable (\(p=1\)), through a finite sample of \(n\) observation.

3.1.1 Categorical variables

Graphical representations: Graphical tools are part of data analysis as they provide intuition about the data set before formal analysis.

Bar chart: displays frequencies or proportions of categories.
Pie chart: visualizes proportions (mainly descriptive).

Numerical indicators: usually we work on the relative frequencies (e.g. mode).

3.1.2 Numerical variables

Graphical representations:

Histogram: approximates the distribution of the variable.
Boxplot: summarizes location, dispersion, and outliers.

3.1.2.1 Mean (location)

Definition 2 (Sample mean) The sample mean is defined as: \[ \bar X = \frac{1}{n} \sum_{i=1}^n X_i \]

It summarizes the central tendency of the observed data.

Proposition 1

\(\bar X\) is an unbiased estimator of the expected value \(\mathbb{E}[X]\): \[ \mathbb{E}[\bar X] = \mu \]
\(\bar X\) is consistent: \[ \bar X \xrightarrow[n \to \infty]{} \mu \]
Its variance decreases with the sample size: \[ \operatorname{Var}(\bar X) = \frac{\operatorname{Var}(X)}{n} \]

3.1.2.2 Variance (dispersion)

Definition 3 (Sample variance) The unbiased sample variance is: \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 \]

It measures the dispersion of the data around the sample mean.

Definition 4 (Population variance) The corresponding population quantity is: \[ \operatorname{Var}(X) = \mathbb E \big[(X - \mathbb EX)^2\big] \]

Proposition 2

\(S^2\) is an unbiased estimator of \(\operatorname{Var}(X)\): \[ \mathbb{E}[S^2] = \operatorname{Var}(X) \]
\(S^2\) is consistent as \(n \to \infty\).
The variance is sensitive to extreme observations.

3.1.2.3 Standard deviation (scale)

Definition 5 (Sample standard deviation) The sample standard deviation is: \[ S = \sqrt{S^2} \]

It is expressed in the same units as the variable and is often easier to interpret than the variance.

Definition 6 (Population standard deviation) The corresponding population quantity is: \[ \sigma(X) = \sqrt{\operatorname{Var}(X)} \]

Proposition 3

\(S\) is a biased estimator of \(\sigma\).
\(S\) is nevertheless consistent.
Like the variance, it is sensitive to outliers.

3.2 Bivariate statistical analysis

Bivariate statistical analysis studies the relationship between two variables observed on the same individuals.
The appropriate tools depend on whether the variables are categorical or numerical.

3.2.1 Graphical representations

Two categorical variables: stacked or juxtaposed bar charts are used to compare joint and conditional distributions.
Two numerical variables: a scatterplot visualizes the form, strength, and direction of the relationship.
One numerical and one categorical variable: parallel boxplots compare the distribution of the numerical variable across categories.

Graphical representations are an essential first step to detect association, trends, or anomalies.

3.2.2 Two categorical variables

The relationship between two categorical variables is assessed using the chi-square test of independence, which compares observed frequencies to those expected under independence.

This test evaluates whether the variables are statistically independent.

3.2.3 Two numerical variables

3.2.3.1 Covariance

Definition 7 (Sample covariance) Given observations \((X_i, Y_i)\) for \(i = 1, \dots, n\), the sample covariance is: \[ \operatorname{cov}(X,Y) = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y). \]

Covariance measures the joint variability of the two variables but depends on their units.

3.2.3.2 Correlation coefficient

Definition 8 (Sample correlation) The Pearson correlation coefficient is defined as: \[ \rho(X,Y) = \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}. \]

Correlation is a standardized, unit-free measure of linear dependence.

Remark

Remark.

\(\rho(X,Y) \in [-1,1]\),
\(|\rho|\) measures the strength of linear association,
the sign indicates the direction of the relationship,
\(\rho(X,Y) = 0\) does not necessarily imply independence.

4 Statistics, ML, Data Analysis, Data Mining, or AI?

4.1 A question of vocabulary

There is no real consensus on the vocabulary around data science, but we can characterise subfields in terms of their respective goals and tools.

Field	Core Question	Characteristics
Statistics	What can we infer about the world?	Often model‑based (parametric or not). Emphasis on bias, variance, confidence intervals.
Machine Learning	How well can we predict?	Algorithmic and optimization‑driven. Focus on generalization performance. Often trades interpretability for accuracy.
Data Analysis	What does the data show?	Exploratory and descriptive. Visualization, summaries, diagnostics.
Data Mining	What patterns exist?	Pattern discovery in large datasets. Clustering, association rules, anomaly detection.
Artificial Intelligence	How can machines learn from data?	Broader field encompassing ML algorithms and data-driven decision-making. Reasoning, pattern recognition, or adaptation.

Modern data science blends all of these perspectives into one field (James et al. 2021).

There is an important distinction between:

Unsupervised methods: no output variable (no training data set) such as PCA and clustering.
Supervised method: output variable (training data set) such as linear or logistic regression, discriminant analysis, decision trees and neural networks. If the output variable is categorical, the supervised method is a classification method. It the variable is numerical, the method is a regression method.

4.2 Methods explored in this course

Method	Field	Typical Use	Supervised / Unsupervised
PCA	Statistics / Data Analysis	Dimension reduction	Unsupervised
LDA	Statistics / Data Analysis	Classification, interpretation	Supervised
CART	Machine Learning	Nonlinear prediction	Supervised
Bootstrap	Statistics / Machine Learning	Uncertainty estimation	–
Bagging	Machine Learning	Variance reduction	Supervised
Random Forests	Machine Learning	Robust prediction	Supervised

References

Bühlmann, Peter, and Sara Van De Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-20192-9.

Domo. 2024. “Data Never Sleeps 12.0.” Domo. https://www.domo.com/learn/infographic/data-never-sleeps-12.

Giraud, Christophe. 2021. Introduction to High-Dimensional Statistics. 2nd ed. New York: Chapman; Hall/CRC. https://doi.org/10.1201/9781003158745.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-1418-1.

Johnstone, Iain M., and D. Michael Titterington. 2009. “Statistical Challenges of High-Dimensional Data.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906): 4237–53. https://doi.org/10.1098/rsta.2009.0159.

Rahman, Robi, and David Owen. 2024. “The Size of Datasets Used to Train Language Models Doubles Approximately Every Six Months.” Epoch AI. https://epoch.ai/data-insights/dataset-size-trend.

Reinsel, David, John Gantz, and John Rydning. 2018. “The Digitization of the World from Edge to Core.” US44413318. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.

Sinha, Satyajit. 2025. “Number of Connected IoT Devices Growing 14% to 21.1 Billion.” IoT Analytics. https://iot-analytics.com/number-connected-iot-devices/.

Taylor, Petroc. 2025. “Data Generation Volume Worldwide 2010-2029.” Statista. https://www.statista.com/statistics/871513/worldwide-data-created/.