Introduction

Published

January 5, 2026

1 Big Data

1.1 Definition

Big Data refers to datasets whose size, speed, or complexity make them difficult to collect, store, process, or analyze with traditional tools. The challenge is primarily computational, but it has strong implications for statistics and machine learning.

Big Data is commonly summarized by the 3 Vs:

  • Volume: massive number of observations \(n\)
  • Velocity: speed at which new data is generated
  • Variety: different types/structures of data from various origins.

Sometimes other Vs are mentioned:

  • Veracity: uncertainty, noise, and data quality issues
  • Value: impact and value added by a potential data analysis

(not discussed in this course).

1.2 Volume

1.2.1 The Global Datasphere

  • Global data production is measured in zettabytes (\(10^{21}\) bytes = 1 billion terabytes).
  • End of 2024: the global volume of data is estimated at 21 terabytes per person on average.
  • The volume of data created worldwide roughly doubles every 2–3 years (Reinsel, Gantz, and Rydning 2018).
Code
# Years
years <- 2010:2029

# Data volume in Zettabytes
datasphere_zb <- c(
  2.0, 5.0, 6.5, 9.0, 12.5, 15.5, 18.0, 26.0, 33.0, 41.0,
  64.2, 79.0, 97.0, 120.0, 173.4, 181.0, 221.0, 291.0, 394.0, 527.5
)

# Colors: historical (<=2024) blue, projections (>2024) orange
colors <- ifelse(years <= 2024, "blue", "orange")
Code
# Plotly bar chart
library(plotly)
plot_ly(
  x = ~years,
  y = ~datasphere_zb,
  type = 'bar',
  marker = list(color = colors)
) %>%
  layout(
    title = "",
    xaxis = list(title = "Year"),
    yaxis = list(title = "Data Volume (Zettabytes)"),
    template = "plotly_white",
    annotations = list(
      x = 2010,
      y = max(datasphere_zb) * 0.95,
      text = "",
      showarrow = FALSE,
      xanchor = "left",
      yanchor = "top",
      font = list(size = 10, color = "gray")
    )
  )
Figure 1: Volume of data or information created, captured, copied, and consumed worldwide from 2010 to 2029. Source: Taylor (2025).

1.2.2 AI and Training Data Explosion

  • The size of datasets used to train language models doubles approximately every six months.
  • Modern AI models (e.g. large language models) are trained on hundreds of gigabytes to terabytes of data, a scale unseen in classical statistics.
Figure 2: Growth of AI training dataset sizes. Source: Rahman and Owen (2024).

1.3 Velocity

1.3.1 The rising frequency of data generation

The global population is increasingly online, which means that every minute, a huge amount of data is generated on the internet.

The following figure illustrates the amount of data generated every minute of every day:

  • emails sent,
  • social media posts,
  • video streaming,
  • online transactions.
Figure 3: Data generated every minute on the internet. Source: Domo (2024).

1.3.2 An increasing number of connected devices

  • The Internet of Things (IoT) is defined in Recommendation ITU-T Y.2060 (06/2012) as a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies.
  • The IoT is made of smart devices connected on a network and working without relying on human interaction (e.g. sensors, smart devices, and connected infrastructure).
  • As of 2025, the number of connected IoT devices reached 21.1 billion globally (excluding consumer devices like PCs, laptops, cellphones, etc.)
  • IoT devices generate continuous streams of data, often at very high frequency.
Figure 4: Global IoT market forecast (in billions of connected IoT devices). Source: Sinha (2025).

1.4 Variety

1.4.1 Where does Big Data come from?

  • Digital platforms: clicks, searches, recommendations
  • Finance: transactions, high‑frequency trading
  • Public administration: tax, social security, census
  • Science: satellites, genomics, climate modeling
  • Industry: sensors, supply chains, logistics

Big Data is often collected for operational purposes, not for statistical inference.

Figure 5: 2018 Enterprise Datasphere by Industry (1 EB = 1 million TB). Source: Reinsel, Gantz, and Rydning (2018).

Data is often private, but some websites allow public access to large datasets in the perspective of Open Data.

1.4.2 Complex data structures

In this course, we will often assume that each of the \(n\) observations can be represented as a vector
\[ X_i \in \mathbb{R}^p. \] In other words, a fixed number of numerical variables is measured for each individual.

However, many modern datasets do not fit into this framework. Instead of vectors, observations may be:

  • functions
  • probability distributions
  • networks or graphs
  • combinations of heterogeneous objects.

Such data are referred to as complex data.

Most classical methods covered in this course (PCA, LDA, CART, Random Forests) assume that we observe independent discrete random vectors in a Euclidean space.

Complex data invalidate this assumption. We need to understand their structure before applying any statistical or machine learning method.

This leads to new representations, metrics and notions of variability that are beyond the scope of this course.

1.4.2.1 Functional Data

In functional data analysis (FDA), each observation is a function: \[ f_i(t), \quad t \in \mathcal{T}, \] where \(\mathcal{T}\) is typically time, space, or wavelength.

Examples:

  • daily electricity consumption curves,
  • intraday financial price curves,
  • yield curves in economics,
  • temperature curves over a year.

Each observation is not a vector, but an entire curve.

Key challenges:

  • Infinite-dimensional nature of the data
  • Smoothness and regularity assumptions
  • Temporal dependence within each observation.

Typical tools:

  • Functional PCA
  • Basis expansions (Fourier, splines, wavelets)
  • Functional regression

Functional data emerged as a consequence of the rising number of IoT devices monitoring variables at a very high frequency, that are better analyzed as continuous curves rather than discrete numerical values.

1.4.2.2 Distributional Data

In distributional data analysis (DDA), each observation is a probability distribution: \[ \pi_i (x), \quad x \in \mathcal X \] (taking positive values and integrating to 1) rather than a single realization.

Examples:

  • income distributions across regions,
  • age distributions in populations,
  • uncertainty-aware measurements,
  • empirical distributions of prices or returns.
  • temperature distributions over a year.

Key challenges:

  • Standard distances (Euclidean) are meaningless
  • How to average distributions?

Typical tools:

  • Wasserstein (optimal transport) distances
  • Fréchet means of distributions
  • Aitchison structure (similar to compositional data).

1.4.2.3 Graph and Network Data

In graph data, each observation is a network: \[ G_i = (V_i, E_i), \] with nodes \(V_i\) and edges \(E_i\).

Examples: social networks, trade networks.

Key challenges: Size and topology vary across graphs. Dependence structure is intrinsic. Graphs encode relational information, not just attributes.

Typical tools:

  • Graph Laplacians and spectral methods
  • Network embeddings
  • Graph neural networks.

1.4.2.4 Other kinds of complex data

  • Textual Data
    • Documents, articles, social media posts
    • Represented via embeddings or topic models
    • High-dimensional, sparse, and structured
  • Image and Signal Data
    • Images as high-dimensional arrays
    • Strong spatial dependence
    • Often low-dimensional structure hidden in high dimensions
  • Longitudinal and Panel Data
    • Repeated measurements over time
    • Correlation across time and individuals
    • Widely used in economics and social sciences.

2 High‑Dimensional Data

2.1 Definition

A dataset is high‑dimensional when the number of variables \(p\) is large relative to the number of observations \(n\): \[ p \ge n \quad \text{or} \quad p \gg n. \]

This is often summarized as ‘large \(p\), small \(n\) (Johnstone and Titterington 2009).

2.2 Examples

  • Gene expression: thousands of genes, few patients
  • Economic indicators: many correlated macro or firm‑level variables

Very often, complex data is also high-dimensional:

  • Text analysis: words as variables
  • Images: pixels as variables.

2.3 Statistical consequences

  • Observations are all approximately equidistant (Giraud 2021)
  • Sample covariance matrices are singular when \(p > n\)
  • Estimators have high variance
  • Overfitting (i.e. performing well on training data but poorly on unseen data) becomes unavoidable without constraints
  • Classical asymptotics fail.

These issues motivate dimension reduction and regularization (Bühlmann and Van De Geer 2011).

2.4 Big Data vs High‑Dimensional Data

Aspect Big Data High‑Dimensional Data
Main challenge Computation Statistical inference
Typical \(n\) Very large Small to moderate
Typical \(p\) Moderate Large
Key risk Scalability Overfitting
Typical tools Distributed systems PCA, sparsity

Some methods in this course apply beyond the strict \(n \gg p\) case, others have a limited interpretation when \(p\) rises.

3 Basics in data exploration

Definition 1 (Observations and variables)  

  • A unit (or observation or case) is the core element in statistics (e.g. students, firms, countries, years)

  • A variable is a characteristic measured on the units (e.g. gender, age, number of employees, gross national product, unemployment rate).

  • A variable is categorical or qualitative (e.g. country of birth, gender) OR numerical or quantitative (e.g. GNP, age).

3.1 Univariate statistical analysis

Univariate statistical analysis focuses on the description of a single variable (\(p=1\)), through a finite sample of \(n\) observation.

3.1.1 Categorical variables

Graphical representations: Graphical tools are part of data analysis as they provide intuition about the data set before formal analysis.

  • Bar chart: displays frequencies or proportions of categories.
  • Pie chart: visualizes proportions (mainly descriptive).

Numerical indicators: usually we work on the relative frequencies (e.g. mode).

3.1.2 Numerical variables

Graphical representations:

  • Histogram: approximates the distribution of the variable.
  • Boxplot: summarizes location, dispersion, and outliers.

3.1.2.1 Mean (location)

Definition 2 (Sample mean) The sample mean is defined as: \[ \bar X = \frac{1}{n} \sum_{i=1}^n X_i \]

It summarizes the central tendency of the observed data.

Proposition 1  

  • \(\bar X\) is an unbiased estimator of the expected value \(\mathbb{E}[X]\): \[ \mathbb{E}[\bar X] = \mu \]
  • \(\bar X\) is consistent: \[ \bar X \xrightarrow[n \to \infty]{} \mu \]
  • Its variance decreases with the sample size: \[ \operatorname{Var}(\bar X) = \frac{\operatorname{Var}(X)}{n} \]

3.1.2.2 Variance (dispersion)

Definition 3 (Sample variance) The unbiased sample variance is: \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 \]

It measures the dispersion of the data around the sample mean.

Definition 4 (Population variance) The corresponding population quantity is: \[ \operatorname{Var}(X) = \mathbb E \big[(X - \mathbb EX)^2\big] \]

Proposition 2  

  • \(S^2\) is an unbiased estimator of \(\operatorname{Var}(X)\): \[ \mathbb{E}[S^2] = \operatorname{Var}(X) \]
  • \(S^2\) is consistent as \(n \to \infty\).
  • The variance is sensitive to extreme observations.

3.1.2.3 Standard deviation (scale)

Definition 5 (Sample standard deviation) The sample standard deviation is: \[ S = \sqrt{S^2} \]

It is expressed in the same units as the variable and is often easier to interpret than the variance.

Definition 6 (Population standard deviation) The corresponding population quantity is: \[ \sigma(X) = \sqrt{\operatorname{Var}(X)} \]

Proposition 3  

  • \(S\) is a biased estimator of \(\sigma\).
  • \(S\) is nevertheless consistent.
  • Like the variance, it is sensitive to outliers.

3.2 Bivariate statistical analysis

Bivariate statistical analysis studies the relationship between two variables observed on the same individuals.
The appropriate tools depend on whether the variables are categorical or numerical.

3.2.1 Graphical representations

  • Two categorical variables: stacked or juxtaposed bar charts are used to compare joint and conditional distributions.

  • Two numerical variables: a scatterplot visualizes the form, strength, and direction of the relationship.

  • One numerical and one categorical variable: parallel boxplots compare the distribution of the numerical variable across categories.

Graphical representations are an essential first step to detect association, trends, or anomalies.

3.2.2 Two categorical variables

The relationship between two categorical variables is assessed using the chi-square test of independence, which compares observed frequencies to those expected under independence.

This test evaluates whether the variables are statistically independent.

3.2.3 Two numerical variables

3.2.3.1 Covariance

Definition 7 (Sample covariance) Given observations \((X_i, Y_i)\) for \(i = 1, \dots, n\), the sample covariance is: \[ \operatorname{cov}(X,Y) = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y). \]

Covariance measures the joint variability of the two variables but depends on their units.

3.2.3.2 Correlation coefficient

Definition 8 (Sample correlation) The Pearson correlation coefficient is defined as: \[ \rho(X,Y) = \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}. \]

Correlation is a standardized, unit-free measure of linear dependence.

Remark

Remark.

  • \(\rho(X,Y) \in [-1,1]\),
  • \(|\rho|\) measures the strength of linear association,
  • the sign indicates the direction of the relationship,
  • \(\rho(X,Y) = 0\) does not necessarily imply independence.

4 Statistics, ML, Data Analysis, Data Mining, or AI?

4.1 A question of vocabulary

There is no real consensus on the vocabulary around data science, but we can characterise subfields in terms of their respective goals and tools.

Field Core Question Characteristics
Statistics What can we infer about the world? Often model‑based (parametric or not). Emphasis on bias, variance, confidence intervals.
Machine Learning How well can we predict? Algorithmic and optimization‑driven. Focus on generalization performance. Often trades interpretability for accuracy.
Data Analysis What does the data show? Exploratory and descriptive. Visualization, summaries, diagnostics.
Data Mining What patterns exist? Pattern discovery in large datasets. Clustering, association rules, anomaly detection.
Artificial Intelligence How can machines learn from data? Broader field encompassing ML algorithms and data-driven decision-making. Reasoning, pattern recognition, or adaptation.

Modern data science blends all of these perspectives into one field (James et al. 2021).

There is an important distinction between:

  • Unsupervised methods: no output variable (no training data set) such as PCA and clustering.
  • Supervised method: output variable (training data set) such as linear or logistic regression, discriminant analysis, decision trees and neural networks. If the output variable is categorical, the supervised method is a classification method. It the variable is numerical, the method is a regression method.

4.2 Methods explored in this course

Method Field Typical Use Supervised / Unsupervised
PCA Statistics / Data Analysis Dimension reduction Unsupervised
LDA Statistics / Data Analysis Classification, interpretation Supervised
CART Machine Learning Nonlinear prediction Supervised
Bootstrap Statistics / Machine Learning Uncertainty estimation
Bagging Machine Learning Variance reduction Supervised
Random Forests Machine Learning Robust prediction Supervised

References

Bühlmann, Peter, and Sara Van De Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-20192-9.
Domo. 2024. “Data Never Sleeps 12.0.” Domo. https://www.domo.com/learn/infographic/data-never-sleeps-12.
Giraud, Christophe. 2021. Introduction to High-Dimensional Statistics. 2nd ed. New York: Chapman; Hall/CRC. https://doi.org/10.1201/9781003158745.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-1418-1.
Johnstone, Iain M., and D. Michael Titterington. 2009. “Statistical Challenges of High-Dimensional Data.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906): 4237–53. https://doi.org/10.1098/rsta.2009.0159.
Rahman, Robi, and David Owen. 2024. “The Size of Datasets Used to Train Language Models Doubles Approximately Every Six Months.” Epoch AI. https://epoch.ai/data-insights/dataset-size-trend.
Reinsel, David, John Gantz, and John Rydning. 2018. “The Digitization of the World from Edge to Core.” US44413318. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.
Sinha, Satyajit. 2025. “Number of Connected IoT Devices Growing 14% to 21.1 Billion.” IoT Analytics. https://iot-analytics.com/number-connected-iot-devices/.
Taylor, Petroc. 2025. “Data Generation Volume Worldwide 2010-2029.” Statista. https://www.statista.com/statistics/871513/worldwide-data-created/.