[1] -0.08058779 0.28044078 1.19011050 -1.25212790
From a course by Davy Paindaveine and Nathalie Vialaneix
Last updated on October 1, 2024
Efron (1979)
Let \(X_1, \dots, X_n \sim P_\theta\) i.i.d. Let \(\hat\theta = \hat\theta (X_1, \dots, X_n)\) be an estimator for \(\theta\).
One often wants to evaluate the variance \(\operatorname{Var} [\hat\theta]\) to quantify the uncertainty of \(\hat\theta\).
The bootstrap is a powerful, broadly applicable method:
The method is nonparametric and can deal with small \(n\).
James et al. (2021)
Now, if historical data \(X_1=(Y_1,Z_1),\ldots,X_n=(Y_n,Z_n)\) are available, then we can estimate \(\lambda_\text{opt}\) by \[ \hat{\lambda}_\text{opt} = \frac{\widehat{\text{Var}[Y]}-\widehat{\text{Cov}[Y,Z]}}{\widehat{\text{Var}[Y]}+\widehat{\text{Var}[Z]}-2\widehat{\text{Cov}[Y,Z]}} \] where
We generated 1000 samples from the population. The first three are
Here: \[ \widehat{\text{Std}[\hat{\lambda}_\text{opt}]} \approx 0.077, \qquad \bar{\lambda}_\text{opt} \approx 0.331 \ \ (\approx \lambda_\text{opt} = \frac13 = 0.333) \]
(This could also be used to estimate quantiles of \(\hat{\lambda}_\text{opt}\).)
This provides \[ \widehat{\text{Std}[\hat{\lambda}_\text{opt}]}^* \approx 0.079 \]
(This could again be used to estimate quantiles of \(\hat{\lambda}_\text{opt}\).)
Results are close: \(\widehat{\text{Std}[\hat{\lambda}_\text{opt}]}\approx 0.077\) and \(\widehat{\text{Std}[\hat{\lambda}_\text{opt}]}^*\approx 0.079\).
Above, each bootstrap sample \((X_1^{*b},\ldots,X_n^{*b})\) is obtained by sampling (uniformly) with replacement among the original sample \((X_1,\ldots,X_n)\).
Possible uses:
Possible uses when \(T\) is an estimator of \(\theta\):
Bootstrap estimates of \(\mathbb E [\bar{X}]\) and \(\text{Var}[\bar{X}]\) are then given by
The practical sessions will explore how well such estimates behave.
boot
functionA better strategy is to use the boot
function from
The boot
function takes typically 3 arguments:
data
: the original samplestatistic
: a user-defined function with the statistic to bootstrap
R
: the number \(B\) of bootstrap samples to considerIf the statistic is the mean, then a suitable user-defined function is
The bootstrap estimate of \(\text{Var}[\bar{X}]\) is then
Breiman (1996)
The bootstrap has other uses than those described above.
In particular, it allows us to design ensemble methods in statistical learning.
Bagging (Bootstrap Aggregating), which is the most famous approach in this direction, can be applied to both regression and classification.
Below, we mainly focus on bagging of classification trees, but it should be clear that bagging of regression trees can be performed similarly.
Breiman et al. (1984)
sex entry time cens
1 Male 782 127 1
2 Male 1020 108 1
3 Male 856 113 1
4 Male 915 42 1
Predict sex
\(\in \{\text{Male}, \text{Female} \}\) on the basis of two numerical predictors (entry
, time
) and a binary one (cens
).
In Part 1 of this course, we learned about a special type of classifiers \(\phi_{\mathcal{S}}\), namely classification trees.
(+) Interpretability
(+) Flexibility
(–) Stability
(–) Performance
The process of averaging will reduce variability, hence, improve stability. Recall indeed that, if \(U_1,\ldots,U_n\) are uncorrelated with variance \(\sigma^2\), then \[ \text{Var}[\bar{U}]=\frac{\sigma^2}{n} \cdot \] Since unpruned trees have low bias (but high variance), this reduced variance will lead to a low value of \[ \text{MSE}=\text{Var}+(\text{Bias})^2 \] which will ensure a good performance.
How to perform this averaging?
Denote as \(\phi_{\mathcal{S}}(x)\) the predicted class for predictor value \(x\) returned by the classification tree associated with sample \(\mathcal{S}=\{(X_i,Y_i), \ i=1,\ldots,n\}\).
Bagging of this tree considers predictions from \(B\) bootstrap samples \[\begin{matrix} \mathcal{S}^{*1} &=& ((X_1^{*1},Y_1^{*1}), \ldots, (X_n^{*1},Y_n^{*1})) & \leadsto & \phi_{\mathcal{S}^{*1}}(x) \\ \vdots & & & & \vdots \\ \mathcal{S}^{*b} &=& ((X_1^{*b},Y_1^{*b}), \ldots, (X_n^{*b},Y_n^{*b})) & \leadsto & \phi_{\mathcal{S}^{*b}}(x) \\ \vdots & & & & \vdots \\ \mathcal{S}^{*B} &=& ((X_1^{*B},Y_1^{*B}), \ldots, (X_n^{*B},Y_n^{*B})) & \leadsto & \phi_{\mathcal{S}^{*B}}(x) \\ \end{matrix}\] then proceeds by majority voting (i.e., the most frequently predicted class wins): \[ \phi_{\mathcal{S}}^\text{Bagging}(x) = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \# \{ b:\phi_{\mathcal{S}^{*b}}(x)=k \} \]
entry=782
time=127
cens=1
\(\quad \Downarrow\)
Female
entry=782
time=127
cens=1
\(\quad \Downarrow\)
Male
entry=782
time=127
cens=1
\(\quad \Downarrow\)
Male
For \(x=\,\)(entry,time,cens)\(\,=\,\)(782,127,1),
Of course, \(B\) is usually much larger (\(B=500\)? \(B=1000\)?), which requires automating the process (through, e.g., the boot
function).
We repeat \(M=1000\) times the following experiment:
This provides \(M=1000\) test errors for the direct (single-tree) approach, and \(M=1000\) test errors for the bagging approach.
Several strategies to estimate prediction accuracy of a classifier:
(1) Compute a test error (as above): Partition the data set \(\mathcal{S}\) into a training set \(\mathcal{S}_\text{train}\) (to train the classifier) and a test set \(\mathcal{S}_\text{test}\) (on which to evaluate the misclassification rate \(e_\text{test}\)).
(2) Compute an \(L\)-fold cross-validation error:
Partition the data set \(\mathcal{S}\) into \(L\) folds \(\mathcal{S}_{\ell}\), \(\ell=1,\ldots,L\). For each \(\ell\), evaluate the test error \(e_{\text{test},\ell}\) associated with training set \(\mathcal{S}\setminus\mathcal{S}_{\ell}\) and test set \(\mathcal{S}_{\ell}\).
The quantity \[ e_\text{CV}=\frac{1}{L}\sum_{\ell=1}^L e_{\text{test},\ell} \] is then the (\(L\)-fold) ‘cross-validation error’.
(3) Compute the Out-Of-Bag (OOB) error1:
For each observation \(X_i\) from \(\mathcal{S}\), define the OOB prediction as \[ \phi_{\mathcal{S}}^\text{OOB}(X_i) = \underset{k\in\{1,\ldots,K\}}{\operatorname{argmax}} \# \{ b:\phi_{\mathcal{S}^{*b}}(X_i)=k \textrm{ and } (X_i,Y_i)\notin \mathcal{S}^{*b} \} \] This is a majority voting discarding, quite naturally, bootstrap samples that use \((X_i,Y_i)\) to train the classification tree. The OOB error is then the corresponding misclassification rate \[ e_\text{OOB}=\frac{1}{n}\sum_{i=1}^n \mathbb{1}[ \phi_{\mathcal{S}}^\text{OOB}(X_i) \neq Y_i] \]
Ho (1995)
Using \(m=p\) would simply provide bagging-of-trees. Using \(m\) small is appropriate when there are many correlated predictors. Common practice:
In both cases, results are actually not very sensitive to \(m\).
but they
We repeated \(M=1000\) times the following experiment:
channing
data set into a training set (of size 300) and a test set (of size 162);randomForest
function in R
with default parameters (\(B=500\) trees, \(m\approx \sqrt p\)).This provided \(M=1000\) test errors for the direct (single-tree) approach, \(M=1000\) test errors for the bagging approach, and \(M=1000\) test errors for the random forest approach.
Because bagging-of-trees and random forests are poorly interpretable compared to classification trees, the following is useful.
The importance \(v_j\) of the \(j\)th predictor is measured as follows.
For each tree (i.e., for any \(b=1,\ldots,B\)),
(A similar measure is used for regression, based on MSEs).