library(boot)
data(channing)
channing <- channing[, c("sex", "entry", "time", "cens")]
n <- nrow(channing)
channing[sample(1:n, 4), ]| sex | entry | time | cens | |
|---|---|---|---|---|
| 79 | Male | 1010 | 2 | 1 |
| 257 | Female | 941 | 3 | 0 |
| 246 | Female | 811 | 87 | 0 |
| 292 | Female | 979 | 37 | 0 |
From a course by Davy Paindaveine and Nathalie Vialaneix
Last updated on February 10, 2025
library(boot)
data(channing)
channing <- channing[, c("sex", "entry", "time", "cens")]
n <- nrow(channing)
channing[sample(1:n, 4), ]| sex | entry | time | cens | |
|---|---|---|---|---|
| 79 | Male | 1010 | 2 | 1 |
| 257 | Female | 941 | 3 | 0 |
| 246 | Female | 811 | 87 | 0 |
| 292 | Female | 979 | 37 | 0 |
Predict sex \(\in \{\text{Male}, \text{Female} \}\) on the basis of two numerical predictors (entry, time) and a binary one (cens).
In Chapter 3 of this course, we learned about a special type of classifiers \(\phi_{\mathcal{S}}\), namely classification trees (Breiman et al. 1984).
The process of averaging will reduce variability, hence, improve stability. Recall indeed that, if \(U_1,\ldots,U_n\) are uncorrelated with variance \(\sigma^2\), then \[ \text{Var}[\bar{U}]=\frac{\sigma^2}{n} \cdot \] Since unpruned trees have low bias (but high variance), this reduced variance will lead to a low value of \[ \text{MSE}=\text{Var}+(\text{Bias})^2 \] which will ensure a good performance.
How to perform this averaging?
Denote as \(\phi_{\mathcal{S}}(x)\) the predicted class for predictor value \(x\) returned by the classification tree associated with sample \(\mathcal{S}=\{(X_i,Y_i), \ i=1,\ldots,n\}\).
Bagging of this tree considers predictions from \(B\) bootstrap samples \[\begin{matrix} \mathcal{S}^{*1} &=& ((X_1^{*1},Y_1^{*1}), \ldots, (X_n^{*1},Y_n^{*1})) & \leadsto & \phi_{\mathcal{S}^{*1}}(x) \\ \vdots & & & & \vdots \\ \mathcal{S}^{*b} &=& ((X_1^{*b},Y_1^{*b}), \ldots, (X_n^{*b},Y_n^{*b})) & \leadsto & \phi_{\mathcal{S}^{*b}}(x) \\ \vdots & & & & \vdots \\ \mathcal{S}^{*B} &=& ((X_1^{*B},Y_1^{*B}), \ldots, (X_n^{*B},Y_n^{*B})) & \leadsto & \phi_{\mathcal{S}^{*B}}(x) \\ \end{matrix}\] then proceeds by majority voting (i.e., the most frequently predicted class wins): \[ \phi_{\mathcal{S}}^\text{Bagging}(x) = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \# \{ b:\phi_{\mathcal{S}^{*b}}(x)=k \} \]
For \(x=\,\)(entry, time, cens)\(\,= (782, 127, 1)\),
the bagging classifier will thus classify \(x\) into Male.
Of course, \(B\) is usually much larger (\(B=500\)? \(B=1000\)?), which requires automating the process (through, e.g., the boot function).
We repeat \(M=1000\) times the following experiment:
This provides \(M=1000\) test errors for the direct (single-tree) approach, and \(M=1000\) test errors for the bagging approach.
Several strategies to estimate prediction accuracy of a classifier:
(1) Compute a test error (as above): Partition the data set \(\mathcal{S}\) into a training set \(\mathcal{S}_\text{train}\) (to train the classifier) and a test set \(\mathcal{S}_\text{test}\) (on which to evaluate the misclassification rate \(e_\text{test}\)).
(2) Compute an \(L\)-fold cross-validation error:
Partition the data set \(\mathcal{S}\) into \(L\) folds \(\mathcal{S}_{\ell}\), \(\ell=1,\ldots,L\). For each \(\ell\), evaluate the test error \(e_{\text{test},\ell}\) associated with training set \(\mathcal{S}\setminus\mathcal{S}_{\ell}\) and test set \(\mathcal{S}_{\ell}\).
| Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | |
|---|---|---|---|---|---|
| Run \(\ell\)=1 | Test | Train | Train | Train | Train |
| Run \(\ell\)=2 | Train | Test | Train | Train | Train |
| Run \(\ell\)=3 | Train | Train | Test | Train | Train |
| Run \(\ell\)=4 | Train | Train | Train | Test | Train |
| Run \(\ell\)=5 | Train | Train | Train | Train | Test |
Then the (\(L\)-fold) ‘cross-validation error’ is: \[ e_\text{CV}=\frac{1}{L}\sum_{\ell=1}^L e_{\text{test},\ell} \]
(3) Compute the Out-Of-Bag (OOB) error1:
For each observation \(X_i\) from \(\mathcal{S}\), define the OOB prediction as \[ \phi_{\mathcal{S}}^\text{OOB}(X_i) = \underset{k\in\{1,\ldots,K\}}{\operatorname{argmax}} \# \{ b:\phi_{\mathcal{S}^{*b}}(X_i)=k \textrm{ and } (X_i,Y_i)\notin \mathcal{S}^{*b} \} \] This is a majority voting discarding, quite naturally, bootstrap samples that use \((X_i,Y_i)\) to train the classification tree. The OOB error is then the corresponding misclassification rate \[ e_\text{OOB}=\frac{1}{n}\sum_{i=1}^n \mathbb{1}[ \phi_{\mathcal{S}}^\text{OOB}(X_i) \neq Y_i] \]
