Code
library(boot)
data(channing)
<- channing[,c("sex","entry","time","cens")]
channing 1:4,] channing[
sex entry time cens
1 Male 782 127 1
2 Male 1020 108 1
3 Male 856 113 1
4 Male 915 42 1
February 12, 2024
sex entry time cens
1 Male 782 127 1
2 Male 1020 108 1
3 Male 856 113 1
4 Male 915 42 1
Predict sex
\(\in \{\text{Male}, \text{Female} \}\) on the basis of two numerical predictors (entry
, time
) and a binary one (cens
).
In Part 1 of this course, we learned about a special type of classifiers \(\phi_{\mathcal{S}}\), namely classification trees (Breiman et al. 1984).
(+) Interpretability
(+) Flexibility
(–) Stability
(–) Performance
The process of averaging will reduce variability, hence, improve stability. Recall indeed that, if \(U_1,\ldots,U_n\) are uncorrelated with variance \(\sigma^2\), then \[ \text{Var}[\bar{U}]=\frac{\sigma^2}{n} \cdot \] Since unpruned trees have low bias (but high variance), this reduced variance will lead to a low value of \[ \text{MSE}=\text{Var}+(\text{Bias})^2 \] which will ensure a good performance.
How to perform this averaging?
Denote as \(\phi_{\mathcal{S}}(x)\) the predicted class for predictor value \(x\) returned by the classification tree associated with sample \(\mathcal{S}=\{(X_i,Y_i), \ i=1,\ldots,n\}\).
Bagging of this tree considers predictions from \(B\) bootstrap samples \[\begin{matrix} \mathcal{S}^{*1} &=& ((X_1^{*1},Y_1^{*1}), \ldots, (X_n^{*1},Y_n^{*1})) & \leadsto & \phi_{\mathcal{S}^{*1}}(x) \\ \vdots & & & & \vdots \\ \mathcal{S}^{*b} &=& ((X_1^{*b},Y_1^{*b}), \ldots, (X_n^{*b},Y_n^{*b})) & \leadsto & \phi_{\mathcal{S}^{*b}}(x) \\ \vdots & & & & \vdots \\ \mathcal{S}^{*B} &=& ((X_1^{*B},Y_1^{*B}), \ldots, (X_n^{*B},Y_n^{*B})) & \leadsto & \phi_{\mathcal{S}^{*B}}(x) \\ \end{matrix}\] then proceeds by majority voting (i.e., the most frequently predicted class wins): \[ \phi_{\mathcal{S}}^\text{Bagging}(x) = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \# \{ b:\phi_{\mathcal{S}^{*b}}(x)=k \} \]
entry=782
time=127
cens=1
\(\quad \Downarrow\)
Female
entry=782
time=127
cens=1
\(\quad \Downarrow\)
Male
entry=782
time=127
cens=1
\(\quad \Downarrow\)
Male
For \(x=\,\)(entry,time,cens)\(\,=\,\)(782,127,1),
Of course, \(B\) is usually much larger (\(B=500\)? \(B=1000\)?), which requires automating the process (through, e.g., the boot
function).
We repeat \(M=1000\) times the following experiment:
This provides \(M=1000\) test errors for the direct (single-tree) approach, and \(M=1000\) test errors for the bagging approach.
Several strategies to estimate prediction accuracy of a classifier:
(1) Compute a test error (as above): Partition the data set \(\mathcal{S}\) into a training set \(\mathcal{S}_\text{train}\) (to train the classifier) and a test set \(\mathcal{S}_\text{test}\) (on which to evaluate the misclassification rate \(e_\text{test}\)).
(2) Compute an \(L\)-fold cross-validation error:
Partition the data set \(\mathcal{S}\) into \(L\) folds \(\mathcal{S}_{\ell}\), \(\ell=1,\ldots,L\). For each \(\ell\), evaluate the test error \(e_{\text{test},\ell}\) associated with training set \(\mathcal{S}\setminus\mathcal{S}_{\ell}\) and test set \(\mathcal{S}_{\ell}\).
The quantity \[ e_\text{CV}=\frac{1}{L}\sum_{\ell=1}^L e_{\text{test},\ell} \] is then the (\(L\)-fold) ‘cross-validation error’.
(3) Compute the Out-Of-Bag (OOB) error1:
For each observation \(X_i\) from \(\mathcal{S}\), define the OOB prediction as \[ \phi_{\mathcal{S}}^\text{OOB}(X_i) = \underset{k\in\{1,\ldots,K\}}{\operatorname{argmax}} \# \{ b:\phi_{\mathcal{S}^{*b}}(X_i)=k \textrm{ and } (X_i,Y_i)\notin \mathcal{S}^{*b} \} \] This is a majority voting discarding, quite naturally, bootstrap samples that use \((X_i,Y_i)\) to train the classification tree. The OOB error is then the corresponding misclassification rate \[ e_\text{OOB}=\frac{1}{n}\sum_{i=1}^n \mathbb{1}[ \phi_{\mathcal{S}}^\text{OOB}(X_i) \neq Y_i] \]
This is for bagging procedures only.↩︎