Case Study in python: Recruitment agency

Last updated on January 19, 2026

1 About trees

1.1 Introduction

1.2 Examples

  1. Predict the land use of a given area (agriculture, forest, etc.) given satellite information, meteorological data, socio-economic information, prices information, etc.

  2. Determine how computer performance is related to a number of variables which describe the features of a PC (the size of the cache, the cycle time of the computer, the memory size and the number of channels. Both the last two were not measured but minimum and maximum values obtained).

Risk of heart attack

Predict high risk for heart attack:

  • University of California: a study into patients after admission for a heart attack.
  • 19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours)
  • Question: Can the high risk (will not survive 30 days) patients be identified?
Figure 1: A binary tree to identify patients at high risk of heart attack. Source: (Breiman et al. 1984, fig. 1.1).

1.3 Vocabulary

Figure 2: A (reversed) true tree.

1.4 When to use decision trees?

2 Procedure

2.1 Overview

2.2 Notations

2.3 Growing the tree

The need of an homogeneity criterion

Homogeneity of a split

Building a split: small classification example detailed

Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import dtreeviz as dt

candidates = pd.read_csv("data/recruitment.csv")
candidates

features = ["Exp", "Dip", "Test"]
output = "Res"
clf = DecisionTreeClassifier().fit(candidates[features], candidates[output])

export_graphviz(
    clf,
    "images/recruitment_tree.dot",
    feature_names=features,
    class_names=["Rejected", "Accepted"],
    filled=True,
)

clf_viz = dt.model(
    clf,
    candidates[features],
    candidates[output],
    feature_names=features,
    target_name=output,
    class_names=["Rejected", "Accepted"],
)
clf_viz.view()
Figure 3: Tree.
Code

Tree 0 Dip <= 2.5 gini = 0.5 samples = 10 value = [5, 5] class = Rejected 1 Exp <= 4.5 gini = 0.278 samples = 6 value = [5, 1] class = Rejected 0->1 True 6 gini = 0.0 samples = 4 value = [0, 4] class = Accepted 0->6 False 2 gini = 0.0 samples = 4 value = [4, 0] class = Rejected 1->2 3 Test <= 2.5 gini = 0.5 samples = 2 value = [1, 1] class = Rejected 1->3 4 gini = 0.0 samples = 1 value = [1, 0] class = Rejected 3->4 5 gini = 0.0 samples = 1 value = [0, 1] class = Accepted 3->5

Code
clf_viz.ctree_feature_space(features=["Exp", "Dip"])
clf_viz.ctree_feature_space(features=["Exp", "Test"])
clf_viz.ctree_feature_space(features=["Dip", "Test"])

Computing Gini indices

2.4 Stopping the tree

2.5 Pruning the tree

Why and how pruning?

Figure 4: Why pruning?

A penalized criterion

Iterative pruning

Choosing the optimal subtree

Pruning a tree with the rpart R function

References

Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. 1984. Classification And Regression Trees. 1st ed. Routledge. https://doi.org/10.1201/9781315139470.