Case Study in `python`: Recruitment agency

Camille Mondon

camille.mondon@tse-fr.eu

Toulouse School of Economics

Last updated on January 19, 2026

1 About trees

1.1 Introduction

1.2 Examples

Predict the land use of a given area (agriculture, forest, etc.) given satellite information, meteorological data, socio-economic information, prices information, etc.
Determine how computer performance is related to a number of variables which describe the features of a PC (the size of the cache, the cycle time of the computer, the memory size and the number of channels. Both the last two were not measured but minimum and maximum values obtained).

Risk of heart attack

Predict high risk for heart attack:

University of California: a study into patients after admission for a heart attack.
19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours)
Question: Can the high risk (will not survive 30 days) patients be identified?

Figure 1: A binary tree to identify patients at high risk of heart attack. Source: (Breiman et al. 1984, fig. 1.1).

1.3 Vocabulary

1.4 When to use decision trees?

2 Procedure

2.1 Overview

2.2 Notations

2.3 Growing the tree

The need of an homogeneity criterion

Homogeneity of a split

Building a split: small classification example detailed

Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import dtreeviz as dt

candidates = pd.read_csv("data/recruitment.csv")
candidates

features = ["Exp", "Dip", "Test"]
output = "Res"
clf = DecisionTreeClassifier().fit(candidates[features], candidates[output])

export_graphviz(
    clf,
    "images/recruitment_tree.dot",
    feature_names=features,
    class_names=["Rejected", "Accepted"],
    filled=True,
)

clf_viz = dt.model(
    clf,
    candidates[features],
    candidates[output],
    feature_names=features,
    target_name=output,
    class_names=["Rejected", "Accepted"],
)
clf_viz.view()

Code

clf_viz.ctree_feature_space(features=["Exp", "Dip"])
clf_viz.ctree_feature_space(features=["Exp", "Test"])
clf_viz.ctree_feature_space(features=["Dip", "Test"])

Computing Gini indices

2.4 Stopping the tree

2.5 Pruning the tree

Why and how pruning?

A penalized criterion

Iterative pruning

Choosing the optimal subtree

Pruning a tree with the `rpart` R function

References

Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. 1984. Classification And Regression Trees. 1st ed. Routledge. https://doi.org/10.1201/9781315139470.

Case Study in python: Recruitment agency