Case Study in R: Recruitment agency

Last updated on January 12, 2026

1 Generalities

1.1 Data

Code
# Data
library(MASS)
recruitment <- read.csv("data/recruitment.csv", row.names = 1)

# Transformation of the variable to predict in a factor
recruitment$Res[which(recruitment$Res == 0)] <- "Rejected"
recruitment$Res[which(recruitment$Res == 1)] <- "Accepted"
recruitment$Res <- factor(recruitment$Res, levels = c("Rejected", "Accepted"), ordered = T)
recruitment
Id Dip Test Exp Res
A 1 5 4 Rejected
B 2 3 3 Rejected
C 1 4 5 Accepted
D 2 3 4 Rejected
E 1 4 4 Rejected
F 4 3 4 Accepted
G 3 4 4 Accepted
H 1 1 5 Rejected
I 3 2 5 Accepted
J 5 4 4 Accepted
Code
library(plotly)

fig <- plot_ly(recruitment, x = ~Dip, y = ~Test, z = ~Exp, color = ~Res, colors = c("#BF382A", "#0C4B8E"))
fig <- fig %>% layout(scene = list(xaxis = list(title = "Diploma"),
                     yaxis = list(title = "Aptitude Test"),
                     zaxis = list(title = "Experience")))
fig
Figure 1

1.2 Objectives

  • First objective: explain the variable RES by the 3 scores obtained by the candidates and obtain a graphical representation that will distinguish the good candidates from the others only by knowing the 3 scores.

  • Second objective: to be able to predict whether a new candidate will be good or not given his 3 scores (without knowing for this candidate the variable RES).

1.3 Functions or linear discriminant variables

2 Discriminant Analysis

2.1 Between-group and within-group covariances

Code
x <- recruitment[, 2:4]
y <- recruitment[, 5]
n <- nrow(x)

covB <- Reduce("+", lapply(split(x, y), function(g) nrow(g) * (colMeans(g) - colMeans(x)) %*% t(colMeans(g) - colMeans(x)))) / n
rownames(covB) <- rownames(cov(x))
covB |> as.data.frame()
Table 1: Between-group covariance.
Dip Test Exp
Dip 0.81 0.09 0.18
Test 0.09 0.01 0.02
Exp 0.18 0.02 0.04
Code
covW <- Reduce("+", lapply(split(x, y), function(g) (nrow(g)-1) * cov(g))) / n
covW |> as.data.frame()
Table 2: Within-group covariance.
Dip Test Exp
Dip 1.00 -0.08 -0.34
Test -0.08 1.20 -0.28
Exp -0.34 -0.28 0.32
Code
(n-1)/n * cov(x) |> as.data.frame()
Table 3: Total covariance.
Dip Test Exp
Dip 1.81 0.01 -0.16
Test 0.01 1.21 -0.26
Exp -0.16 -0.26 0.36

2.2 Mahalanobis Distance

2.3 Steps of the analysis

Computing the discriminant variables

Code
# LDA
lda_fit <- lda(Res ~ Dip + Test + Exp, recruitment)
lda_fit
Call:
lda(Res ~ Dip + Test + Exp, data = recruitment)

Prior probabilities of groups:
Rejected Accepted 
     0.5      0.5 

Group means:
         Dip Test Exp
Rejected 1.4  3.2 4.0
Accepted 3.2  3.4 4.4

Coefficients of linear discriminants:
           LD1
Dip  1.2375669
Test 0.6320478
Exp  2.1780117
Code
eigen(solve(covW) %*% covB)
eigen() decomposition
$values
[1] 3.250668e+00+0.000000e+00i 2.460960e-17+8.974585e-17i
[3] 2.460960e-17-8.974585e-17i

$vectors
             [,1]                   [,2]                   [,3]
[1,] 0.4790158+0i -0.1314896+0.07112118i -0.1314896-0.07112118i
[2,] 0.2446421+0i -0.2431902-0.64009065i -0.2431902+0.64009065i
[3,] 0.8430268+0i  0.7132985+0.00000000i  0.7132985+0.00000000i

The CAN1 eigenvector is:

Code
as.data.frame(lda_fit$scaling)
LD1
Dip 1.2375669
Test 0.6320478
Exp 2.1780117

The first observation and the means are:

Code
x[1, ]
Dip Test Exp
1 5 4
Code
as.data.frame(t(colMeans(x)))
Dip Test Exp
2.3 3.3 4.2

The CAN1 score of the first observation is:

Code
x_cen <- sweep(x, 2, colMeans(x))
sum(lda_fit$scaling * x_cen[1,])
[1] -0.969958

It can be obtained directly using:

Code
predict(lda_fit, recruitment)$x[1]
[1] -0.969958

Global quality of dicrimination

Code
lda_fit$svd^2 / (1 + lda_fit$svd^2)
[1] 0.9629703

Dimension choice

In our small example, there is only 2 groups and so one discriminant variable that we keep.

Interpretation of the discriminant variables

Code
sum(diag(cov(x)))
[1] 3.755556
Code
cor(predict(lda_fit, recruitment)$x, x) |> as.data.frame()
Dip Test Exp
LD1 0.7649719 0.103956 0.381172
Code
sum(cor(x, predict(lda_fit, recruitment)$x)^2)
[1] 0.741281

Scatterplot of the observations on the discriminant axes

Because there is only one discriminant variable:

Code
plot(lda_fit)
Figure 2

Interpretation

3 Allocation (or classification) rule

The objective is to predict if a new candidate will meet the needs of the company.

3.1 Geometric classification rule

3.2 Probabilistic classification rules

The first observation is classified as:

Code
lda_pred <- predict(lda_fit, recruitment)
lda_pred$class[1]
[1] Rejected
Levels: Rejected Accepted

Its true class was:

Code
recruitment[1, 5]
[1] Rejected
Levels: Rejected < Accepted
Code
library(MASS)
library(plotly)
library(dplyr)

# Fit LDA
lda_fit <- lda(Res ~ Dip + Test + Exp, data = recruitment)

# Grid
x_seq <- seq(min(recruitment$Dip), max(recruitment$Dip), length.out = 15)
y_seq <- seq(min(recruitment$Test), max(recruitment$Test), length.out = 15)
z_seq <- seq(min(recruitment$Exp), max(recruitment$Exp), length.out = 15)

grid <- expand.grid(Dip = x_seq, Test = y_seq, Exp = z_seq)
grid$Res_pred <- predict(lda_fit, grid)$class

# 3D scatter plot of original points
fig <- plot_ly(
  recruitment,
  x = ~Dip, y = ~Test, z = ~Exp,
  color = ~Res,
  colors = c("#BF382A", "#0C4B8E"),
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 5)
)

# Add decision regions as semi-transparent points
fig <- fig %>%
  add_trace(
    data = grid,
    x = ~Dip, y = ~Test, z = ~Exp,
    color = ~Res_pred,
    colors = c("#FF9999", "#9999FF"),
    type = "scatter3d",
    mode = "markers",
    marker = list(size = 3, opacity = 0.15),
    showlegend = FALSE
  )

fig
Figure 3: A 3D visualisation of the rule.
Code
library(candisc)
library(ggplot2)
library(cowplot)

# Predictor names
predictors <- c("Dip", "Test", "Exp")
n <- length(predictors)

all_plots <- list()

for (i in 1:n) {
  for (j in 1:n) {
    if (i == j) {
      # Diagonal: just show variable name
      p <- ggplot() +
        annotate("text", x = 0.5, y = 0.5, label = predictors[i], size = 6, fontface = "bold") +
        theme_void()  # no axes, ticks, grid
    } else {
      # Off-diagonal: LDA decision region for predictor pair
      f <- as.formula(paste(predictors[i], "~", predictors[j]))
      p <- plot_discrim(lda_fit, f, resolution = 400) +
        theme_minimal() +
        theme(legend.position = "none")
    }
    all_plots <- c(all_plots, list(p))
  }
}

# Arrange all plots in an n x n matrix
plot_grid(plotlist = all_plots, ncol = n)
Figure 4: A matrix of 2D scatterplots to visualise of the rule.

3.3 Quality of classification

Confusion matrix

Code
table(recruitment$Res, lda_pred$class) |> unclass() |> as.data.frame()
Rejected Accepted
Rejected 5 0
Accepted 0 5

ROC and AUC

In the case of two classes:

Code
library(ROCR)
pred <- prediction(lda_pred$posterior[, 2], recruitment$Res)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)
Figure 5: Perfect classification, as it is computed on the training data.

4 Invariant Coordinate Selection (ICS)

ICS and the discriminant subspace

Figure 6: LDA, PCA, and ICS compared for two groups. Source: Colombe Becquart.

5 Conclusion