Case Study in `R`: Recruitment agency

Last updated on January 12, 2026

1 Generalities

1.1 Data

Code

# Data
library(MASS)
recruitment <- read.csv("data/recruitment.csv", row.names = 1)

# Transformation of the variable to predict in a factor
recruitment$Res[which(recruitment$Res == 0)] <- "Rejected"
recruitment$Res[which(recruitment$Res == 1)] <- "Accepted"
recruitment$Res <- factor(recruitment$Res, levels = c("Rejected", "Accepted"), ordered = T)
recruitment

Id	Dip	Test	Exp	Res
A	1	5	4	Rejected
B	2	3	3	Rejected
C	1	4	5	Accepted
D	2	3	4	Rejected
E	1	4	4	Rejected
F	4	3	4	Accepted
G	3	4	4	Accepted
H	1	1	5	Rejected
I	3	2	5	Accepted
J	5	4	4	Accepted

Code

library(plotly)

fig <- plot_ly(recruitment, x = ~Dip, y = ~Test, z = ~Exp, color = ~Res, colors = c("#BF382A", "#0C4B8E"))
fig <- fig %>% layout(scene = list(xaxis = list(title = "Diploma"),
                     yaxis = list(title = "Aptitude Test"),
                     zaxis = list(title = "Experience")))
fig

Figure 1

1.2 Objectives

First objective: explain the variable RES by the 3 scores obtained by the candidates and obtain a graphical representation that will distinguish the good candidates from the others only by knowing the 3 scores.
Second objective: to be able to predict whether a new candidate will be good or not given his 3 scores (without knowing for this candidate the variable RES).

1.3 Functions or linear discriminant variables

2 Discriminant Analysis

2.1 Between-group and within-group covariances

Code

x <- recruitment[, 2:4]
y <- recruitment[, 5]
n <- nrow(x)

covB <- Reduce("+", lapply(split(x, y), function(g) nrow(g) * (colMeans(g) - colMeans(x)) %*% t(colMeans(g) - colMeans(x)))) / n
rownames(covB) <- rownames(cov(x))
covB |> as.data.frame()

Table 1: Between-group covariance.

	Dip	Test	Exp
Dip	0.81	0.09	0.18
Test	0.09	0.01	0.02
Exp	0.18	0.02	0.04

Code

covW <- Reduce("+", lapply(split(x, y), function(g) (nrow(g)-1) * cov(g))) / n
covW |> as.data.frame()

Table 2: Within-group covariance.

	Dip	Test	Exp
Dip	1.00	-0.08	-0.34
Test	-0.08	1.20	-0.28
Exp	-0.34	-0.28	0.32

Code

(n-1)/n * cov(x) |> as.data.frame()

Table 3: Total covariance.

	Dip	Test	Exp
Dip	1.81	0.01	-0.16
Test	0.01	1.21	-0.26
Exp	-0.16	-0.26	0.36

2.2 Mahalanobis Distance

2.3 Steps of the analysis

Computing the discriminant variables

Code

# LDA
lda_fit <- lda(Res ~ Dip + Test + Exp, recruitment)
lda_fit

Call:
lda(Res ~ Dip + Test + Exp, data = recruitment)

Prior probabilities of groups:
Rejected Accepted 
     0.5      0.5 

Group means:
         Dip Test Exp
Rejected 1.4  3.2 4.0
Accepted 3.2  3.4 4.4

Coefficients of linear discriminants:
           LD1
Dip  1.2375669
Test 0.6320478
Exp  2.1780117

Code

eigen(solve(covW) %*% covB)

eigen() decomposition
$values
[1] 3.250668e+00+0.000000e+00i 2.460960e-17+8.974585e-17i
[3] 2.460960e-17-8.974585e-17i

$vectors
             [,1]                   [,2]                   [,3]
[1,] 0.4790158+0i -0.1314896+0.07112118i -0.1314896-0.07112118i
[2,] 0.2446421+0i -0.2431902-0.64009065i -0.2431902+0.64009065i
[3,] 0.8430268+0i  0.7132985+0.00000000i  0.7132985+0.00000000i

The CAN1 eigenvector is:

Code

as.data.frame(lda_fit$scaling)

	LD1
Dip	1.2375669
Test	0.6320478
Exp	2.1780117

The first observation and the means are:

Code

x[1, ]

Dip	Test	Exp
1	5	4

Code

as.data.frame(t(colMeans(x)))

Dip	Test	Exp
2.3	3.3	4.2

The CAN1 score of the first observation is:

Code

x_cen <- sweep(x, 2, colMeans(x))
sum(lda_fit$scaling * x_cen[1,])

[1] -0.969958

It can be obtained directly using:

Code

predict(lda_fit, recruitment)$x[1]

[1] -0.969958

Global quality of dicrimination

Code

lda_fit$svd^2 / (1 + lda_fit$svd^2)

[1] 0.9629703

Dimension choice

In our small example, there is only 2 groups and so one discriminant variable that we keep.

Interpretation of the discriminant variables

Code

sum(diag(cov(x)))

[1] 3.755556

Code

cor(predict(lda_fit, recruitment)$x, x) |> as.data.frame()

	Dip	Test	Exp
LD1	0.7649719	0.103956	0.381172

Code

sum(cor(x, predict(lda_fit, recruitment)$x)^2)

[1] 0.741281

Scatterplot of the observations on the discriminant axes

Because there is only one discriminant variable:

Code

plot(lda_fit)

Interpretation

3 Allocation (or classification) rule

The objective is to predict if a new candidate will meet the needs of the company.

3.1 Geometric classification rule

3.2 Probabilistic classification rules

The first observation is classified as:

Code

lda_pred <- predict(lda_fit, recruitment)
lda_pred$class[1]

[1] Rejected
Levels: Rejected Accepted

Its true class was:

Code

recruitment[1, 5]

[1] Rejected
Levels: Rejected < Accepted

Code

library(MASS)
library(plotly)
library(dplyr)

# Fit LDA
lda_fit <- lda(Res ~ Dip + Test + Exp, data = recruitment)

# Grid
x_seq <- seq(min(recruitment$Dip), max(recruitment$Dip), length.out = 15)
y_seq <- seq(min(recruitment$Test), max(recruitment$Test), length.out = 15)
z_seq <- seq(min(recruitment$Exp), max(recruitment$Exp), length.out = 15)

grid <- expand.grid(Dip = x_seq, Test = y_seq, Exp = z_seq)
grid$Res_pred <- predict(lda_fit, grid)$class

# 3D scatter plot of original points
fig <- plot_ly(
  recruitment,
  x = ~Dip, y = ~Test, z = ~Exp,
  color = ~Res,
  colors = c("#BF382A", "#0C4B8E"),
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 5)
)

# Add decision regions as semi-transparent points
fig <- fig %>%
  add_trace(
    data = grid,
    x = ~Dip, y = ~Test, z = ~Exp,
    color = ~Res_pred,
    colors = c("#FF9999", "#9999FF"),
    type = "scatter3d",
    mode = "markers",
    marker = list(size = 3, opacity = 0.15),
    showlegend = FALSE
  )

fig

Figure 3: A 3D visualisation of the rule.

Code

library(candisc)
library(ggplot2)
library(cowplot)

# Predictor names
predictors <- c("Dip", "Test", "Exp")
n <- length(predictors)

all_plots <- list()

for (i in 1:n) {
  for (j in 1:n) {
    if (i == j) {
      # Diagonal: just show variable name
      p <- ggplot() +
        annotate("text", x = 0.5, y = 0.5, label = predictors[i], size = 6, fontface = "bold") +
        theme_void()  # no axes, ticks, grid
    } else {
      # Off-diagonal: LDA decision region for predictor pair
      f <- as.formula(paste(predictors[i], "~", predictors[j]))
      p <- plot_discrim(lda_fit, f, resolution = 400) +
        theme_minimal() +
        theme(legend.position = "none")
    }
    all_plots <- c(all_plots, list(p))
  }
}

# Arrange all plots in an n x n matrix
plot_grid(plotlist = all_plots, ncol = n)

Figure 4: A matrix of 2D scatterplots to visualise of the rule.

3.3 Quality of classification

Confusion matrix

Code

table(recruitment$Res, lda_pred$class) |> unclass() |> as.data.frame()

	Rejected	Accepted
Rejected	5	0
Accepted	0	5

ROC and AUC

In the case of two classes:

Code

library(ROCR)
pred <- prediction(lda_pred$posterior[, 2], recruitment$Res)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

Figure 5: Perfect classification, as it is computed on the training data.

4 Invariant Coordinate Selection (ICS)

ICS and the discriminant subspace

Figure 6: LDA, PCA, and ICS compared for two groups. **Source:** Colombe Becquart.

Case Study in R: Recruitment agency

1 Generalities

1.1 Data

1.2 Objectives

1.3 Functions or linear discriminant variables

2 Discriminant Analysis

2.1 Between-group and within-group covariances

2.2 Mahalanobis Distance

2.3 Steps of the analysis

Computing the discriminant variables

Global quality of dicrimination

Dimension choice

Interpretation of the discriminant variables

Scatterplot of the observations on the discriminant axes

Interpretation

3 Allocation (or classification) rule

3.1 Geometric classification rule

3.2 Probabilistic classification rules

3.3 Quality of classification

Confusion matrix

ROC and AUC

4 Invariant Coordinate Selection (ICS)

ICS and the discriminant subspace

5 Conclusion

Case Study in `R`: Recruitment agency