Lecture notes

From a course by Anne M. Ruiz and Colombe Becquart.

Published

January 12, 2026

1 Generalities

1.1 Data

Typical applications of discriminant analysis include credit scoring, quality control, and risk prediction.

We assume that the data consist of \(n\) observations described by \(p\) quantitative variables. The observations belong to \(k < p\) classes or groups, which are represented by a qualitative variable with \(k\) categories. The data are stored in a table with \(n\) rows (observations) and \(p+1\) columns (variables, including the qualitative variable).

Let \(X^1, X^2, \ldots, X^p\) denote the \(p\) quantitative variables, which will be considered as explanatory variables. Let \(Y\) denote the qualitative variable, which is the variable to be explained.

1.2 Objectives

The primary objective of discriminant analysis (DA) is to separate the \(k\) classes as effectively as possible using the \(p\) explanatory variables. To achieve this, DA constructs new variables called discriminant variables. These variables can then be used to produce graphical representations that illustrate the separation of the \(k\) groups based on the \(p\) explanatory variables.

The second objective of DA is to classify a new observation, described by these \(p\) quantitative variables, into one of the \(k\) groups (also called group allocation).

Remark

1.3 Linear discriminant variables (heuristics)

A linear discriminant variable is a linear combination of the \(p\) explanatory variables that separates the \(k\) groups as effectively as possible.

In general, at most \(k-1\) discriminant functions are sufficient to separate \(k\) classes.

Remark

2 Discriminant Analysis

2.1 Between-group and within-group variances

Instead of computing the mean vector and covariance matrix for the entire dataset, DA computes them separately for each class.

Let \(n_1, n_2, \ldots, n_k\) denote the number of observations in each class. Let \[ \overline{X^1}^{(1)}, \overline{X^1}^{(2)}, \ldots, \overline{X^1}^{(k)} \] denote the mean of variable \(X^1\) within each class, and similarly for \(X^2, \ldots, X^p\). Let \[ \overline{X}^{(1)}, \overline{X}^{(2)}, \ldots, \overline{X}^{(k)} \] denote the mean vectors for each class.

In the same manner, we compute the covariance matrices for each class, denoted \(V^{(j)}, j=1,\ldots,k\).

The total covariance matrix for all observations is: \[ V = \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})(X_i - \overline{X})^\top. \]

We define two important matrices: the within-group and between-group covariance matrices.

Definition 1  

  • The within-class covariance matrix is \[ W = \frac{1}{n} \sum_{j=1}^k n_j V^{(j)}. \]

  • The between-class covariance matrix is \[ B = \frac{1}{n} \sum_{j=1}^k n_j (\overline{X}^{(j)} - \overline{X}) (\overline{X}^{(j)} - \overline{X})^\top. \]

Proposition 1 (Analysis of variance decomposition) The total covariance matrix is the sum of within-class and between-class covariance matrices: \[ V = W + B. \]

In discriminant analysis, we seek discriminant variables that maximize the between-group variance while minimizing the within-group variance. This contrasts with PCA, which maximizes total variance without considering group structure.

Remark

Remark 3. In regression terms, TSS (total sum of squares) corresponds to \(V\), ESS (explained sum of squares) corresponds to \(B\), and RSS (residual sum of squares) corresponds to \(W\).

2.2 Mahalanobis Distance

For two vectors \(u\) and \(v\) in \(\mathbb{R}^p\), the Euclidean distance is \[ d^2(u,v) = \sum_{i=1}^p (u_i - v_i)^2 = (u-v)^\top (u-v). \]

Definition 2 (Mahalanobis distance) The Mahalanobis distance between \(u\) and \(v\) with respect to the within-class covariance \(W\) is \[ d_W^2(u,v) = (u-v)^\top W^{-1} (u-v). \]

2.3 Steps of the analysis

2.3.1 Computing discriminant variables

We start by computing the eigendecomposition of: \[ W^{-1} B \] (instead of the correlation matrix \(R\) in PCA).

Proposition 2  

  • \(W^{-1} B\) is diagonalizable and has non-negative eigenvalues.
  • At most \(k-1\) eigenvalues of \(W^{-1} B\) are non-zero.

Definition 3 The discriminant axes \(v_1, \dots, v_{k-1}\) are eigenvectors of \(W^{-1} B\) associated with eigenvalues in a decreasing order. The components of these vectors are called the raw canonical coefficients.

Definition 4 The discriminant variables \(\operatorname{CAN}1, \dots, \operatorname{CAN}_{k-1}\) are obtained as linear combinations of the centered initial variables using the raw canonical coefficients: \[\operatorname{CAN}^j = \sum_{\ell=1}^p v_j^\ell (X^\ell - \overline{X}^\ell)\]

Proposition 3 The discriminant variables are uncorrelated.

2.3.2 Global quality of discrimination

The squared canonical correlation coefficient: \[ \frac{\operatorname{Var}_B(\operatorname{CAN}^j)}{\operatorname{Var}(\operatorname{CAN}^j)} =\frac{\lambda_j}{1+\lambda_j} \] (where \(\lambda_j\) is the eigenvalue associated with the \(j\)-th discriminant variable) measures the proportion of variance of a discriminant variable explained by the between-group variance, analogous to \(R^2\) in regression. Values close to 1 indicate strong discrimination.

2.3.3 Dimension choice

When more than two discriminant variables exist, a choice of which to retain is required. The importance of a discriminant variable decreases with its order, and explained inertia can be used as a criterion, measured by the eigenvalues.

2.3.4 Interpretation of the discriminant variables

Coefficients (raw canonical coefficients) indicate the contribution of each variable. Correlations between the initial variables and discriminant variables can also be visualized using scatterplots.

2.3.5 Scatterplot of the observations on the discriminant axes

We plot the discriminant variables (CAN1, CAN2, . . . ) which correspond to the coordinates of the observations on these new variable and use different symbols or colors to visualize the different categories of the variable we explain.

If we have many discriminant variables, we can plot different scatterplots: (CAN1,CAN2), (CAN1,CAN3),… (similar to PCA).

Remark: In the presence of 2 groups, only one discriminant variable. Plot an histogram in that case with different colors depending on the value of the categorical variable.

2.3.6 Interpretation

All the observations can be interpreted (there is no problem of quality of representation). We interpret the scatterplot in terms of discrimination of the groups (which axis discriminate which group? How good is the discrimination?) We also make comment on the groups which are not well separated.

Remark

Remark. In some cases, the discrimination is not perfect with some overlapping of some the groups on the scatterplots and this also deserves comment.

3 Allocation (or classification) rule

We now consider the second objective of discriminant analysis. Once the discriminant variables obtained (which lead to the best separation between groups), some new observation is considered for which we only know the explanatory variables and do not know the explained variable (category of the qualitative variable or group). Our aim is to allocate this new observation to one of the \(k\) groups (prediction objective).

There exist several possibilities to allocate or classify an observation to one of the \(k\) possible groups. We will give details on an intuitive (geometric) allocation rule but there exist many more based on probabilistic arguments and we will present briefly one of them.

3.1 Geometric classification rule

The idea is simple and consists of calculating for a given observation its discriminant variables and allocating the observation to the group with the nearest mean. Because the distances are calculating on the discriminant variables (using the usual Euclidean formula), it takes into account the Mahalanobis distance. In order to implement this allocation rule, we need to know the mean of each class and the values of the observations all for the discriminant variables:

\[f(x) = \arg\min_{1 \leq g \leq k} d_W^2 (x, \overline{X}^{(g)})\]

Example

Example 1 If the group means are symmetric around 0, a positive CAN1 value indicates assignment to group 1, and a negative value indicates assignment to group 2.

Remark

Remark 4. The geometric classification rule leads to the prediction of a group for a new observation. This rule may lead to some errors if the groups are not well separated. The major problem with the geometric rule is that it does not take into account the possibility of unbalanced groups (for some rare events for instance). The following rule will correct this problem.

3.2 Probabilistic classification rules

If we assume a Gaussian distribution within each group, and known prior probabilities, we can compute the posterior probabilities for each observation and assign it to the group with the highest posterior probability (Bayes rule). When we assume that all the within-group covariance matrices are equal to \(W\), then we obtain the rule: \[f(x) = \arg\min_{1 \leq g \leq k} d_W^2 (x, \overline{X}^{(g)}) - 2 \log \frac{n_g}n\]

3.3 Quality of classification

3.3.1 Confusion matrix

In order to measure the quality of classification rules, one applies the rules to observations for which we already know the group and compare the predictions with the actual values in a confusion table.

If the categorical variable has \(k\) categories and we classify \(n\) observations, the confusion matrix contains \(k\) columns which correspond to the \(k\) actual categories and \(k\) rows for the \(k\) predicted categories. Each of the \(n\) observations is assigned to the corresponding cell of the table which contains the cross-tabulated counts.

Table 1: A confusion table for \(k\) classes.
actual class 1 actual class 2 actual class k
predicted class 1 \(n_{11}\) \(n_{12}\) \(n_{1k}\)
predicted class 2 \(n_{21}\) \(n_{22}\) \(n_{2k}\)
predicted class k \(n_{k1}\) \(n_{k2}\) \(n_{kk}\)

The sum of all \(n_{ij}\) is \(n\).

In the case of only two-classes, we have:

Figure 1: Confusion table. Source: Fawcett (2006).
Remark

Remark 5. Sensitivity = True Positive Rate (TPR) = Hit = Recall.

Specificity = 1 - False Positive Rate (FPR).

3.3.2 ROC and AUC

In the case where \(k=2\), we can represent the classification rule of LDA as a ROC (receiver operating characteristic) curve. Each classification rule is represented as a point in the ROC space, which is the \([0,1]^2\) square where the x-axis is the FPR and the y-axis is the TPR.

The diagonal correspond to flip-a-coin classification, whereas the left and upper boundaries of the square correspond to perfect classification.

Since LDA produces scores (posterior probabilities) rather than binary class labels, we can choose a threshold to convert scores into class assignments. By varying this threshold from the extreme that assigns all observations to class 1 to the extreme that assigns all to class 2, we obtain a ROC curve, representing all possible classification rules associated with the considered discriminant analysis.

Then, the AUC (area under the curve) provides a global measurement of the performance of LDA as a classification method.

4 Invariant Coordinate Selection (ICS)

If the group are unknown a priori, the between-group and within-group covariance matrices are not computable, so it is impossible to perform a discriminant analysis.

But Invariant Coordinate Selection can recover the discriminant subspace without knowing the groups by replacing the between-group and within-group covariance matrices by two scatter matrices.

4.1 General principle

The objective of Invariant Coordinate Selection is to maximize the ratio of two scatter matrices capturing the ratio of \(B\) to \(W\) : \[\max_{\substack{v \in \mathbb{R}^p}} \frac{v^\top\operatorname{Cov}_4[X]v}{v^\top\operatorname{Cov}[X]v}\]

4.2 Scatter matrices

Let \(X\) be a random vector of dimension \(p\), with distribution function \(F_X\). A scatter matrix of \(X\), denoted \(V(F_X)\), is a symmetric positive definite matrix of dimension \(p \times p\) and affine equivariant. Let \(\mathcal{P}_p\) be the set of all positive definite symmetric matrices of order \(p\).

\[\operatorname{Cov}[X]=\mathbb{E}[(X - \mathbb{E}(X))(X - \mathbb{E}(X))^\top]\]
\[\operatorname{Cov}_4[X]=\frac{1}{p+2} \mathbb{E}[d^2(X - \mathbb{E}(X))(X - \mathbb{E}(X))^\top]\] where \(d^2\) is the square of the Mahalanobis distance: \[d^2={(X - \mathbb{E}(X))^\top\operatorname{Cov}[X]^{-1}(X - \mathbb{E}(X))}\]

4.3 Component calculations

To perform this comparison, ICS relies on the joint diagonalization of \(V_1\) and \(V_2\). \[\begin{aligned} H^\top V_1H &= I_p \\ H^\top V_2H &= \Delta \end{aligned}\] where \(V_1, V_2 \in \mathcal{P}_p\), \(\Delta = \text{diag}(\rho_1, \dots, \rho_p)\), \(\rho_1, \dots, \rho_p\) being the eigenvalues of \(V_1^{-1}V_2\) sorted in descending order, and \(H = (h_1, \dots, h_p)\) is a matrix containing the corresponding eigenvectors.

The components are : \(Z = H^\top(Y - \mathbb{E}(Y))\)

4.4 Discriminant subspace

Let \(Y\) be a random vector following a mixture of \(k\) Gaussian distributions with the same covariance matrix : \[Y \sim \sum_{j=1}^{k}\alpha_j N_p(\mu_j, \Gamma)\] where \(\alpha_j > 0\) for \(j=1,\dots, k\) and \(\sum_{j=1}^{k}\alpha_j = 1\), \(\mu_j \in \mathbb{R}^p\) for \(j=1,\dots, k\), and \(\Gamma \in \mathcal{P}_p\).

The discriminant subspace \(\phi\) is given by : \[\phi = \langle \phi_1, \ldots, \phi_k \rangle\] where \(\phi_j = \Gamma^{-1} (\mu_j - \mu)\) for \(j=1,\ldots, k\).

4.5 Properties

  1. The first and last invariant components generate the Fisher discriminant subspace, under certain conditions.

    • A certain number of eigenvalues will be equal, while the remainder will be either larger or smaller than the equal ones.

    • The first (last) components are associated with the largest (smallest) eigenvalues.

    • Thus, the focus of our analysis will be on the largest and/or smallest eigenvalues.

  2. The components associated with non-equal eigenvalues are invariant under affine transformations, up to a factor of \(\pm 1\).

5 Conclusion

Discriminant analysis is appropriate when:

  • Observations are described by quantitative variables and belong to distinct groups.
  • The goal is to separate these groups using quantitative variables.
  • We wish to classify new observations into one of the groups.
  • The number of groups \(k\) is smaller than the number of variables \(p\).
Remark

Remark 6. This lecture covers linear discriminant analysis (canonical). Quadratic discriminant analysis is an extension that allows for group-specific covariance matrices.

References

Fawcett, Tom. 2006. “An Introduction to ROC Analysis.” Pattern Recognition Letters, ROC Analysis in Pattern Recognition, 27 (8): 861–74. https://doi.org/10.1016/j.patrec.2005.10.010.