*Already understand how LDA works? *Jump forward to the code*!*

The Linear Discriminant Analysis Algorithm (LDA) is a Machine Learning method used to categorize two or more groups based on their features. This algorithm discovers a feature subspace that optimizes the group separability. This process uses mainly statistical and probabilistic methods since it relies on Bayes’ Theorem, which allows calculating an event probability based on its relationship with another event.

Suppose you’re working with a high-dimensional data set in which we want to classify some observations, for example, the healthcare datasets have a lot of features for each patient — such as blood pressure, height, weight, etc. -. In that case, this algorithm comes in very handy since it will discover a linear projection that will allow it to optimize the data space into a lower-dimensional subspace that maximizes some measure of class separation.

LDA algorithm can be understood in a general way with the following example: let’s say we have a wheel factory and have two events: the wheel approved quality control (green), and the wheel didn’t approve the quality control (red), and it’s all based on a single feature. Let’s say the wheel curvature.

We can see that with a single feature, there’s some overlapping, and we can’t make a decision or have a clear idea of the situation here. But what does this overlapping mean?

This graphic gives us information about a false negative, which would be a good wheel that failed the quality control — green circle -, and two false positives, some bad wheels that approved the quality control.

To avoid that, we can add another feature, let’s call it wheel diameter, this way, we have a better scatterplot that will help to compute the LDA algorithm.

With this scatterplot, it’s nearly impossible to draw a straight line that separates the groups efficiently, but with LDA we can reduce this 2-dimensional plane into a 1-dimensional one.

The LDA algorithm will create a new axis. The new axis will maximize the distance between the means of both groups and minimize the variance within the individual group. Also, this method will project all the data onto this new axis, making a 1-Dimensional separated dataset. It can show us as result a plot like the following.

This way the algorithm gives us better information about the wheels based on the two features but showed as only one, giving us a perfect classification that will help us in our decision-making process about our wheel quality control system.

If the distributions’ means become shared, the LDA Algorithm will fail due to the difficulties in identifying a new axis that makes both classes linearly separable. Non-linear discriminant analysis is employed in such cases.

But how can Linear Discriminant Analysis do all this job? We’ll see all the mathematical explanations with the Python Tutorial, using a well-known library called scikit-learn and its Iris dataset.

To start with the tutorial first let’s clarify that we’re using Anaconda distribution, in case you don’t, maybe you’ll need to install the necessary *pips, *in case you haven’t, also we’re going to use the Iris Dataset given by the well-known scikit-learn library. So we’re going to do the corresponding imports. You can follow along in this Jupyter Notebook.

The next thing we’re going to do is read the data set and set some variables for their posterior use.

We also create a dictionary that will contain the values of the corresponding indexes and their meaning according to the Iris type.

Now that we have all the data loaded, we’re ready to start with LDA computing. It consists of five steps:

- - Computing the d-dimensional mean vectors

We’re going to start with a simple computation of the mean vectors of the three different Iris classes. This dataset contains four features: sepal length, sepal width, petal length, and petal width; and has 50 samples of each Iris class.

2.- Computing the Scatter Matrices

For this algorithm to work, we need to compute two scatter matrices: the within-class scatter matrix and the between-class scatter matrix.

A scatter matrix is a statistic — a computed value based on a sample used for statistical purposes — used to estimate the covariance matrix. If you want to know more about the covariance matrix, you can read a bit about it in my previous PCA Algorithm article in section 2: “ Covariance Matrix Computation” (PCA Algorithm Tutorial in Python. Principal Component Analysis (PCA) | by Anthony Barrios | Accel.AI | Medium).

The within-class scatter matrices and the between-class scatter matrix provide all of the essential information regarding the relationship within and between groups, respectively. These are generalizations of sums of squares in the analysis of their variance.

The within-class scatter matrix is computed with the following equation:

Where

C is the number of classes,

Si is the scatter matrix for every class, calculated by

Here,

X is the value vector per instance per class, which means it will include 150 samples.

mi represents the mean vector of class i.

Now, the between-class scattered matrix uses the following formula:

With this equation, we measure the scatter of the whole dataset. Where

C is the number of classes,

Ni is the size of the respective class,

mi is the mean per class, and

m is the overall mean of the dataset.

3.- Solving the generalized eigenvalue problem for the Matrix

If you’re unfamiliar with eigenvalues and eigenvectors, I recommend reading Section 3 of my PCA Article: “Calculate the Eigendecomposition of the Covariance Matrix” (PCA Algorithm Tutorial in Python. Principal Component Analysis (PCA) | by Anthony Barrios | Accel.AI | Medium). It will assist you in better understanding.

A generalized eigenvalue problem consists in determining nonzero vectors x and a number lambda, which potentially can be complex, such that

Ax = λBx

Where x is an eigenvector and λ its eigenvalue; A and B are symmetric matrices.

Solving this problem might help us in different application fields such as chemistry, systems, and control theory, among others. In this case, we will use it to obtain the linear discriminants.

You may recall from linear algebra, that eigenvectors and eigenvalues are giving information about a linear transformation distortion. The eigenvectors tell us the direction of this distortion while the eigenvalues are the scaling factors that describe the magnitude of the distortion.

In this case, the eigenvalues have plenty of value since they tell us how much information is allowed in our new axes that LDA is computing. In order to avoid mistakes with the calculations we made, we can do an extra step, a quick check of the eigendecomposition is correct and satisfy the equation:

Where A is the generalized eigenvalue problem, v is our eigenvector and λ is our eigenvalue.

This simple function searches for similarities in the inserted equation and our values resulting from the previous steps. They don’t have to be the exact values, the function has a delimited error margin that can tell us if the values are ok to work with. In the contrary case, the program will throw an exception, you could try it by changing the *eigv *variable for a random number like 6.

4.- Selecting linear discriminants for the new feature subspace

Since this is a dimensional reduction algorithm, we want to drop the eigenvectors containing the least information. To do this, we need to sort the eigenvectors from highest to lowest corresponding eigenvalues and choose the number of eigenvectors we want to conserve.

Looking at the eigenvalues, we can see that two of them are closer to 0 than the others. It occurs because in LDA the number of linear discriminants should be at most the number of classes minus one, which means that the last eigenvectors should be 0. This doesn’t mean that they don’t have information, but that the information is not relevant to the work. To watch the variance in this dataset let’s run the following code block:

We can see that the first eigenpair is the most informative one and the last two don’t represent much of the data.

Now we’re going to proceed to construct the eigenvector matrix, that we’re calling W. It has a d x k dimension (in this case 4 x 2, since we have 4 features and the 2 most informative eigenpairs).

5.- Transforming the samples onto the new subspace

Finally, we’re going to use W to transform the samples into a 2-dimensional feature subspace. The only thing we need to do is compute the following equation:

Where X is the matrix containing the data of our original data set and Y will be the new k-dimensional subspace that will contain all the samples.

And that’s it. We’ve done with the computing of the LDA Algorithm. To watch how this worked, we’re going to create a function that will make a plot with the new subspaces for the Iris dataset.

When we call the function plot_step_lda, it will show us this plot:

We can see that linear discriminant 1 gives us a better class separation than the second one, this confirms the computation we made in the previous step and the percentage function that told us that the second eigenpair has only a 0.88% of the information.

**In conclusion**

The Linear Discriminant Analysis Algorithm is a great approach for both dimensional reduction and classification in Machine Learning, with different applications such as Face Recognition, Customer Identification, and Medical Fields to classify the patient disease state. Checking all the steps we can notice it shares some similarities with the PCA Algorithm, except for the fact that LDA is better suitable for some other cases of special classification e.g multi-class classification. If you want to continue on this Machine Learning path I recommend reading my previous articles and conducting more research on this interesting and powerful algorithm.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

References:

*Linear Discriminant Analysis in Python | Machine Learning*. (n.d.). Retrieved October 13, 2022, from https://python-course.eu/machine-learning/linear-discriminant-analysis-in-python.php- Epperly, E. (2021, December 14).
*The Elegant Geometry of Generalized Eigenvalue Perturbation Theory — Ethan Epperly*. Retrieved October 13, 2022, from https://www.ethanepperly.com/index.php/2021/12/14/the-elegant-geometry-of-the-generalized-eigenvalue-perturbation-theory/ - GeeksforGeeks. (2022, August 24).
*ML | Linear Discriminant Analysis*. Retrieved October 13, 2022, from https://www.geeksforgeeks.org/ml-linear-discriminant-analysis/ - Ho, G. (2017, December 30).
*Linear Discriminant Analysis for Starters*. ⁂ George Ho. Retrieved October 13, 2022, from https://www.georgeho.org/lda/ *Linear Discriminant Analysis (LDA) in Machine Learning — Javatpoint*. (n.d.). www.javatpoint.com. Retrieved October 13, 2022, from https://www.javatpoint.com/linear-discriminant-analysis-in-machine-learning- Nor, N. M., Hussain, M. A., & Hassan, C. R. C. (2015). Process Monitoring and Fault Detection in Non-Linear Chemical Process Based On Multi-Scale Kernel Fisher Discriminant Analysis.
*12th International Symposium on Process Systems Engineering and 25th European Symposium on Computer Aided Process Engineering*, 1823–1828. https://doi.org/10.1016/b978-0-444-63577-8.50149-2 - Raschka, S. (2014, August 3).
*Linear Discriminant Analysis*. Dr. Sebastian Raschka. Retrieved October 13, 2022, from https://sebastianraschka.com/Articles/2014_python_lda.html - StatQuest with Josh Starmer. (2016, July 10).
*StatQuest: Linear Discriminant Analysis (LDA) clearly explained.*[Video]. YouTube. Retrieved October 13, 2022, from https://www.youtube.com/watch?v=azXCzI57Yfc