## Introduction to Discriminant Analysis

Discriminant Analysis (DA) is used to predict the probability of belonging to a given class based on one or more predictor variables. This method works on both continuous and categorical predictor variables, but its output must be categorical.

### Compared to PCA

Discriminant Analysis is often used as dimensionality reduction for pattern recognition or classification in machine learning. This might sound similar to Principle Component Analysis (PCA), as both try to find a linear combination of variables to explain the data. However, note that DA is *supervised learning*, whereas PCA is *unsupervised*. Recall that in supervised learning, the training dataset is labelled, while in unsupervised learning, the dataset is not labelled.

### Compared to Logistic Regression

Although similar to logistic regression, there are some advances that LDA provides.

- LDA works better than logistic regression if classes are well-separated, and logistic regression estimates are very unstable.
- LDA is more stable than logistic regression if \(n\) is small.
- LDA is also appropriate for more than two response classes.
- LDA provides low-dimensional views of the data.
- LDA requires that the distribution of the predictors \(X\) is approximately normal in each of the classes.

There are multiple types of Discriminant Analyses as outlined below. The most common type of DA is Linear Discriminant Analysis (LDA), which is DA in its simplest form. We'll begin by going through a simple, two-dimensional example in which LDA is applied.

### How LDA works

Let's say you're given the task to figure out whether a person is overweight or skinny. You decide to look at one variable (height) and see if it's good enough to discriminate between overweight and skinny people.

Height isn't the best discriminatory variable by itself. Now let's introduce another variable, weight, into the picture. By plotting height vs. weight, now we're able to see a much clearer separation.

Note that the one-dimensional graph of height is simply the same data points projected onto the x-axis.

Now how do we figure out the best discriminatory line which separates between the two classes? That's where LDA comes in.

LDA calculates the mean and standard deviations of each subgroup, and tries to come up with the line that maximizes the distance between subgroups, while minimizing variation.

With two dimensions, it creates a line, and with three dimensions, a plane. Anything in higher dimensions which we can't visualize as humans, a hyperplane.

Here is a list of common discriminant analyses:

**Linear Discriminant Analysis**(LDA), also called**Fisher Discriminant**, uses linear combinations of predictors to predict the class of a given observation. The assumptions include that the predictor variables are normally distributed, and that the classes having identical variances.**Multiple Discriminant Analysis**(MDA) compress multivariate signal for prdoucing a low dimensional signal. This doesn't directly perform classification, but it does help lower the number of variables in a dataset with a large number of predictors.**Quadratic Discriminant Analysis**(QDA) does not assume that the groups of matrices have equal covariances. However, the assumptiong that the variables are normally distributed still exists.**Mixture Discriminant Anlaysis**(MDA) assumes that each class is a Gaussian mixture of subclasses.**Flexible Discriminant Analysis**uses nonlinear combinations of predictors is used such as splines.**Regularized Discriminant Anlaysis**(RDA) uses regularization (or shrinkage) to improve the estimate of the covariance matrices. Most commonly used when you have a large number of predictors, and not enough sample training size.

## Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a dimensionality reduction technique that reduces the number of dimensions while retaining as much information as possible. It does this by coming up with the optimal new axis that maximizes the distance between classes and minimize the variance within classes. With these new metrics, LDA is able to find which variables are most important in distinguishing among classes and predict new samples.

### Key Assumptions

Before using LDA, you must make sure that your data meets the following criteria.

**Multivariate Normality**: Predictor data follows a normal distribution. This can be tested with R's mvn package. The original Fisher's linear discriminant analysis did not make this assumption.**Homoscedasticity**: Variances among group variables are the same across levels of predictors. This can be tested with Box's M statistic. If this assumption does not hold true, quadratic disciminant analysis can be used instead.**Independence**: Participants are assumed to be randomly sampled, and a participant's score on one variable is assumed to be independent of scores on that variable for all other participants.

### Preparing your data for LDA

- Clearly define what it is your output variable is.
- Ensure that each variable follows a Gaussian/Normal/Bell-type distribution.
- If they do not fall under a normal distribution, use a univarate transformation technique such as logging your data.
- Remove any outliers.
- Standardize your variable values so they have a mean of 0 and standard deviation of 1. This will allow you to see which variables were "most important" in your model

## Linear Discriminant Analysis in R with the Iris Dataset

Let's try running LDA in R with the \(\text{iris}\) dataset. This dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

We'll also be using two R libraries, including tidyverse and mvn, so make sure those are installed and ready to go.

```
install.packages('tidyverse') # For filtering dataset
install.packages('GGally') # For visualizing data
install.packages('MVN')
library(tidyverse)
library(GGally)
library(MVN)
```

### 1) Define what the output variable is

In this case, we'll be attempting to classify the type of iris plant

```
levels(iris$Species)
[1] "setosa" "versicolor" "virginica"
```

### 2) Ensure Multivariate Normality

In order to make sure all our variables have normal distributions, we could perform some kind of normality test, such as shapiro-wilk per dataset. However, R has a package that takes care of this called mvn.

First, let's split our iris dataset by Species.

```
setosa <- iris %>% filter(Species == 'setosa')
versicolor <- iris %>% filter(Species == 'versicolor')
virginica <- iris %>% filter(Species == 'verginica')
```

For the sake of brevity, we're only going to be showing the setosa subset, but you should test this out for all three classes.

It's always a nice idea to visualize our data before running any analyses, so let's plot some histograms. You can easily plot a histogram matrix using

```
ggpairs(data=iris[,1:4], title="Iris Dataset Overview",
mapping=ggplot2::aes(colour = iris$Species),
lower=list(combo=wrap("facethist",binwidth=1)))
```

You can see that the all the variables' histograms look pretty normally distributed except for Petal.Width. Let's see if our mvn tool can pick up on that.

#### Multivariate Normality using Mardia's MVN test

Now, to call mvn on our dataset, simply pass it through as a parameter.

`> mvn(setosa[,1:4])`

`$multivariateNormality Test Statistic p value Result 1 Mardia Skewness 25.6643445196298 0.177185884467652 YES 2 Mardia Kurtosis 1.29499223711605 0.195322907441935 YES 3 MVN <NA> <NA> YES $univariateNormality Test Variable Statistic p value Normality 1 Shapiro-Wilk Sepal.Length 0.9777 0.4595 YES 2 Shapiro-Wilk Sepal.Width 0.9717 0.2715 YES 3 Shapiro-Wilk Petal.Length 0.9550 0.0548 YES 4 Shapiro-Wilk Petal.Width 0.7998 <0.001 NO $Descriptives n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis Sepal.Length 50 5.006 0.3524897 5.0 4.3 5.8 4.8 5.200 0.11297784 -0.4508724 Sepal.Width 50 3.428 0.3790644 3.4 2.3 4.4 3.2 3.675 0.03872946 0.5959507 Petal.Length 50 1.462 0.1736640 1.5 1.0 1.9 1.4 1.575 0.10009538 0.6539303 Petal.Width 50 0.246 0.1053856 0.2 0.1 0.6 0.2 0.300 1.17963278 1.2587179`

The default multivariate test was Mardia's MVN test, which is used to calculate Mardia's multivariate skewness and kurtosis coefficients along with their statistics significance. We see that it is multivariate normal, and that all but one variable pass the normality test. Since Petal.Width doesn't seem to be univariate normal, we can excluse this from our LDA model.

#### Multivariate Normality using Q-Q Plot

Another way we can test for multivariate normality is by looking at the Q-Q plot. Here, "Q" stands for quantile, and this plot is used as a graphical approach to evaluating the agreement between two probability distributions (yours and the ones you're trying to assume - in our case, it's the normal distribution). The two axes on the plot are theoretical quantiles and observed quantiles. If the observed data points fall within the y=x line, then we can assume multivariate normal.

### 3) Remove any outliers

We can detect any outliers setting the showOutliers to True, and looking under the $multivariateOutliers section. We see below that observations 19, 23, 34, and 38 of the versicolor subset are outliers. Upon repeating this with virginica, we see that 19, 32, 23, 18 are outliers of that subset.

`> mvn(versicolor[,1:4], multivariateOutlierMethod='adj', showOutliers=T)`

`... $multivariateOutliers Observation Mahalanobis Distance Outlier 19 19 31.902 TRUE 23 23 15.761 TRUE 34 34 13.673 TRUE 38 38 13.388 TRUE`

### 4) Standardize data points

In order to see which variables were most informative in our LDA analysis, we have to scale our variables such that each variable means are at 0, and standard deviations at 1.

### 5) Implement the model

```
# Taking Outliers out
> data.cleaned = iris[-c(132, 119, 69, 73, 84, 88),]
# Scale
> data.cleaned[,1:4] = apply(data.cleaned[,1:4], 2, scale)
# Take out Petal.Width - train model on other variables
> fit.lda.cleaned = lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length,
data = data.cleaned)
> fit.lda.cleaned.values = predict(fit.lda.cleaned)
> ldahist(fit.lda.cleaned.values$x[,1], g = data.cleaned$Species)
```

Now to see which variable was most informative, we can take a look into our model.

```
> fit.lda.cleaned
...
Coefficients of linear discriminants:
LD1 LD2
Sepal.Length -0.8913939 -0.5961525
Sepal.Width -0.5431295 1.3542638
Petal.Length 5.9632281 1.2448066
...
```

We can see from the coefficients of linear discriminants, under LD1, the variable with the highest absolute value is Petal.Length. Is this variable actually the most informative? To do a quick check, we can plot the histograms of each variable against the species.

```
library(gridExtra)
sl = ggplot(iris, aes(x=Species, y=Sepal.Length)) +
ggtitle('Sepal Length vs. Species') +
geom_boxplot()
sw = ggplot(iris, aes(x=Species, y=Sepal.Width)) +
ggtitle('Sepal Width vs. Species') +
geom_boxplot()
pw = ggplot(iris, aes(x=Species, y=Petal.Width)) +
ggtitle('Petal Width vs. Species') +
geom_boxplot()
grid.arrange(sl, sw, pw, nrow=1)
```

We can see from the three variables that Petal.Width is indeed the most discriminatory variable.

### LDA Plot

To sum this up, let's plot LD1 and LD2 to visualize how well the model is able to distinguish among groups.

```
> newdata <- data.frame(type = data.cleaned[,5], lda = fit.lda.cleaned.values$x)
> ggplot(newdata) +
geom_point(aes(lda.LD1, lda.LD2, colour = type), size = 2.5) +
ggtitle('Distinguishing Species within the iris Dataset with LDA')
```