A Developer‘s Guide to Fisher‘s Linear Discriminant: Intuition, Math, and Python

Fisher‘s Linear Discriminant (FLD) is a classical tool for dimensionality reduction and classification that every data scientist and machine learning practitioner should have in their toolkit. As a full-stack developer, you may encounter FLD as a key component in a variety of ML pipelines, from computer vision to text analysis to bioinformatics.

In this guide, we‘ll dive deep into the intuition behind FLD, the mathematics of how it works, and how to implement it efficiently in Python. Along the way, we‘ll see how FLD relates to other key concepts in ML and how to use it effectively in real-world applications.

The Intuition: Maximizing Class Separation

At its core, FLD is a technique for finding a linear projection of high-dimensional data that maximizes the separation between classes. To build some intuition, let‘s consider a toy binary classification problem:

Intuition for FLD in binary case

Figure 1: A binary classification problem in 2D. The goal of FLD is to find a 1D projection (shown by the dashed line) that maximizes the separation between the class means while minimizing the variance within each class.

In this case, we have two classes (red and blue) in a 2D feature space. Our goal is to find a 1D projection of the data (i.e. a line) where the projected classes are maximally separated. Intuitively, this means we want to find a projection where:

  1. The distance between the projected class means (μ1 and μ2) is large.
  2. The variance of the projected data within each class (σ1^2 and σ2^2) is small.

FLD provides a precise mathematical formulation of this intuition, which allows us to find the optimal projection for any number of classes and features.

The Math: Scatter Matrices and Eigenvalue Problems

To formalize the intuition above, FLD defines two scatter matrices:

  • The between-class scatter matrix S_B measures the variance of the class means:

$$SB = \sum{c=1}^C N_c (\mu_c – \mu)(\mu_c – \mu)^T$$

where C is the number of classes, N_c is the number of points in class c, μ_c is the mean of class c, and μ is the overall mean of the data.

  • The within-class scatter matrix S_W measures the variance of the data within each class:

$$SW = \sum{c=1}^C \sum_{i=1}^{N_c} (x_i^c – \mu_c)(x_i^c – \mu_c)^T$$

where x_i^c is the i-th point in class c.

The goal of FLD is to find a projection vector w that maximizes the ratio of the between-class scatter to the within-class scatter:

$$J(w) = \frac{w^T S_B w}{w^T S_W w}$$

Intuitively, this ratio is large when the class means are far apart (large numerator) and the classes are tightly clustered (small denominator).

It can be shown that the optimal w is the solution to the generalized eigenvalue problem:

$$S_B w = \lambda S_W w$$

where λ is the eigenvalue corresponding to the eigenvector w. In practice, we typically compute the top C-1 eigenvectors to obtain a projection matrix W that maps the data to a (C-1)-dimensional subspace.

Interestingly, FLD is closely related to another classical technique called Linear Discriminant Analysis (LDA). In fact, the only difference is that LDA makes additional assumptions about the class covariances being equal and estimates parameters slightly differently. For a detailed discussion of the relationships between FLD, LDA, and other methods, see [1].

Implementation in Python

Implementing FLD in Python is straightforward using the numpy and scipy libraries. Here‘s a minimal example:

import numpy as np
from scipy.linalg import eigh

def fld(X, y):
    """Compute the FLD projection matrix for data X and labels y."""

    # Compute class means and overall mean
    class_means = np.array([X[y == c].mean(axis=0) for c in np.unique(y)])
    overall_mean = X.mean(axis=0)

    # Compute between-class scatter matrix
    S_B = np.zeros((X.shape[1], X.shape[1]))
    for i, mean in enumerate(class_means):
        n = X[y == i].shape[0]
        S_B += n * np.outer(mean - overall_mean, mean - overall_mean)

    # Compute within-class scatter matrix
    S_W = np.zeros((X.shape[1], X.shape[1]))
    for c in np.unique(y):
        X_c = X[y == c]
        S_W += (X_c - class_means[c]).T @ (X_c - class_means[c])

    # Solve generalized eigenvalue problem
    evals, evecs = eigh(S_B, S_W, eigvals=(X.shape[1]-y.max(),X.shape[1]-1))

    # Return sorted eigenvectors
    return evecs[:, np.argsort(evals)[::-1]]

This implementation has complexity O(d^2 * n + d^3), where d is the number of features and n is the number of samples. For high-dimensional data (large d), the d^3 term from the eigendecomposition can become a bottleneck. In these cases, randomized approximation methods can be used to compute only the top eigenvectors efficiently [2].

It‘s also worth noting that this implementation assumes that S_W is invertible, which may not be the case if the number of features is larger than the number of samples. In practice, regularization techniques are often used to ensure stability.

Using FLD for Visualization

One common use case for FLD is visualization of high-dimensional data. By projecting to a 2D FLD subspace, we can visually inspect the class structure of the data and identify any interesting patterns or outliers.

As an example, let‘s use FLD to visualize the famous MNIST handwritten digits dataset:

from sklearn.datasets import fetch_openml

# Load MNIST data
X, y = fetch_openml(‘mnist_784‘, version=1, return_X_y=True)
X = X / 255.0  # Scale to [0, 1]

# Compute FLD projection matrix
W = fld(X, y.astype(int))[:, :2]  

# Project data to 2D
X_proj = X @ W

# Plot the projected data
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 8))
for digit in range(10):
    ax.scatter(X_proj[y == str(digit), 0], X_proj[y == str(digit), 1], 
               label=str(digit), s=5, alpha=0.5)
ax.legend(markerscale=4, ncol=2)
ax.set_xlabel(‘FLD component 1‘)
ax.set_ylabel(‘FLD component 2‘) 
fig.tight_layout()
plt.show()

MNIST digits projected onto 2D FLD subspace

Figure 2: MNIST digits projected onto the top 2 FLD components. Note how FLD separates the classes quite effectively, even though the original data has 784 dimensions!

This visualization suggests that the digit classes are relatively well-separated in the FLD subspace, although there is still some overlap between certain digits (e.g. 4 and 9). Interestingly, the FLD components seem to correspond to meaningful visual features of the digits, such as the presence of loops or straight lines.

This kind of exploratory data analysis can be a valuable first step in building classification models or identifying potential issues with the data (e.g. mislabeled examples). However, it‘s important to keep in mind that FLD is a linear method, so it may not capture all of the relevant structure in complex datasets.

Using FLD in a Machine Learning Pipeline

As a full-stack developer working on an end-to-end machine learning system, you may want to use FLD as a preprocessing step to reduce the dimensionality of your feature space before training a classifier or other model. This can be especially useful when working with high-dimensional data like images, text, or genomic sequences.

Here‘s a sketch of what an ML pipeline using FLD might look like:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load and preprocess data
X, y = load_my_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)

# Dimensionality reduction with PCA and FLD
pca = PCA(n_components=100).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

lda = LDA(n_components=9).fit(X_train_pca, y_train)
X_train_lda = lda.transform(X_train_pca)  
X_test_lda = lda.transform(X_test_pca)

# Train and evaluate classifier
clf = RandomForestClassifier().fit(X_train_lda, y_train)
y_pred = clf.predict(X_test_lda)
print(f"Test accuracy: {accuracy_score(y_test, y_pred):.3f}")

In this example, we first standardize the data to zero mean and unit variance, which is important for PCA and FLD. We then use PCA to reduce the dimensionality to 100 components, followed by FLD to further reduce to 9 components (assuming a 10-class problem). Finally, we train a random forest classifier on the FLD features and evaluate its performance on the test set.

Of course, the specific steps and parameters in your pipeline will depend on the nature of your data and the requirements of your application. In practice, it‘s a good idea to experiment with different dimensionality reduction and classification methods to see what works best. You may also want to use techniques like cross-validation and grid search to tune your hyperparameters.

Advanced Topics and Extensions

There are many ways to extend and generalize the basic FLD method. Some important variations include:

  • Kernel FLD: Instead of finding a linear projection, kernel FLD finds a nonlinear transformation of the data to a high-dimensional feature space, where classes are more separable. This can be done efficiently using the "kernel trick" [3].

  • Incremental FLD: In some applications, data may arrive in a streaming fashion, making it infeasible to compute the full scatter matrices. Incremental FLD algorithms can update the projection matrix on-the-fly as new data points are observed [4].

  • Heteroscedastic FLD: The standard FLD method assumes that all classes have equal covariance matrices. Heteroscedastic FLD relaxes this assumption and can learn a more flexible decision boundary [5].

  • Sparse FLD: In high-dimensional problems with many irrelevant features, it can be beneficial to learn a sparse projection matrix that selects only a subset of informative features. This can be achieved by adding L1 regularization to the FLD objective [6].

Exploring these extensions is a great way to deepen your understanding of FLD and expand your ML toolbox.

Conclusion and Resources

In this guide, we‘ve taken a deep dive into Fisher‘s Linear Discriminant, from the underlying intuition and mathematics to practical implementation and application details. We‘ve seen how FLD can be used for both dimensionality reduction and classification, and how it fits into the broader ecosystem of machine learning methods.

Whether you‘re a data scientist, ML researcher, or software engineer, understanding FLD and related techniques is a valuable asset. With the foundations we‘ve covered here, you‘re well-equipped to apply FLD to your own problems and continue learning about this fascinating area of machine learning.

To learn more, check out these resources:

  • Duda, Hart, and Stork‘s "Pattern Classification" [1] is a classic textbook that covers FLD, LDA, and many other topics in depth.
  • The scikit-learn documentation [7] provides a good overview of FLD and its relationship to other dimensionality reduction methods.
  • For a more advanced treatment of FLD and kernel methods, see Mika et al.‘s "Fisher Discriminant Analysis with Kernels" [3].
  • To dive deeper into the mathematics of FLD and other projection methods, check out Fukunaga‘s "Introduction to Statistical Pattern Recognition" [8].

I hope you‘ve enjoyed this guide and found it informative. Happy learning!

References

[1] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, 2nd edition, 2000.

[2] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.

[3] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, pages 41–48. IEEE, 1999.

[4] H. Zhao and P. C. Yuen. Incremental linear discriminant analysis for face recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(1):210–221, 2008.

[5] Z. Sun and T. Tan. Heteroscedastic linear discriminant analysis. In 2015 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015.

[6] L. Clemmensen, T. Hastie, D. Witten, and B. Ersbøll. Sparse discriminant analysis. Technometrics, 53(4):406–413, 2011.

[7] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[8] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 2nd edition, 1990.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *