Christian Zuniga, PhD
Principal component analysis (PCA) is an unsupervised, linear technique for dimensionality reduction first developed by Pearson in 1901 ,,. It is widely used in many areas of data mining such as visualization, image processing and anomaly detection. It is based on the fact that data may have redundancies in its representation. Data refers to a collection of similar objects and their features. An object could be a house and the features the location, the number of bedrooms, the square footage, and any other characteristic that can be recorded of the house. In PCA analysis, redundancy in the data refers to linear correlation among features. Knowledge of one feature reveals some knowledge of another feature. PCA may use this redundancy to form a smaller set of features, called principal components that can approximate well the data.
Figure 1 shows the general idea. The data is represented as a matrix X with N objects (like houses) and F features (like square footage). PCA linearly transforms the features into a new set and retains the G most relevant features where G < F. The new features are called the principal components. The new data matrix Y is Y = PX, where P is a G by F projection matrix. The first principal component captures most of the variance of the data. Each additional principal component is made to capture the remaining variance and is uncorrelated or orthogonal to other principal components.
The cars dataset from UC Irvine will be used as an example4. This set contains 9 features for 392 cars of various makes and models. Figure 2 shows two sample features, ‘acceleration’ plotted vs. ‘horsepower’. Acceleration is given in the time taken for a car to accelerate from 0 to 60 mph. The figure shows the two features have opposite trends, or are negatively correlated. This is not surprising since higher horsepower should result in smaller times.
Figure 2 shows that most of the variation of the features is concentrated along a line labeled ‘P1’. The remainder of the variation is along a second line labeled ‘P2’. The lines can be characterized by unit vectors vj = [a1j, a2j] (j=1,2) that give the lines’ orientations. The lines’ displacements from the origin do not matter since the data will later be centered at zero. Each point represents a car and can also be represented by a vector xi = [horsepoweri, accelerationi], where the subscript corresponds to the ith car. Each point can be projected onto a line Pj by the inner product of vj and xi as shown in Figure 3.
This new feature p1 is the first principal component and is a linear combination of the original two features, horsepower and acceleration. In general it will not have a more descriptive name but one could be given to clarify the concept. One option is to think of the combination of ‘horsepower’ and ‘acceleration’ as the ‘performance’ of the car.
The question is then how to find the coefficients a11, and a21 of vector v1, which gives the direction of the best-fit line P1. This line should be as close to all points as possible, minimizing the average distance J to all the points.
The solution lies in the covariance matrix of the features SX. Specifically; the eigenvectors of SX give the required vectors v1 and v2. To calculate the covariance matrix, the mean of each feature is subtracted from each row. To put the features on a similar scale, they should also be divided by their standard deviation. This is done to prevent the analysis from capturing uninteresting directions in the data. After this preprocessing of the data matrix X, the covariance matrix is and F by F matrix:
For the car data set, using only the features ‘acceleration’ and ‘horsepower’, the covariance matrix is.
The off-diagonal term -0.69 shows the cross-covariance between horsepower and acceleration, which is negative as implied by Figure 1. The diagonal terms show the auto-covariance of each feature and have value 1 because of the pre-scaling.
Using any linear algebra solver readily gives the eigenvectors of the covariance matrix. The eigenvalues are (1.69, 0.31). Their sum is the total variance, which is 2. Vector v1 is [0.707, -0.707] and captures 84.5% of the total variance (1.69/2). Figure 4 shows the resulting directions of v1 and v2.
The projection matrix P can be made with v1 and v2 as rows. If both vectors are kept, there is no loss in representation. The new representation would have a covariance matrix.
Since P has the eigenvectors of Sx as rows, the right hand side results in a diagonal matrix of the eigenvalues of Sx.
For dimensionality reduction, only v1 would be used. The new representation, matrix Y, would have a single feature, the first principle component (Y = v1X).
PCA can be applied to any number of features. The car data set has the additional features: ‘cylinders’, ‘displacement’, and ‘weight’ for a total of 5 features. Other features are categorical and one ‘mpg’ is usually the target variable of interest. The same process can be done to obtain the principal components. Each principal component is a linear combination of these 5 features. Standard scaling is applied on the features before making the covariance matrix Sx (now a 5 by 5 matrix). The eigenvalues and eigenvectors are found and the eigenvectors used to make the projection matrix P. Figure 5 shows the percentage of variance explained by each component. Again the first component captures over 80% of the total variance. Instead of 5 features, 1–2 principal components may be enough for various purposes.
For example, Figure 6 shows a plot of principal component 2 vs. principal component 1. Together they capture about 96% of the total variance. The figure shows there may be 3 clusters, or groups of cars. These may correspond to different types of cars such as sports cars, sedans, and trucks. Confirming with the make of the car might clarify this. A clustering algorithm like k-means may be applied to quantify the clusters. This shows a common application of PCA in dimensionality reduction where fewer features help with many machine learning algorithms.
To summarize, PCA is a linear, dimensionality reduction technique that forms new features using linear combinations of the original. These new features, the principal components, maximize the total variance capture and are uncorrelated with each other. The eigenvectors of the covariance matrix are used to transform the data matrix. In practice if there are many features, forming the covariance matrix may be computationally expensive and an SVD of the data matrix is used.
 Gilbert Strang “Linear Algebra and Learning from Data” Wellesley Cambridge Press 2019
 Deisenroth, et. Al. “Mathematics for Machine Learning” to be published Cambridge University Press. https://mml-book.com
 Shlens “A Tutorial on Principal Component Analysis” 2014 https://arxiv.org/abs/1404.1100