### Introduction

Principal component analysis (PCA) is a linear algebra technique that creates new design features from the original design features, such that feature correlation or collinearity is removed. This is useful in regression analysis if collinearity exits and understanding the relationship among variables is important. Collinearity can affect both the regression coefficients and the uncertainty of those regression coefficients and this can obscure relationships. And in clustering analysis, PCA can be used either if collinearity exists or if a large number of features or dimensions exists and dimension reduction is desired.

Formally, PCA is the diagonalization of the covariance matrix. The new design features are the eigenvectors, and the eigenvalue for each eigenvector is the variance explained by this new feature. Below, we will work through a simple example to develop some intuition and then follow that up with an intermediate example in another blog post.

### Simple Example

#### Create Some Data

Let’s set up a simple linear regression example with two ‘independent’ variables or features and one dependent variable or outcome. Here we have quotes around the word independent because we will impose a strong linear correlation between them.

import numpy as np from scipy.stats import pearsonr # create dataset... # number of independent data points or rows... _row = 50 _col = 2 # parameters for white noise... _bias = 0 _scale = 0.1 # create feature variables X = np.random.random(size=_rows*_col).reshape(_rows, _col) X[:,1] = X[:,0] + np.random.normal(_rows, _bias, _scale) # create coefficients... beta = np.random.random(size=X.shape[1]) # transpose X to get the correct design matrix X = X.transpose() # calculate dependent observations... Y = np.dot(beta, X) # add white noise to observations... Y += np.random.normal(size=_rows, loc=_bias, scale=_scale)

Let’s take a quick look at our simple dataset…

import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D %matplotlib notebook fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # create 3D scatter plot ax.scatter(X[0], X[1], Y, color='r', marker='o') ax.set_xlabel('x1') ax.set_ylabel('x2') ax.set_zlabel('Y') # create 2D projections ax.plot(X[0], Y, 'b.', zdir='y', zs=1) ax.plot(X[1], Y, 'b.', zdir='x', zs=1) ax.plot(X[0], X[1], 'ko', zdir='z', zs=-0.5, alpha=0.25) ax.set_xlim(0.0, 1.0) ax.set_ylim(0.0, 1.0) ax.set_zlim(-0.5, 2.5)

#### Measuring Collinearity and Correlation

The above plot shows y = f(x1, x2) in red, along with the projections of the data onto the 2D planes in blue. The correlation between the two supposedly independent variables is clearly visible in the gray projection. We can calculate the linear correlation coefficient…

from scipy.stats import pearsonr print pearsonr(X[0], X[1])

Output:

(0.9458890134715866, 4.1675937498103316e-25)

The Pearon’s R value is 0.946 with a corresponding p-value of approximately zero, e.i. strong correlation with essentially zero probability of a type-I error.

#### Manual PCA

###### IN THIS SECTION WE IMPLEMENT PCA BY HAND TO GET A BETTER FEEL FOR WHATS GOING ON… USING A LIBRARY LIKE SCIKIT-LEARN, MAKES THE PROCESS A BIT MORE USER-FRIENDLY…

Let’s use PCA to see if we can reduce the dimensionality of this problem from 2 features to 1 feature. We will diagonalize the covariance matrix of X and inspect both the eigenvalues and the eigenvectors. Notice that we use call np.linalg.eigh() and not np.linalg.eig(); this is, one, because we can, the covariance matrix will always be symmetric, and two, because then we get the eigenvalues and eigenvectors ordered correctly.

cov_X = np.cov(X) eigenvalues, eigenvectors = np.linalg.eigh(cov_X) print 'eigenvalues: ', eigenvalues print 'eigenvectors: ', eigenvectors

Output:

eigenvalues: [0.00481936 0.17493158] eigenvectors: [[-0.74127979 0.67119615] [ 0.67119615 0.74127979]]

The above eigenvalues, 0.0048 and 0.1749, are significantly different — orders of magnitude different — which is expected for the example. What does this mean? The larger eigenvalue, 0.1749, is the variance explained by the second eigenvector. The second eigenvector explains most of the total variance. The first eigenvector has an eigenvalue of 0.0048 and explains a trivial amount of the total variance. Thus, we should be able to disregard the first eigenvector and still predict the dependent variable with minimal additional error.

What do these new eigenvectors — let’s call them Z — look like? E is the matrix of eigenvectors from the diagonalization of the covariance matrix.

*Z = E · X*

Or,

*Z _{1} = E_{11} · X_{1} + E_{12} · X_{2}*

*Z*

_{2}= E_{21}· X_{1}+ E_{22}· X_{2}Or let’s calculate them explicitly and re-plot the data with the new independent variables,

# transform into the new independent variables... Z = np.dot(eigenvectors, X) fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(Z[0], Z[1], Y, color='r', marker='o') ax.plot(Z[0], Y, 'b.', zdir='y', zs=2.5) ax.plot(Z[1], Y, 'b.', zdir='x', zs=-0.5) ax.plot(Z[0], Z[1], 'ko', zdir='z', zs=-0.5, alpha=0.25) ax.set_xlabel('z1') ax.set_ylabel('z2') ax.set_zlabel('y') ax.set_xlim(-0.5, 2.5) ax.set_ylim(-0.5, 2.5) ax.set_zlim(-0.5, 2.5)

Notice that the major fraction of the total variance is along the *Z2* axis, and that *Z1* is simply white noise. We have successfully transformed the design features using PCA and we can now explain the observational data without using the *Z1* data.

#### Check New Vectors for Collinearity

We can also recalculate the linear correlation between *Z1* and *Z2* to double check that the collinearity has been removed.

print pearsonr(Z[0], Z[1])

Output:

(-7.0718568840985767e-17, 1.0)

And the Pearson’s R correlation coefficient is zero, again confirming that PCA has remove the linear correlation between the independent variables.

#### Finishing Up

After removing features that do not contribute significantly to the variance, we would want to next perform linear regression and develop a predictive model. Let’s fit *Z2* and *Y* using ordinary least squares and then evaluate our predictions of Y. In general, we should split the data in order to cross-validate, however, in this example, it’s impossible to overfit. In the output below, we can see that the R-squared value is 0.993, which is to be expected for such as simple example.

# OSL Regression import statsmodels.api as sm sm.OSL(Y, Z[1]).fit().summary()

Output:

Dep. Variable: | y | R-squared: | 0.993 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.993 |

Method: | Least Squares | F-statistic: | 6753. |

Date: | Fri, 11 Aug 2017 | Prob (F-statistic): | 3.69e-54 |

Time: | 13:54:32 | Log-Likelihood: | 48.327 |

No. Observations: | 50 | AIC: | -94.65 |

Df Residuals: | 49 | BIC: | -92.74 |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

x1 | 1.2777 | 0.016 | 82.176 | 0.000 | 1.246 1.309 |