Introduction

Principal component analysis (PCA) is a linear algebra technique that creates new design features from the original design features, such that feature correlation or collinearity is removed. This is useful in regression analysis if collinearity exits and understanding the relationship among variables is important. Collinearity can affect both the regression coefficients and the uncertainty of those regression coefficients and this can obscure relationships. And in clustering analysis, PCA can be used either if collinearity exists or if a large number of features or dimensions exists and dimension reduction is desired.

Formally, PCA is the diagonalization of the covariance matrix. The new design features are the eigenvectors, and the eigenvalue for each eigenvector is the variance explained by this new feature.  Below, we will work through a simple example to develop some intuition and then follow that up with an intermediate example in another blog post.

Simple Example

Create Some Data

Let’s set up a simple linear regression example with two ‘independent’ variables or features and one dependent variable or outcome.  Here we have quotes around the word independent because we will impose a strong linear correlation between them.

import numpy as np
from scipy.stats import pearsonr

# create dataset...

# number of independent data points or rows...
_row = 50
_col = 2
# parameters for white noise...
_bias = 0
_scale = 0.1

# create feature variables
X = np.random.random(size=_rows*_col).reshape(_rows, _col)
X[:,1] = X[:,0] + np.random.normal(_rows, _bias, _scale)

# create coefficients...
beta = np.random.random(size=X.shape[1])

# transpose X to get the correct design matrix
X = X.transpose()

# calculate dependent observations...
Y = np.dot(beta, X)
# add white noise to observations...
Y += np.random.normal(size=_rows, loc=_bias, scale=_scale)

Let’s take a quick look at our simple dataset…

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib notebook

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# create 3D scatter plot
ax.scatter(X[0], X[1], Y, color='r', marker='o')

ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('Y')

# create 2D projections
ax.plot(X[0], Y, 'b.', zdir='y', zs=1)
ax.plot(X[1], Y, 'b.', zdir='x', zs=1)
ax.plot(X[0], X[1], 'ko', zdir='z', zs=-0.5, alpha=0.25) 

ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.0) 
ax.set_zlim(-0.5, 2.5)

3d_pca_image_1

Measuring Collinearity and Correlation

The above plot shows y = f(x1, x2) in red, along with the projections of the data onto the 2D planes in blue. The correlation between the two supposedly independent variables is clearly visible in the gray projection. We can calculate the linear correlation coefficient…

from scipy.stats import pearsonr

print pearsonr(X[0], X[1])

Output:

(0.9458890134715866, 4.1675937498103316e-25)

The Pearon’s R value is 0.946 with a corresponding p-value of approximately zero, e.i. strong correlation with essentially zero probability of a type-I error.

Manual PCA

IN THIS SECTION WE IMPLEMENT PCA BY HAND TO GET A BETTER FEEL FOR WHATS GOING ON… USING A LIBRARY LIKE SCIKIT-LEARN, MAKES THE PROCESS A BIT MORE USER-FRIENDLY…

Let’s use PCA to see if we can reduce the dimensionality of this problem from 2 features to 1 feature. We will diagonalize the covariance matrix of X and inspect both the eigenvalues and the eigenvectors. Notice that we use call np.linalg.eigh() and not np.linalg.eig(); this is, one, because we can, the covariance matrix will always be symmetric, and two, because then we get the eigenvalues and eigenvectors ordered correctly.

cov_X = np.cov(X)
eigenvalues, eigenvectors = np.linalg.eigh(cov_X)
print 'eigenvalues: ', eigenvalues
print 'eigenvectors: ', eigenvectors

Output:

eigenvalues:  [0.00481936  0.17493158]

eigenvectors:  [[-0.74127979 0.67119615]
                [ 0.67119615 0.74127979]]

The above eigenvalues, 0.0048 and 0.1749, are significantly different — orders of magnitude different — which is expected for the example. What does this mean?  The larger eigenvalue, 0.1749, is the variance explained by the second eigenvector.  The second eigenvector explains most of the total variance.  The first eigenvector has an eigenvalue of 0.0048 and explains a trivial amount of the total variance.  Thus, we should be able to disregard the first eigenvector and still predict the dependent variable with minimal additional error.

What do these new eigenvectors — let’s call them Z — look like? E is the matrix of eigenvectors from the diagonalization of the covariance matrix.

Z = E · X

Or,

Z1 = E11 · X1 + E12 · X2
Z2 = E21 · X1 + E22 · X2

Or let’s calculate them explicitly and re-plot the data with the new independent variables,

# transform into the new independent variables...
Z = np.dot(eigenvectors, X)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(Z[0], Z[1], Y, color='r', marker='o')
ax.plot(Z[0], Y, 'b.', zdir='y', zs=2.5)
ax.plot(Z[1], Y, 'b.', zdir='x', zs=-0.5)
ax.plot(Z[0], Z[1], 'ko', zdir='z', zs=-0.5, alpha=0.25)

ax.set_xlabel('z1')
ax.set_ylabel('z2')
ax.set_zlabel('y')

ax.set_xlim(-0.5, 2.5) 
ax.set_ylim(-0.5, 2.5)
ax.set_zlim(-0.5, 2.5)

3d_pca_image_2

Notice that the major fraction of the total variance is along the Z2 axis, and that Z1 is simply white noise.  We have successfully transformed the design features using PCA and we can now explain the observational data without using the Z1 data.

Check New Vectors for Collinearity

We can also recalculate the linear correlation between Z1 and Z2 to double check that the collinearity has been removed.

print pearsonr(Z[0], Z[1])

Output:

(-7.0718568840985767e-17, 1.0)

And the Pearson’s R correlation coefficient is zero, again confirming that PCA has remove the linear correlation between the independent variables.

Finishing Up

After removing features that do not contribute significantly to the variance, we would want to next perform linear regression and develop a predictive model. Let’s fit Z2 and Y using ordinary least squares and then evaluate our predictions of Y.  In general, we should split the data in order to cross-validate, however, in this example, it’s impossible to overfit. In the output below, we can see that the R-squared value is 0.993, which is to be expected for such as simple example.

# OSL Regression
import statsmodels.api as sm
sm.OSL(Y, Z[1]).fit().summary()

Output:

OLS Regression Results
Dep. Variable: y R-squared: 0.993
Model: OLS Adj. R-squared: 0.993
Method: Least Squares F-statistic: 6753.
Date: Fri, 11 Aug 2017 Prob (F-statistic): 3.69e-54
Time: 13:54:32 Log-Likelihood: 48.327
No. Observations: 50 AIC: -94.65
Df Residuals: 49 BIC: -92.74
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 1.2777 0.016 82.176 0.000 1.246 1.309

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s