Introduction

When independent variables in a multivariate regression model exhibit strong correlation, multicollinearity exits.  This is problematic when evaluating the linear coefficients estimates. Multicollinearity creates an instability in the final estimations of the coefficients.  That instability is mis-interpreted as additional uncertainty in the estimations of the regression coefficients.  The concern is that the perceived increase in uncertainty of correlated variables can lead to the conclusion that those variables are statistically insignificant.

In the following sections, we will create several data sets with and without multicollinearity, explore what the related consequences with respect to statistical significants, and discuss how to identify and mitigate multicollinearity.

Generation of Data

import matplotlib.pyplot as plt
%matplotlib notebook    # using jupyter notebook
import numpy as np
import pandas as pd

First we generate a 10-dimensional dataset that has no multicollinearity by sampling from a standardized gaussian distribution.

data = {}
for i in range(10):
    data[''.join(['x', str(i)])] =\
        np.random.normal(loc=0, scale=1, size=500)

X = pd.DataFrame.from_dict(data)

And we plot the correlation matrix M, where Mij is the Pearson’s R Correlation Value for the ith and jth variables. Alternatively, we could also plot the covariance matrix, obtaining a similar visualization.

plt.figure()
plt.imshow(X.corr())
plt.colorbar()
plt.title("Pearson's R Correlation Matrix")
plt.ylabel('indepenent variables')
plt.xlabel('indepenent variables')

Correlation values can range from -1 and 1. Those extremes indicate perfect negative and positive correlation, respectfully. Values near zero suggest no correlation exists between variables.  The off-diagonal elements in the image below exhibit near zero correlation values, leading us to believe that there is no multicollinearity in this dataset.

no_correlation

We can now introduce multicollinearity by redefining a few of the variables as linear combinations of others.

X.x7 = -1 * X.x3  # full linear dependence
X.x5 = 0.25 * X.x1 + 0.75 * X.x5  # partial linear dependence

Notice in the below correlation plot that M37 and M73 are -1.0, and M15 and M51 is 0.25.

yes_corr

How to Quantify the Severity of Collinearity: Variance Inflation Factor

The variance inflation factor (VIF) allows us to numerically quantify the degree of multicollinearity. The VIF parameter is calculated for each independent or feature variable, and deviations from unity suggests variable correlation with other variables.

For each variable, we solve a linear model where we try to explain the variable Xi in terms of the other n-1 independent variables.

Xi = β0 + Σj≠i βj · Xj

From there we calculate the VIFusing the R2 results from the fit to the above equation.  If R2 is zero, i.e. none of the variance in Xi is explained using the other n-1 variables, and the VIFi = 1.0.  As the R2 increases from zero, the VIFi increases to reflect that fact.

VIFi = 1 / ( 1 - Ri2 )

Let’s calculate the VIF values for a few of our variables from the data set above.

_X = X.drop(['x5'], axis=1)
_y = X.x5

lm = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lm.fit(_X, _y)
R2 = lm.score(_X, _y)
print R2, 1.0 / (1.0 - R2)

For X5, the R52 is 0.1555. And the corresponding VIF5 value is 1.179, which suggests that multicollinearity will inflate the standard error of our regression coefficient estimates. The X1 feature also has a VIF value of 1.182, and this is expected.

For the X7 feature,  R72 is 1.0 and VIF7 value is infinity. Now normally, we wouldn’t get perfect correlation because of real world noise, however, large VIF values are possible. The X3 VIF value is also infinity.

All other features in this dataset are floating just above unity.

for col_i in X.columns:
    lm.fit(X.drop([col_i], axis=1), X[col_i])
    print col_i, 1 / (1 - lm.score(X.drop([col_i], axis=1),  X[col_i]))

output:

x0 1.0085963855
x1 1.18205441212
x2 1.00560189773
x3 inf
x4 1.0108068519
x5 1.17863000174
x6 1.02102453797
x7 inf
x8 1.0079748323
x9 1.0168992924

Generate y Dataset

Before we can explore how multicollinearity effects the standard error in our regression coefficients, we must generate some dependent observation values. To do this, we generate an array of beta coefficients, calculate X·Β, and introduce some gaussian-distributed errors.

betas = np.random.random(size=X.shape[1])
y = np.dot(X, betas) + np.random.normal(0, 1, size=X.shape[0])

The betas variable is an array random numbers between [0, 1)  and our noise is sampled from a standardize normal distribution centered about zero with a standard deviation of unity.

Solve for the Regression Coefficients

We can now solve for the regression coefficients using the ordinary least squares method.  More correctly, we should say that we can now calculate the un-biased estimates for the Β coefficients.  Unless you created your own data as we have above, you never really know what the true coefficients are; we simply calculate estimates.

Β = (XTX)-1XTy

Let’s code up the above equation in python and solve for the regression coefficients…

Xt = X.transpose()
XtX = np.dot(Xt, X)
XtX_inv = np.linalg.inv(XtX)
betas_est = np.dot(np.dot(XtX_inv, Xt) ,y)

We run this and python returns a ‘LinAlgError: Singular matrix’ error.  What happened? The perfect multicollinearity between X3 and X7  creates a situation where the XTX matrix is not invertible. We can not solve this problem with this method.  Either a more robust analytic method like singular value decomposition or use numerical solutions.

Let’s create a slightly different data set without perfect multicollinearity and calculate our regression coefficients. Again, we create a correlation between the X5 and X1 variables.

X = pd.DataFrame.from_dict(data)
X.x5 = 0.25·X.x5 - 0.75·X.x1

y = np.dot(X, betas) + np.random.normal(0, 1, size=X.shape[0])
betas_est = np.dot(np.dot(np.linalg.inv(np.dot(X.transpose(), X)), X.transpose()), y)
# actual betas
print betas
# estimation of betas via regression
print betas_est

output:

[ 0.87262849,  0.10874099,  0.00457641,  0.77028961,  0.10735083, 0.36264465,  0.84870342,  0.46439193,  0.56277177,  0.15009397]
[ 0.84553638,  0.40624893,  0.02285992,  0.75580565,  0.12319431, 0.65499404,  0.80318128,  0.42259867,  0.56750491,  0.14147711]

The above output shows the actual Β (top array) along with the calculated Β (bottom array). The β1 and β5 are significantly inflated with respect to the actual values, this is a consequence of the collinearity. We can also see that in this example, that β2 is estimated to be larger than the actual value, however, we will see in the next section that β2 is statistically insignificant.

Inflation of Standard Error in the Regression Coefficients

Let’s first calculate the covariance matrix for the regression coefficients, asserting the typical ordinary least squares assumptions.

Σβ = (XTX)-1·σ2,

In the above expression, the σ2 is the variance of dependent variable. The definition of the variance is shown below, where yi is the ith observed value and ýi is the ith predicted y value.

σ2 = Σi (yi - ýi)2 / N

Coding this up in python:

var_y = np.sum((y - np.dot(X, betas_est))**2) / X.shape[0]
covar = np.linalg.inv(np.dot(X.transpose(), X)) * var_y

The variances of the regression coefficients are the diagonal elements of the covariance matrix. And the standard errors are the square root of the variances.

print np.sqrt(covar.diagonal())

output:

[ 0.04288585  0.13963199  0.04381037  0.04239948  0.04348278  0.1806739, 0.04478235  0.04263621  0.04420973  0.04450978]

We see that β1 and β5 exhibit significantly higher uncertainties when compared to the other independent variables or features.  This is a result of the multicollinearity.

Fortunately, those standard errors are not large enough to incorrectly conclude that β1 and β5 are statistically non-zero, however, this is possible depending on the data.

Testing Statistical Significance of Regression Coefficients

Regression coefficients can be tested for statistical significance, in general, using a two-tailed t test with the null hypothesis, H:  βi = 0, and n-2 degrees of freedom.

Ti = βi / s(βi), where s(βi) is the standard error of βi

For β2, we have a test statistic of 0.02285992/0.04381037 = 0.52, which clearly fails the significance test. We know that without even calculating the p-value for this example, because the standard error is larger than the estimated value.

For β4, we have a test statistic of 0.12319431/0.04348278 = 2.83 and has a corresponding p-value of 0.0048, which is statistically significant. We reject H and conclude that β4 is non-zero, i.e. there is statistically significant relationship between y and X4.

One can see that with multicollinearity, there is the possibility that the inflated standard error pushes the test statistic down, such that we incorrectly conclude that there is not a statistically significant relationship between y and Xi.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s