### Introduction

When independent variables in a multivariate regression model exhibit strong correlation, multicollinearity exits. This is problematic when evaluating the linear coefficients estimates. Multicollinearity creates an instability in the final estimations of the coefficients. That instability is mis-interpreted as additional uncertainty in the estimations of the regression coefficients. The concern is that the perceived increase in uncertainty of correlated variables can lead to the conclusion that those variables are statistically insignificant.

In the following sections, we will create several data sets with and without multicollinearity, explore what the related consequences with respect to statistical significants, and discuss how to identify and mitigate multicollinearity.

### Generation of Data

import matplotlib.pyplot as plt %matplotlib notebook # using jupyter notebook import numpy as np import pandas as pd

First we generate a 10-dimensional dataset that has no multicollinearity by sampling from a standardized gaussian distribution.

data = {} for i in range(10): data[''.join(['x', str(i)])] =\ np.random.normal(loc=0, scale=1, size=500) X = pd.DataFrame.from_dict(data)

And we plot the correlation matrix *M*, where *M _{ij}* is the Pearson’s R Correlation Value for the

*i*and

^{th}*j*variables. Alternatively, we could also plot the covariance matrix, obtaining a similar visualization.

^{th}plt.figure() plt.imshow(X.corr()) plt.colorbar() plt.title("Pearson's R Correlation Matrix") plt.ylabel('indepenent variables') plt.xlabel('indepenent variables')

Correlation values can range from -1 and 1. Those extremes indicate perfect negative and positive correlation, respectfully. Values near zero suggest no correlation exists between variables. The off-diagonal elements in the image below exhibit near zero correlation values, leading us to believe that there is no multicollinearity in this dataset.

We can now introduce multicollinearity by redefining a few of the variables as linear combinations of others.

X.x7 = -1 * X.x3 # full linear dependence X.x5 = 0.25 * X.x1 + 0.75 * X.x5 # partial linear dependence

Notice in the below correlation plot that *M _{37}* and

*M*are -1.0, and

_{73}*M*and

_{15}*M*is 0.25.

_{51}### How to Quantify the Severity of Collinearity: Variance Inflation Factor

The variance inflation factor (VIF) allows us to numerically quantify the degree of multicollinearity. The VIF parameter is calculated for each independent or feature variable, and deviations from unity suggests variable correlation with other variables.

For each variable, we solve a linear model where we try to explain the variable *X _{i}* in terms of the other

*n-1*independent variables.

X_{i}= β_{0}+ Σ_{j≠i}β_{j}· X_{j}

From there we calculate the VIF_{i }using the *R ^{2}* results from the fit to the above equation. If

*R*is zero, i.e. none of the variance in

^{2}*X*is explained using the other

_{i}*n-1*variables, and the VIF

_{i}= 1.0. As the

*R*increases from zero, the VIF

^{2}_{i}increases to reflect that fact.

VIF_{i}= 1 / ( 1 - R_{i}^{2})

Let’s calculate the VIF values for a few of our variables from the data set above.

_X = X.drop(['x5'], axis=1) _y = X.x5 lm = linear_model.LinearRegression(fit_intercept=True, normalize=False) lm.fit(_X, _y) R2 = lm.score(_X, _y) print R2, 1.0 / (1.0 - R2)

For *X _{5}*, the R

_{5}

^{2}is 0.1555. And the corresponding VIF

_{5}value is 1.179, which suggests that multicollinearity will inflate the standard error of our regression coefficient estimates. The

*X*feature also has a VIF value of 1.182, and this is expected.

_{1}For the *X _{7}* feature, R

_{7}

^{2}is 1.0 and VIF

_{7}value is infinity. Now normally, we wouldn’t get perfect correlation because of real world noise, however, large VIF values are possible. The

*X*VIF value is also infinity.

_{3}All other features in this dataset are floating just above unity.

for col_i in X.columns: lm.fit(X.drop([col_i], axis=1), X[col_i]) print col_i, 1 / (1 - lm.score(X.drop([col_i], axis=1), X[col_i]))

output:

x0 1.0085963855 x1 1.18205441212 x2 1.00560189773 x3 inf x4 1.0108068519 x5 1.17863000174 x6 1.02102453797 x7 inf x8 1.0079748323 x9 1.0168992924

### Generate y Dataset

Before we can explore how multicollinearity effects the standard error in our regression coefficients, we must generate some dependent observation values. To do this, we generate an array of beta coefficients, calculate *X·Β*, and introduce some gaussian-distributed errors.

betas = np.random.random(size=X.shape[1]) y = np.dot(X, betas) + np.random.normal(0, 1, size=X.shape[0])

The *betas* variable is an array random numbers between [0, 1) and our noise is sampled from a standardize normal distribution centered about zero with a standard deviation of unity.

### Solve for the Regression Coefficients

We can now solve for the regression coefficients using the ordinary least squares method. More correctly, we should say that we can now calculate the un-biased estimates for the Β coefficients. Unless you created your own data as we have above, you never really know what the true coefficients are; we simply calculate estimates.

Β = (X^{T}X)^{-1}X^{T}y

Let’s code up the above equation in python and solve for the regression coefficients…

Xt = X.transpose() XtX = np.dot(Xt, X) XtX_inv = np.linalg.inv(XtX) betas_est = np.dot(np.dot(XtX_inv, Xt) ,y)

We run this and python returns a ‘LinAlgError: Singular matrix’ error. What happened? The perfect multicollinearity between X_{3} and X_{7} creates a situation where the *X ^{T}X* matrix is not invertible. We can not solve this problem with this method. Either a more robust analytic method like singular value decomposition or use numerical solutions.

Let’s create a slightly different data set without perfect multicollinearity and calculate our regression coefficients. Again, we create a correlation between the X_{5} and X_{1} variables.

X = pd.DataFrame.from_dict(data) X.x5 = 0.25·X.x5 - 0.75·X.x1 y = np.dot(X, betas) + np.random.normal(0, 1, size=X.shape[0]) betas_est = np.dot(np.dot(np.linalg.inv(np.dot(X.transpose(), X)), X.transpose()), y)

# actual betas print betas # estimation of betas via regression print betas_est

output:

[ 0.87262849, 0.10874099, 0.00457641, 0.77028961, 0.10735083, 0.36264465, 0.84870342, 0.46439193, 0.56277177, 0.15009397] [ 0.84553638, 0.40624893, 0.02285992, 0.75580565, 0.12319431, 0.65499404, 0.80318128, 0.42259867, 0.56750491, 0.14147711]

The above output shows the actual Β (top array) along with the calculated Β (bottom array). The β_{1} and β_{5} are significantly inflated with respect to the actual values, this is a consequence of the collinearity. We can also see that in this example, that β_{2} is estimated to be larger than the actual value, however, we will see in the next section that β_{2} is statistically insignificant.

### Inflation of Standard Error in the Regression Coefficients

Let’s first calculate the covariance matrix for the regression coefficients, asserting the typical ordinary least squares assumptions.

Σ_{β}= (X^{T}X)^{-1}·σ^{2},

In the above expression, the σ^{2} is the variance of dependent variable. The definition of the variance is shown below, where *y _{i}* is the

*i*observed value and

^{th}*ý*is the

_{i}*i*predicted

^{th}*y*value.

σ^{2}= Σ_{i}(y_{i}- ý_{i})^{2}/ N

Coding this up in python:

var_y = np.sum((y - np.dot(X, betas_est))**2) / X.shape[0] covar = np.linalg.inv(np.dot(X.transpose(), X)) * var_y

The variances of the regression coefficients are the diagonal elements of the covariance matrix. And the standard errors are the square root of the variances.

print np.sqrt(covar.diagonal())

output:

[ 0.04288585 0.13963199 0.04381037 0.04239948 0.04348278 0.1806739, 0.04478235 0.04263621 0.04420973 0.04450978]

We see that β_{1} and β_{5} exhibit significantly higher uncertainties when compared to the other independent variables or features. This is a result of the multicollinearity.

Fortunately, those standard errors are not large enough to incorrectly conclude that β_{1} and β_{5} are statistically non-zero, however, this is possible depending on the data.

### Testing Statistical Significance of Regression Coefficients

Regression coefficients can be tested for statistical significance, in general, using a two-tailed *t* test with the null hypothesis, H_{0 }: β_{i} = 0, and *n-2* degrees of freedom.

T_{i}= β_{i}/ s(β_{i}), where s(β_{i}) is the standard error of β_{i}

For β2, we have a test statistic of 0.02285992/0.04381037 = 0.52, which clearly fails the significance test. We know that without even calculating the p-value for this example, because the standard error is larger than the estimated value.

For β4, we have a test statistic of 0.12319431/0.04348278 = 2.83 and has a corresponding p-value of 0.0048, which is statistically significant. We reject H_{0 } and conclude that β4 is non-zero, i.e. there is statistically significant relationship between *y* and *X _{4}*.

One can see that with multicollinearity, there is the possibility that the inflated standard error pushes the test statistic down, such that we incorrectly conclude that there is not a statistically significant relationship between *y* and *X _{i}*.