### Introduction

Linear Regression is a simple, yet powerful method for understanding linear relationships between variables in a collection of observations. For a given *i ^{th}* observation there is a set of n independent variables

*{x*and one dependent variable

_{1}, x_{2}, x_{3}, …, x_{n}}_{i}*y*, and the assumption is that

_{i}*y*is linear function of the independent variables.

_{i}*y _{i}* =

*f(*=

**x**)*β*

_{0}+ β_{1}·x_{1}+ β_{2}·x_{2}+ β_{3}· x_{3}+ … + β_{n}· x_{n}= β_{0}+ ∑_{i}β_{i}·x_{i}Each set of *x _{i}* is typically a row in a data set or design matrix

*and the coefficients*

**X***β’s*are a measure of how that particular

*x*affects

_{i}*y*. We assume that all the observations have the same linear relationship, and thus share a common set of coefficients

_{i}*β’s*. We further simplify the linear relationship in matrix form.

**Y** = **X** · **β**

And thus the goal here is to understand the relationships within the data, i.e. calculate the coefficients *β’s*. This can be either analytically or numerically. We will first discuss the analytical solution.

If we multiply both sides of the above equation by * X^{T}*, which is the transpose of

*.*

**X****X**^{T}**Y** = **X**^{T}**X** · **β**

Then multiply both sides by *( X^{T}X)^{-1}*, which is the inverse of

*and is defined such that*

**X**^{T}**X***(*.

**X**^{T}**X**)^{-1}· (**X**^{T}**X**) = 1*( X^{T}X)^{-1}* ·

**X**

^{T}

**Y**=

*(*

**X**^{T}**X**)^{-1}· (**X**^{T}**X**)**β***(*·

**X**^{T}**X**)^{-1}**X**

^{T}

**Y**=

**β**We now have an analytic solution for the set of coefficients * β*.

### Example using sklearn

###### (caveat — sklearn.linear_model.LinearRegression calls np.linalg.lstsq which calls an LAPACK routine using Singular Value Decomposition (SVD), which is a more robust algorithm for solving this equation — to be covered more in the future…)

Let’s generate some fake data in python… and then fit it using the scikit-learn library.

import numpy as np X = np.random.random(1000).reshape(500, 2) beta = np.array([-1.25, 1.50])

* X* is now a design matrix with 500 rows with randomly assigned values for the two independent variables. And for the sake of generating fake data, we have assigned β

_{1}and β

_{2}to the values -1.25 and 1.50, respectively.

y = np.dot(X, beta) y += np.ones_like(y) y += 0.1 * np.random.random(1, 1, size=len(y))

First we calculate ** y**, then add a constant offset

*β*, and finally throw in some random noise.

_{0}= 1Plotting up our data…

import matplotlib.pyplot as plt %matplotlib notebook # using Jupyter Notebook

plt.figure() plt.title('example data') plt.subplot(211) plt.plot(X[:,0], y, '.') plt.ylabel('Y') plt.xlabel('x1') plt.subplot(212) plt.plot(X[:,1], y, '.') plt.ylabel('Y') plt.xlabel('x2')

from sklearn import linear_model lm = linear_model.LinearRegression(fit_intercept=True, normalize=False) lm.fit(X, y)

Here we import the *linear_model* module from the *sklearn* library and create an instance of the *LinearRegression* class. We also explicitly set the *fit_intercept* variable to *True* and the *normalize* variable to *False*. Then we call the *fit* method to calculate the regression coefficients. Finally, we call the *score* method to calculate the *R ^{2}* value of our fit.

print lm.coef_ print lm.intercept_ print lm.score(X, y)

The resulting output should be something similar to array([-1.25472623, 1.50384261]) for the coefficients, which is within an acceptable tolerance to the known values of those coefficients. And for the intercept, our value is 1.101767 — again within an acceptable tolerance to the known value. The *R ^{2}* value is 0.97, suggesting that we have a great fit and that 97% of the variance is explained by the independent variables.

print lm.predict([ np.array([0.5, 0.5]) ])

Now we have to power to predict new * y* values given an observation of

*and*

**x**_{1}*. For example, assume both*

**x**_{2}*and*

**x**_{1}*are equal to 0.5, then the new*

**x**_{2 }*value is predicted to be array([ 1.22632577]).*

**y**### Example Python Class for Analytical Calculation

###### (This python class is more or less just for instruction… most libraries will use a different analytical method to solve this problem that is more robust.)

Below is a python class that solves the above equation for the linear regression coefficient. The linear algebra occurs in the call to the *_solve_ols* method, where we calculate *( X^{T}X) — *the XTX variable — along with its inverse

*(*— the XTX_i variable. Then calculating the coefficients as shown above,

**X**^{T}**X**)^{-1 }

**β =***(*·

**X**^{T}**X**)^{-1}**X**

^{T}

**Y**.

This class also allows us to check the residuals, or the noise, are gaussian distributed. And we also have a method *calc_r_squared *to return the *R ^{2}* value.

import numpy as np from scipy.stats import kstest, norm class OLS(object): def __init__(self, X, y, fit_intercept=True): self.X = X self.y = y self.fit_intercept = fit_intercept self._coeff = None def fit(self, check_residuals=True, threshold=0.05): if self.fit_intercept: self._add_intercept() self._solve_ols() if check_residuals: print 'checking residuals...' if self._check_residuals(threshold): print '...residuals are gaussian distributed at %3.2f...' % threshold else: print '...residuals are Not gaussian distributed...' def _add_intercept(self): '''add a column of 1s in the X matrix...''' self.X = np.insert(self.X, 0, np.ones_like(self.y), axis=1) def _solve_ols(self): '''matrix solution for OLS...''' XT = self.X.transpose() XTX = np.dot(XT, self.X) XTX_i = np.linalg.inv(XTX) self._coeff = np.dot(np.dot(XTX_i, XT), self.y) def _calculate_residuals(self): return self.y - np.dot(self.X, self._coeff) def _check_residuals(self, threshold=0.05): '''check residuals using ks_test for normality...''' residuals = self._calculate_residuals() mu, std = np.mean(residuals), np.std(residuals) def g_cdf(x): return norm.cdf(x, mu, std) # standard 2-sided ks test... t_stat, p_value = kstest(residuals, g_cdf) # returns True for gaussian noise return p_value > threshold def calc_r_squared(self, adjusted=True): '''returns the standard R2 value...''' n_obs, n_var = self.X.shape y_ = np.mean(self.y) p = np.dot(self.X, self._coeff) ss_t = np.sum(np.square(self.y - y_)) ss_e = np.sum(np.square(self.y - p)) r2 = 1.0 - ss_e/ss_t if adjusted: return 1.0 - (1.0 - r2) * ((n_obs - 1) / (n_obs - n_var)) return r2

### Scaling of the Analytic Linear Regression Solution

The analytic solution requires matrix transposition, matrix multiplication, and matrix inversion. The transposition step scales as *O(mn)*, where we have *m* rows or observations and *n* independent variables. The first matrix multiplication step * X^{T}X * scales as O(n

^{2}m). The inversion of

*scales as*

**X**^{T}**X***O(n*assuming basic inversion calculation; other inversion algorithms exists that provide

^{3})*O(n*performance — see wikipedia for more details. Then the next matrix multiplication

^{2.373})*(*·

**X**^{T}**X**)^{-1}**X**

^{T}scales as O(n

^{2}m). We see here that the analytic solution to linear regression is in general going to scale as O(n

^{2}m). If for some reason

*n > m*, then this solution will scale as

*O(n*

^{3}).For problems were *n* and *m* are very large, the computation cost of an analytic solution is prohibitive. In those cases, numerical solutions such as gradient descent can be implemented.

### Related Topics

Cross-validation should always be used to evaluate a model’s true predictive power. This is typically done by splitting the data into both a training set and a testing set. Future posts will discuss this topic further.

There are several assumptions about the data we have implicitly made in this blog post.

- Linear regression assumes that each observation is independent of all others and that the noise can be modeled as a gaussian centered about zero with a constant variance. If the noise does not have a constant variance (i.e. not
**homoscedastic**), then either a non-linear transformation of the dependent variable or a weighted linear regression is necessary. - We have assumed no
**multicollinearity**, i.e. that the independent variables are not correlated with each other. If multicollinearity exists, then either regularization or principle component analysis should be considered.