Introduction

Linear Regression is a simple, yet powerful method for understanding linear relationships between variables in a collection of observations.  For a given ith observation there is a set of n independent variables {x1, x2, x3, …, xn}i and one dependent variable yi, and the assumption is that yi is linear function of the independent variables.

yi = f(x) = β0 + β1·x1 + β2·x2 + β3· x3+ … + βn· xn = β0 + ∑i βi·xi

Each set of xi is typically a row in a data set or design matrix X and the coefficients β’s are a measure of how that particular xi affects yi. We assume that all the observations have the same linear relationship, and thus share a common set of coefficients β’s. We further simplify the linear relationship in matrix form.

Y = X · β

And thus the goal here is to understand the relationships within the data, i.e. calculate the coefficients β’s. This can be either analytically or numerically. We will first discuss the analytical solution.

If we multiply both sides of the above equation by XT, which is the transpose of X.

XTY = XTX · β

Then multiply both sides by (XTX)-1, which is the inverse of XTX and is defined such that (XTX)-1 · (XTX) = 1.

(XTX)-1 · XT Y = (XTX)-1 · (XTX) β
(XTX)-1 · XT Y = β

We now have an analytic solution for the set of coefficients β.

Example using sklearn

(caveat — sklearn.linear_model.LinearRegression calls np.linalg.lstsq which calls an LAPACK routine using Singular Value Decomposition (SVD), which is a more robust algorithm for solving this equation — to be covered more in the future…)

Let’s generate some fake data in python… and then fit it using the scikit-learn library.

import numpy as np
X = np.random.random(1000).reshape(500, 2)
beta = np.array([-1.25, 1.50])

X is now a design matrix with 500 rows with randomly assigned values for the two independent variables. And for the sake of generating fake data, we have assigned β1 and β2 to the values -1.25 and 1.50, respectively.

y = np.dot(X, beta)
y += np.ones_like(y)
y += 0.1 * np.random.random(1, 1, size=len(y))

First we calculate y, then add a constant offset β0 = 1, and finally throw in some random noise.

Plotting up our data…

import matplotlib.pyplot as plt
%matplotlib notebook # using Jupyter Notebook
plt.figure()
plt.title('example data')
plt.subplot(211)
plt.plot(X[:,0], y, '.')
plt.ylabel('Y')
plt.xlabel('x1')
plt.subplot(212)
plt.plot(X[:,1], y, '.')
plt.ylabel('Y')
plt.xlabel('x2')

fake_data

from sklearn import linear_model
lm = linear_model.LinearRegression(fit_intercept=True, normalize=False)
lm.fit(X, y)

Here we import the linear_model module from the sklearn library and create an instance of the LinearRegression class.  We also explicitly set the fit_intercept variable to True and the normalize variable to False.  Then we call the fit method to calculate the regression coefficients. Finally, we call the score method to calculate the R2 value of our fit.

print lm.coef_
print lm.intercept_
print lm.score(X, y)

The resulting output should be something similar to array([-1.25472623, 1.50384261]) for the coefficients, which is within an acceptable tolerance to the known values of those coefficients. And for the intercept, our value is 1.101767 — again within an acceptable tolerance to the known value.  The R2 value is 0.97, suggesting that we have a great fit and that 97% of the variance is explained by the independent variables.

print lm.predict([ np.array([0.5, 0.5]) ])

Now we have to power to predict new y values given an observation of x1 and x2. For example, assume both x1 and x are equal to 0.5, then the new y value is predicted to be array([ 1.22632577]).

Example Python Class for Analytical Calculation

(This python class is more or less just for instruction… most libraries will use a different analytical method to solve this problem that is more robust.)

Below is a python class that solves the above equation for the linear regression coefficient. The linear algebra occurs in the call to the _solve_ols method, where we calculate (XTX)  — the XTX variable — along with its inverse (XTX)-1  — the XTX_i variable. Then calculating the coefficients as shown above,  β = (XTX)-1 · XT Y.

This class also allows us to check the residuals, or the noise, are gaussian distributed. And we also have a method calc_r_squared to return the R2 value.

import numpy as np
from scipy.stats import kstest, norm

class OLS(object):
    def __init__(self, X, y, fit_intercept=True):
        self.X = X
        self.y = y
        self.fit_intercept = fit_intercept
        self._coeff = None

    def fit(self, check_residuals=True, threshold=0.05):
        if self.fit_intercept:
            self._add_intercept()
        self._solve_ols()
        
        if check_residuals:
            print 'checking residuals...'
            if self._check_residuals(threshold):
                print '...residuals are gaussian distributed at %3.2f...' % threshold
            else:
                print '...residuals are Not gaussian distributed...'
                
    def _add_intercept(self):
        '''add a column of 1s in the X matrix...'''
        self.X = np.insert(self.X, 0, np.ones_like(self.y), axis=1)

    def _solve_ols(self):
        '''matrix solution for OLS...'''
        XT = self.X.transpose()
        XTX = np.dot(XT, self.X)
        XTX_i = np.linalg.inv(XTX)
        self._coeff = np.dot(np.dot(XTX_i, XT), self.y)

    def _calculate_residuals(self):
        return self.y - np.dot(self.X, self._coeff)

    def _check_residuals(self, threshold=0.05):
        '''check residuals using ks_test for normality...'''
        residuals = self._calculate_residuals()
        mu, std = np.mean(residuals), np.std(residuals)

        def g_cdf(x):
            return norm.cdf(x, mu, std)

        # standard 2-sided ks test...
        t_stat, p_value = kstest(residuals, g_cdf)
        # returns True for gaussian noise
        return p_value > threshold

    def calc_r_squared(self, adjusted=True):
        '''returns the standard R2 value...'''
        n_obs, n_var = self.X.shape
        y_ = np.mean(self.y)
        p = np.dot(self.X, self._coeff)

        ss_t = np.sum(np.square(self.y - y_))
        ss_e = np.sum(np.square(self.y - p))

        r2 = 1.0 - ss_e/ss_t
        if adjusted:
            return 1.0 - (1.0 - r2) * ((n_obs - 1) / (n_obs - n_var))
        return r2

Scaling of the Analytic Linear Regression Solution

The analytic solution requires matrix transposition, matrix multiplication, and matrix inversion. The transposition step scales as O(mn), where we have m rows or observations and n independent variables. The first matrix multiplication step XT scales as O(n2m).  The inversion of XTX  scales as O(n3assuming basic inversion calculation; other inversion algorithms exists that provide O(n2.373performance — see wikipedia for more details. Then the next matrix multiplication (XTX)-1 · XT scales as O(n2m).  We see here that the analytic solution to linear regression is in general going to scale as O(n2m).  If for some reason n > m, then this solution will scale as O(n3).

For problems were n and m are very large, the computation cost of an analytic solution is prohibitive.  In those cases, numerical solutions such as gradient descent can be implemented.

Related Topics

Cross-validation should always be used to evaluate a model’s true predictive power. This is typically done by splitting the data into both a training set and a testing set.  Future posts will discuss this topic further.

There are several assumptions about the data we have implicitly made in this blog post.

  1. Linear regression assumes that each observation is independent of all others and that the noise can be modeled as a gaussian centered about zero with a constant variance.  If the noise does not have a constant variance (i.e. not homoscedastic), then either a non-linear transformation of the dependent variable or a weighted linear regression is necessary.
  2. We have assumed no multicollinearity, i.e. that the independent variables are not correlated with each other. If multicollinearity exists, then either regularization or principle component analysis should be considered.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s