Principal Component Analysis


Principal component analysis (PCA) is a linear algebra technique that creates new design features from the original design features, such that feature correlation or collinearity is removed. This is useful in regression analysis if collinearity exits and understanding the relationship among variables is important. Collinearity can affect both the regression coefficients and the uncertainty of those regression coefficients and this can obscure relationships. And in clustering analysis, PCA can be used either if collinearity exists or if a large number of features or dimensions exists and dimension reduction is desired.

Formally, PCA is the diagonalization of the covariance matrix. The new design features are the eigenvectors, and the eigenvalue for each eigenvector is the variance explained by this new feature.  Below, we will work through a simple example to develop some intuition and then follow that up with an intermediate example in another blog post.

Simple Example

Create Some Data

Let’s set up a simple linear regression example with two ‘independent’ variables or features and one dependent variable or outcome.  Here we have quotes around the word independent because we will impose a strong linear correlation between them.

import numpy as np
from scipy.stats import pearsonr

# create dataset...

# number of independent data points or rows...
_row = 50
_col = 2
# parameters for white noise...
_bias = 0
_scale = 0.1

# create feature variables
X = np.random.random(size=_rows*_col).reshape(_rows, _col)
X[:,1] = X[:,0] + np.random.normal(_rows, _bias, _scale)

# create coefficients...
beta = np.random.random(size=X.shape[1])

# transpose X to get the correct design matrix
X = X.transpose()

# calculate dependent observations...
Y =, X)
# add white noise to observations...
Y += np.random.normal(size=_rows, loc=_bias, scale=_scale)

Let’s take a quick look at our simple dataset…

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib notebook

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# create 3D scatter plot
ax.scatter(X[0], X[1], Y, color='r', marker='o')


# create 2D projections
ax.plot(X[0], Y, 'b.', zdir='y', zs=1)
ax.plot(X[1], Y, 'b.', zdir='x', zs=1)
ax.plot(X[0], X[1], 'ko', zdir='z', zs=-0.5, alpha=0.25) 

ax.set_xlim(0.0, 1.0)
ax.set_ylim(0.0, 1.0) 
ax.set_zlim(-0.5, 2.5)


Measuring Collinearity and Correlation

The above plot shows y = f(x1, x2) in red, along with the projections of the data onto the 2D planes in blue. The correlation between the two supposedly independent variables is clearly visible in the gray projection. We can calculate the linear correlation coefficient…

from scipy.stats import pearsonr

print pearsonr(X[0], X[1])


(0.9458890134715866, 4.1675937498103316e-25)

The Pearon’s R value is 0.946 with a corresponding p-value of approximately zero, e.i. strong correlation with essentially zero probability of a type-I error.

Manual PCA


Let’s use PCA to see if we can reduce the dimensionality of this problem from 2 features to 1 feature. We will diagonalize the covariance matrix of X and inspect both the eigenvalues and the eigenvectors. Notice that we use call np.linalg.eigh() and not np.linalg.eig(); this is, one, because we can, the covariance matrix will always be symmetric, and two, because then we get the eigenvalues and eigenvectors ordered correctly.

cov_X = np.cov(X)
eigenvalues, eigenvectors = np.linalg.eigh(cov_X)
print 'eigenvalues: ', eigenvalues
print 'eigenvectors: ', eigenvectors


eigenvalues:  [0.00481936  0.17493158]

eigenvectors:  [[-0.74127979 0.67119615]
                [ 0.67119615 0.74127979]]

The above eigenvalues, 0.0048 and 0.1749, are significantly different — orders of magnitude different — which is expected for the example. What does this mean?  The larger eigenvalue, 0.1749, is the variance explained by the second eigenvector.  The second eigenvector explains most of the total variance.  The first eigenvector has an eigenvalue of 0.0048 and explains a trivial amount of the total variance.  Thus, we should be able to disregard the first eigenvector and still predict the dependent variable with minimal additional error.

What do these new eigenvectors — let’s call them Z — look like? E is the matrix of eigenvectors from the diagonalization of the covariance matrix.

Z = E · X


Z1 = E11 · X1 + E12 · X2
Z2 = E21 · X1 + E22 · X2

Or let’s calculate them explicitly and re-plot the data with the new independent variables,

# transform into the new independent variables...
Z =, X)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(Z[0], Z[1], Y, color='r', marker='o')
ax.plot(Z[0], Y, 'b.', zdir='y', zs=2.5)
ax.plot(Z[1], Y, 'b.', zdir='x', zs=-0.5)
ax.plot(Z[0], Z[1], 'ko', zdir='z', zs=-0.5, alpha=0.25)


ax.set_xlim(-0.5, 2.5) 
ax.set_ylim(-0.5, 2.5)
ax.set_zlim(-0.5, 2.5)


Notice that the major fraction of the total variance is along the Z2 axis, and that Z1 is simply white noise.  We have successfully transformed the design features using PCA and we can now explain the observational data without using the Z1 data.

Check New Vectors for Collinearity

We can also recalculate the linear correlation between Z1 and Z2 to double check that the collinearity has been removed.

print pearsonr(Z[0], Z[1])


(-7.0718568840985767e-17, 1.0)

And the Pearson’s R correlation coefficient is zero, again confirming that PCA has remove the linear correlation between the independent variables.

Finishing Up

After removing features that do not contribute significantly to the variance, we would want to next perform linear regression and develop a predictive model. Let’s fit Z2 and Y using ordinary least squares and then evaluate our predictions of Y.  In general, we should split the data in order to cross-validate, however, in this example, it’s impossible to overfit. In the output below, we can see that the R-squared value is 0.993, which is to be expected for such as simple example.

# OSL Regression
import statsmodels.api as sm
sm.OSL(Y, Z[1]).fit().summary()


OLS Regression Results
Dep. Variable: y R-squared: 0.993
Model: OLS Adj. R-squared: 0.993
Method: Least Squares F-statistic: 6753.
Date: Fri, 11 Aug 2017 Prob (F-statistic): 3.69e-54
Time: 13:54:32 Log-Likelihood: 48.327
No. Observations: 50 AIC: -94.65
Df Residuals: 49 BIC: -92.74
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 1.2777 0.016 82.176 0.000 1.246 1.309

Multicollinearity and Clustering


Clustering or unsupervised learning techniques are powerful methods to find underlying patterns that may not be obvious and lead to, for example, customer segments that can be implemented in future marketing campaigns. These algorithms can be, in general, broken down into centroid-based methods, or distribution-based methods, or density-based clustering. Each algorithm or class of algorithms excels under certain circumstances. A good 2-dimensional visualization of these algorithms can be found on the scikit-learn page.

One underlying similarity to most of these algorithms is that pair-wise distance metrics are computed to quantify the closeness or similarity of independent data points. The smaller the distance metric between independent data points, the more similar these data points appear.

Effect of Multicollinearity

The existence of strong correlation between or amongst the features of a design matrix can lead to non-optimal clustering results. Why might this happen? Well, as discussed above, the clustering algorithms are measuring a distance metric or similarity between data points, which are in turn used to create groups or clusters of similar data points. When two or more features are highly correlated, those features have a stronger influence on the distance calculation than they should and can effect the grouping.

Multicollinearity should be removed from the design matrix prior to clustering. Additionally, the features of the design matrix should be standardized.  Non-standardized design matrix will also lead to a non-optimal clustering results in which one or more of the features dominate the distance calculations. On a side note, similar issues occur in convex optimizations, such as gradient descent.

One method to remove multicollinearity from data is via Principal Component Analysis (PCA).  This technique creates new features from linear combinations of the original features, such that the collinearity is removed.  Typically, one or more of the new features is predominately noise and can be removed from the transformed design matrix; this is why PCA is referred to as a feature reduction technique.

Linear Correlation and The Covariance Matrix


For various exploratory efforts, understanding variable correlation is a necessary analysis step.  Measuring correlation helps to build intuition about the problem at hand and can be used to identify simple multicollinearity if present.

The covariance matrix is the basis for understanding linear correlation. Each element of the matrix is defined as the following, where E[B] is the expectation value for the vector B and < B > is the mean value of the vector B.

Σij = Cov(Xi, Xj) = E[(Xi - <Xi>) (Xj - <Xj>)]

Generate Dataset

Below we generate a 2D dataset of random numbers, each vector of length 10.

import numpy as np
X = np.random.normal(0, 1, 20).reshape(2, 10)

Calculating Covariance

Let’s begin by calculating the Σij ourselves using the equation above…

# Cov(0,0)
i, j = 0, 0
Cov_00 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j]))
print Cov_00

i, j = 0, 1 
Cov_01 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j]))
print Cov_01

i, j = 1, 1 
Cov_11 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j]))
print Cov_11



Alternatively, both the numpy and pandas libraries have either built-in functions or methods for calculating covariance.

# using the numpy.cov() function
print np.cov(X)

# using the pandas.dataframe.cor() method
df_X = pd.DataFrame(X.transpose())
print df_X.cov()


# from numpy
[[ 1.44065095  0.0642379 ]
 [ 0.0642379   0.37446063]]

# from pandas
          0         1
0  1.440651  0.064238 
1  0.064238  0.374461

One should immediately notice that the covariance values from numpy and pandas disagree with our initial calculation — Σ00 = 1.44 for numpy and pandas, and Σ00 = 1.29 for our calculation.  What could be the cause of this difference? Both numpy and pandas calculate the unbiased estimation of the covariance, while we naively calculated the biased estimation of the covariance.  The covariance is a measurement of the population.  The data we have is a sample of that population.  For large sample sizes, the covariance of the sample approaches that of the population.  For small samples sizes, the biased estimation will always be smaller than the actual population covariance, hence the term biased.

How to calculate the unbiased covariance? In our above calculation, we used the mean as the expectation value, which yielded biased covariance values. The unbiased value is calculated by dividing by n-1, instead of n…

Σij = Cov(Xi, Xj) = 1/(n-1) · Σn (Xi,n - <Xi>) · (Xj,n - <Xi>)

More information on biased and unbiased estimators can be found on wikipedia right here.

Interpretation of Covariance

The covariance matrix Σ is a symmetric matrix, that provides information about how one independent variable will vary with another independent variable, i.e. correlation. If the two variables are truly independent, then the covariance will be close to zero. However, if the two variables tend to increase together, then the covariance will be a positive number. And likewise, if as one variable increases, the other variable tends to decrease, then the covariance value will be negative.

The absolute size of the a particular covariance element Σij will depend on both on the variability of the individual variables and the degree of correlation of those variables. This tends to obscure our ability to quickly glance at the covariance values and understand the degree of correlation.

Correlation via Pearson’s R Value

Linear correlation is measured by normalizing the covariance value by the standard deviations of the two variables.  This metric is referred to as Pearson’s R.

Rij = Cov(Xi, Xj) / ( σi · σj )

The magnitude of this number is now bound between -1 and 1. Again, uncorrelated variables have values near zero.  And variables with strong correlation will approach +/-1 depending on the whether the correlation is positive or negative.

In python we can calculate the Pearson’s R value with the help of the numpy library. Notice, that we have pass the ddof parameter a value of 1 — this makes sure that numpy returns an unbiased estimation of the standard deviation.

S = np.cov(X)

R_00 = S[0][0] / (np.std(X[0], ddof=1) * np.std(X[0], ddof=1))
R_01 = S[0][1] / (np.std(X[0], ddof=1) * np.std(X[1], ddof=1))

print R_00
print R_01



Alternatively, pandas makes life much easier, assuming the data are represented as a pandas.DataFrame object.  There are different methods to calculate correlation, Pearson’s method calculates linear correlation.  Here we explicitly pass the method variable a string value of ‘pearson’.

print df_X.corr( method='pearson')


         0        1
0  1.00000  0.08746
1  0.08746  1.00000

Practical Calculations of Correlation and Covariances

With most of the below examples, distinguishing biased and unbiased estimators becomes less important.  As datasets grow larger and large,  n and n-1 are for all practical purposes equal.

SQL (biased estimation):

SELECT AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB
FROM database.table

SQL with Group By (biased estimation):

SELECT category, AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB
FROM database.table GROUP BY category

SQL (unbiased estimation):

It is possible to calculate the unbiased covariance value…

    COUNT(*) as n_datapoints,
    AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB

then with paper and pencil…

covunbiased(A,B) = n / (n-1) · covbiased(A,B)

HiveQL (biased estimation):

    category, covar_pop(col_a, col_b) as cov_ab
    database.table group by category

HiveQL (unbiased estimation):

    category, covar_samp(col_a, col_b) as cov_ab
    database.table group by category

HiveQL, Pearson’s R:

    category, corr(col_a, col_b) as r_ab
    database.table group by category


# using a dataframe object
cov_ab = df.stat.cov('col_a', 'col_b')

pySpark, Pearson’s R:

# using a dataframe object
r_ab = df.stat.corr('col_a', 'col_b')

# using two rdds
from pyspark.mllib.stat import Statistics
r_ab = Statistics.corr(rdd_a, rdd_b, method='pearson')



When independent variables in a multivariate regression model exhibit strong correlation, multicollinearity exits.  This is problematic when evaluating the linear coefficients estimates. Multicollinearity creates an instability in the final estimations of the coefficients.  That instability is mis-interpreted as additional uncertainty in the estimations of the regression coefficients.  The concern is that the perceived increase in uncertainty of correlated variables can lead to the conclusion that those variables are statistically insignificant.

In the following sections, we will create several data sets with and without multicollinearity, explore what the related consequences with respect to statistical significants, and discuss how to identify and mitigate multicollinearity.

Generation of Data

import matplotlib.pyplot as plt
%matplotlib notebook    # using jupyter notebook
import numpy as np
import pandas as pd

First we generate a 10-dimensional dataset that has no multicollinearity by sampling from a standardized gaussian distribution.

data = {}
for i in range(10):
    data[''.join(['x', str(i)])] =\
        np.random.normal(loc=0, scale=1, size=500)

X = pd.DataFrame.from_dict(data)

And we plot the correlation matrix M, where Mij is the Pearson’s R Correlation Value for the ith and jth variables. Alternatively, we could also plot the covariance matrix, obtaining a similar visualization.

plt.title("Pearson's R Correlation Matrix")
plt.ylabel('indepenent variables')
plt.xlabel('indepenent variables')

Correlation values can range from -1 and 1. Those extremes indicate perfect negative and positive correlation, respectfully. Values near zero suggest no correlation exists between variables.  The off-diagonal elements in the image below exhibit near zero correlation values, leading us to believe that there is no multicollinearity in this dataset.


We can now introduce multicollinearity by redefining a few of the variables as linear combinations of others.

X.x7 = -1 * X.x3  # full linear dependence
X.x5 = 0.25 * X.x1 + 0.75 * X.x5  # partial linear dependence

Notice in the below correlation plot that M37 and M73 are -1.0, and M15 and M51 is 0.25.


How to Quantify the Severity of Collinearity: Variance Inflation Factor

The variance inflation factor (VIF) allows us to numerically quantify the degree of multicollinearity. The VIF parameter is calculated for each independent or feature variable, and deviations from unity suggests variable correlation with other variables.

For each variable, we solve a linear model where we try to explain the variable Xi in terms of the other n-1 independent variables.

Xi = β0 + Σj≠i βj · Xj

From there we calculate the VIFusing the R2 results from the fit to the above equation.  If R2 is zero, i.e. none of the variance in Xi is explained using the other n-1 variables, and the VIFi = 1.0.  As the R2 increases from zero, the VIFi increases to reflect that fact.

VIFi = 1 / ( 1 - Ri2 )

Let’s calculate the VIF values for a few of our variables from the data set above.

_X = X.drop(['x5'], axis=1)
_y = X.x5

lm = linear_model.LinearRegression(fit_intercept=True, normalize=False), _y)
R2 = lm.score(_X, _y)
print R2, 1.0 / (1.0 - R2)

For X5, the R52 is 0.1555. And the corresponding VIF5 value is 1.179, which suggests that multicollinearity will inflate the standard error of our regression coefficient estimates. The X1 feature also has a VIF value of 1.182, and this is expected.

For the X7 feature,  R72 is 1.0 and VIF7 value is infinity. Now normally, we wouldn’t get perfect correlation because of real world noise, however, large VIF values are possible. The X3 VIF value is also infinity.

All other features in this dataset are floating just above unity.

for col_i in X.columns:[col_i], axis=1), X[col_i])
    print col_i, 1 / (1 - lm.score(X.drop([col_i], axis=1),  X[col_i]))


x0 1.0085963855
x1 1.18205441212
x2 1.00560189773
x3 inf
x4 1.0108068519
x5 1.17863000174
x6 1.02102453797
x7 inf
x8 1.0079748323
x9 1.0168992924

Generate y Dataset

Before we can explore how multicollinearity effects the standard error in our regression coefficients, we must generate some dependent observation values. To do this, we generate an array of beta coefficients, calculate X·Β, and introduce some gaussian-distributed errors.

betas = np.random.random(size=X.shape[1])
y =, betas) + np.random.normal(0, 1, size=X.shape[0])

The betas variable is an array random numbers between [0, 1)  and our noise is sampled from a standardize normal distribution centered about zero with a standard deviation of unity.

Solve for the Regression Coefficients

We can now solve for the regression coefficients using the ordinary least squares method.  More correctly, we should say that we can now calculate the un-biased estimates for the Β coefficients.  Unless you created your own data as we have above, you never really know what the true coefficients are; we simply calculate estimates.

Β = (XTX)-1XTy

Let’s code up the above equation in python and solve for the regression coefficients…

Xt = X.transpose()
XtX =, X)
XtX_inv = np.linalg.inv(XtX)
betas_est =, Xt) ,y)

We run this and python returns a ‘LinAlgError: Singular matrix’ error.  What happened? The perfect multicollinearity between X3 and X7  creates a situation where the XTX matrix is not invertible. We can not solve this problem with this method.  Either a more robust analytic method like singular value decomposition or use numerical solutions.

Let’s create a slightly different data set without perfect multicollinearity and calculate our regression coefficients. Again, we create a correlation between the X5 and X1 variables.

X = pd.DataFrame.from_dict(data)
X.x5 = 0.25·X.x5 - 0.75·X.x1

y =, betas) + np.random.normal(0, 1, size=X.shape[0])
betas_est =, X)), X.transpose()), y)
# actual betas
print betas
# estimation of betas via regression
print betas_est


[ 0.87262849,  0.10874099,  0.00457641,  0.77028961,  0.10735083, 0.36264465,  0.84870342,  0.46439193,  0.56277177,  0.15009397]
[ 0.84553638,  0.40624893,  0.02285992,  0.75580565,  0.12319431, 0.65499404,  0.80318128,  0.42259867,  0.56750491,  0.14147711]

The above output shows the actual Β (top array) along with the calculated Β (bottom array). The β1 and β5 are significantly inflated with respect to the actual values, this is a consequence of the collinearity. We can also see that in this example, that β2 is estimated to be larger than the actual value, however, we will see in the next section that β2 is statistically insignificant.

Inflation of Standard Error in the Regression Coefficients

Let’s first calculate the covariance matrix for the regression coefficients, asserting the typical ordinary least squares assumptions.

Σβ = (XTX)-1·σ2,

In the above expression, the σ2 is the variance of dependent variable. The definition of the variance is shown below, where yi is the ith observed value and ýi is the ith predicted y value.

σ2 = Σi (yi - ýi)2 / N

Coding this up in python:

var_y = np.sum((y -, betas_est))**2) / X.shape[0]
covar = np.linalg.inv(, X)) * var_y

The variances of the regression coefficients are the diagonal elements of the covariance matrix. And the standard errors are the square root of the variances.

print np.sqrt(covar.diagonal())


[ 0.04288585  0.13963199  0.04381037  0.04239948  0.04348278  0.1806739, 0.04478235  0.04263621  0.04420973  0.04450978]

We see that β1 and β5 exhibit significantly higher uncertainties when compared to the other independent variables or features.  This is a result of the multicollinearity.

Fortunately, those standard errors are not large enough to incorrectly conclude that β1 and β5 are statistically non-zero, however, this is possible depending on the data.

Testing Statistical Significance of Regression Coefficients

Regression coefficients can be tested for statistical significance, in general, using a two-tailed t test with the null hypothesis, H:  βi = 0, and n-2 degrees of freedom.

Ti = βi / s(βi), where s(βi) is the standard error of βi

For β2, we have a test statistic of 0.02285992/0.04381037 = 0.52, which clearly fails the significance test. We know that without even calculating the p-value for this example, because the standard error is larger than the estimated value.

For β4, we have a test statistic of 0.12319431/0.04348278 = 2.83 and has a corresponding p-value of 0.0048, which is statistically significant. We reject H and conclude that β4 is non-zero, i.e. there is statistically significant relationship between y and X4.

One can see that with multicollinearity, there is the possibility that the inflated standard error pushes the test statistic down, such that we incorrectly conclude that there is not a statistically significant relationship between y and Xi.

pyODBC and SQL

Getting Data with pyODBC

Moving data from a structured database via either a SQL query or a HiveQL query to a local machine is many times desired for deeper exploratory analysis. For small and medium sized data sets, transferring data straight to RAM is ideal, without the intermediate step of saving the query results to a file. pyODBC is a python library that enables database connects — this will require that the ODBC driver is installed along with any other required database-specific drivers.

Below is an example of using pyODBC library with SQL Server:

import pyodbc

db_cnx = pyodbc.connect(r'''Driver={SQL Server};

db_cursor = db_cnx.cursor()
query = '''select * from database.table limit 100;'''
query_results = db_cursor.fetchall()

First we create a database connection object by instantiating the pyodbc.connect python class with all the connection details. Here the server_address will be the actual address to your server, and the database_name is actual name of the database for the connection.

Then we create a cursor object and begin to use the execute method to run our queries. To return query results, we have to call the fetchall method, which returns a tuple of tuples back to python, i.e. query_results = ((row0), (row1), (row2), …).

Warning: There is the possibility of crashing your computer by completely filling the RAM.  When working with new data, I tend to watch the RAM utilization on the Activity Monitor during data pulls, killing the process if necessary.

Simple Database Interface Class

Below is a database interface class that I personal used for SQL Server queries before finding out that pandas has build-in support for sql queries. Nearly all my exploratory work is done in python using pandas, thus this class was written to convert the tuple of tuples into a pandas dataframe object by default.  The current version as shown parses the HiveQL query to get the correct column names, and does not support ‘select * …’ queries at the moment if the data_frame flag is set to True.

import pandas as pd
import pyodbc

class DatabaseInterface(object):
    def __init__(self, database, server):
        self.database = database
        self.server = server
        self.driver = '{SQL Server}'
        self.connection =\
                               ''' % (self.driver, self.server, self.database))
        self.cursor = self.connection.cursor()

    def execute(self, query, data_frame=True):
        results = self.cursor.fetchall()

        if not data_frame:
            return results
            column_names = self._parse_query(query)
            return self._build_data_frame(results, column_names)

    def _build_data_frame(self, data, column_names):
        dictionary = {str(column): [] for column in column_names}
        for data_row in data:
            for i, data_point in enumerate(data_row):
        return pd.DataFrame.from_dict(dictionary)

    def _parse_query(self, query):
        parsed_string = query.split()
        column_names = []
        for string in parsed_string:
            string = string.strip().replace(',', '').replace('[', '').replace(']', '')
            if string.upper() == 'SELECT':
            elif 'FROM' in string.upper():
                if '*' in column_names:
                    raise Exception('cannot build dataframe with arbitrary column names...')
                return column_names
                # split removes table names from joined queries...

Example usage…

db = DatabaseInterface(my_database,
sql_query = '''select column_a, column_b from database.table'''
sql_results = db.execute(sql_query)

Practical Apache Hive


Apache Hive is an open-source big-data software project that allows SQL-style queries of data stored in the Hadoop file system.  Those SQL-style queries, or HiveQL queries are converted into a collection of MapReduce jobs which get executed in the Hadoop cluster.

There are several options for the Hive Execution Engine that is responsible for running the collection of MapReduce jobs.  The simplest one is the MapReduce engine, which executes the the collection of jobs as a traditional MapReduce. Apache Tez is another execution engine available to for Hive queries.  The difference between these two engines, from a purely practical standpoint, is that Tez speeds up query executions in part by limiting un-necessary disk writes between MapReduce jobs, where as the MapReduce engine always writes to disk after each MapReduce job. There are other differences between these two execution engines, but those details are beyond the topic of practical Hive. More recently, Hive queries can be executed using the Apache Spark engine, however, this requires that Spark be installed on the cluster.  Both Hive on Tez and Hive on Spark execute queries more quickly that the original MapReduce engine.

Basic Queries

select * from database.table limit 100;

This is one of the simplest HiveQL queries, where we have selected all the columns available in database.table and we only want the first 100 rows. Queries like this are great for getting a preview of the data structure and quality. This query should return almost instantaneously, since no MapReduce jobs are created.

select column_a, columns_b from database.table limit 100;

Here we select only specific columns.  Again, this should return almost instantaneously… no MapReduce jobs are created.

select column_a, column_b from database.table where column_a = 'HIVE';

This HiveQL query uses a where clause to sub-set the data and only returns rows where the value in the column_a is equal to the string HIVE.  The where clause initiates the creation of MapReduce jobs, and thus this query will utilize an execution engine.  The execution time for this query will depend on the database.table size, the available hardware resources, and the execution engine.

set hive.execution.engine=tez;
     (column_one = 'HIVE') and (column_two = 'TEZ');

Where clause conditions can be strung together as shown above.  Again, the where clause initiates the creation of MapReduce jobs. The indented structure and the parenthesis are not necessary, but does provide clarity as queries become increasing complex.  Notice that an initial command was passed telling Hive to use the TEZ execution engine.

     a.column_key as new_column_key,
     a.column_one as new_column_one,
     a.column_two as new_column_two,
     b.column_one as new_column_three,
     b.column_two as new_column_four
     (select column_one, column_two, column_key from database.table_47828) as a
inner join
     (select column_one, column_two, column_key from database.table_n3232) as b
on a.column_key = b.column_key;

Let’s break down the above query.

(select column_one, column_two, column_key from database.table_47828) as a

First we are selecting a sub-set of the columns available from database.table_47828, and an creating a temporary alias of that result a which we will reference shortly.

(select column_one, column_two, column_key from database.table_n3232) as b

Similarly, we select a sub-set of the columns in database.table_n3232 and alias that result as b.

    (select column_one, column_two, column_key from database.table_47828) as a
inner join
    (select column_one, column_two, column_key from database.table_n3232) as b
on a.column_key = b.column_key;

Next, the we take the results from the individual queries and execute an inner join to combine rows that share a common value in the column_key column.  Since each table has a column named column_key in this example, we make use of the temporary aliases we created earlier.

     a.column_key as new_column_key,
     a.column_one as new_column_one,
     a.column_two as new_column_two,
     b.column_one as new_column_three,
     b.column_two as new_column_four

The final piece is requesting which columns to return and how to return them.  In this example, we rename each column using the as new_column_x syntax.  Notice, that we again make use of the temporary table aliases when referencing table names.  This is necessary when tables have columns with the same name.

Left outer join, right outer join, and full outer join (or simply, join) work similarly in Hive.  Further information on the syntax of Hive Joins can be found here.

Other Useful Queries

describe database.table;

This query returns a list of the available columns and the data type of each column.  If you use a graphical interface such as dbvisualizer, then this data is typical already available.

select distinct(column_one) from database.table;
select count(distinct(column_one)) from database.table;

This top query returns the unique values from column_one within database.table. Personally, this query is useful to check date ranges and look for missing data within in a given data set.  Or we count the number of distinct values in a given column.

select column_one, count(*) from database.table group by column_one;

Alternatively, we can group the dataset by the unique values in column_one and return the number of rows or entries having that particular value.  This query shows the group by clause and the corresponding aggregation function, which is a very common query pattern.  There are many build-in aggregation functions in Hive that are useful.  The following functions are an abbreviated list:

sum(column), avg(column), var_pop(column), stddev_pop(column), corr(column_one, column_two), percentile_approx(column, percentile_target).

     cast(regexp_replace(column_one, '[A-Z]', '') as int) as new_column

The above query exhibits both the cast() function and the utilization of a regular expression in HiveQL.  First, the values from column_one are stripped of any alphabet characters and then cast into the data type int.  This type of syntax is useful in a more complex query, such as an inner join where we need to modify a column prior to using it as the join key.

Linear Regression Basics


Linear Regression is a simple, yet powerful method for understanding linear relationships between variables in a collection of observations.  For a given ith observation there is a set of n independent variables {x1, x2, x3, …, xn}i and one dependent variable yi, and the assumption is that yi is linear function of the independent variables.

yi = f(x) = β0 + β1·x1 + β2·x2 + β3· x3+ … + βn· xn = β0 + ∑i βi·xi

Each set of xi is typically a row in a data set or design matrix X and the coefficients β’s are a measure of how that particular xi affects yi. We assume that all the observations have the same linear relationship, and thus share a common set of coefficients β’s. We further simplify the linear relationship in matrix form.

Y = X · β

And thus the goal here is to understand the relationships within the data, i.e. calculate the coefficients β’s. This can be either analytically or numerically. We will first discuss the analytical solution.

If we multiply both sides of the above equation by XT, which is the transpose of X.

XTY = XTX · β

Then multiply both sides by (XTX)-1, which is the inverse of XTX and is defined such that (XTX)-1 · (XTX) = 1.

(XTX)-1 · XT Y = (XTX)-1 · (XTX) β
(XTX)-1 · XT Y = β

We now have an analytic solution for the set of coefficients β.

Example using sklearn

(caveat — sklearn.linear_model.LinearRegression calls np.linalg.lstsq which calls an LAPACK routine using Singular Value Decomposition (SVD), which is a more robust algorithm for solving this equation — to be covered more in the future…)

Let’s generate some fake data in python… and then fit it using the scikit-learn library.

import numpy as np
X = np.random.random(1000).reshape(500, 2)
beta = np.array([-1.25, 1.50])

X is now a design matrix with 500 rows with randomly assigned values for the two independent variables. And for the sake of generating fake data, we have assigned β1 and β2 to the values -1.25 and 1.50, respectively.

y =, beta)
y += np.ones_like(y)
y += 0.1 * np.random.random(1, 1, size=len(y))

First we calculate y, then add a constant offset β0 = 1, and finally throw in some random noise.

Plotting up our data…

import matplotlib.pyplot as plt
%matplotlib notebook # using Jupyter Notebook
plt.title('example data')
plt.plot(X[:,0], y, '.')
plt.plot(X[:,1], y, '.')


from sklearn import linear_model
lm = linear_model.LinearRegression(fit_intercept=True, normalize=False), y)

Here we import the linear_model module from the sklearn library and create an instance of the LinearRegression class.  We also explicitly set the fit_intercept variable to True and the normalize variable to False.  Then we call the fit method to calculate the regression coefficients. Finally, we call the score method to calculate the R2 value of our fit.

print lm.coef_
print lm.intercept_
print lm.score(X, y)

The resulting output should be something similar to array([-1.25472623, 1.50384261]) for the coefficients, which is within an acceptable tolerance to the known values of those coefficients. And for the intercept, our value is 1.101767 — again within an acceptable tolerance to the known value.  The R2 value is 0.97, suggesting that we have a great fit and that 97% of the variance is explained by the independent variables.

print lm.predict([ np.array([0.5, 0.5]) ])

Now we have to power to predict new y values given an observation of x1 and x2. For example, assume both x1 and x are equal to 0.5, then the new y value is predicted to be array([ 1.22632577]).

Example Python Class for Analytical Calculation

(This python class is more or less just for instruction… most libraries will use a different analytical method to solve this problem that is more robust.)

Below is a python class that solves the above equation for the linear regression coefficient. The linear algebra occurs in the call to the _solve_ols method, where we calculate (XTX)  — the XTX variable — along with its inverse (XTX)-1  — the XTX_i variable. Then calculating the coefficients as shown above,  β = (XTX)-1 · XT Y.

This class also allows us to check the residuals, or the noise, are gaussian distributed. And we also have a method calc_r_squared to return the R2 value.

import numpy as np
from scipy.stats import kstest, norm

class OLS(object):
    def __init__(self, X, y, fit_intercept=True):
        self.X = X
        self.y = y
        self.fit_intercept = fit_intercept
        self._coeff = None

    def fit(self, check_residuals=True, threshold=0.05):
        if self.fit_intercept:
        if check_residuals:
            print 'checking residuals...'
            if self._check_residuals(threshold):
                print '...residuals are gaussian distributed at %3.2f...' % threshold
                print '...residuals are Not gaussian distributed...'
    def _add_intercept(self):
        '''add a column of 1s in the X matrix...'''
        self.X = np.insert(self.X, 0, np.ones_like(self.y), axis=1)

    def _solve_ols(self):
        '''matrix solution for OLS...'''
        XT = self.X.transpose()
        XTX =, self.X)
        XTX_i = np.linalg.inv(XTX)
        self._coeff =, XT), self.y)

    def _calculate_residuals(self):
        return self.y -, self._coeff)

    def _check_residuals(self, threshold=0.05):
        '''check residuals using ks_test for normality...'''
        residuals = self._calculate_residuals()
        mu, std = np.mean(residuals), np.std(residuals)

        def g_cdf(x):
            return norm.cdf(x, mu, std)

        # standard 2-sided ks test...
        t_stat, p_value = kstest(residuals, g_cdf)
        # returns True for gaussian noise
        return p_value > threshold

    def calc_r_squared(self, adjusted=True):
        '''returns the standard R2 value...'''
        n_obs, n_var = self.X.shape
        y_ = np.mean(self.y)
        p =, self._coeff)

        ss_t = np.sum(np.square(self.y - y_))
        ss_e = np.sum(np.square(self.y - p))

        r2 = 1.0 - ss_e/ss_t
        if adjusted:
            return 1.0 - (1.0 - r2) * ((n_obs - 1) / (n_obs - n_var))
        return r2

Scaling of the Analytic Linear Regression Solution

The analytic solution requires matrix transposition, matrix multiplication, and matrix inversion. The transposition step scales as O(mn), where we have m rows or observations and n independent variables. The first matrix multiplication step XT scales as O(n2m).  The inversion of XTX  scales as O(n3assuming basic inversion calculation; other inversion algorithms exists that provide O(n2.373performance — see wikipedia for more details. Then the next matrix multiplication (XTX)-1 · XT scales as O(n2m).  We see here that the analytic solution to linear regression is in general going to scale as O(n2m).  If for some reason n > m, then this solution will scale as O(n3).

For problems were n and m are very large, the computation cost of an analytic solution is prohibitive.  In those cases, numerical solutions such as gradient descent can be implemented.

Related Topics

Cross-validation should always be used to evaluate a model’s true predictive power. This is typically done by splitting the data into both a training set and a testing set.  Future posts will discuss this topic further.

There are several assumptions about the data we have implicitly made in this blog post.

  1. Linear regression assumes that each observation is independent of all others and that the noise can be modeled as a gaussian centered about zero with a constant variance.  If the noise does not have a constant variance (i.e. not homoscedastic), then either a non-linear transformation of the dependent variable or a weighted linear regression is necessary.
  2. We have assumed no multicollinearity, i.e. that the independent variables are not correlated with each other. If multicollinearity exists, then either regularization or principle component analysis should be considered.