Introduction

For various exploratory efforts, understanding variable correlation is a necessary analysis step.  Measuring correlation helps to build intuition about the problem at hand and can be used to identify simple multicollinearity if present.

The covariance matrix is the basis for understanding linear correlation. Each element of the matrix is defined as the following, where E[B] is the expectation value for the vector B and < B > is the mean value of the vector B.

Σij = Cov(Xi, Xj) = E[(Xi - <Xi>) (Xj - <Xj>)]

Generate Dataset

Below we generate a 2D dataset of random numbers, each vector of length 10.

import numpy as np
X = np.random.normal(0, 1, 20).reshape(2, 10)

Calculating Covariance

Let’s begin by calculating the Σij ourselves using the equation above…

# Cov(0,0)
i, j = 0, 0
Cov_00 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j]))
print Cov_00

#Cov(0,1)
i, j = 0, 1 
Cov_01 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j]))
print Cov_01

#Cov(1,1)
i, j = 1, 1 
Cov_11 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j]))
print Cov_11

Output:

1.29658585406
0.0578141080647
0.337014567052

Alternatively, both the numpy and pandas libraries have either built-in functions or methods for calculating covariance.

# using the numpy.cov() function
print np.cov(X)

# using the pandas.dataframe.cor() method
df_X = pd.DataFrame(X.transpose())
print df_X.cov()

Output:

# from numpy
[[ 1.44065095  0.0642379 ]
 [ 0.0642379   0.37446063]]

# from pandas
          0         1
0  1.440651  0.064238 
1  0.064238  0.374461

One should immediately notice that the covariance values from numpy and pandas disagree with our initial calculation — Σ00 = 1.44 for numpy and pandas, and Σ00 = 1.29 for our calculation.  What could be the cause of this difference? Both numpy and pandas calculate the unbiased estimation of the covariance, while we naively calculated the biased estimation of the covariance.  The covariance is a measurement of the population.  The data we have is a sample of that population.  For large sample sizes, the covariance of the sample approaches that of the population.  For small samples sizes, the biased estimation will always be smaller than the actual population covariance, hence the term biased.

How to calculate the unbiased covariance? In our above calculation, we used the mean as the expectation value, which yielded biased covariance values. The unbiased value is calculated by dividing by n-1, instead of n…

Σij = Cov(Xi, Xj) = 1/(n-1) · Σn (Xi,n - <Xi>) · (Xj,n - <Xi>)

More information on biased and unbiased estimators can be found on wikipedia right here.

Interpretation of Covariance

The covariance matrix Σ is a symmetric matrix, that provides information about how one independent variable will vary with another independent variable, i.e. correlation. If the two variables are truly independent, then the covariance will be close to zero. However, if the two variables tend to increase together, then the covariance will be a positive number. And likewise, if as one variable increases, the other variable tends to decrease, then the covariance value will be negative.

The absolute size of the a particular covariance element Σij will depend on both on the variability of the individual variables and the degree of correlation of those variables. This tends to obscure our ability to quickly glance at the covariance values and understand the degree of correlation.

Correlation via Pearson’s R Value

Linear correlation is measured by normalizing the covariance value by the standard deviations of the two variables.  This metric is referred to as Pearson’s R.

Rij = Cov(Xi, Xj) / ( σi · σj )

The magnitude of this number is now bound between -1 and 1. Again, uncorrelated variables have values near zero.  And variables with strong correlation will approach +/-1 depending on the whether the correlation is positive or negative.

In python we can calculate the Pearson’s R value with the help of the numpy library. Notice, that we have pass the ddof parameter a value of 1 — this makes sure that numpy returns an unbiased estimation of the standard deviation.

S = np.cov(X)

R_00 = S[0][0] / (np.std(X[0], ddof=1) * np.std(X[0], ddof=1))
R_01 = S[0][1] / (np.std(X[0], ddof=1) * np.std(X[1], ddof=1))

print R_00
print R_01

Output:

1.0
0.087459875298

Alternatively, pandas makes life much easier, assuming the data are represented as a pandas.DataFrame object.  There are different methods to calculate correlation, Pearson’s method calculates linear correlation.  Here we explicitly pass the method variable a string value of ‘pearson’.

print df_X.corr( method='pearson')

Output:

         0        1
0  1.00000  0.08746
1  0.08746  1.00000

Practical Calculations of Correlation and Covariances

With most of the below examples, distinguishing biased and unbiased estimators becomes less important.  As datasets grow larger and large,  n and n-1 are for all practical purposes equal.

SQL (biased estimation):

SELECT AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB
FROM database.table

SQL with Group By (biased estimation):

SELECT category, AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB
FROM database.table GROUP BY category

SQL (unbiased estimation):

It is possible to calculate the unbiased covariance value…

SELECT
    COUNT(*) as n_datapoints,
    AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB
FROM
    database.table

then with paper and pencil…

covunbiased(A,B) = n / (n-1) · covbiased(A,B)

HiveQL (biased estimation):

select
    category, covar_pop(col_a, col_b) as cov_ab
FROM
    database.table group by category

HiveQL (unbiased estimation):

select
    category, covar_samp(col_a, col_b) as cov_ab
FROM
    database.table group by category

HiveQL, Pearson’s R:

select
    category, corr(col_a, col_b) as r_ab
FROM
    database.table group by category

pySpark:

# using a dataframe object
cov_ab = df.stat.cov('col_a', 'col_b')

pySpark, Pearson’s R:

# using a dataframe object
r_ab = df.stat.corr('col_a', 'col_b')

# using two rdds
from pyspark.mllib.stat import Statistics
r_ab = Statistics.corr(rdd_a, rdd_b, method='pearson')
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s