### Introduction

For various exploratory efforts, understanding variable correlation is a necessary analysis step. Measuring correlation helps to build intuition about the problem at hand and can be used to identify simple multicollinearity if present.

The covariance matrix is the basis for understanding linear correlation. Each element of the matrix is defined as the following, where E[*B*] is the expectation value for the vector *B* and < B > is the mean value of the vector *B*.

Σ_{ij}= Cov(X_{i}, X_{j}) = E[(X_{i}- <X_{i}>) (X_{j}- <X_{j}>)]

### Generate Dataset

Below we generate a 2D dataset of random numbers, each vector of length 10.

import numpy as np X = np.random.normal(0, 1, 20).reshape(2, 10)

### Calculating Covariance

Let’s begin by calculating the Σ_{ij} ourselves using the equation above…

# Cov(0,0) i, j = 0, 0 Cov_00 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j])) print Cov_00 #Cov(0,1) i, j = 0, 1 Cov_01 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j])) print Cov_01 #Cov(1,1) i, j = 1, 1 Cov_11 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j])) print Cov_11

Output:

1.29658585406 0.0578141080647 0.337014567052

Alternatively, both the *numpy* and *pandas* libraries have either built-in functions or methods for calculating covariance.

# using the numpy.cov() function print np.cov(X) # using the pandas.dataframe.cor() method df_X = pd.DataFrame(X.transpose()) print df_X.cov()

Output:

# from numpy [[ 1.44065095 0.0642379 ] [ 0.0642379 0.37446063]] # from pandas 0 1 0 1.440651 0.064238 1 0.064238 0.374461

One should immediately notice that the covariance values from *numpy* and *pandas* disagree with our initial calculation — Σ_{00} = 1.44 for numpy and pandas, and Σ_{00} = 1.29 for our calculation. What could be the cause of this difference? Both *numpy* and *pandas* calculate the **unbiased** estimation of the covariance, while we naively calculated the **biased** estimation of the covariance. The covariance is a measurement of the **population**. The data we have is a **sample** of that population. For large sample sizes, the covariance of the sample approaches that of the population. For small samples sizes, the biased estimation will always be smaller than the actual population covariance, hence the term biased.

How to calculate the unbiased covariance? In our above calculation, we used the mean as the expectation value, which yielded biased covariance values. The unbiased value is calculated by dividing by *n-1*, instead of *n…*

Σ_{ij}= Cov(X_{i}, X_{j}) = 1/(n-1) · Σ_{n}(X_{i,n}- <X_{i}>) · (X_{j,n}- <X_{i}>)

More information on biased and unbiased estimators can be found on wikipedia right here.

### Interpretation of Covariance

The covariance matrix Σ is a symmetric matrix, that provides information about how one independent variable will vary with another independent variable, i.e. correlation. If the two variables are truly independent, then the covariance will be close to zero. However, if the two variables tend to increase together, then the covariance will be a positive number. And likewise, if as one variable increases, the other variable tends to decrease, then the covariance value will be negative.

The absolute size of the a particular covariance element Σ_{ij} will depend on both on the variability of the individual variables and the degree of correlation of those variables. This tends to obscure our ability to quickly glance at the covariance values and understand the degree of correlation.

### Correlation via Pearson’s R Value

Linear correlation is measured by normalizing the covariance value by the standard deviations of the two variables. This metric is referred to as Pearson’s R.

R_{ij}= Cov(X_{i}, X_{j}) / ( σ_{i}· σ_{j})

The magnitude of this number is now bound between -1 and 1. Again, uncorrelated variables have values near zero. And variables with strong correlation will approach +/-1 depending on the whether the correlation is positive or negative.

In python we can calculate the Pearson’s R value with the help of the *numpy* library. Notice, that we have pass the *ddof *parameter a value of 1 — this makes sure that *numpy* returns an unbiased estimation of the standard deviation.

S = np.cov(X) R_00 = S[0][0] / (np.std(X[0], ddof=1) * np.std(X[0], ddof=1)) R_01 = S[0][1] / (np.std(X[0], ddof=1) * np.std(X[1], ddof=1)) print R_00 print R_01

Output:

1.0 0.087459875298

Alternatively, *pandas* makes life much easier, assuming the data are represented as a pandas.DataFrame object. There are different methods to calculate correlation, Pearson’s method calculates linear correlation. Here we explicitly pass the *method* variable a string value of ‘pearson’.

print df_X.corr( method='pearson')

Output:

0 1 0 1.00000 0.08746 1 0.08746 1.00000

### Practical Calculations of Correlation and Covariances

With most of the below examples, distinguishing biased and unbiased estimators becomes less important. As datasets grow larger and large, *n* and *n-1* are for all practical purposes equal.

#### SQL (biased estimation):

SELECT AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB FROM database.table

#### SQL with Group By (biased estimation):

SELECT category, AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB FROM database.table GROUP BY category

#### SQL (unbiased estimation):

It is possible to calculate the unbiased covariance value…

SELECT COUNT(*) as n_datapoints, AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB FROM database.table

then with paper and pencil…

cov_{unbiased}(A,B) = n / (n-1) · cov_{biased}(A,B)

#### HiveQL (biased estimation):

select category, covar_pop(col_a, col_b) as cov_ab FROM database.table group by category

#### HiveQL (unbiased estimation):

select category, covar_samp(col_a, col_b) as cov_ab FROM database.table group by category

#### HiveQL, Pearson’s R:

select category, corr(col_a, col_b) as r_ab FROM database.table group by category

#### pySpark:

# using a dataframe object cov_ab = df.stat.cov('col_a', 'col_b')

#### pySpark, Pearson’s R:

# using a dataframe object r_ab = df.stat.corr('col_a', 'col_b') # using two rdds from pyspark.mllib.stat import Statistics r_ab = Statistics.corr(rdd_a, rdd_b, method='pearson')