For various exploratory efforts, understanding variable correlation is a necessary analysis step. Measuring correlation helps to build intuition about the problem at hand and can be used to identify simple multicollinearity if present.
The covariance matrix is the basis for understanding linear correlation. Each element of the matrix is defined as the following, where E[B] is the expectation value for the vector B and < B > is the mean value of the vector B.
Σij = Cov(Xi, Xj) = E[(Xi - <Xi>) (Xj - <Xj>)]
Below we generate a 2D dataset of random numbers, each vector of length 10.
import numpy as np X = np.random.normal(0, 1, 20).reshape(2, 10)
Let’s begin by calculating the Σij ourselves using the equation above…
# Cov(0,0) i, j = 0, 0 Cov_00 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j])) print Cov_00 #Cov(0,1) i, j = 0, 1 Cov_01 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j])) print Cov_01 #Cov(1,1) i, j = 1, 1 Cov_11 = np.mean((X[i] - np.mean(X[i])*(X[j] - np.mean(X[j])) print Cov_11
1.29658585406 0.0578141080647 0.337014567052
Alternatively, both the numpy and pandas libraries have either built-in functions or methods for calculating covariance.
# using the numpy.cov() function print np.cov(X) # using the pandas.dataframe.cor() method df_X = pd.DataFrame(X.transpose()) print df_X.cov()
# from numpy [[ 1.44065095 0.0642379 ] [ 0.0642379 0.37446063]] # from pandas 0 1 0 1.440651 0.064238 1 0.064238 0.374461
One should immediately notice that the covariance values from numpy and pandas disagree with our initial calculation — Σ00 = 1.44 for numpy and pandas, and Σ00 = 1.29 for our calculation. What could be the cause of this difference? Both numpy and pandas calculate the unbiased estimation of the covariance, while we naively calculated the biased estimation of the covariance. The covariance is a measurement of the population. The data we have is a sample of that population. For large sample sizes, the covariance of the sample approaches that of the population. For small samples sizes, the biased estimation will always be smaller than the actual population covariance, hence the term biased.
How to calculate the unbiased covariance? In our above calculation, we used the mean as the expectation value, which yielded biased covariance values. The unbiased value is calculated by dividing by n-1, instead of n…
Σij = Cov(Xi, Xj) = 1/(n-1) · Σn (Xi,n - <Xi>) · (Xj,n - <Xi>)
More information on biased and unbiased estimators can be found on wikipedia right here.
Interpretation of Covariance
The covariance matrix Σ is a symmetric matrix, that provides information about how one independent variable will vary with another independent variable, i.e. correlation. If the two variables are truly independent, then the covariance will be close to zero. However, if the two variables tend to increase together, then the covariance will be a positive number. And likewise, if as one variable increases, the other variable tends to decrease, then the covariance value will be negative.
The absolute size of the a particular covariance element Σij will depend on both on the variability of the individual variables and the degree of correlation of those variables. This tends to obscure our ability to quickly glance at the covariance values and understand the degree of correlation.
Correlation via Pearson’s R Value
Linear correlation is measured by normalizing the covariance value by the standard deviations of the two variables. This metric is referred to as Pearson’s R.
Rij = Cov(Xi, Xj) / ( σi · σj )
The magnitude of this number is now bound between -1 and 1. Again, uncorrelated variables have values near zero. And variables with strong correlation will approach +/-1 depending on the whether the correlation is positive or negative.
In python we can calculate the Pearson’s R value with the help of the numpy library. Notice, that we have pass the ddof parameter a value of 1 — this makes sure that numpy returns an unbiased estimation of the standard deviation.
S = np.cov(X) R_00 = S / (np.std(X, ddof=1) * np.std(X, ddof=1)) R_01 = S / (np.std(X, ddof=1) * np.std(X, ddof=1)) print R_00 print R_01
Alternatively, pandas makes life much easier, assuming the data are represented as a pandas.DataFrame object. There are different methods to calculate correlation, Pearson’s method calculates linear correlation. Here we explicitly pass the method variable a string value of ‘pearson’.
print df_X.corr( method='pearson')
0 1 0 1.00000 0.08746 1 0.08746 1.00000
Practical Calculations of Correlation and Covariances
With most of the below examples, distinguishing biased and unbiased estimators becomes less important. As datasets grow larger and large, n and n-1 are for all practical purposes equal.
SQL (biased estimation):
SELECT AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB FROM database.table
SQL with Group By (biased estimation):
SELECT category, AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB FROM database.table GROUP BY category
SQL (unbiased estimation):
It is possible to calculate the unbiased covariance value…
SELECT COUNT(*) as n_datapoints, AVG(col_a *col_b) - AVG(col_a) * AVG(col_b) as cov_AB FROM database.table
then with paper and pencil…
covunbiased(A,B) = n / (n-1) · covbiased(A,B)
HiveQL (biased estimation):
select category, covar_pop(col_a, col_b) as cov_ab FROM database.table group by category
HiveQL (unbiased estimation):
select category, covar_samp(col_a, col_b) as cov_ab FROM database.table group by category
HiveQL, Pearson’s R:
select category, corr(col_a, col_b) as r_ab FROM database.table group by category
# using a dataframe object cov_ab = df.stat.cov('col_a', 'col_b')
pySpark, Pearson’s R:
# using a dataframe object r_ab = df.stat.corr('col_a', 'col_b') # using two rdds from pyspark.mllib.stat import Statistics r_ab = Statistics.corr(rdd_a, rdd_b, method='pearson')