Introduction

Clustering or unsupervised learning techniques are powerful methods to find underlying patterns that may not be obvious and lead to, for example, customer segments that can be implemented in future marketing campaigns. These algorithms can be, in general, broken down into centroid-based methods, or distribution-based methods, or density-based clustering. Each algorithm or class of algorithms excels under certain circumstances. A good 2-dimensional visualization of these algorithms can be found on the scikit-learn page.

One underlying similarity to most of these algorithms is that pair-wise distance metrics are computed to quantify the closeness or similarity of independent data points. The smaller the distance metric between independent data points, the more similar these data points appear.

Effect of Multicollinearity

The existence of strong correlation between or amongst the features of a design matrix can lead to non-optimal clustering results. Why might this happen? Well, as discussed above, the clustering algorithms are measuring a distance metric or similarity between data points, which are in turn used to create groups or clusters of similar data points. When two or more features are highly correlated, those features have a stronger influence on the distance calculation than they should and can effect the grouping.

Multicollinearity should be removed from the design matrix prior to clustering. Additionally, the features of the design matrix should be standardized.  Non-standardized design matrix will also lead to a non-optimal clustering results in which one or more of the features dominate the distance calculations. On a side note, similar issues occur in convex optimizations, such as gradient descent.

One method to remove multicollinearity from data is via Principal Component Analysis (PCA).  This technique creates new features from linear combinations of the original features, such that the collinearity is removed.  Typically, one or more of the new features is predominately noise and can be removed from the transformed design matrix; this is why PCA is referred to as a feature reduction technique.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s