Introduction

Clustering is grouping similar observations in the dataset. Clustering is useful in business problems when you have to categorize similar customers to understand their needs and give a new offering which suits their specific needs. For instance, when a credit card company is looking to offer new credit cards, it needs to understand how many types of customers are there and what are their requirements.

Clustering is an unsupervised learning technique, i.e. it does not need a target variable. All it requires is dependent variables, and algorithm will segregate the data-points as per the similarity of values of these variables.

 

Distance Measure

 

Before learning clustering, we need to learn how the distance between two points is measured, as distance is the core criteria to draw clusters. Less distant points will be in similar cluster, and more distant points will be in different clusters. There are multiple types of distance measures. Two of them are mentioned below:

  1. Euclidean distance : \(d_{euclidean} = \sqrt{\sum_{i=1}^N ( x_i - y_i} )^2 \)
  2. Manhattan distance: \(d_{manhattan} = \sum_{i=1}^N |x_i - y_i| \)

where x & y are vectors of n dimension. In our case, x & y will be two data points with n features. The difference of each feature values will be calculated to get the distance between two data-points. In the below graph, there are only 2 features.

\[ \boldsymbol{Euclidean Distance} =  \sqrt{ {X_1 - X_2}^2 + {Y_1 - Y_2}^2 } \] \[ \boldsymbol{Manhattan Distance} = |X_1 - X_2| + |Y_1 - Y_2| \]  

Linkage

 

Linkage is a way to determine how the distance between two clusters will be defined. There are 3 types of linkages :

1. Ward: Ward considers distance on the basis of variance. Ward linkage says that the distance between two clusters is the amount by which sum of square(SS) increases when these two clusters'are merged.

\[Ward Linkage : SS_{12} - (SS_{1} + SS_{2})\]

where

  • \(SS_{1}\) : sum of squares of cluster 1
  • \(SS_{2}\) : sum of squares of cluster 2
  • \(SS_{12}\) : sum of squares when cluster 1 & cluster 2 are combined

2. Complete: According to this linkage, the distance between two clusters is the distance between their two most distant observation.

3. Average: Average linkage says that the distance between two clusters is the average distance between each point in one cluster to every point in other cluster.

The clustering algorithm will merge the pairs of clusters that minimize the linkage criterion.

 

Types of clustering algorithms

 

There are many types of clustering algorithms. We will be learning 3 types of algorithms in this lesson which are widely used in industry.

  1. Centroid Based
  2. Connectivity Based
  3. Density Based
 

Impact of Outliers

 

Clusters are very sensitive to outliers. You can imagine if a value is too far from other values how much it can impact cluster formations. Let's understand this with k-means clustering example. Look at the below picture, how just making one point outlier changed clusters, two points moved to other clusters because of their proximity to new clusters centroids. One extreme outlier can pull the centroid outside significantly.

 
 

Standardization

 

Standardization is very important in clustering as the features may be in entirely different units. The difference of 1 in the number of houses that two people own might be much larger than the difference of 1000 dollar annual salary that two people earn. If features are not standardized, features which have large variance(large spread) will have elongated clusters along them. There are few clustering algorithms like k-means, which try to minimize within cluster sum of square (i.e., within cluster variance - also called as inertia), assume that clusters are convex and isotropic(round shape in all directions), tend to predict very poor quality clusters.

 

Clustering Algorithm Performance evaluation

 

There are multiple ways to check the clustering quality. There are primarily two measures of the good clusters:

  1. How close are data-points within clusters
  2. How far are data-points from other cluster data-points
 

Silhouette Coefficient

Silhouette Coefficient is a measure of how well similar datapoints are clustered together and dissimilar datapoints are separated from each other. Higher coefficient value signifies better clustering.

\[ \text{Silhouette Coefficient for a single datapoint} = \frac{b - a}{max(a, b)} \]

Overall Silhouette Coefficient is calculated as the mean of all the Silhouette Coefficients for each data point.

  • a: The mean distance between a data point and all other points in the same cluster.
  • b: The mean distance between a data point and all other points in the next nearest cluster.
 

Calinski-Harabaz Index

This is the ratio of (between clusters variance) & (within clusters variance). A higher score signifies better clusters.

Mathematical formula is

\[ \text{Calinski-Harabaz Index =} \frac{Total Variance_{within}}{Toatl Variance_{between}}*\frac{N-K}{K-1} \]
  • K = number of clusters
  • N = total number of data points
  • \(Total Variance_{within}\) = Overall within-cluster variance (Sum of variance of all datapoints from their respective cluster's centroids)
  • \(Total Variance_{between}\) = Overall between-cluster variance(Variance of all cluster centroids from all datapoints global centroid)
 

How to determine the number of clusters

 

Determining the number of clusters depends upon the following criteria:

  1. Business Decision
  2. Quality of clusters
    1. Elbow Method
    2. Silhouette Coefficient Method
 

Business Decision

Sometimes, it's a business call that into how many clusters algorithm should categorize the data. If a credit card company wants/has the capacity to launch 3 types of credit cards, number of clusters will have to be 3, no matter what. At times, domain knowledge plays a key role in deciding the number of clusters.

 

Elbow-Method

Sum of all withing cluster variance decreases as we increase the number of clusters and if we draw a graph of K vs. Variance, it looks something like this

Basically, the graph looks something elbow. The point where reduction in variance is not significant should be chosen as final K value. We will draw this graph in k-means algorithm chapter.

 

Silhouette Coefficient Method

When we draw a graph between Silhouette coefficient and number of clusters, it looks something like this. Highest Silhouette coefficient point will determine the K value. Notice that there is no fixed ever increasing or decreasing pattern in the graph, the coefficient value may increase or decrease for subsequent k values.

 

In next 3 chapters, we will learn 3 different algorithms of clustering.

Complete and Continue