Why feature selection in clustering is important
Knowingly or unknowingly we deal with groups all the time when we work with datasets. Companies do it when they want to segregate their customers into different groups so that they can position their products. Universities are grouped according to their rankings, quality of research, etc. Employers group their employees into different categories when they want to evaluate the performances of their associates and identify good performers. Clustering can be considered as a method (algorithm) to uncover these groupings within the datasets. A lot of times they are considered as a topic under unsupervised learning.
Regardless of the application domain, these groupings/clustering are done with the help of some features/variable. It is not difficult to imagine that not all the features/variable are important when groupings are done. E.g. for performance evaluation, it might be more relevant for the organizations to group the employees based on features that are relevant for their jobs. So, to judge a performance of a person working in sales department, it is important to evaluate how much sales the person had closed in that year rather than putting too much focus on their driving skills (unless driving skill is necessary for that job).
Effect of unimportant features: Now when we work with data, we come across different types of features. A lot times features get included due to the data collection method. We know that different websites collect enormous amount of information (features: age, location, etc.) about their customers; however, there is no way to tell in advance that which of these features are important. So inadvertently we come across datasets that contain unimportant features. The negative effect of unimportant features on supervised machine learning and clustering algorithms are well studied. For simple demonstration I have used a couple of examples. The first example is from the well-known Iris dataset. The data set contains data for three species(3 clusters) of iris plant. In the data set, there are 150 observations with 4 measurements (features) in Sepal Width, Sepal Length, Petal Width, and Petal Length. In the below plot we can see that all 3 clusters are easily separable when we look along Petal Length and Sepal Length (right plot). However, when we use a different set of features to look into the data, we can see that the red and the green clusters get mixed up.
Now, it is not always wise to consider the labels of the observations as the natural cluster in the data. To be more certain about the negative effect of unimportant features, I generated a simulate dataset where there are four clusters with five features. Out of these five features, only two of them are important. Since this is simulation, we generate a dataset where we can make certain features to be important.
As we can see, the four clusters are easily identifiable along X1 and X2 features. However, along other unimportant features (X3, X4, X5), some of these clusters get mixed up with other clusters. Therefore, making it hard to segregate all four clusters. In reality, there is likely to be more features in the dataset. Now if we run any clustering algorithm using all features, it is likely that the algorithm will get tricked by the presence of the unimportant features and produce suboptimal result. To put it into a business context, if a company wants to segregate their customer data and hope to find the clusters, presence of unimportant feature might lead to wrong conclusion and final decision in terms of resource allocation and prioritization. This can lead to substantial loss for the company.
What can be done: It is not always possible to just plot the data like I did (it is not even recommended), and determine which of the features seem to be important. This is especially challenging when there are too many features. We need a computation method so that the algorithm itself can select the unimportant features. There are many algorithms that are available and should be used depending on the use case.
The following papers are excellent resources about the different feature selection algorithms for clustering. They are mostly review papers, which means they will not necessarily go through these algorithms in detail. However, interested readers can always check the original papers that they mentioned in their papers.
- “A review of clustering techniques and developments” (https://www.sciencedirect.com/science/article/abs/pii/S0925231217311815)
- “Feature Selection for Clustering: A Review” (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.8115&rep=rep1&type=pdf)