A new permutation-based feature selection algorithm for clustering

Muhammad Maruf Sazed
6 min readJan 6, 2021

Feature selection is an important topic in machine learning. It is relevant for both supervised and unsupervised problems. At this moment there are probably thousands of articles on the internet on different aspects of feature selection. So, I am not going to repeat that by discussing what feature selection is and why it is important. Most of these discussions seem to be around supervised machine learning. However, feature selection is also relevant for unsupervised problems, including clustering. I recently wrote a short overview on why feature selection is important for clustering. (https://marufsazed.medium.com/why-feature-selection-in-clustering-is-important-de4c5c907c29).

In this article, I will introduce a new feature selection algorithm that I worked on with my supervisor as part of my master’s research. The advantage of this new algorithm is that it is very flexible and can be used alongside different clustering algorithms. Also, it is easier to implement compared to some of the other feature selection algorithms that are available for clustering. Relevant R code can be found in the following GitHub rep: https://github.com/MuhammadMarufSazed/Permutation_feature_selection_in_clustering

Before I introduce the algorithm, I will try to explain the intuition behind the main idea. The idea is similar to what random forest algorithm does to rank the important features. However, random forest does that for supervised problems where class labels are present. We implemented similar idea for clustering where class labels are not present.

For demonstration, let us consider the following toy dataset.

A toy example: dataset with two features and a class a label. There are two classes: toddler and grown-up.

There are 2 features, weight and height, in the dataset. There are two classes: toddler and grown-up. Just by looking at the data we can see that the values related to toddlers are quite different compared to what they are for grown-ups. The darker blue rows are related to toddlers and the light blue row is related to grown-ups. Since we have a label in this case, we can use any machine learning algorithm and try to predict the class labels using features weight and height. We can then predict the class and calculate the prediction error by comparing the prediction to the given class labels (ideally, we should split it the data in training, test sets but for simplicity we will not go into that). Now, if we randomly shuffle one of the feature columns (please refer to the following picture) and fit the same machine learning algorithm again, what can we expect to happen to the prediction error?

Shuffling a feature will have negative impact on the prediction error.

It is reasonable to assume that the prediction error will decline with this shuffled dataset. Even by looking at the shuffled dataset we can see that the “relationship” between the features and the class labels are distorted by the shuffling. I look at this approach as if we are working with a locker combination. There is only one arrangement of numbers that can unlock the lock. This is similar to the original dataset that we started with. We can shuffle these combination dials any way we want. We can shuffle one of these combination dials or we can shuffle multiple combination dials. In either case, the ideal combination will be distorted, and we will not be able to open the lock.

Our approach is similar to how we use a locker.

Back to the toy dataset, we can do the random shuffling multiple times and calculate prediction error each time, and then calculate the mean of those prediction errors. We can do the same for all the features. At this point we assume that the mean prediction error will be lowest for the feature that is the most important. This is a reasonable assumption since when we shuffle a feature, it essentially becomes noise. So, by turning the most important feature into noise, we lose significant amount of information which negatively affects the prediction algorithm. As a result, the prediction error will be worse than what we had with the original unshuffled dataset. So, these mean prediction errors can be an indication of the importance of the features for the algorithm.

For unsupervised problems, we do not have the class labels but for clustering we use different objective functions. For example, k-means calculates the sum of squared distance from the cluster means, model-based clustering for gaussian mixture models use log-likelihood or Bayesian Information Criteria (BIC) as the objective function. So whichever clustering algorithm that we decide to use, we can identify the features for which the mean values of the objective function are considerably different from the value of the objective function based on the original dataset. This will help us rank the features based on their importance.

I will now introduce the actual algorithm more formally. As I mentioned before, the key idea is to take any clustering algorithm (e.g. model based clustering, k-means, spectral clustering, etc.) and apply that on the given dataset. We then keep a note of the value of the objective function for that clustering [Step 1]. Then we randomly shuffle each of the features one after the other on multiple occasions; apply the clustering algorithm each time and keep note of the value of the objective function for each of the shuffled datasets. We take mean of these values for each of the features and calculate the relative absolute distance from the value based on the original dataset, which we refer to as importance measure [Step 2]. Then we sort the calculated importance measures [Step 3] and select a number of features [Step 4]. Following are the detailed steps of the algorithm.

I implemented the algorithm on several simulated and real datasets. The algorithm produced satisfactory performance in majority of these situations. Following figure shows the outcome using the body dataset that can be found from R gclus library. I applied two types of clustering algorithms, model based Mclust and k-means. We can see that the permutation feature selection algorithm ranks these features according to their importance. In this case Mclust seems to be doing a better job as it was able to utilize the information in the dataset better compared to the k-means. This example shows that we can substantially reduce the feature space (number of features) by applying the feature selection algorithm. Obviously, the actual clustering algorithm plays an important role in the final outcome of this feature selection algorithm.

Ranking of features for body data set. The features that are marked in green are selected.

One of the challenges of this method is it is computationally expensive. It will take a lot of iterations to rank these features. So, if we use a clustering algorithm that itself is computationally expensive, then this whole feature selection algorithm will take a lot of time. However, we can improve by applying parallel computations. The other challenge is, the permutation based method can be influenced by the presence of highly correlated features. So we need to be careful when we apply this technique when there are a lot of correlated features in the dataset.

--

--