IMPROVING THE ACCURACY OF IMBALANCED DATASET USING K-MEANS CLUSTERING

Adnan Saeed; Dr. Anwar Ali Sanjrani; Syed Khalid Shah Bukhari; Shabeer Ahmad

Authors

Adnan Saeed
Dr. Anwar Ali Sanjrani
Syed Khalid Shah Bukhari
Shabeer Ahmad

Keywords:

IMPROVING THE ACCURACY, OF IMBALANCED DATASET, USING K-MEANS CLUSTERING

Abstract

Class imbalance represents a significant obstacle in predictive modelling, frequently producing biased models that demonstrate poor performance on minority classes. Conventional classification methods often exhibit a tendency to favour the majority class, leading to suboptimal recall and precision for the critical minority outcomes. To address this issue, the current paper suggests the state-of-the-art prediction approaches that combine the results of the unsupervised K-Means clustering with the supervised classification algorithms. The key assumption is to capture some underlying group-level behavioral patterns by clustering and then provide the resulting cluster labels as auxiliary features in the classification pipeline. The aim of this hybrid approach is to augment the feature space, enhance model sensitivity to the minority class and finally improve overall model predictive power. Experimental testing conducted on two actual churn datasets of customers showed that models trained with cluster labels continually performed better on all important performance metrics. Most interestingly, there was a significant increase in performance when the K-Means clustering algorithm was used together with the K-Nearest Neighbours (KNN) classifier than when either of the two were used separately as the base-line models. The given framework is an effective and feasible plan to eliminate the challenge of data imbalance on customer churn prediction and other similar classification systems.