An Interpretable Measure of Dataset Complexity for Imbalanced Classification Problems

Jonatan Møller Nuutinen Gøttcke*, Colin Bellinger, Paula Branco, Arthur Zimek

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

The class imbalance problem is associated with harmful classification bias and presents itself in a wide variety of important applications of supervised machine learning. Measures have been developed to determine the imbalance complexity of datasets with imbalanced classes. The most common such measure is the Imbalance Ratio (IR). It is, however, widely accepted that the complexity of a classification task is the combined result of class imbalance and other factors, such as class overlap. Thus, in order to accurately assess the complexity of a problem, the data complexity measures ought to account for more than the simple IR. In this paper, we demonstrate that IR has a weak correlation with classifier performance in terms of macro averaged recall, gmean score, and precision. Other more complete measures such as the adapted N1 and N3 measures use neighborhood information to assess overlap. These measures show a strong negative correlation with classifier performance, but their reported values were hard to interpret. This motivates a new measure that estimates overlap complexity and returns a value with a clear interpretation. Here we propose such a measure based on the number of minority instances entangled in a Tomek Link. The proposed measure is evaluated on a large selection of synthetic and real datasets and is found to be as good as or better than the best competitors in terms of its negative correlation with respect to mean classifier performance.

Original languageEnglish
Title of host publication2023 SIAM International Conference on Data Mining, SDM 2023
Number of pages9
PublisherSociety for Industrial and Applied Mathematics
Publication date2023
Pages253-261
ISBN (Electronic)9781611977653
DOIs
Publication statusPublished - 2023
Event2023 SIAM International Conference on Data Mining, SDM 2023 - Minneapolis, United States
Duration: 27. Apr 202329. Apr 2023

Conference

Conference2023 SIAM International Conference on Data Mining, SDM 2023
Country/TerritoryUnited States
CityMinneapolis
Period27/04/202329/04/2023
SponsorIBM Corp, SIAM Activity Group on Data Science

Fingerprint

Dive into the research topics of 'An Interpretable Measure of Dataset Complexity for Imbalanced Classification Problems'. Together they form a unique fingerprint.

Cite this