Research output per year
Research output per year
Jonatan Møller Nuutinen Gøttcke*, Colin Bellinger, Paula Branco, Arthur Zimek
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
The class imbalance problem is associated with harmful classification bias and presents itself in a wide variety of important applications of supervised machine learning. Measures have been developed to determine the imbalance complexity of datasets with imbalanced classes. The most common such measure is the Imbalance Ratio (IR). It is, however, widely accepted that the complexity of a classification task is the combined result of class imbalance and other factors, such as class overlap. Thus, in order to accurately assess the complexity of a problem, the data complexity measures ought to account for more than the simple IR. In this paper, we demonstrate that IR has a weak correlation with classifier performance in terms of macro averaged recall, gmean score, and precision. Other more complete measures such as the adapted N1 and N3 measures use neighborhood information to assess overlap. These measures show a strong negative correlation with classifier performance, but their reported values were hard to interpret. This motivates a new measure that estimates overlap complexity and returns a value with a clear interpretation. Here we propose such a measure based on the number of minority instances entangled in a Tomek Link. The proposed measure is evaluated on a large selection of synthetic and real datasets and is found to be as good as or better than the best competitors in terms of its negative correlation with respect to mean classifier performance.
Original language | English |
---|---|
Title of host publication | 2023 SIAM International Conference on Data Mining, SDM 2023 |
Number of pages | 9 |
Publisher | Society for Industrial and Applied Mathematics |
Publication date | 2023 |
Pages | 253-261 |
ISBN (Electronic) | 9781611977653 |
DOIs | |
Publication status | Published - 2023 |
Event | 2023 SIAM International Conference on Data Mining, SDM 2023 - Minneapolis, United States Duration: 27. Apr 2023 → 29. Apr 2023 |
Conference | 2023 SIAM International Conference on Data Mining, SDM 2023 |
---|---|
Country/Territory | United States |
City | Minneapolis |
Period | 27/04/2023 → 29/04/2023 |
Sponsor | IBM Corp, SIAM Activity Group on Data Science |
Research output: Thesis › Ph.D. thesis