Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem

التفاصيل البيبلوغرافية
العنوان: Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem
المؤلفون: Nam, Yunbi, Han, Sunwoo
سنة النشر: 2023
المجموعة: Computer Science
Statistics
مصطلحات موضوعية: Statistics - Machine Learning, Computer Science - Machine Learning, Statistics - Methodology
الوصف: Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to improved prediction performance in class imbalance problem.
Comment: 20 pages, 3 figures
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2312.10573
رقم الأكسشن: edsarx.2312.10573
قاعدة البيانات: arXiv