دورية أكاديمية

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

التفاصيل البيبلوغرافية
العنوان: Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
المؤلفون: Lanera C; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan, 18, 35131, Padova, Italy., Berchialla P; Department of Clinical and Biological Sciences, University of Torino, Torino, Italy., Sharma A; Department of Biological Sciences and Bioengineering, Indian Institute of Technology Kanpur, Kanpur, India., Minto C; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan, 18, 35131, Padova, Italy., Gregori D; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan, 18, 35131, Padova, Italy., Baldi I; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan, 18, 35131, Padova, Italy. ileana.baldi@unipd.it.
المصدر: Systematic reviews [Syst Rev] 2019 Dec 06; Vol. 8 (1), pp. 317. Date of Electronic Publication: 2019 Dec 06.
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: BioMed Central Country of Publication: England NLM ID: 101580575 Publication Model: Electronic Cited Medium: Internet ISSN: 2046-4053 (Electronic) Linking ISSN: 20464053 NLM ISO Abbreviation: Syst Rev Subsets: PubMed not MEDLINE; MEDLINE
أسماء مطبوعة: Original Publication: London : BioMed Central
مستخلص: Background: The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews.
Methods: We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy.
Results: Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65.
Conclusions: Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.
References: J Am Med Inform Assoc. 2017 Nov 1;24(6):1165-1168. (PMID: 28541493)
Clin Infect Dis. 2014 Jun;58(12):1649-57. (PMID: 24647016)
Res Synth Methods. 2018 Dec;9(4):602-614. (PMID: 29314757)
Syst Rev. 2015 Jan 14;4:5. (PMID: 25588314)
J Biomed Inform. 2014 Oct;51:242-53. (PMID: 24954015)
BMC Bioinformatics. 2010 Jan 26;11:55. (PMID: 20102628)
J Clin Epidemiol. 2017 Nov;91:31-37. (PMID: 28912003)
BMC Med Inform Decis Mak. 2018 Jun 25;18(1):46. (PMID: 29940927)
J Integr Bioinform. 2011 Sep 16;8(3):177. (PMID: 21926440)
Int J Med Inform. 2017 Jan;97:120-127. (PMID: 27919371)
J Clin Epidemiol. 2018 Nov;103:22-30. (PMID: 29981872)
J Med Internet Res. 2013 Jun 26;15(6):e122. (PMID: 23803299)
فهرسة مساهمة: Keywords: Classification; Indexed search engine; Machine learning; Text mining; Unbalanced data, systematic review
تواريخ الأحداث: Date Created: 20191208 Latest Revision: 20200108
رمز التحديث: 20221213
مُعرف محوري في PubMed: PMC6896747
DOI: 10.1186/s13643-019-1245-8
PMID: 31810495
قاعدة البيانات: MEDLINE
الوصف
تدمد:2046-4053
DOI:10.1186/s13643-019-1245-8