دورية أكاديمية

Simplifying Text Mining Activities: Scalable and Self-Tuning Methodology for Topic Detection and Characterization

التفاصيل البيبلوغرافية
العنوان: Simplifying Text Mining Activities: Scalable and Self-Tuning Methodology for Topic Detection and Characterization
المؤلفون: Evelina Di Corso, Stefano Proto, Bartolomeo Vacchetti, Paolo Bethaz, Tania Cerquitelli
المصدر: Applied Sciences, Vol 12, Iss 10, p 5125 (2022)
بيانات النشر: MDPI AG, 2022.
سنة النشر: 2022
المجموعة: LCC:Technology
LCC:Engineering (General). Civil engineering (General)
LCC:Biology (General)
LCC:Physics
LCC:Chemistry
مصطلحات موضوعية: textual data, unsupervised learning, self-tuning algorithms, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
الوصف: In recent years, the number and heterogeneity of large scientific datasets have been growing steadily. Moreover, the analysis of these data collections is not a trivial task. There are many algorithms capable of analyzing large datasets, but parameters need to be set for each of them. Moreover, larger datasets also mean greater complexity. All this leads to the need to develop innovative, scalable, and parameter-free solutions. The goal of this research activity is to design and develop an automated data analysis engine that effectively and efficiently analyzes large collections of text data with minimal user intervention. Both parameter-free algorithms and self-assessment strategies have been proposed to suggest algorithms and specific parameter values for each step that characterizes the analysis pipeline. The proposed solutions have been tailored to text corpora characterized by variable term distributions and different document lengths. In particular, a new engine called ESCAPE (enhanced self-tuning characterization of document collections after parameter evaluation) has been designed and developed. ESCAPE integrates two different solutions for document clustering and topic modeling: the joint approach and the probabilistic approach. Both methods include ad hoc self-optimization strategies to configure the specific algorithm parameters. Moreover, novel visualization techniques and quality metrics have been integrated to analyze the performances of both approaches and to help domain experts interpret the discovered knowledge. Both approaches are able to correctly identify meaningful partitions of a given document corpus by grouping them according to topics.
نوع الوثيقة: article
وصف الملف: electronic resource
اللغة: English
تدمد: 2076-3417
Relation: https://www.mdpi.com/2076-3417/12/10/5125; https://doaj.org/toc/2076-3417
DOI: 10.3390/app12105125
URL الوصول: https://doaj.org/article/62bf56d2b1f44231adbcba1b5accb89e
رقم الأكسشن: edsdoj.62bf56d2b1f44231adbcba1b5accb89e
قاعدة البيانات: Directory of Open Access Journals
الوصف
تدمد:20763417
DOI:10.3390/app12105125