دورية أكاديمية

MLSeq: Machine learning interface for RNA-sequencing data.

التفاصيل البيبلوغرافية
العنوان: MLSeq: Machine learning interface for RNA-sequencing data.
المؤلفون: Goksuluk D; Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Ankara, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey., Zararsiz G; Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey. Electronic address: gokmenzararsiz@hotmail.com., Korkmaz S; Department of Biostatistics, School of Medicine, Trakya University, 22030, Edirne, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey., Eldem V; Department of Biology, Faculty of Science, Istanbul University, 34452, Istanbul, Turkey., Zararsiz GE; Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey., Ozcetin E; Department of Industrial Engineering, Faculty of Engineering, Hitit University, 19030, Corum, Turkey., Ozturk A; Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey., Karaagaoglu AE; Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Ankara, Turkey.
المصدر: Computer methods and programs in biomedicine [Comput Methods Programs Biomed] 2019 Jul; Vol. 175, pp. 223-231. Date of Electronic Publication: 2019 Apr 29.
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: Elsevier Scientific Publishers Country of Publication: Ireland NLM ID: 8506513 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1872-7565 (Electronic) Linking ISSN: 01692607 NLM ISO Abbreviation: Comput Methods Programs Biomed Subsets: MEDLINE
أسماء مطبوعة: Publication: Limerick : Elsevier Scientific Publishers
Original Publication: Amsterdam : Elsevier Science Publishers, c1984-
مواضيع طبية MeSH: Machine Learning* , Software*, Sequence Analysis, RNA/*methods, Algorithms ; Discriminant Analysis ; Gene Expression Profiling ; Humans ; Linear Models ; Poisson Distribution ; Programming Languages ; RNA ; Support Vector Machine
مستخلص: Background and Objective: In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network.
Methods: Classification of RNA-sequencing data is not straightforward since raw data should be preprocessed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-sequencing data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq.
Materials: Three real RNA-sequencing datasets (i.e cervical cancer, lung cancer and aging datasets) were used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as algorithms based on dicrete distributions, and voomNSC, nearest shrunken centroids (NSC) and support vector machines (SVM) were selected as algorithms based on continuous distributions for model comparisons. Each algorithm is compared using classification accuracies and sparsities on an independent test set.
Results: The algorithms which are based on discrete distributions performed better in cervical cancer and aging data with accuracies above 0.92. In lung cancer data, the most of algorithms performed similar with accuracies of 0.88 except that SVM achieved 0.94 of accuracy. Our voomNSC algorithm was the most sparse algorithm, and able to select 2.2% and 6.6% of all features for cervical cancer and lung cancer datasets respectively. However, in aging data, sparse classifiers were not able to select an optimal subset of all features.
Conclusion: MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data.
(Copyright © 2019 Elsevier B.V. All rights reserved.)
فهرسة مساهمة: Keywords: Classification; Linear discriminant analysis; Negative Binomial; Poisson; RNA-Sequencing
المشرفين على المادة: 63231-63-0 (RNA)
تواريخ الأحداث: Date Created: 20190521 Date Completed: 20191206 Latest Revision: 20191217
رمز التحديث: 20221213
DOI: 10.1016/j.cmpb.2019.04.007
PMID: 31104710
قاعدة البيانات: MEDLINE
الوصف
تدمد:1872-7565
DOI:10.1016/j.cmpb.2019.04.007