Patterns in big data bioinformatics: Understanding complex diseases with interpretable machine learning
العنوان: | Patterns in big data bioinformatics: Understanding complex diseases with interpretable machine learning |
---|---|
المؤلفون: | Garbulowski, Mateusz |
المساهمون: | Komorowski, Jan, Dr., Urbanowicz, Ryan J., Assistant Professor |
المصدر: | Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. |
مصطلحات موضوعية: | complex diseases, big data, machine learning, transcriptomics, life sciences, rough sets, Bioinformatics, Bioinformatik |
الوصف: | Alterations in the flow of genetic information may lead to complex diseases. Such changes are measured with various omics techniques that usually produce the so-called “big data”. Using interpretable machine learning (ML), we retrieved patterns from transcriptomics data sets. Specifically, we employed a rule-based ML to identify associations among features and a decision in a combinatorial manner, i.e. a co-prediction. We developed tools and methods that can be applied by a large community of bioinformaticians and proved their usability through a variety of studies.In paper I, we developed an R.ROSETTA package that provides an environment for rule-based ML relying on the rough sets. Basically, R.ROSETTA is an R wrapper of the ROSETTA toolkit; however, it extends its functions with various analytical solutions. The package was tested on a microarray gene expression case-control study of autism. Estimated models were highly accurate and provided lists of possible interactions among genes. Moreover, benchmarking revealed that R.ROSETTA was among the best performing rule- and decision tree-based methods.In paper II, we applied the R.ROSETTA together with a VisuNet package. We used both tools to perform a rule-based network analysis of autism spectrum disorder (ASD) subtypes. Here, we used microarray-based gene expression measures of ASD patients and controls from three data sets. We demonstrated that rule-based modelling is an efficient approach to merge multiple cohorts. Furthermore, we estimated centrality distances among produced subnetworks that revealed dissimilarities of ASD subtypes and controls. Finally, we discovered a highly probable interaction between EMC4 and TMEM30A genes.In paper III, we investigated our tools to perform an RNA-seq-based gene expression analysis of Acute Myeloid Leukemia (AML). We aimed at discovering gene expression patterns between the AML diagnosis and relapse. Specifically, we applied a rule-based network analysis to validate independent cohorts. Our study revealed that overexpressed CD6 and underexpressed INSR are highly co-predictive genes associated to the AML relapse. Finally, we demonstrated arc diagrams as a novel way of visualizing co-predictors.In paper IV, we analyzed glioma grading by performing a comprehensive ML analysis for RNA-seq data sets. We broadly preprocessed data sets and removed a strong batch effect that occurred between glioma grades. Afterwards, we performed ML evaluation on single-sample gene set enrichment scores that revealed topmost accurate collections and annotations that distinguish glioma grades. Among others, we found cell cycle, Fanconi anemia and cholesterol-related pathways associated to glioma progression. Finally, we discovered several co-enrichment mechanisms among annotations. |
وصف الملف: | electronic |
URL الوصول: | https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-455316 https://uu-se.zoom.us/j/68396034138 https://uu.diva-portal.org/smash/get/diva2:1600808/FULLTEXT01.pdf https://uu.diva-portal.org/smash/get/diva2:1600808/PREVIEW01.jpg |
قاعدة البيانات: | SwePub |
ردمك: | 9151313073 9789151313078 |
---|---|
تدمد: | 16516214 |