Evolved Features for DNA Sequence Classification and Their Fitness Landscapes
العنوان: | Evolved Features for DNA Sequence Classification and Their Fitness Landscapes |
---|---|
المؤلفون: | Suprakash Datta, Wendy Ashlock |
المصدر: | IEEE Transactions on Evolutionary Computation. 17:185-197 |
بيانات النشر: | Institute of Electrical and Electronics Engineers (IEEE), 2013. |
سنة النشر: | 2013 |
مصطلحات موضوعية: | Sequence, Finite-state machine, Fitness landscape, Computer science, business.industry, Evolutionary algorithm, Overfitting, Machine learning, computer.software_genre, Theoretical Computer Science, Random forest, Computational Theory and Mathematics, Genetic algorithm, Artificial intelligence, business, Cluster analysis, computer, Software |
الوصف: | A key problem in genomics is the classification and annotation of sequences in a genome. A major challenge is identifying good sequence features. Evolutionary algorithms have the potential to search a large space of features and automatically generate useful ones. This paper proposes a two-stage method that generates features using multiple replicates of a genetic algorithm operating on an augmented finite state machine, called a side effect machine (SEM), and then selects a small diverse feature set using several methods, including a novel method called dissimilarity clustering. We apply our method to three problems related to transposable elements and compare the results to those using k-mer features. We are able to produce a small set of interesting and comprehensible features that create random forest classifiers more accurate and less prone to overfitting than those created using k-mer features. We analyze the SEM fitness landscapes and discuss the use of different fitness functions. |
تدمد: | 1941-0026 1089-778X |
URL الوصول: | https://explore.openaire.eu/search/publication?articleId=doi_________::8ffcf9cb58ab869adbc0f0a42378b507 https://doi.org/10.1109/tevc.2012.2207120 |
حقوق: | CLOSED |
رقم الأكسشن: | edsair.doi...........8ffcf9cb58ab869adbc0f0a42378b507 |
قاعدة البيانات: | OpenAIRE |
تدمد: | 19410026 1089778X |
---|