دورية أكاديمية

Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach.

التفاصيل البيبلوغرافية
العنوان: Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach.
المؤلفون: Dlamini GS; IBM Research Johannesburg 2001 South Africa., Muller SJ; IBM Research Johannesburg 2001 South Africa., Meraba RL; IBM Research Johannesburg 2001 South Africa., Young RA; IBM Research Johannesburg 2001 South Africa., Mashiyane J; IBM Research Johannesburg 2001 South Africa., Chiwewe T; IBM Research Johannesburg 2001 South Africa., Mapiye DS; IBM Research Johannesburg 2001 South Africa.
المصدر: IEEE access : practical innovations, open solutions [IEEE Access] 2020 Oct 15; Vol. 8, pp. 195263-195273. Date of Electronic Publication: 2020 Oct 15 (Print Publication: 2020).
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: Institute of Electrical and Electronics Engineers Country of Publication: United States NLM ID: 101639462 Publication Model: eCollection Cited Medium: Print ISSN: 2169-3536 (Print) Linking ISSN: 21693536 NLM ISO Abbreviation: IEEE Access Subsets: PubMed not MEDLINE
أسماء مطبوعة: Original Publication: Piscataway, NJ : Institute of Electrical and Electronics Engineers, 2013-
مستخلص: The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.
(This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.)
References: J Cell Mol Med. 2002 Apr-Jun;6(2):279-303. (PMID: 12169214)
N Engl J Med. 2014 Oct 9;371(15):1418-25. (PMID: 24738640)
Nucleic Acids Res. 1980 Oct 10;8(19):4545-62. (PMID: 7433114)
Brief Bioinform. 2019 Mar 22;20(2):426-435. (PMID: 28673025)
PLoS One. 2013 Sep 23;8(9):e74109. (PMID: 24086312)
BMC Bioinformatics. 2017 May 10;18(1):247. (PMID: 28486927)
PLoS One. 2020 Apr 24;15(4):e0232391. (PMID: 32330208)
Epidemiol Rev. 2006;28:112-25. (PMID: 16754644)
Gene. 2004 May 26;333:143-9. (PMID: 15177689)
Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29;361(1467):519-23. (PMID: 16524840)
J Mol Biol. 2001 Nov 30;314(3):433-44. (PMID: 11846557)
Cell. 2002 Mar 8;108(5):583-6. (PMID: 11893328)
Genome Res. 2003 Feb;13(2):145-58. (PMID: 12566393)
PLoS Genet. 2010 Sep 09;6(9):e1001107. (PMID: 20838593)
Adv Virus Res. 1997;48:1-100. (PMID: 9233431)
J Biol Chem. 1961 Mar;236:864-75. (PMID: 13790780)
Lancet. 2020 Feb 22;395(10224):565-574. (PMID: 32007145)
N Engl J Med. 2003 May 15;348(20):1953-66. (PMID: 12690092)
Virus Res. 2005 Apr;109(1):33-7. (PMID: 15826910)
Genome Biol. 2017 Oct 3;18(1):186. (PMID: 28974235)
DNA Repair (Amst). 2015 Dec;36:146-155. (PMID: 26411877)
N Engl J Med. 2012 Nov 8;367(19):1814-20. (PMID: 23075143)
Comput Appl Biosci. 1991 Jan;7(1):39-49. (PMID: 2004273)
Electrophoresis. 1998 Apr;19(4):528-35. (PMID: 9588798)
Genomics. 2010 Jan;95(1):7-15. (PMID: 19747541)
J Microbiol Immunol Infect. 2021 Apr;54(2):159-163. (PMID: 32265180)
PeerJ. 2019 Mar 13;7:e6594. (PMID: 30886779)
Front Microbiol. 2013 Sep 06;4:269. (PMID: 24046767)
PLoS One. 2010 Aug 20;5(8):e12330. (PMID: 20808837)
Biol Direct. 2013 Jan 22;8:3. (PMID: 23339707)
Curr Opin Microbiol. 1998 Oct;1(5):598-610. (PMID: 10066522)
ScientificWorldJournal. 2012;2012:104269. (PMID: 22619571)
PLoS One. 2020 Sep 3;15(9):e0238344. (PMID: 32881907)
Trends Genet. 1995 Jul;11(7):283-90. (PMID: 7482779)
Semin Immunopathol. 2017 Jul;39(5):529-539. (PMID: 28466096)
Tuberculosis (Edinb). 2016 May;98:62-76. (PMID: 27156620)
Gene. 1990 Mar 1;87(1):23-9. (PMID: 2110097)
Elife. 2014 Dec 09;3:e04531. (PMID: 25490153)
J Hepatol. 2016 Oct;65(1 Suppl):S2-S21. (PMID: 27641985)
Bioinformatics. 2005 Aug 1;21(15):3301-7. (PMID: 15905277)
Viruses. 2020 Mar 27;12(4):. (PMID: 32230900)
J Comput Biol. 2009 Nov;16(11):1539-47. (PMID: 19958082)
Infect Ecol Epidemiol. 2013 Aug 30;3:. (PMID: 24003364)
Int J Evol Biol. 2012;2012:342482. (PMID: 22536540)
Trends Biotechnol. 2017 Jun;35(6):498-507. (PMID: 28351613)
Cytometry A. 2020 Jul;97(7):662-667. (PMID: 32506725)
PLoS One. 2018 Nov 14;13(11):e0206409. (PMID: 30427878)
PLoS One. 2019 Sep 11;14(9):e0222271. (PMID: 31509583)
Nat Biotechnol. 2008 Sep;26(9):1011-3. (PMID: 18779814)
Virus Res. 2003 Mar;92(1):1-7. (PMID: 12606071)
PLoS One. 2009 Dec 02;4(12):e8113. (PMID: 19956556)
Mol Syst Biol. 2009;5:311. (PMID: 19888206)
J Virol. 2017 Mar 29;91(8):. (PMID: 28148785)
Nucleic Acids Res. 2018 Jan 4;46(D1):D8-D13. (PMID: 29140470)
BMC Bioinformatics. 2016 Oct 6;17(Suppl 13):381. (PMID: 27766939)
فهرسة مساهمة: Keywords: Alignment-free sequence analysis; COVID-19; XGBoost; dinucleotide frequencies; feature representations; genomic signatures; human pathogens; machine learning
تواريخ الأحداث: Date Created: 20220103 Latest Revision: 20220104
رمز التحديث: 20240829
مُعرف محوري في PubMed: PMC8675546
DOI: 10.1109/ACCESS.2020.3031387
PMID: 34976561
قاعدة البيانات: MEDLINE
الوصف
تدمد:2169-3536
DOI:10.1109/ACCESS.2020.3031387