Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum.

التفاصيل البيبلوغرافية
العنوان: Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum.
المؤلفون: Madugula SS; Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States., Pujar P; Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States., Bharani N; Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States., Wang S; Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States., Jayasinghe-Arachchige VM; Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States., Pham T; Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas., Mashburn D; Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States., Artilis M; Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States., Liu J; Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States.; Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas.
المصدر: BioRxiv : the preprint server for biology [bioRxiv] 2024 Jan 23. Date of Electronic Publication: 2024 Jan 23.
نوع المنشور: Preprint
اللغة: English
بيانات الدورية: Country of Publication: United States NLM ID: 101680187 Publication Model: Electronic Cited Medium: Internet NLM ISO Abbreviation: bioRxiv Subsets: PubMed not MEDLINE
مستخلص: The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations like large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In the current study, we aim to elucidate the unique protein attributes associated with Cas9 and Cas12 families and identify the features that distinguish each family from the other. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,495 features) encoding various physiochemical, topological, constitutional, and coevolutionary information of Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and Non-Cas proteins. All the models were evaluated rigorously on the test and independent datasets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 95% and 97% on their respective independent datasets, while the multiclass classifier achieved a high F1 score of 0.97. We observed that Quasi-sequence-order descriptors like Schneider-lag descriptors and Composition descriptors like charge, volume, and polarizability are essential for the Cas12 family. More interestingly, we discovered that Amino Acid Composition descriptors, especially the Tripeptide Composition (TPC) descriptors, are important for the Cas9 family. Four of the identified important descriptors of Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all the Cas9 proteins and were located within different catalytically important domains of the Cas9 protein structure. Among these four tripeptides, tripeptides DHI and HHA are well-known to be involved in the DNA cleavage activity of the Cas9 protein. We therefore propose the the other two tripeptides, PWN and PYY, may also be essential for the Cas9 family. Our identified important descriptors enhanced the understanding of the catalytic mechanisms of Cas9 and Cas12 proteins and provide valuable insights into design of novel Cas systems to achieve enhanced gene-editing properties.
Competing Interests: The authors declare no competing financial interests.
التعليقات: Update in: J Chem Inf Model. 2024 Jun 24;64(12):4897-4911. doi: 10.1021/acs.jcim.4c00625. (PMID: 38838358)
References: CRISPR J. 2018 Oct;1(5):325-336. (PMID: 31021272)
Curr Opin Microbiol. 2017 Jun;37:67-78. (PMID: 28605718)
Proc Natl Acad Sci U S A. 1995 Sep 12;92(19):8700-4. (PMID: 7568000)
Protein Eng. 1996 Jan;9(1):27-36. (PMID: 9053899)
Phys Chem Chem Phys. 2020 Jun 4;22(21):12044-12057. (PMID: 32421120)
Cell. 2009 Nov 25;139(5):945-56. (PMID: 19945378)
Cell Biosci. 2018 Nov 12;8:59. (PMID: 30459943)
Biopolymers. 2003 Oct;70(2):201-11. (PMID: 14517908)
BMC Bioinformatics. 2015 Dec 03;16:402. (PMID: 26630876)
Front Immunol. 2018 Jul 31;9:1783. (PMID: 30108593)
Phys Chem Chem Phys. 2020 Apr 6;22(13):6848-6860. (PMID: 32195493)
Biol Proced Online. 2020 Sep 14;22:22. (PMID: 32939188)
Biochemistry. 2023 Dec 19;62(24):3465-3487. (PMID: 37192099)
Viruses. 2021 Aug 03;13(8):. (PMID: 34452396)
Methods. 2022 Jul;203:276-281. (PMID: 33662563)
Microbiology (Reading). 2005 Aug;151(Pt 8):2551-2561. (PMID: 16079334)
Nature. 2015 Jan 29;517(7536):583-8. (PMID: 25494202)
Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. (PMID: 36408920)
Comput Struct Biotechnol J. 2020 Sep 08;18:2401-2415. (PMID: 33005303)
Brief Bioinform. 2023 Jan 19;24(1):. (PMID: 36502435)
Science. 2016 Aug 05;353(6299):aaf5573. (PMID: 27256883)
ACS Chem Biol. 2018 Feb 16;13(2):347-356. (PMID: 29121460)
Bioinformatics. 2018 Jul 15;34(14):2499-2502. (PMID: 29528364)
Biochem Biophys Res Commun. 2000 Nov 19;278(2):477-83. (PMID: 11097861)
PeerJ. 2021 Jul 30;9:e11887. (PMID: 34395100)
Nucleic Acids Res. 2000 Jan 1;28(1):235-42. (PMID: 10592235)
Proteins. 1999 Jun 1;35(4):401-7. (PMID: 10382667)
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1313-1315. (PMID: 28186905)
Biochem Soc Trans. 2020 Feb 28;48(1):15-23. (PMID: 31922192)
BMC Genomics. 2016 May 17;17:356. (PMID: 27184979)
Nature. 2016 Nov 15;539(7630):479. (PMID: 27882996)
BMC Bioinformatics. 2013 Sep 26;14:285. (PMID: 24070402)
Nat Struct Mol Biol. 2019 Aug;26(8):679-685. (PMID: 31285607)
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W52-7. (PMID: 17537822)
Science. 2007 Mar 23;315(5819):1709-12. (PMID: 17379808)
Int J Biol Macromol. 2023 May 31;238:124054. (PMID: 36933595)
RNA. 2004 Mar;10(3):355-68. (PMID: 14970381)
Proc Natl Acad Sci U S A. 2015 Aug 18;112(33):10437-42. (PMID: 26216948)
ACS Catal. 2020 Nov 20;10(22):13596-13605. (PMID: 33520346)
Biochem Soc Trans. 2013 Dec;41(6):1392-400. (PMID: 24256226)
Biochimie. 2015 Oct;117:119-28. (PMID: 25868999)
Gigascience. 2020 Jun 1;9(6):. (PMID: 32556168)
BMC Bioinformatics. 2007 Jan 20;8:18. (PMID: 17239253)
Nat Rev Microbiol. 2020 Feb;18(2):67-83. (PMID: 31857715)
Proteins. 2004 Apr 1;55(1):66-76. (PMID: 14997540)
Mol Cell. 2017 Oct 05;68(1):15-25. (PMID: 28985502)
J Cell Biochem. 2002;84(2):343-8. (PMID: 11787063)
Biomed J. 2020 Feb;43(1):8-17. (PMID: 32200959)
Biophys J. 1994 Feb;66(2 Pt 1):335-44. (PMID: 8161687)
J Mol Evol. 2005 Feb;60(2):174-82. (PMID: 15791728)
Science. 2015 Jun 26;348(6242):1477-81. (PMID: 26113724)
Philos Trans R Soc Lond B Biol Sci. 2019 May 13;374(1772):20180087. (PMID: 30905284)
Science. 2012 Aug 17;337(6096):816-21. (PMID: 22745249)
Nat Rev Microbiol. 2015 Nov;13(11):722-36. (PMID: 26411297)
Nucleic Acids Res. 2023 Jan 6;51(D1):D418-D427. (PMID: 36350672)
Nucleic Acids Res. 2003 Jul 1;31(13):3692-7. (PMID: 12824396)
Front Bioeng Biotechnol. 2020 Jun 25;8:635. (PMID: 32671038)
J Bacteriol. 1987 Dec;169(12):5429-33. (PMID: 3316184)
Science. 2013 Feb 15;339(6121):819-23. (PMID: 23287718)
BioTech (Basel). 2021 Jul 06;10(3):. (PMID: 35822768)
Elife. 2013 Jan 29;2:e00471. (PMID: 23386978)
J Cell Mol Med. 2020 Mar;24(6):3256-3270. (PMID: 32037739)
معلومات مُعتمدة: R35 GM133657 United States GM NIGMS NIH HHS
تواريخ الأحداث: Date Created: 20240208 Latest Revision: 20240624
رمز التحديث: 20240624
مُعرف محوري في PubMed: PMC10849529
DOI: 10.1101/2024.01.22.576286
PMID: 38328240
قاعدة البيانات: MEDLINE
الوصف
DOI:10.1101/2024.01.22.576286