دورية أكاديمية

Structure-based protein function prediction using graph convolutional networks.

التفاصيل البيبلوغرافية
العنوان: Structure-based protein function prediction using graph convolutional networks.
المؤلفون: Gligorijević V; Center for Computational Biology, Flatiron Institute, New York, NY, USA. vgligorijevic@flatironinstitute.org., Renfrew PD; Center for Computational Biology, Flatiron Institute, New York, NY, USA., Kosciolek T; Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.; Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland., Leman JK; Center for Computational Biology, Flatiron Institute, New York, NY, USA., Berenberg D; Center for Computational Biology, Flatiron Institute, New York, NY, USA.; Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA., Vatanen T; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; The Liggins Institute, University of Auckland, Auckland, New Zealand., Chandler C; Center for Computational Biology, Flatiron Institute, New York, NY, USA., Taylor BC; Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, USA., Fisk IM; Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA., Vlamakis H; Broad Institute of MIT and Harvard, Cambridge, MA, USA., Xavier RJ; Broad Institute of MIT and Harvard, Cambridge, MA, USA.; Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.; Gastrointestinal Unit, and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.; Center for Microbiome Informatics and Therapeutics, MIT, Cambridge, MA, USA., Knight R; Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.; Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA.; Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA., Cho K; Center for Data Science, New York University, New York, NY, USA.; CIFAR Azrieli Global Scholar, New York, NY, USA., Bonneau R; Center for Computational Biology, Flatiron Institute, New York, NY, USA. rb133@nyu.edu.; Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA. rb133@nyu.edu.; Center for Data Science, New York University, New York, NY, USA. rb133@nyu.edu.; Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA. rb133@nyu.edu.
المصدر: Nature communications [Nat Commun] 2021 May 26; Vol. 12 (1), pp. 3168. Date of Electronic Publication: 2021 May 26.
نوع المنشور: Comparative Study; Evaluation Study; Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't; Research Support, U.S. Gov't, Non-P.H.S.
اللغة: English
بيانات الدورية: Publisher: Nature Pub. Group Country of Publication: England NLM ID: 101528555 Publication Model: Electronic Cited Medium: Internet ISSN: 2041-1723 (Electronic) Linking ISSN: 20411723 NLM ISO Abbreviation: Nat Commun Subsets: MEDLINE
أسماء مطبوعة: Original Publication: [London] : Nature Pub. Group
مواضيع طبية MeSH: Deep Learning* , Models, Biological* , Protein Structure, Tertiary*, Computational Biology/*methods , Proteins/*physiology, Amino Acid Sequence ; Databases, Protein/statistics & numerical data ; Datasets as Topic ; Models, Molecular ; Proteins/ultrastructure ; Structure-Activity Relationship
مستخلص: The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .
References: Goodsell, D. S. The Machinery of Life (Springer Science & Business Media, 2009).
Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 47, D351–D360 (2018). (PMID: 632394110.1093/nar/gky1100)
Jones, D. T. & Cozzetto, D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31, 857–863 (2014). (PMID: 25391399438002910.1093/bioinformatics/btu744)
Dawson, N. L. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289–D295 (2016). (PMID: 27899584521057010.1093/nar/gkw1098)
Gerstein, M. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold. Des. 3, 497–512 (1998). (PMID: 988915910.1016/S1359-0278(98)00066-2)
Vogel, C., Berzuini, C., Bashton, M., Gough, J. & Teichmann, S. A. Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol. 336, 809–823 (2004). (PMID: 1509598910.1016/j.jmb.2003.12.026)
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. News 25, 25–29 (2000).
Bairoch, A. The ENZYME database in 2000. Nucleic Acids Res. 28, 304–305 (2000). (PMID: 1059225510246510.1093/nar/28.1.304)
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2016). (PMID: 27899662521056710.1093/nar/gkw1092)
Boutet, E, Lieberherr, D, Tognolli, M, Schneider, M & Bairoch, A. UniProtKB/Swiss-Prot 89–112 (Humana Press, 2007).
Ovchinnikov, S. et al. Protein structure determination using metagenome sequence data. Science 355, 294–298 (2017). (PMID: 28104891549320310.1126/science.aah4043)
Greener, J. G., Kandathil, S. M. & Jones, D. T. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun. 10, 1–13 (2019). (PMID: 10.1038/s41467-019-11994-0)
Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018). (PMID: 29788355603084810.1093/nar/gky427)
Vallat, B., Webb, B., Westbrook, J., Sali, A. & Berman, H. M. Archiving and disseminating integrative structure models. J. Biomol. NMR 73, 385–398 (2019). (PMID: 31278630669229310.1007/s10858-019-00264-2)
Webb, B & Sali, A. Protein Structure Modeling with MODELLER 1–15 (Springer New York, 2014).
Shigematsu, H. Electron cryo-microscopy for elucidating the dynamic nature of live-protein complexes. Biochim. Biophys. Acta Gen. Subj. 1864, 129436 (2019).
García-Nafría, J. & Tate, C. G. Cryo-electron microscopy: moving beyond x-ray crystal structures for drug receptors and drug development. Annu. Rev. Pharmacol. Toxicol. 60, 51–71 (2020). (PMID: 3134887010.1146/annurev-pharmtox-010919-023545)
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 1–5 (2020). (PMID: 10.1038/s41586-019-1923-7)
Gilliland, G. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000). (PMID: 1059223510247210.1093/nar/28.1.235)
Pieper, U. et al. ModBase, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 42, D336–D346 (2013). (PMID: 24271400396501110.1093/nar/gkt1144)
Koo, D. C. E. & Bonneau, R. Towards region-specific propagation of protein functions. Bioinformatics 35, 1737–1744 (2018). (PMID: 651316310.1093/bioinformatics/bty834)
Torng, W. & Altman, R. B. High precision protein functional site detection using 3D convolutional neural networks. Bioinformatics 35, 1503–1512 (2018). (PMID: 649923710.1093/bioinformatics/bty813)
Schug, J., Diskin, S., Mazzarelli, J., Brunk, B. P. & Stoeckert, C. J. Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res. 12, 648–655 (2002). (PMID: 1193224918751110.1101/gr.222902)
Das, S. et al. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 31, 3460–3467 (2015). (PMID: 26139634461222110.1093/bioinformatics/btv398)
Guan, Y. et al. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome biology 9, S3 (2008). (PMID: 18613947244753710.1186/gb-2008-9-s1-s3)
Wass, M. N., Barton, G. & Sternberg, M. J. E. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res. 40, W466–W470 (2012). (PMID: 22641853339434610.1093/nar/gks489)
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013). (PMID: 23353650358418110.1038/nmeth.2340)
Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016). (PMID: 27604469501532010.1186/s13059-016-1037-6)
Zhou, N. et al. The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019). (PMID: 31744546686493010.1186/s13059-019-1835-8)
Peña-Castillo, L. et al. A critical assessment of mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 9, S2 (2008). (PMID: 18613946244753610.1186/gb-2008-9-s1-s2)
Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. FFPred 3: feature-based function prediction for all Gene Ontology domains. Sci. Rep. 6, 31865 (2016). (PMID: 27561554499999310.1038/srep31865)
Mostafavi, S. et al. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4 (2008). (PMID: 18613948244753810.1186/gb-2008-9-s1-s4)
Cho, H., Berger, B. & Peng, J. Compact integration of multi-network topology for functional analysis of genes. Cell Syst. 3, 540–548 (2016). (PMID: 27889536522529010.1016/j.cels.2016.10.017)
Barot, M., Gligorijević, V. & Bonneau, R. deepNF: deep network fusion for protein function prediction. Bioinformatics 34, 3873–3881 (2018). (PMID: 29868758622336410.1093/bioinformatics/bty440)
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations (2019).
AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301.e3 (2019). (PMID: 310055796513320)
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLOS Comput. Biol. 13, 1–34 (2017). (PMID: 10.1371/journal.pcbi.1005324)
Kulmanov, M., Khan, M. A. & Hoehndorf, R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2017). (PMID: 586060610.1093/bioinformatics/btx624)
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015). (PMID: 10.1038/nature1453926017442)
Jiménez, J., Doerr, S., Martínez-Rosell, G., Rose, A. S. & De Fabritiis, G. DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics 33, 3036–3042 (2017). (PMID: 2857518110.1093/bioinformatics/btx350)
Amidi, A. et al. Enzynet: enzyme classification using 3d convolutional neural networks on spatial representation. PeerJ, 6, e4750 (2018).
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34, 18–42 (2017). (PMID: 10.1109/MSP.2017.2693418)
Henaff, M., Bruna, J. & LeCun, Y. Deep convolutional networks on graph-structured data. CoRR abs/1506.05163 (2015).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR) (2017).
Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. in Proceedings of the 28th International Conference on Neural Information Processing Systems Vol. 2, NIPS’15, 2224–2232 (MIT Press, 2015).
Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inform. Model. 57, 1757–1772 (2017). (PMID: 10.1021/acs.jcim.6b00601)
Fout, A., Byrd, J., Shariat, B. & Ben-Hur, A. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 6530–6539 (Curran Associates, Inc., 2017).
Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV) 618–626 (2017).
Peters, M. et al. Deep contextualized word representations. in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237 (Association for Computational Linguistics, 2018).
Graves, A. Generating sequences with recurrent neural networks. Preprint at https://arxiv.org/abs/1308.0850 (2013).
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2013). (PMID: 24288371396511010.1093/nar/gkt1223)
Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing SystemstsVol. 29 (eds Lee, D. et al.)  3844–3852 (Curran Associates, Inc., 2016).
Hamilton, W., Ying, Z. & Leskovec, J. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 1024–1034 (Curran Associates, Inc., 2017).
Velickovic, P. et al. Graph attention networks. In International Conference on Learning Representations (2018).
Dehmamy, N., Barabasi, A.-L. & Yu, R. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) 15413–15423 (Curran Associates, Inc., 2019).
Gutmanas, A. et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 47, D482–D489 (2018). (PMID: 6324003)
Leaver-Fay, A. et al. Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology Vol. 487, 545–574 (Elsevier, 2011).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33 (2005).
Bonneau, R., Ruczinski, I., Tsai, J. & Baker, D. Contact order and ab initio protein structure prediction. Protein Sci. 11, 1937–1944 (2002). (PMID: 12142448237367410.1110/ps.3790102)
Alterovitz, R. et al. Resboost: characterizing and predicting catalytic residues in enzymes. BMC Bioinform. 10, 197 (2009).
Pope, P. E., Kolouri, S., Rostami, M., Martin, C. E. & Hoffmann, H. Explainability methods for graph convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Montavon, G., Samek, W. & Müller, K.-R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 73, 1–15 (2018). (PMID: 10.1016/j.dsp.2017.10.011)
Zołna, K., Geras, K. J. & Cho, K. Classifier-agnostic saliency map extraction. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 10087–10088 (2019).
Adebayo, J. et al. In Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) Advances in Neural Information Processing Systems 31, 9505–9515 (Curran Associates, Inc., 2018).
Denil, M., Demiraj, A., Kalchbrenner, N., Blunsom, P. & de Freitas, N. Modelling, visualising and summarising documents with a single convolutional neural network. Preprint at https://arxiv.org/abs/1406.3830 (2014).
Yang, J., Roy, A. & Zhang, Y. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 41, D1096–D1103 (2012). (PMID: 23087378353119310.1093/nar/gks966)
Porter, C. T., Bartlett, G. J. & Thornton, J. M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 32, D129–D133 (2004). (PMID: 1468137630876210.1093/nar/gkh028)
Schneider, R., de Daruvar, A. & Sander, C. The HSSP database of protein structure-sequence alignments. Nucleic Acids Res. 25, 226–230 (1997). (PMID: 901654114641910.1093/nar/25.1.226)
Huberts, D. H. & van der Klei, I. J. Moonlighting proteins: an intriguing mode of multitasking. Biochim. Biophys. Acta, Mol. Cell Res. 1803, 520–525 (2010). (PMID: 10.1016/j.bbamcr.2010.01.022)
Geirhos, R. et al. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. in International Conference on Learning Representations (2019).
Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) (Curran Associates, Inc., 2019).
Chang, A., Schomburg, I., Jeske, L., Placzek, S. & Schomburg, D. BRENDA in 2019: a European ELIXIR core data resource. Nucleic Acids Res. 47, D542–D549 (2018). (PMID: 6323942)
of the Gene Ontology Consortium, T. R. G. G. The gene ontology’s reference genome project: a unified framework for functional annotation across species. PLOS Comput. Biol. 5, 1–8 (2009).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006). (PMID: 1673169910.1093/bioinformatics/btl158)
Lovell, S. C. et al. Structure validation by C α geometry: ϕ, ψ and C β deviation. Proteins 50, 437–450 (2003). (PMID: 1255718610.1002/prot.10286)
Rhodes, G. Complementary Science: Crystallography Made Crystal Clear 3rd edn. (Academic Press, Burlington, US, 2014).
Wang, G., Dunbrack, J. & Roland, L. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591 (2003). (PMID: 1291284610.1093/bioinformatics/btg224)
Nielsen, H., Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K. & Winther, O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33, 3387–3395 (2017). (PMID: 2903661610.1093/bioinformatics/btx431)
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015). (PMID: 26301843476829910.1038/nmeth.3547)
Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2017). (PMID: 590559110.1093/bioinformatics/btx780)
Eddy, S. R. A new generation of homology search tools based on probabilistic inference. in Genome informatics. International Conference on Genome Informatics Vol. 23, 205–211 (2009).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015).
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006). (PMID: 10.1016/j.patrec.2005.10.010)
معلومات مُعتمدة: P30 DK043351 United States DK NIDDK NIH HHS
المشرفين على المادة: 0 (Proteins)
تواريخ الأحداث: Date Created: 20210527 Date Completed: 20210609 Latest Revision: 20230202
رمز التحديث: 20231215
مُعرف محوري في PubMed: PMC8155034
DOI: 10.1038/s41467-021-23303-9
PMID: 34039967
قاعدة البيانات: MEDLINE
الوصف
تدمد:2041-1723
DOI:10.1038/s41467-021-23303-9