دورية أكاديمية

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.

التفاصيل البيبلوغرافية
العنوان: OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.
المؤلفون: Ahdritz G; Department of Systems Biology, Columbia University, New York, NY, USA.; Harvard University, Cambridge, MA, USA., Bouatta N; Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA. nbouatta@gmail.com., Floristean C; Department of Systems Biology, Columbia University, New York, NY, USA., Kadyan S; Department of Systems Biology, Columbia University, New York, NY, USA., Xia Q; Department of Systems Biology, Columbia University, New York, NY, USA., Gerecke W; Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA., O'Donnell TJ; Icahn School of Medicine at Mount Sinai, New York, NY, USA., Berenberg D; Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA., Fisk I; Flatiron Institute, New York, NY, USA., Zanichelli N; OpenBioML, Cambridge, MA, USA., Zhang B; Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA., Nowaczynski A; NVIDIA, Santa Clara, CA, USA., Wang B; NVIDIA, Santa Clara, CA, USA., Stepniewska-Dziubinska MM; NVIDIA, Santa Clara, CA, USA., Zhang S; NVIDIA, Santa Clara, CA, USA., Ojewole A; NVIDIA, Santa Clara, CA, USA., Guney ME; NVIDIA, Santa Clara, CA, USA., Biderman S; EleutherAI, New York, NY, USA.; Booz Allen Hamilton, McLean, VA, USA., Watkins AM; Prescient Design, Genentech, New York, NY, USA., Ra S; Prescient Design, Genentech, New York, NY, USA., Lorenzo PR; NVIDIA, Santa Clara, CA, USA., Nivon L; Cyrus Bio, Seattle, WA, USA., Weitzner B; Outpace Bio, Seattle, WA, USA., Ban YA; Arzeda, Seattle, WA, USA., Chen S; Rutgers University, New Brunswick, NJ, USA., Zhang M; University of Illinois at Urbana-Champaign, Champaign, IL, USA., Li C; Microsoft, Redmond, WA, USA., Song SL; Microsoft, Redmond, WA, USA., He Y; Microsoft, Redmond, WA, USA., Sorger PK; Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA., Mostaque E; Stability AI, Los Altos, CA, USA., Zhang Z; Rutgers University, New Brunswick, NJ, USA., Bonneau R; Prescient Design, Genentech, New York, NY, USA., AlQuraishi M; Department of Systems Biology, Columbia University, New York, NY, USA. m.alquraishi@columbia.edu.
المصدر: Nature methods [Nat Methods] 2024 May 14. Date of Electronic Publication: 2024 May 14.
Publication Model: Ahead of Print
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: Nature Pub. Group Country of Publication: United States NLM ID: 101215604 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1548-7105 (Electronic) Linking ISSN: 15487091 NLM ISO Abbreviation: Nat Methods Subsets: MEDLINE
أسماء مطبوعة: Original Publication: New York, NY : Nature Pub. Group, c2004-
مستخلص: AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.
(© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.)
References: Anfinsen, C. B. Principles that govern the folding of protein chains. Science 181, 223–230 (1973). (PMID: 10.1126/science.181.4096.2234124164)
Dill, K. A., Ozkan, S. B., Shell, M. S. & Weikl, T. R. The protein folding problem. Annu. Rev. Biophys. 37, 289–316 (2008). (PMID: 10.1146/annurev.biophys.37.092707.153558185730832443096)
Jones, D. T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999–1006 (2015). (PMID: 10.1093/bioinformatics/btu79125431331)
Golkov, V. et al. Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images. In Advances in Neural Information Processing Systems (eds Lee, D. et al.) (Curran Associates, 2016).
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017). (PMID: 10.1371/journal.pcbi.1005324280560905249242)
Liu, Y., Palmedo, P., Ye, Q., Berger, B. & Peng, J. Enhancing evolutionary couplings with deep convolutional neural networks. Cell Syst. 6, 65–74 (2018). (PMID: 10.1016/j.cels.2017.11.01429275173)
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020). (PMID: 10.1038/s41586-019-1923-731942072)
Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021). (PMID: 10.1038/s42256-021-00348-5343686238340610)
Šali, A. & Blundell, T. L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993). (PMID: 10.1006/jmbi.1993.16268254673)
Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5, 725–738 (2010). (PMID: 10.1038/nprot.2010.5203607672849174)
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 577, 583–589 (2021). (PMID: 10.1038/s41586-021-03819-2)
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022). (PMID: 10.1038/s41592-022-01488-1356373079184281)
Baek, M. Adding a big enough number for ‘residue_index’ feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/ residue_index modification). Twitter twitter.com/minkbaek/status/1417538291709071362?lang=en (2021).
Tsaban, T. et al. Harnessing protein folding neural networks for peptide–protein docking. Nat. Commun. 13, 176 (2022). (PMID: 10.1038/s41467-021-27838-9350133448748686)
Roney, J. P. & Ovchinnikov, S. State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101 (2022). (PMID: 10.1103/PhysRevLett.129.23810136563190)
Baltzis, A. et al. Highly significant improvement of protein sequence alignments with AlphaFold2. Bioinformatics 38, 5007–5011 (2022).
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022). (PMID: 10.1038/s41467-022-28865-w352731468913741)
Wayment-Steele, H. K., Ovchinnikov, S., Colwell, L. & Kern, D. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. Nature 625, 832–839 (2024). (PMID: 10.1038/s41586-023-06832-937956700)
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021). (PMID: 10.1038/s41586-021-03828-1342937998387240)
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021). (PMID: 10.1093/nar/gkab10618728224)
Callaway, E. ‘The entire protein universe’: AI predicts shape of nearly every known protein. Nature 608, 15–16 (2022). (PMID: 10.1038/d41586-022-02083-235902752)
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Ahdritz, G. et al. OpenProteinSet: training data for structural biology at scale. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 4597-4609 (Curran Associates, 2023).
Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHub github.com/google/jax (2018).
Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20 3505–3506 (Association for Computing Machinery, 2020).
Charlier, B., Feydy, J., Glaunès, J., Collin, F.-D. & Durif, G. Kernel operations on the GPU, with autodiff, without memory overflows. J. Mach. Learn. Res. 22, 1–6 (2021).
Falcon, W. & the PyTorch Lightning team. PyTorch Lightning (PyTorch Lightning, 2019).
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. FlashAttention: fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 16344–16359 (Curran Associates, 2022).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017). (PMID: 10.1093/nar/gkw108127899574)
wwPDB Consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2018). (PMID: 10.1093/nar/gky949)
Haas, J. ürgen et al. Continuous automated model evaluation (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86, 387–398 (2018). (PMID: 10.1002/prot.2543129178137)
Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013). (PMID: 10.1093/bioinformatics/btt473239865683799472)
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997). (PMID: 10.1016/S0969-2126(97)00260-89309224)
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021). (PMID: 10.1093/nar/gkaa107933237325)
Andreeva, A., Kulesha, E., Gough, J. & Murzin, A. G. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 48, D376–D382 (2020). (PMID: 10.1093/nar/gkz106431724711)
Saitoh, Y. et al. Structural basis for high selectivity of a rice silicon channel Lsi1. Nat. Commun. 12, 6236 (2021). (PMID: 10.1038/s41467-021-26535-x347163448556265)
Mota, DaniellyC. A. M. et al. Structural and thermodynamic analyses of human TMED1 (p241) Golgi dynamics. Biochimie 192, 72–82 (2022). (PMID: 10.1016/j.biochi.2021.10.00234634369)
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).
Rabe, M. N. & Staats, C. Self-attention does not need O(n 2 ) memory. Preprint at https://doi.org/10.48550/arXiv.2112.05682 (2021).
Cheng, S. et al. FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming 417–430 (Association for Computing Machinery, 2024).
Li, Z. et al. Uni-Fold: an open-source platform for developing protein folding models beyond AlphaFold. Preprint at bioRxiv https://doi.org/10.1101/2022.08.04.502811 (2022).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Science 22, 2577–2637 (1983).
Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003). (PMID: 10.1093/nar/gkg57112824330168977)
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011). (PMID: 10.1371/journal.pone.0028766221633313233603)
Sułkowska, J. I., Morcos, F., Weigt, M., Hwa, T. & Onuchic, José Genomics-aided structure prediction. Proc. Natl Acad. Sci. USA 109, 10340–10345 (2012). (PMID: 10.1073/pnas.1207864109226914933387073)
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (eds Oh, A. H. et al.) 30016–30030 (NeurIPS, 2022).
Tay, Y. et al. Scaling laws vs model architectures: how does inductive bias influence scaling? In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 12342–12364 (Association for Computational Linguistics, 2023).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). (PMID: 10.1126/science.ade257436927031)
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019). (PMID: 10.1038/s41592-019-0598-1316364607067682)
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at bioRxiv https://doi.org/10.1101/2022.07.21.500999 (2022).
Singh, J., Paliwal, K., Litfin, T., Singh, J. & Zhou, Y. Predicting RNA distance-based contact maps by integrated deep learning on physics-inferred secondary structure and evolutionary-derived mutational coupling. Bioinformatics 38, 3900–3910 (2022). (PMID: 10.1093/bioinformatics/btac421357515939364379)
Baek, M., McHugh, R., Anishchenko, I., Baker, D. & DiMaio, F. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024). (PMID: 10.1038/s41592-023-02086-537996753)
Pearce, R., Omenn, G. S. & Zhang, Y. De novo RNA tertiary structure prediction at atomic resolution using geometric potentials from deep learning. Preprint at bioRxiv https://doi.org/10.1101/2022.05.15.491755 (2022).
McPartlon, M., Lai, B. & Xu, J. A deep SE(3)-equivariant model for learning inverse protein folding. Preprint at bioRxiv https://doi.org/10.1101/2022.04.15.488492 (2022).
McPartlon, M. & Xu, J. An end-to-end deep learning method for protein side-chain packing and inverse folding. In Proceedings of the National Academy of Sciences e2216438120 (PNAS, 2023).
Knox, H. L., Sinner, E. K., Townsend, C. A., Boal, A. K. & Booker, S. J. Structure of a B 1 2 -dependent radical SAM enzyme in carbapenem biosynthesis. Nature 602, 343–348 (2022). (PMID: 10.1038/s41586-021-04392-4351107348950224)
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004). (PMID: 10.1002/prot.2026415476259)
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE Press, 2020).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Wang, G. et al. HelixFold: an efficient implementation of AlphaFold2 using PaddlePaddle. Preprint at https://doi.org/10.48550/arXiv.2207.05477 (2022).
Yuan, J. et al. OneFlow: redesign the distributed deep learning framework from scratch. Preprint at https://doi.org/10.48550/arXiv.2110.15032 (2021).
Ovchinnikov, S. Weekend project! nerd-face So now that OpenFold weights are available. I was curious how different they are from AlphaFold weights and if they can be used for AfDesign evaluation. More specifically, if you design a protein with AlphaFold, can OpenFold predict it (and vice-versa)? (1/5). Twitter twitter.com/sokrypton/status/1551242121528520704?lang=en (2022).
Wei, X. et al. The α-helical cap domain of a novel esterase from gut Alistipes shahii shaping the substrate-binding pocket. J. Agric. Food Chem. 69, 6064–6072 (2021). (PMID: 10.1021/acs.jafc.1c0094033979121)
Carroll, B. L. et al. Caught in motion: human NTHL1 undergoes interdomain rearrangement necessary for catalysis. Nucleic Acids Res. 49, 13165–13178 (2021). (PMID: 10.1093/nar/gkab1162348714338682792)
معلومات مُعتمدة: U54-CA225088 U.S. Department of Health & Human Services | NIH | National Cancer Institute (NCI); OAC-2106661 National Science Foundation (NSF); R35GM150546 U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences (NIGMS); OAC-2112606 National Science Foundation (NSF); R35 GM150546 United States GM NIGMS NIH HHS
تواريخ الأحداث: Date Created: 20240514 Latest Revision: 20240607
رمز التحديث: 20240607
DOI: 10.1038/s41592-024-02272-z
PMID: 38744917
قاعدة البيانات: MEDLINE