دورية أكاديمية

Gene Set Summarization Using Large Language Models.

التفاصيل البيبلوغرافية
العنوان: Gene Set Summarization Using Large Language Models.
المؤلفون: Joachimiak MP; Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA., Caufield JH; Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA., Harris NL; Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA., Kim H; Robert Bosch LLC, Sunnyvale, CA 94085, USA., Mungall CJ; Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.
المصدر: ArXiv [ArXiv] 2024 Jul 04. Date of Electronic Publication: 2024 Jul 04.
نوع المنشور: Journal Article; Preprint
اللغة: English
بيانات الدورية: Country of Publication: United States NLM ID: 101759493 Publication Model: Electronic Cited Medium: Internet ISSN: 2331-8422 (Electronic) Linking ISSN: 23318422 NLM ISO Abbreviation: ArXiv Subsets: PubMed not MEDLINE
مستخلص: Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling Large Language Models (LLMs) to use scientific texts directly and avoid reliance on a KB. TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives) uses generative AI to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct retrieval from the model. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for an input gene set. However, LLM-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, in our experiments these methods were rarely able to recapitulate the most precise and informative term from standard enrichment analysis. We also observe minor differences depending on prompt input information, with GO term descriptions leading to higher recall but lower precision. However, newer LLM models perform statistically significantly better than the oldest model across all performance metrics, suggesting that future models may lead to further improvements. Overall, the results are nondeterministic, with minor variations in prompt resulting in radically different term lists, true to the stochastic nature of LLMs. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis, however they may provide summarization benefits for implicit knowledge integration across extant but unstandardized knowledge, for large sets of features, and where the amount of information is difficult for humans to process.
References: Nucleic Acids Res. 2024 Jan 5;52(D1):D938-D949. (PMID: 38000386)
Nucleic Acids Res. 2016 Jan 4;44(D1):D555-9. (PMID: 26656951)
Trends Genet. 2023 Apr;39(4):308-319. (PMID: 36750393)
Sci Rep. 2018 Mar 23;8(1):5115. (PMID: 29572502)
Nat Protoc. 2019 Feb;14(2):482-517. (PMID: 30664679)
Neurol Int. 2022 Apr 02;14(2):337-356. (PMID: 35466209)
Nucleic Acids Res. 2021 Jan 8;49(D1):D1207-D1217. (PMID: 33264411)
Genetics. 2023 May 4;224(1):. (PMID: 36866529)
Nucleic Acids Res. 2021 Jan 8;49(D1):D831-D847. (PMID: 33037820)
Nucleic Acids Res. 2020 Jan 8;48(D1):D704-D715. (PMID: 31701156)
Nature. 2022 Mar;603(7903):893-899. (PMID: 35158371)
J Transl Med. 2023 Oct 16;21(1):728. (PMID: 37845713)
Database (Oxford). 2016 Aug 07;2016:. (PMID: 27504008)
PLoS One. 2016 Jun 22;11(6):e0157989. (PMID: 27331905)
Bioinformatics. 2023 Jul 1;39(7):. (PMID: 37389415)
Brief Bioinform. 2012 May;13(3):281-91. (PMID: 21900207)
Nat Methods. 2016 Aug 30;13(9):705-6. (PMID: 27575621)
Nucleic Acids Res. 2006 Aug 07;34(13):3687-97. (PMID: 16893953)
Bioinformatics. 2020 Feb 15;36(4):1234-1240. (PMID: 31501885)
Nucleic Acids Res. 2017 Feb 28;45(4):e20. (PMID: 28204549)
Sci Data. 2018 Apr 17;5:180061. (PMID: 29664468)
Bioinformatics. 2024 Feb 1;40(2):. (PMID: 38341654)
Nucleic Acids Res. 2020 Jan 8;48(D1):D650-D658. (PMID: 31552413)
Nucleic Acids Res. 2018 Jan 4;46(D1):D649-D655. (PMID: 29145629)
Nucleic Acids Res. 2019 Jan 8;47(D1):D419-D426. (PMID: 30407594)
Nucleic Acids Res. 2010 Jun;38(11):3523-32. (PMID: 20172960)
Database (Oxford). 2020 Jan 1;2020:. (PMID: 32559296)
Bioinformatics. 2024 Mar 4;40(3):. (PMID: 38383067)
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. (PMID: 26553804)
معلومات مُعتمدة: R24 OD011883 United States OD NIH HHS; RM1 HG010860 United States HG NHGRI NIH HHS; U24 HG012212 United States HG NHGRI NIH HHS
تواريخ الأحداث: Date Created: 20230609 Latest Revision: 20240718
رمز التحديث: 20240718
مُعرف محوري في PubMed: PMC10246080
PMID: 37292480
قاعدة البيانات: MEDLINE