دورية أكاديمية

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.

التفاصيل البيبلوغرافية
العنوان: Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.
المؤلفون: Gharavi E; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA., LeRoy NJ; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA., Zheng G; Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA., Zhang A; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.; Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA.; Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA., Brown DE; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.; Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA., Sheffield NC; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.; Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA.; Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.; Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
المصدر: Bioengineering (Basel, Switzerland) [Bioengineering (Basel)] 2024 Mar 08; Vol. 11 (3). Date of Electronic Publication: 2024 Mar 08.
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: MDPI AG Country of Publication: Switzerland NLM ID: 101676056 Publication Model: Electronic Cited Medium: Print ISSN: 2306-5354 (Print) Linking ISSN: 23065354 NLM ISO Abbreviation: Bioengineering (Basel) Subsets: PubMed not MEDLINE
أسماء مطبوعة: Original Publication: Basel, Switzerland : MDPI AG, [2014]-
مستخلص: As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
References: Nucleic Acids Res. 2016 Jul 8;44(W1):W122-7. (PMID: 27098038)
Bioinformatics. 2019 Dec 1;35(23):4907-4911. (PMID: 31150060)
J Am Med Inform Assoc. 2015 Nov;22(6):1114. (PMID: 26555016)
Nature. 2020 Jul;583(7818):699-710. (PMID: 32728249)
Bioinformatics. 2019 May 15;35(10):1799-1801. (PMID: 30329013)
Nat Commun. 2019 Oct 8;10(1):4576. (PMID: 31594952)
Nat Rev Cancer. 2010 Mar;10(3):205-12. (PMID: 20147902)
Bioinformatics. 2020 Sep 15;36(18):4682-4690. (PMID: 32618995)
BMC Bioinformatics. 2013 Jan 17;14:19. (PMID: 23323543)
Bioinformatics. 2017 Oct 15;33(20):3323-3330. (PMID: 29028263)
Nucleic Acids Res. 2018 Jul 2;46(W1):W194-W199. (PMID: 29878235)
IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):2210-2222. (PMID: 37022216)
Nature. 2021 Dec;600(7889):536-542. (PMID: 34819669)
Sci Data. 2017 Jun 06;4:170059. (PMID: 28585923)
Bioinformatics. 2021 Dec 7;37(23):4299-4306. (PMID: 34156475)
Nat Methods. 2022 Sep;19(9):1088-1096. (PMID: 35941239)
Bioinformatics. 2020 Jul 1;36(Suppl_1):i309-i316. (PMID: 32657413)
Front Genet. 2023 Mar 20;14:1155809. (PMID: 37020996)
Bioinformatics. 2020 Feb 15;36(4):1234-1240. (PMID: 31501885)
Bioinformatics. 2011 Mar 1;27(5):718-9. (PMID: 21208982)
Bioinformatics. 2023 Mar 1;39(3):. (PMID: 36857584)
Database (Oxford). 2019 Jan 1;2019:. (PMID: 31820804)
Front Genet. 2019 May 01;10:381. (PMID: 31118945)
Nat Genet. 2017 May 26;49(6):816-819. (PMID: 28546571)
Bioinformatics. 2021 Apr 9;37(1):118-120. (PMID: 33367484)
Proteomics. 2013 Jan;13(1):22-4. (PMID: 23148064)
Bioinformatics. 2010 Mar 15;26(6):841-2. (PMID: 20110278)
Nat Methods. 2018 Feb;15(2):123-126. (PMID: 29309061)
IEEE/ACM Trans Comput Biol Bioinform. 2016 Mar-Apr;13(2):233-47. (PMID: 26529777)
Sci Data. 2022 Sep 8;9(1):553. (PMID: 36075919)
معلومات مُعتمدة: R01 HG012558 United States HG NHGRI NIH HHS; R35 GM128636 United States GM NIGMS NIH HHS
فهرسة مساهمة: Keywords: chromatin; computational genomics; embeddings; functional genomics; genomic intervals; information retrieval; metadata; representation learning; search
تواريخ الأحداث: Date Created: 20240327 Latest Revision: 20240330
رمز التحديث: 20240330
مُعرف محوري في PubMed: PMC10967841
DOI: 10.3390/bioengineering11030263
PMID: 38534537
قاعدة البيانات: MEDLINE
الوصف
تدمد:2306-5354
DOI:10.3390/bioengineering11030263