دورية أكاديمية

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.

التفاصيل البيبلوغرافية
العنوان: Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.
المؤلفون: LeRoy NJ; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA., Smith JP; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA., Zheng G; Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA., Rymuza J; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA., Gharavi E; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA., Brown DE; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.; Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA., Zhang A; Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA.; Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA., Sheffield NC; Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA.; Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.; Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.; School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.; Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
المصدر: NAR genomics and bioinformatics [NAR Genom Bioinform] 2024 Jul 05; Vol. 6 (3), pp. lqae073. Date of Electronic Publication: 2024 Jul 05 (Print Publication: 2024).
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: Oxford University Press Country of Publication: England NLM ID: 101756213 Publication Model: eCollection Cited Medium: Internet ISSN: 2631-9268 (Electronic) Linking ISSN: 26319268 NLM ISO Abbreviation: NAR Genom Bioinform Subsets: PubMed not MEDLINE
أسماء مطبوعة: Original Publication: [Oxford] : Oxford University Press, [2019]-
مستخلص: Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.
(© The Author(s) 2024. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.)
References: Nat Methods. 2019 May;16(5):397-400. (PMID: 30962623)
Nat Commun. 2018 Sep 7;9(1):3647. (PMID: 30194434)
J Comput Biol. 2022 Jul;29(7):619-633. (PMID: 35584295)
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):. (PMID: 33827925)
Nat Commun. 2021 Feb 26;12(1):1337. (PMID: 33637727)
Bioinformatics. 2021 Dec 7;37(23):4299-4306. (PMID: 34156475)
Nat Methods. 2017 Oct;14(10):975-978. (PMID: 28825706)
Genome Biol. 2019 Nov 18;20(1):241. (PMID: 31739806)
Front Genet. 2022 Dec 13;13:1063233. (PMID: 36583014)
Bioengineering (Basel). 2024 Mar 08;11(3):. (PMID: 38534537)
Nat Genet. 2021 Mar;53(3):403-411. (PMID: 33633365)
Brief Bioinform. 2021 Jul 20;22(4):. (PMID: 33279962)
Cell. 2018 May 31;173(6):1535-1548.e16. (PMID: 29706549)
Nat Methods. 2022 Sep;19(9):1088-1096. (PMID: 35941239)
Nucleic Acids Res. 2019 Jan 25;47(2):e10. (PMID: 30335168)
Cell. 2018 Aug 23;174(5):1309-1324.e18. (PMID: 30078704)
Nat Commun. 2023 Apr 3;14(1):1864. (PMID: 37012226)
Nat Commun. 2019 Oct 8;10(1):4576. (PMID: 31594952)
Front Genet. 2023 Mar 20;14:1155809. (PMID: 37020996)
Comput Struct Biotechnol J. 2020 Jun 12;18:1429-1439. (PMID: 32637041)
Nat Biotechnol. 2022 May;40(5):703-710. (PMID: 35058621)
Cell Rep Methods. 2022 Mar 15;2(3):100182. (PMID: 35475224)
معلومات مُعتمدة: R01 HG012558 United States HG NHGRI NIH HHS; R35 GM128636 United States GM NIGMS NIH HHS
تواريخ الأحداث: Date Created: 20240708 Latest Revision: 20240709
رمز التحديث: 20240709
مُعرف محوري في PubMed: PMC11224678
DOI: 10.1093/nargab/lqae073
PMID: 38974799
قاعدة البيانات: MEDLINE
الوصف
تدمد:2631-9268
DOI:10.1093/nargab/lqae073