دورية أكاديمية

Inferring whole-genome histories in large population datasets.

التفاصيل البيبلوغرافية
العنوان: Inferring whole-genome histories in large population datasets.
المؤلفون: Kelleher J; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK. jerome.kelleher@bdi.ox.ac.uk., Wong Y; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK., Wohns AW; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK., Fadil C; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK., Albers PK; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK., McVean G; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.
المصدر: Nature genetics [Nat Genet] 2019 Sep; Vol. 51 (9), pp. 1330-1338. Date of Electronic Publication: 2019 Sep 02.
نوع المنشور: Journal Article; Research Support, Non-U.S. Gov't
اللغة: English
بيانات الدورية: Publisher: Nature Pub. Co Country of Publication: United States NLM ID: 9216904 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1546-1718 (Electronic) Linking ISSN: 10614036 NLM ISO Abbreviation: Nat Genet Subsets: MEDLINE
أسماء مطبوعة: Original Publication: New York, NY : Nature Pub. Co., c1992-
مواضيع طبية MeSH: Algorithms* , Evolution, Molecular* , Genetics, Population* , Genome, Human* , Pedigree* , Selection, Genetic*, Computer Simulation ; Datasets as Topic ; Haplotypes ; Humans ; Models, Genetic ; Mutation ; Polymorphism, Single Nucleotide ; Population Density
مستخلص: Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
التعليقات: Erratum in: Nat Genet. 2019 Nov;51(11):1660. (PMID: 31591513)
Comment in: Nat Methods. 2019 Nov;16(11):1077. (PMID: 31673154)
References: Darwin, C. Charles Darwin’s Notebooks, 1836–1844: Geology, Transmutation of Species, Metaphysical Enquiries (Cambridge Univ. Press, 1987).
Haeckel, E. Generelle Morphologie der Organismen (G. Reimer, 1866).
Hinchliff, C. E. et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc. Natl Acad. Sci. USA 112, 12764–12769 (2015). (PMID: 10.1073/pnas.1423041112)
Felsenstein, J. Inferring Phylogenies (Sinauer Associates, 2004).
Yang, Z. & Rannala, B. Molecular phylogenetics: principles and practice. Nat. Rev. Genet. 13, 303–314 (2012). (PMID: 10.1038/nrg3186)
Morrison, D. A. Genealogies: pedigrees and phylogenies are reticulating networks not just divergent trees. Evol. Biol. 43, 456–473 (2016). (PMID: 10.1007/s11692-016-9376-5)
Ragan, M. A. Trees and networks before and after Darwin. Biol. Direct 4, 43 (2009). (PMID: 10.1186/1745-6150-4-43)
Griffiths, R. C. The two-locus ancestral graph. Lect. Notes Monogr. Ser. 18, 100–117 (1991). (PMID: 10.1214/lnms/1215459289)
Griffiths, R. C. & Marjoram, P. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3, 479–502 (1996). (PMID: 10.1089/cmb.1996.3.479)
Minichiello, M. J. & Durbin, R. Mapping trait loci by use of inferred ancestral recombination graphs. Am. J. Hum. Genet. 79, 910–922 (2006). (PMID: 10.1086/508901)
Arenas, M. The importance and application of the ancestral recombination graph. Front. Genet. 4, 206 (2013). (PMID: 37962703796270)
Gusfield, D. ReCombinatorics: the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks (MIT Press, 2014).
Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014). (PMID: 10.1371/journal.pgen.1004342)
Bordewich, M. & Semple, C. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 8, 409–423 (2005). (PMID: 10.1007/s00026-004-0229-z)
Wang, L., Zhang, K. & Zhang, L. Perfect phylogenetic networks with recombination. J. Comput. Biol. 8, 69–78 (2001). (PMID: 10.1089/106652701300099119)
Hein, J. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98, 185–200 (1990). (PMID: 10.1016/0025-5564(90)90123-G)
Song, Y. S. & Hein, J. Constructing minimal ancestral recombination graphs. J. Comput. Biol. 12, 147–169 (2005). (PMID: 10.1089/cmb.2005.12.147)
Gusfield, D., Eddhu, S. & Langley, C. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J. Bioinform. Comput. Biol. 02, 173–213 (2004). (PMID: 10.1142/S0219720004000521)
Gusfield, D., Bansal, V., Bafna, V. & Song, Y. S. A decomposition theory for phylogenetic networks and incompatible characters. J. Comput. Biol. 14, 1247–1272 (2007). (PMID: 10.1089/cmb.2006.0137)
Kuhner, M. K., Yamato, J. & Felsenstein, J. Maximum likelihood estimation of recombination rates from population data. Genetics 156, 1393–1401 (2000). (PMID: 14613171461317)
Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 (2001). (PMID: 14618551461855)
Song, Y. S., Wu, Y. & Gusfield, D. Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution. Bioinformatics 21, i413–i422 (2005). (PMID: 10.1093/bioinformatics/bti1033)
Parida, L., Melé, M., Calafell, F., Bertranpetit, J. & The Genographic Consortium Estimating the ancestral recombinations graph (ARG) as compatible networks of SNP patterns. J. Comput. Biol. 15, 1133–1153 (2008). (PMID: 10.1089/cmb.2008.0065)
O’Fallon, B. D. ACG: rapid inference of population history from recombining nucleotide sequences. BMC Bioinformatics 14, 40 (2013). (PMID: 10.1186/1471-2105-14-40)
Mirzaei, S. & Wu, Y. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics 33, 1021–1030 (2016). (PMID: 58600235860023)
Cardona, G., Rosselló, F. & Valiente, G. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics 9, 532 (2008). (PMID: 10.1186/1471-2105-9-532)
McGill, J. R., Walkup, E. A. & Kuhner, M. K. GraphML specializations to codify ancestral recombinant graphs. Front. Genet. 4, 146 (2013). (PMID: 10.3389/fgene.2013.00146)
Kelleher, J., Etheridge, A. M. & McVean, G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 12, e1004842 (2016). (PMID: 10.1371/journal.pcbi.1004842)
Kelleher, J., Thornton, K. R., Ashander, J. & Ralph, P. L. Efficient pedigree recording for fast population genetics simulation. PLoS Comput. Biol. 14, e1006581 (2018). (PMID: 10.1371/journal.pcbi.1006581)
The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature 526, 68–74 (2015). (PMID: 10.1038/nature15393)
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). (PMID: 10.1038/nature18964)
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). (PMID: 10.1038/s41586-018-0579-z)
Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015). (PMID: 10.1371/journal.pbio.1002195)
Ané, C. & Sanderson, M. J. Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54, 146–157 (2005). (PMID: 10.1080/10635150590905984)
Danecek, P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–2158 (2011). (PMID: 10.1093/bioinformatics/btr330)
Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014). (PMID: 10.1093/bioinformatics/btu014)
Pedersen, B. S. & Quinlan, A. R. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33, 1867–1869 (2017). (PMID: 10.1093/bioinformatics/btx057)
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009). (PMID: 10.1093/bioinformatics/btp163)
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003). (PMID: 1470419814704198)
Kendall, M. & Colijn, C. Mapping phylogenetic trees to reveal distinct patterns of evolution. Mol. Biol. Evol. 33, 2735–2743 (2016). (PMID: 10.1093/molbev/msw124)
Shchur, V., Ziganurova, L. & Durbin, R. Fast and scalable genome-wide inference of local tree topologies from large number of haplotypes based on tree consistent PBWT data structure. Preprint at bioRxiv https://doi.org/10.1101/542035 (2019).
Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. https://doi.org/10.1038/s41588-019-0484-x (2019). (PMID: 10.1038/s41588-019-0484-x)
Kimura, M. & Ota, T. The age of a neutral mutant persisting in a finite population. Genetics 75, 199–212 (1973). (PMID: 12129971212997)
Griffiths, R. C. & Tavaré, S. The age of a mutation in a general coalescent tree. Stoch. Models 14, 273–295 (1998). (PMID: 10.1080/15326349808807471)
Ormond, L., Foll, M., Ewing, G. B., Pfeifer, S. P. & Jensen, J. D. Inferring the age of a fixed beneficial allele. Mol. Ecol. 25, 157–169 (2016). (PMID: 10.1111/mec.13478)
Nakagome, S. et al. Estimating the ages of selection signals from different epochs in human history. Mol. Biol. Evol. 33, 657–669 (2016). (PMID: 10.1093/molbev/msv256)
Smith, J., Coop, G., Stephens, M. & Novembre, J. Estimating time to the common ancestor for a beneficial allele. Mol. Biol. Evol. 35, 1003–1017 (2018). (PMID: 10.1093/molbev/msy006)
Albers, P. K. & McVean, G. Dating genomic variants and shared ancestry in population-scale sequencing data. Preprint at bioRxiv https://doi.org/10.1101/416610 (2018).
Keightley, P. D. & Jackson, B. C. Inferring the probability of the derived vs. the ancestral allelic state at a polymorphic site. Genetics 209, 897–906 (2018). (PMID: 60282446028244)
Lunter, G. Haplotype matching in large cohorts using the Li and Stephens model. Bioinformatics 35, 798–806 (2019). (PMID: 10.1093/bioinformatics/bty735)
Fisher, R. A. A fuller theory of ‘junctions’ in inbreeding. Heredity 8, 187–197 (1954). (PMID: 10.1038/hdy.1954.17)
Jombart, T., Kendall, M., Almagro-Garcia, J. & Colijn, C. treespace: statistical exploration of landscapes of phylogenetic trees. Mol. Ecol. Resour. 17, 1385–1392 (2017). (PMID: 10.1111/1755-0998.12676)
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011). (PMID: 10.1093/bioinformatics/btq706)
Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009). (PMID: 10.1371/journal.pgen.1000695)
Haller, B. C. & Messer, P. W. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Mol. Biol. Evol. 36, 632–637 (2019). (PMID: 10.1093/molbev/msy228)
Haller, B. C., Galloway, J., Kelleher, J., Messer, P. W. & Ralph, P. L. Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol. Ecol. Resour. 19, 552–566 (2019). (PMID: 10.1111/1755-0998.12968)
Oliphant, T. E. A guide to NumPy (Trelgol Publishing, 2006).
McKinney, W. et al. Data structures for statistical computing in Python. Proc. 9th Python in Science Conference 51–56 (2010).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90 (2007). (PMID: 10.1109/MCSE.2007.55)
Regions in the European Union–Nomenclature of Territorial Units for Statistics–NUTS 2013/EU-28 (Eurostat, 2011).
معلومات مُعتمدة: United Kingdom WT_ Wellcome Trust; 100956 United Kingdom WT_ Wellcome Trust
تواريخ الأحداث: Date Created: 20190904 Date Completed: 20200123 Latest Revision: 20220420
رمز التحديث: 20221213
مُعرف محوري في PubMed: PMC6726478
DOI: 10.1038/s41588-019-0483-y
PMID: 31477934
قاعدة البيانات: MEDLINE
الوصف
تدمد:1546-1718
DOI:10.1038/s41588-019-0483-y