Concatenation and Species Tree Methods Exhibit Statistically Indistinguishable Accuracy under a Range of Simulated Conditions

João Tonini; Andrew Moore; David Stern; Maryia Shcheglovitova; Guillermo Ortí

doi:10.1371/currents.tol.34260cc27551a527b124ec5f6334b6be

Concatenation and Species Tree Methods Exhibit Statistically Indistinguishable Accuracy under a Range of Simulated Conditions

March 9, 2015 · Systematics

Tonini J, Moore A, Stern D, Shcheglovitova M, Ortí G. Concatenation and Species Tree Methods Exhibit Statistically Indistinguishable Accuracy under a Range of Simulated Conditions. PLOS Currents Tree of Life. 2015 Mar 9 . Edition 1. doi: 10.1371/currents.tol.34260cc27551a527b124ec5f6334b6be.

XML

Authors

João Tonini Department of Biological Sciences, The George Washington Univerisity, Washington, District of Columbia, USA.
Andrew Moore Department of Biological Sciences, The George Washington University, Washington, District of Columbia, USA.
David Stern Computational Biology Institute, Department of Biological Sciences, The George Washington University, Washington, District of Columbia, USA.
Maryia Shcheglovitova Department of Geography & Environmental Systems, University of Maryland Baltimore County, Baltimore, MD, USA.
Guillermo Ortí Department of Biological Sciences, The George Washington Univerisity, Washington, District of Columbia, USA.

Abstract

Phylogeneticists have long understood that several biological processes can cause a gene tree to disagree with its species tree. In recent years, molecular phylogeneticists have increasingly foregone traditional supermatrix approaches in favor of species tree methods that account for one such source of error, incomplete lineage sorting (ILS). While gene tree-species tree discordance no doubt poses a significant challenge to phylogenetic inference with molecular data, researchers have only recently begun to systematically evaluate the relative accuracy of traditional and ILS-sensitive methods. Here, we report on simulations demonstrating that concatenation can perform as well or better than methods that attempt to account for sources of error introduced by ILS. Based on these and similar results from other researchers, we argue that concatenation remains a useful component of the phylogeneticist’s toolbox and highlight that phylogeneticists should continue to make explicit comparisons of results produced by contemporaneous and classical methods.

Funding Statement

The first three authors contributed equally to the study. A.J.M., D.S., and J.F.R.T. acknowledge support from The George Washington University. A.J.M. and M.S are supported by National Science Foundation Graduate Research Fellowship Program Grants No. DGE-1246908 and DGE-1144243. J.F.R.T. is supported by fellowship from the Science without Borders program (CAPES/Brazil). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction

For decades, phylogeneticists have recognized the potential for discordance between a gene tree and its species tree ¹^,². Maddison’s seminal insights on the topic ³ illustrate how discordance may arise from any of the following sources: gene duplication and extinction (“hidden paralogy”), lateral gene transfer, and incomplete lineage sorting (ILS). Given the computational and theoretical difficulties of adequately modeling gene descent, phylogeneticists have traditionally not attempted to reconstruct species phylogenies with models that explicitly account for sources of gene trees-species tree discord, and have instead opted to increase the total number of loci included in their concatenated data matrices under the expectation that, on average, the signal of multiple genes will recover the true species tree.

In recent years, however, molecular systematists have begun eschewing traditional supermatrix methods in favor of explicit modeling of the processes that may generate discordance. Although issues related to gene duplication and lateral gene transfer have received some attention ⁴^,⁵, most efforts have concentrated on accounting for the effects of ILS by modeling the multispecies coalescent ⁶. The barrage of new methods now available to account for error due to ILS (Table 1) and the general transition away from supermatrix methods seem to follow the predicted emergence of a new paradigm in molecular systematics ⁷.

Table 1. Species tree methods accounting for incomplete lineage sorting in molecular systematics

Note that this list is not exhaustive.
Method	Details
BATWING⁸	Uses Metropolis–Hastings algorithms to estimate population histories and mutation rates. The ancestral DNA data are inferred at each node of the genealogical tree. Different types of sequence data can be analyzed. The program estimates demographic parameters, but do not allow for the effects of either recombination or selection.
*BEAST⁶	Explicitly models the multispecies coalescent, coestimating multiple gene trees within a shared species tree and effective population sizes of extant and extinct populations in a Bayesian Markov chain Monte Carlo framework.
BEST⁹^,¹⁰^,¹¹	Estimates species tree topology, divergence times, and population sizes from gene trees under a multispecies coalescent model in a Bayesian framework; unlike *BEAST, BEST estimates each gene tree individually, then uses importance sampling to infer the species tree.
BUCKy¹²	Uses non-parametric clustering of genes to estimate concordance factors based on proportions of genes supporting a clade; these factors provide the basis for inferring species-trees. Implemented in a Bayesian phylogenetic framework, and thus integrates over gene tree uncertainty.
GLASS¹³	An estimator of pairwise divergence times under the multispecies coalescent. Applies pairwise estimates by single-linkage clustering to generate species trees with branch lengths, using multiple unlinked loci and allow the use of several alleles from each population.
iGLASS¹⁴	Improves GLASS estimation of divergence times by deriving the expected waiting time until the first interspecific coalescence occurs among independent loci for a pair of taxa.
MCMCcoal¹⁵	A Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. Uses multiple loci, the topology of the species tree is assumed known, and a MCMC algorithm integrates over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. It can handle any species tree and allows different numbers of sequences at different loci.
MDC¹⁶	Parsimony-based method that heuristically searches for the species tree that minimizes the number of deep coalescences implied by the gene trees, and thus does not model the coalescent process.
MP-EST¹⁷	Uses pseudo-likelihood function of the species tree to obtain maximum pseudo-likelihood estimates of species trees, with branch lengths in coalescent units. It assumes no gene flow or horizontal gene transfer, but method is robust to a small amount of the former.
NJst¹⁸	The distance between two species is defined as the average number of internodes between two species across gene trees, then the species tree is estimated by the neighbor joining tree built from the distance matrix.
SNAPP¹⁹	Uses a polynomial-time algorithm that computes the likelihood of a species tree directly from the sequences under a finite-sites model of mutation effectively integrating over all possible gene trees. The method applies to unlinked biallelic markers and it is implemented in a Markov chain Monte Carlo sampler for inferring species trees, divergence dates, and population sizes.
STAR²⁰	Uses summary statistics of coalescence times by ordering the expected ranks of the coalescences among sequences, which is consistent with the ancestral order of populations in the species tree. It is resistant to variable substitution rates along the branches in gene trees.
STEAC²⁰	Uses summary statistics of average coalescence times to estimate the species tree. The gene trees are generated from multilocus sequences using a consistent method for gene tree estimation (e.g. maximum likelihood) without the molecular clock assumption.
STEM²¹	Analytically derives the maximum likelihood species tree for a set of gene trees with branch-length information, explicitly modeling discord as a function of the coalescent process.
SVDquartets²²	Infers relationships among quartets of taxa under the coalescent model using techniques from algebraic statistics; it accounts for mutational and coalescent variance. Uncertainty in the estimated relationships is quantified using the nonparametric bootstrap.
STELLS²³	Uses coalescent-based maximum likelihood approach for inferring the species tree from a set of gene tree topologies (no branch lengths), which are used to estimated branch lengths in coalescent units.; the algorithm assumes free recombination between genes. It assumes the gene tree topologies are correctly inferred, therefore does not model uncertainty of gene tree estimation from sequences.

Inadequate model specification is known to introduce systematic bias into phylogenetic analyses, and thus the drive to account for potential error introduced by ILS has merit on theoretical grounds ¹⁶^,²⁴^,²⁵. However, the conditions under which traditional supermatrix approaches to molecular systematics necessarily result in less accurate topologies than methods accounting for population genetic processes remain unresolved, and recent empirical studies have reported high congruence between supermatrix and species tree methods ²⁶^,²⁷^,²⁸.

Moreover, while theoretical studies have demonstrated the existence of species trees for which discordant gene trees are more likely than genealogies that agree with the species tree (the “anomalous gene trees” [AGTs] of Degnan and Rosenberg ²⁴), such species histories are apparently unlikely to mislead most phylogenetic analyses. Huang and Knowles ²⁹ have shown that, for species trees containing anomaly zones (i.e. short internal branches that result in AGTs), the probability of observing a polytomous gene tree is greater than the probability of recovering an anomalous gene tree. Thus, for species trees with short, deep branches, gene trees are more likely to be uninformative than to be outright misleading.

Given such empirical and theoretical findings, traditional supermatrix approaches may still be adequate for densely sampled data matrices and for clades evolving under non-extreme rates of change. However, simulation studies evaluating the accuracy of phylogenetic methods that explicitly treat error due to ILS do not always compare these methods to concatenation ³⁰. Those simulation studies that do make comparisons to concatenation have not demonstrated uniformly superior accuracy of ILS-sensitive methods across all simulated conditions ³¹, and often use simple models of sequence evolution and/or few taxa ⁶^,¹²^,²⁵, conditions that do not reflect typical empirical studies.

Furthermore, simulation studies concluding that concatenation is statistically inconsistent in the presence of anomaly zones and gene tree discordance often do not demonstrate superior accuracy by species tree methods under the same conditions ³², and are thus uninformative as to the practical merits of supermatrix and species-tree paradigms. Even when concatenation is compared to species-tree methods, it is not always specified whether apparent differences in performance are statistically significant ¹²; but see ³³ and discussion below of other recent simulation studies that explicitly assess the accuracy of concatenation vis-à-vis species-tree methods.

Huang et al. ³⁰ demonstrate how two sources of error – mutational variance and coalescent variance – impact the estimation of species trees using methods that account for ILS, but do not make direct comparisons to concatenation. Many empiricists still rely on concatenation for producing phylogenetic hypotheses for many taxa (e.g. ) and calibrating phylogenies with fossil information ³⁴^,³⁵^,³⁶. Empiricists are likely to continue using concatenation for the foreseeable future, until methods that co-estimate genes trees and species trees *BEAST⁶ become more computationally feasible ³⁷, and until truly synthetic and well-understood approaches to simultaneously account for the various sources of gene tree-species tree discord are developed. Thus, it is crucial that the relative accuracy of concatenation, vis-à-vis species-tree approaches, be explicitly delimited.

Herein we extend the experimental design of Huang et al. ³⁰ to include traditional supermatrix methodologies, and explicitly assess the relative accuracies of species-tree methods and concatenation in estimating phylogeny. We find that the species-tree methods fail to outperform concatenation, and argue that, as a matter of best practices, studies involving methods sensitive to sources of gene tree-species tree discord should continue to compare results against those obtained by concatenation.

Methods

Huang et al. ³⁰ performed simulations to quantify the effects of mutational and coalescent variance in species-tree estimation. Specifically, they compared the performance of two methods – STEM ²¹ and MDC ¹⁶ – in resolving the true species tree under a number of conditions, but did not present comparisons to the supermatrix approach. Our goal was to explicitly make these comparisons. The relevant steps of the simulations performed by Huang et al. ³⁰ involved 1) generating a species tree under a uniform speciation model, 2) simulating coalescent gene trees for each species tree, 3) simulating DNA sequences under a specified model of nucleotide evolution along the branches of each gene trees, 4) estimating gene trees from the simulated DNA matrix, 5) estimating species trees from the estimated gene trees using MDC and STEM, and 6) calculating the discord between the true species tree and the estimated species tree. Our simulations followed the same trajectory, except that we do not estimate gene trees separately and instead estimate species trees directly from concatenated matrices. Rather than rerun the simulations of Huang et al. ³⁰ in full, we obtained from the authors their calculated measures of discord between results obtained for each species tree method and the true tree under a variety of conditions. We then compared these discord measurements to values that we produced by identically parameterized simulations analyzed under concatenation.

We obtained 50 species trees of eight taxa used by Huang et al. ³⁰, which those authors generated in the Mesquite software package ³⁸ under a uniform speciation model. We simulated 540 coalescent gene trees for each of the 50 species tree under a neutral coalescent model, constant population size, and no migration in the program ms ³⁹ using the script provided by Huang et al. ³⁰. For each of the 50 species trees, the simulations produced gene trees at total tree depths of 1N and 10N generations; these values reflect a range of total lineage duration. We used only the subset of all simulated trees with a single individual per species, in accordance with the common practice for supermatrix methods. For each gene tree, sequences of 1000 base pairs were simulated using Seq-Gen ⁴⁰ under an HKY model of nucleotide substitution ⁴¹, a transition-transversion rate ratio of 3.0, gamma-distributed rate heterogeneity shape parameter of 0.8, and nucleotide frequencies of A=0.3, C=0.2, T=0.3, and G=0.2 for the ancestral sequence, in accordance with the model of evolution used by Huang et al. ³⁰. These simulations assume a single clock-like rate of sequence evolution across the tree.

Huang et al. ³⁰ evaluate the effects of increasing the number of loci in the data matrix on the accuracy of MDC and STEM in obtaining the true tree, measuring the accuracy of trees produced with matrices of 3, 9, and 27 genes. To make our supermatrix analyses comparable to those of Huang et al. ³⁰, the 540 genes of each of the 50 species trees were concatenated into matrices of 3, 9, and 27 genes; for each species tree at each tree depth, this procedure produced 180 matrices with 3 loci (1N/3 loci and 10N/3 loci), 60 matrices with 9 loci (1N/9 loci and 10N/9 loci), and 20 matrices with 27 loci (1N/27 loci and 10N/27 loci). These concatenated data sets were analyzed in MrBayes 3.2.2 ⁴² under the same model of nucleotide substitution used to produce the sequences (i.e. HKY), a uniform molecular clock model, and Dirichlet distributed nucleotide frequencies. MrBayes was run for each tree estimation until the standard deviation of split frequencies fell below 0.01. Estimating species trees using the substitution model under which they were simulated ensured that model misspecification was not an additional source of error in comparing method accuracy. After discarding the first 25% of the posterior as burn-in, we sampled every 100 trees from the posterior distribution to produce a majority rule consensus tree with all compatible clades added (using the “allcompat” command in MrBayes). The use of a consensus tree for subsequent analyses of discord between methods poses a potential problem, as calculations of topological discord (described below) may be biased in trees with polytomous clades. This potentiality is not addressed by Huang et al. ³⁰, and their measure of discord was also based on the 50% majority rule consensus with all compatible clades. However, adding all compatible clades to the majority rule consensus ensures that most trees are fully resolved, and manual inspection of a representative subset of our estimated species trees found no polytomous clades. Encouragingly, recent research indicates that the majority-rule consensus is expected to minimize the discord between the estimated and the true tree ⁴³.

Topological discord between estimated species trees and their true species tree was measured using the Robinson-Foulds (RF) distance ⁴⁴ in the R package phangorn ⁴⁵. The RF statistic measures the distance between two unrooted trees as the summation of the number of bipartitions unique to the first tree and the number of bipartitions unique to the second tree. At each tree depth, mean discord was calculated by averaging RF distances for each of the 180, 60, and 20 concatenated matrices associated with each of the 50 species trees. We then took the average of the means for all 50 species trees for each matrix size, such that the mean discord for matrices consisting of 3 loci, 9 loci, and 27 loci was estimated from a total of 9,000, 3,000, and 1,000 phylogenetic analyses, respectively. We note that Huang et al. ³⁰ ran 20 independent replicates for each species tree at each permutation of tree depth and sampling design; we did not adopt this procedure here, as a single iteration of the simulations produced sufficiently large sample sizes. ANOVA statistical tests were used to evaluate differences in mean RF distances within and across methods using the values obtained by Huang et al. ³⁰. We did not investigate whether clade support, rather than topological disparity, differed significantly between concatenation and species tree methods, but this topic merits further research. The scripts used to perform the simulations are deposited at http://dx.doi.org/10.6084/m9.figshare.1326720.

We acknowledge that eight-taxon matrices are significantly smaller than the matrices of most molecular phylogenetic analyses – though still larger than those used in many simulation analyses – and also recognize that STEM and MDC are not the most sophisticated ILS-sensitive species tree methods available. Our goal, however, was to perform a general test of the importance of considering concatenation when choosing among phylogenetic estimation methods. For the sake of comparability to Huang et al. ³⁰, who seek explicitly to make recommendations regarding method choice, we did not expand their basic approach.

Results

Comparisons among methods of a given matrix size demonstrate that mean discord values are statistically indistinguishable, with two exceptions: 1N/3 loci and 1N/9 loci (Table 2, Fig. 1). For 1N/3 loci, MDC performed significantly worse than the other methods under consideration, while at 1N/9 loci, STEM exhibited statistically superior performance. Within methods, increasing the number of loci results in a significant increase in accuracy for both 1N and 10N simulations (Table 2, Fig. 1). Across all simulations, concatenation is as or more accurate than either STEM or MDC, except at 1N/9 loci, where STEM appears, on average, to produce slightly less discordant trees than concatenation.

Under the simulated conditions, increasing the number of loci sampled improved accuracy across all inference methods, as is expected for methods that exhibit statistical consistency. Likewise, all methods showed increased accuracy for analyses of taxa related by longer effective internal branches, i.e. branches characterized by smaller ancestral populations, less ancestral polymorphism, and/or increased generational turnover (Fig. 1). Still, the RF values were surprisingly high at 1N, reflecting the inherent difficulty in confidently resolving recent divergences.

Table 2. Results of ANOVA tests for differences in accuracy among methods (concatenation, STEM, MDC) with different numbers of loci and between methods with the same number of loci

Accuracy measured by the topological distance (Robinson-Foulds distance) between the true and the obtained tree. Significant differences denoted with asterisks. See also Figure 1.
	3 loci	9 loci	27 loci	Concatenation	MDC	STEM
1N	<0.001*	0.0255*	0.6384	<0.0001*	<0.0001*	<0.0001*
10N	0.3601	0.8233	0.8505	<0.0002	<0.0001*	<0.0011*

Fig. 1: Mean and standard deviation estimated from a total of 9,000, 3,000, and 1,000 phylogenetic analyses for the 3 loci, 9 loci, and 27 loci analyses, respectively.

The Phylogeneticist’s Toolbox

Our simulations demonstrate that concatenation can perform as well or better than methods that attempt to account for sources of error introduced by ILS (Table 2, Fig. 1). This result corroborates some recent empirical findings that concatenation and species-tree methods can produce near-identical results ²⁷^,²⁸^,³³^,⁴⁶, and is concordant with simulation studies indicating that ILS-sensitive methods do not consistently outperform concatenation under all conditions ³¹^,³³^,⁴⁶^,⁴⁷^,⁴⁸^,⁴⁹^,⁵⁰. Indeed, concatenation can outperform species-tree methods when levels of ILS are low, few loci are used, or gene trees have low phylogenetic signal ³³^,⁴⁷, conditions that likely reflect the reality of many phylogeneticists working with mitochondrial and nuclear markers generated by traditional methods of sequencing to estimate large scale phylogenetic relationships ³⁴^,³⁵^,³⁶^,⁵¹^,⁵²^,⁵³. In addition, simulations by Patel et al. ⁴⁹ and analysis of mammalian empirical data ⁵⁰ indicate that concatenated analyses benefit from increased power vis-à-vis “shortcut” coalescence methods (i.e. those that estimate gene trees and species trees separately) when trying to resolve deeper divergences, especially when more loci are used. In these cases, high mutational error and gene tree heterogeneity are significantly more detrimental to shortcut coalescent methods than they are to concatenation, resulting in more accurate concatenation-based trees despite the incorrect assumption that all loci share a common topology. Binning techniques that use concatenation to produce “supergene trees” as input for coalescence-based analyses may improve the accuracy of species tree estimation by alleviating the bias introduced by using numerous small loci ⁵⁴.

Gene tree-species tree discordance is a well-documented phenomenon, and we do not question that it poses a significant challenge to accurate inference of phylogeny; however, the growing body of research indicating that concatenation is not uniformly worse than discordance-sensitive methods should give phylogeneticists pause. That concatenation remains reasonably accurate under a range of conditions suggests that it retains practical utility. Where results from concatenation are concordant with those obtained from discordance-sensitive methods, the phylogeneticist can be reasonably confident that the resultant phylogenetic hypothesis is robust. Where concatenation and discordance-sensitive methods disagree significantly, the disparity may indicate regions of the tree in need of focused attention and might suggest possible biological factors confounding phylogeny estimation ⁵⁵, or may indicate critically violated assumptions in one or both approaches ⁵⁰. In these cases, morphological, physiological or ecological evidence may provide support for one hypothesis over the other.

Though many methods have been developed to account for error due to ILS (Table 1), other biological processes (hidden paralogy and horizontal gene transfer) can confound accurate phylogenetic inference; however, empirical researchers cannot know a priori the relative contributions of these processes to the genealogical history of a lineage, or to what extent discordance among gene trees may be due to methodological or sampling error (e.g. poor taxonomic sampling; model misspecification, species misidentification, long branch misplacement, arbitrary resolution of gene trees) ⁵⁰^,⁵⁶. While concatenation has been rightly criticized for not accounting for known sources of phylogenetic error, shortcut coalescence methods assume that all error is due to incomplete lineage sorting – that is, that gene trees are known with certainty ⁵⁰. To date, no species tree method simultaneously accounts for all sources of gene discordance ⁵⁷; until well-understood, synthetic approaches to co-estimating gene trees and species tree and accounting for all sources of gene tree-species tree discord are developed, the continued use of concatenation has practical merit insofar as it avoids making a priori assumptions about the extent to which various sources of uncertainty have influenced the evolutionary history of the taxa under consideration. Thus, we argue that concatenation remains a useful component of the phylogeneticist’s toolbox as a “null” methodology against which results from alternative methods of inference can be compared and interpreted.

That concatenation remains an important part of the phylogeneticist’s toolbox is borne out by recent analyses indicating that the method exhibits greater power to overcome sampling error and discrepant patterns of homoplasy ⁵⁰, and that “concatalescence” methods can make explicit use of concatenation in improving the quality of input trees prior to coalescence-based estimation of the species tree ⁵⁴. Moreover, estimation of time-calibrated phylogenies combining molecular and morphological data (extant and fossil) is only possible in concatenation analyses, and concatenation has been shown to be more resilient to the effects of missing data than species tree methods ⁵⁸^,⁵⁹.

We did not test improvement in accuracy of species tree estimation by using multiple individuals of the same species. New species-tree methods that have been developed to handle phylogenomic data of hundreds of loci have also used one individual per species when testing for accuracy and performance ⁴⁶, a common practice in studies aiming to infer large-scale interspecific relationships ³⁴^,³⁶^,⁵¹^,⁵²^,⁵³. Although our results and the findings of phylogenomic studies indicate concatenation exhibits reasonable accuracy under certain conditions, we recognize that phylogenetic studies using either supermatrix or species-tree methods may benefit from including multiple individuals per species, especially for recent divergences ³⁰^,⁴⁹, as the additional data help account for intraspecific allelic diversity when estimating interspecific relationships.

Research is only beginning to shed light on the relative accuracies of species-tree methods and concatenation for genomic-scale data matrices with thousands of taxa. Simulations indicate that coalescent-based methods improve accuracy for genome-scale matrices ³³, though adding loci for which gene tree estimates are poor may be more detrimental for ILS-sensitive methods than for concatenation ⁴⁹^,⁵⁰. Recent empirical research suggests that rapid radiations may make some internal nodes permanently intractable, regardless of methodology or sampling (e.g. few taxa and genes ⁶⁰, many taxa and few genes ⁶¹, or few taxa and many genes ²⁷).

The phylogeneticist’s toolbox continues to grow as our understanding of the dynamics of evolving populations improves, and the last year has witnessed the development of numerous methodological tools and considerations for practicing systematists. For example, recent research has produced posterior predictive model checking for coalescent models ⁶² and new methods of data exploration to reveal the causes and consequences of patterns of gene tree discordance ⁶³^,⁶⁴. We do not dispute that the proliferation and refinement of new methodologies is crucial to the continued success of phylogenetic research. Our purpose here is to demonstrate the importance of explicit comparison of contemporaneous and classical methodologies, and to highlight that the addition of theoretically important parameters may not always produce practical gains. We conclude that concatenation exhibits statistically comparable accuracy under a range of sampling and tree depth conditions vis-à-vis some existing species tree methods, and urge molecular phylogeneticists to thoroughly evaluate the performance of methods that model gene tree-species tree discord against concatenation.

Competing Interests

The authors have declared that no competing interests exist.

Acknowledgements

We thank Huateng Huang for helpful information in executing the simulations. Alex Pyron, Ricardo Betancur-R., Laura Kubatko, Frank Anderson, Mark Holder, The George Washington University systematics discussion group and three anonymous referees provided valuable suggestions. A.J.M., D.S., and J.F.R.T. acknowledge support from The George Washington University. A.J.M. is supported by National Science Foundation Graduate Research Fellowship Program Grant No. DGE-1246908. J.F.R.T. is supported by the Science without Borders program (CAPES/Brazil).

References

Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G. Fitting the Gene Lineage into its Species Lineage, a Parsimony Strategy Illustrated by Cladograms Constructed from Globin Sequences. Syst Biol. 1979 Jun;28(2):132-163. doi: 10.1093/sysbio/28.2.132
Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988 Sep;5(5):568-83. PubMed PMID:3193878.
Maddison WP. Gene trees in species trees. Syst Biol. 1997 Sep;46(3):523-536. doi: 10.1093/sysbio/46.3.523
Boussau B, Szöllosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013 Feb;23(2):323-30. PubMed PMID:23132911.
Nakhleh L. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends Ecol Evol. 2013 Dec;28(12):719-28. PubMed PMID:24094331.
Heled J, Drummond AJ. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010 Mar;27(3):570-80. PubMed PMID:19906793.
Edwards SV. Is a new and general theory of molecular systematics emerging? Evolution. 2009 Jan;63(1):1-19. PubMed PMID:19146594.
Wilson IJ, Weale ME, Balding DJ. Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J R Statist Soc. A 2003 Apr;166(2):155-201. http://www.jstor.org/stable/3559656
Liu L, Pearl DK. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007 Jun;56(3):504-14. PubMed PMID:17562474.
Liu L. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics. 2008 Nov 1;24(21):2542-3. PubMed PMID:18799483.
Liu L, Pearl DK, Brumfield RT, Edwards SV. Estimating species trees using multiple-allele DNA sequence data. Evolution. 2008 Aug;62(8):2080-91. PubMed PMID:18462214.
Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010 Nov;26(22):2910-2911. doi: 10.1093/bioinformatics/btq539
Mossel E, Roch S. Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan;7(1):166–171.
Jewett EM, Rosenberg NA. iGLASS: an improvement to the GLASS method for estimating species trees from gene trees. J Comput Biol. 2012 Mar;19(3):293-315. PubMed PMID:22216756.
Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003 Aug;164(4):1645-1656.
Maddison WP, Knowles LL. Inferring phylogeny despite incomplete lineage sorting. Syst Biol. 2006 Feb;55(1):21-30. PubMed PMID:16507521.
Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010 Oct;10:302. doi:10.1186/1471-2148-10-302
Liu L, Yu L. Estimating Species Trees from Unrooted Gene Trees. Syst Biol. 2011 Oct;60(5):661-667. doi: 10.1093/sysbio/syr027
Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Mol Biol Evol. 2012 Aug;29(8):1917-1932. doi: 10.1093/molbev/mss086
Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009 Oct;58(5):468-477. doi: 10.1093/sysbio/syp031
Kubatko LS, Carstens BC, Knowles LL. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009 Apr;25(7):971-973. doi: 10.1093/bioinformatics/btp079
Chifman J, Kubatko LS. Quartet inference from SNP data under the coalescent model. Bioinformatics. 2014 Dec;30(23):3317-3324. doi: 10.1093/bioinformatics/btu530
Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution. 2012 Mar;66(3):763-75. PubMed PMID:22380439.
Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006 May;2(5):e68. doi: 10.1371/journal.pgen.0020068
Liu L, Edwards SV. Phylogenetic analysis in the anomaly zone. Syst Biol. 2009 Aug;58(4):452-460. doi: 10.1093/sysbio/syp034
Berv JS, Prum RO. A comprehensive multilocus phylogeny of the Neotropical cotingas (Cotingidae, Aves) with a comparative evolutionary analysis of breeding system and plumage dimorphism and a revised phylogenetic classification. Mol Phylogenet Evol. 2014 Dec;81:120-36. PubMed PMID:25234241.
Pyron RA, Hendry CR, Chou VM, Lemmon EM, Lemmon AR, Burbrink FT. Effectiveness of phylogenomic data and coalescent species-tree methods for resolving difficult nodes in the phylogeny of advanced snakes (Serpentes: Caenophidia). Mol Phylogenet Evol. 2014 Dec;81:221-31. PubMed PMID:25193610.
Thompson AW, Betancur-R R, López-Fernández H, Ortí G. A time-calibrated, multi-locus phylogeny of piranhas and pacus (Characiformes: Serrasalmidae) and a comparison of species tree methods. Mol Phylogenet Evol. 2014 Dec;81:242-57. PubMed PMID:25261120.
Huang H, Knowles LL. What is the danger of the anomaly zone for empirical phylogenetics? Syst Biol. 2009 Oct;58(5):527-36. PubMed PMID:20525606.
Huang H, He Q, Kubatko LS, Knowles LL. Sources of error inherent in species-tree estimation: impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. Syst Biol. 2010 Oct;59(5):573-83. PubMed PMID:20833951.
DeGiorgio M, Degnan JH. Fast and Consistent Estimation of Species Trees Using Supermatrix Rooted Triples. Mol Biol Evol. 2010 Mar;27(3):552-569. doi: 10.1093/molbev/msp250
Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007 Feb;56(1):17-24. PubMed PMID:17366134.
Mirarab S, Bayzid MS, Warnow T. Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. Syst Biol. 2014 Aug 26. PubMed PMID:25164915.
Pyron RA, Wiens JJ. A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, salamanders, and caecilians. Mol Phylogenet Evol. 2011 Nov;61(2):543-83. PubMed PMID:21723399.
Betancur-R R, Broughton RE, Wiley EO, Carpenter K, López JA, Li C, Holcroft NI, Arcila D, Sanciangco M, Cureton Ii JC, Zhang F, Buser T, Campbell MA, Ballesteros JA, Roa-Varon A, Willis S, Borden WC, Rowley T, Reneau PC, Hough DJ, Lu G, Grande T, Arratia G, Ortí G. The tree of life and a new classification of bony fishes. PLoS Curr. 2013 Apr 18;5. PubMed PMID:23653398.
Pyron RA, Burbrink FT, Wiens JJ. A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. BMC Evol Biol. 2013 Apr;13:93. doi:10.1186/1471-2148-13-93
Zimmermann T, Mirarab S, Warnow T. BBCA: Improving the scalability of *BEAST using random binning. BMC Genomics. 2014 Oct;15(Suppl 6):S11. doi:10.1186/1471-2164-15-S6-S11
Maddison WP, Maddison DR. Mesquite: a modular system for evolutionary analysis. 2008. Version 2.5 http://mesquiteproject.org
Hudson RR. Generating samples under a Wright-Fisher neutral model. Bioinformatics. 2002 Feb;18(2):337-338. doi: 10.1093/bioinformatics/18.2.337
Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997 Jun;13(3):235-8. PubMed PMID:9183526.
Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160-74. PubMed PMID:3934395.
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012 May;61(3):539-42. PubMed PMID:22357727.
Holder MT, Sukumaran J, Lewis PO. A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics. Syst Biol. 2008 Oct;57(5):814-21. PubMed PMID:18853367.
Robinson DF, Foulds LR. Comparison of Phylogenetic Trees. Math Biosci. 1981 Feb;53(1-2):131-147. doi:10.1016/0025-5564(81)90043-2
Schliep KP. 2011. phangorn: Phylogenetic analysis in R. Bioinformatics. 2011 Feb;27(4):592-593. doi: 10.1093/bioinformatics/btq706
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014 Sep 1;30(17):i541-8. PubMed PMID:25161245.
Bayzid MS, Warnow T. Naive binning improves phylogenomic analyses. Bioinformatics. 2013 Sep 15;29(18):2277-84. PubMed PMID:23842808.
Leaché AD, Rannala B. The Accuracy of Species Tree Estimation under Simulation: A Comparison of Methods. Syst Biol. 2010 Mar;60(2):126-137. doi: 10.1093/sysbio/syq073
Patel S, Kimball R, Braun EL. Error in Phylogenetic Estimation for Bushes in the Tree of Life. J Phylogen Evolution Biol. 2013 Jun;1(2):1-10. doi: 10.4172/2329-9002.1000110
Gatesy J, Springer MS. Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol Phylogenet Evol. 2014 Nov;80:231-66. PubMed PMID:25152276.
Meredith RW, Janečka JE, Gatesy J, Ryder OA, Fisher CA, Teeling EC, Goodbla A, Eizirik E, Simão TL, Stadler T, Rabosky DL, Honeycutt RL, Flynn JJ, Ingram CM, Steiner C, Williams TL, Robinson TJ, Burk-Herrick A, Westerman M, Ayoub NA, Springer MS, Murphy WJ. Impacts of the Cretaceous Terrestrial Revolution and KPg extinction on mammal diversification. Science. 2011 Oct 28;334(6055):521-4. PubMed PMID:21940861.
Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO. The global diversity of birds in space and time. Nature 2012 Nov;491:444-448. doi: 10.1038/nature11631
Ronquist F, Klopfstein S, Vilhelmsen L, Schulmeister S, Murray DL, Rasnitsyn AP. A total-evidence approach to dating with fossils, applied to the early radiation of the hymenoptera. Syst Biol. 2012 Dec 1;61(6):973-99. PubMed PMID:22723471.
Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science. 2014 Dec 12;346(6215):1250463. PubMed PMID:25504728.
Xi Z, Liu L, Rest JS, Davis CC. Coalescent versus concatenation methods and the placement of Amborella as sister to water lilies. Syst Biol. 2014 Nov;63(6):919-32. PubMed PMID:25077515.
Betancur-R R, Naylor GJ, Ortí G. Conserved genes, sampling error, and phylogenomic inference. Syst Biol. 2014 Mar;63(2):257-62. PubMed PMID:24293251.
Szöllősi GJ, Tannier E, Daubin V, Boussau B. The Inference of Gene Trees with Species Trees. Syst Biol. 2015 Jan;64(1):e42-e62. doi: 10.1093/sysbio/syu048
Hovmöller R, Knowles LL, Kubatko LS. Effects of missing data on species tree estimation under the coalescent. Mol Phylogenet Evol. 2013 Dec;69(3):1057-62. PubMed PMID:23769751.
Jiang W, Chen SY, Wang H, Li DZ, Wiens JJ. Should genes with missing data be excluded from phylogenetic analyses? Mol Phylogenet Evol. 2014 Nov;80:308-18. PubMed PMID:25124098.
Vidal N, Delmas AS, David P, Cruaud C, Couloux A, Hedges SB. The phylogeny and classification of caenophidian snakes inferred from seven nuclear protein-coding genes. C R Biol. 2007 Feb;330(2):182-7. PubMed PMID:17303545.
Pyron RA, Burbrink FT, Colli GR, de Oca AN, Vitt LJ, Kuczynski CA, Wiens JJ. The phylogeny of advanced snakes (Colubroidea), with discovery of a new subfamily and comparison of support methods for likelihood trees. Mol Phylogenet Evol. 2011 Feb;58(2):329-42. PubMed PMID:21074626.
Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC. Poor Fit to the Multispecies Coalescent is Widely Detectable in Empirical Data. Syst Biol. 2014 May;63(3):322-333. doi: 10.1093/sysbio/syt057
McVay JD, Carstens BC. Phylogenetic model choice: justifying a species tree or concatenation analysis. J Phylogen Evolution Biol. 2013 Jul;1:114. doi: 10.4172/2329-9002.1000114
Zwickl DJ, Stein JC, Wing RA, Ware D, Sanderson MJ. Disentangling methodological and biological sources of gene tree discordance on Oryza (Poaceae) chromosome 3. Syst Biol. 2014 Sep;63(5):645-59. PubMed PMID:24721692.