Tandy Warnow is the Founder Professor of Engineering at the University of Illinois in Urbana-Champaign. Her work is focused on methods for phylogenetic estimation, including large-scale multiple sequence alignment and multi-locus species tree estimation.

Incomplete lineage sorting (ILS), modelled by the multi-species coalescent, is a process that results in a gene tree being different from the species tree. Because ILS is expected to occur for at least some loci within genome-scale analyses, the evaluation of species tree estimation methods in the presence of ILS is of great interest. Performance on simulated and biological data have suggested that concatenation analyses can result in the wrong tree with high support under some conditions, and a recent theoretical result by Roch and Steel proved that concatenation using unpartitioned maximum likelihood analysis can be statistically inconsistent in the presence of ILS. In this study, we survey the major species tree estimation methods, including the newly proposed “statistical binning” methods, and discuss their theoretical properties. We also note that there are two interpretations of the term “statistical consistency”, and discuss the theoretical results proven under both interpretations.

Estimating species trees from multiple loci is commonly performed using concatenation methods, in which multiple sequence alignments from different genomic regions are concatenated into one large supermatrix, and then a tree is estimated on the supermatrix. Yet, incomplete lineage sorting (ILS) (modelled by the multi-species coalescent model) can result in different loci having topologically different phylogenies, with high probability of gene tree incongruence when the effective population size is large and the time between speciation events is small

Because concatenation can result in incorrect trees, coalescent-based "species tree methods" have been developed that are statistically consistent under the multi-species coalescent model. However, there are two meanings for "statistical consistency under the multi-species coalescent model", and so it is important to distinguish between them.

Thus, what Roch and Steel proved

On the other hand, many species tree estimation methods have been developed that are provably statistically consistent in the first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST

Only a few coalescent-based species tree estimation methods have been proven to be statistically consistent in the second (i.e., stronger) sense of the term, which establishes that the species tree estimated by the method converges to the true species tree as the number of loci is allowed to increase, even when the sequence length per locus is bounded ^{,}^{,}^{,}

Simulation studies evaluating the relative performance of species tree estimation methods in comparison to concatenation analysis have had mixed results (e.g., sometimes concatenation is more accurate, sometimes coalescent-based methods are more accurate, and sometimes the differences are not statistically significant ^{,}^{,}^{,}^{,}^{,}^{,}^{,}^{,}^{,}^{,}^{,}

In two recent papers ^{,}

Statistical binning has the following steps:

Compute gene trees with bootstrapping on each locus, using maximum likelihood.

Use the bootstrap support on the edges of each gene tree to determine for each pair of genes whether they are likely to have a common gene tree topology, and build an "incompatibility graph" to represent this information (so that each node represents a gene, and two genes are connected by an edge if the topological differences between their gene trees are considered statistically significant according to the test).

Partition the vertices of the graph into sets of approximately the same size, so that no two vertices in any set are adjacent; these are called the "bins".

For each bin, concatenate the alignments in the bin, and compute a fully partitioned maximum likelihood tree on the bin; these are called the "supergene trees".

Apply the preferred summary method (e.g., ASTRAL or MP-EST) to the set of supergene trees to compute the species tree.

Note that the only difference between weighted and unweighted statistical binning is that weighted statistical binning replicates the supergene trees by the number of genes in the bin, and unweighted statistical binning does not do this. This difference is essential to the theoretical properties of the two methods. As proven in

As shown in

Thus, statistical binning used with a coalescent-based summary method provides a blend of concatenation and coalescent-based methods: supergene trees are computed on concatenated alignments using fully partitioned maximum likelihood analyses, and then a coalescent-based summary method (such as MP-EST) is applied to these supergene trees to estimate a species tree. However, importantly, all maximum likelihood analyses used in statistical binning are fully partitioned, and this is critically important to the statistical properties that can be proven about these methods.

Liu et al.

Although their simulation study only examined naive binning

We can definitively answer this question with respect to the first meaning of statistical consistency: as shown in

However, as shown in

But what about the second interpretation of statistical consistency, where the number of sites per locus is fixed, but the number of loci increases? Are pipelines that use weighted statistical binning, followed by a summary method such as MP-EST or ASTRAL, statistically consistent under this second meaning (i.e., the strong sense of statistical consistency)? Does the Roch and Steel result help shed any light on this issue? Before we can answer this, we need to understand the difference between unpartitioned and fully partitioned maximum likelihood analyses.

Suppose we are given multiple sequence alignments for

In a fully partitioned Jukes-Cantor maximum likelihood analysis, we no longer assume that all the sites evolve down a single Jukes-Cantor model tree; instead, we assume that the different parts of the concatenated alignment each evolves down its own model tree, and the only constraint we make is that the model tree for the different parts share the same tree topology. Hence, we allow the numeric model parameters (i.e., branch lengths) to differ between the different loci. Therefore, the result of a fully partitioned Jukes-Cantor maximum likelihood analysis on a concatenated alignment from

We will show that fully partitioned maximum likelihood analyses and unpartitioned maximum likelihood analyses have very different theoretical properties. First, consider how maximum likelihood evaluates a sequence alignment where the sequences for the different species are all identical (i.e., an input of 100 sequences each identical to AACATAG). It is easy to see that all tree topologies are equally good under Jukes-Cantor maximum likelihood, and so cannot be distinguished under the maximum likelihood criterion. We will refer to alignments of this type as "invariant alignments" and loci that have invariant alignments as "invariant loci". Now consider a concatenated alignment based on

Roch and Steel

Consider an input where the first ^{th} locus has a variable sequence alignment. By the argument provided by Roch and Steel, as ^{th} locus.

For the same data, but under a fully partitioned maximum likelihood analysis, the first ^{th} locus. Thus, for all values of p, the fully partitioned maximum likelihood analysis will always be the Jukes-Cantor maximum likelihood of the ^{th} locus. Thus, as ^{th} locus but the fully partitioned maximum likelihood analysis will be the maximum likelihood analysis of the ^{th} locus. Since Jukes-Cantor maximum likelihood and maximum parsimony are not identical methods (i.e, they can return different trees on the same dataset), this shows that fully partitioned and unpartitioned maximum likelihood analyses are different methods, and can return different trees.

To summarize, the theorem by Roch and Steel established that an

This article has focused on what has been established theoretically for some standard methods for estimating species trees (i.e., concatenation using maximum likelihood and summary methods such as MP-EST and ASTRAL), as well as for the newer approach of weighted statistical binning, followed by a summary method. Because the term "statistical consistency" has been used in two different ways in the literature, we have summarized what is known under each meaning: the weaker sense where both parameters (sequence length per locus and number of loci) increase, and the stronger sense where only the number of loci increases but the sequence length per locus is bounded, perhaps by a constant. We have also clarified that Roch and Steel's theorem about concatenation using maximum likelihood being statistically inconsistent is restricted to unpartitioned maximum likelihood.

So, what do we know about statistical consistency (for either sense) of standard techniques for estimating species trees in the presence of ILS? Methods that have been proven to be statistically consistent in the first sense include the standard summary methods (e.g., MP-EST, ASTRAL, the population tree from BUCKy

However, even for this weak sense of statistical consistency (where both the number of sites per locus and number of loci increase), the statistical consistency of many methods is still unknown. In particular, Roch and Steel's theorem does not establish statistical inconsistency for fully partitioned maximum likelihood. Because the loci can have different tree topologies under the multi-species coalescent model, it seems likely that maximum likelihood analyses, even if fully partitioned, will be found to be inconsistent. However, the proof for an unpartitioned analysis being inconsistent provided by Roch and Steel in

In terms of the second definition (and stronger sense) of statistical consistency, where only the number of loci increase but the number of sites per locus is bounded, very little has been established. In fact, the only established results are that unpartitioned maximum likelihood and unweighted statistical binning are both inconsistent. Finally, we do not know whether any of the standard summary methods, fully partitioned maximum likelihood analyses, or weighted statistical binning, are statistically consistent under this second definition.

In fact, to try to prove a method is statistically consistent or inconsistent under the second definition is very difficult; the only methods that have been proven to be statistically consistent under this definition (which constrains the number of sites per locus to be bounded) are explicitly designed to have this property, but all (to our knowledge) require some additional constraints (e.g., the same constant rate of evolution across all loci

In other words, the major phylogenomic estimation methods in common use that are designed to address incomplete lineage sorting -- summary methods such as MP-EST, ASTRAL, and the population tree from BUCKy, and co-estimation methods such as *BEAST -- have been proven to be statistically consistent only in the first sense, where both the number of loci and number of sites increase. None of them have been proven to be consistent in the second sense, where the sequence lengths per locus are bounded. Similarly, the weighted statistical binning method is statistically consistent in this first sense, but it is not known if it is statistically consistent in the second sense.

It is also clear that Roch and Steel's theorem (as well as the observations provided by their proof) is limited to unpartitioned maximum likelihood. Therefore, it cannot help us understand the theoretical properties of fully partitioned maximum likelihood, and hence also cannot help us understand the theoretical properties of methods that use fully partitioned maximum likelihood (such as weighted statistical binning). Therefore, any attempt to establish the statistical consistency or inconsistency of these methods will need to use other mathematical arguments.

In other words, the established theory regarding species tree estimation methods is limited. Yet, performance in practice (i.e., on simulated and biological data) suggests that many methods (including concatenation) provide good accuracy under some model conditions. Unfortunately, as discussed in ^{,}^{,}^{,}

Finally, despite the interest in coalescent-based species tree estimation, the current methods are still in their infancy, and new methods will need to be developed in order to obtain highly accurate species trees under realistic conditions. While the focus of this article has been on the performance of concatenation analyses, and the implications for coalescent-based summary methods that use weighted statistical binning, alternative approaches have been developed that do not require the estimation of gene trees or supergene trees (e.g.,

We present the current status with respect to statistical consistency (of the first or second kind) of some standard phylogenomic estimation methods. The first column is for the first meaning of statistical consistency, which states that the species tree estimated by the method will converge to the true species tree as the number of loci and number of sites per locus both increase. The second column is for the second meaning, which states that the species tree estimated by the method will converge to the true species tree as the number of loci increases, even for bounded number of sites per locus. We also cite the paper in which the theoretical result is established.

Consistency - first kind | Consistency - second kind | |
---|---|---|

MP-EST | YES | UNKNOWN |

ASTRAL | YES | UNKNOWN |

Unpartitioned concatenated maximum likelihood | NO ( |
NO ( |

Fully partitioned maximum likelihood | UNKNOWN | UNKNOWN |

Unweighted statistical binning followed by consistent summary method (e.g., ASTRAL) | NO ( |
NO ( |

Weighted statistical binning followed by consistent summary method (e.g., ASTRAL) | YES ( |
UNKNOWN |

*BEAST | YES | UNKNOWN |

An outline of the proof of the main theorem is as follows: We show that the expected proportion of sites that are constant can be made arbitrary large with low rates of evolution (the lower bounds are formalized in Claim 4) and that the empirical frequencies of site patterns is concentrated around the expected values (Claim 2). When there are a large enough number of invariable sites, it can be shown that likelihood scores and parsimony scores converge to the same answer (formalized in Claim 1). Thus trees that have better parsimony score have better likelihood under these scenarios. Therefore, it suffices to show that parsimony is not statistically consistent under arbitrary low rates of evolution.

The authors have declared that no competing interests exist.

The author thanks the two anonymous reviewers, whose suggestions improved the manuscript.