Theoretically, we now know a great deal about phylogenetic estimation. We know, for example, that when sequences evolve with only substitutions (but no indels) under models such as the General Time Reversible (GTR) model, then accurate estimation of trees (with high probability) is guaranteed, provided that appropriate methods (such as maximum likelihood) are used and the sequences are “long enough”

However, sequence evolution includes indels, and so typically a phylogenetic analysis begins with a set of sequences of different lengths. A phylogenetic analysis of such a dataset, therefore, must either first align these sequences before applying a phylogenetic estimation method or it must co-estimate the alignment and tree at the same time. (The third approach estimates the tree without ever estimating an alignment, but is much less in vogue.)

In this paper, we discuss theoretical and empirical aspects of phylogeny estimation in the presence of indels, focusing on two-phase methods using maximum likelihood (henceforth referred to as "ML"). We include, for completeness sake, a discussion of co-estimation methods.

Methods that estimate trees and alignments from unaligned sequences (i.e., co-estimation methods) have also been developed, including POY

Despite the progress in co-estimation methods, most phylogenies are estimated in two phases: first an alignment is estimated, and then a tree is computed on the alignment. Once the sequences are aligned, a decision must be made about how to treat the gaps in the alignment. In current practice, the following are the dominant gap-treatments:

Remove all sites in which any gap appears, thus reducing to a gap-free alignment with fewer sites,

Assign an additional “fictitious” state for each gap,

Code all the gaps in the alignment, and treat the presence/absence of gaps as a binary character (complementing the original sequence alignment character data), and

Treat the gaps as missing data.

The first option of removing all sites with gaps has the advantage of being statistically consistent for models in which the substitution process and the mechanism producing insertions and deletions are independent, but it has the disadvantage of removing data – and could result in sequence alignments that have so few sites as to be phylogenetically uninformative. Indeed, while this may not happen on small datasets, on large nucleotide datasets, this could lead to empty alignments.

The second option of assigning an additional state for each gap presents other challenges. By definition, the true alignment represents positional homology, and hence two positions that have a nucleotide in a site constrain all nodes on the evolutionary path between them to also have a nucleotide in that position. In other words, ensuring that the model makes phylogenetic sense is rather complicated. Therefore, the substitution process must be extended carefully to handle an additional fictitious state properly. Finally, when the indel process can insert and delete several nucleotides at a time, the sites within the alignment no longer evolve independently, making this treatment invalid.

The third option, of coding each gap (maximal contiguous collection of dashes) in the alignment, includes a collection of techniques, ranging from extremely simple (create a single binary presence/absence character for each position that contains any gap) to very complex techniques. Software to automatically produce these additional binary characters encoding the gaps in a given alignment includes GapCoder

We begin with the Jukes-Cantor (JC) model of DNA sequence evolution

Letting the tree topology T and alignment A be fixed, we define ML_{JC} (A,T) := sup_{θ} P(A|(T,θ)). That is, ML_{JC}(A,T) is the supremum of all likelihood scores obtained for JC model trees with the same fixed tree topology T (but allowing θ to vary). Although the likelihood is continuous, the supremum may not actually be achieved for some θ because the range of values allowed for this parameter is not a closed set; that is, the supremum may be approached by parameter values θ for which some of the p(e) are arbitrarily close to the boundary values 0 or 3/4. Finally, we can talk about the JC ML tree for a fixed gap-free alignment A, as the tree T such that the likelihood ML_{JC}(A,T) is maximized over all trees.

Maximum likelihood (ML) inference of the parameter T under the JC model is defined as follows:

Input: sequence alignment A containing no gaps

Output: all model trees T such that ML_{JC}(A,T) is maximized

We now discuss ML analysis when the input sequence alignment contains gaps, and gaps are treated as missing data. The same algorithmic approach is used as when the input alignment does not contain gaps, except that the likelihood calculation must also be able to work with gapped sequences. As discussed above, there are several different ways of treating gaps, but the standard technique treats gaps as missing data.

It is well known that ML is statistically consistent for the GTR model (and hence for its submodels, such as Jukes-Cantor), when the data are generated by the GTR model and the optimization problem is solved exactly. However, we will show that ML, treating gaps as missing data, can be inconsistent under these conditions, when the input sequence alignments contain gaps. In other words, we will prove that ML can produce the wrong tree, under some conditions on the input sequence alignment.

Let S be a set of DNA sequences in an alignment A. We will say that the alignment A is "monotypic" if for each site in A, there is exactly one nucleotide type (that is, all A’s, all C’s, all T’s, or all G’s). In particular, we do not allow any site to be entirely gapped. For example, the following is a monotypic alignment:

s1 = A − −

s2 = − C −

s3 = A − −

s4 = − − −

s5 = − − −

s6 = − − T

s7= A − −

The following results were established in

_{JC}(A,T) =(1/4)^{R}, where R is the number of sites in the alignment A.

Proof. This result follows from Lemma 1 in _{JC} (A,T) is realized by a sequence of parameter values in which all the p(e) converge to 0). For this setting of the substitution parameters, the probability of the data is just the probability of picking the correct state for that site, which is 1/4 under the JC model. Hence, the ML score of the alignment, given the tree T, is (1/4)^{R}, where R is the number of sites in A.

Proof. This result follows from Theorem 2 in

This theorem indicates a potential problem with treating gaps as missing data. If the mechanism generating the data has a high probability of producing aligned sequences that are monotypic for some parameter values, then it will be difficult to reliably infer the underlying phylogenetic tree if the gaps are treated merely as missing data rather than features of the data that are informative about the path that evolution has taken. More specifically, for those models of evolution for which monotypic alignments have non-zero probability, ML, treating gaps as missing data, may not be statistically consistent.

Theorem 1 shows that treating gaps as missing data has the potential to result in meaningless phylogenetic estimations, even when analyzed under maximum likelihood (ML) for the correct model, since - under an extreme case in which the substitution probabilities are all zero - all trees are equally good solutions to maximum likelihood. In other words, what this theoretical result shows is that under an extreme condition in which substitution probabilities are zero,

We now compare this observation from

In contrast to this theoretical result, we consider the performance in practice of statistically-based methods such as ML and Bayesian MCMC. Simulation studies when sequences evolve without indels have shown that these statistically-based methods produce highly accurate trees, typically better than trees estimated using maximum parsimony or distance-based methods (see, for example,

Of particular interest here, however, is the observation in these studies that even when analyzing the true alignment, the error rate for ML trees increases with the rate of evolution (see, for example,

The theorem in this paper holds under the assumption of all substitution probabilities being zero (the case where only indels but no substitutions occur). Thus, this theoretical result can be criticized as being applicable only to a biologically unrealistic case. A careful reader could therefore ask "Is maximum likelihood, treating gaps as missing data, provably statistically consistent for model conditions with substitutions and not just indels?" The answer is that there are no published theoretical results establishing statistical consistency for maximum likelihood when gaps are treated as missing data. However, as was shown in

These results add to the growing literature about theoretical guarantees (or lack thereof) in phylogenetic analysis. Unfortunately, what we now know is that theoretical guarantees for phylogeny estimation have only been established for very restrictive conditions: indel-free evolution (so that alignment is not an issue) with well-behaved site substitution models. This raises the real possibility that the standard likelihood-based methods of analysis (e.g., MrBayes

From an empirical viewpoint, multiple sequence alignment estimation on nucleotide datasets is difficult, especially on large datasets

Clearly, what is needed is the creation of statistically-based methods that treat both indels and substitutions in a statistically consistent manner, and that can run on large datasets (with at least hundreds but preferably thousands of sequences). Guarantees of statistical consistency do not necessarily yield good performance in practice, but they can lead to methods with good empirical performance (and have the potential to produce much more accurate results than statistically inconsistent methods). Therefore, an effort should be made to develop such methods, and to test these methods on both biological and simulated data, in order to evaluate their accuracy under realistic conditions. Until then, phylogenetic analyses can certainly be based upon standard two-phase approaches, but biologists should use these standard methods with caution - realizing that even the best of the current two-phase methods do not have statistical guarantees.

The author has declared that no competing interests exist.

The author thanks Steve N. Evans for stimulating and helpful discussions, and the two anonymous referees for their comments.