Tree of Life – PLOS Currents Tree of Life Tue, 21 Aug 2018 20:48:56 +0000 en-US hourly 1 A Post-pleistocene Calibrated Mutation Rate from Insect Museum Specimens Fri, 13 Jul 2018 14:00:33 +0000 Quantifying the age of recent species divergence events can be challenging in the absence of calibration points within many groups. The katydid species Neoconocephalus lyristes provides the opportunity to calibrate a post-Pleistocene, taxa specific mutation rate using a known biogeographic event, the Mohawk-Hudson Divide. DNA was extracted from pinned museum specimens of N. lyristes from both Midwest and Atlantic populations and the mitochondrial gene COI sequenced using primers designed from extant specimens. Coalescent analyses using both strict and relaxed molecular clock models were performed in BEAST v1.8.2. The assumption of a strict molecular clock could not be rejected in favor of the relaxed clock model as the distribution of the standard deviation of the clock rate strongly abutted zero. The strict molecular clock model resulted in an intraspecific calculated mutation rate of 14.4-17.3 %/myr, a rate substantially higher than the common rates of sequence evolution observed for insect mitochondrial DNA sequences. The rate, however, aligns closely with mutation rates estimated from other taxa with similarly recent lineage divergence times.



In recent years, many examples of rapid speciation and diversification occurring during the last glacial cycle (i.e., within 500 kyr BP 1), or even after the last glacial maximum (LGM, 19 kyr BP 2) have been described. Arguably, the most impressive examples of rapid diversification are the cichlid radiation events within the African Rift Valley, where a small number of founding species diversified into hundreds of species after the LGM 3,4,5. Other examples include the old world pea aphids 6, North American songbirds 7,8,9, and the threespine sticklebacks of British Columbia 10; in some cases, significant diversification arose in as little as 50 years 11.

The accurate timing of diversification events allows us to better understand the mechanisms leading to phenotypic diversification and/or speciation. Molecular clock techniques allow the timing of diversification events based on estimates of the rate of genetic mutations per unit time 12,13. Mutation rates are gene specific and can vary between lineages and through time within a lineage 14,15. Therefore, accurate dating using a molecular clock requires reliable calibration of the rate of sequence evolution for that particular group, time interval, and gene. Rates can be calibrated using nodes dated from fossils 16 and from biogeographic vicariance events 17,18.

Estimates of nucleotide evolution vary greatly dependent with the age of the calibrating point, with younger calibration points resulting in higher rate estimates 19,20,21. Fossils and most biogeographic events are ancient (millions of years old) and are appropriate for the dating of similarly ancient events. The few available rate estimates using very young age calibration points (<200kyrs 22,23,24) suggest an exponential increase of estimated rates 25,26,27; additional data is needed to support this pattern. The exponential pattern of estimates is likely an artifact of the estimation methods and does not reflect true differences in rates on nucleotide evolution 19. One reason for the small number of estimates for recent lineage divergences is that suitably recent calibration points are scarce 25, since these events are too recent to use fossil evidence.

Here we use a postglacial vicariance event to calibrate a lineage specific mutation rate for North American Neoconocephalus katydids. At a time following the LGM, water from the North American Great Lakes drained through the Mohawk-Hudson Outlet to the Atlantic coast 28. Wetland habitats formed within the Hudson and Mohawk Valleys, which allowed coastal plain species to expand their ranges into the wetlands surrounding the Great Lakes 29. The opening of the St. Lawrence Seaway (10,750-10,600 14C yr BP 30), diverted melt water and led to the drying of the wetlands in the Mohawk-Hudson outlet. This vicariance event left disjunct wetland habitats in the Midwest (mainly bogs and fens) and along the Atlantic Coast (bogs and marsh habitat). Such disjunct ranges matching this pattern are found in plant, reptile, amphibian, and insect species possessing a coastal plain affinity 29,31,32. Neoconocephalus lyristes is an example of such a habitat specialist, limited to bog and fen wetlands. The species’ described range follows the pattern of the Mohawk-Hudson Divide, with isolated populations in the Great Lakes area 31 and North Atlantic Coast (33 Fig. 1).

Fig 1

Fig. 1: Historic collection sites for N. lyristes overlaid with hypothesized range.

Sites are modified from 34,35 based on literature and collection records. The collection localities of museum samples used in this study are indicated in red.

The eleven North American Neoconocephalus katydid species possess markedly little genetic variation despite their high diversity of species-specific call patterns and may be an example of a recent species radiation 36,37. The accurate timing of this radiation will help identify the evolutionary mechanisms leading to rapid species diversification observed in this group. Neoconocephalus lyristes provides a unique opportunity among species of Neoconocephalus for the calibration of a post-Pleistocene mutation rate as gene flow between these two disjunct ranges likely ceased with the draining of the Mohawk-Hudson Outlet (10,750-10,600 14C yr BP 30). Here we sequenced mtDNA from museum specimens representing both populations and estimated an intraspecific mutation rate using a coalescent Bayesian method.


Over three years of searching previous collection sites we found only a single extant population of N. lyristes, in Cedar Bog Nature Preserve, Urbane, OH, USA. Due to apparent local extinction of N. lyristes from most of its Midwest and its entire Atlantic range, we used museum samples collected in the first half of the 20th century. We selected 18 dried N. lyristes specimens, from the Hebard Collection at the Academy of Natural Sciences of Drexel University for DNA extraction and analysis. Specimens represent samples from both Atlantic Coastal and Midwest populations (Fig. 1), with collection dates ranging from 1905-1932. We used a non-destructive method for DNA extraction (modified from 38). A hind leg was removed and placed in a 1.5 ml microcentrifuge tube fully submerged in one ml of digestion buffer: 3 mM CaCl2, 2% sodium dodecyl sulphate (SDS), 40 mM dithiothreitol (DTT), 250 mg/ml proteinase K, 100 mM Tris buffer pH 8 and 100 mM NaCl (quantities represent molarity of final concentrations). Hind legs were incubated overnight (17-19 hrs.) at 55°C. Following digestion we removed the hind legs from buffer and placed them in 100% EtOH for two hours to stop enzymatic activity. Extraction of DNA contained in the buffer was completed using the standard Qiagen DNeasy Blood + Tissue Kit (Qiagen Inc., Valencia, CA, USA) extraction method.

Amplification took place in a laboratory without prior exposure to DNA that could be amplified by primers used in this study. Polymerase chain reaction (PCR) prep was performed in a UV hood. All equipment and surfaces were sanitized with a 10% bleach solution and tools were sanitized in a UV Stratalinker 1800. For this study we designed six overlapping primer pairs (Appendix: Supplemental Table 1) around non-variable regions of the mitochondrial gene cytochrome oxidase I (COI). These primers were based upon extant N. lyristes, N. robustus, and N. bivocatus COI sequences and designed using the Primer3 39 plugin in Geneious v6.0.5 40. Each primer pair amplified approximately 150 bp; combined, they provide complete coverage of the 743 bp target region.

PCR amplification was performed on an Eppendorf Mastercycler gradient (Eppendorf-Brinkman Instruments Inc., Westbury, NY, USA) using Taq DNA polymerase (Platinum Taq, Invitrogen Inc., Carlsbad, CA, USA). All primers were used at a concentration of 10 mM. Thermocycling conditions for all six primer-sets are as follows: Hot start at 94°C 2 min, denaturation at 94°C 30 sec, annealing at 56°C 30 sec, extension 72°C 40 sec, repeated 40x, with a final 72°C extension for 7 min. Amplified PCR products were prepared for sequencing using a ExoI/SAP enzymatic cleanup (2.75 μl 10x SAP buffer, 0.5 μl SAP, 0.25 μl ExoI per 20 μl of PCR product) incubated at 37°C for 30 min, followed by 80°C for 15 min to inactivate enzymes. Sequencing was performed at the DNA Core Facility, University of Missouri, Columbia, MO, USA on an ABI 3730 DNA Analyzer, using standard Big Dye Terminator cycle sequencing chemistry (Applied Biosystems, Foster City, CA, USA). Sequences were edited, aligned and trimmed in Geneious v6.0.5 41. We used a global alignment with free end gaps and 70% similarity rule. Regions of sequence with high ambiguity were labeled as missing. One individual, with greater than ten percent ambiguity, was removed from the analysis (m017). Individual m007 failed to amplify. We successfully sequenced COI from 16 individuals.

We evaluated substitution models using jModel Test v0.1.1 42 and found GTR+G to be a suitable model. Phylogenetic analyses were conducted using a coalescent method as implemented in BEAST v1.8.2 xml 43; input files were formatted using BEAUti v1.7.4 43. Our analysis assumed a constant population size for the coalescent inferences 44. We ran this analysis to convergence, performing ten runs with twenty million generations sampled every two thousand trees. We assessed convergence through visual inspection of posterior values among the ten runs in Tracer v1.5 45. This analysis was performed using both a strict 46, as well as a relaxed molecular clock model 13. The Midwest individuals were run both unconstrained as well as constrained to monophyly. The constrained run assured that the age calibration point was assigned to the correct node in all trees 21. To evaluate the influence of the prior settings on the posterior samples, we repeated the analysis as above but without any sequence data.

Using the radiocarbon date of 10,750±150 14C yr BP, the end of 150-300 year period of steady melt water flow following the final large flood through the Hudson Valley at 10,900 14C yr BP 30, we calibrated the calendar age of the Mohawk-Hudson Divide. We performed the radiocarbon to calendar age conversion using the IntCal13 curve in OxCal v4.2 online 47. The age estimate was fixed to the highest likelihood value within the 95% confidence interval; yielding a calibrated date of 10,739.5 cal BP. Being a known biogeographic barrier we allowed the node age prior probability of the Midwest clade to vary along a normal distribution, with the calibrated date as the mean age and a standard deviation of one-thousand years. This allows for the possibility of lineage divergence prior to the biogeographic event, as well as the overestimation of the events age 48. The Euclidean mean and standard deviation priors were set to exponential with mean values of 10 and 0.3 respectively. Convergence of MCMC runs was visualized using Tracer v1.5 45 to ensure that all runs converged. With Tracer v1.5 we ascertained the average mutation rate between populations of N. lyristes based on the Mohawk-Hudson calibration. Runs were combined in LogCombiner v1.8.2 49 and a maximum clade credibility consensus tree was formed in TreeAnnotator v1.7.4 50.


We successfully sequenced 743 bp of the mitochondrial gene COI from sixteen individuals (5 from Midwest and 11 from Atlantic Coast populations, Table 1). Sequence similarity among the 16 samples ranged from 92.0% to 99.8%. We found the greatest diversity within the Atlantic population. The Midwest clade fell within the larger clade of Atlantic Coast N. lyristes (Fig. 2). This observation is congruent with the hypothesized biogeographic history of the species where the Midwest populations diverging from the ancestral Atlantic population.

Table 1: Museum specimen list

N. lyristes pinned specimens obtained from the Hebard collection at the Academy of Natural Sciences of Drexel University. Included is all relevant data from specimen label, as well as the ambiguities present in final sequences. (*) denotes samples removed from analysis for failed amplification or excess ambiguity.

Study reference # Locality Collection Date Collected/ID by Ambiguities (#/743 bp)
m001 Cape May Court House, NJ 1914 Hebard 0
m002 Cape May Court House, NJ 1914 Hebard 1
m003 Cape May Court House, NJ 1914 Hebard 0
m004 Cape May Court House, NJ 1914 Hebard 1
m005 Cape May Court House, NJ 1914 Hebard 0
m006 Cape May Court House, NJ 1914 Hebard 24
m007 Cedar Swamp, OH 1929 Unknown N/A*
m008 Cedar Swamp, OH 1932 Edward S. Thomas 25
m009 Cedar Swamp, OH 1932 Edward S. Thomas 0
m010 Chicago, IL (Beach IL) 1906 Unknown 2
m011 Chicago, IL (S. of Jackson Park) 1905 Unknown 18
m012 Chicago, IL (S. of Jackson Park) 1905 Unknown 15
m013 Whitesbog, NJ 1923 Det. D.C. Rentz (1974) 0
m014 Whitesbog, NJ 1923 H. Fox 0
m015 Whitesbog, NJ 1923 Unknown 0
m016 Whitesbog, NJ 1923 H. Fox 0
m017 Whitesbog, NJ 1923 Unknown 103*
m018 Whitesbog, NJ 1923 Unknown 1

Con tree

Fig. 2: Consensus tree from coalescence analysis using a strict molecular clock model and Midwest clade constrained to monophyly

Nodes possessing <0.85 posterior probabilities were collapsed. Red star represents the Mohawk-Hudson Divide, with the prior of the node age set to a normal distribution with a mean age of 10,739.5 cal BP. The Midwest specimen m010 fell outside of the Midwest clade prior to constraining the group to monophyly.

Using the unconstrained coalescence model, four out of five Midwestern individuals formed a clade within the larger clade of Atlantic Coast N. lyristes. One Midwest individual (m010) grouped among Atlantic individuals (Appendix: Supplemental Fig. 1). In order to prevent the age calibration point from being assigned to the wrong nodes in some trees we constrained the Midwest clade to monophyly in further analyses 21. The resulting constrained consensus tree (Fig. 2) is congruent with the hypothesized biogeographic history of the species, with the Midwest population diverging from the ancestral Atlantic population.

Using a relaxed clock model, we obtained branch specific mutation rates between 14.4 and 37.5 %/myr from the consensus of the ten runs. The average rate of mutation among branches was 15.8 %/myr, ranging from 15.7-15.9 %/myr between the ten independent runs. The distribution of the standard deviation of the clock rate strongly abutted zero when the relaxed molecular clock was used (Fig. 3). This indicates support for a constant rate of substitution and a strict molecular clock was used 51. The strict molecular clock analysis produced a tree (Fig. 2) with a similar, but not identical, topology to the relaxed clock’s consensus tree. The relationship between Midwest animals and their relationship to the Atlantic clade remained unchanged, with minor changes in the relationships between Atlantic individuals. The strict consensus tree, with Midwest clade constrained to monophyly possessed an average mutation rate of 17.3 %/myr, with mutation rates between the ten runs. Predictably, a slower rate of 14.4 %/myr was obtained when the same analysis was run with individual M010 removed. These two rates, while diverging slightly, both indicate a rate of mutation significantly faster than most reported in the literature 19,20.

standard deviation

Fig. 3: Distribution of the standard deviation rates from relaxed clock analysis

Includes data from ten combined runs (twenty million generations sampled every two thousand trees) using a relaxed molecular clock model. Units for the clock rate are in substitutions per site per million years. The distribution strongly abuts zero, indicating support for a strict molecular clock 51.


Here, we focused on the calibration of an intraspecific mutation rate at a very recent timescale. Evolutionary rates calibrated across divergent timescales can be markedly different 25, with younger calibration dates (<1 Mya) showing substantially higher estimates of rates divergence than older lineages 19,20. In mammals, for example, the age of the calibration dates shows a negative relationship with estimates for molecular evolutionary rates 19,52. Metastudies utilizing insect mtDNA rates estimated from both inter- and intraspecific calibrations show a similar pattern to that observed in mammals 20,21. Available data suggest an exponential increase of estimated rates 22,23,24 with decreasing calibration age (Fig. 4). The exponential pattern of estimates is likely an artifact of the estimation methods and does not reflect true differences in rates on nucleotide evolution 19.


Fig. 4: Estimates of evolutionary rates (%/myr) plotted against calibration age (myr)

The black data points were obtained directly from 20,21. The red point represents the mutation rate estimate from this study. Note both axes are in log scale.

The sequence variation among populations has two components, fixed substitutions between them that have accumulated since divergence and current within population variation 19. The fixed substitutions among lineages represent the actual evolutionary divergence. Most of the within population genetic variation will be removed over time by genetic drift and selection and therefore only a small fraction will ultimately contribute to lineage divergence 26,53. For young divergence times, the within species variation will contribute a much larger fraction of the total nucleotide differences, as only few fixed substitutions have accumulated. For ancient divergence times, in contrast, the same amount of within population variation would be dwarfed by fixed substitutions accumulated since divergence 27,54. Thus, short calibration times should lead to gross overestimations of evolutionary divergence rates, while ancient calibration times (>1 Mya 54) should provide much more realistic estimates.

Insect mtDNA rates of mutation:

We estimated a mutation rate for COI at 14.4-17.3 %/myr, using the strict molecular clock model and a very recent calibration time. Our estimate is significantly higher than the commonly assumed mtDNA mutation rate of 1.15 %/myr 55,56, which were based on much older divergence times. Estimates of substitution rates calibrated from the age of the Mid-Aegean Trench (9-12 Mya), for example, within an insect model range from 1.0-2.7 %/myr dependent upon application of various substitution and clock models 21. Our estimated rate of 14.4-17.3 %/myr, on the other hand, aligns with estimates found using similarly recent calibration dates (Fig. 4). A mutation rate of 19.2 %/myr was estimated for the European butterfly Parnassius mnemosyne, calibrated with a vicariance event at 10,000 years BP 23. Intermediate calibration dates resulted in an intermediate estimate of evolutionary rates. The mutation rate for the North American ground beetle (Nebria) was estimated at 5.7 %/myr, using a vicariance event dated to 150,000 years BP 22. Our estimates for N. lyristes fit into the exponential pattern previously described (Fig. 4). Thus, this study agrees with the slower estimates of Orthopteran mtDNA sequence evolution and may serve as an internal calibration point for Neoconocephalus diversification.

The high mutation rate inferred from our data set could be due to problems in the mathematical models underlying the molecular clock. This seems unlikely, since both fixed clock and relaxed clock models lead to nearly identical results. Furthermore, the close fit of our data point to data from previous work conducted with a variety of methods 20,21 suggests that our particular methods were not responsible for the high estimated mutation rate.

As more evidence accumulates supporting the occurrence of postglacial species diversification, the greater the need for appropriate tools for timing these events. This will in part include the utilization of young vicariance events for molecular clock calibration. Geologically supported postglacial vicariance events within North America are lacking for many taxa groups 57,58. The Mohawk-Hudson Divide provides a recent biogeographic vicariance event, with the potential for the calibration of lineage specific mutation rates for a number of plant, amphibian, reptile, and insect groups.

Use of museum samples:

The use of ancient DNA (aDNA) samples can be hindered by severe degradation 38,59. In this study two of the eighteen samples could not be sequenced successfully. These two samples were not the oldest, nor from the same locality. Severe degradation of DNA, beyond that in the other sixteen samples, or a mismatch in primer binding sites may account for failed amplification (Table 1). In those samples that were sequenced successfully ambiguities were high, while this is likely due to the degraded nature of aDNA, the coamplification of nuclear pseudogenes could also lead to such ambiguities. The amplification of relatively short (150 bp) segments increases the likelihood of amplifying pseudogenes, not amplified when targeting longer sequences. Nuclear pseudogenes of COI, while not noted in Neoconocephalus, have been found in other Orthopterans 60. We found no internal stop codons within our COI sequences. As internal stop codons are common in pseudogenes, it is unlikely that our data is affected by their presence. Our primers were developed from COI reference sequences from three extant Neoconocephalus species. Amplification would therefore not be affected by sequence degradation, as may be the case if primers are developed from the aDNA itself. One concern with the use of aDNA is sequence degradation, with post mortem C-U deamination 61, reflected in higher than expected percentage of Thymine in resulting sequences. We compared the percentages of nucleotides in sequences from our museum specimens and from live collected N. lyristes, which were almost identical (e.g., GC content 35.1% v. 36.2%). This indicates that sequence degradation has minor, if any, influence on our results.

In this study museum specimens replaced extant samples, necessitated by the rarity, or likely local extinction, of N. lyristes from most of its known range. Despite the additional challenges of working with museum specimens, aDNA can replace extant specimens when collection is either not possible because of extinction 61,62 or broad resampling is untenable 63,64.

With advances in the amplification of ancient DNA 65,66,67, museum collections are also opening up areas of study that are not possible with extant data alone 62,67,68. Ancient DNA can be utilized in the calibration of molecular clocks through dating tip ages 69. Samples from multiple time points, can provide additional information about the genetic and demographic changes in groups over time 70. Ancient DNA has been used in the reconstruction and timing of many mammal groups 52,70,71,72, but remains underutilized in the timing of insect lineages despite the abundance of specimens in museums. Several of the problems associated with the use of aDNA can be overcome by next generation sequencing (NGS). For example, NGS has the capability to target short and degraded DNA samples 73. NGS also allows for the sequencing of whole genomes from aDNA 64,74,75 and less destructive sampling techniques from Museum samples 76,77,78.

Data Availability

All supplementary data are available at figshare:

Nucleotide sequences are available at GenBank: Accession numbers KU881748 – KU881763

Competing Interests

The authors have declared that no competing interests exist.

Corresponding Author

Gideon Ney:


Due to Tree of Life Editor unavailability, a member of the PLOS ONE Editorial Board, Wolfgang Arthofer (University of Innsbruck, Austria), rendered the final decision on this paper.


Supplemental Table 1: Table of primers designed for amplification of N. lyristes COI sequences

Primers were designed from reference sequences of extant N. lyristes, N. bivocatus, and N. robustus.

Primer name Primer sequence
lyF68 (forward) 5’-GGA ATT GCA CAT GCT GGA GC-3’
lyR197 (reverse) 5’-GTG ATA TTC CTG GGG CAC GT-3’
lyF187 (forward) 5’-ACG TGC CCC AGG AAT ATC AC-3’
lyR336 (reverse) 5’-CCG GCA GGA TCA AAG AAT GA-3’
lyF317 (forward) 5’-TCA TTC TTT GAT CCT GCC GGA-3’
lyR466 (reverse) 5’-GGC TTC CTT TTT CCC ACT TTC T-3’
lyF440 (forward) 5’-AGT CAA GAA AGT GGR AAA AAG GA-3’
lyR589 (reverse) 5’-AGC TGA AGT AAA ATA RGC TCG TG-3’
lyF545 (forward) 5’-ACA GTA GGA ATG GAT GTT GAT ACA C-3’
lyR694 (reverse) 5’-GCC TAG AGC TCA TAA AAG GGA AG-3’
lyF666 (forward) 5’-ACA GTC CTT CCC TTT TAT GAG CT-3’
lyR811 (reverse) 5’-AGA TAG AAC ATA ATG GAA ATG GGC T-3’

Appendix Fig 1

Supplemental Fig. 1: Consensus tree using a strict molecular clock and the Midwest clade unconstrained. Node values represent posterior probabilities calculated from eighteen million total trees. Red taxa represent Midwest samples and black taxa Atlantic samples.

]]> 0
Chloroplast Genome Sequence Annotation of Dendrobium nobile (Asparagales: Orchidaceae), an Endangered Medicinal Orchid from Northeast India Fri, 19 May 2017 10:05:46 +0000 Orchidaceae constitutes one of the largest families of angiosperms. Owing to the significance of orchids in plant biology, market needs and current sustainable technology levels, basic research on the biology of orchids and their applications in the orchid industry is increasing. Although chloroplast (cp) genomes continue to be evolutionarily informative, there is very limited information available on orchid chloroplast genomes in public repositories. Here, we report the complete cp genome sequence of Dendrobium nobile from Northeast India (Orchidaceae, Asparagales), bearing the GenBank accession number KX377961, which will provide valuable information for future research on orchid genomics and evolution, as well as the medicinal value of orchids. Phylogenetic analyses using Bayesian methods recovered a monophyletic grouping of all Dendrobium species (D. nobile, D. huoshanense, D. officinale, D. pendulum, D. strongylanthum and D. chrysotoxum). The relationships recovered among the representative orchid species from the four subfamilies, i.e., Cypripedioideae, Epidendroideae, Orchidoideae and Vanilloideae, were consistent within the family Orchidaceae.



Chloroplasts are specialized intracellular organelles in which photosynthesis occurs, and they originated via an endosymbiotic relationship with cyanobacteria. Though most chloroplast genes are believed to have been transferred to the nucleus during evolution, their genomes have maintained fairly conserved structures and gene contents throughout their evolutionary lineage 1. While the complete plastid genomes of tobacco and liverworts were the first to be determined, as of April 2017, the complete chloroplast (cp) genomes of 1161 GenBank accessions from land plants have been reported in the National Centre for Biotechnology Information (NCBI) Organelle Genome Resources ( Typically, the cp genomic size of land plants varies between 120 and 220 kb, with a pair of inverted repeats (IRs) that separate the genome into a large single copy (LSC) region and a small single copy (SSC) region3. Variation in the size of cp genomes among plant lineages is generally observed in the mutable IR region. The cp genomes of land plants usually contain approximately 110–120 genes, which mostly participate in photosynthesis or gene expression4,5. Information regarding gene content, polycistronic transcription units, sequence insertion or deletion, transition or transversion, and nucleotide repeats may help resolve evolutionary relationships in the kingdom Plantae (Viridiplantae)6,7,8.

The uniparental inheritance and non-recombinant nature of cp genomes make them potentially useful tools for inferring evolutionary and ancient phylogenetic relationships. Additionally, cp DNA data are easily obtainable from bulk DNA extractions, as multiple copies of these genomes are present in each cell, and they exhibit considerable sequence and structural variations within and between plant species9. Several chloroplast markers have been harnessed for phylogenetic analyses and taxonomic systematizations10. Complete cp genome sequencing and annotation provides important sequence information about suitable plastid DNA markers for the classification of plant species.

The advent of high-throughput sequencing has recently facilitated rapid advancements in the field of chloroplast genomics. Previously, such studies were performed on isolated chloroplasts, in which the entire chloroplast genome was amplified by rolling circle amplification. Recent progress in next-generation sequencing (NGS) technologies has paved the way for faster and cheaper methods to sequence organellar genomes11,12. There are multiple NGS platforms available for organelle genome sequencing, for which the Illumina platform is widely used, as it emphasizes the use of rolling circle amplification products12. At the time of writing this manuscript, cp genomes from 66 orchid species have been reported, according to NCBI Organellar genome records ([Organism: exp]%20NOT%20genome[PROP]%20AND%20non_genome[filter]). In the subfamily Epidendroideae, the genus Dendrobium contains nearly 1200 species, and the cp DNA sequences of only six Dendrobium species have been determined and deposited in GenBank.

Dendrobium nobile Lindl. is one of the most widespread species within the genus Dendrobium and is among the best-known plants used in traditional Chinese medicinal. D. nobile is an epiphytic or lithophytic plant native to the Indian subcontinent (Northeast India (including Assam and Sikkim), Bangladesh, Nepal and Bhutan), southern China (including Tibet), and Indochina (Myanmar, Thailand, Laos and Vietnam). Various parts of the plant are widely used as analgesics, antipyretics, and tonics to nourish the stomach in traditional medicine. Denbinobin, a natural product isolated from D. nobile, has a unique phenanthrene quinone skeleton and displays antitumor and anti-inflammatory activities13.

D. nobile is listed in the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) Appendix II, indicating that it is vulnerable to extinction if proper measures are not taken to control anthropogenic activities. In light of the need to preserve native species and use their chloroplast genome information in various molecular biology studies, we determined the cp genome sequence of D. nobile (GenBank Accession number: KX377961) from India and report it for the first time.

Another unpublished D. nobile chloroplast genome from China bearing the accession number KT591465 is archived in GenBank. A comparative study between these two genomes and the genome of the closely related species Dendrobium officinale reveals many evolutionary hotspots in the plastid genome, which are very useful for developing molecular markers to distinguish other Dendrobium species. Comparative chloroplast genomic analysis will be very useful for marker-assisted commercial breeding programs, chloroplast genetic engineering and systematic biology studies of Dendrobium species within the family Orchidaceae.


Plant materials and DNA extraction

D. nobile plant specimens were collected from the National Research Centre for Orchids at Gangtok, Sikkim (Northeast India). A voucher specimen was deposited at the Department of Botany, North-Eastern Hill University, Shillong, India. Young fresh leaves were taken from the orchids that were grown in a greenhouse. High molecular weight DNA was extracted using a modified CTAB buffer that was purified using a column (Qiagen, GmBH, Germany). The DNA quantity was assessed using a spectrophotometer (Nanodrop 2000, Thermo Fischer, USA), and the DNA integrity was assessed by gel electrophoresis (using 0.8% Agarose, Sigma, USA). The quality and quantity of the genomic DNA were assessed using agarose gel electrophoresis, a Nanodrop and Qubit detection methods.

NGS Library preparation

Both paired-end and mate-pair libraries were prepared. Approximately 4 µg of Qubit-quantified DNA was used for tagmentation. The tagmented sample was cleaned using AMPURE XP beads (Beckman Coulter #A63881) and subjected to strand displacement. The 2-5 kb and 8-13 kb strand-displaced samples were size-selected using gel electrophoresis and subjected to circularization overnight. The linear DNA was then digested using DNA exonuclease. The circularized DNA molecules were sheared using a Covaris microTUBE with the S220 system (Covaris, Inc., Woburn, MA, USA) to obtain fragments of 300 to 1000 bp. The sheared DNA was subjected to M280 Streptavidin beads (Thermo Fisher Scientific, Waltham, MA) containing biotinylated junction adapters for purification. End repair, A-tailing, and adapter ligations were performed on the bead-DNA complex. The adapter-ligated sample was amplified for 15 PCR cycles (denaturation at 98˚C for 30 sec, cycling (98˚C for 10 sec, 65˚C for 30 sec and 72˚C for 30 sec) and a final extension at 72˚C for 5 min) and cleaned up using AMPURE XP beads (Beckman Coulter #A63881). The prepared library was quantified using Qubit and validated for quality by running an aliquot on D1000 ScreenTape (Agilent). The libraries were amplified for 9-11 cycles according to the Nextflex protocol and were quantified and sequenced on an Illumina NextSeq500 (Illumina, USA).

Data processing

The data quality of the Illumina WGS raw reads (151 bp x 2) was assessed using the FastQC tool. The raw reads were pre-processed using Perl scripts for adapter clipping and low-quality filtering. Reference Dendrobium chloroplast genomes (D. officinale, Accession: KJ862886; D. huoshanense, Accession: NC_028430 and D. strongylanthum, Accession: NC_027691) were retrieved from the NCBI-GenBank database. Adapter-clipped and low-quality trimmed processed reads were aligned to Dendrobium cp genomes using the BWA-MEM algorithm14 with the default parameter settings. Aligned reads were extracted, and k-mer–based de-novo assembly was achieved using the SPAdes-3.6.0 program (k-mer used 21, 33, 55 and 77) with the default parameter setting. The quality of the assembled genome was assessed by read alignment and genome coverage calculations using Samtools and Bcftools15 (

Genome annotation

Protein-coding and ribosomal RNA genes were annotated using the Basic Local Alignment Search Tool (BLAST; BLASTN, PHI-BLAST and BLASTX)16, CGAP17 and DOGMA18. The boundaries of each annotated gene were manually determined by comparison with orthologous genes from other orchid cp genomes. The tRNA genes were predicted using ARAGORN19. The circular genome maps were drawn using OGDRAW, followed by manual modification20. The sequencing data and gene annotations were submitted to GenBank, and their accession number (KX377961) was acquired.

Nucleotide sequence diversity and phylogenetic application of cp genomes in the family Orchidaceae

Whole chloroplast genome datasets from plant species representing four subfamilies of Cypripedioideae, Epidendroideae, Orchidoideae and Vanilloideae in the family Orchidaceae were aligned, and a comparative genome rearrangement was separately drawn using MAUVE21 with default parameters. The combined matrix was utilized for the phylogenomic analyses. A Bayesian inference (BI) tree was constructed using two independent Metropolis-coupled Markov chain (MCMC) runs using MrBayes 3.2.622. Two parallel Bayesian analyses with four chains each, partitioned by the DNA region, were run for 50 million generations. Trees were constructed using a general time reversible substitution model (GTR) with substitution rates estimated by MrBayes 3.2.6. Metropolis-coupled Markov chain Monte Carlo (MCMCMC) sampling was performed with two incrementally heated chains that were combinatorially run for 100,000 generations. Coalescence of the substitution rate and the rate model parameters were also examined. The average standard deviation of the split frequencies was calculated, and generations were added until the standard deviation value was below 0.01. Posterior probabilities indicated clade support (100%). A cladogram with the posterior probabilities for each clade and a phylogram with mean branch lengths were generated and subsequently examined using FigTree v1.3.1 ( The phylogenetic groupings in the family Orchidaceae were colour-coded based on sub-family groupings.


The complete cp genome of D. nobile was determined from a whole genome project initiative of the same species using paired-end and mate-pair data from Illumina HiSeq with 150*2 and Illumina NextSeq500 with 75*2, respectively. Further, the aligned Illumina reads were separated and assembled using CLC Main Workbench version 7.7.1. The Indian monoisolate D. nobile chloroplast genome is circular; is 152,018 bp long; and has a 62.53% A + T content, and it is similar to the previously determined chloroplast DNA of D. nobile from China (accession number KT591465; 153,660 bp; 62.50%). A total of 134 unique genes were successfully annotated, including 79 protein-coding genes, 8 rRNA genes, 7 pseudogenes and 38 tRNA genes (Fig. 1). The cp genome also comprised LSC (1.84,944; 84,944 bp), SSC (111,230.125,733; 14,504 bp), and two IR regions of 26,285 bp: IRA (84,945.111,229) and IRB (125,734.152018). A total of 20 genes were duplicated in the IR, 81 genes in the LSC and 11 genes in the SSC regions of the genome. In total, there were 12 genes {rps16, atpF, rpoC1, ycf3, rps12 (2), clpP, petB, rpl2 (2), ndhB (2)} with introns.

Genes shown inside the circle are transcribed clockwise, and genes shown outside the circle are transcribed counterclockwise. Genes belonging to different functional groups are colour-coded. A pair of inverted repeats (IRA and IRB) separates the genome into large single-copy (LSC) and small single-copy (SSC) regions in the inner circle; ψ indicates an ndh pseudogene.

Fig. 1: Annotated gene map of Dendrobium nobile chloroplast genome. Genes shown inside the circle are transcribed clockwise, and genes shown outside the circle are transcribed counterclockwise. Genes belonging to different functional groups are colour-coded. A pair of inverted repeats (IRA and IRB) separates the genome into large single-copy (LSC) and small single-copy (SSC) regions in the inner circle; ψ indicates an ndh pseudogene.

Chloroplast sequences have been used in deep phylogenetic analyses because of their low substitution rates23. Complete chloroplast genomes have often been utilized to resolve relationships among angiosperms24. However, whole-genome sequencing using sparse sampling can result in long-branch artefacts and incorrect evolutionary reconstructions. Previous studies on the complete chloroplast genomes of D. officinale and six other orchid species have highlighted deep phylogenomic analyses based on their chloroplast genome organization25. Luo et al. (2014) achieved consistent results for the relationships among Phalaenopsis (Aeridinae), Cymbidium (Cymbidiinae), Dendrobium (Dendrobiinae), Oncidium and Erycina (Oncidiinae) within the subfamily Epidendroideae. Their analysis revealed structural similarities, but differences in IR/SSC junctions and ndh genes were also reported, which can be used as markers to identify species of orchids25.

In the present study, we sequenced the entire D. nobile cp genome from plant material collected from Northeast India. Twenty-three whole cp genome sequences spanning four subfamilies in the family Orchidaceae were also retrieved from GenBank.

A comparative whole genome rearrangement showing homologous alignment segments was drawn using all 23 known cp DNA sequences. Each genome is displayed horizontally, and homologous segments are shown as coloured blocks that are connected across genomes (Fig. 2A). Inverted segments in the genomes are represented by blocks with a downward shift relative to the reference genome. Sequence regions covered by a coloured block are entirely collinear and homologous among the genomes. The breakpoints of genome rearrangements are represented by boundaries of coloured blocks unless a sequence has been gained or lost in the breakpoint region. A Bayesian phylogenetic tree with 1000 bootstrap values was computed.

In our analysis of the relationships among the four subfamilies within the family Orchidaceae, all of the representative orchid species from each subfamily were well resolved into monophyletic clades. The analysis further exhibited a congruent monophyletic grouping of Dendrobium species (D. nobile, D. huoshanense, D. officinale, D. pendulum, D. strongylanthum and D. chrysotoxum) in the overall tree topology (Fig. 2B).


Fig. 2: Bayesian phylogenetic tree of the family Orchidaceae, reconstructed based on whole chloroplast genomesA. Whole chloroplast genome alignment of 23 orchid species representing four subfamilies, Cypripedioideae, Epidendroideae, Orchidoideae and Vanilloideae, in the family Orchidaceae. Each genome panel contains the name, the sequence coordinates for the genome, and a single black horizontal centre line with coloured block outlines appearing above and below. Each block is homologous and internally free of genomic rearrangement and is connected by lines to similarly coloured blocks depicting comparative homology across genomes. B. Phylogenetic trees from the whole genome alignment matrix yielded monophyletic groupings of the four orchid families. Posterior probability/bootstrap values are indicated near the nodes, which are quite supportive of the overall tree topology. The taxon-wise GenBank accession numbers for the published sequences are as follows: Corallorhiza mertensiana (NC_025661.1), Corallorhiza odontorhiza (NC_025664.1), Cymbidium ensifolium (NC_028525.1), Cymbidium kanran (NC_029711.1), Cymbidium sinense (NC_021430.1), Cypripedium formosanum (NC_026772.1), Cypripedium japonicum (NC_027227.1), Dendrobium chrysotoxum (NC_028549.1), Dendrobium huoshanense (NC_028430.1), Dendrobium nobile (NC_029456.1), Dendrobium officinale (NC_024019.1), Dendrobium pendulum (NC_029705.1), Dendrobium strongylanthum (NC_027691.1), Dendrobium nobile NE_India (KX377961), Goodyera fumata (NC_026773.1), Goodyera procera (NC_029363.1), Masdevallia coccinea (NC_026541.1), Masdevallia picturata (NC_026777.1), Paphiopedilum armeniacum (NC_026779.1), Paphiopedilum niveum (NC_026776.1), Phalaenopsis aphrodite (NC_007499.1), Phalaenopsis equestris (NC_017609.1), and Vanilla planifolia (NC_026778.1).

Nucleotide sequence diversity and SSR analysis

A detailed comparative account of the nucleotide sequence statistics are outlined in Tables 1-5. This descript includes atomic counts for single and double stranded DNA; nucleotide counts; A, T, G, C content and frequencies; codon usage and frequency; and nucleotide count in the codon positions of D. nobile genomes reported from Northeast India and China and of the reference genome D. officinale. Nucleotide sequence diversity, codon statistics from coding regions, AT/GC counts and simple sequence repeats (SSR) were computed for selected Dendrobium species using CLC Workbench 7.7.1. The complete D. nobile Indian isolate cp DNA (KX377961) is comprised of 15,2018 bases, whereas the Chinese isolate (LC011413) is 15,0793 bp in length. The nucleotide variation for A and T and the G-C percentage were higher in the Indian isolate than in the Chinese isolate. A total of 79 CDS, 22 exons, 132 genes, 2 repeat regions, 8 rRNAs and 38 tRNAs were reported from the cp DNA from the Indian isolate, whereas the Chinese isolate comprises 73 CDS, 127 genes, 2 repeat regions, 8 rRNAs and 39 tRNAs (Tables 1-5). In total, 46, 42, and 46 SSRs were identified for D. officinale, D. huoshanense and D. nobile, respectively. The SSR analysis also revealed types of nucleotide repeats (mono-, di- and tetra-) in the cp DNA of Dendrobium species, which can serve as potential barcode markers for species discrimination (Fig. 3).


Fig. 3: SSR analysis of the cp DNA from four Dendrobium species. Three types of SSRs (mono-, di- and tetranucleotide repeats) were revealed in the cp DNA of D. nobile, D. officinale, D. huoshanense and D. strongylanthum. Different colour codes represent SSRs in different Dendrobium species.

Table 1. Sequence information for the cp DNA of Dendrobium nobile isolates from India and China.

Information LC011413 KX377961
Sequence type DNA DNA
Length 150,793bp circular 152,018bp circular
Organism Dendrobium nobile Dendrobium nobile
Name LC011413 KX377961
Description Dendrobium nobile chloroplast DNA, complete genome. Dendrobium nobile chloroplast, complete genome.
Modification Date 02-AUG-2016 10-AUG-2016
Weight (single-stranded) 46.552 MDa 46.932 MDa
Weight (double-stranded) 93.156 MDa 93.912 MDa

Table 2: Nucleotide counts for the cp DNA of Dendrobium nobile isolates from India and China.

Nucleotide LC011413 KX377961
Adenine (A) 46,152 46,576
Cytosine (C) 28,739 28,853
Guanine (G) 27,891 28,039
Thymine (T) 48,009 48,381
Purine (R) 0 31
Pyrimidine (Y) 0 20
Adenine or cytosine (M) 0 56
Guanine or thymine (K) 0 31
Cytosine or guanine (S) 0 3
Adenine or thymine (W) 0 28
Not adenine (B) 0 0
Not cytosine (D) 0 0
Not guanine (H) 0 0
Not thymine (V) 0 0
Any nucleotide (N) 2 0
C + G 56,630 56,892
A + T 94,161 94,957

Table 3: Annotated genomic features for the cp DNA of Dendrobium nobile isolates from India and China.

Feature type LC011413 KX377961
CDS 73 79
Exon 0 22
Gene 127 132
Misc. feature 9 2
Repeat region 2 2
Source 1 1
rRNA 8 8
tRNA 39 38

Table 4: Nucleotide frequencies in the cp DNA of Indian and Chinese Dendrobium nobile isolates.

Nucleotide LC011413 KX377961
Adenine (A) 0.306 0.306
Cytosine (C) 0.191 0.190
Guanine (G) 0.185 0.184
Thymine (T) 0.318 0.318
Purine (R) 0.000 0.000
Pyrimidine (Y) 0.000 0.000
Adenine or cytosine (M) 0.000 0.000
C + G 0.376 0.374
A + T 0.624 0.625

Table 5: Codon usage frequency in the cp DNA of Dendrobium nobile isolates from India and China.

Codon LC011413 KX377961
AAA 0.04 0.04
AAC 0.01 0.01
AAG 0.02 0.02
AAT 0.04 0.04
ACA 0.02 0.02
ACC 0.01 0.01
ACG 0.01 0.01
ACT 0.02 0.02
AGA 0.02 0.02
AGC 0.00 0.00
AGG 0.01 0.01
AGT 0.02 0.02
ATA 0.02 0.02
ATC 0.02 0.02
ATG 0.02 0.02
ATT 0.04 0.04
CAA 0.03 0.03
CAC 0.01 0.01
CAG 0.01 0.01
CAT 0.02 0.02
CCA 0.01 0.01
CCC 0.01 0.01
CCG 0.00 0.00
CCT 0.02 0.02
CGA 0.01 0.01
CGC 0.00 0.00
CGG 0.00 0.00
CGT 0.01 0.01
CTA 0.01 0.02
CTC 0.01 0.01
CTG 0.01 0.01
CTT 0.02 0.02
GAA 0.04 0.04
GAC 0.01 0.01
GAG 0.01 0.01
GAT 0.03 0.03
GCA 0.02 0.02
GCC 0.01 0.01
GCG 0.01 0.01
GCT 0.02 0.02
GGA 0.03 0.03
GGC 0.01 0.01
GGG 0.01 0.01
GGT 0.02 0.02
GTA 0.02 0.02
GTC 0.01 0.01
GTG 0.01 0.01
GTT 0.02 0.02
TAA 0.00 0.00
TAC 0.01 0.01
TAG 0.00 0.00
TAT 0.03 0.03
TCA 0.02 0.02
TCC 0.01 0.01
TCG 0.01 0.01
TCT 0.02 0.02
TGA 0.00 0.00
TGC 0.00 0.00
TGG 0.02 0.02
TGT 0.01 0.01
TTA 0.03 0.03
TTC 0.02 0.02
TTG 0.02 0.02
TTT 0.03 0.03


Chloroplast genome sequences serve as valuable assets in herbal medicine. As many medicinal plants are highly endangered and rare in nature, little information is available to confirm their identity. Bio-barcodes derived from chloroplast genomes are quite useful for identifying species varieties and resources. Functional and structural annotations of gene content, gene organization, and chloroplast genome sequences have been used as important markers in systematic research. This report determined the complete chloroplast genome sequence of D. nobile from Northeast India. We found structural similarities among the taxa of different subfamilies of Orchidaceae and also identified differences in IR/SSC junctions and ndh genes from other orchid plastid genomes. Our phylogenetic analyses reveal that D. nobile is most closely related to D. officinale and D. pendulum. In addition, relationships among subfamilies in the family Orchidaceae were resolved in the present study. The highly divergent genes in the cp genomes identified in this study can be used as potential molecular markers in phylogenetic analyses. In summary, the results of this study will further our understanding of the evolution, molecular biology and genetic improvement of the medicinal orchid D. nobile.

Data Availability

The entire chloroplast sequence is available from NCBI GenBank with the accession number KX377961 (

Competing Interests

The authors have declared that no competing interests exist.

Corresponding Authors

Devendra Kumar Biswal, Bioinformatics Centre, North-Eastern Hill University, Shillong- 793022, Meghalaya, India


Pramod Tandon, Biotech Park, Kursi Road, Lucknow- 226021, Uttar Pradesh, India


]]> 0
Red Algal Phylogenomics Provides a Robust Framework for Inferring Evolution of Key Metabolic Pathways Fri, 02 Dec 2016 17:00:49 +0000 Red algae comprise an anciently diverged, species-rich phylum with morphologies that span unicells to large seaweeds. Here, leveraging a rich red algal genome and transcriptome dataset, we used 298 single-copy orthologous nuclear genes from 15 red algal species to erect a robust multi-gene phylogeny of Rhodophyta. This tree places red seaweeds (Bangiophyceae and Florideophyceae) at the base of the mesophilic red algae with the remaining non-seaweed mesophilic lineages forming a well-supported sister group. The early divergence of seaweeds contrasts with the evolution of multicellular land plants and brown algae that are nested among multiple, unicellular or filamentous sister lineages. Using this novel perspective on red algal evolution, we studied the evolution of the pathways for isoprenoid biosynthesis. This analysis revealed losses of the mevalonate pathway on at least three separate occasions in lineages that contain Cyanidioschyzon, Porphyridium, and Chondrus. Our results establish a framework for in-depth studies of the origin and evolution of genes and metabolic pathways in Rhodophyta.



Red algae (Rhodophyta) form a monophyletic lineage containing ~7,000 described species1 that exhibit a wide variety of morphological and ultra-structural forms and have complex reproductive strategies. The Cyanidiophytina (e.g., Galdieria and Cyanidioschyzon) include extremophiles that thrive in volcanic areas surrounding hot springs. In contrast, their mesophilic sisters (Rhodophytina) are globally distributed from freshwater environments to open oceans and deep oceans (>200 m) to the intertidal zone. Despite a highly reduced core gene inventory that resulted from an ancient phase of genome reduction2, red algae represent one of the few eukaryotic lineages that have evolved complex multicellularity3, typified by red seaweeds such as Porphyra and Gracilaria. Red seaweeds account for ~95% of known red algal taxa and are important sources of agricultural (e.g., nori) and industrial products (e.g., agar and carrageenan).

Studies of red algal systematics have largely relied on a handful of plastid and nuclear genes4,5,6,7,8 and focused on a broad diversity of lineages within the Florideophyceae9,10. One of the major findings of these analyses is the separation of Cyanidiophytina from the Rhodophytina4,8. Whereas Cyanidiophytina contain only two known families (Cyanidiaceae and Galdieraceae), Rhodophytina encompass six classes: Bangiophyceae, Florideophyceae, Compsopogonophyceae, Porphyridiophyceae, Rhodellophyceae, and Stylonematophyceae4. Excluding the well-supported monophyly of Bangiophyceae and Florideophyceae (hereafter, collectively referred to as red seaweeds), relationships among the remaining classes remain controversial4,5,6,7,8.

In this study, we applied phylogenomics to a rich genomic dataset to erect a robust red algal tree of life. The dataset encompassed 298 orthologous nuclear-encoded genes from all major red algal lineages. In contrast to previous phylogenies built using smaller datasets4,5,6,7,8, our results support a fundamental, ancient split between red seaweeds and non-seaweed lineages among mesophiles. We discuss the implication of this new perspective on red algal phylogeny to understanding the evolution of multicellularity in red algae, and demonstrate the utility of this phylogenetic framework to infer the evolution of the mevalonate (MVA) pathway of isoprenoid biosynthesis in Rhodophyta.


Construction of single-copy orthologous gene alignments

We created a local database that includes protein sequences (translated from EST or predicted from genome sequences) from 15 red algal taxa2,11,12,13,14,15,16 (Fig. 1A) and 3 green algae17,18,19 (Table 1, Appendix 1). This database, after removing short sequences with length <100 amino acids, was used in a self-query using BLASTp (e-value cutoff = 1e-5). The BLASTp search output was used as input for OrthoMCL20 with parameters (evalueExponentCutoff = -10, percentMatchCutoff = 40, inflation = 1.5) to construct orthologous gene families. Among these families, we searched for single-copy orthologous genes with one gene copy per species (allowing missing data in up to three red algae and in no more than one green alga). For each orthologous gene family, the corresponding sequences were retrieved and aligned using MUSCLE (version 3.8.31) under the default settings21. The alignments were then trimmed using TrimAl (version 1.4)22 in automated mode (-automated) and then ‘polished’ with T-COFFEE (version 9.03)23 to removed poorly aligned residues (conservation score ≤ 5) among the aligned blocks. A total of 298 single-gene alignments (length >150 amino acids and with ≥15 sequences) were retained for downstream analysis.

Construction of the multi-protein phylogeny

The 298 single-copy gene alignments were concatenated into a super-protein alignment. A phylogenetic tree was inferred using Phylobayes (version 3.3)24 under the CAT model25. This is a mixture model that takes into consideration site-specific evolutionary properties (such as rate and profile) within the alignment25. The CAT model generally fits data significantly better than one-matrix models such as LG and WAG. We set up two chains that ran in parallel and assessed convergence periodically using ‘bpcomp’ and ‘tracecomp’ functions. Convergence assessments were done based on sampled trees (taking one from every 10 trees) following burnin equal to 20% of the entire length of the chain. The two chains were stopped when they converged to an acceptable level that allows good qualitative measurement of the posterior consensus. According to the user instructions (, an acceptable run corresponds to a maximum discrepancy across all bipartitions (maxdiff <0.3) when monitored with the ‘bpcomp’ function, and statistical discrepancies <0.3 and effective sizes >50 for all parameters when monitored with the ‘tracecomp’ function.

Construction of coalescence model-based species trees

We built a coalescence model-based red algal phylogeny with 100 replicates following Seo’s method26. For each replicate, we randomly sampled 298 genes with replacement. For each sampled alignment, a pseudo-alignment was generated by random sampling of amino acid site from the original alignment with replacement. Only one green algal sequence (as outgroup) was retained with the priority given to Chlamydomonas reinhardtii, Chlorella variabilis, and Micromonas RCC299 in order. A ML tree was built for each pseudo-alignment using IQtree (version 0.9.6)27 under the best-fit amino acid evolutionary model selected on the fly (-m TEST). The resulting 298 ML trees, rooted with outgroup sequences, were then used for maximum pseudo-likelihood tree construction using MP-EST (version 1.4) under the default settings28. This procedure was repeated 100 times and the resulting 100 maximum pseudo-likelihood trees were summarized under majority rule using the ‘consense’ function in Phylip (

Phylogenetic analyses of mevalonate pathway genes

Galdieria sulphuraria proteins in the MVA pathway (module identifier: M00095) and the methylerythritol phosphate (MEP) pathway (module identifier: M00096) were retrieved from the KEGG database29 and used as queries against NCBI (nr) using BLASTp (e-value cutoff = 1e-5) ( The representative sequences (e.g., from Metazoa and land plants) were retrieved from Genbank. Local BLASTp searches (e-value cutoff = 1e-5) were done against our red algal database aforementioned followed by retrieval of the significant hits. Galdieria phlegrea sequences were retrieved from the previous study30. Each G. sulphuraria query, together with the homologs (from Genbank and our local database), were aligned using MUSCLE (version 3.8.31)21 under the default settings. The alignment was trimmed using trimAl (version 1.4)22 in the automated mode (-automated). ML trees were built using IQtree (version 0.9.6)27 under the best amino acid evolutionary model selected using (-m TEST) with branch support values estimated using 1,500 ultrafast bootstrap replicates (-bb 1500). The resulting trees were manually inspected. Distantly related paralogs (if any) were removed manually and the trees were rebuilt following the procedure described above.

Validation of gene losses in red algae

We searched for the G. sulphuraria MVA and MEP proteins in a red algal nucleotide database (genome and transcriptome) using tBLASTn (e-value cutoff = 1e-5). The homologous protein sequences translated from the hit nucleotide sequences were collected using an in-house script. For each query sequence, the translated proteins corresponding to the three top bit-score hits and the three top-identity (query-hit identity) hits were incorporated into the single-gene ML tree building procedure described above. Distantly related homologs were manually identified and removed. Red algal sequences that were monophyletic with G. sulphuraria were considered to be orthologs.

Results and Discussion

Red algal tree of life

We constructed single-gene alignments for a total of 298 one-to-one orthologous genes (98,494 amino acid positions in total) that are conserved in 15 red algal and 3 green algal taxa (see Methods). Analysis of the concatenated super-protein alignment under the CAT model led to a highly supported phylogenetic tree that received 1.00 posterior probability for all interior nodes (Fig. 1A). This tree confirmed the early split between Cyanidiophytina and Rhodophytina4,8 and monophyletic relationship between Bangiophyceae and Florideophyceae4,8. The relationships within Florideophyceae are consistent with previous analyses10,31 with Hildenbrandiophycidae (Hildenbrandia) in the basal position. Nemaliophycidae (Palmaria) is sister to the monophyletic group containing Corallinophycidae (Calliarthron) and Rhodymeniophycidae (Chondrus)10,31. The remaining non-seaweed mesophilic lineages formed a robust monophyletic group, with Stylonematophyceae in the basal position. Compsopogonophyceae formed a sister group to the monophyletic Porphyridiophyceae and Rhodellophyceae.

Concatenation-based analysis has previously been shown in some instances to result in inflated statistical support for incorrect topologies32 due to heterogeneity across genes and gene-specific evolution, such as gene duplication33. To minimize this problem, we used a tree summarization approach that does not rely on the concatenation of multiple single-gene alignments. This method takes a population of single-gene trees as input and estimates the species tree using a coalescence model28. This analysis led to the same tree topology (Fig. 1A) to the concatenation-based analysis with high bootstrap support for the monophyletic group comprising red seaweeds (bootstrap support = 100%) and non-seaweed mesophilic red algae (bootstrap support = 90%). The relationships among non-seaweed red algal lineages are however weakly supported (bootstrap support = 49-51%). Taken together, our phylogenomic analyses strongly support a separation between seaweeds and non-seaweed lineages at the base of mesophilic red algae (Fig. 1A).

(A) A phylogenetic tree inferred from a concatenated alignment of 298-proteins. The outgroup species are not shown. Statistical supports (separated by a back slash) for each branch are derived from the super-protein analysis (posterior probability) and from the coalescence model-based analysis (bootstrap support). (B) Schematic representation of the positions of red seaweeds and land plants (thick branches) in red algae and Viridiplantae, respectively. The phylogenies are derived from this study (panel I), Scott et al (Ref. 6, panel II) and Leliaert et al. (Ref. 35, panel III). The arrows indicate genome reduction (GR). Bangiophyceae (Bangio.), Compsopogonophyceae (Compsopogo.), Cyanidiophyceae (Cyanidio.), Florideophyceae (Florideo.), Porphyridiophyceae (Porphyridio.), Stylonematophyceae (Stylonemato.), Coleochaetophyceae (Coleochaeto.), Chlorokybophyceae (Chlorokybo.), Klebsormidiophyceae (Klebsormidio.), Mesostigmatophyceae (Mesostigmato.), Zygnematophyceae (Zygnemato.).

Fig. 1: Red algal phylogenomics

(A) A phylogenetic tree inferred from a concatenated 298-protein alignment. The outgroup species are not shown. Statistical supports (separated by a back slash) for each branch are derived from the super-protein analysis (posterior probability) and from the coalescence model-based analysis (bootstrap support). (B) Schematic representation of the positions of red seaweeds and land plants (thick branches) in red algae and Viridiplantae, respectively. The phylogenies are derived from this study (panel I), Scott et al (Ref. 6, panel II) and Leliaert et al. (Ref. 35, panel III). The arrows indicate genome reduction (GR). Bangiophyceae (Bangio.), Compsopogonophyceae (Compsopogo.), Cyanidiophyceae (Cyanidio.), Florideophyceae (Florideo.), Porphyridiophyceae (Porphyridio.), Stylonematophyceae (Stylonemato.), Coleochaetophyceae (Coleochaeto.), Chlorokybophyceae (Chlorokybo.), Klebsormidiophyceae (Klebsormidio.), Mesostigmatophyceae (Mesostigmato.), Zygnematophyceae (Zygnemato.).

The early divergence of red seaweeds within mesophilic red algae is consistent with the antiquity of the Bangiophyceae. A putative bangiophyte (Bangiomorpha pubescens) has been found in rocks dated at ca. 1.2 billion years old34. This result suggests the existence of distinct fates for the two lineages that split from the common ancestor of mesophilic red algae. One lineage remained unicellular with the development of filaments in some species (e.g., Rhodochaete and Purpureofilum), whereas the other developed complex filamentous plant bodies leading to red seaweeds with bi- and tri-phasic life cycles. The basal position of red seaweeds among mesophiles (Fig. 1B, scenario I) contrasts with previous analyses4,5,6,8,10,34 (Fig. 1B, scenario II) and to the highly derived position of land plants in Viridiplantae35. Land plants evolved within streptophyte green algae that have simpler morphological forms (Fig. 1B, scenario III)35. Likewise, kelps (phaeophytes) are also nested among multiple unicellular stramenopile lineages36. Among mesophilic red algae, red seaweeds appear to have ‘recovered’ from the extensive genome reduction they shared with the red algal ancestor2 and that was further exacerbated in Cyanidiophytina due to their extremophilic lifestyles30. The early (>1 billion year old) emergence of the multicellular lineage is all the more remarkable when placed in context to the early evolution of red algae. Alternatively, the lack of simpler lineages (in terms of morphology) in the red seaweed clade may suggest their high rate of extinction or the existence of yet unknown species that remain to be discovered in this clade. An early emergence of a peculiar type of multicellularity (green seaweed Palmophyllophyceae) was also discovered recently in the basal position of Chlorophyta37.

Parallel losses of MVA pathway

To demonstrate the usefulness of this novel perspective on red algal phylogeny, we used the reference tree to elucidate the evolution of the isopentenyl pyrophosphate (IPP) biosynthetic pathway. IPP is the building block of isoprenoids that comprises a large diversity of lipids found in all three domains of life. In photosynthetic eukaryotes, two independent pathways exist to produce IPP, the cytosolic and peroxisome localized MVA pathway and the plastid MEP pathway38. Whereas the MEP pathway is conserved across many species, the MVA pathway has been lost in green algae (Chlorophyta)38 and in some red algal lineages such as C. merolae16 and P. purpureum12. Our analysis of red algal sequence data (see Methods) showed that the MEP pathway is present in all examined lineages. The minor gene losses that were found are most likely to be explained by missing data commonly associated with transcriptome datasets (Fig. 3, Appendix 2). In contrast, the MVA pathway is largely absent (3rd to 6th enzymes in the pathway, Fig. 2A) in most red algal lineages except the Stylonematophyceae (Rhodosorus marinus and Purpureofilum apyrenoidigerum) and G. sulphuraria. Presence of the MVA pathway in G. sulphuraria39 and Cyanidium caldarium40 is supported with genetic and biochemical evidence39,40. This result suggests that loss of MVA pathway is more widespread than previously thought. The red algal origin of the MVA genes in Stylonematophyceae is supported with phylogenetic data (see Methods). For example, in the phylogeny of HMG-CoA reductase (HMGR, Fig. 2B), R. marinus and P. apyrenoidigerum form a monophyletic group with and Galdieria species, whereas no other red algae were present in this clade. A similar pattern is found for other MVA pathway genes that were lost in most red algal species (Fig. 4, Appendix 3).


Fig. 2: MVA pathway in red algae

(A) The distribution of MVA pathway genes across red algal species. Black and open circles denote the presence and absence of the genes, respectively. For each gene, the gray boxes indicate gene presence for the corresponding classes. Arrows indicate genome reduction. Red vertical bars indicate gene losses. ACAT (acetyl-CoA acetyltransferase), HMGS (hydroxymethylglutaryl-CoA synthase), HMGR (3-hydroxy-3-methylglutaryl-CoA reductase), MVK (mevalonate kinase), PMK (phosphomevalonate kinase), MVD (mevalonate decarboxylase), IDI (isopentenyl-diphosphate delta-isomerase). (B) A ML tree of HMGR. The taxa in red color: red algae, green: Viridiplantae, orange: chromalveolates, brown: Opisthokonta.

Absence of the MVA pathway in all five sampled red seaweeds suggests it was most likely lost in their common ancestor. BLASTp searches (e-value cutoff = 10) against nucleotide databases (expressed sequence tag and transcriptome shotgun assembly) in NCBI did not return any significant hits to MVA pathway genes from Bangiophyceae and Florideophyceae. In addition, their losses in C. merolae, P. purpureum, and C. crispus that have both transcriptome and genome data available are well supported. Given the red algal phylogeny (Fig. 1A), these losses were unambiguously resulted from three parallel events (Fig. 2A). Under this scenario, the MVA pathway survived the ancient phases of genome reduction (arrows, Fig. 2A) and underwent gene loss more recently after the split of the seaweed and non-seaweed lineages. MVA pathway loss in C. merolae likely resulted from an additional phase of genome reduction specific to this lineage30 (Fig. 2A). The selective forces that led to the retention or loss of the MVA pathway across the mesophilic red algal lineages are presently unknown. Nonetheless, MVA pathway loss suggests that IPP biosynthesis is dependent on the plastid MEP pathway and requires transporters for the export of IPP from the plastid to the cytosol38. The MVA pathway was also lost in Chlorophyta (including most unicellular green algae)38 and G. sulphuraria is physiologically distinct from mesophilic species. For this reason, the discovery of possible MVA pathway-containing and -absent lineages among mesophilic red algae provides an algal model for studying the evolution of isoprenoid biosynthesis and intracellular trafficking among compartments.


Our phylogenomic analyses resulted in a well-supported red algal phylogeny that provides new insights into the evolution of red seaweeds. Our results will allow more accurate reconstruction of evolutionary events (e.g., gene family evolution2 and molecular calibration10) and provide a framework to map the distribution of red algal functions and traits. Further efforts are needed to substantiate the relationships among non-seaweed mesophilic red algae with high quality genome data from these taxa41.

Data Availability

The multi-protein alignment is available for download (ID: 20087) from TreeBASE (

Competing Interests

The authors have declared that no competing interests exist.

Corresponding Author

Huan Qiu, Department of Ecology, Evolution and Natural Resources, Rutgers University, New Brunswick, NJ 08901, USA.


Appendix 1

Table 1

Algal genome and transcriptome data used for the phylogenomic analysis

Classification Species Source Data type MMETSP ID
Seaweed Hildenbrandia rubra Ref. 2 Transcriptome
Seaweed Palmaria palmata Ref. 2 Transcriptome
Seaweed Calliarthron tuberculosum Ref. 14 Partial genome
Seaweed Chondrus crispus Ref. 13 Whole genome
Seaweed Porphyra umbilicalis Ref. 14 Transcriptome
Mesophiles Purpureofilum apyrenoidigerum Ref. 2 Transcriptome
Mesophiles Rhodochaete pulchella Ref. 2 Transcriptome
Mesophiles Rhodosorus marinus Ref. 11 Transcriptome MMETSP0315
Mesophiles Rhodella maculata Ref. 11 Transcriptome MMETSP0167
Mesophiles Compsopogon coeruleus Ref. 11 Transcriptome MMETSP0312
Mesophiles Erythrolobus australicus Ref. 11 Transcriptome MMETSP1353
Mesophiles Timspurckia oligopyrenoides Ref. 11 Transcriptome MMETSP1172
Mesophiles Porphyridium purpureum Ref. 12 Transcriptome
Extremophiles Galdieria sulphuraria Ref. 15 Transcriptome
Extremophiles Cyanidioschyzon merolae Ref. 16 Whole genome
Green algae Chlorella variabilis Ref. 17 Whole genome
Green algae Chlamydomonas reinhardtii Ref. 18 Whole genome
Green algae Micromonas pusilla Ref. 19 Whole genome

Appendix 2


Fig. 3: Distribution of the MEP pathway across red algal lineages

Black and open circles denote the presence and absence of the genes, respectively. For each gene, the gray boxes indicate the gene presence for the corresponding classes. DXS (1-deoxy-d-xylulose 5-phosphate synthase), DXR (1-deoxy-d-xylulose 5-phosphate reductoisomerase), MCT (2-C-methyl-d-erythritol 4-phosphate cytidylyltransferase), CMK (C-methyl-d-erythritol kinase), MDS (2-C-methyl-d-erythritol 2,4-cyclodiphosphate synthase), HDS (4-hydroxy-3-methylbut-2-en-1-yl diphosphate synthase), HDR (4-hydroxy-3-methylbut-2-en-1-yl diphosphate reductase), IDI (isopentenyl-diphosphate isomerase).

Appendix 3

The taxa in red color: red algae, green: Viridiplantae, orange: chromalveolates, brown: Opisthokonta.

Fig. 4: ML trees for six MVA pathway genes

The taxa in red color: red algae, green: Viridiplantae, orange: chromalveolates, brown: Opisthokonta.

]]> 0
How Really Ancient Is Paulinella Chromatophora? Tue, 15 Mar 2016 16:40:57 +0000 The ancestor of Paulinella chromatophora established a symbiotic relationship with cyanobacteria related to the Prochloroccocus/Synechococcus clade. This event has been described as a second primary endosymbiosis leading to a plastid in the making. Based on the rate of pseudogene disintegration in the endosymbiotic bacteria Buchnera aphidicola, it was suggested that the chromatophore in P. chromatophora has a minimum age of ~60 Myr. Here we revisit this estimation by using a lognormal relaxed molecular clock on the 18S rRNA of P. chromatophora. Our time estimates show that depending on the assumptions made to calibrate the molecular clock, P. chromatophora diverged from heterotrophic Paulinella spp. ~ 90 to 140 Myr ago, thus establishing a maximum date for the origin of the chromatophore.



Mitochondria and plastids evolved from free­living bacteria by symbiogenesis more than one billion years ago1. Both events boosted the evolution of eukaryotes by expanding their metabolic abilities. Primary endosymbiosis leading to organelles (i.e., mitochondria and plastids) was thought to be unique in the history of life until the recent discovery of an independent primary endosymbiosis in Paulinella chromatophora2. This thecate filose amoeba hosts in its cytoplasm photosynthetic organelles of cyanobacterial origin, called chromatophores. Phylogenetic analysis of 16S rRNA showed that the chromatophores originated from marine α­cyanobacteria from the Prochlorococcus/Synechococcus clade. In contrast, “classical” plastids of plants and algae evolved from an ancient unknown lineage of cyanobacteria3.

The genome of the cyanobacteria Synechococcus WH5701, one of the closest known free­-living relatives of the chromatophore with a sequenced genome, has 2917 protein­-coding genes4. In contrast, the chromatophores of P. chromatophora strains FK01 and M0880/a contain between 841 and 867 protein­-coding genes respectively5,6. Remaining genes in the chromatophore suggest a strong metabolic interdependence with the amoebal nucleocytoplasm. The discovery that proteins of the chromatophore photosynthetic apparatus are encoded in the host genome and imported back into the cyanobacterial­-derived compartment, reinforces the suggestion that the chromatophore is a bona fide primary organelle7.

It is likely that the origin of the chromatophore is one or two orders of magnitude more recent than the establishment of the primary plastids of plants and algae. But, how ancient is the chromatophore? Since there is no fossil record of P. chromatophora or its close relatives, it is difficult to have a precise answer to this question. However, an initial guess suggested the chromatophore has a minimum age of 60 Myr. This was based on the proposal that in Buchnera aphidicola (Aphid’s endosymbiotic bacteria) a pseudogene needs between 40-60 Myr to disintegrate completely8. Since the genome of the chromatophore has pseudogenes, the same tempo was extrapolated for the chromatophore6.

Surprisingly, the suggestion that the chromatophore in P. chromatophora has a minimum age of 60 Myr underwent a change in part of the subsequent literature indicating that P. chromatophora has ~60 Myr. Clearly, this is an assertion that requires clarification and further analysis. The recent identification by single cell genomics of non-­photosynthetic close relatives of P. chromatophora9 and novel attempts to calibrate the origin of major eukaryotic groups10,11 offer an opportunity to revisit the origin in time of this extraordinary symbiosis.


Taxon sampling. Based on previous published phylogenetic analyses we retrieved a sample of 18S rRNA sequences from rhizaria and stramenopila from SILVA database12. These included sequences from: i) the phylogenetic analysis whereby close relatives of P. cromatophora were identified9; ii) the phylogenetic analyses describing divergence times of major eukaryotic groups10,11; and iii) sequences from phylogenetic analyses of euglyphids13 and diatoms14. We did not include sequences from foraminifera due to their high rate of evolution reflected in extreme long branches. Sequences were aligned with MUSCLE15. The final alignment contained 43 sequences and 1128 aligned positions without gaps. The phylogenetic tree reconstructed from these sequences (see below) is in general terms congruent with published phylogenetic analyses.

Molecular clock analyses. BEAUTi was used to prepare xml files. Time trees were inferred by using a lognormal relaxed molecular clock as implemented in BEAST 216. To select the model of evolution we used jModelTest17. The Tamura and Nei 1993 model (TrN) plus gamma distribution and an estimated proportion of invariant sites were rated best by the bayesian information criterion (BIC). Therefore, all the analyses were conducted under this model of evolution. In addition, we used a Yule tree prior for all analyses.

To calibrate the clock, we relied upon several sources of information (Fig 2). These included organismal as well as chemical fossils. We also used previous published estimates of the time divergence between rhizaria and stramenoplia. These sources of information were organized into four different calibration schemes. All four schemes used information from non­-controversial fossil record (evidences: a, b, c and d). However, the schemes differ on the use of soft evidence from previous molecular clock analyses (evidence: e.1 and e.2); and on the use of evidence from vase­shaped microfossils (VSM) described by18. VSM were originally assigned to rhizaria. However, a posterior molecular clock analysis favored a re­interpretation of VSM as members of amoebozoa10. Thus, the four calibration schemes were: (a, b, c, d, e.1); (a, b, c, d, e.1, f); (a, b, c, d, e.2); (a, b, c, d).

Priors on the age of nodes were adjusted to reflect confidence on time divergence. For instance, we assigned a strong prior on the origin of rhizosolenid diatoms because the origin of this lineage has been dated to 91.5 ± 1.5 Myr based on chemical fossils14. The 95% probability distribution on this case is between 91.5 to 97.0 Myr. Other calibrations received less strong priors, as indicated by the larger number of years contained along the 2.5% to 9.5% quantiles of the shape column from Table 1. Priors used for the origin of pennate diatoms and for the origins of earliest diatoms were based on those used by11 and represent minimal divergence times. The prior used for the origin of Eugliphydae was based on the supposition that the group is much older than the fossil evidence as suggested by19. This prior also represents a minimal divergence time. Finally, to assign a prior to the divergence in time of rhizaria from stramenopila we first averaged BEAST time estimates published by11 and then we parametrized a normal distribution to include the variability of these estimates. We used this normal distribution as a prior. We constructed a second prior for the same divergence of rhizaria from stramenopila but now based on the inference published by10. All sequences within each prior were restricted to be monophyletic. The root of the tree was determined by making rhizaria monophyletic.

We evaluated each calibration scheme by looking at convergence and ESS > 200 in Tracer after running chains of length 1010 and sampling each 104 generations. We discarded 10% of generations as burnin. Finally, we conducted all analyses without sequences (i.e., sampling from priors) to determine the impact of priors and to test whether sequences are informative on estimated divergence dates. Trees were obtained with TreeAnnotator and visualised with FigTree. We provide BEAUTi xml files of the four calibration schemes for BEAST 2 analyses here.

Results and discussion

In Fig. 1 we show an estimate of the divergence in time of P. chromatophora by using a lognormal relaxed clock on 18S rRNA. To calibrate this tree we relied on: a) the origin of rhizosolenid diatoms, which is known with high confidence (91.5 ± 1.5 Myr)14; b) a minimal time divergence of pennate diatoms (80 Myr)11,20; c) a minimal time divergence for diatoms (133.9 Myr)11,21; d) a minimal time divergence of Euglyphidae 40 Myr19; and e.1) a time estimation of the divergence of rhizaria from stramenopila ~1232 Myr11. This gave us an estimation for the origin of P. chromatophora of 93.6 Myr (38.1 – 138.2).


Fig. 1: The origin of P. chromatophora in time according to SSU rRNA lognormal molecular clock.

Tree calibrated under the scheme (a, b, c, d, e.1). Small open red circles indicate calibrations based on fossil record (a, b, c, and d); small open blue circle indicate calibration based on previous estimation (e.1); red full circle, estimated age of P. chromatophora. Notice that the divergence in time of Plasmodiophorabrassicae (black open circle) is consistent with the proposed origin in time of forams. Species names in orange: stramenopila; species name in blue and green: rhizaria.


Fig. 2: Calibration constrains.

Evidence: used to derive priors for calibration constraints; Min: Minimal divergence time (Offset parameter in BEAUTi) in Myr; Distribution: Used to model each prior together with their respective parameter values (alpha α and beta β for gamma γ and mean and sigma σ for normal distributions); Shape: shows the median for each distribution and the values containing the 2.5% and 97.5% quantiles. Stronger priors span few years between the 2.5% and 97.5% quantiles.

However other calibration schemes are possible. For instance, we can use the report of vase­-shaped microfossils (VSM) of ~742­ to 770 Myr (calibration f) that were described tentatively as members of Euglyphida18 together with calibrations a, b, c, d and e.1, to get an estimate of 141.4 Myr (48.7 – 210.4) for the origin of P. chromatophora. Alternatively, we can follow the suggestions that VSMs belong to amoebozoa and that rhizaria diverged from stramenopila 754.1 Myr (639.8 – 903.3)10 together with calibration points a, b, c, d and e.2 to get an estimate of 55.8 Myr (25.4 – 78.6) for the origin of P. chromatophora. Finally, we can estimate the divergence of P. chromatophora based only on calibration points a, b, c and d; thus avoiding controversial VSM fossils (calibration f) and the soft information provided from previous estimates of the divergence between rhizaria and stramenopila (calibrations e.1 and e.2). By this, we get an estimate of 47.9 Myr (28.8 – 64.9).

Which of the four time estimates is closer to the true divergence time of P. chromatophora? Foraminifera offer a clue. Foraminifera belong to rhizaria and as well as diatoms have a detailed fossil record. However, their 18S sequences are too divergent to include them in our analysis. The first appearance of forams in the fossil record is dated at 545 Myr22. Recent molecular time estimates of forams suggest this group originated 770 Myr (650 – 920)23. A phylogenetic analysis based on 15 concatenated genes shows that forams are a sister group of Plasmodiophorabrassicae with the exclusion of Gromiaoviformis as an outgroup11. Therefore, the origin of the lineage leading to P. brassicae in our tree has to be at least as old as the proposed origin of forams. By looking at our time estimates, it happens that only the two proposals that are based on a time of ~1232 Myr for the origin of rhizaria (i.e., that uses calibration e.1) result in a date for the origin of P. brassicae that is congruent with the estimated origin of forams (Fig. 3).


Fig. 3: Concordance between different time divergence inferences and earliest forams in the fossil record.

Only calibration schemes assuming a time divergence of rhizaria from stramenopila of ~1232 Myr (e.1) are consistent with the fossil record and proposed time divergence of forams. The horizontal black broken line represents the date of the oldest foram fossil at 545 Myr. The blue broken line represents the estimated origin of forams 770 Myr (650 – 920).

Although the 95% confidence intervals are rater large, the estimates of 93.6 and 141.4 Myr are consistent with the original suggestion that the chromatophore in P. chromatophora has a minimum age of 60 Myr6; and with a more recent proposal of a maximum age of 200 Myr for the divergence of P. chromatophora from Euglypha rotunda24.

If our estimates are correct, the two strains of P. chromatophora diverged from each other about (45.7 – 64.7 Myr). The genomes of the chromatophores in these two strains differ in about 33 genes5. This give us a rate of ~ 0.25 to 0.36 gene inactivations per million year since the divergence of the two strains. These show that genome reduction in the chromatophore is a process that continues slowly however stately. Whether these genes are been lost by genetic drift or natural selection is still a mater of research. Further refinements on the time of origin of P. chromatophora and their chromatophores are expected as our understanding of the evolution of rhizaria and the eukaryotic tree of life improves.

EAPhy: A Flexible Tool for High-throughput Quality Filtering of Exon-alignments and Data Processing for Phylogenetic Methods Wed, 05 Aug 2015 14:45:54 +0000 Recently developed molecular methods enable geneticists to target and sequence thousands of orthologous loci and infer evolutionary relationships across the tree of life. Large numbers of genetic markers benefit species tree inference but visual inspection of alignment quality, as traditionally conducted, is challenging with thousands of loci. Furthermore, due to the impracticality of repeated visual inspection with alternative filtering criteria, the potential consequences of using datasets with different degrees of missing data remain nominally explored in most empirical phylogenomic studies. In this short communication, I describe a flexible high-throughput pipeline designed to assess alignment quality and filter exonic sequence data for subsequent inference. The stringency criteria for alignment quality and missing data can be adapted based on the expected level of sequence divergence. Each alignment is automatically evaluated based on the stringency criteria specified, significantly reducing the number of alignments that require visual inspection. By developing a rapid method for alignment filtering and quality assessment, the consistency of phylogenetic estimation based on exonic sequence alignments can be further explored across distinct inference methods, while accounting for different degrees of missing data.



High-Throughput Sequencing (HTS) has revolutionised the field of phylogenetics by enabling researchers to question the evolutionary relationships between taxa with large-scale multi-locus datasets 1,2. The development of these methods has been driven by a realisation that the inclusion of many genetic markers helps to account for stochastic coalescent histories of individual genes 3,4,5,6. Species tree inference methods use the multispecies coalescent model to estimate potential gene tree – species tree discordance and large numbers of unlinked loci represent a greater sample of the gene tree distribution underlying the true species tree 6. However, while phylogenetic estimation might improve by sequencing many loci 4,5,6,7, the requirement for high-quality sequence alignments remains unchanged and is fundamental for the correct inference of phylogenetic hypotheses. Existing alignment methods can be extrapolated for use with large-scale multi-locus datasets, but visual inspection of each alignment, the traditional approach for assessing alignment quality, is challenging with thousands of sequenced loci 8. As a consequence of the impracticality of visual inspection, the impact of missing data in large phylogenomic datasets is often nominally explored and the potential consequences of distinct alignment filtering criteria remain unknown. Nonetheless, contradicting opinions coexist 9,10,11 regarding the effect of missing data on phylogenetic inference and it is therefore advisable to quantify the sensitivity of empirical phylogenetic hypotheses to data filtering choices. Thus we need workflows that automate (as far as possible) the assessment of alignment quality and the consequences (in terms of missing data) of making different choices about filtering criteria. Ideally, such a workflow would facilitate the conversion of individual contiguous sequences (‘contigs’) into quality-filtered alignments, and help to minimise the demand for visual inspection.

The need for a high-throughput alignment filtering system emerged with the recent advance in molecular methods to target and sequence large numbers of orthologous loci. Since whole-genome sequencing is still too costly for most research labs that focus on non-model organisms, genome reduction protocols have been developed that isolate large numbers of orthologous loci across the genome of closely related and deeply divergent taxa 1,8. There are two increasingly popular genome reduction methods that specifically focus on exonic sequence regions and can generate genetic markers suitable for phylogenomic inference. Transcriptome sequencing is a cost-effective method that does not require the a-priori availability of genomic resources. RNA is extracted from the same tissue in different target species and with the expected expression of similar genes, orthologous loci are isolated and sequenced for phylogenetic comparison. An alternative method, exon-capture 12, is a target enrichment approach 13,14,15,16 that benefits from an increasing number of readily available genomic resources and enables the design of study-specific capture systems. The use of exonic sequence regions for phylogenetic inference, generated by transcriptome sequencing or exon-capture, is promising and has been successfully demonstrated at different levels of divergence across the tree of life 17,18,19,20. The tremendous increase in the scale of available exonic loci benefit inference methods, but also requires a significant investment in the development of bioinformatic resources to process such data.

Whereas several excellent bioinformatic pipelines have been constructed for processing raw sequence data and conducting sequence assembly 13,14,15,16,19, a bioinformatic scheme is needed for subsequent alignment, alignment quality assessment and alignment filtering. Most published studies still conduct visual inspection of alignment quality and account for missing data by dividing datasets into a limited number of categories manually (i.e. 14,18)or automated (i.e. 21). Recently, Misof et al.19 developed a method to assess alignment quality in an extensive study that used transcriptomes to infer the phylogeny of insects. They identified potentially erroneous alignments by calculating the BLOSUM62 distance between each amino acid sequence and the best reciprocal hit of a reference taxon. A distance calculation based on a BLOSUM matrix was warranted, due to a significant level of protein divergence between most taxa. The BLOSUM alignment score matrix values the alignment of each amino acid pair differently, representing the likelihood of amino acid substitutions, but lacks resolution when the expected level of protein divergence between two sequences is limited. However, although this has not been tested prior, it can be expected that at shallower levels of divergence subtle misalignments might actually have more significant consequences for phylogenetic estimation than when inferring relationships between distant taxa, stressing the need to identify such misalignments. When assessing alignments with limited levels of sequence divergence, the exact number of clustered amino acid changes is more likely a better indicator of alignment quality than the overall BLOSUM62 distance score. In this short communication, I describe a flexible high-throughput pipeline for quality assessment of exonic sequence alignments and subsequent filtering of missing data. The pipeline is specifically designed to be flexible and process both population and phylogenetic level data, but the method developed by Misof et al.19 will likely be more effective at deep phylogenetic scales.

EAPhy, exon alignment for phylogenetics, was developed to process exonic sequence data for phylogenetic inference, but is valuable for any type of analysis that requires high-quality filtered alignments (i.e. population genomics or molecular evolution). In this manuscript I will focus on its application for phylogenetic inference. The first objective of the pipeline is to quantify alignment quality and highlight just those loci that require visual inspection. By translating exonic nucleotide alignments into amino acid alignments, EAPhy infers the relative quality of sequences and alignments by assuming that most mutations within exons are silent. In addition, the identification of regions that harbor an excessive cluster of amino acid replacements distinct from a summary reference sequence, is used as a proxy for alignment quality. Simultaneously, insertions and deletions that result in frame shifts and the introduction of multiple stop-codons are unlikely to represent true biological events and such alignments should be addressed. The pipeline can be adapted based on the expected level of divergence between taxa by adjusting the stringency of filtering criteria. The second objective is to provide a user-friendly method to account for missing data. By enabling filtering criteria for missing data to vary, the consistency of phylogenetic estimation can be quantified across different levels of missing data. Lastly, EAPhy was designed to generate alignments of different sorts (haplotype, diplotype and SNP based), in the formats required for most commonly used inference software and facilitate the further exploration of distinct analysis methods. With the development of a high-throughput method for alignment filtering and processing, the overarching aim of this pipeline is to reduce the bioinformatic burden of data analysis involving exonic sequence alignments and ultimately promote further research into the (in)congruences between inference methods, while accounting for different degrees of missing data.

Overview of Methods

EAPhy consists of a collection of scripts that takes as input a set of unaligned sequences for an arbitrary number of species and loci. It will generate multiple sequence alignments using existing aligning software and subsequently filters these alignments for a number of user-specified criteria. The final output consists of quality filtered multiple sequence alignments, allowing different degrees of missing data as preferred, and a list of alignments that still require visual inspection. The output files are automatically exported in the input format of commonly used phylogenetic inference programs. The complete package is freely available at

EAPhy can be run on most individual computers (i.e. does not require a cluster set-up) and an individual run for a modest dataset can be completed within hours. The pipeline has been used with exon-capture datasets involving tens to hundreds of individuals and thousands of loci, and finished within six hours on a Macintosh desktop computer with a 3.1 GHz Intel i7 processor (2012) and 16 GB of RAM. It is important for the user to adopt filtering criteria suitable for the dataset (level of divergence and data quality) analyzed, but if filtering criteria have been carefully reviewed EAPhy should be able to handle larger datasets than currently tested. For EAPhy to function appropriately, I advise to run the pipeline initially with a small subset of the data and replicate the analysis with alternative filtering criteria. If filtering and flagging of alignments works well, then the analysis can be extrapolated for usage with the complete dataset. The importance of specifying appropriate filtering criteria should not be underestimated, since misspecification of filtering criteria will result in a significantly reduced dataset or alternatively a dataset that equals the input data, regardless of potential low-quality alignments.

EAPhy is not designed to identify individual sequencing errors that are often associated with HTS datasets, but will identify sequence regions with excessive non-synonymous substitutions (potential ‘low-coverage’ sequences) if these have not been filtered out beforehand and appear anomalous in the resulting alignments. Several excellent pipelines have been developed to filter raw sequence data and generate assemblies, and the starting point of this pipeline requires assembled individual contigs that start in first codon frame, for each presumed orthologous locus. A complete overview of the pipeline is outlined in Figure 1 and a general description of the most important components is provided here.

Specification of configuration script

At the onset of each EAPhy run, the system path to an align program executable and all filtering criteria for downstream analysis are specified in a single configuration script. Muscle 22 is the default aligner used by EAPhy, but should be installed by the user independently from downloading EAPhy. Alternative alignment software can be used but requires modifications of several scripts. The EAPhy pipeline is designed as a set of modules that can be executed independently or in consecutive order as a complete analysis (Fig. 1). This provides a straightforward system to reiterate specific components of the pipeline, with alternative filtering criteria for alignment quality or missing data. A complete description of all filtering parameters can be found in the manual and is part of the EAPhy package that can be downloaded from GitHub.


Fig. 1: A schematic overview of the EAPhy workflow

The user specifies filtering instructions in a single configuration script (1). EAPhy will first subset the target individuals from each locus and create new contig files (2). With an existing aligner, new alignments are created for each locus (3) and are subsequently processed and checked (4). Alignments are highlighted that do not fulfill the filtering criteria and can be visually inspected. If deemed appropriate for inclusion, manually checked alignments can be added to the filtered alignment list (5). Alternatively, EAPhy can automatically continue with the alignments that passed filtering and exclude the problematic loci. Once the complete collection of filtered loci has been identified, final alignments are generated (6). If diplotype sequence data was used and heterozygous positions coded according to IUPAC format, concatenated (7) and SNP (8) alignments can be generated if required.

Missing data – within alignments

The effect of missing data on phylogenetic inference is not well understood and contradicting opinions coexist 9,10,11. Phylogenetic estimation is likely unbiased with large numbers of loci, if there are no systematic differences in sequence length between individuals for any given locus. However, the maximum-likelihood (ML) estimation might cluster individuals by sequence length rather than sequence similarity for complete positions, when missing data are non-randomly distributed and specific individuals have systematically shorter contigs or are completely missing for specific loci. At sites with missing data, the probability of observing an ‘A’, ‘T’, ‘C’ or ‘G’, is set to 1 and ML will group taxa together for which there is more signal and less uncertainty (Stamatakis, pers. comm.). Thus the effect of missing data is not limited to small sequence datasets but should also be accounted for and characterized in large-scale datasets. With the development of EAPhy, I do not advocate to discard or include incomplete sites but rather provide the opportunity to account for missing data by generating datasets where different filtering criteria have been enforced.

Missing data within individual sequences are particularly prevalent at the beginning and end of alignments (‘jagged edges’), since individual contig sequences often differ in length. Once alignments have been constructed using an existing aligner (e.g. Muscle 22), EAPhy will first address missing data by processing alignments in accordance with stringency criteria specified in the configuration script (Fig. 2). First, potential gaps within individual sequences are removed to yield long consecutive sequences (Fig. 2.1). EAPhy converts all sequence alignments into amino acid alignments and then uses a ‘jump-sliding window’ approach to assess the presence of potential non-consecutive sequence stretches that are often prevalent at the start/end of individual sequences. A jump-sliding window approach was developed since a conventional sliding window approach would remove the complete individual sequence if the first window would contain more missing data than allowed. Each window is assessed on the presence of amino acid sequence gaps and if a window contains more gaps than allowed, the complete window is removed for that individual sequence. In-frame gaps (i.e. triplet insertions) are retained if the amount of inserted codon gaps per window does not contain more missing codons than allowed. The window then ‘jumps’ a sequence distance of half the window size plus one codon and the process reiterates. By converting a nucleotide alignment in its amino acid equivalent, EAPhy specifically takes into account the coding-codon character of exonic sequences. When nucleotide sequence data is removed by codon, the remaining sequence is still in correct frame and codon position can still be inferred for each nucleotide position.

After individual sequences have been trimmed for missing data, EAPhy then assesses missing data between individuals by evaluating the amount of missing data for each amino acid alignment column (Fig. 2.2). The algorithm used is similar to the jump-sliding window approach, but now focuses on the amount of missing data within each amino acid alignment column. The window-length of amino acid columns and the amount of missing data allowed within each column, can be specified in the configuration script. The algorithm evaluates for each amino acid column whether the amount of individuals with missing data exceeds the cut-off specified. If more than half of the columns in a given window have more missing data than allowed, the columns in the first half of the window are removed from the alignment. The window then ‘jumps’ a sequence distance of half the window size plus one codon and the process reiterates. Amino acid columns at the end of alignments are removed, if they have not been evaluated but the specified window length exceeds the number of remaining columns. When alignments have been filtered for missing data within and between individuals, EAPhy evaluates the presence of single nucleotide insertions, by assessing the frequency of sequenced individuals for each nucleotide alignment column. If the number of sequenced individuals is below a user specified cut-off, the site is assumed to be a sequencing error and removed from the alignment.


Fig. 2: An exemplary overview of the three main filtering steps conducted during alignment filtering and quality assessment

First, potential gaps within individual sequences are removed to yield long consecutive sequences (1). By converting nucleotide codons into amino-acids, the number of missing amino acids is assessed for each window using a ‘jump-sliding-window’ approach. If less amino-acids are missing for a given window than the specified ‘jump-window gap ratio’, the complete corresponding nucleotide stretch is retained (see green frames). If more amino acids are missing for a window, the complete nucleotide stretch is removed (see red frame). Secondly, the amount of missing data for each amino-acid alignment column in a given window is quantified (2). If more than half of the amino acid columns in a window miss more individuals than the specified ‘column gap ratio’, all corresponding nucleotide columns are removed (see red frame). Lastly, the quality of the resulting alignment is assessed, by comparing each individual sequence to a consensus sequence (3). Following a common sliding window approach for each individual sequence, the number of amino acids identical to the consensus is quantified. If, for each window, the number of amino acids distinct is less than the specified ‘difference ratio’, the alignment is retained (see green frames). If for any individual within an alignment, a window would fail this criterion, the alignment is flagged for visual inspection. In addition, for each sequence the number of stop-codons is quantified and if any individual sequence contains more stop-codons than a specified cut-off number (e.g. > 1), the alignment is also flagged for visual inspection.

Alignment quality

Once each alignment has been filtered for missing data, EAPhy then inspects the alignment quality by translating the nucleotide sequences and evaluating the resulting amino acid alignment (Fig. 2.3). First, if the number of stop codons for any individual sequence exceeds a user specified cut-off value (e.g. > 1), the alignment is flagged for visual inspection. Subsequently, a general consensus sequence is estimated for each alignment, and each individual sequence is compared to the consensus sequence in a ‘normal’ sliding-window approach. The window length is specified in the configuration script and each individual sequence is compared to the consensus sequence by sliding window. For each window, the number of amino acids distinct from the consensus is quantified and if greater than the proportion specified in the configuration script, the alignment is flagged for visual inspection.

Finally, phylogenetic inference is dependent on the comparison of orthologous genetic markers and comparing potential paralogous loci might yield confounded estimates of relationship. EAPhy assumes that the sequenced contigs for each locus are orthologous but has an additional option to potentially identify paralogous loci, by identifying markers with excessive levels of average individual heterozygosity. The user can inspect the distribution of average individual heterozygosity across all loci and based on this observation make an informed decision whether to exclude a certain percentage of loci with the highest level of average individual heterozygosity.

Concatenation and SNP selection

After visual inspection and filtering of flagged alignments, the collection of final high quality alignments can then be used for a variety of phylogenetic estimation methods. Gene trees can be inferred based on single alignments and a concatenated maximum likelihood tree can be estimated based on all alignments combined. Since all alignment filtering was conducted by codon, each nucleotide can still be assigned its correct codon position. PartitionFinder 23estimates the most optimal partitioning scheme across all sequence positions and appropriate substitution model for each partition. A PartitionFinder input file is automatically created with each gene and codon position of the concatenated alignment specified.

In addition to sequence-based alignments, EAPhy will also generate concatenated alignments that include polymorphic sites exclusively. SNAPP 24 is a species tree method that uses unlinked biallelic markers, instead of sequence-based alignments, and EAPhy can generate alignments with a biallelic SNP randomly sampled from each locus. It will verify whether polymorphic sites are biallelic and neglect polymorphic sites with more than two allelic states. Alternatively SNP alignments can be constructed where every single SNP is considered, regardless of allele count, or with all SNP’s across all loci concatenated. If a study is geared towards recovering population structure, such alignments can be used in analyses that model allele frequencies (e.g. 25).

Missing data – number of sequenced individuals

Sequencing success can vary among individual samples. If specific individuals are systematically underrepresented and miss data for many loci, it is possible that the phylogenetic placement of such taxa is ambiguous and the investigator would prefer to exclude these samples. Thus, the potential impact of missing individuals across loci should be accounted for. EAPhy attempts to highlight where this is likely by: a) providing alternative datasets with different numbers of missing individuals allowed and b) providing summary statistic output files quantifying the number of loci sequenced for each individual. This enables the investigator to further explore the potential effects of missing data on phylogenetic inference.

In Summary

The first objective of developing EAPhy was to provide a flexible and rigorous tool to generate reliable alignments, while minimizing the need for extensive visual inspection. Secondly, EAPhy was designed to allow filtering criteria for missing data to vary and investigate the impact of missing data on phylogenetic estimation. Lastly, EAPhy creates a large number of desired input formats for subsequent analysis, enabling the exploration of distinct inference methods. Negating the effort of manual alignment filtering and processing, EAPhy will hopefully stimulate further research into the potential consequences of applying alternative criteria for missing data and datatype, and how this might ultimately result in (in)congruent estimates of phylogenetic relationships across methods. The simultaneous development of novel molecular approaches to sequence orthologous genetic markers and bioinformatic methods to analyze such data, will ultimately provide us with the tools to generate a phylogenetic framework for all taxa across the tree of life.

Competing Interests

The author has declared that no competing interests exist

]]> 0
Visualising Geophylogenies in Web Maps Using GeoJSON Tue, 23 Jun 2015 13:00:40 +0000 This article describes a simple tool to display geophylogenies on web maps including Google Maps and OpenStreetMap. The tool reads a NEXUS format file that includes geographic information, and outputs a GeoJSON format file that can be displayed in a web map application.



The increasing number of georeferenced sequences in GenBank 1 and the growth of DNA barcoding 2 means that the raw material to create geophylogenies 3 is readily available. However, constructing visualisations of phylogenies and geography together can be tedious. Several early efforts at visualising geophylogenies focussed on using existing GIS software 4, or tools such as Google Earth 5,6,7 . While the 3D visualisations enabled by Google Earth are engaging, it’s not clear that they are easy to interpret. Another tool, GenGIS 12,13 , supports 2D visualisations where the phylogeny is drawn flat on the map, avoiding some of the problems of Google Earth visualisations. However, like Google Earth, GenGIS requires the user to download and install additional software on their computer.

By comparison, web maps such as Google Maps 15 and OpenStreetMap 16 are becoming ubiquitous and work in most modern web browsers. They support displaying user-supplied data, including geometrical information encoded in formats such as GeoJSON, making them a light weight alternative to 3D geophylogeny viewers. This paper describes a tool that makes use of the GeoJSON format and the capabilities of web maps to create quick and simple visualisations of geophylogenies.

2D layout of geophylogenies

The following discussion assumes that we have a phylogeny, and that for most (if not all) of the OTUs in that phylogeny are associated with a point locality for which we know the latitude and longitude.

In order to draw a geophylogeny on a web map we need to solve three problems. The first, relatively trivial problem is to place the the localities of the OTUs on the map (I shall refer to these as “occurrences”).

The second is to draw the phylogeny. Typically when drawing an evolutionary tree we compute x and y coordinates for a device where these coordinates have equal units and are linear in both horizontal and vertical dimensions, such as a computer screen or printer. In web maps coordinates are expressed in terms of latitude and longitude, and in the widely-used “web mercator” projection the y-axis (latitude) is non-linear. Furthermore, on a web map the user can zoom in and out, so pixel-based coordinates only make sense with respect to a particular zoom level.

The single 256 × 256 pixel tile representing the globe a zoom level 0 showing the pixel coordinates for the top left corner, the centre (corresponding to longitude 0, latitude 0), and the bottom right. Tile image map tiles by CartoDB under CC-BY 3.0 license.

Fig. 1: Web map tile

The single 256 × 256 pixel tile representing the globe a zoom level 0 showing the pixel coordinates for the top left corner, the centre (corresponding to longitude 0, latitude 0), and the bottom right. Tile image map tiles by CartoDB under CC-BY 3.0 license.

Web maps use “tiles” of a fixed size to represent the globe. Each tile is typically 256 × 256 pixels in size, and the number of tiles comprising a map is 2zoom where zoom is the zoom level. At zoom level 0 the map comprises a single tile (Fig. 1), at zoom level 1 the map comprises 4 tiles, and zoom level 2 eight tiles, and so on. To accommodate the web mercator projection, we first compute a geographic bounding box for the tree based on the bounding box that encloses the occurrences, then offset that box so that so that the tree is drawn, say below, the occurrences. We can then convert the longitude λ and latitude Ω coordinates of the bounding box to pixels x and y at zoom level 0 using the following formulae:

Note that the maximum latitude that can be displayed in the web mercator projection is 85.051129° north and south. The tree drawing is then laid out within that bounding box, with the nodes positioned in terms of pixels. Once pixel coordinates have been computed for the whole tree, they are then converted back to latitude and longitude values:

Expressing the tree in terms of latitude and longitude coordinates means that the rendering of the tree as the user zooms in and out is handled automatically by the web map application.

If we want to provide the user with a visual connection between each occurrence on the map and the location of the corresponding OTU in the phylogeny, we can draw a line connecting the two. These lines may criss-cross creating visual clutter, reducing this clutter is the third problem. To make the diagram more comprehensible, I adopt the approach used by GenGIS 12,13 to reorder the nodes in the tree to minimise the number of crossings 8. As an additional feature, if a taxon is represented by more than one occurrence, we can enclose the set of occurrences by a convex polygon to represent the range of that taxon.

Having computed a layout, we then need to render that on a web map. There are a number of different web maps available, each with their own API. Rather than tie the visualisation to a particular API, we can use a standardised output format, such as GeoJSON, to encode the layout, so that users can pick which web map they wish to use for the visualisation.


GeoJSON 17 is a format for encoding geographic data in JSON (JavaScript Object Notation). It includes various geometry types (such as Point, LineString, and Polygon), and is supported by a number of online mapping tools, including Google Maps 15 and Leaflet 18 . A GeoJSON document comprises a set of one or more features, each of which has a geometry and additional properties. Using the GeoJSON geometry types we can encode occurrences (Point), the tree (a set of LineString), and taxon distributions (Polygon) in GeoJSON, then have the entire visualisation rendered by the web application. The GeoJSON specification does not, by itself, include any information on how to display the objects encoded in a GeoJSON document (e.g., what colour to use for a line), but some informal standards have emerged, such as storing CSS styles as properties.

Input format

In order to create the visualisation we also need a way to input a phylogeny and the geographic localities. The approach taken here is to use the NEXUS format 9 , and the GEOGRAPHIC datatype introduced by the Mesquite Cartographer package 14 . While some might argue that XML represents the future of phylogenetic file formats 10 , NEXUS is easy to manually edit and hence facilitates debugging and exploring the software. Given a set of OTUs, the tool expects a NEXUS file with a TREES block describing a tree, followed by a CHARACTERS block encoding the location of each OTU. Each OTU is typically a DNA sequence. Sets of sequence may belong to the same taxon (e.g., a species or a DNA barcode BIN 2 ). Following Mesquite, this information can be stored in an ALTTAXNAMES command in a NOTES block.

Figure 2 shows a NEXUS file for the widely used Banza example 11,19

NEXUS file for Hawaiian Banza, with geographical data encoded in the CHARACTERS block.

Fig. 2: NEXUS file for Hawaiian Banza

NEXUS file for Hawaiian Banza, with geographical data encoded in the CHARACTERS block.


I have implemented a NEXUS to GeoJSON converter using PHP. The code parses the NEXUS file, computes a bounding box based on the distribution of the OTUs, draws the tree, and exports the result in GeoJSON. The code is available on github Code for the examples in this article are available from A live demo can be explored at which includes examples of visualising geophylogenies using both Google Maps (Fig. 3) and Leaflet (Fig. 4).

Fig3-GoogleMaps-CC-BY-no logo

Fig. 3: Geophylogeny for South American marsupial

Geophylogeny for DNA barcodes for the marsupial Proechimys guyannensis, showing two distinct clusters that are geographically allopatric (data from BOLD, map tiles by CartoDB under CC-BY 3.0 license).

Geophylogeny for Hawaiian katydids (genus Banza) displayed using the Leaflet framework with map tiles by CartoDB under CC-BY 3.0 license.

Fig. 4: Geophylogeny for Hawaiian katydids

Geophylogeny for Hawaiian katydids (genus Banza) displayed using the Leaflet framework with map tiles by CartoDB under CC-BY 3.0 license.


At present the method described here requires a middle layer (written in PHP) that resides on a web server and converts the NEXUS file to GeoJSON. An obvious extension would be to port that code to Javascript and have the entire tool function within the web-browser client.

Although lacking some of the functionality of more specialised software such as GenGIS, an advantage of a web map-based tool is that it brings phylogenies into an environment already familiar to users of biodiversity data, such as the GBIF portal. Many users will have already encountered points on maps, and layers (e.g., of environmental data, or estimated species distributions). By representing phylogeny in GeoJSON we open the way for phylogenetic information to be incorporated into these maps.

Another reason GeoJSON is attractive is that because it is a JSON document it could be stored and indexed in a document database such as CouchDB 20 , which I’ve used elsewhere for taxonomic and phylogenetic data 21 . Hence we could imagine being able to quickly build a database of geophylogenies that can be queried both taxonomically and spatially. This would be one way to tackle the challenge of Kidd’s call for a “map of life”3.

Competing interests

The authors have declared that no competing interests exist.

]]> 0
Concatenation Analyses in the Presence of Incomplete Lineage Sorting Fri, 22 May 2015 16:00:52 +0000 Incomplete lineage sorting (ILS), modelled by the multi-species coalescent, is a process that results in a gene tree being different from the species tree. Because ILS is expected to occur for at least some loci within genome-scale analyses, the evaluation of species tree estimation methods in the presence of ILS is of great interest. Performance on simulated and biological data have suggested that concatenation analyses can result in the wrong tree with high support under some conditions, and a recent theoretical result by Roch and Steel proved that concatenation using unpartitioned maximum likelihood analysis can be statistically inconsistent in the presence of ILS. In this study, we survey the major species tree estimation methods, including the newly proposed “statistical binning” methods, and discuss their theoretical properties. We also note that there are two interpretations of the term “statistical consistency”, and discuss the theoretical results proven under both interpretations.



Estimating species trees from multiple loci is commonly performed using concatenation methods, in which multiple sequence alignments from different genomic regions are concatenated into one large supermatrix, and then a tree is estimated on the supermatrix. Yet, incomplete lineage sorting (ILS) (modelled by the multi-species coalescent model) can result in different loci having topologically different phylogenies, with high probability of gene tree incongruence when the effective population size is large and the time between speciation events is small 3. Most importantly, Roch and Steel 1 recently proved that using unpartitioned maximum likelihood to estimate a species tree on a concatenated alignment from different loci can converge to a tree other than the species tree as the number of loci increases, even if the sequence length per locus is allowed to increase; in other words, unpartitioned maximum likelihood can be statistically inconsistent (and even positively misleading, a stronger statement) under the multi-species coalescent model. Furthermore, simulation studies have shown that species trees estimated using concatenation can result in incorrect trees with high support 2. Thus, both theory and empirical studies show that the use of concatenation to estimate a species tree from multiple loci can lead to incorrect phylogenetic estimations.

Because concatenation can result in incorrect trees, coalescent-based “species tree methods” have been developed that are statistically consistent under the multi-species coalescent model. However, there are two meanings for “statistical consistency under the multi-species coalescent model”, and so it is important to distinguish between them.

  • Statistical consistency under the multi-species coalescent model – weak version. The most common use of the term asserts that the tree estimated by the species tree method will converge in probability to the true species tree as the number of sites per locus and the number of loci both increase 3.
  • Statistical consistency under the multi-species coalescent model – strong version. The other use of the term asserts that the estimated tree will converge in probability to the true species tree as the number of loci increases, but limits the sequence length per locus (perhaps to a constant number of sites). The first use of the term is clearly the weaker condition, since it makes stronger assumptions.

Thus, what Roch and Steel proved 1 is that unpartitioned maximum likelihood is statistically inconsistent in both senses. However, their proof does not extend to fully partitioned maximum likelihood analyses, which allow the numeric model parameters to change between the different loci in the concatenated alignment. Indeed, it is not yet established whether unpartitioned maximum likelihood analyses are inconsistent or consistent under either of these interpretations of the meaning of statistical consistency.

On the other hand, many species tree estimation methods have been developed that are provably statistically consistent in the first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees and species trees) and others (e.g., SVDquartets 15) estimate the tree using a single site from each locus within a set of unlinked loci; however, most of the commonly used methods estimate the species tree by combining estimated gene trees. These “summary methods” (such as MP-EST 12 and ASTRAL 4) are statistically consistent under the multi-species coalescent model in the first (i.e., weaker) sense of the term, and are increasingly popular due to their relative speed and ease of use.

Only a few coalescent-based species tree estimation methods have been proven to be statistically consistent in the second (i.e., stronger) sense of the term, which establishes that the species tree estimated by the method converges to the true species tree as the number of loci is allowed to increase, even when the sequence length per locus is bounded 7,16,18,19. However, it is not known whether the commonly used summary methods such as MP-EST are statistically consistent under the second sense of the term.

Simulation studies evaluating the relative performance of species tree estimation methods in comparison to concatenation analysis have had mixed results (e.g., sometimes concatenation is more accurate, sometimes coalescent-based methods are more accurate, and sometimes the differences are not statistically significant 4,5,6,8,9,10,11). In addition, coalescent-based summary methods have been shown to have reduced accuracy in the presence of substantial gene tree estimation error 4,5,6,8,10,11. Furthermore, the proofs of statistical consistency for standard coalescent-based summary methods (such as MP-EST and ASTRAL) have assumed that all the input gene trees are estimated without any error 7. Since estimated gene trees are often imperfect, the assumption of 100% accurate gene trees is unlikely to be biologically realistic 8.

Statistical Binning

In two recent papers 10,11, we presented techniques designed to improve coalescent-based species tree estimation by improving gene tree estimation when estimated gene trees have insufficient accuracy as a result of limited phylogenetic signal (e.g., because of low rates of evolution or sequence lengths that are too short). These two techniques partition the loci into sets, so that each set should contain loci that are deemed likely to have a common evolutionary tree. The first of these techniques 11 is called “unweighted statistical binning” and the second 10 is called “weighted statistical binning”; the use of weighting in statistical binning addresses a weakness we identified in unweighted statistical binning. Each type of statistical binning uses a fully partitioned maximum likelihood analysis to estimate a tree on each bin, and then these “supergene” trees are given to a summary method (such as MP-EST) to compute a species tree. Hence, statistical binning is a combination of concatenation and coalescent-based species tree estimation.

Statistical binning has the following steps:

  1. Compute gene trees with bootstrapping on each locus, using maximum likelihood.
  2. Use the bootstrap support on the edges of each gene tree to determine for each pair of genes whether they are likely to have a common gene tree topology, and build an “incompatibility graph” to represent this information (so that each node represents a gene, and two genes are connected by an edge if the topological differences between their gene trees are considered statistically significant according to the test).
  3. Partition the vertices of the graph into sets of approximately the same size, so that no two vertices in any set are adjacent; these are called the “bins”.
  4. For each bin, concatenate the alignments in the bin, and compute a fully partitioned maximum likelihood tree on the bin; these are called the “supergene trees”. If performing a weighted statistical binning, then replicate the supergene tree by the number of genes in its bin. If performing an unweighted statistical binning, then do not replicate the supergene trees.
  5. Apply the preferred summary method (e.g., ASTRAL or MP-EST) to the set of supergene trees to compute the species tree.

Note that the only difference between weighted and unweighted statistical binning is that weighted statistical binning replicates the supergene trees by the number of genes in the bin, and unweighted statistical binning does not do this. This difference is essential to the theoretical properties of the two methods. As proven in 10, pipelines based on weighted statistical binning followed by summary methods such as MP-EST or ASTRAL (that will converge to the true species tree as the number of true gene trees increases) are statistically consistent using the first definition, which allows the number of sites per locus as well as the number of loci to increase. However, unweighted statistical binning – even if followed by MP-EST or ASTRAL – is not statistically consistent under the first definition!

As shown in 10, MP-EST and ASTRAL used with weighted statistical binning were typically more accurate than when used without weighted statistical binning with respect to the estimation of species tree topologies and branch lengths, on datasets simulated under the multi-species coalescent model. However, there are some cases where using weighted statistical binning reduced accuracy, but these are limited to small numbers of taxa with very high levels of ILS.

Thus, statistical binning used with a coalescent-based summary method provides a blend of concatenation and coalescent-based methods: supergene trees are computed on concatenated alignments using fully partitioned maximum likelihood analyses, and then a coalescent-based summary method (such as MP-EST) is applied to these supergene trees to estimate a species tree. However, importantly, all maximum likelihood analyses used in statistical binning are fully partitioned, and this is critically important to the statistical properties that can be proven about these methods.

Liu et al. 17 argue against the use of statistical binning, saying “We show that approaches such as binning, designed to augment the signal in species tree analyses, can distort the distribution of gene trees and are inconsistent”. The claim that they show phylogenomic pipelines using naive binning 21 (not the same as statistical binning) are inconsistent is invalid, since they argue through the use of simulations (their own, and by reference to other simulations). However, statistical consistency or inconsistency is a property of behavior as the amount of data goes to infinity — and simulations are limited to finite data. Therefore, simulations by definition cannot prove statistical consistency or inconsistency. Hence, their paper does not establish inconsistency for naive binning, nor for any kind of binning approach.

Although their simulation study only examined naive binning 21, Liu et al. also expressed concerns about the potential for statistical binning to have a deleterious effect on species tree estimation. Therefore, their study does raise the following question: Are statistical binning pipelines statistically consistent under the multi-species coalescent model?

We can definitively answer this question with respect to the first meaning of statistical consistency: as shown in 10, phylogenomic pipelines that use weighted statistical binning followed by coalescent-based summary methods such as MP-EST will converge in probability to the true species tree as the number of loci and sites per locus both increase. We sketch the proof, to illustrate how the different aspects of the algorithm design are used to ensure statistical consistency. First, 10 showed that as the sequence length per locus increases, estimated gene trees converge to the true gene tree, and their bootstrap support values also converge to 100%; hence, the binning procedure produces bins that contain genes with the same gene tree topology with probability converging to 1. Recall that in a weighted statistical binning analysis, the loci that are placed in the same bin are then analyzed under a fully partitioned maximum likelihood analysis. Under the assumption that all the loci in the same bin have evolved down the same tree, a fully partitioned maximum likelihood analysis will converge to the tree associated to the bin as the length of the sequences for each locus increased. Then, in a weighted statistical binning analysis, this “supergene” tree would be replicated as many times as the number of loci in its bin. Finally, as the number of loci increases, the distribution of the supergene trees will converge to the true gene tree distribution defined by the species tree. Therefore, if the supergene trees are analyzed using a summary method (e.g., MP-EST or ASTRAL) that is statistically consistent given true gene trees, the pipeline described would be statistically consistent (in the first sense) under the multi-species coalescent model.

However, as shown in 10, pipelines using unweighted statistical binning (which does not replicate the supergene trees) are very different. In particular, as the number of genes increases, the estimated distribution of supergene trees will converge to the flat distribution in which every possible gene tree topology appears exactly once. Hence, no matter how these supergene trees are combined, the true species tree cannot be estimated with high probability. Therefore, pipelines that use unweighted statistical binning are not consistent even in the weak sense.

But what about the second interpretation of statistical consistency, where the number of sites per locus is fixed, but the number of loci increases? Are pipelines that use weighted statistical binning, followed by a summary method such as MP-EST or ASTRAL, statistically consistent under this second meaning (i.e., the strong sense of statistical consistency)? Does the Roch and Steel result help shed any light on this issue? Before we can answer this, we need to understand the difference between unpartitioned and fully partitioned maximum likelihood analyses.


Suppose we are given multiple sequence alignments for p different loci, and we wish to compute a maximum likelihood tree on the concatenated alignment under the Jukes-Cantor site evolution model, where a Jukes-Cantor model tree consists of a rooted binary tree T and numeric model parameters (the branch lengths of the tree). In an unpartitioned Jukes-Cantor maximum likelihood analysis we assume all the sites within the alignment evolve down a single Jukes-Cantor model tree, and we seek the tree and its numeric model parameters that is most likely to have generated the observed data.

In a fully partitioned Jukes-Cantor maximum likelihood analysis, we no longer assume that all the sites evolve down a single Jukes-Cantor model tree; instead, we assume that the different parts of the concatenated alignment each evolves down its own model tree, and the only constraint we make is that the model tree for the different parts share the same tree topology. Hence, we allow the numeric model parameters (i.e., branch lengths) to differ between the different loci. Therefore, the result of a fully partitioned Jukes-Cantor maximum likelihood analysis on a concatenated alignment from p loci is a set of p (potentially different) Jukes-Cantor model trees, each sharing the same tree topology. Equivalently, the result is a single tree topology T, but also p branch lengths for each branch in T.

We will show that fully partitioned maximum likelihood analyses and unpartitioned maximum likelihood analyses have very different theoretical properties. First, consider how maximum likelihood evaluates a sequence alignment where the sequences for the different species are all identical (i.e., an input of 100 sequences each identical to AACATAG). It is easy to see that all tree topologies are equally good under Jukes-Cantor maximum likelihood, and so cannot be distinguished under the maximum likelihood criterion. We will refer to alignments of this type as “invariant alignments” and loci that have invariant alignments as “invariant loci”. Now consider a concatenated alignment based on p loci, where all the loci but one are invariant. Under a fully partitioned Jukes-Cantor maximum likelihood analysis of such a dataset, because the p-1 invariant multiple sequence alignments fit every tree topology equally well, they do not impact the fully partitioned Jukes-Cantor maximum likelihood analysis of the concatenated alignment, and the result is identical to the Jukes-Cantor maximum likelihood analysis of the single variable multiple sequence alignment (see 10 for details) . However, as Roch and Steel showed (see below), unpartitioned maximum likelihood analyses converge (as the number of loci increases) to a maximum parsimony analysis of the single variable multiple sequence alignment. Thus, unpartitioned and fully partitioned maximum likelihood analyses behave very differently, and theoretical results (positive or negative) about one do not imply the same results for the other.

Roch and Steel’s proof that concatenation is statistically inconsistent under the multi-species model

Roch and Steel 1 prove that unpartitioned maximum likelihood is statistically inconsistent under the multi-species coalescent model, under the assumption that the gene sequence evolution model is the r-state symmetric model (i.e., models such as Jukes-Cantor, in which substitutions between every pair of distinct states are equiprobable). They establish this proof by showing that for some model species tree with very high levels of ILS and very low rates of evolution (in which all loci have the same sequence length), with high probability, nearly all loci will be invariant (i.e., their sequence alignments will have no changes on them). They then show that under the r-state symmetric model, unpartitioned maximum likelihood will converge to maximum parsimony as the number of loci increases (see Appendix). Roch and Steel then argue that under these conditions, as the number of loci increases, maximum parsimony will converge to a tree that is different from the species tree. In other words, unpartitioned maximum likelihood on the concatenated alignment will be positively misleading because maximum likelihood will be identical to maximum parsimony under some conditions (for large enough numbers of loci), and maximum parsimony will be positively misleading under the multi-species coalescent model.

Consider an input where the first p-1 loci have invariant multiple sequence alignments, and the pth locus has a variable sequence alignment. By the argument provided by Roch and Steel, as p→∞, an unpartitioned maximum likelihood analysis will converge to a maximum parsimony analysis of the concatenated alignment. Since p-1 of these loci are invariant, the unpartitioned Jukes-Cantor maximum likelihood analysis of the concatenated alignment converges to the maximum parsimony analysis of the pth locus.

For the same data, but under a fully partitioned maximum likelihood analysis, the first p-1 loci have no impact on the GTR maximum likelihood analysis, and so the result is identical to Jukes-Cantor maximum likelihood on the pth locus. Thus, for all values of p, the fully partitioned maximum likelihood analysis will always be the Jukes-Cantor maximum likelihood of the pth locus. Thus, as p→∞, the unpartitioned maximum likelihood analysis will converge to the maximum parsimony analysis of the pth locus but the fully partitioned maximum likelihood analysis will be the maximum likelihood analysis of the pth locus. Since Jukes-Cantor maximum likelihood and maximum parsimony are not identical methods (i.e, they can return different trees on the same dataset), this shows that fully partitioned and unpartitioned maximum likelihood analyses are different methods, and can return different trees.

To summarize, the theorem by Roch and Steel established that an unpartitioned maximum likelihood analysis can be inconsistent under the multi-species coalescent model, but their result does not apply to a fully partitioned maximum likelihood analysis. Most importantly, their proof uses the fact that the maximum likelihood analysis is unpartitioned to show that as the number of loci increases and nearly all the loci are invariant, the unpartitioned maximum likelihood analysis converges to maximum parsimony. This statement is explicitly not true for fully partitioned analyses, which are unaffected by invariant loci. Hence, their entire argument is restricted to unpartitioned maximum likelihood, and their proof does not apply to a fully partitioned maximum likelihood analysis.

Summary and Discussion

This article has focused on what has been established theoretically for some standard methods for estimating species trees (i.e., concatenation using maximum likelihood and summary methods such as MP-EST and ASTRAL), as well as for the newer approach of weighted statistical binning, followed by a summary method. Because the term “statistical consistency” has been used in two different ways in the literature, we have summarized what is known under each meaning: the weaker sense where both parameters (sequence length per locus and number of loci) increase, and the stronger sense where only the number of loci increases but the sequence length per locus is bounded, perhaps by a constant. We have also clarified that Roch and Steel’s theorem about concatenation using maximum likelihood being statistically inconsistent is restricted to unpartitioned maximum likelihood.

So, what do we know about statistical consistency (for either sense) of standard techniques for estimating species trees in the presence of ILS? Methods that have been proven to be statistically consistent in the first sense include the standard summary methods (e.g., MP-EST, ASTRAL, the population tree from BUCKy 22, etc.), methods that estimate trees directly from alignments, such as *BEAST and SVDquartets, and also weighted statistical binning, paired with standard summary methods. On the negative side, Roch and Steel’s theorem establishes that unpartitioned maximum likelihood can be statistically inconsistent. A separate argument shows that unweighted statistical binning is inconsistent in this first sense.

However, even for this weak sense of statistical consistency (where both the number of sites per locus and number of loci increase), the statistical consistency of many methods is still unknown. In particular, Roch and Steel’s theorem does not establish statistical inconsistency for fully partitioned maximum likelihood. Because the loci can have different tree topologies under the multi-species coalescent model, it seems likely that maximum likelihood analyses, even if fully partitioned, will be found to be inconsistent. However, the proof for an unpartitioned analysis being inconsistent provided by Roch and Steel in 1 depends on using an unpartitioned analysis, and so establishing the inconsistency (or consistency, as the case may be) of fully partitioned maximum likelihood analysis requires a different mathematical argument.

In terms of the second definition (and stronger sense) of statistical consistency, where only the number of loci increase but the number of sites per locus is bounded, very little has been established. In fact, the only established results are that unpartitioned maximum likelihood and unweighted statistical binning are both inconsistent. Finally, we do not know whether any of the standard summary methods, fully partitioned maximum likelihood analyses, or weighted statistical binning, are statistically consistent under this second definition.

In fact, to try to prove a method is statistically consistent or inconsistent under the second definition is very difficult; the only methods that have been proven to be statistically consistent under this definition (which constrains the number of sites per locus to be bounded) are explicitly designed to have this property, but all (to our knowledge) require some additional constraints (e.g., the same constant rate of evolution across all loci 19 , or a strict molecular clock 7). Proofs of inconsistency under the second definition have been established, but again only for unpartitioned maximum likelihood (proven by Roch and Steel) and unweighted statistical binning. Attempts to prove statistical consistency or inconsistency under this second definition have so far failed for any of the standard methods.

In other words, the major phylogenomic estimation methods in common use that are designed to address incomplete lineage sorting — summary methods such as MP-EST, ASTRAL, and the population tree from BUCKy, and co-estimation methods such as *BEAST — have been proven to be statistically consistent only in the first sense, where both the number of loci and number of sites increase. None of them have been proven to be consistent in the second sense, where the sequence lengths per locus are bounded. Similarly, the weighted statistical binning method is statistically consistent in this first sense, but it is not known if it is statistically consistent in the second sense.

It is also clear that Roch and Steel’s theorem (as well as the observations provided by their proof) is limited to unpartitioned maximum likelihood. Therefore, it cannot help us understand the theoretical properties of fully partitioned maximum likelihood, and hence also cannot help us understand the theoretical properties of methods that use fully partitioned maximum likelihood (such as weighted statistical binning). Therefore, any attempt to establish the statistical consistency or inconsistency of these methods will need to use other mathematical arguments.

In other words, the established theory regarding species tree estimation methods is limited. Yet, performance in practice (i.e., on simulated and biological data) suggests that many methods (including concatenation) provide good accuracy under some model conditions. Unfortunately, as discussed in 7,8,10,13, most simulation studies have explored performance only on very small datasets (e.g., with at most ten species) and under unrealistic conditions. For example, 14 examined performance on model species tree with only 5 species and very high levels of ILS, where sequence evolution was under a strict molecular clock, and used 1000 sites per locus (so that the sequence lengths per locus are too large to avoid recombination events. Instead, to understand the relative performance of coalescent-based summary methods and concatenation, more extensive analyses based on biologically realistic conditions (and hence based on short sequences, or modeling sequence evolution with recombination within loci) are needed.

Finally, despite the interest in coalescent-based species tree estimation, the current methods are still in their infancy, and new methods will need to be developed in order to obtain highly accurate species trees under realistic conditions. While the focus of this article has been on the performance of concatenation analyses, and the implications for coalescent-based summary methods that use weighted statistical binning, alternative approaches have been developed that do not require the estimation of gene trees or supergene trees (e.g., 10111516181920). Given the theoretical and empirical challenges in producing accurate gene trees, these approaches may provide the best accuracy for genome-scale phylogenomic analysis.

Statistical consistency (both senses) of standard species tree estimation techniques

Statistical consistency of some standard methods

We present the current status with respect to statistical consistency (of the first or second kind) of some standard phylogenomic estimation methods. The first column is for the first meaning of statistical consistency, which states that the species tree estimated by the method will converge to the true species tree as the number of loci and number of sites per locus both increase. The second column is for the second meaning, which states that the species tree estimated by the method will converge to the true species tree as the number of loci increases, even for bounded number of sites per locus. We also cite the paper in which the theoretical result is established.

Consistency – first kind Consistency – second kind
Unpartitioned concatenated maximum likelihood NO (1) NO (1)
Fully partitioned maximum likelihood UNKNOWN UNKNOWN
Unweighted statistical binning followed by consistent summary method (e.g., ASTRAL) NO (10) NO (10)
Weighted statistical binning followed by consistent summary method (e.g., ASTRAL) YES (10 ) UNKNOWN

Appendix 1: Quote from Roch and Steel’s Paper

An outline of the proof of the main theorem is as follows: We show that the expected proportion of sites that are constant can be made arbitrary large with low rates of evolution (the lower bounds are formalized in Claim 4) and that the empirical frequencies of site patterns is concentrated around the expected values (Claim 2). When there are a large enough number of invariable sites, it can be shown that likelihood scores and parsimony scores converge to the same answer (formalized in Claim 1). Thus trees that have better parsimony score have better likelihood under these scenarios. Therefore, it suffices to show that parsimony is not statistically consistent under arbitrary low rates of evolution.

Sebastien Roch and Mike Steel, “Likelihood-based tree reconstruction on a concatenation of sequence datasets can be statistically inconsistent”, Theoretical Population Biology 100 (2015): 56-62

Competing Interests

The authors have declared that no competing interests exist.

]]> 0
One Tree to Link Them All: A Phylogenetic Dataset for the European Tetrapoda Fri, 08 Aug 2014 10:49:46 +0000


The use of phylogenetic data into ecological analyses has grown rapidly in the last decades, giving rise to new disciplines such as community phylogenetics which incorporate information on species relatedness into the study of community structure1,2, as well as to studies of large-scale distribution of species and their phylogenetic diversity3,4. Additionally, the integration of ecological and evolutionary information holds promise to improve ecological forecasting in the current context of climate and land change and biodiversity loss5,6. Since the pioneering work of Dan Faith7, conservation biology has long recognized the importance of considering phylogenetic diversity as a relevant feature for conservation8,9. The EDGE framework is, in this regard, an important initiative that combines the evolutionary distinctiveness of species (i.e. the evolutionary contribution of a species to the tree of life) with globally endangered risk assessment to derive conservation priorities10. Recent works have also focused on how future climate and land use change could further jeopardize the tree of life in certain parts of the world11,12.

To foster the developments of these emergent fields and timely questions, detailed and broadly sampled phylogenetic hypotheses are needed to appropriately integrate evolutionary information into ecological and conservation studies. Recent phylogenomic studies have improved our understanding of the evolutionary relationships within the main Tetrapoda groups, especially at high levels such as families and orders. For instance, Roelants and colleagues13 clarified the relationships between global amphibians at the family level, while Pyron et al.14 performed a similar achievement on Squamata, sampling all families and sub-families. Concerning birds, Hackett et al.15 elucidated the inter-ordinal relationships of extant birds, and a later study16 confirmed the partly controversial results found by Hackett and colleagues. Despite these achievements, we still lack detailed species-level phylogenies for such groups. Moreover, there is a lack of phylogenies for particular regions (but see 17), as systematists mainly focus on building species-level phylogenies for entire clades. Although it is of obvious interest, research areas such as community phylogenetics and conservation planning do not specifically require complete taxonomic sampling, but rather complete spatial, or biogeographic, sampling. In order words, ecological studies that wish to integrate evolutionary data usually require a phylogenetic hypothesis for the entire species pool under study, which might be along a specific gradient18 or a continental scale assessment11,19 . For instance, incorporating phylogenetic diversity in reserve design or gap analysis only require a complete phylogenetic tree for the entire group with the region of interest (see for example 19, 85). It should however be noted that since the complete coverage only concerns Europe, estimates of phylogenetic uniqueness are therefore biased and should be accounted for in the analysis of the data (e.g. 86).

For that purpose, we here construct and provide a phylogenetic dataset for all Tetrapoda species that occur in the entire European sub-continent (including Turkey) built on relevant phylogenetic data in Genbank and consensus tree knowledge, based on a supermatrix-supertree mixed approach20 . We also check the congruence of the phylogenies obtained with previous evolutionary studies.


Squamata and Testudinae

The list of European Squamata species was extracted from Maiorano et al21. DNA sequences of 7 nuclear (BDNF, c-mos, NT3, PDC, R35, RAG-1, RAG-2) and 6 mitochondrial loci (12S, 16S, COI, cytB, ND2, ND4) were downloaded from Genbank with PHLAWD22. These regions have been shown to be useful for phylogenetic inference in previous studies of squamates according to Pyron et al14. Only 16 species of a total of 239 had no molecular data available in Genbank. In addition to Squamata species, we included 3 levels of outgroup taxa: Sphenodon punctata (closest living relative to Squamata); all 10 species of European turtles, two crocodilians (Alligator and Crocodylus) and two birds (Dromaius and Gallus); and finally two mammals (Mus and Pan). Genbank accession numbers are detailed in Table S1 (Appendix 1).

For each region, DNA sequences were aligned with MAFFT23 and checked by eye with Seaview24. Ambiguous alignment positions were trimmed with trimAl25. All the regions were concatenated in a supermatrix with FASConCAT26. The phylogenetic inference analysis was conducted with RaxML v. 7.8.127 using the GTRGAMMA model and employing the rapid hill-climbing algorithm28; we searched for 100 Maximum Likelihood trees applying a family tree constraint for squamates based on Pyron et al14. Bootstrapping was conducted with 1000 replicates to assess clade support.

The 100 ML trees were dated with penalized-likelihood as implemented in r8s29; we constrained 5 nodes based on fossil information extracted from Mulcahy et al.30: we set a minimum and a maximum age of 256 and 300 mya respectively for the most recent common ancestor (mrca) of all Reptilia31,32, a minimum and a maximum age of 239 and 250 mya respectively for the mrca of Birds and crocodilians32 , a minimum age for the mrca of Lepidosauria of 223 mya33,34, a minimum age of 111 mya for the stem branch of Amphisbaenidae34,35, and a minimum age of 93 mya for the stem branch of Alethinophidia36. The best smoothing value was determined by a cross-validation procedure, following 29 .

The data matrix and the phylogenetic tree with the highest likelihood are available in Treebase (accession number: S15708).


For Amphibians, we include here the phylogenetic tree constructed for a previous study19. The list of European Amphibian species was extracted from Maiorano et al21. We retrieved from GenBank sequences of phylogenetic informative regions that were available for at least 30% of the species: 9 mitochondrial (12S, 16S, COI, cytb, ND1, ND2, ND4, tRNA-Leu, tRNA-Val) and 2 nuclear (RAG-1, rho) regions. We found relevant molecular data for all species, but we excluded the two hybrid species Pelophylax grafi and Pelophylax hispanicus. We included Xenopeltis unicolor, Gallus gallus and Mus musculus as outgroups to root the tree. For each region, alignment was conducted with four programs (Clustal37, Kalign38, MAFFT23 , MUSCLE39). The best resulting alignment was selected based on Mumsa38, and checked visually. Ambiguous regions of each alignment were removed with trimAl25. All regions were concatenated in a supermatrix with FASconCAT. As with Squamata, we obtained 100 ML phylogenetic trees by conducting a phylogenetic inference analysis with RaxML, this time applying a family-level tree constraint based on Roelants et al13. A bootstrap analysis was conducted with 1000 replicates to assess clade support.

We dated the 100 ML trees with penalized-likelihood (r8s) using the following fossil data to constrain minimum ages for selected nodes: 155 mya for the crown-origin of salamanders40, 170 mya for Bombianura41, 250 mya for Batrachia42, 110 mya for the split of Pelobatidae and Pelodytidae families43, 145 mya for the split of Pelobatidae and Neobatrachia43, and 61 mya for the split of Plethodidae and Proteidae44. Additionally, we set a minimum and maximum age (312-330 mya) for the split between diaspid (Gallus gallus, Xenopeltis unicolor) and synapsid amniotes (Mus musculus), based on Benton and Donoghue45.

The data matrix and the phylogenetic tree with the highest likelihood are available in Treebase (accession number: S13561).


We include here 100 dated phylogenetic trees for 430 species of European breeding birds from Roquet et al20. This phylogenetic dataset was built upon sequences retrieved from GenBank for 10 mitochondrial gene regions (12S , ATP6 , ATP8 , COII , COIII , ND1 , ND3 , ND4 , ND5 , ND6) and six nuclear ones (28S , c-mos, c-myc , RAG-1 , RAG-2 , ZENK). The alignment procedure was the same as for Amphibians. We also performed 100 ML phylogenetic inference searches and standard bootstrapping (1000 replicates) with RaxML, applying a tree constraint at the ordinal level based on Hackett et al15. The 100 trees were dated with penalized likelihood (r8s) applying fossil calibrations for 14 clades (Table S2, Appendix 2). The best ML tree can be found in Treebase (study number 10770).


The phylogenetic data here included for mammals is based on the super-tree of Fritz and colleagues46; concretely, we extracted the resampled dataset of 100 fully resolved phylogenetic trees from Kuhn et al.47, where polytomies of the super tree from Fritz et al.46 were randomly resolved applying a birth-death model to simulate branch-lengths. Then, for each tree, we replaced the Carnivora clade with the update performed on a recent study48, which provides a better resolution and increases the sampling from 252 sp to all Carnivora species (286 sp). Later, we removed all non-European species. These modifications of the phylogenetic trees were done with the R package ape.

Phylogenetic inference

As stated before, for each taxon group except mammals we have conducted 100 ML inferences with RAxML. Every inference begins with a different starting tree, which is built by adding sequences one by one in random order, identifying their optimal location on the tree under the parsimony optimality criterion. Since sequences are added in random order, it is very likely that a different starting tree is generated at every search8788. RAxML searches were then performed with the method “lazy subtree rearrangement” (a variant of subtree prunning and regrafting method) under a ML framework. Like all heuristic search strategies, the Maximum Likelihood search strategy employed by RAxML is not guaranteed to find the most probable tree of the tree-space, and because of that, it is important to conduct multiple searches from different starting trees. To check if all the searches converged on trees with similar likelihoods, we performed the Shimodaira-Hasegawa test89 (SH) implemented in RAxML. In all cases, the likelihoods of the trees of a same group of taxa were not significantly different (p < 0.01). This increases our confidence on the trees found being close/similar to the most likely tree, and that the trees obtained do not result from the algorithm getting stuck in a local optima.

Supertree construction

The trees cited above were combined, after pruning the outgroups, by joining them with the R package ape; to do so we set divergence ages between these main groups based on the information retrieved in the webpage Timetree49: the divergence age between mammals and sauropsids (i.e. birds, turtles and squamates) was set to 324 mya, and the divergence age between amphibians and the rest of the groups was set to 361 mya. To build the final tetrapod tree, we randomly selected one tree from each of 100 trees available for each group. We repeated this approach 100 times to get 100 realisations of the tetrapod tree. These combinations were done randomly since the likelihoods of the trees of each group were not significantly different according to the SH test. The 100 dated trees of each group and the 100 dated supertrees for all European Tetrapoda are available from the Dryad digital repository (DOI: X).

Results and Discussion


The study of Pyron et al.14 constituted a major advance in our understanding of the phylogenetic relationships between the main lineages of Squamata. Their study had a broad taxonomic and molecular sampling: they included members of all currently recognized families and subfamilies, for which 7 nuclear and 5 mitochondrial loci were analysed. Here, we took profit of the knowledge derived from that study by incorporating a tree constraint to the family level based on their results. We also performed the analysis without the tree constraint (results not shown); the results were congruent with the first analysis, but the lack of a family-tree constraint yield low bootstrap (BS) support for the deepest nodes.

Our phylogenetic results are largely congruent with those of Pyron and colleagues14. We have similar levels of strong nodal supports except for the relationships between genera of Lacertidae; 67.8% of the nodes had a strong support (BS > 70%, Fig. 1, Appendix 3) and 13.1% of the nodes had a moderate support (BS 50-70%). In accordance with their study, we detected that some genera are not monophyletic: Ablepharus (Scincidae), Cyrtopodion (Gekkonidae), Zamenis (Colubridae). We also found strong evidence that Hierophis and Dolichophis (Colubridae) are not monophyletic genera, as D. cypriensis (which was not included in 14) is nested within Hierophis with a 100% BS.

Available dating studies on Squamata differ considerably on age estimates. For instance, a recent study30 estimated the squamate crown group to be c. 180 mya, while two other studies estimated the same group to be c. 240 mya50,51. Our estimates of divergence times are in general roughly similar to those of Kumazawa’s study50. It has been suggested30 that the use of only mitochondrial regions (which is not the case here) may bias the results towards older ages, but anyway differences in methodology and in taxon and molecular sampling make difficult to identify all the causes of those discrepancies.


The phylogenetic inference analysis for the amphibians yielded a particularly robust topology: 83.5 % of the nodes showed a strong support (BS>70%, Fig. 2, Appendix 3). Supported nodes of our ML trees were congruent with previous phylogenetic studies52,53,54,55,56,57. Concerning the divergence age estimates, we obtained younger ages for the deepest nodes compared to the work of Roelants and colleagues13, for instance, Batrachia was estimated with r8s to be c. 330 mya in that study, while we estimated it to be c. 300 mya; in contrast, we retrieved older ages for the shallowest nodes (e.g. we estimated that the divergence between Salamandra and Pleurodeles occurred 100 mya, Roelants and colleagues estimated it to have occurred c. 75 mya). These differences might be linked to the difference in molecular and taxon sampling: Roelants et al. sampled only one species per genera; and several families that were included in their work are not present in Europe and thus were not included in our supermatrix.


Supported nodes of our ML trees are congruent with previous phylogenetic studies (Anseriformes: Donne-Goussé et al. 58, Eo et al.59, Gonzalez et al.60; Galliformes: Gutierrez et al.61 , Dimcheff et al.62, Crowe et al.63, Kimball et al.64, Kriegs et al.65, Lislevand et al.66; Gruiformes: Fain et al.67; Procellariiformes: Penhallurick and Wink68; Ardeidae: Sheldon et al.69 ; Accipitridae: Lerner and Mindell70, Griffiths et al.71 ; Charadriiformes: Paton et al.72 , Thomas et al.73, Pons et al.74 , Bridge et al.75 , Paton and Baker76 , Fain and Houde77; Passeriformes: Alström et al.78, Nguembock et al.79, Treplin et al.80; Piciformes-Coraciiformes: Johansson et al.81, Benz et al.82; Strigiformes: Wink et al.83); 68.7% of the nodes had a strong BS support (BS>70%, Fig. 3, Appendix 3), and an additional 12.4% had a moderate support (BS=50-70%). Divergence age estimates were, in general, congruent with those obtained by Brown et al84.


The modification of the most recent mammals supertrees available on the literature47 with the update of Carnivora clade48 allowed to increase the phylogenetic resolution (only nine polytomies remain in the updated Carnivora clade) and to have a higher species sampling.

The importance of accounting for phylogenetic uncertainty

Phylogenetic information is sometimes incorporated in ecological analyses based on a single phylogenetic tree, assuming the tree is known without error. Any phylogenetic tree estimate will probably not be an exact representation of the true phylogeny due to possible bias or uncertainties such as molecular and taxon sampling, sequence alignment, homoplasy, or long-branch attraction9091. For all these reasons, it is important to include phylogenetic uncertainty in order to avoid overestimating our confidence in subsequent analyses (i.e. obtaining too narrow confidence intervals). This type of uncertainty can be accounted for in two ways: with a single consensus tree (in which unsupported nodes are collapsed into polytomies), or running the analyses with a range of trees and later summarising the results11, 93. The first approach (i.e. consensus tree) may not be preferable, as polytomies can influence the results of tree-based statistical analyses (e.g. see 92 for the influence of phylogenetic resolution on several community phylogenetics indices), and do not allow to fully explore the variation in ecological patterns resulting from phylogenetic uncertainty. Moreover, not all methods have been adapted to allow for polytomies, some of them require completely bifurcating trees (e.g. the EDGE index10). For all these reasons, we highly recommend to account for phylogenetic uncertainty by including a set of high-probability trees.

Data availability statement

The 100 dated supertrees for all European Tetrapoda and the 100 dated trees of each taxon group (amphibians, birds, mammals, squamates and turtles) are available from the Dryad digital repository (DOI: X).


We provide here a phylogenetic dataset constituted of 100 chronograms of European Tetrapoda species as a tool for ecological studies that aim to incorporate an evolutionary perspective, and for phylogenetic conservation assessment. This phylogenetic dataset is in general agreement with previous studies, and we expect it to be coarsely approximate with the “true” Tetrapoda evolutionary tree. Instead of providing the best ML tree for every group, we provide 100 trees (available on Dryad repository), as computing analyses with several trees allows taking in account phylogenetic uncertainty. Regarding the taxonomic sampling, the big majority of species are included (91%). On the other side, some molecular regions have low sampling, thus, this dataset will be useful until substantial amount of molecular data becomes available for a considerable number of species.

Competing Interests

The authors have declared that no competing interests exist.

]]> 1
DNA Barcoding of Marine Copepods: Assessment of Analytical Approaches to Species Identification Mon, 23 Jun 2014 17:16:44 +0000


Marine copepods represent a predominant component of the zooplankton throughout the world oceans in both abundance and biomass 1,2. There are more than 2,500 described species of planktonic marine copepods, with species distributions ranging from shallow, brackish, estuarine waters to deep ocean (abysso- and hadopelagic) zones 3. Copepods exhibit a wide variety of biogeographical patterns, from very limited distributions to cosmopolitan and global-ocean ones.

Their high species diversity, together with their relatively small size and apparent similarity among different forms, has made the morphological identification and quantification of copepod species a challenging task 4. In addition, it is likely that there are large numbers of cryptic species within what are now considered recognized species, especially for geographically-widespread taxa 5,6,7.

Considerable effort has been focused on the development and use of genetic approaches to identifying and discriminating marine species in the past ~20 years (reviewed by Bucklin et al. 8). Use of a fragment of the cytochrome c oxidase subunit I (COI) gene for discrimination and identification of animal species, i.e., DNA barcoding 9,10, has moved rapidly from novelty to widespread use, although it has not been free of controversy. Objections have focused on uses of barcodes beyond the original intent as a species assignment tool, including DNA taxonomy 11,12, ecological assessment 13, and species discovery 14. Recent improvements in methods for statistical analysis of barcode data 13,15,16,17 and growing focus on the appropriate use and limitations of barcode analysis 18 are advancing the field of DNA barcoding.

Recent DNA barcoding studies of marine planktonic copepods have focused on examination of species-level diversity in particular regions of the ocean 19,20,21,22, and also on particular – usually problematical – taxa 23,24,25,26,27,28,29. Other studies have used DNA barcodes for biogeographical or phylogeographical analyses 30,31,32,33,34. A number of studies have revealed cryptic species 5,33,35,36,37.

This study provides 800 new barcode sequences for 63 copepod species not included in previous studies. These new barcoding records increase both the depth of sampling and also the geographical coverage of existing records, and continue progress toward a taxonomically-comprehensive database or library of DNA barcode sequences for all species of the groups or lineages of interest. Importantly, this study examines a variety of statistical and analytical approaches used for barcode data, and provides new information about the strengths, weaknesses and limitations of DNA barcodes for discrimination and identification of copepod species. A particular focus of this study is the impact of any missing data (i.e., species not represented in the barcode database) on the accuracy and reliability of species identifications. Finally, we offer new guidance and a conceptual framework for continued barcoding efforts to meet challenges of species identification of copepods, one of the most ecologically important and systematically complex groups of marine zooplankton.


Samples analyzed

Sequences of the COI barcoding region 38 were determined for identified individual specimens collected from various sources from 1992 to 2011, and archived at the University of New Hampshire (1992-2005) or the University of Connecticut (2005-2011). All specimen and collection metadata are included in the GenBank entries. Appropriate reference is made to previously published sequence data. Laboratory protocols (DNA purification, PCR amplification, and sequencing) are as described in previous publications by the authors 20,35.

DNA sequence data analysis

Sequences were analyzed using the Molecular Evolutionary Genetic Analysis (MEGA) Ver. 5 39. Sequences were aligned using ClustalW 40, as implemented in MEGA, using the corresponding amino acid translated version. This procedure allows better resolution by removing gap ambiguities, ensures designation of the correct codon reading frame, and minimizes risks of including nuclear pseudogenes with mitochondrial origins, known as numts 41,42. Initial tree runs were used to check for very divergent sequences (i.e., potential numts), which were removed prior to analysis. A total of eight individual sequences clustered in a single, highly supported, independent clade, that comprised a mixture of species from different orders (two) and families (six) of the Subclass Copepoda (five Calanoida and three Harpacticoida). When compared with the final number (1381 COI sequences) this value can be considered low, although it is necessary to recognize that these eight sequences were those that had passed all initial filters (for example, they did not code for a stop codon, or were extremely aberrant when translated into their correspondent amino-acid sequence).

Descriptive statistics

Three different alignments were prepared for analysis in order to study the influence and possible bias due to variation in sequence length and heterogeneity in levels of sequence divergence along the barcode region. The analyzed alignments will be referred to as follows: 1) original alignment, including all 1,381 sequences of any length; 2) standard barcode alignment, including only sequences of >500 bp (see Barcode of Life,; and 3) unique barcode alignment, considering only a 400 bp portion (positions 96 to 497) of the barcode region and including only a single copy of all the different haplotypes for each species (576 sequences in all). The unique alignment was subjected to a sliding window analysis of nucleotide diversity (π) using DnaSP Ver. 5 43. Two runs were performed with 10 bp step size and window lengths of 10 bp and 100 bp; results were compared to visualize differences in π along the analyzed region.

Genetic distances within species, genera, families, and orders and between orders were calculated in MEGA using the Kimura 2-parameter (K2P) model 44 for each of the three alignments previously described. Mann-Whitney U tests were carried out based upon the unique alignment distances matrix to compare distances within versus between species and between taxonomic levels. Although K2P was the second-best fit for the dataset (the best corresponded to GTR+I+ Γ), this model was used to allow direct comparison with previously published barcoding studies, which most frequently used this metric 8, despite growing criticism of this metric for barcode analysis 45,46. On the other hand, the same studies have shown that the choice of evolutionary model does not affect success rates of species identifications 45,46; uncorrected p-distances perform equally to any model; but see Fregin et al. 47.

Barcoding resolution

Initially, two automated statistical techniques for barcoding approaches to species identity assignment were evaluated: 1) automated identification of significant clades after tree reconstruction; and 2) genetic distance-based assignment by the Basic Local Alignment Search Tool (BLAST) method 48. Parallel analyses were carried out on the three alignments. In addition, a non-automated technique for species assignment was considered: the “best close match” 49 combines best match criteria and maximum within-species distance thresholds. Similar to the BLAST approach, this technique analyzes each query individually and identifies the closest sequence within a flexible threshold adapted to each dataset. Although computationally intensive and potentially time-consuming, this approach has been shown to out-perform automated and much more complicated methods 16, especially when the sequences are highly variable and many species are represented by one or a few sequences.

Neighbor-Joining (NJ) trees 50 were reconstructed in MEGA using the K2P evolutionary model for the standard and unique alignments. Maximum Likelihood (ML) phylogenetic tree analyses were done using RAxML Ver. 7.2.8 51 under the GTR+I+Γ model for the three datasets. This model-based method (ML) allows inclusion of non-overlapping sequences in the same analysis, which is not possible with distance-based methods, such as NJ. In addition, there is a growing concern about the validity and adequacy of both NJ and K2P for barcode analysis, especially when compared with methods like ML under the best fit evolutionary model 45,46,47. The NJ and ML trees were compared for the standard and unique alignments to evaluate consistency of the results. Confidence level was estimated for both methods as percentage recovery after 10,000 bootstraps. Putative species were inferred using the Poisson tree processes model (PTP) on the ML trees 52. These putative species are equivalent to molecular operational taxonomic units (MOTU 53) and were compared with the morphologically-identified species (OTUs).

For the BLAST approach, jMOTU 15 was used on the original, standard and unique alignments. The minimum alignment length (i.e., overlap between sequence pairs) for analysis of the original dataset was set at 100 bp. The standard alignment showed minimum overlap of ~350 bp between pairs of sequences; 400 bp was common to all sequences in the unique alignment. The BLAST filter was 85 for all analyses. The tree results, resolved MOTUs, and identified OTUs were compared for the three alignments.

Fig. 1: Frequency distribution of the 1,381 sequences from the original dataset by length (in base pairs).

A total of 1,141 sequences (82%) fulfilled the minimum of 500 bp length definition of a gold standard barcode.

Species-by-species analyses

Taxa showing discrepancies between MOTUs and OTUs were selected for additional analysis, when possible based on available data, to examine possible reasons (e.g., variation among geographic areas or populations, cryptic speciation) for the observed disparities between morphological and molecular data. Analyses included parsimony haplotype networks (gene genealogies) using TCS Ver. 1.2.1 54 and calculation of FST distances between samples or regions using Arlequin Ver. 3.5 55.


This study reports a total of 800 new DNA barcode sequences for identified specimens of 63 species not included in previous studies. These new data were analyzed with 581 previously published sequences, yielding a total dataset of 1,381 sequences with an average length of 578.9 ± 84.3 bp (range: 105 – 658 bp); 82% of the sequences were > 500 bp (Fig. 1). The sequences originated from 195 different taxa or OTUs, including 71 genera, 37 families and 4 orders. Of the 1,381 total sequences, 1,354 belonged to the Order Calanoida (see Supplementary data S1.Alignment.fas at

Descriptive statistics

The sliding window analysis of the unique alignment using the 100 bp window length showed that nucleotide diversity (π) was lower toward the 5’ end of the barcode region, but was relatively constant and higher in the half of the region toward the 3’ end (Fig. 2). For the analysis using the 10 bp window length, the results were markedly irregular, reflecting variation among different domains along the COI barcode region, with moderately conserved regions separated by highly variable ones.

Fig. 2: Sliding window analyses of nucleotide diversity (π) along the “unique” alignment (see Methods).

The horizontal line indicates the average π for the fragment, 0.206. Analyses with window lengths of 10 bp and 100 bp were run with a 10 bp step size; both analyses showed lower variability on the 5’ end of the amplified fragment.

Based on the unique alignment, K2P distances between species were larger than those within species (p < 0.001), but some overlap was observed and no clear barcode gap 56 was identified. Some species showed high divergences between conspecific individuals, while in other cases there were no differences between individuals of different species (see Supplementary data at The range of variation of distances was reduced for the standard and original alignments, but there was some overlap of within- and between-species distances, which was more pronounced when comparing higher taxonomic levels (genus and above).

Analysis of the unique alignment revealed low densities of K2P distances between individuals from 0.05 to 0.15, and very low densities between 0.08 and 0.09 (Fig. 3). Overlap of within- and between-species distances was still observed (Fig. 3). Distances within and between higher-level groups also showed overlap, although these were significantly different when analyzed using multiple U tests (p < 0.001 in all cases).

Tree-based analysis of barcodes

The Maximum Likelihood trees based on the unique and standard alignments showed similar results to those of the NJ trees in terms of resolution and discrimination of clades, albeit with some differences in bootstrap values (see Supplementary data at Overall, the ML tree showed better grouping of closely-related taxa and higher recovery of deeper nodes than the NJ analysis.

Fig. 3: Frequency distribution (in percentages) of Kimura-2-Parameter (K2P) distances by taxonomic level: species, genus, family, order and subclass.

Overlap of within- and between-species distances was still observed. Distances within and between higher-level groups also showed overlap.

Automated tree-based analyses of the unique alignment resolved 227 MOTUs for the ML tree and 241 for the NJ tree. Examining the tree by eye, these MOTUs could be reduced to 65 distinct species-specific clusters, each with more than one sequence separated by short internal branches. Bootstrap recovery was > 98% (100% in most cases; see Supplementary data A number of taxa showed fragmentation (i.e., separation of clusters within the species grouping), indicating geographic differentiation or cryptic speciation; these clusters were identified as different putative species by the PTP analysis (Supplementary data In contrast, there were highly supported clades comprising sequences from different species, including species of Calanus (C. helgolandicus Claus 1863 and C. euxinus Huselmann 1991; C. agulhensis De Decker, Kaczmaruk & Marska 1991 and C. sinicus Brodsky 1965); Centropages (C. typicus Kröyer 1849 and C. chierchiae Giesbrecht 1889); Acartia (A. tonsa Dana 1849 and A. hudsonica Pinhey 1926); Pleuromamma (P. gracilis Claus 1863 and P. piseki Farran 1929); as well as a Paracalanus Boeck 1964 species clade.

The standard alignment identified most species with 99 – 100% bootstrap confidence; the PTP automated method identified 222 putative species from the ML tree and 277 for the NJ tree. In general, the automated tree approach failed to group conspecific individuals when the depth of sampling in that taxon was low, even though the species were grouped in a single clade with high bootstrap support in other tree-based analyses. Inclusion of additional sequences allowed better resolution of MOTUs, especially for clades showing variable or complex results (e.g., Acartia tonsa / hudsonica, Pleuromamma gracilis / piseki, and Paracalanus species).

Fig. 4: Number of MOTU inferred from the three alignments at a range of cut-offs (x-axis) expressed as percentage (relative to the mean length of each dataset) of differences between sequences.

The lower panel shows in detail the box from the upper panel.

ML analysis of the original alignment yielded results that were fully comparable to those for the standard and unique alignments; PTP identified 249 putative species. Deeper nodes showed better support when additional individuals were analyzed. Even the shortest sequences (17 sequences of 105 – 288 bp; Fig. 1), were placed in the correct species clade, with no decrease in the confidence value. The automated method again yielded variable results for taxa with small numbers of individuals. In some cases, analysis of additional individuals segregated closely-related sister species into different clades, although the monophyly of the morphological species was retained in all cases. When genetic distances between conspecific individuals were moderately high (5-7 %), PTP analysis failed to identify these as a single putative species (Supplementary data

BLAST analysis of barcodes

Results of the BLAST analyses carried out in jMOTU showed more sensitivity to sequence homogeneity than did the tree-based analyses (Fig. 4). For the unique and the standard alignments, there were marked shifts resulting in the attenuation of the slope indicating a within-species MOTU threshold of 2.5 – 3 % (Fig. 4). In contrast, the original alignment did not show this shift, instead showing progressive attenuation of the slope of the curve. For the sake of argument, if we apply a 3% sequence difference threshold level for species differentiation 9, all three alignments gave similar results: 225 clusters were identified from the unique alignment, 229 from the standard, and 228 from the original alignment, although there were differences in the taxa comprising the MOTUs between the analyses. For all three alignments, the number of MOTUs detected exceeded the number of OTUs. For the unique and standard alignments, each MOTU contained only one OTU (with the same exceptions indicated for the tree analyses). However, this analysis is not equivalent to a standard BLAST analysis of a single sequence, since the constraints (minimum length, percentage, etc) are based on averages calculated for the analyzed dataset. In a standard BLAST analysis, for which these thresholds are based on the query sequence length, even the shorter sequences are properly identified.

Best close match analysis of barcodes

Clades or MOTUS were analyzed individually using best close match, with a primary focus on those for which there were discrepancies between MOTUs and OTUs. The results indicated that, in nearly all cases when more than one individual of the same species was included in the analyzed dataset, the closest match was the same species. Exceptions to this outcome included closely-related species pairs of Calanus (C. helgolandicus and C. euxinus; C. agulhensis and C. sinicus); Centropages (C. typicus and C. chierchiae); and Acartia (A. tonsa and A. hudsonica). In sum, although the MOTU/OTU concordance can be improved in comparison with automated procedures in a flexible (but subjective) fashion, total agreement between morphological and molecular species assignment methods was not possible.

Species-by-species analyses

Taxa showing discrepancies between MOTUs and OTUs selected for additional analysis, when possible based on available data, to examine possible reasons (e.g., variation among geographic areas or populations, cryptic speciation) for the observed disparities between morphological and molecular data, are studied in detail in Appendix 1.


The growing use of DNA barcodes to discriminate and identify marine animal species has included many studies on zooplankton and a number of studies of planktonic copepods (see Bucklin et al. 8 for a review). This study presents results of comparative analysis of a large dataset of 1,381 barcode sequences for 195 copepod species, including 800 new barcode sequences for 63 copepod species not included in any previous study. Evaluations include ML and NJ automated tree-based, BLAST, and “best close match” analyses of three different sequence alignments, varying the analyzed sequence domain and the numbers of individuals per species. We report here our conclusions regarding the reliability and resolution of diverse statistical approaches to species identification of planktonic marine copepods based on DNA barcodes.

The “best close match” 49 yielded the best results in terms of establishing a species threshold that avoids false positive results – even without a previously-identified and barcoded individual of the same species. Although individuals may fail to be assigned to a species, incorrect species assignments are avoided, unless the distance between two morphologically identified OTUs is zero. This analysis avoids a frequent error of NJ trees, for which individuals of different species may cluster together – albeit on long branches – with bootstrap support equal or very close to 100% when one or more of the species is missing from the analyzed barcode dataset 12,18.

The poor performance of the tree-based automated method compared to the others, can be attributed to the unbalanced dataset (large disparity in numbers of individuals among species) and with low coverage in many cases. This fact is known to limit the performance of PTP and other similar tree-based delimitation methods 52 . On the other hand, the failure of the PTP approach resulted in lack of power to identify the species, and not the much less desirable error of wrong species assignment.

The high levels of genetic diversity within species and the limited number of species for which DNA barcodes are available make character-based diagnosis 11,12 very unlikely to succeed. This approach may be appropriate for much-studied and well-defined groups of taxa, where much of variability has been characterized. Especially when a small number of nucleotide substitutions are used as taxon identifiers or as a step in a taxonomic key, accurate species identification will require complete knowledge of variability both among populations of a species and among species of the group of interest. We are still far away from this goal for marine copepods, due to their large effective population sizes (on the order of 108 57) and exceptional genetic diversity among eukaryotes 58.

Although the coincidence of these two concepts (large population sizes and exceptionally high genetic diversity) could seem counter-intuitive 59, this should be considered in light of factors inherent to the marine planktonic environment. The large distributional ranges of most species, in many cases across multiple ocean basins, might facilitate the isolation of lineages, while still allowing migration and continuous exchange of individuals across the distributional range. The short generation time for these species (usually weeks to several months, rarely multiple years) makes impossible the migration of individuals across the entire range of the species in a single or few generations. Thus, both oceanographic barriers and isolation by distance may results in population differentiation at large scales and among different ocean basins. However, if analyzed at fine scales, allele frequency differences would show continuous variation, with stronger differences at hydrographic or biogeographical barriers only 35,86,87,88.

One of the most powerful applications of barcoding for marine copepods is the analysis of the entire zooplankton community through high throughout DNA sequencing of environmental samples, out-performing the results obtained even by trained morphological analysts 60 . Recent technical advances allow determination of long sequences necessary for accurate identification of species in mixed assemblages. Limitations include inefficient amplification of the COI barcode region in samples containing diverse taxa resulting from variability in the amplification priming sites 61, which hinders annealing of consensus primers. However, higher affinity and amplification success rates of more conserved genes have the associated problem of under-estimating the real diversity of species in a community 62,63,64, due to low levels of sequence divergence and lack of discrimination between closely related species. The low affinity of the consensus COI barcoding primers by Folmer et al. 38 can be countered by design of suites of group-specific primers; copepod-specific primers have been designed for this purpose 20,65 . In the very near future, environmental barcoding approaches may employ nested sets of species- and group-specific amplification and sequencing primers and protocols to ensure reliable, accurate, cost-effective, and rapid assessments of species-level of diversity of pelagic communities, including the taxonomically complex and ecologically-important copepods.


Automated statistical analyses allow species identification and detection of species boundaries based on DNA barcodes 16,17. However, our results showed a large discrepancy between the numbers of OTUs and MOTUs (e.g., for the original alignment, 195 morphological species versus 249 / 228 putative species on the ML tree / jMOTU, respectively). Since these numbers are based on a 3% threshold for discrimination in jMOTUs, this may be due in some cases to unrecognized cryptic species. In other cases, the discrepancy may reflect strong population structuring of widely-distributed species, perhaps combined with incomplete sampling of populations across the geographic range (see discussion in Bergsten et al. 38,66). Those errors could be corrected by non-automated approaches that would not be suited for larger dataset, such as the best-close match or examining the problematic clades on the tree by eye. It is not rare for marine copepods to show genetic differences over 5 % between individuals within and between populations 31,35. Although those cases may be easily resolved by considering the geographical reference (collection location) and/or closely-related species, detection by automated analysis is difficult without geographically and taxonomically dense and balanced sampling. In other cases, the putative species delimitation would be biased by over-sampled taxa 52. Marine planktonic copepods are excellent examples of the inherent challenges of sampling highly abundant, widely distributed populations: high spatial resolution and geographically-extensive sampling is needed for a perfect match between OTUs and MOTUs. But, despite under-sampling of intraspecific variation in the dataset analyzed here, there were no false positives (i.e., assignment of the wrong species to an individual) and the genetically closest individual to any specimen identified using a barcode almost always (with a few notable exceptions) belonged to the same species.

A criticism of metazoan barcoding is the reliance on a single gene, rather than multiple molecular markers. In fact, results obtained from additional genes do not always yield the same results, and caution is advised when using only one or few genetic markers. Additional sources of error include sample sizes, geographical coverage, and sampling bias. In sum, many problems associated with barcoding result not from the COI barcoding region, but from relying on a single molecular marker without necessary consideration for the inevitable limitations, since any gene – even very conserved ones – will have strengths and pitfalls 64. It is possible that there may be better regions for species assignment, and longer sequences do provide better accuracy and reliability 61, but our results confirm that even very short COI fragments (< 150 bp) show acceptable levels of accuracy for species identification. Further, although average COI divergence is significantly higher for deeper taxonomic levels, there is no consistent relationship between divergence and taxonomic level. COI shows marked saturation and erosion of the phylogenetic signal for deeper nodes 67.

A primary limitation of barcoding is the widespread problem of incorrect species identification in published datasets, which markedly reduces reliability and usefulness of the approach 18,49,68. This problem was detected in our dataset by comparison with data from GenBank and other public databases. In other cases, when the obvious morphological differences between the two species made misidentification unlikely, errors may result from laboratory procedures. Solutions include approaches that allow independent confirmation of identifications, e.g., inclusion of images, retention of voucher specimens for later examination, and ratings on the accuracy of taxonomists 69. Another solution is simply to continue to populate databases and increase taxon sampling densities both systematically and geographically, thus allowing recognition of errors at the time of data submission.


This study presents new DNA barcode data for marine copepods (800 sequences for 63 species not previously sequenced) and reports the results of new analyses of a larger dataset (1,381 sequences for 195 copepod species). Our conclusions include recommendations to improve the accuracy and feasibility of using DNA barcodes for species identification of marine planktonic copepods, including: 1) availability of PCR and sequencing primers suited to the targeted species; 2) availability of a taxonomically-comprehensive DNA barcode database linking DNA sequences to accurately identified specimens; 3) increased density of taxon sampling; and 4) near-complete coverage for the group of interest. In particular, comprehensive databases are needed for environmental barcoding efforts (i.e., barcoding of unsorted environmental samples) that seek to characterize species-level diversity of marine zooplankton assemblages and ecosystems.

Increasingly sophisticated approaches to statistical analysis of the barcode region of the mitochondrial cytochrome c oxidase subunit I (COI) gene have resulted in new appreciation for the strengths and weaknesses of this genetic marker for species assignment of planktonic copepods. An important result is that – for all analytical approaches – accurate identification requires inclusion in the analyzed dataset of a barcode sequence for that species. The lack of a complete DNA barcode library is thus the most limiting factor for accurate and reliable discrimination and identification of species of planktonic copepods. In fact, DNA barcodes are currently available for only ~ 400 copepod species, including many parasitic and freshwater taxa. In addition, extensive coverage of species diversity is especially critical for efficient resolution on large datasets using automated methods. Fortunately, many barcoding studies have focused on ecologically important, abundant, and/or geographically widespread species and species groups, making the available DNA barcode data particularly useful. Species that are rare or geographically restricted may remain unidentifiable using barcodes for the foreseeable future.

]]> 0
Reweaving the Tapestry: a Supertree of Birds Mon, 09 Jun 2014 13:30:55 +0000


The class Aves contains an estimated 10,000 extant species1 occupying almost every geographical location, from ocean to desert. They originated within theropod dinosaurs during the Jurassic period2, with the earliest recognised stem group bird being the iconic Archaeopteryx lithographica, of which a number of 150 million year old fossils have been discovered in the famous Solnhofen lagerstätte of Germany. Regular new discoveries, particularly from the vertebrate rich Cretaceous deposits of China, continue to improve our understanding of the earliest birds3,4,5. Modern birds experienced a rapid radiation early in their evolutionary history, though the timing of this is contentious6, resulting in the remarkable diversity that we see today. This rapid radiation of deeper branches is, however, the main confounding factor in our attempts to find the “true” avian tree of life.

Birds are of great interest in a range of fields such as comparative biology, conservation and macroevolutionary studies. They are an economically important group, providing food for humans, as well as fertilizer, and some species are kept as pets. Yet human activity may be partly to blame for the 1,313 species currently on the IUCN Red List of threatened species7; there is a real risk of extinction to many bird species and much effort is being directed towards issues in conservation. Phylogenies are an important tool in conservation, as highlighted by Nee and May8, and allow testing of hypothetical extinction models to assess the loss of “phylogenetic diversity”. Large, well-resolved phylogenies are also vital when attempting to answering important macroevolutionary and macroecological questions yet surprisingly few attempts have been made to reconstruct the phylogeny of all birds and no comprehensive tree including fossil taxa has yet been published though it is well-known that the exclusion/inclusion of fossil taxa can have implications on the resulting phylogenetic tree. Many previous comparative studies have been based on Sibley and Ahlquist’s “tapestry”9, constructed using the much-criticised technique of DNA-hybridisation. Although a massive achievement at the time, this phylogeny contained just 1083 taxa, around a quarter of all birds, with most taxa at genus-level. A number of comparative studies using birds have been based on the tapestry of Sibley and Ahlquist; these include the tempo and mode of bird evolution10, the effect of generation time on rates of avian molecular evolution11, the evolution of avian mating systems and the association between mating systems and pair-bond length12. The dependence of these comparative analyses on the tapestry is troubling as there are concerns about the validity of the method used13,14,15. Although there have been more recent attempts at an inclusive bird phylogeny based on large molecular data sets these are still largely incomplete16. A recent large phylogeny of birds17 contains 9993 taxa but the use of the results of previous studies as a backbone is potentially problematic. In addition, approximately one third of the taxa were added manually post priori. These recent attempts were based on molecular loci and therefore, by definition, excluded extinct taxa. The inclusion of fossils is vital for macroevolutionary studies and investigations into the origins of modern birds.

There are two approaches used for creating large phylogenies. One is the supermatrix or “total evidence” method18,19,20. Here, all characters and taxa make up a single large matrix. A major drawback of this approach is that some types of data cannot be combined (e.g. immunological distance data and DNA-hybridisation data) and that combination of these data types introduces subjective decisions and is vastly time consuming. There is also the potential for a large amount of missing data when combining information in this way21. Bird systematists have employed hard and soft body morphology, behaviour, allozymes, nucleotide sequences, and DNA-hybridisation to elucidate avian phylogeny. Consequently, a supermatrix approach would a priori eliminate many of these data sources. Supertree methods offer a practicable approach to synthesising large numbers of smaller overlapping phylogenies. These “source trees” are built with primary data (e.g., character sets obtained from morphological features or from gene sequencing) or and can be constructed using any method and contain any number of taxa. Many supertree methods also allow conflict between source trees. Therefore supertrees give the widest possible view of phylogeny, both in terms of taxonomic coverage and the types of data incorporated. There are some well-documented issues in using supertree methods to construct large phylogenies, such as data quality and the reliance upon secondary rather than primary information22,23. We attempt to minimise these issues where possible by the use of a strict and robust data processing protocol22,24. Whilst this will not eliminate all possible issues, it allows the construction of an inclusive and large phylogeny. In the future combining supertree and supermatrix methods to complement each other is a potential solution to resolve some of the pitfalls of each method25.

Supertrees have now been produced for many groups of taxa including dinosaurs26, tetrapods27,28, grasses29, mammals30 and crocodiles31. Supertrees have also been produced for avian subsets such as the tube-nose seabirds32, shorebirds33, oscine songbirds34, the fowls35, and a 980 taxon supertree across all extant orders36 but no comprehensive supertree has yet been constructed for all of Aves. Our aim is to combine data from all sources, including both fossil and extant taxa, to create an inclusive phylogeny of birds that will help elucidate their origins and aid conservationists in concentrating their efforts in preserving so-called “biodiversity hotspots”.


Source tree collection

Potential source trees were identified initially from online resources. The Web of Knowledge Science Citation Index ( was searched using the search terms: phylog*, taxonom*, systematic* and clad* in conjunction with all scientific and common names of birds from order to family level. These searches were carried out from the year 1976 up to 2009. This is a significant update – an additional two year’s worth of data (118 published papers) – compared to the tree of Davis24. See conclusions for further comments on the scope, and limitations, of our search. Following the initial search all papers potentially containing phylogenetic trees were examined. In addition, the reference lists of these papers were checked to obtain any further potential source trees. All source trees, along with associated meta-data, were recorded in their original form exactly as they appeared in the source references. Meta-data includes information about source trees such as bibliographic details, characters used (molecular, morphological, behavioural etc.), methods used for tree building, and the taxa included in the analysis. These data were stored in XML file format while the trees were recorded using TreeView37. At this stage no corrections were made for synonyms or any other apparent errors or inconsistencies in the source trees.

Data quality is a big challenge in supertree construction22,23 therefore a strict protocol for data processing was implemented based on that first described by Bininda-Emonds et al.22 . This protocol was followed with some modifications24 and implemented using the Supertree Tool Kit (STK) software38. Source trees needed to meet several criteria for inclusion in the analysis: 1) it should be explicit that the author’s intention was to construct a phylogeny, 2) the characters and taxa used in the analysis must be clearly identifiable and 3) the tree should be based on an analysis of a novel, independent dataset. We defined non-independence as two or more studies that use the same character data and have identical taxa or two or more studies that use the same character data and where one taxa set is a subset of the other. In the case where one was a subset of the other the less comprehensive tree was removed from the dataset. Where this was not possible trees were combined to create a single tree for inclusion in the supertree analysis. Identification of potentially redundant data was automated through use of the STK software. Despite our first criterion we included a taxonomic backbone or “seed tree” as a source tree as studies have shown that the inclusion of a large taxonomically complete greatly improves the performance of supertree methods39. This approach is far more conservative than placing constraints on the dataset; the poorly resolved taxonomy tree acts only to improve overlap and will not over-ride stronger signals within the dataset. We created a very conservative tree, only including taxa that could be unambiguously assigned at order level, compiled using Howard and Moore Catalogue of Birds of the World40. Orders that are in a state of flux were either excluded entirely, (e.g., the “Bucconiformes”) or in other cases the core members of the order were included but taxa whose membership of that order is uncertain were excluded (e.g., the Pelecaniformes). Fossil taxa were added to the backbone tree using the Paleobiology Database ( as a guide.

Nomenclature and taxonomic consistency

OTUs (operational taxonomic units) were standardised to avoid the inclusion of higher taxa and vernacular names that would artificially inflate the number of taxa in the analysis and synonyms and misspellings that could lead to inconsistencies. Names were standardised according to Howard and Moore40, chosen for its conservative approach. Paraphyletic taxa were dealt with using the STK38, which calculates all possible positions of paraphyletic taxa in a source tree. Once all the possible, non-identical, permutations have been calculated, a mini-supertree can be created from them. Higher taxa and vernacular names were removed from source trees by substituting the constituent taxa in a polytomy. Where possible the actual species that the authors intended to represent were used. Where this was not indicated in the source reference all taxa that make up the higher taxa or vernacular name were used but only those that were already present in the dataset to avoid artificial inflation of the number of taxa. Definitions for higher taxa were also according to Howard and Moore. Some substitutions were necessarily very large, for example, a number of source trees contained the taxon “Neornithes” which requires the substitution of virtually every taxon contained within the supertree. The STK contains a tool that enables these to be substituted automatically using a text file containing a user-defined list of substitutions. This substitution stage should not introduce taxa that are not already contained within the dataset, the STK deals with this by checking the presence/absence of substituted taxa by checking the substitution file that is to be used against the source data and indicates any potential problem taxa. The final step for taxonomic/nomenclatural standardisation is the replacement of generic taxa to specific, again in the form of a polytomy containing all taxa of a given genus with the caveat that they are already present in the dataset. Once nomenclature had been standardised it was possible to check that the source trees have sufficient taxonomic overlap, we required each source tree to have least two taxa in common with at least one other source tree21. After all data processing had been completed checks were carried out to ensure that no errors had been introduced during data processing. Again this was implemented using the STK which checks the tree files against the meta-data held about the source trees. This guards against both software and human errors. After these final checks the dataset contained 6326 taxa from 1036 source trees. See additional file 141 for a list of papers containing the source trees and additional file 241 for a Nexus file containing all source trees.

Supertree construction

The most commonly used supertree method is Matrix Representation with Parsimony (MRP)42. Although more supertree methods have become available over the last few years, many of with them software implementation (e.g., SuperFine43, Matrix Representation with Compatibility44, Minimum Flip45, Modified MinCut46), they tend to be slow and unable to deal with large datasets within a reasonable time frame. We chose to use MRP for this analysis as it is still the only supertree construction method able to deal with a dataset of this size. In matrix-based supertree methods all taxa subtended by a given node in a source tree are scored as “1”, taxa not subtended from that node are scored as “0”, taxa not present in that source tree are scored as “?”. Trees are rooted with a hypothetical, all-zero outgroup47. We used standard Baum and Ragan MRP coding42 and the matrix creation was automated using the STK software38. See additional file 341 for the matrix in TNT format.

The matrix was analysed with TNT48 using the “xmult level=10” option, an aggressive search strategy devised to find the shortest trees in as little time as possible. The analysis was run on the Imperial College supercomputer CX1. We ran the analysis on 100 cores for 144 hours, which is equivalent to over two years of computational time on a single core. Each core ran an independent analysis, using a different random starting point for the heuristic search. In this way as much of the phylo-space as possible can be covered in as short a time as possible.

Support values were not calculated for the supertree. An attempt was made to calculate QS values49 for the tree, however CX1 ran out of memory after five days and was unable to complete the analysis. The calculation of V values50 faced similar computational limitations. Traditional support measures such as bootstrap and jack-knifing are of debatable relevance to supertrees51 and would face the same computational limitations.


The analysis found nine MPTs of length 28834 (additional file 441 ). Some areas of the tree were poorly resolved with some odd taxon placement, on closer inspection many of these taxa were observed to be those that are poorly constrained within the source trees or poorly represented within the dataset (see discussion) therefore we calculated an agreement subtree. We were unable to calculate a Maximum Agreement Subtree in PAUP* 4.0b1052 due to memory limitations so we used the Approximate Agreement Subtree function in TNT48. This function uses a heuristic that accurately obtains an agreement subtree but does not guarantee to find the one with the greatest number of taxa (i.e. the MAST). The agreement subtree contained 5380 taxa and the resolution was greatly improved. Figure 1 shows the whole supertree with higher taxa indicated. This figure gives an indication of the size of the tree and the relative sizes of clades. For a simplified order-level tree see Figure 2. For an electronic version in which the whole tree can be viewed in detail see additional file 541

Fig. 1: Agreement subtree calculated for nine MPTs of length nine MPTs of length 28834.

Fig. 2: Simplified summary supertree showing order-level relationships

Silhouettes are from


Taxonomic coverage, data types and resolution

The resulting supertree contains approximately two thirds of all known birds synthesised from source data from the years 1976 to 2008. The number of source trees from each year shows that the majority of data are derived from 2000 onwards (Figure 3). The vast majority of the source data comes from molecular sources (Figure 4) with cytochrome b being the single highest contributor to the data set with 38.9% (403 trees) of source trees built from cytochrome b sequences. See additional file 641 for further information on the composition of the data set. Published molecular studies more than doubled in the period from 2000 to 2009, quickly becoming the largest source of available data. The overall topology of the supertree is more consistent with molecular hypotheses, possibly due to the strong bias towards molecular analyses in the source data. Figure 5 shows the overall distribution of taxa sampled in source trees in the form of a data availability matrix. The density of data sampling is excellent with a large densely sampled area and very few trees and taxa with poor sampling. Resolution of the tree is very high (99.85%).

Fig. 3: Distribution of source trees by year of publication.

The number of avian phylogenies published, and included in the supertree analysis, is heavily skewed towards recent years with relatively few trees from pre-1995.

Fig. 4: Distribution of source trees by year of publication.

The number of avian phylogenies published, and included in the supertree analysis, is heavily skewed towards recent years with relatively few trees from pre-1995.

Fig. 5: Data availability matrix for the supertree source data.

Source trees are sorted vertically and taxa are sorted horizontally both by frequency. Each dot represents the presence of a taxon in a given source tree. The most frequently occurring taxon, Gallus gallus, is at the bottom. The bottom left hand corner shows the most densely sampled area where many taxa are found in many source trees.

MRP performance and novel clades

The MRP method cannot provide new information on relationships that is not already present in the source trees but it is a convenient and fast method for summarising the current state of phylogenetic knowledge. Although some spurious relationships may be recovered, eg., novel clades as discussed below, the majority of the relationships found in the supertree are well-supported by the source trees (see main discussion for details).

A small number (approximately 3%) of taxa are placed in novel clades; i.e. clades that are not supported by the source data. These novel clades tend to occur at the bases of large clades near the tips of the trees and therefore only affect the lower level relationships. The vast majority are found within the order Passeriformes, an order that has historically posed the biggest challenge to avian systematists34. An examination of these taxa and the corresponding source trees showed that these taxa, without exception, are characterised by one or more of the following:

  1. Presence in only a small number of source trees.
  2. Variable position within source trees.
  3. Commonly placed in source trees as part of a polytomy.
  4. Often/only present as an outgroup.

See additional file 741 for a list of problematic taxa and their occurrence within the source trees. Post-Mesozoic fossil taxa are particularly poorly represented by the source data and as a result are commonly found in novel clades. Palaeopsittacus and Psittacopes for example are each only represented by one source tree53,54, therefore the algorithm is unable to accurately place them. Another observation made is that MRP has a tendency to place fossil taxa in highly derived positions, e.g., the Cretaceous anseriform fossil Vegavis is placed within a clade of extant ducks despite there being no source trees to support this relationship. These poorly constrained taxa are a big problem for supertree analyses and these “novel clades” are a well-known, but problematic, property of Matrix Representation with Parsimony49. Algorithms and software are becoming available to help reduce this problem by identifying potential rogue taxa either prior to running an analysis or post priori55 but, as is the case with support values, they cannot cope with the size of the present dataset and we find that the analysis is again necessarily limited by computational constraints. The size of the data set also makes manual identification of these taxa, as we have done here, extremely time-consuming and cumbersome.

Another problem is that many of the algorithms available only identify taxa that may be placed incorrectly as a result of many possible, equally parsimonious positions in the tree (e.g., Safe Taxonomic Reduction56) but this does not appear to be the sole problem with MRP and what is really needed is a leaf-based measure of support that would readily identify potential rogue taxa that occupy positions in the tree for which there is no evidence in the underlying data. In the meantime we suggest caution is used, particularly if fossil taxa are to be used for obtaining clade divergence dates.

Tree topology

The tree is well-resolved and stable at both order and family level with the majority of families and orders resolved as monophyletic; see discussion below for exceptions.

Deep divergences

The extinct Mesozoic birds are placed at the base of the tree with Archaeopteryx lithographica occupying the most basal position. Within these the Enantiornithes (“opposite birds”) form a distinct monophyletic clade. The Enantiornithes represent a separate radiation to the Ornithurae (the direct ancestors of modern birds) that subsequently became extinct at the Cretaceous-Tertiary boundary57,58,59,60. The earliest divergences of birds are amongst the least controversial61 and here the supertree supports the split of the crown group modern birds (Neornithes) into Palaeognathae + Neognathae with a further spilt of Neognathae into the Galloanserae landfowl/waterfowl clade + all other modern birds (Neoaves)62,63 as opposed to Sibley and Ahlquist’s9 non-monophyletic Neognathae in which the Galloanserae are sister group to the Palaeognathae.


The extinct Tertiary palaeognaths Lithornis and Palaeotis are basal to the extant palaeognath taxa. The supertree supports the hypothesis of Tinamidae + all other palaeognaths64. The extinct Madagascan elephant bird Aepyornis appears within the Struthioniformes at the base of the Struthionidae + Rheidae while the Dinornithidae of New Zealand appear at the base of the Struthioniformes clade. The New Zealand ratites, Apterygidae and Dinornithidae, do not form a monophyletic group. This has been suggested by Houde14 and Cooper et al.65 to have implications for vicariance biogeography providing evidence for a second colonisation of New Zealand by kiwis.


The Galliformes + Anseriformes clade is well-supported by molecular works66. Within the Galliformes, the supertree supports the more recent view of Megapodidae as sister to the Cracidae + remaining Galliformes, which all constitute monophyletic families rather than the more traditional placement of a Megapodidae + Cracidae clade as sister to the rest of the Galliformes9. The Anseriformes are split into the three well-defined traditional, monophyletic families: Anhimidae, Anseranatidae and Anatidae.

“Waterbird” assemblage

The supertree recovers the “waterbird” clade containing the “Pelecaniformes”, “Ciconiiformes”, Procellariiformes, Sphenisciformes and Gaviiformes as found in the large molecular analyses of Ericson et al.67 and Hackett et al.16. Morphological evidence also supports this clade68. The Livezey and Zusi phylogeny68 also places the orders Phoenicopteriformes, Podicipediformes and Phaethontiformes within this assemblage, as can be seen in the supertree.

Within this assemblage the traditional “Pelecaniformes” are split into two groups, one comprising the Pelecanidae, the other the Fregatidae + Sulidae + Anhingidae + Phalacrocoracidae. In addition the Pelecanidae are grouped with the “ciconiiform” families Ardeidae, Balaenicipitidae and Scopidae. These findings are consistent with recent molecular studies16,71,72,73,69,67,70,71,72,73 in which it is proposed that the “Pelecanidae” group retains the name Pelecaniformes while the second group be given the name “Phalacrocoraciiformes”74. Some analyses also place Threskiornithidae with the Pelecaniformes which would result in only the Ciconiidae remaining in the Ciconiiformes. The placement here of Threskiornithidae + Ciconiidae may simply reflect the recent state of flux of these taxa.

The sister group relationship of Sphenisciformes + Procellariiformes has support from both morphology68 and molecular data16. There is limited evidence for the placing of Gaviiformes with these taxa75 but the relative positions of all three orders within the “waterbird” assemblage is far from resolved. There are a number of well-known fossil penguins (e.g., Delphinornis, Marambiornis, Perudyptes) which are all placed basally within the Sphenisciformes in the supertree. The Procellariiformes consist of well established monophyletic families.

The Phoenicopteriformes + Podicipediformes clade found by the supertree, termed “Mirandornithes” by Sangster76, is well supported by a large number of molecular studies16,67,69,72,77,78.


The positioning of the Gruiformes + Otidae and the Charadriiformes (including the Turnicidae) as sister groups to the “waterbird” assemblage is congruent with recent molecular hypotheses16,67,73. The Turnicidae were traditionally placed in the Gruiformes9,79 but are now understand to be part of the Charadriiformes63. It is less certain that the Cariamidae are genuinely part of the Gruiformes + Otididae clade. The Cariamidae were also part of the traditional “Gruiformes” but may actually be more closely related to the falconiform birds61. The core Gruiformes found here are composed of the Psophidae, Gruidae, Heliornithidae, Aramidae and Rallidae. Within the Charadriiformes the most basal lineages include the plovers and allies; Chionidiae, Burhinidae, Pluvianidae, Recurvirostridae, Ibidorhynchidae, Haematopidae and Charadriidae. The supertree divides the remaining Charadriiformes into monophyletic gull and sandpiper lineages.


Eurypygidae + Rhynchochetidae is well supported by morphological and molecular data16,67,68,80,81,82. The Columbidae + Pteroclididae is less certain but relatively well established83,84. The position of these taxa with the Mesitornithidae as part of the “Metaves” is suggested by recent molecular studies16,67. The supertree however does not support the “Metaves” clade and it has been suggested that the “Metaves” does not constitute a monophyletic group as discussed in Mayr61. The results found here are more congruent with Chojnowski et al.’s80 findings of an affinity between Columbiformes and the core Gruiformes + “waterbird” assemblage. The supertree places the extinct Raphidae (dodo and solitaire) within the Columbiformes.


The relationships of the hoatzin are controversial and poorly understood. Opisthocomus has previously been placed with the Cuculiformes (cuckoos, coucals and anis)85,86,87, the Gruiformes (crakes and rails) and Musophagiformes88,89. The supertree supports the Opisthocomus + Musophagiformes relationship. Other putative close relatives include the Columbiformes63 but the supertree does not recover this relationship.


Mayr90 coined the term “Strisores” for the clade containing Caprimulgiformes and Apodiformes that has received a great deal of support from molecular data16,67,71,90. The supertree supports the monophyly of this proposed clade and places it as sister to the “landbird” assemblage as in Pratt et al. 200991; rather than as part of the “Metaves” as proposed elsewhere16,67,70 or as a polyphyletic group73. The sister group relationship of the “caprimulgiform” taxon Aegothelidae and the Apodiformes, resulting in a paraphlyletic “Caprimulgiformes” is well-supported by molecular and morphological data16,67,69,90,91,92,93,94. The Apodiformes contain a monophyletic Apodidae and Trochilidae, as in Sibley and Ahlquist’s9 “Trochiliformes” for hummingbird taxa. The association between the Apodiformes and “Trochiliformes” has long been recognised63,93,95,96 and is not contradicted by any of the source trees.


The Falconiformes and Accipitriformes represent a single lineage in the supertree. The Falconiformes and Strigiformes are united as in analyses based on osteology68,97. They are not however placed within the “landbird” assemblage as in recent large molecular studies16,67. New World vultures (Cathartidae) are not placed with the Ciconiiformes as in some early works but neither are they placed with the Old World Vultures (Accipitriformes)16,67 supporting the proposal that they might require an order level designation (“Cathartiformes”)67.


Recent large molecular analyses have proposed a “landbird” clade, the supertree recovers part of this clade but not in its entirety. The supertree does support a monophyletic clade containing the Coraciiformes, Alcediniformes and Piciformes which is well supported by molecular data16,67. The affinities of the Leptosomatidae are not well-understood61, the supertree places them within this “landbird” assemblage with the Coraciiformes. The Trogoniformes are a taxon for which higher level relationships are poorly understood, in the supertree they are placed as sister to the Coraciiformes + Alcediniformes + Piciformes clade with the Coliiformes and Cuculiformes also placed in this clade. The former is supported by molecular data68,98, however the latter is not well-supported.

The Piciformes are split into two distinct clades, one supporting the division into the monophyletic families Ramphastidae, Capitonidae, Megalaimidae (previously included within Capitonidae), Lybiidae and Semnornithidae99,100,101,102 and the second containing the monophyletic Picidae (woodpeckers) and the Indicatoridae (honeyguides) as in Simpson and Cracraft101, Swiersczewski and Raikow102 and Lanyon and Zink100.

The coraciiform clade contains the Brachypteraciidae, Coraciidae, Meropidae, Alcedinidae, Todidae and Momotidae. The Bucerotiformes, Bucorvidae, Bucerotidae and Phoeniculidae, are placed in a clade sister to the Piciformes. The Hoopoe, Upupa epops, is also placed within the Bucerotiformes in contrast to Sibley and Ahlquist’s9 suggestion of elevating it to a new order “Upupiformes”.

The Psittaciformes are traditionally considered to have no close living relatives9 but the supertree is consistent with more recent analyses that place them as the sister taxon to the Passeriformes16,67,69,89.


The Passeriformes contain the majority of extant bird species and have undergone extensive reorganisation within the last decade. The supertree supports the division into three suborders: New Zealand Wrens (Acanthisitti) + all other passeriformes (Tyranni + Passeri). Monophyly of the Old and New World suboscines is well-documented103,104 and as expected the supertree splits the Tyranni (suboscines) into Old World (Eurylaimides) and New World (Tyrannides) groups, all of which contain well-established monophyletic families, the one exception being the Eurylaimidae which is now understood to be polyphletic105. In the supertree Smithornis and Calpytomena fall outwith the main Eurylaimidae clade. The neotropical Sapayoa aenigma was traditionally placed in the New World suboscines but has more recently been placed in the Old World suboscines in varying positions; the supertree places it at the base of the main Eurylaimidae clade (containing Eurylaimus)106,107,108. The New World suboscines are further split into two monophyletic superfamilies; the “bronchophone” suboscines and the Furnarioidea. The Oligocene fossils Zygodactylus and Primozygodactylus danielsi are placed at the base of the Passeriformes.

Sibley and Ahlquist9 split the Passeri into the Corvida and the Passerida but while the Passerida is retained it is now known that the “Corvida” do not comprise a monophyletic group69,109,110,111. Basal within the Passeri are the Menuridae and Atrichornithidae, sometimes designated as the superfamily Menuroidea9,112. The supertree also supports the superfamily status of the previously incertae sedis Ptilonorhynchoidea (Climacteridae + Ptilonorhynchidae)9,112,113, and supports a relationship between Orthonychidae + Pomatostomidae. The Meliphagoidea contains a monophyletic Maluridae, Pardalotidae, Acanthizidae and Meliphagidae.

The large well-supported110 superfamily Corvoidea includes the corvid birds that have radiated out from the Australo-Papuan region and diversified worldwide. As found in the previously published oscine supertree34Melanocharis and Paramythia berrypeckers, and Toxorhamphus longbills appear to belong to Corvoidea rather than to Passeroidea as suggested by Sibley and Ahlquist9 and Monroe and Sibley1. Other lineages placed within this clade include well-established members of the core Corvoidea. These include the Campephagidae, Paradisaeidae, Monarchidae, Oriolidae, Dicuridae, Laniidae and Corvidae. The Picarthatidae + Chaetopidae + Eupetidae clade (possible superfamily) and Petroicidae are at the base of the large infraorder Passerida. This placing of the Petroicidae reflects recent views on their position within the oscine birds110,114.

The supertree supports the identification of a number of recently proposed superfamilies within the monophyletic Passerida clade in addition to Sibley and Ahlquist’s9 original three: Sylvioidea, Muscicapoidea and Passeroidea. At the base of the Passerida are the Sylvioidea and the possible superfamily Paroidea. The Hyliotidae have recently been split from Sylviidae115 and are placed as sister to the Sylvioidea in the supertree. The Sylvioidea families have undergone a great deal of change in recent years, the supertree supports many of the newly suggested families and new delimitation of traditional families, for example the splitting of the “Timaliidae” into a core timaliid clade and a number of newly recognised lineages such as the Pellorneidae and Leiothrichidae116 and the splitting of the “Sylviidae” to recognise new families such as the Locustellidae and the Cisticoliidae117,118. Well-supported members of the Sylvioidea include Alaudidae, Hirundinidae, and Pycnonotidae109,117,119,120, while the supertree supports the inclusion of the Zosteropidae within the Timaliidae34.

The Muscicapidoidea and Certhioidea form a clade with the proposed Bombycilloidea and Reguloidea superfamilies. Muscicapoidea intra-relationships are well-supported by a number of analyses9,109,115,118,119,121 and the supertree finds the traditional families Mimidae, Cinclidae, Sturnidae, Turdidae and Muscicapidae along with the Buphagidae and Rhabbdornithidae also being placed as distinct families.

The Passeroidea is the largest passeriform superfamily. Along with finches, sparrows, weavers etc. it contains the nine-primaried oscines – songbirds with nine easily identifiable primary feathers on each wing. The nine-primaried oscines are a large radiation that contains approximately 10% of all extant species of birds122 and form a strongly supported monophyletic clade109,110,122. The supertree does not place the Peucedramidae within the nine-primaried oscines but with the Prunellidae. All the families are resolved as monophyletic with the exception of the Thraupidae/Cardinalidae clade which has undergone extensive reorganisation in recent years123. The supertree was unable to resolve the position of the Icteridae, the varying position of the Icteridae in the supertree as sister to either the Parulidae or the Emberizidae both have support from recent analyses109,119,124. In the simplified family level tree we have collapsed these three families to a trichotomy to reflect this uncertainty, which seems likely to be a reflection of its varying position in source trees rather than a true biological relationship. The supertree supports the separation of the estrildid finches and the true sparrows into two families the Estrildidae and Passeridae as in Christidis and Boles74. The Dicaeoidea (Nectariniidae + Dicaeidae) and Promeropidae are at the base of the Passeroidea. These may represent independent superfamilies or may be included as part of the Passeroidea.


The supertree is the first published species-level supertree of birds. It is also the first comprehensive phylogeny of birds to include fossils; both recently extinct and Mesozoic taxa, which are of vital importance for analyses requiring an understanding of the deep evolutionary history of birds. It is not intended to be the final word in avian systematics nor is it intended to be used as a basis for re-evaluating avian taxonomy. It does, however, provide a platform upon which further research can be based and will hopefully provide a useful resource for researchers studying avian macroevolution, conservation, biodiversity, comparative biology and character evolution. An earlier version of the supertree24 has already been used in a large number and variety of evolutionary studies125,126,127,128,129,130,131,132,133,134 and it is anticipated that this updated tree will provide a basis for further research of this nature and may be of particular use to macroevolutionary studies due to the inclusion of fossil taxa. We acknowledge that many additional papers have been published since our data collection ceased – avian systematics is a rapidly moving field. This tree does however represent a significant update compared to Davis24 and we anticipate that a further update will be published in the future; for now this tree is still the only large avian phylogeny available with a broad taxonomic coverage containing both fossil and extant taxa. This work highlights areas in which systematic knowledge is poor or inconsistent, suggesting a possible focus for future phylogenetic studies. We also identify the need for leaf-based measures of support to aid identification of rogue taxa in supertree analyses. The supertree represents a first attempt at a species-level avian supertree and will no doubt be improved upon as further data and better algorithms become available.

Availability of supporting data

All supplementary data are available at figshare:

Competing Interest Statement

The authors declare that no competing interests exist.

]]> 0