Recently developed molecular methods enable geneticists to target and sequence thousands of orthologous loci and infer evolutionary relationships across the tree of life. Large numbers of genetic markers benefit species tree inference but visual inspection of alignment quality, as traditionally conducted, is challenging with thousands of loci. Furthermore, due to the impracticality of repeated visual inspection with alternative filtering criteria, the potential consequences of using datasets with different degrees of missing data remain nominally explored in most empirical phylogenomic studies. In this short communication, I describe a flexible high-throughput pipeline designed to assess alignment quality and filter exonic sequence data for subsequent inference. The stringency criteria for alignment quality and missing data can be adapted based on the expected level of sequence divergence. Each alignment is automatically evaluated based on the stringency criteria specified, significantly reducing the number of alignments that require visual inspection. By developing a rapid method for alignment filtering and quality assessment, the consistency of phylogenetic estimation based on exonic sequence alignments can be further explored across distinct inference methods, while accounting for different degrees of missing data.
Incomplete lineage sorting (ILS), modelled by the multi-species coalescent, is a process that results in a gene tree being different from the species tree. Because ILS is expected to occur for at least some loci within genome-scale analyses, the evaluation of species tree estimation methods in the presence of ILS is of great interest. Performance on simulated and biological data have suggested that concatenation analyses can result in the wrong tree with high support under some conditions, and a recent theoretical result by Roch and Steel proved that concatenation using unpartitioned maximum likelihood analysis can be statistically inconsistent in the presence of ILS. In this study, we survey the major species tree estimation methods, including the newly proposed “statistical binning” methods, and discuss their theoretical properties. We also note that there are two interpretations of the term “statistical consistency”, and discuss the theoretical results proven under both interpretations.
Phylogeneticists have long understood that several biological processes can cause a gene tree to disagree with its species tree. In recent years, molecular phylogeneticists have increasingly foregone traditional supermatrix approaches in favor of species tree methods that account for one such source of error, incomplete lineage sorting (ILS). While gene tree-species tree discordance no doubt poses a significant challenge to phylogenetic inference with molecular data, researchers have only recently begun to systematically evaluate the relative accuracy of traditional and ILS-sensitive methods. Here, we report on simulations demonstrating that concatenation can perform as well or better than methods that attempt to account for sources of error introduced by ILS. Based on these and similar results from other researchers, we argue that concatenation remains a useful component of the phylogeneticist’s toolbox and highlight that phylogeneticists should continue to make explicit comparisons of results produced by contemporaneous and classical methods.
Correction In table 1 an incorrect value was provided for the “Contigs” value for “Filamoeba nolandi” (Column 3, row 2). The corrected table is provided below: Table 1: Transcriptome statistics for Amoebozoa used in case study. Taxon ATCC Contigs SSU Bacterial Eukaryotic Unknown Genes Filamoeba nolandi 50430 21671 17 2409 12205 7057 171 Pessonella sp. […]
Since the ever-increasing availability of phylogenetic informative data, the last decade has seen an upsurge of ecological studies incorporating information on evolutionary relationships among species. However, detailed species-level phylogenies are still lacking for many large groups and regions, which are necessary for comprehensive large-scale eco-phylogenetic analyses. Here, we provide a dataset of 100 dated phylogenetic trees for all European tetrapods based on a mixture of supermatrix and supertree approaches. Phylogenetic inference was performed separately for each of the main Tetrapoda groups of Europe except mammals (i.e. amphibians, birds, squamates and turtles) by means of maximum likelihood (ML) analyses of supermatrix applying a tree constraint at the family (amphibians and squamates) or order (birds and turtles) levels based on consensus knowledge. For each group, we inferred 100 ML trees to be able to provide a phylogenetic dataset that accounts for phylogenetic uncertainty, and assessed node support with bootstrap analyses. Each tree was dated using penalized-likelihood and fossil calibration. The trees obtained were well-supported by existing knowledge and previous phylogenetic studies. For mammals, we modified the most complete supertree dataset available on the literature to include a recent update of the Carnivora clade. As a final step, we merged the phylogenetic trees of all groups to obtain a set of 100 phylogenetic trees for all European Tetrapoda species for which data was available (91%). We provide this phylogenetic dataset (100 chronograms) for the purpose of comparative analyses, macro-ecological or community ecology studies aiming to incorporate phylogenetic information while accounting for phylogenetic uncertainty.
More than 2,500 species of copepods (Class Maxillopoda; Subclass Copepoda) occur in the marine planktonic environment. The exceptional morphological conservation of the group, with numerous sibling species groups, makes the identification of species challenging, even for expert taxonomists. Molecular approaches to species identification have allowed rapid detection, discrimination, and identification of species based on DNA sequencing of single specimens and environmental samples. Despite the recent development of diverse genetic and genomic markers, the barcode region of the mitochondrial cytochrome c oxidase subunit I (COI) gene remains a useful and – in some cases – unequaled diagnostic character for species-level identification of copepods. This study reports 800 new barcode sequences for 63 copepod species not included in any previous study and examines the reliability and resolution of diverse statistical approaches to species identification based upon a dataset of 1,381 barcode sequences for 195 copepod species. We explore the impact of missing data (i.e., species not represented in the barcode database) on the accuracy and reliability of species identifications. Among the tested approaches, the best close match analysis resulted in accurate identification of all individuals to species, with no errors (false positives), and out-performed automated tree-based or BLAST based analyses. This comparative analysis yields new understanding of the strengths and weaknesses of DNA barcoding and confirms the value of DNA barcodes for species identification of copepods, including both individual specimens and bulk samples. Continued integrative morphological-molecular taxonomic analysis is needed to produce a taxonomically-comprehensive database of barcode sequences for all species of marine copepods.
As phylogenetic data becomes increasingly available, along with associated data on species’ genomes, traits, and geographic distributions, the need to ensure data availability and reuse become more and more acute. In this paper, we provide ten “simple rules” that we view as best practices for data sharing in phylogenetic research. These rules will help lead towards a future phylogenetics where data can easily be archived, shared, reused, and repurposed across a wide variety of projects.
Our knowledge of the avian tree of life remains uncertain, particularly at deeper levels due to the rapid diversification early in their evolutionary history. They are the most abundant land vertebrate on the planet and have been of great historical interest to systematists. Birds are also economically and ecologically important and as a result are intensively studied, yet despite their importance and interest to humans around 13% of taxa currently on the endangered species list perhaps as a result of human activity. Despite all this no comprehensive phylogeny that includes both extinct and extant species currently exists. Here we present a species-level supertree, constructed using the Matrix Representation with Parsimony method, of Aves containing approximately two thirds of all species from nearly 1000 source phylogenies with a broad taxonomic coverage. The source data for the tree were collected and processed according to a strict protocol to ensure robust and accurate data handling. The resulting tree topology is largely consistent with molecular hypotheses of avian phylogeny. We identify areas that are in broad agreement with current views on avian systematics and also those that require further work. We also highlight the need for leaf-based support measures to enable the identification of rogue taxa in supertrees. This is a first attempt at a supertree of both extinct and extant birds, it is not intended to be utilised in an overhaul of avian systematics or as a basis for taxonomic re-classification but provides a strong basis on which to base further studies on macroevolution, conservation, biodiversity, comparative biology and character evolution, in particular the inclusion of fossils will allow the study of bird evolution and diversification throughout deep time.
Understanding the evolutionary relationships of all eukaryotes on Earth remains a paramount goal of modern biology, yet analyzing homologous sequences across 1.8 billion years of eukaryotic evolution is challenging. Many existing tools for identifying gene orthologs are inadequate when working with heterogeneous rates of evolution and endosymbiotic/lateral gene transfer. Moreover, genomic-scale sequencing, which was once the domain of large sequencing centers, has advanced to the point where small laboratories can now generate the data needed for phylogenomic studies. This has opened the door for increased taxonomic sampling as individual research groups have the ability to conduct genome-scale projects on their favorite non-model organism.
Here we present some of the tools developed, and insights gained, as we created a pipeline that combines data-mining from public databases and our own transcriptome data to study the eukaryotic tree of life. The first steps of a phylogenomic pipeline involve choosing taxa and loci, and making decisions about how to handle alleles, paralogs and non-overlapping sequences. Next, orthologs are aligned for analyses including gene tree reconstruction and concatenation for supermatrix approaches. To build our pipeline, we created scripts written in Python that integrate third-party tools with custom methods. As a test case, we present the placement of five amoebae on the eukaryotic tree of life based on analyses of transcriptome data. Our scripts are available on GitHub and may be used as-is for automated analyses of large scale phylogenomics, or adapted for use in other types of studies.
Analyses on the scale of all eukaryotes present challenges not necessarily found in studies of more closely related organisms. Our approach will be of relevance to others for whom existing third-party tools fail to fully answer desired phylogenetic questions.