Benchmark datasets and software for developing and testing methods for large-scale multiple sequence alignment and phylogenetic inference

·
Citation
PDF, XML
Authors

Abstract

We have assembled a collection of web pages that contain benchmark datasets and software tools to enable the evaluation of the accuracy and scalability of computational methods for estimating evolutionary relationships. They provide a resource to the scientific community for development of new alignment and tree inference methods on very difficult datasets. The datasets are intended to help address three problems: multiple sequence alignment, phylogeny estimation given aligned sequences, and supertree estimation. Datasets from our work include empirical datasets with carefully curated alignments suitable for testing alignment and phylogenetic methods for large-scale systematics studies. Links to other empirical datasets, lacking curated alignments, are also provided. We also include simulated datasets with properties typical of large-scale systematics studies, including high rates of substitutions and indels, and we include the true alignment and tree for each simulated dataset. Finally, we provide links to software tools for generating simulated datasets, and for evaluating the accuracy of alignments and trees estimated on these datasets. We welcome contributions to the benchmark datasets from other researchers.

Funding Statement

The research was supported by the US National Science Foundation DEB 0733029, and by Microsoft Research through support to TW.

One of the principal goals of the National Science Foundation’s Assembling the Tree of Life (AToL) initiative is “[a]ssembly of a framework phylogeny, or Tree of Life, for all major lineages of life.” [1] Much of that effort has focused on accumulating and analyzing data for the major taxonomic groups. However, because of the scale of the problems (numbers of species and amount of sequence information), the initiative has also required development of methods for sequence alignment, phylogenetic inference and supertree estimation that can handle hundreds, thousands or even tens of thousands of sequences. In the last decade, many new methods have been developed to address these challenging computational problems, including RAxML [2], GARLI [3], POY [4], SATé [5], and MrBayes [6]. However, evaluations of the efficacy of these methods for large-scale alignment and tree estimation–required for highly accurate estimations of the Tree of Life–have lagged behind method development.

To facilitate testing of large-scale alignment and phylogeny estimation methods, we have assembled a collection of web pages of (1) benchmark datasets and (2) software appropriate for creating new simulated benchmark datasets (http://www .cs.utexas.edu/users/phylo/datasets/). Because these datasets have been assembled with an eye to their usefulness for Tree of Life-scale projects, only datasets that have large numbers of taxa and/or present other difficulties for phylogenetic reconstruction and alignment (e.g., high rates of substitution and insertions and deletions) are included. The datasets we provide range in numbers of taxa from a few hundred to more than 300,000 sequences. The datasets are broken down into sets most appropriate for three types of phylogenetic problems: phylogenetic estimation given aligned sequences, supertree estimation, and multiple sequence alignment. Some datasets are appropriate for more than one type of problem and therefore are referenced more than once. Reference information and links are provided for all published datasets.

Benchmarks for phylogenetic estimation

The benchmark datasets for phylogenetic estimation are both empirical and simulated. They have been used in large-scale systematics studies, and so present challenges for maximum likelihood, maximum parsimony and Bayesian estimation. A subset of the empirical datasets (Table 1) include curated alignments and reference trees (generated using RAxML version 7.0.4 [2]). Reference trees have been assessed by bootstrapping, with edges having less than 75% support contracted. The remaining empirical datasets lack curated alignments and reference trees, but are appropriate for assessing the ability of alignment and phylogenetic software to operate on large and/or difficult datasets. They can also be used to compare how well algorithms solve particular optimality criteria, e.g., maximum parsimony or maximum likelihood. The empirical datasets include both nucleic acid and amino acid sequence data.

Table 1 . Empirical datasets and their properties.
Dataset a Gene Taxonomic Range Number of Taxa Number of Characters b Percentage Indels Average Gap Length
16S.B.ALL 16S rRNA Bacteria 27,643 6,857 80.0 4.9
16S.T 16S rRNA The three domains of life plus mitochondria and chloroplasts 7,350 11,856 87.4 12.1
16S.3 16S rRNA The three domains of life 6,323 8,716 82.1 9.4
16S.M.aa_ag c 16S rRNA Mitochondria 1,028 4,907 82.6 22.0
16S.M 16S rRNA Mitochondria 901 4,722 78.1 17.2
23S.M 23S rRNA Mitochondria 278 10,738 83.7 31.9
23S.M.aa_ag c 23S rRNA Mitochondria 263 10,305 83.5 34.2
23S.E.aa_ag c 23S rRNA Eukaryotes nuclear 144 8,619 61.1 13.5
23S.E 23S rRNA Eukaryotes 117 9,079 59.7 12.6
a Unless otherwise noted, all datasets in this table are taken from Cannone et al. [7]. Curated alignments were produced by Cannone et al. using covariation and secondary structure. The reference trees reported on our web site were generated using RAxML version 7.0.4. Complete run parameters and program commands are listed on the web site.
b The number of columns in the aligned dataset.
c[8]

Simulated datasets (Table 2) were taken from three sources and include both amino acid and nucleil acid sequences of widely varying numbers of sequences, rates of substitution and sizes and rates of indels.

Table 2 . Simulated sequence datasets and some of their properties.
Dataset Source Data Type a Number of Taxa Number of Characters b Software c
FastTree Price et al. [9]
Price et al. [10]
AA
NA
250; 1,250; 5,00078,132 N/A Rose [11]
SATé Liu et al. [5] NA 100; 500; 1,000 1,000 SeqGen [12] Rose [11]
RNASim kim.bio.upenn.edu/software/csd.shtml NA (SSU rRNA) 128; 256; 512; 1,024; 2,048; 4,096; 8,192; 16,384; 1,000,000 1,542 RNASim [13]
a AA = amino acid; NA = nucleic acidb The number of characters in the root sequence
c The software used to generate the datasets

Benchmarks for multiple sequence alignment

Most of the benchmark datasets for multiple sequence alignment are the same as those for phylogenetic estimation. Both empirical and simulated datasets are provided. Taken as a whole, these datasets have properties that are typical both of markers currently used in large-scale phylogeny estimation and of markers that have evolved under high rates of indels, and are thus extremely difficult to align.

In addition to the simulated datasets for phylogeny estimation, a simulated amino acid dataset is included [14], which has sequences that were generated using Rose [11] and ranging from 20 to 100 taxa. Several additional empirical amino acid datasets are also included (BAliBASE [15], OXBench [16], PREFAB [17], SABmark [18]).

The empirical benchmark datasets for testing multiple sequence alignment include datasets with highly reliable curated sequence alignments that have been carefully validated by the community. The gold standard for this sort of dataset is Robin Gutell’s Comparative RNA Website (CRW) [19]. The curated alignments provided in CRW are based upon secondary structural information, which is particularly helpful where the mature rRNA is double stranded due to sequence complementarity.

Benchmarks for supertree methods

Finally, we provide benchmarks for testing supertree methods. As with the other benchmark collections, we provide both empirical (Table 3) and simulated [20] supertree datasets, and include datasets with different properties (such as the number of source trees, and the taxon sampling strategies used to produce the source trees).

Table 3 . Empirical supertree datasets.
Dataset Taxonomic Range Total Taxa Number of Source Trees
McMahon and Sanderson [21] Comprehensive papilionoid legumes 2,228 39
Cardillo et al. [22] Marsupials 267 158
Beck et al. [23] Placental mammals 116 726
Kennedy and Page [24] Seabirds 121 7

Wojciechowski et al. [25]
Temperate herbaceous papilionoid legumes 558 19

Software for generating datasets

The collection of simulation software is useful for three aspects of producing simulated datasets: generating simulated phylogenetic trees (e.g., r8s [26] and Mesquite [27]); evolving sequences on phylogenetic trees (particularly tools that evolve sequences with both substitution and indel events) and tools for post-processing model trees in order to deviate from the model assumptions (such as the molecular clock).

Conclusions

We hope that having these benchmark datasets in a single location will facilitate research on large-scale phylogenetic methods and enhance the reproducibility of work based upon the datasets. We also hope that posting simulation tools that are capable of generating large-scale phylogenetic datasets will promote the generation of new benchmark datasets for use by the community.

We conceive of these pages as an evolving resource for the community and welcome input that would improve or expand them. We invite readers to contact us if they wish to contribute benchmark datasets or software to this resource or if they know of additional datasets or software that should be added to the pages. We will be happy to work with the laboratories providing them, and will either store them locally or provide links to their sites.

Competing Interests

The authors have declared that no competing interests exist.

Acknowledgements

The authors thank the editors and two anonymous reviewers for their very helpful suggestions on the manuscript. Correspondence should be sent to CRL. The research was supported by the US National Science Foundation DEB 0733029, and by Microsoft Research through support to TW.

References

  • National Science Foundation, Available from:
    Reference Link
  • Stamatakis, A., 2006. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21):2688-2690.
  • Zwickl, D.J. 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. In Section of Integrative Biology, School of Biological Sciences, University of Texas at Austin: Austin.
  • Varón, A., L.S. Vinh, and W.C. Wheeler. 2010. POY version 4: phylogenetic analysis using dynamic homologies. Cladistics, 2010. 26: in press.
  • Liu, K., et al. 2009. Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees. Science 324(5934):1561-1564.
  • Ronquist, F. and J.P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19(12):1572-1574.
  • Cannone, J. J., S. Subramanian, M. N. Schnare, J. R. Collett, L. M. D'Souza, Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Müller, N. Pande, Z. Shang, N. Yu, R. R. Gutell. 2002. The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BMC Bioinformatics 3:2.
  • Elgavish T., J. J. Cannone, J. C. Lee, S. C. Harvey, R. R. Gutell. 2001. AA.AG@Helix.Ends: A:A and A:G Base-pairs at the Ends of 16 S and 23 S rRNA Helices. Journal of Molecular Biology 310:735-753.
  • Price, M. N., P. S. Dehal, A. P. Arkin. 2009. FastTree: Computing Large Minimum-Evolution Trees with Profiles instead of a Distance Matrix. Molecular Biology and Evolution 26:1641-1650.
  • Price, M. N., P. S. Dehal, A. P. Arkin. 2010. FastTree 2 -- Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE 5(3):e9490.
  • Stoye, J., D. Evers, F. Meyer. 1998. Rose: generating sequence families. Bioinformatics 14:157-163.
  • Rambaut, A. and N.C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235.
  • Guo, S., Wang,L.-S., Kim, J. 2009. Large-scale simulation of RNA macroevolution by an energy-dependent fitness model. arXiv:0912.2326v1 [q-bio.PE].
  • Wang, L-S., J. Leebens-Mack, P. K. Wall, K. Beckmann, C. W. dePamphilis, T. Warnow. 2009. The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation. EEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB).
  • Thompson, J. D., P. Koehl, R. Ripp, O. Poch. 2005. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins 61:127-136.
  • Raghava, G. P. S., S. M. J. Searle, P. C. Audley, J. D. Barber, G. J. Barton. 2003. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4:47.
  • Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl. Acids Res. 32:1792-1797.
  • Van Walle. Available from: http://bioinformatics.vub.ac.be/databases/databases.html
    Reference Link
  • Gutell, R. Available from: http://www.rna.ccbb.utexas.edu/CAR/
    Reference Link
  • Swenson, M. S., F. Barbançon, C. Linder, and T. Warnow. 2010. A simulation study comparing supertree and combined analysis methods using SMIDGen. Algorithms for Molecular Biology 5.
  • McMahon, M. and M. Sanderson. 2006. Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic Biology 55:818-836.
  • Cardillo, M., O. R. P. Bininda-Emonds, E. Boakes, and A. Purvis. 2004. A species-level phylogenetic supertree of marsupials. Journal of Zoology 264:11-31.
  • Beck, R. M. D., O. R. P. Bininda-Emonds, M. Cardillo, F. G. R. Liu, and A. Purvis. 2006. A higher-level MRP supertree of placental mammals. BMC Evolutionary Biology 6:93.
  • Kennedy, M. and R. Page. 2002. Seabird supertrees: combining partial estimates of procellariiform phylogeny. The Auk 119:88-108.
  • Wojciechowski, M., M. Sanderson, K. Steele, and A. Liston. 2000. Molecular phylogeny of the "temperate herbaceous tribes" of papilionoid legumes: a supertree approach. Advances in Legume Systematics 9:277-298.
  • Sanderson, M.J. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19(2):301-302.
  • Maddison, W.P. and D.R. Maddison. 2010. Mesquite: a modular system for evolutionary analysis. Version 2.74.
    Reference Link

Leave a Comment