Introduction

Scientists are making rapid progress building the Tree of Life, which promises to provide new insights into evolutionary relationships across space and time. We now have available large-scale phylogenies (e.g. 12) and are accumulating massive amounts of character and distributional data (e.g. plant traits 3; mammalian traits 4; species occurrence data 5; gene interactions 6; microbiomes 7). Together, these resources provide researchers with unique opportunities to investigate processes that drive biological diversification at broad and deep phylogenetic scales (e.g. 8910). Developing tools that enable comparative analysis of various types of character data across large-scale phylogenies is essential to further our understanding of fundamental evolutionary processes.

As a “first draft” of the Tree of Life emerges, it becomes increasingly clear that we lack the resources necessary to interpret the evolution of character data across phylogenies at large scales. This is largely due to a lack of tools for the integration and visualization of massive amounts of spatial, temporal, and character data and for placing this information into the comparative framework provided by phylogenetic trees 11. Comparative analyses are hindered by the lack of scalable programs for studying trait evolution, the heterogeneity of data formats, and the inability to integrate multiple analyses to enable novel discoveries 12. Tools that successfully integrate very large phylogenetic trees with vast amounts of data associated with branch tips and nodes will enable evolutionary biologists to visualize and investigate the processes that drive the evolution of organismal diversity.

Here we present an initial description of a new open-source software package, Arbor, which is currently in development. Funded by the NSF AVAToL program, Arbor will allow us to harvest massive amounts of information from phylogenetic trees to facilitate new discoveries about the evolution of life on Earth.

Arbor: a new vision for comparative analyses

To bring comparative biology up to speed with the emerging Tree of Life, we are developing Arbor, an extendable platform for analyzing and visualizing evolutionary processes in a phylogenetic framework. Using Arbor workflows, researchers will be able to investigate evolutionary questions across very large phylogenetic trees containing very large numbers of terminal units (e.g. species, populations, individuals) using diverse types of character data (e.g. traits, biogeography, ecological associations, species interactions). The goal of Arbor is to provide biologists with an immediate and intuitive interface for testing comparative evolutionary questions across the Tree of Life and to enable them to quickly and efficiently communicate their findings to scientific and general audiences.

We are currently developing and implementing flexible and extendable tools that interact through an intuitive visual workspace in Arbor. As designed, Arbor workflows have three main benefits. First, Arbor will be flexible: the workspace allows end users to link together custom-designed sets of statistical analyses geared toward the types of questions inherent to their data. Second, Arbor will be extendable: software developers are able to quickly and easily add functionality without altering the details of Arbor’s core implementation. Third, Arbor will be sharable: users will be able to save their Arbor workflows and share them with others. As such, Arbor will be “next generation” software that enables tree-thinking and novel analyses at broad spatial and temporal scales.

The Arbor development team (see below) brings together a variety of expertise in both biology and computer science. The ongoing development of Arbor focuses on four components. First, we are developing the “core” of Arbor. This core will be a phylogenetic comparative analysis platform with a flexible user interface that will leverage existing software, algorithms and web services to allow scientists to address comparative evolutionary hypotheses at every scale across the Tree of Life. Second, we are implementing shared data integration and basic analysis tools for Arbor. These modular tools will provide basic visualization and fast, scalable algorithms for comparative analyses across a given phylogeny (i.e. lineage-specific trait diversification) or between two topologies (i.e. co-diversification, species interactions). Third, we are using a set of three test cases to develop Arbor into software that can solve fundamental problems in evolutionary biology. These test cases will stimulate rapid development of numerous Arbor modules that are directly inspired by biological applications across the Tree of Life. Fourth, we are developing four outreach initiatives to ensure that Arbor has a broader impact in advancing science, education and public education in comparative evolutionary biology.

The Arbor core: workflows and web services promote flexible and dynamic analyses

The Arbor interface will be a radical departure from currently existing software for comparative analyses. Users will interact with Arbor in an intuitive, visual way; building an Arbor workflow will have more in common with building a structure using Legos than with programming a computer. By using a modular open-source interface and web services, we will encourage the development of new tools and simultaneously enhance interoperability among existing tools.

Arbor takes inspiration from existing workflow software, especially ParaView 13 (also see Galaxy 14 and the iPlant Discovery Environment33). Users of workflow software create problem-specific workflows visually by connecting a series of data manipulations and analyses in a series of steps. The underlying software system is able to automatically coordinate the invocation of all of these steps during workflow execution (see the test case described below for an example workflow design).

We are designing Arbor as scalable software, with tree-of-life-scale analyses in mind from the beginning. Arbor will be able to efficiently process both small and large datasets, ultimately extending to analyses across the entire Tree of Life. This scalable performance will come from the use of improved algorithms (e.g. 15) along with both fine-grained parallelism (using multithreaded execution of individual steps when possible in a workflow) and coarse-grained parallelism (executing multiple steps of a workflow simultaneously on separate cores). Arbor will run efficiently to develop and test workflow designs using small and medium-size datasets on laptop and personal computers. Then, after a workflow idea has been expressed and validated, the workflow can be run against tree-of-life scale data using a scalable computing resource. To do this, the user will transfer the workflow from the local computer to an Arbor instance running on a large, multiprocessing computer, a computation cluster (e.g., XSEDE; www.xsede.org), or a similar cloud computational service to run a full-scale analysis.

Arbor will be modular software, extendable through a plug-in architecture that allows new analysis modules to be added to the program (similar to the modular design of Mesquite 16). Arbor modules will be able to include algorithms written in a number of languages (e.g. R, Python, Perl, C, or C++). Researchers who develop software for comparative analyses will have an immediate user base if they integrate their program as an Arbor module, promoting and encouraging active use of Arbor by the programming community. Arbor’s modular construction will allow new algorithms to become quickly available and usable with data; scientists will be able to carry out comparative analyses that have not yet been imagined.

Instead of spending effort redeveloping yet another implementation of known algorithms, the Arbor team will focus initially on integrating with existing external applications. Arbor will be able to work alongside proven systems like Mesquite, MrBayes, BEAST, PHYLIP, ape, and others. At any point along the way, Arbor’s visualization tools can be used to examine intermediate steps along an analysis workflow. All of the analyses in a workflow will be coordinated through the Arbor interface. This integration represents an entirely new way to build novel comparative analyses.

Arbor will include a web services infrastructure, providing the ability to integrate with existing web-hosted services that offer trees and data for analysis or perform algorithms useful for Arbor workflows (e.g. Lifemapper, CIPRES, MAFFT, RAXML, UniFrac, TreeBase, PorToL, OneZoom, and iPLANT). We will also make specific connections to the other AVAToL projects (Open Tree of Life and Next-Generation Phenomics). These connections will be facilitated by Arbor’s dataIntegrator module, which will allow Arbor to access, integrate, and analyze phylogenetic, geospatial, ontogenetic, and morphological data together in a single workflow.

Users will be able to export and share their workflows so that entire analyses can be quickly replicated with new or updated data. Shared workflows will be stored at the DRYAD digital repository (datadryad.org) or a similar archive, so that users who generate practical or compelling workflows receive direct credit via citations in scientific publications. We will encourage buy-in from user communities by providing an interactive and intuitive visual exploratory workspace with a user-driven toolkit. This toolkit can be expanded to include novel analytical tools as later modules are developed to serve the scientific community. Furthermore, we plan to quantify the impact of Arbor in real time. Tool use could be tracked to demonstrate user action and efficiency, providing feedback on which programs are being used and combined. Workflows will capture the provenance of ideas by automatically generating a list of references to be cited by end-users.

What will I be able to do with Arbor?

Arbor will begin by implementing the most commonly used existing comparative algorithms. We have started from the most flexible current platform for comparative methods, the statistical computer programming language R, home to several widely used phylogenetic analyses packages including ape 17, GEIGER 18, picante 19, and diversitree 20. The comparative algorithms from these packages – especially ape – will form the initial basis for Arbor’s core analysis infrastructure. We will also focus on developing and implementing new algorithms that can be applied efficiently to very large datasets (e.g. 15).

Arbor will allow users to link analyses as a series of steps in a workflow. To facilitate this process, we will spend effort to extend the flexible and scalable data structures already implemented in ParaView and/or VTK (the Visualization Toolkit 22) to be useful for a wide array of evolutionary analyses. Phylogenetic tree data structures are initially based on designs implemented in the ape and phylobase R packages. Our project will study and refine the design of tree and data structures for scalability and efficiency.

We will include powerful tree-based selection and query operations in Arbor’s analytical capabilities through the treeManipulator module. For example, Arbor will be able to locate sets of species that co-occur at a particular place and time, and to plot those species on a phylogenetic tree. This type of analysis, combining phylogenetic, geospatial, temporal, and other evolutionary traits, will take advantage of the rich filtering and selection operations implemented in Arbor’s analysis framework. These steps can then be chained together to form problem-specific workflows.

A strength of Arbor will be the phyloVis module, which will use a mature set of interactive graphical tools already extensively tested by the scientific visualization and data-mining communities. VTK-based visualizations have previously been implemented in interactive user interfaces running on all major computing platforms (Windows, Mac, Linux, and various custom supercomputer architectures). The VTK library itself is open-source and has been steadily evolving and expanding since its initial releases. In the past few years, the Visualization Toolkit has been extended to include informatics and information visualization tools to VTK, making VTK a premier information visualization platform that includes pre-built interactive visualization views 23. The phyloVis module will take advantage of these views to include visualizations specific to comparative analyses. We will also allow users to interface with OneZoom, a new tool for visualizing very large phylogenetic trees 11.

We will employ a set of three test cases to develop Arbor into software that can address fundamental questions in evolutionary biology. Test cases have been selected for their overarching biological relevance, but also to deliberately challenge and drive software development. These test cases include: (I) The Evolutionary Process of Spatial Diversification, using species distribution data, phylogenetic relationships, and temporal data to understand processes underlying biogeographic patterns and ecological niche differentiation; (II) The Evolution of Symbiotic Communities, using natural evolutionary replicates to understand the tempo and mode of evolution in species interactions and the evolution of phylogenetic community structure; (III) The Evolution of Complex Interactions, using novel evolutionary models and analytical algorithms to understand functional diversification during macroevolution and the evolution of interaction networks. Below we describe one test case in more detail to give an idea of the potential of the Arbor framework for enabling novel analyses.

Arbor Test Case: The Evolutionary Process of Spatial Diversification

The geographic range of a species is a crucial feature of its evolution. Species’ ranges are determined by both contemporary and historical factors 24. Currently, species distribution models (SDMs) focus on the impact of contemporary abiotic factors in structuring species’ ranges 25. By contrast, biogeographic models (such as Lagrange 26) consider how species’ ranges change through time due to dispersal, speciation, and extinction 2728. Arbor will allow users to integrate both approaches to allow comprehensive analyses of the evolution of species’ ranges over time.

For this test case, we will combine locality data with dated phylogenetic trees and paleoclimate data to model species distributions over time. Our test case analyses will focus on two main goals: (I) including historical data in SDMs; and (II) including information about species’ niches in biogeographic models.

To include historical data in Species Distribution Models (SDMs), the Arbor data integration module (dataIntegrator) will query occurrence data (e.g, VertNet [vertnet.org], GBIF [www.gbif.org], iDigBio [www.idigbio.org], or others) to obtain locality information for individuals from many species within a clade, and climate data servers to obtain climate data for each locality. An intuitive visual output will enable the user to effectively curate occurrence data, selecting data points that are relevant for the analysis. A distribution model for each of these species using the curated occurrence data will be constructed using an openModeller module (based on openmodeller.sourceforge.net) and LifeMapper (lifemapper.org), an existing set of web services and tools. openModeller will use the integration capabilities of Arbor to incorporate both current distribution data and the phylogenetic relationships of species into SDMs. Using these data, the openModeller module will be able to project species ranges onto past climatic layers using species’ distribution models combined with paleoclimatic data, given the phylogenetic relationships of the species and the temporal scale of their ancestral distributions. The results can then be visualized using the phyloVis module.

Fig. 1: Proposed phyloFlow workflow for the test case developed in the text. Information processing flows from remote sources to five phyloFlow modules.

First, the dataIntegrator (A) combines phylogenetic, distributional, and climactic data sources. openModeller (B) then creates species distribution models, and Lagrange (C) tests historical biogeographic hypotheses incorporating those models. Next, phyloView allows geographic (D) and phylogenetic (E) visualization of results. Outputs are directed to the dataOutput module (F), which can save results both locally and remotely.

We show a hypothetical workflow for this test case in Figure 1. The entire workflow has already been demonstrated (outside of Arbor) in a pilot study 29 using a virtual species on a spatially realistic matrix of 144 cells, and climate modeled in 130 time steps (back to 140,000 years before present). We will also be able to use the comparative capabilities of Arbor to constrain niche models to follow patterns expected if species’ response to abiotic factors follows evolutionary history (i.e., closely related species have niches that are more similar to one another than distantly related species). This will allow the user to directly test for phylogenetic structure in historic and extant species distribution patterns. We will use the New World genus Aphelocoma (scrub jays), which has 9 phylogenetic species, as a biological application of this test case. The phylogeny of the clade is well established 30 and predictions of Pleistocene ranges have been made 31, but without consideration to the historical trajectories of climatic factors and the phylogenetic history of the clade. We will use the new methods and algorithms implemented in Arbor to model scrub jay species distributions while incorporating paleoclimatic and phylogenetic data.

To include information about species’ niches in biogeographic models, we will use the Lagrange module to test hypotheses about the processes that govern range changes over time. Currently, the Lagrange program can fit process-based models of changes in species ranges through time by modeling speciation, extinction, and dispersal among a set of geographic regions. One can either assume that dispersal rates among these regions are constant or make modifications based on hypothesized patterns of geographic connections through time. The Lagrange module in Arbor will allow us to modify transition rates as a function of historical climate, directly testing whether geographic range shifts affect biogeographic processes. As a supplemental biological application, we will use this approach to investigate the diversification of Heliconia plants and their hummingbird pollinators. As these are Andean taxa that diversify during uplift activity, this test case will address the common biogeographical hypothesis that the Andean uplift and subsequent diversification of ecological niche space led to the in situ diversification of local lineages, and that current species distribution is largely driven by historical distributions and shifts in climatic factors 32.

Arbor outreach

Our outreach and education efforts integrate four main goals to ensure that Arbor plays a meaningful role in advancing science and education in comparative evolutionary biology. As part of the development of Arbor, we will (1) create novel educational programs targeted to K-12 and the general public in collaboration with several public science centers; (2) explicitly involve diverse researchers and students, providing training both in software development and in using Arbor through hackathons and workshops; (3) engage undergraduates as well as graduate students and postdoctoral fellows in the development of Arbor; and (4) leverage and augment research conducted by other NSF-supported Tree of Life and Dimensions of Biodiversity projects.

We plan to develop a software interface, selected data sets, and lesson plans that will allow middle school, high school, and undergraduate students to engage with authentic scientific data as they develop tree-thinking skills. This interface will be a streamlined version of Arbor with a graphically appealing interface and more limited data analysis options to facilitate student use. Rich but appropriate data sets will be selected from the parent project so that students can explore a variety of evolutionary scenarios (e.g., co-diversification, ecological niches through time) and investigate multiple questions. Lesson plans will provide background information and explicit suggestions for instructors regarding how these materials can best be incorporated into classrooms to teach evolution, systematics, comparative biology, and the process of science.

The Arbor team

The arbor team includes both biologists and computer scientists. Biologists include Bob Thacker (University of Alabama at Birmingham), Chelsea Specht (University of California, Berkeley), Jorge Soberon (University of Kansas), and Luke Harmon (University of Idaho). We span a range of biological expertise and interests, including symbiosis, genetics, developmental biology, species distributions, and comparative methods. Computer scientists include Charlie Hughes (University of Central Florida) and Curt Lisle (KnowledgeVis). We are also working closely with a team of programmers from KitWare, led by Wes Turner and Jeff Baumes. The computer scientists have extensive expertise with workflow software, visualization, user interaction and statistical analyses of large datasets.

Community involvement: how you can get involved now with Arbor

Although the Arbor team is small now, we will grow more inclusive through a series of planned training courses and hackathons. Training courses will be geared towards teaching end-users how to obtain and analyze data using Arbor workflows, while hackathons will help programmers get up to speed and create new modules for Arbor. The first of these meetings will take place in late 2013. We will also be offering ‘sabbaticals’ to postdoctoral scholars who are actively generating phylogenetic and comparative data that could be analyzed within Arbor. These 3-6 month appointments allow postdocs to work closely with the Arbor team to integrate their data into Arbor and design novel Arbor workflows that are suited to their particular questions, thereby immediately developing the capacity of Arbor to serve the scientific community. We encourage community feedback and involvement at every step of the process; the goal of the Arbor team is to produce efficient and open-source software that will enable scientists, students, and educators to understand the Tree of Life in new ways. Everyone is free to provide comments or suggestions either here on the PLOS Currents page or via our website (http://www.arborworkflows.com/).

Conclusion

We are rapidly approaching a “rough draft” for the Tree of Life. As phylogenetic trees become more reliable, larger, and more complete, the need for integrative approaches to analyze and understand those trees and their evolutionary significance becomes critical. Arbor will allow users to test key evolutionary hypotheses with methods that are new and flexible using massive amounts of data. Arbor’s most unique feature is allowing users to create sharable and citable workflows that combine methods in new and unanticipated ways. We are building the Tree of Life, but we do not yet know what it can tell us about the process of evolution. Arbor will allow us to harvest massive amounts of new information from the Tree of Life and facilitate new discoveries about organismal evolution.