Best Practices for Data Sharing in Phylogenetic Research

Karen Cranston; Luke J. Harmon; Maureen A. O'Leary; Curtis Lisle

doi:10.1371/currents.tol.bf01eff4a6b60ca4825c69293dc59645

As phylogenetic data becomes increasingly available, along with associated data on species’ genomes, traits, and geographic distributions, the need to ensure data availability and reuse become more and more acute. In this paper, we provide ten “simple rules” that we view as best practices for data sharing in phylogenetic research. These rules will help lead towards a future phylogenetics where data can easily be archived, shared, reused, and repurposed across a wide variety of projects.

The phenotype represents a critical interface between the genome and the environment in which organisms live and evolve. Phenotypic characters also are a rich source of biodiversity data for tree building, and they enable scientists to reconstruct the evolutionary history of organisms, including most fossil taxa, for which genetic data are unavailable. Therefore, phenotypic data are necessary for building a comprehensive Tree of Life. In contrast to recent advances in molecular sequencing, which has become faster and cheaper through recent technological advances, phenotypic data collection remains often prohibitively slow and expensive. The next-generation phenomics project is a collaborative, multidisciplinary effort to leverage advances in image analysis, crowdsourcing, and natural language processing to develop and implement novel approaches for discovering and scoring the phenome, the collection of phentotypic characters for a species. This research represents a new approach to data collection that has the potential to transform phylogenetics research and to enable rapid advances in constructing the Tree of Life. Our goal is to assemble large phenomic datasets built using new methods and to provide the public and scientific community with tools for phenomic data assembly that will enable rapid and automated study of phenotypes across the Tree of Life.

We describe our efforts to develop a software package, Arbor, that will enable scientific research in all aspects of comparative biology. This software will enable developmental biologists, geneticists, ecologists, geographers, paleobiologists, educators, and students to analyze diverse types of comparative data at multiple phylogenetic and spatiotemporal scales using an intuitive visual interface. Arbor’s user-defined workflows will be exported and shared so that entire analyses can be quickly replicated with new or updated data. Arbor will also be designed to easily and seamlessly expand to include novel analytical tools as they are developed. Here we describe the core components of Arbor, as well as provide details of one proposed test case to illustrate the software’s key functionality.

In August 2011, a week-long NSF-sponsored workshop focusing on the Tree of Life (ToL) took place in Lake Placid, New York. This workshop, called AVAToL (Assembling Visualizing, and Analyzing the Tree of Life), was the first application of NSF’s Ideas Lab concept to systematics. In this article we outline the history and motivation for the Ideas Lab approach and its application to the ToL, explain the nuts and bolts of the Ideas Lab process and look to the potential contributions of AVAToL funded projects to help enable the future of ToL and more broadly, comparative biological research.

Best Practices for Data Sharing in Phylogenetic Research

Next-generation phenomics for the Tree of Life

Arbor: Comparative Analysis Workflows for the Tree of Life

The Ideas Lab Concept, Assembling the Tree of Life, and AVAToL