Phylogenetic trees are used by researchers across multiple fields of study to display historical relationships between organisms or genes. Trees are used to examine the speciation process in evolutionary biology, to classify families of viruses in epidemiology, to demonstrate co-speciation in host and pathogen studies, and to explore genetic changes occurring during the disease process in cancer, among other applications. Due to their complexity and the amount of data they present in visual form, phylogenetic trees have generally been difficult to render for publication and challenging to directly interact with in digital form. To address these limitations, we developed PhyloPen, an experimental novel multi-touch and pen application that renders a phylogenetic tree and allows users to interactively navigate within the tree, examining nodes, branches, and auxiliary information, and annotate the tree for note-taking and collaboration. We present a discussion of the interactions implemented in PhyloPen and the results of a formative study that examines how the application was received after use by practicing biologists — faculty members and graduate students in the discipline. These results are to be later used for a fully supported implementation of the software where the community will be welcomed to participate in its development.
Background
Understanding the evolutionary relationships of all eukaryotes on Earth remains a paramount goal of modern biology, yet analyzing homologous sequences across 1.8 billion years of eukaryotic evolution is challenging. Many existing tools for identifying gene orthologs are inadequate when working with heterogeneous rates of evolution and endosymbiotic/lateral gene transfer. Moreover, genomic-scale sequencing, which was once the domain of large sequencing centers, has advanced to the point where small laboratories can now generate the data needed for phylogenomic studies. This has opened the door for increased taxonomic sampling as individual research groups have the ability to conduct genome-scale projects on their favorite non-model organism.
Results
Here we present some of the tools developed, and insights gained, as we created a pipeline that combines data-mining from public databases and our own transcriptome data to study the eukaryotic tree of life. The first steps of a phylogenomic pipeline involve choosing taxa and loci, and making decisions about how to handle alleles, paralogs and non-overlapping sequences. Next, orthologs are aligned for analyses including gene tree reconstruction and concatenation for supermatrix approaches. To build our pipeline, we created scripts written in Python that integrate third-party tools with custom methods. As a test case, we present the placement of five amoebae on the eukaryotic tree of life based on analyses of transcriptome data. Our scripts are available on GitHub and may be used as-is for automated analyses of large scale phylogenomics, or adapted for use in other types of studies.
Conclusion
Analyses on the scale of all eukaryotes present challenges not necessarily found in studies of more closely related organisms. Our approach will be of relevance to others for whom existing third-party tools fail to fully answer desired phylogenetic questions.
Felsenstein’s pruning algorithm allows one to calculate the probability of any particular data pattern arising on a phylogeny given a model of character evolution. Here we present a similar dynamic programming algorithm. Our algorithm treats the tree and model as known. The algorithm makes it feasible to calculate the probability that a randomly selected character will be a member of a particular class of character patterns. Specifically, we are interested in binning patterns by the number of parsimony steps and the set of states observed at the tips of the tree. This algorithm was developed to expand the range of data set sizes that can be used with Waddell et al.’s marginal testing approach for assessing the adequacy of a model. The algorithms introduced can also be used in likelihood calculations which correct for ascertainment biases. For example, Lewis introduced an Mkv model which corrects for the lack of constant sites. The probability of a constant pattern arising can be calculated using the algorithm that we present, or by enumerating all possible constant patterns and calculating the probability of each one. Because the number of constant data patterns is small, both methods are efficient. However, elaborations of the Mkv model (such as those in Nylander et al) require calculating the probability of parsimony-uninformative patterns arising. For large trees and characters with many possible character states, the number of possible parismony-uninformative patterns is immense. In these cases, the algorithms introduced here will be more efficient. The algorithm has been implemented in open source software written in C++.