Building a Phylogenomic Pipeline for the Eukaryotic Tree of Life – Addressing Deep Phylogenies with Genome-Scale Data

·

Background
Understanding the evolutionary relationships of all eukaryotes on Earth remains a paramount goal of modern biology, yet analyzing homologous sequences across 1.8 billion years of eukaryotic evolution is challenging. Many existing tools for identifying gene orthologs are inadequate when working with heterogeneous rates of evolution and endosymbiotic/lateral gene transfer. Moreover, genomic-scale sequencing, which was once the domain of large sequencing centers, has advanced to the point where small laboratories can now generate the data needed for phylogenomic studies. This has opened the door for increased taxonomic sampling as individual research groups have the ability to conduct genome-scale projects on their favorite non-model organism.

Results
Here we present some of the tools developed, and insights gained, as we created a pipeline that combines data-mining from public databases and our own transcriptome data to study the eukaryotic tree of life. The first steps of a phylogenomic pipeline involve choosing taxa and loci, and making decisions about how to handle alleles, paralogs and non-overlapping sequences. Next, orthologs are aligned for analyses including gene tree reconstruction and concatenation for supermatrix approaches. To build our pipeline, we created scripts written in Python that integrate third-party tools with custom methods. As a test case, we present the placement of five amoebae on the eukaryotic tree of life based on analyses of transcriptome data. Our scripts available on GitHUb and may be used as-is for automated analyses of large scale phylogenomics, or adapted for use in other types of studies.

Conclusion
Analyses on the scale of all eukaryotes present challenges not necessarily found in studies of more closely related organisms. Our approach will be of relevance to others for whom existing third-party tools fail to fully answer desired phylogenetic questions.

Next-generation phenomics for the Tree of Life

·

The phenotype represents a critical interface between the genome and the environment in which organisms live and evolve. Phenotypic characters also are a rich source of biodiversity data for tree building, and they enable scientists to reconstruct the evolutionary history of organisms, including most fossil taxa, for which genetic data are unavailable. Therefore, phenotypic data are necessary for building a comprehensive Tree of Life. In contrast to recent advances in molecular sequencing, which has become faster and cheaper through recent technological advances, phenotypic data collection remains often prohibitively slow and expensive. The next-generation phenomics project is a collaborative, multidisciplinary effort to leverage advances in image analysis, crowdsourcing, and natural language processing to develop and implement novel approaches for discovering and scoring the phenome, the collection of phentotypic characters for a species. This research represents a new approach to data collection that has the potential to transform phylogenetics research and to enable rapid advances in constructing the Tree of Life. Our goal is to assemble large phenomic datasets built using new methods and to provide the public and scientific community with tools for phenomic data assembly that will enable rapid and automated study of phenotypes across the Tree of Life.

Arbor: Comparative Analysis Workflows for the Tree of Life

·

We describe our efforts to develop a software package, Arbor, that will enable scientific research in all aspects of comparative biology. This software will enable developmental biologists, geneticists, ecologists, geographers, paleobiologists, educators, and students to analyze diverse types of comparative data at multiple phylogenetic and spatiotemporal scales using an intuitive visual interface. Arbor’s user-defined workflows will be exported and shared so that entire analyses can be quickly replicated with new or updated data. Arbor will also be designed to easily and seamlessly expand to include novel analytical tools as they are developed. Here we describe the core components of Arbor, as well as provide details of one proposed test case to illustrate the software’s key functionality.

The Tree of Life and a New Classification of Bony Fishes

·

The tree of life of fishes is in a state of flux because we still lack a comprehensive phylogeny that includes all major groups. The situation is most critical for a large clade of spiny-finned fishes, traditionally referred to as percomorphs, whose uncertain relationships have plagued ichthyologists for over a century. Most of what we know about the higher-level relationships among fish lineages has been based on morphology, but rapid influx of molecular studies is changing many established systematic concepts. We report a comprehensive molecular phylogeny for bony fishes that includes representatives of all major lineages. DNA sequence data for 21 molecular markers (one mitochondrial and 20 nuclear genes) were collected for 1410 bony fish taxa, plus four tetrapod species and two chondrichthyan outgroups (total 1416 terminals). Bony fish diversity is represented by 1093 genera, 369 families, and all traditionally recognized orders. The maximum likelihood tree provides unprecedented resolution and high bootstrap support for most backbone nodes, defining for the first time a global phylogeny of fishes. The general structure of the tree is in agreement with expectations from previous morphological and molecular studies, but significant new clades arise. Most interestingly, the high degree of uncertainty among percomorphs is now resolved into nine well-supported supraordinal groups. The order Perciformes, considered by many a polyphyletic taxonomic waste basket, is defined for the first time as a monophyletic group in the global phylogeny. A new classification that reflects our phylogenetic hypothesis is proposed to facilitate communication about the newly found structure of the tree of life of fishes. Finally, the molecular phylogeny is calibrated using 60 fossil constraints to produce a comprehensive time tree. The new time-calibrated phylogeny will provide the basis for and stimulate new comparative studies to better understand the evolution of the amazing diversity of fishes.

Multi-locus phylogenetic analysis reveals the pattern and tempo of bony fish evolution

·

Over half of all vertebrates are “fishes”, which exhibit enormous diversity in morphology, physiology, behavior, reproductive biology, and ecology. Investigation of fundamental areas of vertebrate biology depend critically on a robust phylogeny of fishes, yet evolutionary relationships among the major actinopterygian and sarcopterygian lineages have not been conclusively resolved. Although a consensus phylogeny of teleosts has been emerging recently, it has been based on analyses of various subsets of actinopterygian taxa, but not on a full sample of all bony fishes. Here we conducted a comprehensive phylogenetic study on a broad taxonomic sample of 61 actinopterygian and sarcopterygian lineages (with a chondrichthyan outgroup) using a molecular data set of 21 independent loci. These data yielded a resolved phylogenetic hypothesis for extant Osteichthyes, including 1) reciprocally monophyletic Sarcopterygii and Actinopterygii, as currently understood, with polypteriforms as the first diverging lineage within Actinopterygii; 2) a monophyletic group containing gars and bowfin (= Holostei) as sister group to teleosts; and 3) the earliest diverging lineage among teleosts being Elopomorpha, rather than Osteoglossomorpha. Relaxed-clock dating analysis employing a set of 24 newly applied fossil calibrations reveals divergence times that are more consistent with paleontological estimates than previous studies. Establishing a new phylogenetic pattern with accurate divergence dates for bony fishes illustrates several areas where the fossil record is incomplete and provides critical new insights on diversification of this important vertebrate group.

Phylogenetic Analysis of Six-Domain Multi-Copper Blue Proteins

·

Multicopper blue proteins, composed of several repetitive copper-binding domains similar to one-domain cupredoxin-like proteins, were found in almost all organisms. They are classified into the three different groups, based on their two-, three- or six-domain organization. We found orthologs of chordate six-domain copper-binding proteins in animals, plants, bacteria and archea. The phylogenetic analysis of 183 multicopper blue proteins and their copper-binding sites comparison make us think that all the modern six-domain blue proteins have originated from the common ancestral six-domain protein in the process of gene duplication and copper-binding sites loss as a result of amino acid substitutions.

The Ideas Lab Concept, Assembling the Tree of Life, and AVAToL

·

In August 2011, a week-long NSF-sponsored workshop focusing on the Tree of Life (ToL) took place in Lake Placid, New York. This workshop, called AVAToL (Assembling Visualizing, and Analyzing the Tree of Life), was the first application of NSF’s Ideas Lab concept to systematics. In this article we outline the history and motivation for the Ideas Lab approach and its application to the ToL, explain the nuts and bolts of the Ideas Lab process and look to the potential contributions of AVAToL funded projects to help enable the future of ToL and more broadly, comparative biological research.

An Algorithm for Calculating the Probability of Classes of Data Patterns on a Genealogy

·

Felsenstein’s pruning algorithm allows one to calculate the probability of any particular data pattern arising on a phylogeny given a model of character evolution. Here we present a similar dynamic programming algorithm. Our algorithm treats the tree and model as known. The algorithm makes it feasible to calculate the probability that a randomly selected character will be a member of a particular class of character patterns. Specifically, we are interested in binning patterns by the number of parsimony steps and the set of states observed at the tips of the tree. This algorithm was developed to expand the range of data set sizes that can be used with Waddell et al.’s marginal testing approach for assessing the adequacy of a model. The algorithms introduced can also be used in likelihood calculations which correct for ascertainment biases. For example, Lewis introduced an Mkv model which corrects for the lack of constant sites. The probability of a constant pattern arising can be calculated using the algorithm that we present, or by enumerating all possible constant patterns and calculating the probability of each one. Because the number of constant data patterns is small, both methods are efficient. However, elaborations of the Mkv model (such as those in Nylander et al) require calculating the probability of parsimony-uninformative patterns arising. For large trees and characters with many possible character states, the number of possible parismony-uninformative patterns is immense. In these cases, the algorithms introduced here will be more efficient. The algorithm has been implemented in open source software written in C++.

Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent

·

Background
Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated.

Results
We prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data.

Conclusions
Our result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.

Phylogenetic discordance of human and canine carcinoembryonic antigen (CEA, CEACAM) families, but striking identity of the CEA receptors will impact comparative oncology studies.

·

Comparative oncology aims at speeding up developments for both, human and companion animal cancer patients. Following this line, carcinoembryonic antigen (CEA, CEACAM5) could be a therapeutic target not only for human but also for canine (Canis lupus familiaris; dog) patients. CEACAM5 interacts with CEA-receptor (CEAR) in the cytoplasm of human cancer cells. Our aim was, therefore, to phylogenetically verify the antigenic relationship of CEACAM molecules and CEAR in human and canine cancer.
Anti-human CEACAM5 antibody Col-1, previously being applied for cancer diagnosis in dogs, immunohistochemically reacted to 23 out of 30 canine mammary cancer samples. In immunoblot analyses Col-1 specifically detected human CEACAM5 at 180 kDa in human colon cancer cells HT29, and the canine antigen at 60, 120, or 180 kDa in CF33 and CF41 mammary carcinoma cells as well as in spontaneous mammary tumors. While according to phylogenicity canine CEACAM1 molecules should be most closely related to human CEACAM5, Col-1 did not react with canine CEACAM1, -23, -24, -25, -28 or -30 transfected to canine TLM-1 cells. By flow cytometry the Col-1 target molecule was localized intracellularly in canine CF33 and CF41 cells, in contrast to membranous and cytoplasmic expression of human CEACAM5 in HT29. Col-1 incubation had neither effect on canine nor human cancer cell proliferation. Yet, Col-1 treatment decreased AKT-phosphorylation in canine CF33 cells possibly suggestive of anti-apoptotic function, whereas Col-1 increased AKT-phosphorylation in human HT29 cells. We report further a 99% amino acid similarity of human and canine CEA receptor (CEAR) within the phylogenetic tree. CEAR could be detected in four canine cancer cell lines by immunoblot and intracellularly in 10 out of 10 mammary cancer specimens from dog by immunohistochemistry. Whether the specific canine Col-1 target molecule may as functional analogue to human CEACAM5 act as ligand to canine CEAR, remains to be defined. This study demonstrates the limitations of comparative oncology due to the complex functional evolution of the different CEACAM molecules in humans versus dogs. In contrast, CEAR may be a comprehensive interspecies target for novel cancer therapeutics.