Background
Understanding the evolutionary relationships of all eukaryotes on Earth remains a paramount goal of modern biology, yet analyzing homologous sequences across 1.8 billion years of eukaryotic evolution is challenging. Many existing tools for identifying gene orthologs are inadequate when working with heterogeneous rates of evolution and endosymbiotic/lateral gene transfer. Moreover, genomic-scale sequencing, which was once the domain of large sequencing centers, has advanced to the point where small laboratories can now generate the data needed for phylogenomic studies. This has opened the door for increased taxonomic sampling as individual research groups have the ability to conduct genome-scale projects on their favorite non-model organism.
Results
Here we present some of the tools developed, and insights gained, as we created a pipeline that combines data-mining from public databases and our own transcriptome data to study the eukaryotic tree of life. The first steps of a phylogenomic pipeline involve choosing taxa and loci, and making decisions about how to handle alleles, paralogs and non-overlapping sequences. Next, orthologs are aligned for analyses including gene tree reconstruction and concatenation for supermatrix approaches. To build our pipeline, we created scripts written in Python that integrate third-party tools with custom methods. As a test case, we present the placement of five amoebae on the eukaryotic tree of life based on analyses of transcriptome data. Our scripts are available on GitHub and may be used as-is for automated analyses of large scale phylogenomics, or adapted for use in other types of studies.
Conclusion
Analyses on the scale of all eukaryotes present challenges not necessarily found in studies of more closely related organisms. Our approach will be of relevance to others for whom existing third-party tools fail to fully answer desired phylogenetic questions.