Domains characterize proteins structurally, evolutionarily and functionally
Behzadi and Vingron put forward an iterative procedure (BV) to reconstruct ancestral domain compositions using the phylogenetic relationship within domain families
The algorithm consists of the following steps:
Input a species tree, a set of domain trees and extant domain compositions, i.e. a partition of the domain trees' leaf set. Recursively map domain tree nodes to species tree nodes at the least common ancestor (LCA) of all child domain nodes, starting at the leaf level. Recursively walk through the species tree nodes (bottom-up). In all child species, the domains are already partitioned. Establish correspondence between domain nodes in the child species and those in the current species (relabeling of domain nodes), then find an optimal partition in the current species which is closest to all child partitions in terms of the weighted number of fusions and deletions.
Previous analyses of MDPs have decided against the use of phylogenetic trees for domains
We implemented BV, tested it and and identified critical issues that need to be addressed for successful reconstruction of phylogenies of MDPs using BV. Due to the large number of possible domain combinations a good set of partitions cannot be found by brute-force enumeration. We implemented a heuristic called weak edge erosion, which yields close to optimal solutions faster than simulated annealing suggested by
In the following, we present our findings in detail, provide an implementation and show how to use it in practice. First, BV and the individual improvements are introduced formally.
The algorithm BV uses the information of domain trees and the composition of extant proteins to infer the domain composition of ancestral proteins
Phylogenetic trees are inferred for each domain family individually. Reconciling domain trees for each domain family using the known species tree assigns domain nodes to each species node
which can be arbitrarily defined.
Behzadi and Vingron suggest an additive measure given by
thus we solve
where
The number of merges to have all domains of
Assigning costs for union and deletion yields the partitioning score
The entire tree is reconstructed in a bottom-up pass including the root.
As the number of possible partitions for domains into genes grows rapidly, complete enumeration is only possible for the simplest cases and not suitable for real applications. For BV, it was suggested to use simulated annealing to solve the partitioning problem
Weak edge erosion is a hierarchical graph clustering method based on the idea of attacking a network of affinities between elements at its weakest points and recursively create clusters by separating network components. The affinity graph is defined as follows:
Consider an undirected loop-free graph
Cluster boundaries are induced by regions sparse in edges. But dividing a set to create partitions translates to cutting through a large number of edges in a clique, so minimal cuts and related concepts of connectivity are of limited use. Particularly, min-cut tends to separate single vertices from cliques, thus creating a suboptimal partition.
To obtain meaningful cuts for our purpose, we introduce the concept of
This first condition accounts for that we prefer weaker affinities to be violated. If the edge weights are equal, we want to exploit the weakening effect of high vertex weights, thus the weakness order is determined by the heavier vertices:
If these are equal as well, the relation between the lighter vertices decides:
The concept of edge weakness is illustrated in Figure 2. As the cost function is additive, we can find an approximate solution to the partitioning problem by splitting
Weights are depicted by the multiplicity of lines. The edge
If edges tie, such as
If the upper nodes tie as well, as for
Walking through the cut-tree in a bottom-up procedure, we decide whether the block in the parent node or the two set blocks in its children yield a better score
Each node represents a set block of the vertices it contains. Vertex weights are denoted by the number of peripheries. Note that this weight changes during the process according to the sum of incident edges (e.g. in vertex
and
However, this does not suffice to disconnect the graph, so vertex weights are recalculated:
and all edges with a weight of 1 are removed,
The complete procedure is summarized in Algorithm 1:
The original authors of BV propose simulated annealing to solve the underlying partitioning problem
Due to their short length, the inferred domain phylogenies often disagree with each other and the species tree. The BV algorithm was proposed for ideal data and does not consider errors in the underlying domain topologies. A practical consequence of such errors are that the order of speciation and duplication events between adjacent domains do not agree. These conflicts can lead to duplicate nodes in the reconstructed composition, for example nodes that have successors within the same protein. The BV algorithm will then produce MDPs with a high partition score due to additional copies of the conflicting domains. Our algorithm aims to produce a partition with an improved score by applying nearest neighbor interchanges on the conflicting domain trees. For each modification the ancestral composition is reconstructed and the modification with the lowest score
Species trees were generated following the Yule-Harding model. An initial domain composition of three families in a single protein was placed in the root and passed to its children, whose protein domain compositions underwent evolutionary events of genes (fusion, duplication) and domains (gain, duplication, loss). Subtree pruning and regrafting (SPR) operations were applied on the domain trees to create perturbed input data.
For an empirical evaluation we ran global models of the PFAM database
As we cannot obtain true ancestral multi-domain protein compositions, we relied on simulations to test the validity of the algorithm and to inspect its performance when errors are introduced. The practical performance was evaluated on known instances of MDPs, available on the accompanying website. A brief, non-trivial example is presented in Fig. 1.
To assess the performance of the algorithm on controlled input, species and domain phylogenies were simulated. The reconstructed domain compositions were compared to the original compositions using the partition distance measure, which ranges between zero and the size of the compositions
An automated correction of incongruent domain phylogenies is effective for cases with minor errors. Manual curation is advised if domain phylogenies cannot be inferred reliably. To this end, internal nodes with high scores are tagged for assessment.
The run times of our implementation are short. We simulated 50 replicates of input data containing 50 taxa and no regrafting operations and ran the implementation under Linux on an AMD 64 X2 3200+ processor. The erosion heuristic inferred a MDP reconstruction in 2.06s±0.25s, while the simulated annealing procedure took 153s±22s to achieve comparable solutions. In practice, the calculation of reliable domain families bounds the performance.
JMJ domains are found in proteins involved in chromatin remodeling complexes often together with domain families such as ARID, PHD and PLUN-1
Our solution for the inference of phylogenies of multi-domain proteins provides a simple and easy-to-use interface. The weak edge erosion heuristic provides considerable speed-up over simulated annealing while maintaining comparable solution quality. Beyond the application on MDPs, such methods could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes. Future work will be directed at improving the quality of the tree reconciliation.
The program was successfully tested under Python 2.6 and 2.7 on Windows, Mac OS X and Linux. It receives input for species and domain trees as well as parameters in Nexus format. The output can be visualized using GraphViz. The source code with additional figures and examples can be found at
Implementation: MH and JW. Weak edge erosion: JW. Domain analysis: ST, CS, RK. Example data: IK, CS. Simulations: MH and RK. Wrote the paper: MH, JW and RK.
The authors have declared that no competing interests exist.