# Abstract

Phylogenies of multi-domain proteins have to incorporate macro-evolutionary events, which dramatically increases the complexity of their construction.

We present an application to infer ancestral multi-domain proteins given a species tree and domain phylogenies. As the individual domain phylogenies are often incongruent, we provide diagnostics for the identification and reconciliation of implausible topologies. We implement and extend a suggested algorithmic approach by Behzadi and Vingron (2006).

# Funding Statement

The authors have no support or funding to report## Introduction

Domains characterize proteins structurally, evolutionarily and functionally [1] . More than half of the proteins in prokaryotes and about 80 percent of the proteins in eukaryotes are composed of multiple domains [2] . About 200 domains in eukaryotes occur in diverse architectures [3] and provide a challenge for phylogenetic inference, as proteins can be composed of non-homologous elements. Evolutionary events such as the fusion of proteins or the loss of domains need to be considered in phylogenetic analyses of multi-domain proteins (MDPs).

Behzadi and Vingron put forward an iterative procedure (BV) to reconstruct ancestral domain compositions using the phylogenetic relationship within domain families [4] . See Figure 1 for an overview. Their algorithm minimizes the number of the macro-evolutionary events *protein fusion* and *domain loss* using a set-theoretic formulation and is independent of the order of the domains in the proteins.

The algorithm consists of the following steps:

Previous analyses of MDPs have decided against the use of phylogenetic trees for domains [5] or relied on establishing phylogenies only for domain trees with high internal bootstrap support [6] . Recently, an alternative approach for the reconstruction of MDPs including domain trees was proposed [7] .

We implemented BV, tested it and and identified critical issues that need to be addressed for successful reconstruction of phylogenies of MDPs using BV. Due to the large number of possible domain combinations a good set of partitions cannot be found by brute-force enumeration. We implemented a heuristic called weak edge erosion, which yields close to optimal solutions faster than simulated annealing suggested by [4] . In the practical application it showed that most domain trees are incongruent to each other and the species tree. We implemented a simple procedure to detect and rectify problematic cases.

In the following, we present our findings in detail, provide an implementation and show how to use it in practice. First, BV and the individual improvements are introduced formally.

## Methods

**The algorithm by Behzadi and Vingron**

The algorithm BV uses the information of domain trees and the composition of extant proteins to infer the domain composition of ancestral proteins [4]. The original publication contains a worked example recommended for further study. Let be a species tree in which is the parent species of and . Each species, e.,g. , is assigned a set of domains that belong to a domain family . Let be a partition of called a *domain composition family* of *domain compositions*. As these are sets, the ordering of domains within a gene is ignored.

Phylogenetic trees are inferred for each domain family individually. Reconciling domain trees for each domain family using the known species tree assigns domain nodes to each species node in . The nodes which have a direct child in at least one of the descendant species are the elements of . Let be the set relabeled in a way that each domain in receives the name of its ancestral domain in , and let be the according domain composition family. If there is a duplication event in , the two domains map to the same ancestral domain in , thus . Since is known only for the leafs, the problem is to find partitions for the inner nodes, i.e.domain composition families of ancestral species that minimize some cost

which can be arbitrarily defined.

Behzadi and Vingron suggest an additive measure given by

thus we solve

where measures the number of deletions and merges that are necessary to transform the elements of , the domain compositions of the parent, to the child domain composition . Ignore all that do not contain any domain in and store the indices of the remaining in

The number of merges to have all domains of in one set is therefore . The resulting set could contain other domains that need to be deleted as they are not in the child domain composition; their number is

Assigning costs for union and deletion yields the partitioning score

The entire tree is reconstructed in a bottom-up pass including the root.

**Heuristics for partitioning**

As the number of possible partitions for domains into genes grows rapidly, complete enumeration is only possible for the simplest cases and not suitable for real applications. For BV, it was suggested to use simulated annealing to solve the partitioning problem [4] . We explored a deterministic algorithm we call weak edge erosion (see below) to find a suitable partition

**Weak Edge Erosion**

Weak edge erosion is a hierarchical graph clustering method based on the idea of attacking a network of affinities between elements at its weakest points and recursively create clusters by separating network components. The affinity graph is defined as follows:

Consider an undirected loop-free graph with a vertex set . An edge between vertices is assigned a cost according to the number of sets in the child partitions that contain as a subset after relabeling. Note that each set induces a clique in this affinity graph. The edge weight measures the affinity of two vertices to occur together in a set. Thus removing edges corresponds to breaking the affinity between elements. To find a good approximation for the partitioning problem, we let store in a tree node , cut the graph and store the resulting components in child nodes of . These nodes are processed recursively until the vertex set in a node scores 0 with the score . Then the nodes are checked in a bottom-up pass as to whether their subset or the subsets of their children score better, and this solution is passed to the parent.

Cluster boundaries are induced by regions sparse in edges. But dividing a set to create partitions translates to cutting through a large number of edges in a clique, so minimal cuts and related concepts of connectivity are of limited use. Particularly, min-cut tends to separate single vertices from cliques, thus creating a suboptimal partition.

To obtain meaningful cuts for our purpose, we introduce the concept of *weak edges*. Let each vertex be labeled by the sum of weights of all its incident edges. From the perspective of an edge, high vertex weights mean that the vertices have a strong connection to a set of other vertices, so edges are regarded to become weaker as their vertex weights grow. We thus define a total order of weakness: let be two edges, and be two vertices such that. is weaker than , denoted by , if the corresponding edge has a lower weight:

This first condition accounts for that we prefer weaker affinities to be violated. If the edge weights are equal, we want to exploit the weakening effect of high vertex weights, thus the weakness order is determined by the heavier vertices:

If these are equal as well, the relation between the lighter vertices decides:

The concept of edge weakness is illustrated in Figure 2. As the cost function is additive, we can find an approximate solution to the partitioning problem by splitting into nested recursively and combine the local results. We use Algorithm 1 to find a good partition of with a cost . The weakest edges are removed until the graph is decomposed into two or more components. A cut tree of is built such that a node contains and its children contain the connected components. See Figure 3 for an example.

Weights are depicted by the multiplicity of lines. The edge has a weight of 1, node has a combined weight of incident edges of 3, and a of 2. The upper row contains the heavier nodes incident to the central edge. The lower the edge weight, the weaker this edge, e.g.

If edges tie, such as , and , the heavier nodes decide:

If the upper nodes tie as well, as for , the weights of the lighter nodes decide:

Walking through the cut-tree in a bottom-up procedure, we decide whether the block in the parent node or the two set blocks in its children yield a better score , and pass this local solution upwards until the root contains an approximate solution for the partitioning problem.

Each node represents a set block of the vertices it contains. Vertex weights are denoted by the number of peripheries. Note that this weight changes during the process according to the sum of incident edges (e.g. in vertex ). The weakest edges that are attacked during an erosion are denoted by dashed lines. Final erosions of the remaining cliques in the leafs are omitted, since they do not yield a better score. In the root, the weakest edges are and , as their weights are 1, and have the maximum number of incident edge weights:

and is the maximum lighter vertex adjacent to and :

However, this does not suffice to disconnect the graph, so vertex weights are recalculated:

,

and all edges with a weight of 1 are removed, and . Relying on min-cut instead of erosion would fail to create set blocks , and , as it would cut off , and as single vertices.

The complete procedure is summarized in Algorithm 1:

**Simulated annealing**

The original authors of BV propose simulated annealing to solve the underlying partitioning problem [8]. We implemented a dynamic cooling schedule, which sets most parameters automatically [9]. Starting from a valid domain configuration, the fusion and fission of groups of domains and the swap of individual domains are used to generate related domain compositions. The simulated annealing procedure is then used to minimize the score of the domain composition.

**Identification and curation of implausible domain trees**

Due to their short length, the inferred domain phylogenies often disagree with each other and the species tree. The BV algorithm was proposed for ideal data and does not consider errors in the underlying domain topologies. A practical consequence of such errors are that the order of speciation and duplication events between adjacent domains do not agree. These conflicts can lead to duplicate nodes in the reconstructed composition, for example nodes that have successors within the same protein. The BV algorithm will then produce MDPs with a high partition score due to additional copies of the conflicting domains. Our algorithm aims to produce a partition with an improved score by applying nearest neighbor interchanges on the conflicting domain trees. For each modification the ancestral composition is reconstructed and the modification with the lowest score is used as a correction.

**Simulation of MDP phylogenies**

Species trees were generated following the Yule-Harding model. An initial domain composition of three families in a single protein was placed in the root and passed to its children, whose protein domain compositions underwent evolutionary events of genes (fusion, duplication) and domains (gain, duplication, loss). Subtree pruning and regrafting (SPR) operations were applied on the domain trees to create perturbed input data.

**Construction of domain phylogenies**

For an empirical evaluation we ran global models of the PFAM database [10] with HMMER’s hmmpfam and hmmalign [11] against the UniProt/Swiss-Prot database [12] to identify domains and construct alignments. Maximum Likelihood trees were inferred from the alignments using PhyML [13]. The trees were rooted using Notung [14].

## Results and Discussion

As we cannot obtain true ancestral multi-domain protein compositions, we relied on simulations to test the validity of the algorithm and to inspect its performance when errors are introduced. The practical performance was evaluated on known instances of MDPs, available on the accompanying website. A brief, non-trivial example is presented in Fig. 1.

**Simulations**

To assess the performance of the algorithm on controlled input, species and domain phylogenies were simulated. The reconstructed domain compositions were compared to the original compositions using the partition distance measure, which ranges between zero and the size of the compositions [15]. Most ancestral partitions can be reconstructed perfectly but not all events can be mapped correctly. For example, the gain of a new domain family and the subsequent loss in one of its immediate children cannot be reconstructed accurately. The high standard deviation suggests that there are a few compositions which differ very much from their simulated counterpart. SPR operations lower the quality of the reconstruction.

An automated correction of incongruent domain phylogenies is effective for cases with minor errors. Manual curation is advised if domain phylogenies cannot be inferred reliably. To this end, internal nodes with high scores are tagged for assessment.

**Performance**

The run times of our implementation are short. We simulated 50 replicates of input data containing 50 taxa and no regrafting operations and ran the implementation under Linux on an AMD 64 X2 3200+ processor. The erosion heuristic inferred a MDP reconstruction in 2.06s±0.25s, while the simulated annealing procedure took 153s±22s to achieve comparable solutions. In practice, the calculation of reliable domain families bounds the performance.

**Application**

JMJ domains are found in proteins involved in chromatin remodeling complexes often together with domain families such as ARID, PHD and PLUN-1 [16] . They provide a concise example, parts of which are presented in Fig. 1. Only three species were selected for display; using a larger number of species is advisable for the inference of individual domain phylogenies.

## Conclusion

Our solution for the inference of phylogenies of multi-domain proteins provides a simple and easy-to-use interface. The weak edge erosion heuristic provides considerable speed-up over simulated annealing while maintaining comparable solution quality. Beyond the application on MDPs, such methods could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes. Future work will be directed at improving the quality of the tree reconciliation.

## Availability and requirements

The program was successfully tested under Python 2.6 and 2.7 on Windows, Mac OS X and Linux. It receives input for species and domain trees as well as parameters in Nexus format. The output can be visualized using GraphViz. The source code with additional figures and examples can be found at http://virulence.molgen.mpg.de/cocos/ and is freely available under a BSD license.

## Author contributions

Implementation: MH and JW. Weak edge erosion: JW. Domain analysis: ST, CS, RK. Example data: IK, CS. Simulations: MH and RK. Wrote the paper: MH, JW and RK.

## Competing interests

The authors have declared that no competing interests exist.

# References

- Doolittle RF. The origins and evolution of eukaryotic proteins. Philos Trans R Soc Lond B Biol Sci. 1995 Sep 29;349(1329):235-40. PubMed PMID: 8577833.
- Apic G, Gough J, Teichmann SA. An insight into domain combinations. Bioinformatics. 2001;17 Suppl 1:S83-9. PubMed PMID: 11472996.
- Basu MK, Carmel L, Rogozin IB, Koonin EV. Evolution of protein domain promiscuity in eukaryotes. Genome Res. 2008 Mar;18(3):449-61. Epub 2008 Jan 29. PubMed PMID: 18230802; PubMed Central PMCID: PMC2259109.
- Behzadi B, Vingron M. Reconstructing Domain Compositions of Ancestral Multi-domain Proteins. Lecture Notes in Computer Science, 2006; Volume 4205/2006, 1-10

Reference Link - Song N, Sedgewick RD, Durand D. Domain architecture comparison for multidomain homology identification. J Comput Biol. 2007 May;14(4):496-516. PubMed PMID: 17572026.
- Forslund K, Henricson A, Hollich V, Sonnhammer EL. Domain tree-based analysis of protein architecture evolution. Mol Biol Evol. 2008 Feb;25(2):254-64. Epub 2007 Nov 19. PubMed PMID: 18025066.
- Wiedenhoeft J, Krause R, Eulenstein O. The Plexus Model for the Inference of Ancestral Multi-Domain Proteins. IEEE/ACM Trans Comput Biol Bioinform. 2011 Jan 27. [Epub ahead of print] PubMed PMID: 21282868.
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. Epub 2009 Nov 17. PubMed PMID: 19920124; PubMed Central PMCID: PMC2808889.
- Huang, MD and Romeo, F. and Sangiovanni-Vincentelli, A. An efficient general cooling schedule for simulated annealing. Proceedings of the IEEE International Conference on Computer-Aided Design 1986.
- Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009 Oct;23(1):205-11. PubMed PMID: 20180275.
- UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009 Jan;37(Database issue):D169-74. Epub 2008 Oct 4. PubMed PMID: 18836194; PubMed Central PMCID: PMC2686606.
- Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003 Oct;52(5):696-704. PubMed PMID: 14530136.
- Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000;7(3-4):429-47. PubMed PMID: 11108472.
- Gusfield D. Partition-distance: A problem and class of perfect graphs arising in clustering, Information Processing Letters, Volume 82, Issue 3, 16 May 2002, Pages 159-164, ISSN 0020-0190, DOI: 10.1016/S0020-0190(01)00263-0.

Reference Link - Balciunas D, Ronne H. Evidence of domain swapping within the jumonji family of transcription factors. Trends Biochem Sci. 2000 Jun;25(6):274-6. PubMed PMID: 10838566.
- Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000;7(3-4):429-47. PubMed PMID: 11108472.

Reference Link