Design of a Parallel Graph-Based Protein Sequence Clustering Algorithm Mohammed Omer Haj Assayony & Nur’Aini Abdul Rashid School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia
[email protected],
[email protected]
Abstract Clustering protein sequences is becoming important in helping biologists analyze the large protein sequences produced by wet lab experiments. Graph based partitioning methods which is an important and stable algorithm in computer science can be used to cluster protein sequences such that each identified subgraph can be considered as a cluster. Each cluster represents a family of protein sequences or protein sequences that shared a common attribute. Since the size of protein sequence databases increases 1.5 times yearly, a fast and efficient graph based protein sequence clustering method is much needed. We proposed a parallel approach in graph-based clustering methods by improving the performance of an existing algorithm ProtClust by using parallel methods. We presented the design of a parallel method which will be the basis of our experiments for protein sequence clustering.
1. Introduction The goal of protein sequence clustering is to group a huge number of sequences into groups such that sequences in each group are very similar to each other whereas sequences in different groups are dissimilar. This process enables biologists to study structure and functions of protein sequences efficiently and enables them to detect the functions and structure of new sequences easily. Recently developed clustering algorithm based on graph theory produced good results in clustering data sequences, especially protein clustering [1]. Those algorithms convert the clustering problem into a wellknown problem in graph theory and then use traditional algorithms and techniques to deal with. The problems in graph theory were studied by mathematicians and computer scientists in different fields and a lot of work was done in designing efficient
978-1-4244-2328-6/08/$25.00 © 2008 IEEE
algorithms for them. Computer scientists also performed a lot of effort in designing and implementing efficient parallel algorithms for many of these problems [10]. This work is about improving the performance of protein sequence clustering algorithm proposed in [9] by introducing parallel methods in the critical parts of the algorithm.
2. Related Work Clustering algorithms based on graph theory were developed recently and have succeeded in clustering a large set of protein sequences into accurate clusters depending on the similarity between the sequences. Those algorithms can be divided, based on graph properties, into four main groups: 1. Algorithms based on strongly connected components of directed graphs. 2. Algorithms based on biconnected components of undirected graphs. 3. Algorithms based on graph partitioning of weighted graphs. 4. Algorithms based on graph completeness property. Cluster of Orthologous Groups (COG), a strongly connected component based graph algorithm, was proposed by Tatusov et al. in [7]. The algorithm was used in clustering proteins from completely sequenced genomes and constructed a database of protein families called COG. The main step in the clustering process in COG is a construction of directed graphs called BeTs and the resulting cluster requires BeTs to be at least strongly connected component [3]. Bolten et al. developed a clustering algorithm for predicting the protein structure where transitivity plays a curtail role [4]. The algorithm has two interesting features where the first is that it exploits the transitive property of the similarity efficiently, as the transitive is the best feature of similarity used to detect remote
homologues proteins. The second features is the construction of clusters is dependen on detecting strongly connected components of a graph, the feature that has efficient sequential algorithms and many parallel algorithms work efficiently on different architectures. ProClust algorithm [9] introduced two improvements to enhance the resulting clusters is an extention of Bolten et al[4]. First, ProClust refines the similarity graph G by removing some edges based on based on score significance value defined by the user. Second, ProClust performs post-processing steps to merge related clusters that contains in one SCOP family. The algorithm builds profiles depending on multiple alignment for all resulting clusters contains more than 20 sequences. Depending on the profiles the algorithm decides which clusters should be merged. The next set of algorithms is based on the biconnected component of a directed graph. A Biconnected components and Articulation points based Grouping of sequences (BAG) uses two graph properties: biconnected components and articulation points to cluster a large number of protein from multiple genome [3]. The main step in this version of BAG is the determination of biconnected components and articulation points the complexity of the algorithm depends on the complexity of the algorithm used in determining them. Applying BAG on clustering entire protein sequences from different set of genomes succeeded in producing protein families in efficient way. A partitioning of a weighted graph is to partition the graph into subgraphs by removing some edges from it. The set of edges whose removal disconnected a connected graph into two subgraphs is called cut and its weight is defined as the sum of cut edges weights. The partitioning depends on removing edges whose cut weight is minimal to divide the graph into subgraphs. Other partitioning approaches add another condition: the vertices in the resulting subgraphs should be roughly equal [5]. The partitioning with the last condition is an NP-complete problem and is handled by heuristic techniques. One of the efficient heuristic algorithm dealing with was proposes by Fiduccia and Mattheyses on 1982, called FM algorithm. CGP, which depends on FM algorithm, was used to cluster all mouse proteins in SWISS-PROT which includes 4,482 entries. The resulting families were very near from InterPro families [5]. A p-quasi complete graph is a graph structure introduced by Matsuda et al. in [8]. A graph G is a pquasi complete graph if the degree of every vertex in G is at least p.( |V| - 1 ), 0 < p