Comparative interactomics

Share Embed


Descripción

FEBS Letters 579 (2005) 1828–1833

FEBS 29325

Minireview

Comparative interactomics Gianni Cesareni*,1, Arnaud Ceol, Caius Gavrila, Luisa Montecchi Palazzi, Maria Persico, Maria Victoria Schneider Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica, 00133 Rome, Italy Accepted 31 January 2005 Available online 8 February 2005 Edited by Robert Russell and Giulio Superti-Furga

Abstract The behavior, morphology and response to stimuli in biological systems are dictated by the interactions between their components. These interactions, as we observe them now, are therefore shaped by genetic variations and selective pressure. Similar to what has been achieved by comparing genome structures and protein sequences, we hope to obtain valuable information about systemsÕ evolution by comparing the organization of interaction networks and by analyzing their variation and conservation. Equally, significantly we can learn whether and how to extend the network information obtained experimentally in well-characterized model systems to different organisms. We conclude from our analysis that, despite the recent completion of several high throughput experiments aimed at the description of complete interactomes, the available interaction information is not yet of sufficient coverage and quality to draw any biologically meaningful conclusion from the comparison of different interactomes. Thus, the transfer of network information obtained from simple organism to evolutionary distant species should be carried out and considered with caution. By using smaller higher-confidence datasets, a larger fraction of interactions is shown to be conserved; this suggests that with the development of more accurate experimental and informatic approaches, we will soon be in the position to study the network evolution. Ó 2005 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved. Keywords: Protein interaction; Evolution; Networks; High-throughput experiment

1. Introduction In recent years sequencing projects have produced the sequence of entire genomes and of the corresponding inferred proteomes at an unprecedented rate. The assembly, organization and comparison of this mass of data has given a considerable impulse to a relatively new discipline often referred to as comparative genomics [1]. The development of high throughput data-collection approaches has in turn generated information, albeit at a much lower rate, about most of a cellÕs macromolecular components and about their interactions. *

Corresponding author. E-mail address: [email protected] (G. Cesareni). 1

All the authors have contributed equally to this review.

In particular, semi-automated yeast two hybrid and complex purification approaches, complemented by a large number of ÔtraditionalÕ low throughput experiments published over the years in the scientific literature, have provided for the first time a glimpse into the global interactome of some simple organisms [2–7]. Such a description of the functional and physical interactions between the several thousand proteins in a proteome, although incomplete, is a necessary step toward the understanding of a biological system. Nevertheless, even this limited task, when we plan extending it to a number of interesting biological systems, represents a daunting endeavor if we bear in mind the significant effort already devoted to the study of the interactome of a single simple organism (Saccharomyces cerevisiae). In this contribution, we want to ask whether the combination of Ôcomparative genomicsÕ and ÔinteractomicsÕ can produce synergistic effects to the end of identifying an interaction core that is conserved in evolution and of drawing more reliable protein networks. In Fig. 1, we have schematically illustrated three strategies that exploit the possibility to compare homologous sequences from several different organisms and to group them into families of evolutionary related proteins (orthology groups). We can, for instance, use the notion that functional patterns, including those involved in protein interaction, are more conserved in evolution to assess the likelihood that any in vitro discovered interaction could play a physiological role. This information could then be used to obtain a more dependable protein network (Fig. 1, strategy 1). Alternatively, we may want to investigate, by comparing two experimentally determined networks, whether the connections between the proteins are conserved in evolution and to what extent we can use this conservation to infer new interactomes from existing experimentally defined protein networks (Strategy 2 and 3, Fig. 1). Most of these approaches are rather sensitive to the algorithm used to identify orthologs or paralogs originating from duplication events occurring either before or after speciation. Although this is an important issue, we will not discuss it here but refer to other authors [8–10].

2. Exploiting sequence conservation and sequence variation to infer and validate protein interactions Protein interactions pose constraints on the evolution of genome organization and of protein sequences. In fact, several

0014-5793/$30.00 Ó 2005 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.febslet.2005.01.064

G. Cesareni et al. / FEBS Letters 579 (2005) 1828–1833

1829

Fig. 1. Schematic representation of three approaches that take advantage of the availability al large genomic datasets to infer reliable protein interaction networks.

informatic methods make use of these evolutionary relics to infer or validate protein interactions. For instance, one method implies that proteins that are consistently present or absent in different proteome sets are likely to interact functionally [11] while a second method searches proteomes looking for proteins that are covalently joined in a single peptide chain and interpreting this fusion event as an evidence that these two proteins interact either physically or functionally [12,13]. Besides this class of methods, which explore genome organization, a second class focuses on conservation or variation at the amino acid sequence level. Pazos and Valencia [14] have exploited Ôcorrelated sequence variationsÕ in multi alignments of orthologous protein pairs to infer protein interactions. This is an extension of a method, dubbed Ôcorrelated mutationsÕ, previously developed to predict proximal pairs of residues in folded proteins [15]. The underlying rationale is that pairs of residues that are part of two interacting surfaces tend to co-evolve, with changes in one protein being remedied by compensatory mutations in the partner protein. The main limitation of this approach, sometimes referred to as Ôin silico 2-hybridÕ, is the need for alignments, with a good species coverage, for any protein pair under study. In its favor the method, in the cases in which it is applicable, does provide topological information about the interaction complex. Most experimental or informatic approaches, whenever the Ôinteraction thresholdsÕ are kept high, tend to overpredict. It is therefore useful to have solid filtering methods to skim an inferred network of interest by sorting out interactions that are unlikely to occur. Whenever an interaction can be mapped to a simple peptide or a simple interaction interface, as in the case of protein interaction modules, one can apply the principle that peptides matching functional motifs are more likely to be conserved than peptides matching a random amino acid pattern of comparable complexity. In fact, by looking at the conservation of a set of functional motifs taken from the PROSITE database [16], in fourteen different yeast proteomes, one

observes a much slower decrease of sequence conservation when compared to the corresponding decrease in the conservation of a comparable scrambled pattern (Fig. 2). These findings cannot be acritically extended to the identification of peptide motifs involved in protein interaction. In this case co-evolution could force sequence divergence by compensatory changes of the domainpeptide partners. Nevertheless, by using sequence conservation as a filtering tool, we were able to increase the performance of the WISE approach [17] in the inference of the protein interaction network mediated by 14-3-3 proteins in S. cerevisiae. Furthermore. Beltrao and Serrano (personal communication), using a set of eleven yeast genomes showed that combining comparative proteomics and secondary structure information can greatly increase the performance of consensus based predictions of SH3 targets.

3. Network comparison Work over the past 50 years has revealed that molecular mechanisms underlying fundamental biological processes are conserved in evolution and that models worked out from experiments carried out in simple organisms can often be extended to more complex organisms. This observation forms the basis for using interaction networks derived from experiments in model organisms to obtain information about interactions that may occur between the ortholog proteins in different organisms. Wahlout et al. [18] originally proposed an interolog concept to transfer protein interactions across species and Matthews et al. [19], after mapping yeast interactions into the Caenorhabditis elegans proteome, verified them by yeast two-hybrid. Later Gavin et al. [2] pointed out that proteins that are part of a metazoan ortholog-set preferentially interact with proteins of the same set. They also showed that the products of essential genes have a propensity to associate and that ortholog networks and essential proteins networks overlap significantly.

1830

G. Cesareni et al. / FEBS Letters 579 (2005) 1828–1833

Average conservation of PROSITE patterns 100

% Pattern Conservation

90 80 70 60 50 40 30 True positive matches False positive matches Random matches

20 10 0 2

3

4

5

6

7

8

9

10

11

12

13

14

15

Number of Species Considered Fig. 2. Conservation of peptides matching functional patterns in 14 yeast species. The 14 yeast proteomes were clustered in orthology groups by the Inparanoid algorithm. All the pairwise outputs of Inparanoid were merged to generate a set of orthologous sequences for every S. cerevisiae ORF. EMMA from the EMBOSS package (http:// www.rfcgr.mrc.ac.uk/Software/EMBOSS/) was then used to obtain multiple alignments of all the orthology groups. Overall, 5652 multiple alignments were generated. By parsing the prosite.dat file we then collected all the entries described by a pattern documented by at least one true positive match in S. cerevisiae (500 patterns). We next measured the conservation of S. cerevisiae peptides matching one of the patterns in multiple alignments and compared it to the conservation of peptides matching a non-functional inverted pattern obtained by simple right left inversion (ABC becomes CBA). The conservation of pattern matching peptides classified as true positives or false positives in the Prosite database are plotted separately. Of the 500 patterns only 50 have an inverted pattern with at least one match in S. cerevisiae. The graph shows pattern conservation as a function of the number of species considered (species have been added with an order corresponding to increasing evolutionary distance from S. cerevisiae). The percentage of match conservation is the ratio of the number of species containing an exact match in the same protein sequence range divided by the number of species considered.

This observation is consistent with the idea that proteins that are conserved in evolution play an essential physiological role to form a central component of a universal eukaryotic core interactome. To estimate the fraction of common interactions in two distantly related organisms we set out to compare the yeast and the Drosophila networks, the only two organisms whose interactomes have been explored with systematic experimental approaches aimed at being exhaustive. The yeast experimental interaction network includes 4597 proteins connected by more than 14 000 interactions while the Drosophila interactome covers 7000 proteins connected by 20 664 interactions. Our biological insight tells us that many functional interactions, namely those involved in processes common to eukaryotic organisms, should be conserved. In principle, by comparing the yeast and Drosophila interactomes we should be able to estimate the fraction of conserved interactions between these two different eukaryotic organisms. However, a number of reasons could lead to underestimating the degree of network overlap. Indeed comparison of different high throughput experiments in yeast indicates that only a small fraction of interactions are supported by more

than one experiment [20]. This observation could have different explanations. One possibility is that none of the individual experiments may have reached saturation. Alternatively, the different approaches may have produced a significant number of false positives or may have missed a significant number of true interactions. As a consequence even a significant overlap between the physiological networks of two organisms may result in a much smaller one when two incomplete and inaccurate experimental networks are compared. This is illustrated schematically in Fig. 3. To estimate the degree of overlap between the yeast and the Drosophila interactomes, we have used the Inparanoid algorithm to assign 2198 of the 20032 Drosophila melanogaster proteins and 2000 of the 6224 S. cerevisiae proteins to 1966 orthology groups (OG). We have then created two OG interaction networks (OGN) by mapping, whenever possible, each interaction described experimentally in the two organisms to an interaction between the orthology groups of the two partner proteins. The two OGN, derived from the yeast and the Drosophila interactomes, consist of 4096 and 1024 interactions, respectively. When the two are compared only 87 OG interactions are found to be common, that is 2% and 8% of the OGN derived from each organism experimental network. This relatively disappointing figure likely reflects the low quality of the available experimental networks. In fact, the overlap rises to 5% and 24%, which is higher than the overlap between the independent yeast high throughput experiments [20], when we use two OGN derived from drosophila and yeast networks including a smaller number of high confidence interactions (5101 for Drosophila, 2491 for yeast).

4. Inferring interactomes These reservations considered, mapping the interactions experimentally determined in model organisms onto the human proteome would still represent a valuable tool to broaden our understanding of the protein mesh in such an interesting organism. To this end, we have developed Homomint, a web available tool (http://mint.bio.uniroma2.it/mint/) extending protein– protein interactions experimentally verified in models organisms, to the orthologous proteins in Homo sapiens. Similar to other approaches [21], the orthology groups in HomoMINT are obtained by the ‘‘reciprocal best hit method’’ as implemented in the Inparanoid algorithm [8]. To eliminate some noise produced by Inparanoid, we further filtered the orthology groups by applying a string matching algorithm that permits to identify wrongly assigned proteins whose domain architecture differs, by a predefined value, from the domain architecture of the main human ortholog. By this approach all the proteins in the MINT database [22] (approximately 17 000) were mapped to 16 531 orthology groups. These include proteins that participate in interactions described in the high throughput dataset of yeast, Drosophila and nematode as well as several more proteins from other species that were shown to be involved in interactions by low throughput experiments. By this approach (Fig. 4, bottom), we mapped the 40 944 interactions stored in the MINT database to 10 801 HomoMINT interactions whenever the two partners of an interaction discovered in model organisms could be both assigned to a human orthology group. HomoMINT is

G. Cesareni et al. / FEBS Letters 579 (2005) 1828–1833

1831

Fig. 3. The cartoon represents two interactomes in the protein interaction space. The overlap between the ÔrealÕ (physiological) and the experimentally determined protein interaction space is shown by arrows. The region containing the false positives (FP) and false negatives (FN) are also indicated.

updated each day to take into account revised proteome dataset identifiers and new interactions added to the database (to be described in detail elsewhere). The human interactome has not yet been systematically investigated by high throughput experiments. Nevertheless, several low throughput experiments, providing evidence of protein interactions, have been published in the scientific liter-

ature over the past decades. This represents a dataset approximately the same size, albeit much more dependable, as those obtained from the results of high throughput experiments carried out in model organisms. However, it is not readily accessible. Recently, a number of databases have started to compile this information and organize it in a common computerreadable format [22–27]. Although this curation effort is far

Fig. 4. Schematic representation of the contribution of the interaction data curated by four different databases to the assembly of the ÔExperimental Human InteractomeÕ described in this review (top part of the figure). In the lower part of the figure the numbers in reverse color in the rectangles represent the size of the model organism interactomes that were used to infer the HomoMINT network. In the small rectangles we have reported, for each model organism, the number of interactions that could be mapped to the human proteome.

1832

G. Cesareni et al. / FEBS Letters 579 (2005) 1828–1833

Table 1 Contribution of the different protein interaction databases to the human experimental network

MINT Intact Bind DIP HPRD Total

Intersection witha N of Interb

MINT

3086 2280 2645 970 4994 11354

1249 154 296 324

Intact

Bind

DIP

HPRD

1249

154 76

296 138 115

324 88 299 257

76 138 88

115 299

257

a The figures in the columns refer to the numbers of human interactions that are common to any two databases. b Unique binary interaction imported from each database.

from being complete, by combining all the interactions currently deposited in five major databases, we were able to assemble a human interactome of 11 354 independent interactions (Table 1 and Fig. 4, top). This network is likely to have some bias in the coverage of the interaction space due to the scientific communityÕs interest in investigating specific biological domains or to a biased selection of the journal articles compiled by the databases. Nevertheless, the graph representing the interactions deposited in the databases that we have considered share a number of characteristics with those obtained from model organisms or with the ones of HomoMINT: they are all scale-free networks and have comparable diameters, average clustering coefficients and mean path lengths (unpublished observations). The overlap between the human experimental network and the one inferred from model organisms (HomoMINT) is 369 interactions corresponding to 3.4% of the HomoMint interactions suggesting that both networks only cover a small fraction of the real interactome and that either or both are affected by a large number of false positives. Most of the HomoMINT network (90%) is inferred from interactions that have been obtained by high throughput experiments while only 10% is inferred from higher confidence experiments. Interestingly, the set of high confidence interactions covers more than 60% of the intersection between HomoMINT and the experimental network.

5. Conclusions The growing number of sequenced genomes has prompted the development of comparative tools that have great potential for our understanding of biological processes and their evolution. Here, we have discussed the possibility of exploiting the availability of complete catalogs of the protein products of several organisms and the increase in information about protein interactions to compare different interactomes. Systematic network comparison and discovery of conserved interactions can help us in answering some interesting biological questions: Are networks of protein interactions conserved as protein sequences and structures are? Is there a minimal interaction core that is conserved in different species? If so, how large is it? We have shown that the coverage and reliability of presently available interaction datasets are not sufficient to provide a trustworthy answer to these questions and that the overlap between the networks of S. cerevisiae and D. melanogaster is

smaller than one would expect. Interestingly, the percentage overlap increases when more reliable datasets are considered. This also indicates that observation of similar interactions in two networks increases the likelihood of their biological significance. Besides these fundamental questions, one can use the observation of conservation of protein interaction to transfer functional information experimentally verified in model organisms to less characterized genomes and proteomes. Yu et al. [28] assessed statistically the transferability of protein– protein and protein DNA interactions by analyzing the relationship between sequence similarity and interaction conservation. In general, they found a sigmoidal relationship between sequence similarity and interaction conservation. They considered 14 911 interactions in H. pilori, S. cerevisiae, C. elegans and D. melanogaster and found that an interaction is very likely to be conserved in different organism if the joint sequence identity (geometric mean of the individual sequence identity) between the protein partner pairs in the two organisms is >80%. In this short review, we have only compared networks by looking at the fraction of conserved interactions. A fertile area of research, while waiting for more extensive and more reliable network data, is concerned with methods to compare approximately matching simple subgraph motifs by combining network topology and protein sequence similarity [29,30]. Acknowledgments: The work reported here is supported by AIRC (Associazione Italiana per la Ricerca sul Cancro) and by the European Union FP6 ÔInteraction ProteomeÕ. We thank L. Serrano and P. Beltrao for sharing data before publication. We also thank G. Chillemi from CASPUR for help and generous gift of computing time.

References [1] Huynen, M.A., Snel, B. and van Noort, V. (2004) Comparative genomics for reliable protein-function prediction from genomic data. Trends Genet. 20, 340–344. [2] Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147. [3] Ho, Y., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183. [4] Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98, 4569–4574. [5] Uetz, P., et al. (2000) A comprehensive analysis of protein– protein interactions in Saccharomyces cerevisiae. Nature 403, 623– 627, (see comments). [6] Li, S., et al. (2004) A map of the interactome network of the metazoan C. elegans. Science 303, 540–543. [7] Giot, L., et al. (2003) A protein interaction map of Drosophila melanogaster. Science 302, 1727–1736. [8] Remm, M., Storm, C.E. and Sonnhammer, E.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052. [9] Li, L., Stoeckert Jr., C.J. and Roos, D.S. (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189. [10] Cannon, S.B. and Young, N.D. (2003) OrthoParaMap: distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies. BMC Bioinform. 4, 35. [11] Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. and Yeates, T.O. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288.

G. Cesareni et al. / FEBS Letters 579 (2005) 1828–1833 [12] Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O. and Eisenberg, D. (1999) Detecting protein function and protein–protein interactions from genome sequences. Science 285, 751–753. [13] Enright, A.J., Iliopoulos, I., Kyrpides, N.C. and Ouzounis, C.A. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90. [14] Pazos, F. and Valencia, A. (2002) In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins 47, 219–227. [15] Gobel, U., Sander, C., Schneider, R. and A (1994) Correlated mutations and residue contacts in proteins. Proteins 18, 309–317. [16] Hulo, N. et al. (2004) Recent improvements to the PROSITE database Nucleic Acids Res 32 Database issue, D134-7. [17] Landgraf, C., Panni, S., Montecchi-Palazzi, L., Castagnoli, L., Schneider-Mergener, J., Volkmer-Engert, R. and Cesareni, G. (2004) Protein interaction networks by proteome peptide scanning. PLoS Biol. 2, E14. [18] Walhout, A.J., Sordella, R., Lu, X., Hartley, J.L., Temple, G.F., Brasch, M.A., Thierry-Mieg, N. and M (2000) Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 287, 116–122. [19] Matthews, L.R., Vaglio, P., Reboul, J., Ge, H., Davis, B.P., Garrels, J., Vincent, S. and Vidal, M. (2001) Identification of potential interaction networks using sequence-based searches for conserved protein–protein interactions or ‘‘interologs’’. Genome Res. 11, 2120–2126. [20] von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S. and Bork, P. (2002) Comparative assessment of largescale data sets of protein–protein interactions. Nature 417, 399– 403.

1833 [21] Lehner, B. and Fraser, A.G. (2004) A first-draft human proteininteraction map. Genome Biol. 5, R63. [22] Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M. and Cesareni, G. (2002) MINT: a Molecular INTeraction database. FEBS Lett. 513, 135–140. [23] Bader, G.D., Betel, D. and Hogue, C.W. (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res. 31, 248–250. [24] Hermjakob, H., et al. (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res. 32 (Database issue), D452–D455. [25] Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M. and Eisenberg, D. (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303–305. [26] Hermjakob, H., et al. (2004) The HUPO PSIÕs molecular interaction format – a community standard for the representation of protein interaction data. Nat. Biotechnol. 22, 177–183. [27] Peri, S., et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371. [28] Yu, H., et al. (2004) Annotation transfer between genomes: protein–protein interologs and protein–DNA regulogs. Genome Res. 14, 1107–1118. [29] Koyuturk, M., Grama, A. and Szpankowski, W. (2004) An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics 20 (Suppl 1), I200–I207. [30] Kelley, B.P., Sharan, R., Karp, R.M., Sittler, T., Root, D.E., Stockwell, B.R. and Ideker, T. (2003) Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. USA 100, 11394–11399.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.