Hierarchical analysis of large-scale two-dimensional gel electrophoresis experiments

Share Embed


Descripción

1930

DOI 10.1002/pmic.200300533

Proteomics 2003, 3, 1930–1935

Technical Brief Amit Rubinfeld Tsipi Keren-Lehrer Gil Hadas Zeev Smilansky Compugen Ltd., Tel Aviv, Israel

Hierarchical analysis of large-scale two-dimensional gel electrophoresis experiments Large-scale two-dimensional gel experiments have the potential to identify proteins that play an important role in elucidating cell mechanisms and in various stages of drug discovery. Such experiments, typically including hundreds or even thousands of related gels, are notoriously difficult to perform, and analysis of the gel images has until recently been virtually impossible. In this paper we describe a scalable computational model that permits the organization and analysis of a large gel collection. The model is implemented in Compugen’s Z4000 system. Gels are organized in a hierarchical, multidimensional data structure that allow the user to view a large-scale experiment as a tree of numerous simpler experiments, and carry out the analysis one step at a time. Analyzed sets of gels form processing units that can be combined into higher level units in an iterative framework. The different conditions at the core of the experiment design, termed the dimensions of the experiment, are transformed from a multidimensional structure to a single hierarchy. The higher level comparison is performed with the aid of a synthetic “adaptor” gel image, called a Raw Master Gel (RMG). The RMG allows the inclusion of data from an entire set of gels to be presented as a gel image, thereby enabling the iterative process. Our model includes a flexible experimental design approach that allows the researcher to choose the condition to be analyzed a posteriori. It also enables data reuse, the performing of several different analysis designs on the same experimental data. The stability and reproducibility of a protein can be analyzed by tracking it up or down the hierarchical dimensions of the experiment. Keywords: Experiment design / Image analysis / Large scale two-dimensional gel experiments / Multidimensional gel experiments / Two-dimensional gel electrophoresis / Z4000 PRO 0533

2-DE [1, 2] is currently the most used technique in comparative proteome studies [3, 4]. Two-dimensional comparative protein maps are typically used to spot proteins that are differentially expressed in various cells or disease states in order to increase the efficiency of drug discovery. 2-DE has limitations, however (5–7). Two major ones are low sensitivity, which limits the ability to analyze proteins of medium to low abundance [8, 9], and the laborintensive process of sample preparation, gel running and staining [10]. Alternative methods to protein separation have been developed recently to address these weaknesses [11–14]. Recent advances in 2-DE technology, increased the separation capacity of the technique [15,

Correspondence: Amit Rubinfeld, Compugen Ltd., 72 Pinchas Rosen St.’ Tel Aviv 69512, Israel E-mail: [email protected] Fax: 1972-3-765-8555 Abbreviations: ETF, experiment tree folder; RMG, raw master gel

 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

16], but inconsistent and sometimes poor reproducibility remain a problem. In addition to the studied differences between proteomes (the controlled variations), other differences resulting from the 2-DE method are expressed (uncontrolled variations). Thus, detecting differences in protein expression between two samples and investigating patterns of proteins through a set of gels becomes a complex task. It is widely accepted that comparing gel images and extracting meaningful information requires computerized analysis programs [17]. Various software packages for 2-D gel analysis have been developed [18–26]. In order to suppress uncontrolled variations and stabilize the data, gel repeats are created from the same sample. The technique of repeating gels and ordering them in a multidimensional structure yields large-scale experiments. Large Scale Biology, for example, has reported running about 1000 gels per week [27]. However, the analysis of such an immense number of gel images is almost impossible since it takes longer and is more difficult than

Proteomics 2003, 3, 1930–1935

Hierarchical analysis of large 2-D gel experiments

1931

running the gels. In fact, the ability to run more 2-D gels than can be analyzed is currently recognized as a key bottleneck. To the best of our knowledge, no system has yet managed to fully analyze multidimensional experiments. In the present paper we describe a method for analyzing a multidimensional large-scale gel collection implemented in Compugen’s Z4000 software. This method utilizes the registration method already implemented in Compugen’s Z3 software [25, 26] and introduces a technique for hierarchical analysis of a large gel collection. In Z4000, gels are organized in a hierarchical, multidimensional data structure that allows the user to view a large-scale experiment as a tree of numerous simpler experiments, and carry out the analysis a step at a time. Analyzed sets of gels form processing units that can be combined into higher level units in an iterative framework. A flexible experimental design approach allows to be chosen the condition to be analyzed a posteriori. It also allows reuse of data: that is, performing several different analysis designs on the same experimental data so users can mine their experimental data without having to repeat the entire analysis. A protein can be tracked up or down the hierarchical levels of the experiment, to analyze the stability and reproducibility of proteins. The system outputs spot data for statistical analysis and supports visual verification of the results. Z4000 implements the analysis model for a typical hierarchical experiment design. We term each of the tree levels as dimensions of the experiment, where each dimension is one of the experiment’s attributes. The repeats in each dimension are termed the conditions. The lower hierarchy dimensions group together repeats of gels taken from the same conditions or the same sample. These dimensions usually show low variation in the proteome, and are used to suppress uncontrolled variations. The upper dimensions are used to compare different conditions in the search for significant changes (Fig. 1). In order to analyze the data, we use the hierarchical structure and dependencies between dimensions to reverse the order of the hierarchy. This can be likened to transposing a matrix: from a tree of gels containing a list of spots, we compose a list of trees, each characterizing a spot (protein) along the experiment. The reversed representation of the data allows us to analyze the extracted data, track proteins throughout the experiment to analyze the stability and reproducibility, recognize trends and clusters of similar behavior, and apply statistical analyses. The workflow of analyzing a multidimensional experiment is described as pseudocode in Fig. 2 and illustrated in Fig. 3. In the first step, the multidimensional experiment structure is transformed into a single hierarchy structure of a tree. In this structure, each set of gels that share com-

 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Figure 1. An example of a tree display of an experiment that studies the effects of a certain drug on rat blood cells (partially expanded). Rats are treated with 1, 2, 5 and 20 mg of the drug, five animals in each group. Blood samples are collected 5 s, 1 and 5 m after treatment. Each extraction is repeated three times, and every protein extract is separated on five 2-D gels. Hence, the experiment contains 563656365 = 1125 gels. The lower three dimensions are gel, sample and animal repeats. In order to learn which effects are time-dependent, (A) the doses are compared and the time differences are suppressed in order to learn which effects are dose dependent, and (B) the time points are compared and the dose differences are suppressed.

mon experimental attributes is grouped into an experiment tree folder (ETF) and analyzed separately. The analyzed ETF, termed Gel Assembler, is the basic computational unit of the experiment. Gel Assemblers are combined into higher level Gel Assemblers in an iterative manner. In the top level ETF the different conditions of this dimension are being compared. The product of this process is a data set of all the protein spots, arranged in a recursive manner. The analysis is carried out in small steps, which also allows good control of the process, reduces the chance of errors, and dramatically reduces the restrictions on computational resources, i.e. can be performed on a standard PC. There are several novel features to our approach: composing the Raw Master Gel (RMG) from sets of gels sharing common conditions, the Gel Assembler as representative of the ETF, and the hierarchical analysis of ETFs. The following sections describe the first two features in detail. The RMG is a synthetic “adaptor” gel image, created from a set of gels that share some common conditions. The RMG is composed from the pixel levels of the images and

1932

A. Rubinfeld et al.

Proteomics 2003, 3, 1930–1935

Figure 2. The analysis workflow of a multidimensional experiment is described as pseudocode.

Figure 3. The analysis workflow of an ETF with N gels. A reference gel is selected from the set, with no significance as to which one, since the product of the ETF is an RMG, composed from the pixel level of all the gel set. Registration is performed between the reference gel and all the other gels, followed by spot detection and matching, based on the registration data and the characteristics of the spots. The differential expression between the matched spots is computed from the pixel level. The product of this step is a list of the expression levels of each spot along the gel set. An RMG is composed from the gel images, based on registration and supported by the matching. Spots are detected on the RMG and linked to the corresponding spots of the gels in the gel set. Finally, the RMG and the spot data are encapsulated in a Gel Assembler.

not from the features (spots). The RMG emphasizes the repetitive patterns on the gels, and is used to highlight the common parts of the expressed proteome on gel replicates. The RMG suppresses the nonrepetitive patterns, which are the expression of the uncontrolled variations of the gels.

 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Creating the RMG involves two basic phases: geometrical warping of the participating gels to the RMG plane [25], and pixel blending to compose the RMG image. The RMG addresses three types of uncontrolled variations: geometrical, global and local. Geometrical deformations could be caused by several factors including uneven po-

Proteomics 2003, 3, 1930–1935 lymerization of the gel, local stretching, and cracks in the gel. These deformations are corrected in the warping phase. Global variances, such as image acquisition noise, and local deformations, such as dirt stains of fluorescent imaging speckles, are suppressed in the blending phase. The RMG geometrical warping is done in three steps. First, the N participating gels are registered in a group, so all gels are registered at the pixel level to a reference gel, thus to each other. The pixel-based registration yields a robust warp, based on the repetitive patterns of the images, and is insensitive to local deformations and artifacts. The second step consists of detecting spots on each of the gels and matching the spots between gels. The matching data are collected into M spot vectors Sm[N]. A score is derived for each vector Sm, based on the internal matching rank and the number of times the spot was expressed in the gel set. The matching data utilizes the images’ features (spots) to refine the warping at a smaller scale (micro registration). An additional registration anchor is calculated from each of the vectors with a sufficient score. In the third step, using the registration anchors, all N gels are mapped to a common new plane, the RMG plane. The gel images are then warped to the RMG plane using four-point resampling to create a primary multilayer master image, MI(x,y,n). The pixels are blended using robust averaging, as described in the following three steps: First, for each pixel in MI(x,y,n), a vector of the pixel values in all layers at each location, Pxy[n], is collected. For each Pxy, a primary estimator for the blended pixel exy is selected by applying a median filter, and the range of source pixels erxy is saved. Second, for each estimator value em within a given tolerance, we collect all the corresponding ranges erm and calculate the variance of the range. Third, for each pixel set Pxy, we derive the acceptable range that satisfies a given variance tolerance of erm that corresponds to the associated em. Pixels in Pxy that exceed this range are ruled out. The final blended pixel bxy is computed by applying again a median filter on Pxy. The RMG image RI(x,y) is the collection of the blended pixels bxy (Figs. 4a and 4b). Since the RMG image is composed from the pixel level the retained uniqueness of the shape allows further quantitative analysis as on the original gel images. The data in the gel set is stabilized by its RMG representation, as uncontrolled variations are reduced. The contamination and nonrepetitive patters are removed, but the spots retain their unique shape and deforms (smears, streaks). Another benefit of the way the RMG is created is a reduction of the image noise, which is a product of any image-acquiring mechanism (Figs. 4c–f). In addition, the RMG has better geometric precision, as the warps of the input gels tend to cancel each other out.

 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Hierarchical analysis of large 2-D gel experiments

1933

The RMG can also be composed to compare different conditions of the same attribute; for example, extracted proteins from a tissue at different time points. This RMG suppresses differences in the proteome that are unique to the different experimental conditions, and leaves the spots that are repeated regardless of the different conditions. This RMG is used to focus on the proteins that are insensitive to the tested condition; in the above example this refers to proteins whose response to a certain treatment is stable for the tested time period. Spots are detected on the RMG image and matched to the spots on each of the gels in the ETF. This matching links each spot on the RMG to all the spots from which it was composed. These links are the basis of the hierarchical representation of the analyzed data of the experiment, since it links successive levels of the hierarchical data representation. The Gel Assembler is the basic processing unit used in the hierarchical experiment analysis, allowing a scalable workflow, thus the analysis workflow “climbs” up the experiment structure tree. The recursive analysis continues from bottom to top, until the highest dimension of the experiment is reached. Z4000 manages the analysis of large-scale experiments as described, along with tools for the experimental design phase, data mining, verification etc. Once the computational analysis of the experiment is complete, accumulated spot data is displayed in an interactive table window, the Assembly Spot Table (AST), and spots can be tracked up or down the experiment’s dimensions. Changing the hierarchy order of the experiment dimensions allows the user to study the dependencies of different conditions or treatments, and to check the variability or reproducibility of proteins in question in different contexts. Changing the experiment design after the analysis is done offers the ability to expand the range of a certain condition. For example, one can start with a coarse resolution and, after obtaining preliminary results from the analysis, add more attributes to the condition in the effective range with a finer resolution. In such a case, one might start with a dosage series of 1, 2, 5 and 20 mg of a drug, and after finding that the effective range is between 2 and 5 mg, add 3 and 4 mg to the experiment. In feasibility tests carried out on this system, we are able to analyze a multidimensional experiment that had 1056 gel images of 2.5 MB each, organized in four hierarchy levels, in less than 17 h on a standard single processor PC (Pentium III at 667 MHz with 512 MB RAM). In theory, there is no limit to the size of an experiment with the new model, as long as each step is feasible in itself. We also concluded the analysis of the same experiment with the

1934

A. Rubinfeld et al.

Proteomics 2003, 3, 1930–1935

Figure 4. Screen captures from Z4000. (A) is a set of four gels from which an RMG was composed (B). (C) shows the noise histogram of one gel from (A), where the X axis is the variance of the pixel values in a 565 square, and the Y axis is the number of times this value occurred in the image. (D) is the same histogram derived from the RMG image. The noise level is reduced by mean and power. (E) and (F) are zoomed images of a spot in one gel from (A) and the corresponding RMG image, respectively. The image noise is reduced while the spot retains its unique shape and characteristics.

 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2003, 3, 1930–1935 Z3 software, using similar basic algorithms, to serve as a reference. Analyzing the gel images lasted more than 37 hours, in which the user had to operate the software and manually manage the process. Furthermore, in spite of the fact that the gels are organized in a multidimensional hierarchy, there was no direct tool to track similar spots throughout the experiment’s levels and to study the stability, reproducibility or the dependence on the experimental conditions.

Hierarchical analysis of large 2-D gel experiments

1935

[5] Appel, R. D., Vargas, J. R., Palagi, P. M., Walther, D., Hochstrasser, D. F., Electrophoresis 1997, 18, 2735–2748. [6] Branca, M. A., Sannes, L. J., Cambridge Healthtech Institute’s Genomic Pathways Reports Series, Report 1, April 1999. [7] Harry, J. L., Wilkins, M. R., Herbert, B. R., Packer, N. et al., Electrophoresis, 2000, 21, 1071–1081. [8] Lopez, M. F., Electrophoresis 2000, 21, 1082–1093. [9] Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y., Aebersold, R., Proc. Natl. Acad. Sci. USA 2000, 97, 9390–9395. [10] Quadroni, M., James, P., Electrophoresis 1999, 20, 664–677.

In this paper we have described a software platform, implementing a scalable computational model for analyzing large-scale 2-DE experiments. The multidimensional experiment conditions are transformed into a hierarchical structure and analyzed in a recursive manner. Our new model allows large-scale experiments to be better exploited, and, with a flexible experimental design, enables the condition in question to be chosen through different orientation of comparisons to detect proteins whose expressions show interesting dependence on the controlled experimental parameters. Received January 29, 2003 Revised June 2, 2003 Accepted June 10, 2003

[11] Link, A. J., Eng, J., Schieltz, D. M., Carmack, E. et al., Nat. Biotechnol. 1999, 17, 676–682. [12] Han, D. K., Eng, J., Zhou, H., Aebersold, R., Nat. Biotechnol. 2001, 19, 946–951. [13] Jenkins, R. E., Pennington, S. R., Proteomics 2001, 1,13–29. [14] Uetz, P., Curr. Opin. Chem. Biol. 2002, 6, 57–62. [15] Hoving, S., Gerrits, B., Voshol, H., Muller, D. et al., Proteomics 2002, 2, 127–134. [16] Wildgruber, R., Harder, A., Obermaier, C., Boguth, G. et al., Electrophoresis 2000, 21, 2610–2616. [17] Mahon, P., Dupree, P., Electrophoresis 2001, 22, 2075–2085. [18] Olson, A. D., Miller, M. J., Anal. Biochem. 1988, 169, 49–70. [19] Lemkin, P. F., Lipkin, L. E., Comput. Biomed. Res. 1981, 14, 272–297. [20] Anderson, N. L., Taylor, J., Scandora, A. E., Coulter, B. P., Anderson, N. G., Clin. Chem. 1981, 27, 1807–1820. [21] http://www.biorad.com.

References [1] [2] [3] [4]

O’Farrell, P. H., J. Biol. Chem. 1975, 250, 4007–4021. Klose, J., Humangenetik 1975, 26, 231–243. Rabilloud, T., Proteomics 2002, 2, 3–10. Hille, J. M., Freed, A. L., Watzig, H., Electrophoresis 2001, 22, 4035–4052.

 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

[22] http://www.genebio.com/Melanie.html. [23] http://www.phoretix.com/. [24] http://www.nonlinear.com/2d/progenesis/index.htm. [25] Smilansky, Z., Electrophoresis 2001, 22,1616–1626. [26] http://www.2dgels.com. [27] Anderson, N. G., Matheson, A., Anderson, N. L., Proteomics 2001, 1, 3–12.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.