Protein comparison and classification: a differential geometric approach

Share Embed


Descripción

Proc. Nati. Acad. Sci. USA Vol. 85, pp. 777-781, February 1988

Biophysics

Protein comparison and classification: A differential geometric approach (protein conformation)

S. RACKOVSKY AND D. A. GOLDSTEIN Department of Biophysics, University of Rochester, School of Medicine and Dentistry, Rochester, NY 14642

Communicated by Fred Sherman, October 12, 1987 (received for review July 1, 1987)

A method is proposed for rapidly, quantitaABSTRACT tively comparing protein structures of arbitrary sizes, based on the differential geometric representation. The method is applied to a group of 22 protein x-ray structures, and the resulting network of closest relationships is delineated. Several well-known fold types are automatically detected as groupings of related structures, even when the constituent proteins are of different lengths. A complete gradation of types is shown to be detected, ranging from all-helical to all-,8-structure proteins. A relationship among functionally similar proteins is shown in several cases, even where their three-dimensional structures differ. It is suggested that the positions of proteins within the network of relationships correspond with their folding mechanisms.

tion-based methods, it offers some new insights into protein organization. METHODS Previous work (7, 8, 12) demonstrated the utility of the differential geometric method for studying the local conformational similarities and differences between protein segments of equal length. For the present application, we want a method capable of discerning global similarities and differences between segments (in this case, whole proteins) of different length. To this end we represent a protein by its distribution of (Kar) values, which leads to a matrix corresponding to each protein under study. As in previous work (10), each matrix is 18 x 18, and the (ij)th element of the matrix is the number of times the (K,T) values for four-C' units within the protein fall in the interval

The comparison and classification of structures are central to much of the recent upwelling of interest in protein structural studies. It has become clear that structural homology is a far more reliable index of relatedness of proteins than sequence homology (1). The proper classification of structures is a necessary prerequisite to exploring the relationship between sequence and structural features (2), and thus to a full understanding of protein folding and design. It is clear that comparison of structures precedes classification. Comparisons can be carried out on either a qualitative or quantitative basis. Indeed, some of the most useful schemes for protein taxonomy (2, 3) rely on the visual comparison of more-or-less schematic representations of molecular structure. As a complement to those methods, which have provided great insight into the typology of protein structure, it would be useful to have available a quantitative, objective method for the comparison of protein structures of arbitrarily different sizes. Two quantitative methods for comparing structures have been developed, that of Rossmann and coworkers (4, 5) and that of Remington and Matthews (6). Both rely on the superposition comparison of proteins or segments thereof. In ajoint review (1), Matthews and Rossmann have characterized the two methods as follows. The method of Rossmann and coworkers is able to compare proteins of different sizes and determine the presence of insertions and deletions. It relies, however, on a complicated algorithm and on proper choice of a set of starting parameters. The method of Remington and Matthews is less complicated in execution but less able to deal with major structural differences, particularly insertions and deletions. In this work, we present an alternative approach to this problem based on the differential geometric representation (7-12). We demonstrate that the method is easily able to compare quantitatively proteins of arbitrarily different sizes and is very rapid and easily implemented. Further, because the basic approach is different from the superposi-

-0.9 + 0.1(i - 1) < K < -0.9 + 0.li, -0.9 + 0.1(j - 1) ' r< -0.9 + 0.1].

The problem we wish to solve then reduces to that of defining a distance function between matrices such that the mathematical distance has the properties one would intuitively expect-i.e., the distance between matrices increases with a decrease in similarity of the matrices. Such a distance function has been defined in previous work (10), and its utility has been demonstrated in the course of studies on the conformations arising from the presence of the various amino acids in the protein backbone. The distance 0 between matrices M and N is given by

6(MN) = cost

> Mu N,1 ij (E )i1/2

I1] 1/2

It is readily seen (10) that 6(M,N) is a function of the relative distributions of population in the two matrices, rather than the absolute populations. (The population associated with a matrix M is simply the number of 4-C' units in the protein, nM. If R is the number of residues in M, then nM = R - 3.) It is therefore suited to the comparison of matrices corresponding to different populations-i.e., to proteins of different size. Intuitively, one can think of M and N as vectors in a high-dimensional space. 6(MN) then corresponds to the angle between unit vectors with the same direction cosines as M and N. Identical (or proportional) matrices give 0 = 0, and pairs of matrices that never have non-zero elements in the same location give 6 = r/2 (10).

The publication costs of this article were defrayed in part by page charge

payment. This article must therefore be hereby marked "advertisement"

Abbreviations: BPTI, bovine pancreatic trypsin inhibitor; HIPIP, oxidized high-potential iron protein.

in accordance with 18 U.S.C. §1734 solely to indicate this fact.

777

778

Biophysics: Rackovsky and Goldstein

Once an index of similarity of matrices is available, it is necessary to calibrate the resulting scale. Thus, a large number of matrices M with the same population nM can be generated, each of which will give a different value of O(M,N) when compared with a given matrix N. (This could correspond physically to proteins with the same number of residues, but different structures.) Inspection of a given value of O(NM) does not, in general, reveal whether the two proteins in question are similar or dissimilar relative to the ensemble of proteins of the same sizes. We address this problem by generating a large number of random distributions of the same sizes as those arising from the two proteins to be compared. These are generated by choosing (KT) pairs from the sample of proteins that we wish to study, so that the random distributions are generated from data that have the same weighting as the proteins under study, and therefore are realistic as regards both numerical values and frequency of occurrence. Using an ensemble of 200 pairs of randomly generated distributions, we calculate an average value of 6 for random distributions of the specified sizes, (0(nN,nM)%, and a corresponding standard deviation, or(nN,nM). We can then characterize the similarity of N and M (i.e., of the structures of the two proteins that give rise to the distributions N and M) by comparing O(NM) to (O(nN,nM)) + of(nN,nM). If, for example, O(N,M) < (O(nN,nM)) u(nN, nM), then N and M are very similar. If O(NM) >

(t(nN,nM))

+

or(nN,nM), N and M are very different.

The sample of proteins we use is that used in previous (7-12) statistical studies of protein structure. It contains 22 proteins chosen so that the structures available are of good quality and excessive homology is avoided. The proteins

Proc. NatL Acad Sci. USA 85 (1988)

range in size from 54 residues (rubredoxin) to 333 residues (glyceraldehyde-3-phosphate dehydrogenase).

RESULTS AND DISCUSSION Table 1 shows the reduced distance O(NM) between each pair of proteins in the, sample (identified by number in the indicated manner), and the reduced standard deviation of the randomly generated distribution of 0(nN~nM) associated with each pair, 05f(nN,nM). The standard deviations are shown in order to give the reader a feeling for the actual degree of similarity represented by the ( values with which they are associated. Thus, for example,"(NM) < 1 - a'(nN,nM) indicates a very high degree of similarity between the distributions N and M, in keeping with the discussion at the end of the preceding section. Fig. 1 shows the network of closest relationships extracted from the data of Table 1. Two proteins are connected by bonds in this figure if one of the following conditions is fulfilled: (i) the proteins show a very high degree of similarity (in the sense just outlined); (ii) one of the proteins shows greater similarity to the other than to any other member of the data set. Thus, for example, it can be seen from Table 1 that lysozyme and thermolysin satisfy the condition for high similarity. Carbonic anhydrase does not fulfill this condition with any other protein in the data set but is most (and equally) similar to a-chymotrypsin, rubredoxin, BPTI, and ferredoxin. The first striking point evident from Fig. 1 is the ability of the distance function to accurately identify the previously known fold types among the proteins considered, even when they appear in proteins of widely varying sizes, by grouping

FIG. 1. The network of relationships arising from comparison of (Kr) distributions. Only the closest relationships between proteins are shown. Double lines indicate that the proteins connected are much more similar than would be expected of random distributions. Single lines indicate a relationship of more than random similarity. Broken lines indicate the closest relationships entered into by proteins that show less-than-random similarity to all other members of the set. Members of classical fold types are enclosed in the marked boxes.

Biophysics: Rackovsky and Goldstein

'.

N

"%

q'

I

g-

'0

-4

N4

'0

4%

'

0

%

N

.

.

'4

'4

'4

04

'.4

co

0% C4

'. %

'0

"0V

0

0

'

N

.

0 '4

4 *q'

'.4

'.4

'.4 0

0 '.

'.4

0

'.4

I

'4

%

'0'0 '0 '.4 N '4 IA '.4

.

.4N0

'4 40 '0

0

CD%

("

.4

'4 0%

CD

0

'4

'.

'41

'4

'%

'0

'0

'0

'4

'4

'4

'V4m

'4-

0

'0 ~

~

~

'4

'4

W4% .'

'-4

'4

'-4

4%

'.4

'.4

'4 '0

'N

'0 '4

4'

%0

0 '4

'

4

'4

4

'4 4

N 0 '4

'

4

0

' .

' 0 '.4

'

N

"

N

' '

4 4 '

'

'

.

'4

'0 '4

.

.

0

0

IA

'4 '

'A

'4

.V

4 '

4

4% '4

'

0

4%D~

'4

'4

'

'4

C4D

'

'.4

(

0

4%

4 0

0

'4

4

'4

4%'4

'-4

'4 '.14 '-4

'

4 %0 CD.~#

.4

'A

'

'%

'0

'~V4

'4

.0 4

.

4

0

'4

4 N

' '4

' 0

'.4

'0 .

'

.4

'4

0

'0 N

0 '.4

'.4

N

0 '

'.4

0

'.4

'

'0 VI

'.

'0 N

'

4

IA0 r 4% .'4 .4 '%'0 '~~94 ("4 40 ' 0 '4 ~ ~ ~ 0 4% 4%4%4% '4 '4 -' '4 '4 '4 '40.0% '' r 'IA'n '0~~C '0 'o '0 '0 '

4) '0 ~ ~

4% '.4

0

04

'

'

r4%*4

'0' '00 '. .4'4N

'.

.4

4"

'4

'4N'0

.2 Cu ~

'

'.4 0%D

04

'0

'.4

'4

VN

n'

'4

'0 '4

0

'4

0A

'%

'.4

'4

0

-I 4

N '.4

'4

N

0

'.N

'0

4%

0

'.4

'4

n

N

4%

'4 .

4%

4%

'.4

'. 4

'4

9

0

'4 N

N

'

'4co'.

'.4~ N-

co.'

0

'0

'.

.4 0

.

0

'0

'.4 4%

.4

'.4

-

0

'0 .0 '4 '.4

'0

'4

'4

'0

'4

'0

4%

4

'A

'4

'0

a.

'.4

V4%

'

'-4 '4 "0.04%C

'4

4%

'4

'4

'4

'0

"0

' '.

'.

N

%

'~0 ' '0'0 '0

0

..

'0r

'4

'4

0

"4%

.

.0 'N4 * *0 .-0 -*0

4% 4% '4 '4 *'4 ".

40

'4 '4 ' %f

.4

'

C4

r4

'4

" c4 'N

.4co

'.r.q

0

.0 '0 '4 '.4

'40

'4

''

4.0

'.4 ".

)

4

%

0r

0

'

'41-

4

N

r '0

0

'04

0

0

-

'

'.4 .

I

'4

'0

0

.

N

'4%C

N-

.

'0C

'

'0 '0

'0

'.4 '.4 N

'4

' . .4 .I 09'

CO '.

'

'.4

4%

'.4

'4

4

%

.4

'4

4%

.

0

'0

'4

'4

0

'.

'.4 '4 ".4 V '

'N0

4

'4

'.4

N-

4%

I

'0!

o4

'4 0

N4

N-

VI

'0%

'

.0 '.4co

4%

'. 'N

'0

'40

Nq

4

'4

'.4 '.4

'N

N

'4 '

0

'.4

0 0 .qn 'N so .f%

'4

V0

W4

'0

0

'0 '.

0%

.

.

'0

4 04% .4 4

I-*o 'r-'0

0

4 P%'0

*'.4

4 .4 . . '.4 '.4 '.4'.4

A '.

'4

%

. '

0

0

*'.4

N

0 '.

.%

4%

Q.

'.4

4% '.4

0

N

Nq

0 *

.

'4

'.

IA

N

0

4

'0'0'0

'.4

q

4% 4 '. "0 'N

.

%

'.4

'.

'.

'0

0% .44

4

A

0

IA

'.4

4%

4%

'

.4 '..4 .4

'.4

'.4

-.

0

'4

0 '0 0'0 0 CD.9

N

'.4

-

'.4

'%

'.4

4'. 0s0

4%

4%

4

LnI

f% . .4

'

'0-

.0

4%

N

4% o

'4

0

N

.

'0

4

VA .4 ' " .4f4 'IA *0 %M

W'

'.4

'0

.4

V' 4

,4

0l

'0

*r"

'.

'.4

.

'0

'

N

0

0 '0 4%

0%

.

4%

*

4

N

'0

co%0 0o '.4

0

4

'.

0 4 4 0 r0%

'4 '..4T.4

'.

0

0

-I~~'.

"4 'I r'L

.0 0 '.4 '.4

0

-4

o

'.4 4Y

Proc. NatI. Acad. Sci. US,'A 85 (1988)

4

4 % 0 '

'

Cu~~~~~~~~~C N

'4

0

0

'4

4

'

'4 ("

0%

u

0

0

'

4

4'.

'

4

'4

0 10

'4

4

'4

'

'

%

%

4

'4

'4

'

'

0

N'

'0 14 4)

Cu m

W-'4

N

("

w~

IA

'0

4%

0o

0%

0 '4 '~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 94 '4

N

'4

u en

'-4

'4

VIA

'-4

'0

4

'-4

'4

0o

'4

0% '4

0 N

'-4 N

779

780

Biophysics: Rackovsky and Goldstein

the appropriate structures. Thus, the nucleotide-binding fold in flavodoxin, subtilisin, and glyceraldehyde-phosphate dehydrogenase is manifested as a grouping of these three proteins, which have 138, 275, and 333 residues, respectively. Similarly, human plasma prealbumin, concanavalin A, a-chymotrypsin, and rubredoxin are grouped together because they are characterized by a P-sheet/barrel structure, although their sizes range from 54 to 237 residues. (The prealbumin/concanavalin A and chymotrypsin/rubredoxin pairs are placed together, even though they are not connected by bonds. Although the members of these pairs are not nearest-neighbors, the data of Table 1 indicate that the members of the two pairs show essentially the same degree of similarity to BPTI and that they are second neighbors to one another. It is therefore clear that they should be juxtaposed. BPTI is not included in this group because its degree of similarity to ferredoxin and HIPIP is equal to its similarity to rubredoxin.) The globin fold (myoglobin, sea-lamprey hemoglobin) is also identified. It is clear, moreover, that these well-characterized folds represent particular regions within a continuum of protein structural distributions, in much the same way that the characteristic protein structure features (a-helix, bend, extended structure) are population maxima in a continuum of structures on the 4-C" length scale (9). The two globin proteins and carp myogen form a cluster of all-helical proteins (which we will refer to as the A cluster). There is a group of proteins that falls in the gap between these helical proteins and the nucleotide-binding fold, containing thermolysin, lysozyme, triose-phosphate isomerase, carboxypeptidase A and cytochrome c. Lysozyme and cytochrome c are most similar to the proteins of the A cluster in their (K,r) distributions, and thermolysin and carboxypeptidase A less so. Triose-phosphate isomerase shows a lesser degree of similarity to the other members of this cluster. (Thermolysin actually has two domains-one highly helical and one composed of 8-sheet. If considered separately, these would appear at appropriately different locations in the map. This point will be further discussed below.) We will refer to these proteins, together with the three nucleotide-binding proteins, as the AP cluster, denoting the relatively high proportion of Ax [i.e., helix/bend (11)] structure that characterizes their (K,) distributions. A second group of proteins is found to occur between the cluster we have just discussed and the p-barrel proteins. This consists of seven proteins: staphylococcal nuclease, HIPIP, papain, BPTI, ribonuclease S, ferredoxin, and carbonic anhydrase. These proteins are characterized by (K,r) distributions showing greater-than-random proportions of extended structure and generally lower-than-random helix/ bend structure. We therefore designate this group as the aB cluster. Papain and staphylococcal nuclease occur at the aB-AP interface. The former, like thermolysin, has two domains, one helical and one a p-barrel. The latter domain is proportionately larger than the p-domain in thermolysin, and therefore papain falls in the aB cluster. The location of staphylococcal nuclease is particularly interesting. The function of this enzyme is to bind and cut nucleic acids. It does not fall into the classic nucleotide-binding fold (which characterizes proteins that perform very different functions), yet its (K,T) distribution proves to be similar to those of the nucleotide-binding proteins, whose ligands are similar to its substrate. Another interesting group occurs in the aB cluster, consisting of BPTI, HIPIP, and ferredoxin. These small proteins have been grouped with some others (2) as "small metalrich" or "small disulfide-rich" proteins. Richardson (2) has pointed out that these proteins can be viewed as distorted versions of some standard structural type. In fact, the distance function demonstrates that they exhibit a common characteristic in the similarity of their (K,T) distributions, so

Proc. NatL Acad. Sci. USA 85

(1988)

that their grouping has a quantitative basis. Their positions within the aB group are also consonant with their structures. BPTI and HIPIP, which have high proportions of extended structure, border on the p-sheet group (or B group) and show similarity to rubredoxin. Ferredoxin, which has a more significant helical component, falls on the periphery of the aB group. Ribonuclease S, which has a large p3-sheet and several smaller helices, is intermediate between ferredoxin and papain. The B group contains four proteins of widely differing sizes that show almost exclusively p-sheet/barrel structure: human plasma prealbumin, concanavalin A, a-chymotrypsin, and rubredoxin. It should be emphasized that we are comparing structures at a low level of organization, since the comparison is between distributions of 4-Ca structures, without regard to their position in the sequence. The observed regularities are therefore all the more striking. For example, most of the proteases in the sample (carboxypeptidase A, thermolysin, subtilisin, papain) together with the saccharidase lysozyme occur in the AP group. Richardson (2) has pointed out that the proteases, including those in our sample, fall into a number of structural classes that differ widely in their global architecture. It seems, however, that at the lower organizational level represented by the differential geometric distribution a unifying structural principle may be operating, in that these various proteases represent different ways of assembling a protein out of a fairly narrowly defined distribution of 4-Ca structural elements. One such way is to have separate helical and B-structural domains, as in thermolysin.

Similarly, a slightly different (indeed, partly overlapping)

range of 4-Ca structural distributions leads to the nucleotidebinding fold and staphylococcal nuclease. Likewise, the two Fe-S redox proteins ferredoxin and HIPIP occur in a rela-

tively localized region.

The differences between structures that are reflected at this level of organization may also be significant. Staphylococcal nuclease and ribonuclease S are shown to have very different structures, despite their similar functions. aChymotrypsin is seen to be very different from the other proteases. A particularly intriguing instance is the very large distance between carbonic anhydrase and carboxypeptidase A. The global structure of these proteins is known to be qualitatively similar (2). Quantitatively, however, carbonic anhydrase is on the border between the aB and B groups, whereas carboxypeptidase A is in the AB region. We suggest that the quantitative classification correlates with differences in nucleation, in keeping with the proposals of Richardson (2). As we move from the A cluster through the AB and aB clusters to the B region, the primary folding events become less those associated with helix formation and increasingly those associated with formation of extended and p-barrel structure. In certain cases, this hypothesis suggests different folding mechanisms for proteins of qualitatively similar global structure. Thus, carboxypeptidase A and carbonic anhydrase are both parallel a/P, proteins (2). The much higher proportion of extended structure in carbonic anhydrase, which places it in the aB cluster, may reflect a predominance of p-sheet formation in the folding mechanism, while the folding of carboxypeptidase A may be dominated by helix nucleation. Similarly, triose-phosphate isomerase is globally a singly-wound parallel p-barrel. The classification results suggest that helix formation may be the predominant nucleating event and that p-barrel formation is controlled by the interactions that lead to parallel packing of the a-helices. SUMMARY The comparison of (Kt) distributions is a rapid quantitative method for demonstrating relationships between, and thus

Biophysics: Rackovsky and Goldstein classifying, proteins. Several points have been demonstrated. (i) The method is capable of detecting the wellknown folds even when they occur in proteins of widely differing sizes. (ii) A complete gradation of (Ka) distributions is observed, from all-helical to allj3. (iii) Relationships among functionally similar proteins seem to be manifested in their (K,T) distributions, even when their global structures are different. Such relationships are noted among the proteases, the nucleotide-binding proteins and staphylococcal nuclease, and the Fe-S redox proteins. It is suggested that these observations represent an organizing principle, in that certain types of protein are characterized by relatively narrowly defined (KT) distributions, the components of which can be arranged in different ways. (iv) It is suggested also that the positions of proteins within the gradation of distributions correspond to differences in folding mechanisms, in keeping with the proposal of Richardson (2). It follows that, in certain cases, proteins with qualitatively similar structures may fold by different mechanisms. We thank Prof. J. Ellis Bell for helpful comments. This work was supported in part by a research grant from the National Science Foundation (CHE-8509621). It was also supported in part by Grant

Proc. NatL. Acad. Sci. USA 85 (1988)

781

DE-FG02-85ER60281 from the U.S. Department of Energy and has been assigned Report no. DOE/EV/03490-2535. 1. Matthews, B. W. & Rossmann, M. G. (1985) Methods Enzymol. 115, 397-420. 2. Richardson, J. S. (1981) Adv. Protein Chem. 34, 167-339. 3. Levitt, M. & Chothia, C. (1976) Nature (London) 261, 552-558. 4. Rao, S. T. & Rossmann, M. G. (1973) J. Mol. Biol. 76,

241-256.

5. Rossmann, M. G. & Argos, P. (1977) J. Mol. Biol. 109, 99-129. 6. Remington, S. J. & Matthews, B. W. (1980) J. Mol. Biol. 140, 77-99. 7. Rackovsky, S. & Scheraga, H. A. (1978) Macromolecules 11, 1168-1174. 8. Rackovsky, S. & Scheraga, H. A. (1980) Macromolecules 13, 1440-1453. 9. Rackovsky, S. & Scheraga, H. A. (1981) Macromolecules 14, 1259-1269. 10. Rackovsky, S. & Scheraga, H. A. (1982) Macromolecules 15, 1340-1346. 11. Rackovsky, S. & Scheraga, H. A. (1984) Acc. Chem. Res. 17, 209-214. 12. Rackovsky, S. & Scheraga, H. A. (1980) Proc. NatI. Acad. Sci. USA 77, 6965-6967.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.