Assigning amino acid sequences to 3-dimensional protein folds

Share Embed


Descripción

REVIEW

Assigning protein DANIEL

amino

acid sequences

to 3-dimensional

folds FISCHER,

DANNY

RICE, JAMES

U. BOWIE,

AND DAVID

EISENBERG’

UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, California, Los Angeles, California 90095-1570, USA

With the advent of genome sequencing projects, the amino acid sequences of thousands of proteins are determined every year. Each of these protein sequences must be identifIed with its function and its 3-dimensional structure for us to gain a full understanding of the molecular biology of organisms. To meet this challenge, new methods are being developed for fold recognition, the computational assignment of newly determined amino acid sequences to 3-dimensional protein structures. These methods start with a library of known 3-dimensional target protein structures. The new probe sequence is then aligned to each target protein structure in the library and the compatibility of the sequence for that structure is scored. If a target structure is found to have a significantly high compatibility score, it is assumed that the probe sequence folds in much the same way as the target structure. The fundamental assumptions of this approach are that many different sequences fold in similar ways and there is a relatively high probability that a new sequence possesses a previously observed fold. We review various approaches to fold recognition and break down the process into its main steps: creation of a library of target folds; representation of the folds; alignment of the probe sequence to a target fold using a sequence-to-structure compatibility scoring function; and assessment of significance of compatibility. We emphasize that even though this new field of fold recognition has made rapid progress, technical problems remain to be solved in most of the steps. Standard benchmarks may help identify the problem steps and find solutions to the problems.-Fischer, D., Rice, D., Bowie, J, U., Eisenberg, D. Assigning amino acid sequences to 3-dimensional protein folds. FASEB J. 10, 126-136 (1996) ABSTRACF

Key Words: fold recognilion

THE

IDEA

OF FOLD

inverted protein folding

RECOGNITION

In protein fold recognition (1-4) we ask: “Is the sequence of a protein of unknown structure ‘compatible’ with the fold of a protein of known structure?” Because the methods have been developed in several laboratories (4-11),

126

University of

this procedure of fold recognition is now called by several terms, which include inverted protein folding, 3-dimensional profiles, threading, and sequence-structure compatibility searches. In precise terminology (Table 1), the term fold recognition is used for the process of scanning a database of proteins of known structure in order to find a protein compatible with a probe sequence of unknown structure (sequence recognizes structure). The term inverted protein folding is reserved for the converse process: given a protein of known structure, search a sequence library to identify those sequences compatible with the structure (structure recognizes sequence). In both fold recognition and inverted protein folding, assessing the compatibility of sequence to structure is the central problem. The difference lies in whether the probe is a 3-dimensional structure and the database consists of sequences or the probe is a sequence of unknown structure and the database consists of proteins of known 3-dimensional structure. In this review we concentrate on protein fold recognition. However, because the principles of fold recognition and inverted folding are identical, our review covers some work originally conceived as inverted protein folding. In protein fold recognition we assess whether a sequence is compatible with one of the structures in a library of known folds. The compatibility measure is obtained by optimally aligning (threading) the new sequence with each structure and selecting the structure whose alignment score is the highest. As X-ray and nuclear magnetic resonance (NMR) methods reveal more of the total biological repertoire of protein folds, the probability of assigning each new sequence to a known fold grows. The goal of fold recognition is to use a known structure to model the fold of a new sequence rather than predicting the new structure from sequence information alone (ab initio folding). In fold recognition, rather than searching the vast conformational space of the new protein, the search space is confined to the conformations of known structures. Although this reduction in search space appears to simplify the problem, fold recognition is not simple. As we discuss below, not only does fold rec-

To whom correspondence and reprint requests should be addressed, at: Box 951570, UCLA, Los Angeles, CA 90095-1570, USA.

0892.6638/96/0010-01

26/$01 .50 © FASEB

REVIEW TABLE

1. Terminology

offold

recognition

Term

Pmcess

Fold recognition

Given

Related

a probe

compatible target

Inverted protein folding

sequence,

search

fold in a libraiy

for a

of known

folds,

concepts

given

sequence

initial

models

Given a fold, search for compatible

Protein

sequences

searched

in a library

of sequences.

informally

without

all sequences

are

fold

protein

(4, 35)

compatibility Structure recognizes sequence Structure-sequence

to both inverted the term

recognition,

folding

terms

Profilescan (54) Threading (6) Sequence-structure compatibility Sequence recognizes structure Profilesearch

given

structure-sequence is central

and

inverted

ognition share some of the difficulties of direct protein folding, but it also involves additional problems. Nevertheless, current methods of fold recognition can assign some sequences to folds in cases for which other approaches to protein folding are ineffective, and there is hope that as problems are solved, fold recognition will become increasingly powerful for learning new structures. Fold recognition shares conceptual similarities and methods with classical sequence-sequence comparison. In classical sequence comparison, a sequence is assigned to a structure by establishing a similarity of the new sequence to some sequence of known structure. The underlying rationale is based on past experience that if the similarity is sufficiently large (>20-30% sequence identity with few gaps in the alignment), the folds are basically similar. At lower levels of sequence identity, however, the inference of similar folds cannot be made, because it is difficult to distinguish similarities occurring by chance from the ones that occur because of similar folds (12). There are many examples of pairs of proteins with insignificant sequence similarity that adopt a similar fold (13-15), so that there are many cases where sequence comparison methods may fail when a similarity in fold exists. Indeed, it has been suggested that there is a 70% probability that each new sequence having less than 25% identity to any other sequence possesses a previously observed fold (16). In these cases the fold recognition methods are apt to be more useful than sequence comparison methods. Sequence-sequence comparison methods use the amino acid sequences only and do not exploit 3-dimensional structural information. In 1991, Bowie et a!. (4) developed an alternative method: instead of scoring the compatibility of a sequence by comparing it to the sequence of a known structure, the sequence is compared with the structural information of a known 3-dimensional structure. This method was termed 3-dimensional profiles (4). Soon a variety of fold recognition methods were published (6-9, 17-20), and several reviews on these methods appeared (21-24). Here, we focus on analyzing the essential components of fold recognition in general and suggest the use of a benchmark to assess their performance.

are searched

of a

for folds.

for ones that fold a

compatibility folding

all conformations

design:

way. Because

FOLD RECOGNITION

Related

Protein folding:

is often used

for both processes.

Given the variety of methods, it is hard to compare them based only on the published papers. Typically, each new method is published accompanied by a limited number of examples. Each method has its strengths, but these can be difficult to decipher because the performance of each method depends on the optimization of many parameters. We emphasize that the use of a benchmark to assess the performance of fold recognition methods can be helpful in evaluating existing methods and developing new ones. In what follows we describe the essential components of fold recognition methods. Subsequently, we describe the benchmark we use to assess performance and present some results of the evaluations carried out using the benchmark. ESSENTIAL COMPONENTS RECOGNITION METHOD

OF A FOLD

Figure 1 depicts the essential components of fold recognition. In fold recognition, a sequence is aligned to each structure in a library of folds. The structures are represented in one of several possible ways. Using a compatibility scoring function, an alignment algorithm searches for the sequence-structure alignment with the highest score. Then, the scores for the alignment of the sequence to each structure in the library of folds are ranked to find the most probable fold for the sequence. Finally, the significance of the scores is assessed. The five main components of protein fold recognition are: 1) the library of known target folds; 2) the representation of the folds of the proteins; 3) the scoring function used to evaluate the compatibility between the probe sequence and a target fold; 4) the algorithm used to search for the optimal alignment of the probe sequence to each target fold; and 5) the ranking of compatibility scores for sequence-fold pairs and the way significance is assessed. Library

of known

folds

Building a library of known ward step of fold recognition.

folds is the most straightforA library containing all the

127

REVIEW highly similar. The current PDB contains the structures of more than 3000 protein chains, of which more than 2000 are highly similar to at least one other chain. Several representative fold libraries have been derived (15, 26-28), using different criteria for the similarity of two folds. Current fold libraries contain between 300 and 600 entries, depending on the similarity threshold applied. A technical difficulty in building a library of known folds is the decision about what should be considered a distinct fold. A PDB entry may contain the structures of two or more chains, and a chain may be composed of two or more domains. A library of known folds built at the level of individual chains and/or domains may be more useful in fold recognition than a more restricted library whose entities are full chains. Representation

of the protein

The representation of the protein structure (Table 2) can be an all-atom structure, a backbone structure, a string of n-carbon atoms, a string of structural environments, a set of inter-residue distances, or in the simplest case, a string of amino acid types (that is, a sequence). Some more complex representations are a combination of these basic representations. The representation of the protein varies according to the type of compatibility function used, as is described in the next subsection. Sequence-to-structure function

YOUR SEQUENCE IS COMPATIBLE TO A GLOBIN FOLD AT A CONFIDENCE

LEVEL

OF 83%

Figure 1. The fold recognition process. A probe sequence is aligned to each structure in a library of target folds. The sequence-structure aligninent is carried out by an alignment algorithm that searches for the alignment which achieves the highest score, using a compatibility function. Then these scores for all sequence-structure alignments are ranked, and a significance is assessed to find the probability that the highest ranking fold is the structure of the sequence. The five essential components of most fold recognition methods are shown: 1) the library of known folds, 2) the representation of the protein, 3) the function used to evaluate the compatibility between the probe sequence and a known fold, 4) the algorithm used to search for the optimal alignment, and 5) the way ranking is computed and the way significance is assessed.

known folds is desirable, and in theory the library could contain all the entries in the Protein Data Bank (PDB)2 (25). But there is a tradeoff between the size of the library and the time needed to scan it: fold recognition generally uses a representative set that covers all the unique folds but excludes those PDB entries that are

2Abbreviations:

PDB, Protein

mensional.

128

Vol. 10

January

1996

Data Bank; 3D-i D, 3-dimensional-i-di-

compatibility

scoring

The compatibility function is used by the alignment algorithm to find the sequence-structure alignment with the highest score (see Fig. 1). The various compatibility functions used in fold recognition can roughly be classified into two types: unipositional and multipositional (Table 3). In general, the compatibility function associates a score to the match of one (or more) amino acid from the probe sequence to one (or more) structural position in the target fold. Each structural position is characterized by some physicochemical characteristics. For example, the properties may include the amino acid type at that position, the local secondary structure of the chain, the solvent accessibility of the position, or the nature and distances of neighboring residues. In general, the values of the compatibility function are statistically derived from str,ucture and sequence data. The principle is to derive a score that reflects the probability of observing each of the 20 amino acids in the position (or positions) of interest, The score is commonly computed as the logarithm of the ratio of the observed over the expected, similar to the way sequence-sequence substitution tables are derived (29). A positive score is a usual occurrence, a negative score is an unusual one. The main difference between uni- and multipositional compatibility functions lies in the way the occurrences are counted. For unipositional functions, the occurrences are counted at each position independently of other posi-

The FASEB Journal

FISCHERFTAL.

REVIEW TABLE

Protein representation in

2.

Authors Sippl

(17,

36)

Bowie et al. (4)

Yi and Lander

(31)

methods

of fold

recognition

Pa

Description

#

Pairs of structural positions (i, j), indicating the 3D distance and sequence separation of the residue in position i with the residue in positionj. Amino acid sequence for the frozen approximation.

Y

String of 18 discrete environments described in terms of area buried,

where the environment of each position polarity, and local secondary structure.

String

of Bowie et al.’s (4) 18 environments]

Y

of environments

[extension

in the fold is

plus amino

acid

sequence.

Zhang

Jones

et al. (32)

Y

t

et al. (6)

String of triplets (ss, b, p) characterizing the environment: ss is secondary structure, b and p are real numbers denoting the area buried and fractional polar area of each residue in the structure [continuous representation of Bowie et al.’s (4) environments].

Similar

(36) [pairs of structural positions (i, f), 3D distance, and sequence andj] plus string of residue accessibilities (the amino acid sequence of

to Sippl

between

i

separation the structure

not used).

Matsuo

and Nishikawa

(11)

Y

String of triplets (solv, hb, bc), where solv is the solvent accessibility, hb is hydrogen-bonded or not, and bc is the local backbone conformation. Also pairs (i, J) of residues in contact, plus the 3D distance and angle between i andj. Amino acid sequence of the structure for the frozen approximation.

Godzik

Ouzonis

Y

(8)

et al. (7)

Y

Pair and triplet residue strucutre for the frozen

String

of structural

structure

Wilmanns

et al.

Y

(37)

and Lawrence

(9)

environments

type and five interface

defined

types:

Pairs

of area buried.

by a contact

helix,

Pairs (i, J) of interacting residues. Structural structure, and 001 numbers) are computed frozen

Bryant

contacts plus string approximation.

beta,

Amino

interface

acid sequence

vector.

Considers

of the

secondary

turn, loop, and water.

property classes (dihedral angles, secondary for each pair. Amino acid of the structure for the

approximation.

(i,j)

of residues

Core segments

in contact,

and allowed

3D

distance

between

i andj.

loop lengths.

* stands for profile-like format. Y indicates the representation is combined with the compatibility function to produce a profile (if pairwise interactions are considered, the frozen approximation is used). #Although the original method does not use a profile, it can he cast in a profile-like format (using the frozen approximation). As a 1-positional function is used in each level of the alignment algorithm, a profile representation could be used, although in this case it may not be convenient (would require n X m profiles for each sequence-structure pair).

tions. For multipositional functions, the occurrences counted at more than one position simultaneously.

Unipositional

compatibility

are

functions

A unipositional compatibility function evaluates compatibility of sequence to fold at each position in the alignment independently of other positions. The function specifies the compatibility value of aligning a given amino acid from the sequence to one position in the fold. For example, a compatibility function can be a 20 by 20 table of scores for matching residue to residue [such as Dayhoffs (29) or Gonnet’s (30) substitution matrices], as in sequence-sequence comparison methods. In this case,

FOLD RECOGNITION

the “structural” property considered is simply the amino acid type at that position. Alternatively, the compatibility function can relate a residue to a structural environment [sometimes called 3dimensional-i-dimensional (3D-iD) scores (4)]. Although a structural environment is defined at each position of the structure, its properties may depend on the nature of neighboring residues. These structural neighbors include residues close in space but not necessarily close in the amino acid chain. In Bowie et a!. (4), the structural environments are classified into 18 discrete classes. This allows one to represent the 3-dimensional protein structure as a string of environments. The compatibility function in this case is an 18 by 20 table of the 3D-iD scores for each of the 20 amino acids in the 18

129

REVIEW TABLE

3. The compatibility

functions

______________ Authors Sippl

in methods

offobd

recognition

Description 2-positional. Pairwise potentials relating two amino acid types to the pair (k, r), where k is a 3D distance. Given an alignment, the compatibility function is evaluated as the of all pairs of aligned positions (i, J) in the structure by taking their sequence separation, r. The frozen approximation converts to a 1-positional function by summing for each i the positionsj in the structure (and using the amino acid at positionj of the structure).

(17, 36)

and r

Bowie et at. (4)

1-positional.

20 x 18 3D-iD

Yi and Lander (31)

1-positional.

20

Zhang et al.

1-positional. Real function relating each amino acid type to the triplet (ss, b, p), where structure type, b is the area buried, and p is the fraction polar (see Table 2).

(32)

x

table

relating

20 sequence-sequence

each amino

table

acid type to each structural

(30) combined

with a 3D-iD

is a sequence separation sum of the contributions k, and their 3D distance, contributions of all other

environment.

profile.

ss is the secondary

2-positional. Pairwise potential similar to Sippl (36). Pairwise interactions beyond 10 A are substituted with a solvation potential. The double dynamic programming algorithm converts the 2-positional compatibility function to a 1-positional one by considering the alignments that match every position i of the sequence with every positionj

Jones et a!. (6)

of the structure.

Matsuo

and Nishikawa

Godzik

(8)

Bryant

1-positional similar to Bowie’s 3D-iD using the frozen approximation.

1-,

Ouzonis et al. Wilmanns

(11)

(7)

and

1-positional.

3-positional

Relates

contact

each amino

potential

Vol. 10

January

converted

acid type to each

(9)

2-positional. randomized

Contact sequences.

potential

relating

discrete environmental classes. Each score specifies the compatibility value of aligning an amino acid to an environment. Other authors use a larger number of environment classes (31) and combine the 3D-iD table with a sequence-sequence table. Instead of using 18 discrete environments, in Zhang and Eisenberg’s work (32) amino acid residue preferences are expressed as a continuous function of environmental variables, avoiding the abrupt change of preference of residues in slightly different environments. Other workers define unipositional compatibility functions based on different structural properties (7, 18). For example, Wilmanns and Eisenberg (18) use a compatibility function based on preferences of a residue for neighbors of specific residue types, main-chain conformations, or secondary structure. In addition, Wilmanns and Eisenberg (i8) consider combined profiles including the original Bowie et al. (4) profiles with three new types. The optimal alignment of a sequence and a structure using a unipositional compatibility function can be computed using dynamic programming algorithms (33, 34) (see below). Any unipositional compatibility function per-

130

combined

with a 2-positional

to 1-positional

structural

2-positional. Three tables for each of the three structural number) relating two amino acid types to a pair of property approximation. Can be combined with a sequence-sequence

et al. (37)

and Lawrence

2-,

scores

1996

two amino

using

environment

function.

Converted

the frozen

(called

to a 1-positional

approximation.

a contact

interface).

properties (dihedral angles, secondary structure, 001 classes. Converted to i-positional using the frozen table and/or with Bowie’s (4) profiles.

acid types

to a 3D distance.

Normalized

with respect

to

mits the combination of the representation of the structure with the compatibility function to form a table called a profile (35). A profile is a table containing n rows and m columns: n is the number of positions in the structure, with each row corresponding to a residue position in the structure; m is normally 20, corresponding to each of the 20 amino acids. The entry (ij) in the profile specifies the value of the compatibility of aligning a residue of type j (from the sequence) to the environment at position i of the structure (j is the index of the 20 amino acids, e.g., i for Ala, 2 for Cys, etc.) (see Fig. 2). The compatibility of the sequence for the 3-dimensional fold is the sum of the individual residue compatibilities in the alignment, corrected for gap penalties (see below). Multipositional

compatibility

functions

We denote the second class of compatibility functions as multipositional, because in such functions the compatibility is computed by considering two or more positions in the alignment at the same time. Such functions attempt to describe in some simplified way the numerous 3-dimensional

The FASEB Journal

FISCHER FT AL.

REVIEW interactions that operate in 3-dimensional protein folds. A common, simplified representation of the protein is a set of pairwise interatomic energy terms, with the structural role of any given residue described in terms of its interactions (22). Both the 3-dimensional distance and the sequence separation between the components of each pair may be included. One form of such a nonlocal function is found in the pairwise potentials developed by Sippl (36). Given two positions in the structure, i andj, at a given sequence separation and a given 3-dimensional distance, the potentials specify the compatibility value of matching two amino acids from the sequence into positions i and j of the structure. Jones et al. (6) use Sippl-like potentials but supplement them with a solvent-accessibility term. Bryant and Lawrence (9) use a 2-positional compatibility function in which pairs are either in contact or not. Godzik et al. (8) use a unary, binary, and tertiary interaction compatibility function to build a “topology fingerprint” that considers the area buried of each residue plus the pairs and triplets of residues that are in contact. In Wilmanns and Eisenberg (37), an iterative alignment procedure is used with a local environmental weighting in which each residue-pair preference value is weighted by the compatibility of the environmental residue with its own environment. The main difficulty of multipositional compatibility functions arises when using them to compute an optimal alignment. Indeed, it has been demonstrated that computing an optimal alignment with them is an NP-complete problem (38), which in practice means that the time required to find the solution grows exponentially as the size of the protein increases. Thus, methods using multipositional compatibility functions simplify the problem by some approximation or another. Some of these methods approximate the multipositional compatibility function as a unipositional one, thus converting the problem to a linear one, which in turn allows them to use the dynamic programming algorithm. Such methods are in principle based on a multipositional compatibility function derived from structural and/or sequence data, but in practice, substitute a “converted” unipositional function to find the optimal alignment. These approximations are described in the next section. Because of the lack of an optimal alignment algorithm for true multipositional compatibility functions, it is not straightforward to determine the strengths and limitations of such compatibility functions. The selectivity and sensitivity of a method using a multipositional function coupled with approximations may be limited due to either the function itself or the approximations applied.

structure described above). Methods based on multipositional compatibility functions either convert the function to a unipositional one, and then use the dynamic programming algorithm, or require a different alignment algorithm. Here we distinguish between two rough classes of alignment algorithms: dynamic programming-based and nondynamic programming-based. Dynamic

programming-based

The dynamic programming algorithm (33, 34) is an efficient and well-known technique for finding an optimal alignment, using a unipositional compatibility function (a unipositional compatibility function evaluates each position of the alignment independently of the other positions in the alignment, i.e. it assigns a compatibility value for the alignment of each residue of the sequence to one structural position of the known fold; see foregoing text). Despite its popularity, the dynamic programming algorithm is not straightforward to apply. First, a set of gap penalties must be specified. When insertions/deletions are allowed in the alignment, there is no analytical method to determine the optimal gap penalties (39). Furthermore, the optimal gap penalties depend on each probe sequence, and small changes in gap penalties may lead to different optimal alignments (40). The gaps in the alignment represent deletions in the structure or sequence that are necessary to achieve best compatibility and which, in terms of the alignment algorithm, are “noncompatible” regions. The score of the resulting alignment is computed as the sum of the compatibility value of each aligned pair minus a penalty for each gap. The compatibility values of the aligned residues are derived from a model (see below); the gap penalties are assigned parameters, not determined from a molecular model. Thus, although the dynamic programming algorithm does compute the optimal alignment, the alignment depends on the arbitrary gap penalty values. Second, there are different variations of the dynamic programming algorithm to chose from; for example, the “global” alignment (34), the “local” alignment (33), or the “global-local” alignment (41). Each algorithm was devised for a particular type of alignment, and each has its pros and cons. A detailed comparison between these algorithms in the context of fold recognition is beyond the scope of this review and will be presented elsewhere. Multipositional

The algorithm alignment

used

to search

As was pointed out earlier, the choice of algorithm to obtain an optimal alignment depends on the type of compatibility function used by the method. Methods based on unipositional compatibility functions [such as Bowie’s (4)] can use a dynamic programming algorithm directly to find the optimal alignment (and use the profile data

FOLD RECOGNITION

to unipositional

conversion

for the optimal Some recent publications are based on multipositional compatibility functions, such as pairwise potentials (17). However, as noted before, because computing the optimal alignment for such functions is an NP-complete problem (38), these methods require approximations [such as the so-called frozen approximation (8, 17)] to make the search feasible. This effectively converts the original multipositional compatibility function to a unipositional

131

REVIEW

3D-iD Determine structural parameters for each residue in the structure

parameters,

postion

parameters,

postion I

parameters,,

postion

i-i

i+1

Amino

Acid

Profile Gap

Type

Penalties A

C

. . .

D

W

Y

OPN

Sequence-to-structure compatibility scores

10

-24 22

-50

101

87 -132

121

-92

-75

100

10

-95

182

167

100

10

100

10

34-11

-5.

54

76

Figure 2. The use of 3D-iD profiles in fold recognition (4). The representation of the protein and a unipositional compatibility function can be combined into a convenient data structure called a profile. Using a profile, the application of a dynamic programming algorithm is straightforward.

one and allows the application of any dynamic programming algorithm. In the frozen approximation of a 2-positional function, the compatibility value of aligning two amino acids from the sequence to two positions in the structure is computed using only one amino acid from the sequence and taking the second one from the structure. The interaction of residue i with other residues j in the structure is approximated by considering the amino acid type that is in position j of the structure. This is an approximation because the probe amino acid that will actually be aligned to positionj is not necessarily the same as the one lying at position j of the structure. Using this approximation, the “pairwise” interactions are converted to a unipositional compatibility function and the dynamic programming algorithm can be used. To overcome some limitations of this frozen approximation, some methods iteratively update the amino acid type at each position of the structure after each alignment is obtained (8, 37). This correction has been found to improve somewhat the correctness of the alignment of the probe sequence to the target in test cases when both the target and probe are known (37). 132

Vol. 10

January 1996

It is still unknown to what extent the limitations of fold recognition methods based on multipositional compatibility functions result from the crudeness of the frozen approximation or the lack of sensitivity or selectivity of the compatibility function itself. Thus, although based on apparently more sophisticated compatibility functions, in practice most of these methods use a unipositional compatibility function and still apply the dynamic programming algorithm to find the optimal alignment. In a different approach, Jones et al. (6) approximate the pairwise interactions by using the dynamic programming algorithm at two levels (42). At each level, an equivalent to a unipositional compatibility function is used to direct the search. Being a heuristic algorithm, it does not ensure that the optimum is found; however, it is reported to work well in practice. Nondynamic

programming-based

The works of Bryant and Lawrence (9) and Lathrop and Smith (43) propose algorithms to find the optimal alignment using a truly 2-positional compatibility function. These methods do not convert the 2-positional function to

The FASEB Journal

FISCHER

EF AL.

REVIEW a unipositional one. However, other approximations are required. One approximation that dramatically reduces the search space (and makes the approach feasible) is that the alignment of the sequence to the structure is carried out only on predefined segments of the structure, requiring each segment to be aligned and allowing gaps only between segments. Another approximation used by Bryant and Lawrence is the application of a Monte Carlo sampling algorithm (although in earlier work the optimum was found by enumeration). In contrast, Lathrop and Smith (43) developed a rigorous branch and bound algorithm. How ranking assessed

is computed

and signilhance

is

Once the scores for the compatibility of the probe sequence to each target fold are computed, those scores must be ranked. The simplest ranking is a sorting in decreasing order of the raw scores (6). But because the raw scores can depend in part on the length and amino acid composition of the sequence, it may help to normalize the scores for these effects (44, 45). The pattern of normalization differs among the methods. Some normalize by the distribution of scores obtained by aligning many sequences of similar length (4, 32), others normalize by shuffling the probe sequence many times (i.e., random sequences of the same length and composition as the probe sequence) and aligning each to the structure (9, 31). Another computes the distribution from aligning the sequence to an artificially constructed polyprotein (17). Whatever the normalization procedure, the final scores are ranked in decreasing order. When comparing a probe sequence against a library of target folds, there will always be a fold with the highest compatibility score. Then the key question is whether this compatibility score is high enough that there is a significant probability that the probe sequence has this fold. This question is usually answered by considering a measure of statistical significance called a z-score. The z-score is the number of standard deviations that a score is above the mean score for all matches. The z-score is not strictly a measure of the statistical significance of the match. The reason is that in order to obtain a probability from a zscore, some assumption about the distribution of the scores must be made. This is not straightforward, as the distribution varies in shape depending on the compatibility function, the optimal alignment, the library of known folds, and the particular probe sequence. Thus, in practice, the z-scores are used simply to compare the score of a match to a background of scores, without necessarilly converting it to a probability. However, tests of probe sequences of known structure permit an estimate of the zscore above which a false-positive is unlikely to occur. Methods that rank the results using a z-score based on distributions as the ones described above automatically attach to the result a reliability measure. From tests on probe sequences of known structure, z-scores above 6-7

FOLD RECOGNITION

often represent reliable assignments. Methods that do not rank the results using a z-score (but, for example, use the raw scores to rank the results) must attach some other measure of significance to the results. Some of these methods assess significance by considering the distribution of the scores of the alignments of the probe sequence with each library fold and compute a z-score based on this distribution. In this case, a z-score above 3 often represents a reliable prediction (6, 41). Each component of fold recognition discussed above involves approximations and parameters. Selecting the best approximations and parameters is crucial to success, but is hindered by the complexity of the entire procedure. Recent improvements in fold recognition have concentrated mostly on devising new, possibly more sensitive, compatibility functions. Less attention has been focused on devising new optimal alignment algorithms or ranking procedures, but improvements in both components are likely to increase the sensitivity and selectivity of current methods.

IS FOLD RECOGNITION SIMPLER INITIO PROTEIN FOLDING?

THAN

AB

Fold recognition has been considered to be simpler than ab initio protein folding. The reason is that in ab initio protein folding only the sequence of the protein is available and a vast conformational space needs to be searched to find the structure. In contrast, in fold recognition the search space is confined to the conformations of a few hundred target structures, and the only question is which, if any, is compatible with the probe sequence. However, fold recognition has serious difficulties also (Table 4), namely: 1) neither unipositional nor multipositional compatibility functions are yet sensitive enough to detect all compatible sequence-structure pairs; 2) although in some applications multipositional compatibility functions appear powerful, computing an optimal alignment without the use of approximations is practically impossible [(it is an NP-complete problem (38)1; and 3) the actual structure of the probe sequence may differ substantially from the target structure (up to 5 A atomic deviations from the known structure). However, the compatibility is evaluated using the target structure itself. Thus, aligning a sequence with a target fold can involve large approximations. In addition, the structural characteristics of the target fold are computed using the full structure, whereas the final alignment may contain gaps in the structure (residues from the structure aligned to gaps in the probe sequence). Although fold recognition methods have had some definite successes (where sequence-sequence comparison fails), the difficulties that remain still make it a substantial challenge. The difficulty of computing the optimal alignment of probe sequence to the 3-dimensional target structure is severe. For multipositional compatibility functions, approximations are required. It is even difficult to determine the component

133

REVIEW TABLE 4. Comoarison of fold recognition

to ab

initio

folding

Ab initio folding Need effective

Difficulties

Must skirt

Fold recognition potential

function.

local minima.

Absence of efficient algorithms.

Need effective

Must screen

search

space

function.

out false positives.

Absence of efficient search algorithms if pairwise interactions without approximations (e.g., the frozen approximation). If dynamic

Search

compatibility

is vast.

Search

programming

space

is used,

is vast if pairwise

gap penalties interactions

need

are used

are used

be specified. without

approximations.

Simplifying

features

Search

constrained

to a few hundred

The fold of a new sequence (70%

Limitations

strengths and weaknesses of a method using such a function. Can they be attributed to the algorithm used in the search or to the compatibility function itself, or to both? For unipositional functions, the optimal alignment is easily computed using a dynamic programming algorithm, provided optimal gap penalty values are given. However, determining the optimal gap penalties for a given compatibility function is an unsolved problem (39). In addition, the type of dynamic programming method used is important: should it search for a local (33), a global (34), or a global-local alignment (41)? What type of score normalization should be used? How is the significance of the score computed? An approach to overcoming these difficulties is discussed in the following. BENCHMARK Because of the multiple component steps of fold recognition, it is not easy to identify the strengths or weaknesses of different methods. A method having a good compatibility function but a poor algorithm to compute the optimal alignment may perform poorly, and vice versa. And in each method, many parameters must be optimized in each of the component steps. To aid in improving and comparing fold recognition methods, we have developed a benchmark that assesses the performance of fold recognition methods in an unbiased and thorough way (41). The benchmark is independent of the representation of the protein, the compatibility function, the search algorithm, and the ranking procedure used in the method being evaluated. That is, it can be applied to any method that has four of the five components described above (the library of known folds is provided by the benchmark). This benchmark offers a standard, similar to benchmarks currently used to assess sequence-sequence alignment (e.g., refs 46, 47) and secondary structure prediction (e.g., ref 48).

134

Vol. 10

January 1996

of the

unique

new

is likely sequences

Cannot

predict

novel folds.

Actual

structure

of probe

sequence

structures

in the

to be similar have

differs

a previously

library.

to one in the library observed

from the target

fold).

structure.

This benchmark may also aid in determining the strengths or weaknesses of different fold recognition methods. The benchmark allows systematic evaluation of the relative merits of each component of a fold recognition method. This can be achieved by varying one coruponent at a time, leaving the others unchanged. The benchmark addresses two main issues: sensitivity and selectivity. Sensitivity is the ability to calculate high-ranking scores for the compatibility of the probe sequence with the correct target fold and selectivity is the ability to calculate low-ranking scores for unrelated folds. In addition, alignment accuracy must also be evaluated, but is not currently included in the benchmark. Other computational aspects that a benchmark can grade are computer time and space requirements, aspects of practical importance. The benchmark (41) consists of a library of about 300 known target structures and a set of 68 probe sequences that cover a wide range of structural classes and folds. The most compatible target fold for each probe sequence is known to the benchmark (and is precomputed by structural comparison; see ref 41 for details). The sequence identity between the probe sequence and the sequence of the target fold is below 30% for all pairs. For each of the 68 probe sequences, the evaluated method scans the library of known folds and produces a ranked list of compatibilities. The benchmark registers the rank that the most compatible fold achieved. An overall sensitivity and selectivity score is computed for all the 68 probe sequences. We have demonstrated (41) the utility of this benchmark by comparing the performance of different compatibility functions, using the global-local alignment algorithm and the raw scores to rank the results. Different compatibility functions were compared using the local alignment algorithm coupled with a statistical measure to rank the results (Elofsson et al., unpublished results).

The FASEB Journal

FISCHER EF AL.

REVIEW From these preliminary results we have found that a compatibility function combining 3D-iD scores with sequence information and pairwise interactions performs better than functions based on the individual components. In addition, we have found that the global-local algorithm is superior to other alignment algorithms, especially when the library of known folds is built at the domain level. Using the global-local algorithm and a combined 3D-iD profile we are able to identify the correct target fold for about two thirds of the probe sequences, including numerous cases where there is no significant sequence identity.

SUMMARY Fold recognition is a computational procedure for assigning a probe amino acid sequence, or part of a sequence, to a known protein fold. The first step in fold recognition is to represent known 3-dimensional protein structures in a form that permits computational comparison with a sequence. Then the probe sequence is aligned to the 3-dimensional structure, associating each residue of the probe sequence with a position in the 3-dimensional fold, and the compatibility of the sequence with the fold is quantified using a compatibility function. The values of the compatibility function are statistically derived from structure and sequence data. In this way, the compatibility of the sequence with every fold of the library is computed, and the compatibility scores are compared to find the fold with which the sequence is most compatible. The goal is to assign the sequence to a fold and to be able to estimate the probability of the correctness of this assignment. One practical procedure for assigning the probe sequence to 3-dimensional folds is the 3-dimensional profile method (4) in which 3-dimensional protein structures are represented as a string of environments, one for each position in the structure. The profile is a table. For the environment of each position, the profile gives the likelihood of finding each of the 20 coded amino acid residues at that position. Given a probe sequence, dynamic programming finds the best alignment of the sequence to the profile. By summing the compatibility scores for each residue in the environment to which it is assigned by the alignment, a score is found for the compatibility of the probe sequence to the profiled structure. Fold recognition has demonstrated early successes. For example, Bowie et al. (4) found that a 3-dimensional profile prepared from the 3-dimensional structure of the target protein actin gives significant compatibility scores for several heat shock proteins. Although the heat shock proteins have no significant sequence similarity to actin, the heat shock protein fold is known from X-ray crystallography to be similar (49). Bowie et al. (4) also found a significant compatibility between the amino acid sequence of the pur repressor and the structure of sugarbinding proteins. Later crystallographic work (50)

FOLD RECOGNITION

confirmed that the pur repressor in fact has a structure similar to the sugar-binding proteins. Related fold recognition methods have shown many other successes. Jones et al. (6) were able to detect several known distant relationships between sequences and structures. For example, they found a high compatibility score for the sequence of C-phycocyanin to structure of myoglobin (51, 52) and for some distantly related TIM barrel and -trefoil folds (22). Similarly, Sippi et al. (53) have reported several successes. For example, they have found the structural similarity of ADP-ribosylation factor and Rasp2l. Other recent partial successes were observed in the Asilomar prediction experiment organized by J. Moult and others. The folds of several sequences (with insignificant sequence similarity to the sequences of any protein of known structure) were correctly identified by various groups [an entire issue of the journal Proteins (volume 23, number 3, 1995) is dedicated to this experiment]. But to date, no fold recognition procedure is able, with a single set of parameters, to find high compatibilities for all known distantly related structure-sequence pairs. In this review we have discussed the reasons for this current state of fold recognition and have suggested possible paths toward improvements. For better success in recognizing distant relationships between amino acid sequences and known structures, improvements are needed in most components of fold recognition, especially in the compatibility functions and in the sequence-structure alignment algorithms. Systematic applications of benchmarks are likely to help. For many contributions at earlier stages of this work drew D. McLachlan, Michael Gribskov, Roland Luthy, manns, and Kam Zhang. For support, we thank the Energy and the Program in Mathematics and Molecular

we thank AnMatthias WitDepartment of Biology.

REFERENCES 1.

2. 3.

4.

5. 6. 7.

Drexler, K. E. (1981) Molecularengineering: an approach to the development of general capabilities for molecular manipulation. Proc. Nat!. Acad. Sci. USA 78, 5275-5278 Pabo, C. (1973) Designing proteins and peptides. Nature (London) 301,200 Ponder, J. W.. and Richards, F. M. (1987) Tertiary templates for proteins: Use of packing criteria in the enumeration of allowed sequences for different structural classes.). Mo!. BioS. 193, 775-791 Bowie, J. U., Luthy, R., and Eisenberg, E. (1991) A method to identify protein sequences that fold into a known 3-dimensional structure. Science 253, 164-170 Luthy, R., McLachlan, A. D., and Eisenberg, D. (1991) Secondary structurebased profiles: Use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 10,220-239 Jones, D. 1., Taylor, W. R., and Thornton, J. M. (1992) A new approach to protein fold recognition. Nature (London) 358,86-89 Ouzounis, C., Sander, C., Scharf, M., and Schneider, R. (1993) Prediction of protein quences

structure by evaluation of sequence-structure fitness. Aligning seto contact profiles derived from 3D structures.). Mo!. Blot. 232,

805-825 8. 9.

10.

Godzik, A., Kolinski, A., and Skolnick, approach to the inverse folding problem.). Bryant, S. H., and Lawrence, C. E. (1993)

J. (1992)

Topology

fingerprint

Mo!. BioS. 227, 227-238 An empirical

energy

function

for

threading protein sequence through folding motif. Proteins 16, 92-112 Rooman, M. J.. Kocher, J. P., and Wodak, S. J. (1992) Extracting information of folding from the amino acid sequence: accurate predictions for protein regions with stable conformation in absence of tertiary interactions. Biochemistry 31,

10226-10238

135

HVIW 11.

Matsuo, Y., and Nishikawa, K. (1994) Protein structural dicted by a sequence-structure compatibility method.

similarities preProtein Sci. 3,

34.

12.

Doolittle, R. F. (1986) Of Uris and Orfs: A primer on How to Analyze Derived Amino Acid Sequences, University Science Books, Mill Valley, California Orengo, C. A. Flores, 1. P., Jones, D. T.,Taylor, W. R., and Thornton, J. M. (1993) Recurring structural motifs in proteins with different functions. Curr. BioS. 6, 131-139 Holm, L, and Sander.C. (1994) Searching protein structure databases has come of age. Proteins 19, 165-173 Fischer, D.Tsai, C. J., and Nussinov, R. (1995) A 3-D sequence-independent representation of the Protein Data Bank. Protein Eng. In press Orengo, C. A., Jones. D. T., and Thornton. J. M. (1994) Protein superfamilies and domain superfolds. Nature (London) 372.631-634 Sippl. M. J., and Weitckus, 5. (1992) Detection of native like models for amino acid sequences of unknown three dimensional structure in a data base of known protein conformations. Proteins 13, 258-271

35.

Wilmanna, M., and Eisenberg, D. (1993) 3-dimensional profiles due-pair preferences: Identification of sequences with n/a-barrel Nat!. Acad. Sci. USA 90, 1379-1383 Krogh, A., Brown. M.,Mian, I. S.,Solander. K.. and Haussler, Hidden markov models in computational biology: Applications

41.

2055-2063

13. 14. 15. 16. 17.

18.

19.

20.

21. 22. 23. 24.

25.

26. 27.

28.

29.

485-500 Dayhoff. (Dayhoff.

31. 32.

33.

D. (1994) to protein

modeling.). Mo!. BioS. 235, 1501-1531 Goldstein. R. A., Luthey-Schulten. Z. A., and Wolynes, P.C. (1992) Protein tertiary structure recognition using optimized hamiltonians with local interactions. Proc. Nat!. Acad. Sci. USA 89, 9029-9033 Bowie, J. U., and Eisenberg, D. (1993) Inverted protein structure prediction. Curr. Opin. Struc. Bio!. 3, 437-444 Jones, D., and Thornton, J. (1993) Protein fold recognition.). Coniput. Aided Mo!. Desig. 7.439-456 Wodak, S. J.. and Rooman, M. J. (1993) Generating and testing protein folds. Curr. Opin. Struct. BioS. 3, 247-259 Kocher, J. P. A., Rooman, M. J., and Wodak, S. J. (1994) Factors influencing the ability of knowledge-based potential to identify native sequence-structure matches. J. Mo!. 8w!. 235, 1598-1613 Bernstein. F. C., Koetzle. 1. F., Williams, G. J. B., Meyer, E. F., Brice. M. D.. Rodgers, J. R.. Kennard, 0.. Shimanouchi. 1., and Tasumi, M. (1977) The Protein Data Bank: A cmputer-based archival He for mcromolecular sructures.). Mo!. Biol. 112, 535-542 Hobohm. U., Schai{, M., Schneider, R., and Sander, C. (1992) Selection of representative protein data sets. Protein Sci. 1,409-417 Boberg, J., Salakoski. 1., and Vihinen, M. (1992) Selection of a representative set of structures from Brookhaven Protein Data Bank. Proteins 14, 265-276 Orengo, C. A., Flores. 1. P.. Taylor, W. R., and Thornton, J. M. (1993) Identification and classification of protein fold families. Protein Eng. 6, M. 0..

evolutionary 30.

from resifold. Proc.

Schwartz.

R. M., and

Orcutt,

B. C. (978) A model of Sequence and Structure National Biomedical Research

change in proteins. In Atlas of Protein

M. 0., ed) Vol. 5, Suppl.

3, p. 345,

Foundation, Washington, D.C. Connet, C. H. Cohen, M. A., and Benner. S. A. (1992) Exhaustive matching of the entire protein sequence database. Science 256. 1433-1445 Yi, 1. M., and Lander, E. S. (1994) Recognition ofrelated proteins by iterative template refinement (itr). Protein Sci. 3, 1315-1328 Zhang. Y. J. K., and Eisenberg, D. (1994) The 3-dimensional profile method using residue preference as a continuous function of residue environment. Protein Sci. 3,687-695 Smith,T. F.,and Waterman, M. S. (1981) Identification of common molecular subsequences.). Mo!. Biol. 147, 195-197

136 Vol. 10

January

1996

The FASEB

36.

37.

38. 39. 40.

42. 43.

44.

45. 4.6.

47.

48. 49.

50.

51.

S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins.). Mo!. Rio!. 48,443-453 Cribskov, M., Luthy. R., and Eisenberg, D. (1990) Profile analysis. Methods Enzymo!. 183, 146-159 Sippl, M. J.. (1990) Calculation of confotmational ensembles from potentials Needleman,

of mean force. An approach to the knowledge based prediction of local structures in globular proteins.). Mo!. BioS. 213, 859-883 Wilmanns, M., and Eisenberg, D. (1995) Inverted protein folding by the residue pair preference profile method. Protein Eng. 8, 627-639

Lathrop, R. (1994) The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng. 7, 1059-1068 Altschul, S. F. (1991) Amino acid substitution matrices from an information theoretic perspective.). Mol. BioS. 219. 555-565 Vingron, M., and Waterman. M. S. (1003) Sequence alignment and penalty choice: review of concepts, case studies and implications.). Mo!. BioS. 235, 1-12 Fischer, D. Elofsson, A., Rice, D. W., and Eisenherg, D. (1996) Assessing the performance of inverted protein folding methods by means of an extensive benchmark. Proc. 1st Pacfic Symp. Biocomputing In press Taylor, W. R., and Orengo, C. A. (1989) Protein structure alignment.J. Mo!. Bio!. 208, 1-22 Lathrop, R., and Smith, T. F. (1994) A branch and bound algorithm for optimal protein threading with pairwise (contact potential) amino acid interactions. In Proceedings of the 27th Hawaii internationa! Conference on System Sciences (Hunter, L., ed) Vol.5, pp. 365-376, IEEE Computer Society Press, Los Alamitos Gribskov. M., McLachlan, A. D., and Eisenberg, D. (1987) Profile analysis: Detection of distantly related proteins. Proc. Nat!. Acad. Sci. USA 84. 4355-4358 Bryant. S. H., and Altschul, S. F. (1995) Statistics of sequence-structure threading. Curr. Opin. Struct. BioS. 5, 236-244 Vogt, C.. Etzold, T.. and Argos. P. (1995) An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. ). Mo!. Biol. 249,816-831 Pearson, W. R.. (1995) Comparison of methods for searching protein sequence databases. Protein Sci. 4. 1145-1160 Rost, B.,and Sander, C. (1993) Prediction of protein secondary structure at better than 70% accuracy.). Mo!. BioS. 232, 584-599 Flaherty, K. M., McKay. D. B.. Kabsch. W.. and Holmes. K. C. (1991) Similarity of the 3-dimensional structures of actin and the atpase fragment of a 70-kda heat shock cognate protein. Proc. Nat!. Acad. Sri. USA 88. 5041-5045 Schumacher, M. A., Choi, K. Y., Zalkin, 11., and Brennan, R. C. (1994) Crystal structure of lad member, purr, bound to dna: minor groove binding by a helices.Science 266, 763-770 Schirmer, 1., Bode, W., Huber, R., Sidler, W., and Zuber, H. J. (1985) X-ray crystallographic

structure

of the light-harvesting

hiliprotein

C-phycocyanin

from the thermophilic

52.

53. 54.

cyanobactenuim .Ilastig!odadus Sanuinosus and its resemblance to globin structure.). Mo!. BioS. 184, 257-277 Pastore, A., and Lesk, A. M. (1990) Comparison of the structures of glohins and phycocyanins: Evidence for evolutionary relationship. Proteins 8. 133-155 Sippl, M. J. (1995) Knowledge-based potentials for proteins. Curr. Opin. Struct. BioS. 5,229-235 Gribskov, M., Homyak, M.. Edenfield. J., and Eisenberg, D. (1988) Profile scanning for3-dimensional structural patterns in protein sequences. CABIOS 4,61-66

Journal

FISCHER

Er AL.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.