Energy-based prediction of amino acid-nucleotide base recognition

July 9, 2017 | Autor: Andrea Mozzarelli | Categoría: Thermodynamics, Computational Chemistry, Water, Computational Biology, DNA, Proteins, Amino Acids, Hydrogen Bonding, Amino Acid Profile, THEORETICAL AND COMPUTATIONAL CHEMISTRY, Protein Binding, Nucleotides, Proteins, Amino Acids, Hydrogen Bonding, Amino Acid Profile, THEORETICAL AND COMPUTATIONAL CHEMISTRY, Protein Binding, Nucleotides

Share Embed

Laporkan tautan ini

Descripción

Energy-Based Prediction of Amino Acid-Nucleotide Base Recognition ANNA MARABOTTI,1,2* FRANCESCA SPYRAKIS,3,4* ANGELO FACCHIANO,1,2 PIETRO COZZINI,4,5 SAVERIO ALBERTI,6 GLEN E. KELLOGG,7 ANDREA MOZZARELLI3,4 1

Laboratory for Bioinformatics and Computational Biology, Institute of Food Science, National Research Council, Avellino, Italy 2 Interdepartmental Research Center for Computational and Biotechnological Sciences (CRISCEB), Second University of Naples, Naples, Italy 3 Department of Biochemistry and Molecular Biology, University of Parma, Parma, Italy 4 National Institute for Biosystems and Biostructures, Rome, Italy 5 Laboratory of Molecular Modelling, Department of General Chemistry, University of Parma, Parma, Italy 6 Unit of Cancer Pathology, CeSI, University of Chieti, Chieti, Italy 7 Department of Medicinal Chemistry and Institute for Structural Biology and Drug Discovery, Virginia Commonwealth University, Richmond, Virginia Received 22 October 2007; Revised 23 January 2008; Accepted 24 January 2008 DOI 10.1002/jcc.20954 Published online 25 March 2008 in Wiley InterScience (www.interscience.wiley.com).

Abstract: Despite decades of investigations, it is not yet clear whether there are rules dictating the speciﬁcity of the interaction between amino acids and nucleotide bases. This issue was addressed by determining, in a dataset consisting of 100 high-resolution protein-DNA structures, the frequency and energy of interaction between each amino acid and base, and the energetics of water-mediated interactions. The analysis was carried out using HINT, a nonNewtonian force ﬁeld encoding both enthalpic and entropic contributions, and Rank, a geometry-based tool for evaluating hydrogen bond interactions. A frequency- and energy-based preferential interaction of Arg and Lys with G, Asp and Glu with C, and Asn and Gln with A was found. Not only favorable, but also unfavorable contacts were found to be conserved. Water-mediated interactions strongly increase the probability of Thr-A, Lys-A, and Lys-C contacts. The frequency, interaction energy, and water enhancement factors associated with each amino acid–base pair were used to predict the base triplet recognized by the helix motif in 45 zinc ﬁngers, which represents an ideal case study for the analysis of one-to-one amino acid–base pair contacts. The model correctly predicted 70.4% of 135 amino acid–base pairs, and, by weighting the energetic relevance of each amino acid–base pair to the overall recognition energy, it yielded a prediction rate of 89.7%. q 2008 Wiley Periodicals, Inc.

J Comput Chem 29: 1955–1969, 2008

Key words: protein-DNA; amino acid–base recognition; HINT; zinc ﬁnger; energy-based code

Introduction Within a cell, two different biological macromolecules are constantly interacting with each other: nucleic acids and proteins. Their precise cross-matching and cross-talk is essential for life. For more than 30 years, several authors have searched for the identiﬁcation of correspondence rules between amino acids and DNA bases (AA-B). Seeman et al. proposed that the amino acid–base pairing could be predicted on the basis of hydrogen bonds patterns,1 while Pabo and Sauer hypothesized that the recognition of some speciﬁc base sequences could be derived largely from hydrogen bonds and van der Waals interactions.2 Further studies, carried out with an increasing number of complexes at high resolution, as more structures became available,

This article contains supplementary material available via the Internet at http://www.interscience.wiley.com/jpages/0192-8651/suppmat. *These authors contributed equally Correspondence to: A. Marabotti; e-mail: [email protected] or A. Mozzarelli; e-mail: [email protected] or G. E. Kellogg; e-mail: [email protected] Contract/grant sponsor: Italian Ministry of University and Scientiﬁc Research—FIRB; contract/grant number: RBNE0157EH Contract/grant sponsor: U.S. National Institutes of Health; contract/grant number: GM71894 Contract/grant sponsors: Internationalization project, Italian National Council of the Researches (CNR-Bioinformatics Project), Italian National Institute for Biosystems and Biostructures

q 2008 Wiley Periodicals, Inc.

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

1956

applied a wide variety of methodological approaches,3–15 generally based on geometric (i.e., analysis of spatial distances, atom– atom interaction angles) or statistical/probabilistic considerations. These studies investigated the speciﬁc interactions between amino acids and bases in the major or minor grooves, including the effect of water molecules, as Mora´vek et al. did by analyzing 60 protein-DNA and 14 drug-DNA complexes.16 Despite these efforts, no general rules were found, although a consensussequence framework was derived for the zinc ﬁnger family and used to predict the speciﬁcity for DNA binding sites.17 Even the existence of a code, similar to the genetic code, was questioned.18 Recently, Benos et al., supported by the ﬁndings of several groups that have identiﬁed deﬁned AA-B preferences,3,10,12,19–21 suggested a ‘‘both directions probabilistic code,’’ rather than a ‘‘deterministic recognition code’’.15 Most of these previous studies were limited in scope, because a relatively low number of high resolution structures were analyzed. Additionally, the methodologies used often took into account only geometric parameters or at most enthalpic contributions to the binding free energy and, for the most part, neglected the entropic contributions. The latter can only be calculated using free energy component analysis,22,23 via ab initio calculations,24–31 or included in the conformational energy contribution estimated by knowledge-based methods.11,32–34 However, these approaches are prohibitively expensive for the scale of these investigations. To overcome these limitations, we have analyzed proteinDNA complexes reﬁned at a resolution equal or better than 2.5 ˚ . High resolution is a critical requirement, since energy calcuA lations are strongly affected by the geometry of interacting atoms, and, consequently, they are adversely affected by large uncertainties in poorly resolved structures. Moreover, we have used HINT (Hydropathic INTeraction),35–37 a non-Newtonian force ﬁeld based on LogPo/w, the experimentally determined molecular partition coefﬁcient between 1-octanol and water. Since LogPo/w is obtained directly from a solvation–desolvation experiment, HINT implicitly includes entropic contributions arising from water molecules displaced during the formation of molecular complexes. Furthermore, both positively and negatively scoring interactions are quantiﬁed using the same mathematical protocol, that is, the biomolecular association process is evaluated as a concerted event and not as a net sum of terms from different energetic sources.38 Consequently, the HINT scores are derived only by the different hydropathic properties of the interacting atoms, which are implicitly expressed in the hydrophobic atomic constants a. Single atom–atom hydropathic interaction terms bij, that is, the interaction score between atoms i and j, provide a quantitative evaluation of the association process, scored by the following equation: HINT score ¼

XX i

j

bij ¼

XX ðai Si aj Sj Tij Rij þ rij Þ i

(1)

j

where S is the solvent accessible surface area, Tij is a logic function assuming 11 or 21 values, depending on the nature of interacting atoms, and Rij and rij are functions of the distance between atoms i and j. The key parameters a are calculated by a procedure adapted from the CLOG-P method conceived by Hansch and Leo.39

A positive HINT bij value identiﬁes favorable contacts (hydrogen bonds, acid/base, and hydrophobic/hydrophobic), while negative bij values identify unfavorable contacts (acid/ acid, base/base, and hydrophobic/polar) between atoms i and j. The LogPo/w of a biological molecule is given by the sum of all hydrophobic atomic constants, thus, a values are dimensionless thermodynamic parameters related to the free energy of atom transfer between 1-octanol and water. Since each bij is related to a partial dg value, that is, the contribution given by a single atom pair to the global DGinteraction, the HINT score, (SiSj bij), is directly proportional to the global DGinteraction. The HINT score has been successfully correlated to the experimental free energy of binding for the formation of anthracycline antibioticDNA complexes,40,41 ligand-protein complexes42,43 and proteinDNA complexes.44,45 HINT was also applied in evaluating the inﬂuence on binding of the ionization state of protein residues and ligand functional groups,46,47 and in estimating the contribution of water molecules to the binding free energy at protein– protein interfaces48 and ligand–protein interfaces.49,50 Recently, the analysis of the energetics of water molecules bound to proteins in the absence and presence of ligands, estimated by HINT, and their hydrogen bond pattern, evaluated by the Rank algorithm,51 has led to the development of predictive tools for characterizing the role of water in ligand-protein complexes52 and, particularly relevant for this study, in protein-DNA complexes.44 By taking advantage of these tools we have addressed three main questions. Which AA-B pairs are more energetically relevant for achieving a speciﬁc and strong interaction between a protein and DNA? Is there an energetic propensity for each amino acid to recognize a speciﬁc base? Do water molecules, found at proteinDNA interfaces, play a signiﬁcant role in deﬁning speciﬁc AA-B recognition patterns? The result of our analysis has allowed us to deﬁne AA-B recognition preferences, based on four factors speciﬁc for each AA-B pair: frequency, energy, sign (positive or negative), and water enhancement energy. These factors were used to predict the association preferences between amino acids and nucleotide bases in a dataset of 45 zinc ﬁnger-DNA complexes, identiﬁed by phage display selection.53

Materials and Methods Dataset Creation

The coordinates of protein-DNA complexes were retrieved from the Protein Data Bank (PDB).54 In a few cases, when a complex was deposited in the PDB as the ‘‘asymmetric unit’’ in which the DNA appears as single-stranded for crystallographic convention, the coordinates of the biologically active complex were retrieved from the Nucleic Acid Database.55 The upper limit for ˚ in order to maintain a resolution in this study was set at 2.5 A high quality dataset. Only noncovalent complexes between a DNA-binding protein (either monomeric or multimeric) and a double-stranded DNA were used. The dataset includes only structures in which the interaction between protein and DNA is speciﬁc and biologically relevant, thus discarding complexes between receptors bound to a noncognate DNA sequence or

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

1957

Table 1. List of PDB Files Included in the Dataset.

1A3Q 1BL0 1F0O 1HLV 1K78 1NKP 1QRV 1UBD 2AOQ 2DNJ

1A73 1BNZ 1F4K 1HWT 1L3L 1NLW 1R0O 1W0U 2BAM 2FL3

1AAY 1BY4 1G2F 1I3J 1L3V 1OZJ 1R7M 1WTE 2BDP 2H27

1AZ0 1CDW 1GD2 1IAW 1LMB 1PDN 1RH6 1XBR 2BNW 2HAP

1AZP 1CKQ 1GU4 1IG7 1LQ1 1PP7 1RM1 1YO5 2BOP 2ITL

1B3T 1D2I 1GXP 1IGN 1M5X 1PT3 1RXW 1YRN 2C6Y 3CRO

1B72 1DC1 1H6F 1J1V 1MNM 1PUE 1SA3 1YTB 2C7A 3HDD

1B8I 1DNK 1H8A 1JE8 1MNN 1PUF 1SKN 1ZS4 2CGP 3HTS

1BC8 1DUX 1HCQ 1JGG 1N3F 1QNE 1TC3 2A07 2D5V 6PAX

1BDT 1EGW 1HCR 1JJ4 1NFK 1QPZ 1TRO 2A66 2DGC 9ANT

˚ are highlighted in bold. Structures with R 2.0 A

restriction enzymes bound to a nonconsensus sequence. Redundancy was excluded by choosing only the most representative complex between several redundant ones, that is, the one which better fulﬁlls the previous criteria. If two or more equivalent complexes were present, we included the one with the highest resolution. Often, two identical protein-DNA complexes are present in a single PDB ﬁle. These ﬁles were included in the dataset, but they were managed in order to avoid over-representation of interactions (see below). The dataset did not include complexes between antibodies and few DNA bases, or with DNA partially or totally single-stranded (except when the last bases of the double helix were not paired), with covalent bonds between the protein and DNA or within the DNA structure (e.g., thymine dimers), or complexes in which anomalous DNA structures (e.g., a base ﬂipping or a rotation of the purine or pyrimidine ring), nonclassical bases (e.g., uridine or inosine), or classical bases with substituents (e.g., iodine, bromine etc.) are present. The list of the complexes composing the dataset is shown in Table 1.

Model Building

SYBYL (Tripos; www.tripos.com) version 6.91 and HINT (eduSoft, LC; www.edusoft-lc.com) version 3.09S b, incorporated as a module of SYBYL, were used throughout the analysis. All structures were carefully checked before HINT analysis in order to verify that correct atom and bond types were assigned by SYBYL. When some obvious structural artifacts were found, they were corrected with the guideline of preserving as much as possible the crystallographic structure. Hydrogen atoms, not present in the PDB ﬁles, were added to the complexes using SYBYL tools. The orientation of hydrogen atoms added to water molecules was set (using the SYBYL tool) to optimize hydrogen bonds. A mild energy minimization, performed with the Tripos force ﬁeld with a distance-dependent dielectric constant, was applied only to hydrogen atoms with the Powell algorithm until ˚ 21, to remove intermolecular a ﬁnal gradient of 0.1 kcal mol21 A or intramolecular steric contacts that are not minimized by the automatic algorithms upon hydrogen atom addition. This procedure does not affect the position of heavy atoms. Furthermore, after minimization, the orientation of hydrogen atoms involved in base pairing in the DNA molecule was checked and, in a few

cases, hydrogen atoms were manually rotated to obtain the classic A¼ ¼T or GBC H-bond pattern. Hydropathic Analysis

The physico-chemical meaning of the hydropathic analysis with HINT, as well as settings and conditions, were previously described.37,42,44,46,49,52 In this work, to ‘‘partition’’ both protein and DNA molecules, the ‘‘dictionary’’ option was used, by setting the solvent condition as ‘‘neutral,’’ which means that all ionization states are set as would be expected for pH 5 7. Throughout this work the ‘‘essential’’ option that treats only polar hydrogen atoms explicitly was used. To take into account only the direct readout, that is, the speciﬁc interactions between the atoms of the amino acids and the nucleotide bases occurring at the major and minor groove, phosphate and ribose groups of the DNA backbone were excluded from HINT partition, and, as a consequence, from subsequent calculations. Thus, energetics and statistical calculations were limited to, and focused only on, direct or water-mediated side-chain to base interactions. As previously described, bij represents the single atom–atom hydropathic interaction term (i.e., the interaction score between ˚ and with atoms i and j). Only atom-atom interactions within 6 A |bij| 10 were taken into account for this study. The interaction score between a single amino acid AA and a single base B within a complex is given by the HINT score bAA-B, that is, the sum of all bij for the atoms of these two entities.

Frequency, Interaction Energy, and Preference for AA-B Pair

The identity, frequency, and relative interaction scores of amino acid–nucleotide base pairs within the entire dataset were deﬁned using the following procedure: 1. The HINT scores were ‘‘normalized’’ in order to evaluate the interaction between a single protein and a single DNA, that is, to avoid over-representing multimer complexes in the global dataset (see above). The possible cases in a PDB ﬁle are the following: i. Complex between a monomeric protein and a doublestranded DNA. In this case, all interactions found with HINT calculations were taken into account.

Journal of Computational Chemistry

DOI 10.1002/jcc

1958

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

ii. Complex between a dimeric (homo- or hetero-) protein and a double-stranded DNA. Although in the dimeric protein the interactions can be identical (repeated) in both subunits, we took into account all of them, since dimerization is the necessary condition for physiological interaction of that protein with DNA. Therefore, also in this case, we included all interactions detected with HINT in calculations. iii. Two identical complexes between a monomeric protein and a double stranded DNA. This is required due to the crystallographic space group. In these cases, a single ‘‘meta-HINT score’’ was obtained by averaging the equivalent contributions from each complex. (Note that although the heavy atoms may be identical in the two complexes, the optimized structures with hydrogens have variance). iv. Two identical complexes between a dimeric protein and a double stranded DNA. As case (iii), this is a crystallographic ‘‘artifact.’’ Also in this case, the equivalent contributions from each complex were averaged, but the contributions from two subunits of the same complex were treated separately. v. Particular cases were the PDB ﬁles 1H8A and 1K78. The ﬁrst shows a trimeric complex formed by a dimer and a monomer on a single DNA molecule, whereas the second shows two complexes between a dimeric protein and DNA, but in one of the two complexes there is also an additional protein chain interacting with DNA. In these two cases, the PDB ﬁle was divided in two different components. 1H8A was separated into a set containing the homodimeric protein-DNA complex, and into another set containing the monomeric protein-DNA complex. 1K78 was separated into a set containing the two identical dimeric protein-DNA complexes and into another set containing the complex between the additional protein and DNA. 2. The HINT score (bAA-B) for the interaction between each amino acid and each base for the 100 complexes was determined. The frequency of occurrence (count) and the HINT scores for each AA-B type contact were then summed. The count and the HINT score of AA-B interactions scoring either positive or negative were separately determined and summed to obtain the total positive and negative HINT scores. 3. The resulting data were used to calculate the relative contribution of each pair of amino acids and nucleotide bases on the total HINT score, and to deduce the residue energetic preference for each base, determined as a fraction of the corresponding AA-B (AA-A 1 AA-C 1 AA-G 1 AA-T) HINT score.

Analysis of Water Molecules at the Protein-DNA Interface

Only water molecules directly involved in protein-DNA interaction, that is, ‘‘bridging waters’’,52 were taken into account for HINT calculations. Crystallographic waters within a contact dis˚ between the protein and the DNA were automatitance of 4 A cally optimized and ordered using the Rank algorithm, imple-

mented in the 3.09S b HINT version. The Rank algorithm51 is used to determine the potential number of hydrogen bonds formed by each water molecule to non-water (donor and acceptor) atoms within a biomolecular complex. A partial Rank is assigned to each individual protein–water and DNA–water interaction and the sum of these values gives the global Rank characterizing a speciﬁc water molecule. Water molecules able to mediate speciﬁc recognition in protein-DNA complexes are characterized by non-zero partial Rank values with both protein and DNA, and by global Rank values greater than 3. Non-zero partial Rank means that the water is able to make at least one H-bond with both the protein and the DNA, while a global Rank greater than 3 was typical of waters mediating amino acid–base contacts that, otherwise, would not be able to contribute to the complex formation and stabilization.44 This does not necessary mean that the water molecule is forming three H-bonds. In fact, a 3 ranked water may have from two geometrically suitable H-bonds to three or even four H-bonds presenting nonoptimal geometrical parameters (distance and angle).51 These water molecules were carefully inspected and only those properly positioned to form hydrogen bonds with both amino acid side-chains and DNA bases were accepted as bridging waters. Water molecules positioned too far from the protein or the DNA, or contributing only marginally to a well-deﬁned amino acid residue–base interaction, were not considered. The energetic contribution of each water-mediated interaction was calculated by HINT for the entire dataset of complexes. As above, if two or more identical protein-DNA complexes were present in the same PDB ﬁle, waters mediating the same interaction were counted only once.

Prediction of Amino Acid-Nucleotide Base Contacts in Zinc Finger Protein-DNA Complexes

A set of 45 zinc ﬁnger-DNA triplet complexes, identiﬁed by phage display selection, was curated by Ghosh et al.53 The total number of experimentally observed AA-B pair contacts is 135. Starting from the recognition amino acid sequence, located in positions 21, 3, and 6 of each selected zinc ﬁnger motif, the corresponding 30 , middle, and 50 nucleotide bases were predicted using frequency-energy-water-based recognition rules (see Results) and success rates were calculated.

Results Interaction Type, Frequency, and Energetics of Amino Acid–Base Interaction

The selected dataset of protein-DNA complexes is functionally heterogeneous, comprising transcription factors and nontranscription factors, such as restriction enzymes and proteins involved in DNA replication. The crystallographic resolution for all of the 100 protein-DNA structures composing the dataset (Table 1) is ˚ , with a resolution 2.0 A ˚ for more than 40% better than 2.5 A of the complexes. Because of the division of ﬁles 1H8A and 1K78 in two parts (see Materials and Methods), 102 unique protein-DNA complexes were analyzed by the HINT force ﬁeld, with the aim to deﬁne AA-B preferences guiding the direct read-

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

out. No estimation of the phosphate and sugar contribution was provided, since the indirect readout through the DNA backbone is not directed by simple AA-B preferences.6 Both major and minor groove AA-B interactions were analyzed. For each pair of interacting atoms, HINT describes the type of interaction and its strength (the HINT score) with a sign indicating whether the interaction is favorable or unfavorable. Within the dataset, the contribution of favorable interactions is more than twice that of unfavorable ones (see Fig. 1). As expected, the contribution of the amino acid main chain atoms to the interaction with bases is negligible compared to that of amino acid side chain atoms (data not shown). The dominant favorable contributions are generally hydrogen bonds and acid– base interactions, whereas the unfavorable contributions are mainly hydrophobic-polar contacts for AA-T interactions, and acid/acid and base/base contacts for the other AA-B interactions. A detailed analysis of the contribution of each interaction type for the 80 pairs of amino acids and bases is reported in Figure 1SM and Table 1SM. The frequency of speciﬁc amino acid AA-B interactions and the corresponding HINT scores are reported in Table 2. In the dataset, about 80% of the interactions originate from AA-B in the major groove and about 20% in the minor groove (data not shown). The interaction count fraction, FFreq (Table 2), indicates the relative frequency of a speciﬁc AA-B pair over the total number of contacts. It was found (see Fig. 2) that the amino acids that more frequently interact with bases are Arg, Asn, Lys, Gln, Thr, Ser, Asp and, surprisingly, Gly. The sum of their contacts account for more than 70% of the total number of contacts, with Arg alone accounting for 23%. In contrast, Cys, Trp, Pro, Met, Leu, Ile, and Phe rarely participate in DNA binding (the sum of their contacts is about 10% of the total number of contacts). These numbers do not signiﬁcantly change if they are normalized with respect to the natural distribution of amino acid in proteins or, more speciﬁcally, in DNA binding proteins56 (Table 2SM). Furthermore, by analyzing the frequency of a speciﬁc base contacting an amino acid, it was found (Fig. 2, inset) that T is the most contacted base (32%), followed by G (26%), A and C (21%, each). The distribution of HINT score values, that is, how many times an interaction between an amino acid and a DNA base in the dataset falls in a determined HINT score range, is reported in Figure 3. The HINT score values are mainly distributed in the range 2100 to 1100. The contact score exceeds 500 (which corresponds to about 21 kcal mol21 42) only for 82 Arg-G contacts (26.8% of the total number of Arg-G contacts), and also for a few Arg-C, Arg-T, Asp-C, Asp-G, Glu-C, Lys-G contacts, indicating a signiﬁcant energetic role for these amino acid–base pair in the recognition process. It is interesting to note that negative interactions are consistently formed by some AA-B pairs, including Arg with A, C, and G, and might be relevant in modulating recognition (vide infra). On the basis of the relative HINT score of each amino acids with respect to the four bases (Table 2), an energy-based AA-B preference was determined (Fig. 4, Table 4SM). Only AA-B pairs for which the total HINT scores are positive were considered (Table 2), assuming that a negatively scored AA-B contact should not correspond to a possible predictable choice. There-

1959

fore, for Met and Val a base preference cannot be assigned because of negative HINT scores with all four bases (Table 2). Preferences were calculated as a percentage of the corresponding AA-B (AA-A 1 AA-C 1 AA-G 1 AA-T) total HINT score, retaining only positively scored pairs. The base preference (bold labels in Fig. 4) corresponds to AA-B contacts showing a mean HINT score value greater than 50 (Table 2), which is approximately the threshold of HINT score uncertainty.49 Thus, these contacts correspond to relevant energetically driven preferences and are not affected by the ‘‘noise’’ of aspeciﬁc interactions. Arg-G, Asn-A, Asp-C, Gln-A, Glu-C, and Lys-G, with the addition of His-G and Ser-G, appear to be the most relevant contacts. Preferences are evident also for Ala-C, Cys-G, Gly-G, Leu-A, Thr-G, and Trp-C, while no clear base preference is associated to Ile, Phe, and Tyr. In general, hydrogen-bond donors residues like Arg, Cys, His, Lys, Ser, and Thr prefer G, while hydrogen bond acceptor residues, like Asp and Glu, prefer C. Asn and Gln, possessing both donor and acceptor moieties, prefer A. Since these percentages were calculated for each amino acid from the sum of the AA-B HINT scores for the four bases, and not on the overall total HINT score, the preferences should be evaluated with caution, especially for amino acids with very low HINT scores with respect to the total HINT score. For example, even if Cys interacts with G, as opposed to A, C, or T, in more than 90% of the cases, the Cys-G partial HINT score accounts for only a fraction of 0.14 of the overall HINT score (Table 2). In contrast, Asn interacts with A with 56% preference but the Asn-A pair contributes for a fraction of 6.49 of the overall HINT score. Thus, Cys preference for G has a significantly smaller impact on sequence recognition than Asn-A or other more relevant AA-B interactions (see Table 2). The dissection of each AA-B HINT score into its positive and negative contributions allows deeper insight into the underlying energetics of recognition, since the ﬁnal values are not the result of similar positive and negative scores cancelling out. Therefore, we calculated preferences of each residue also on the basis of the sign of contributions (Fig. 2SM, Table 3SM, Table 4SM). The preference based on positive scores conﬁrms the results, reported in Figure 4, and the preference based on negative HINT scores allows the identiﬁcation of the more relevant unfavorable interactions: Asn-T, Asp-G, Gln-T, Glu-G, Ile-T, Leu-T, Met-T, and Val-T. In most of these cases, there are negative scored contacts between hydrophobic and polar groups, even between the hydrophobic amino acids and T, the most hydrophobic base. However, the formation of base–base repulsive contacts between the polar Asp and polar G (and Glu and G) is interesting and possibly plays a role in the ﬁne tuning of protein-DNA recognition. To obtain an amino acid–base preference that weights the absolute values of each AA-B pair with respect to the total HINT score, that is, the total energy calculated for the interaction of all AA-B pairs, the HINT score of each pair of amino acids and bases was scaled to the total HINT score. The resulting HINT score fraction FHINT (Table 2) of each amino acid for a speciﬁc base is reported in Figure 5. The dissection of the positive and negative HINT score fractions is reported in Table 5SM and in a visual format in Table 6SM. The value of Arg-G interaction alone is about 2/5 of the overall HINT score, and the

Journal of Computational Chemistry

DOI 10.1002/jcc

1960

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

Figure 1. The contribution of interaction types to the total HINT scores. The relative HINT score fraction attributed to the interaction of AA with a speciﬁc B was analyzed with respect to the total AA-B HINT score.

two highest contacts (Arg-G and Lys-G) account for about half of the total HINT score. Together, both acidic residues, Asp and Glu, interacting with C contribute another 1/6 of the total HINT score. Other important contributions are from Asn and Gln with A (scoring about 1/10 of the total HINT score), Arg with A, Ser with G, and Asn with C. With respect to DNA (Fig. 5, inset), the analysis clearly shows that the interactions of amino acids with G are the predominant energetic contribution to protein recognition and binding, with a cumulative HINT score of about 3/5 of the total score. Contributions of A and C are almost equivalent and score each 1/5 of the total HINT score, whereas T has a nearly equivalent mixture of favorable and unfavorable interactions with amino acids. The A nucleotide interacts preferably with Asn and Gln, whereas C prefers the acidic residues Asp and Glu. T has the highest frequency of interaction (Fig. 2, inset), but the most signiﬁcant positive interactions of T with Arg, Lys and Phe are, in total, equivalent to the predominant negative ones with Asn, Gln, Asp, and Glu (Table 2).

Figure 2. Interaction count fraction of each AA. Bases are identiﬁed by: A (green), C(cyan), G (black), T (red); positive contacts are represented by solid bars while negative contacts are represented by hashed bars. The inset shows the fraction of contacts formed by each B.

Figure 3. Distribution of HINT score values with respect to each AA-B pair interaction. For each range of 100 HINT score, bars refer to the number of AA-B contacts falling within that range Bases are identiﬁed by: A (green), C (cyan), G (black), T (red). (One single Asp-C contact, accounting for more than 1550 HINT score units has been omitted from the graph for scaling reasons).

Figure 4. Representation of the relative AA-B preferences, calculated as a percentage of the corresponding AA-B (AA-A 1 AA-C 1 AA-G 1 AA-T) HINT score values, retaining only positively scored AA-B pairs. AA-B pairs for which the corresponding mean HINT score [ 50 HINT are labeled in bold, while contacts accounting for less than 5% were not reported.

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

1961

Table 2. Experimental Data and Derived Coefﬁcients for Each AA-B pair.

Interaction AA

Base

Overall interaction counta

Ala

A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T

21 24 34 62 210 252 306 309 162 88 105 189 27 70 63 41 6 7 7 3 65 54 66 100 19 62 76 27 48 48 58 68 21 28 38 49 30 19 19 31 27 7 11 50 70 110 154 109 24 13 15 26 17 14 21 48

Arg

Asn

Asp

Cys

Gln

Glu

Gly

His

Ile

Leu

Lys

Met

Phe

HAA-B, Overall HINT scoreb

Mean HINT scorec

Interaction count fractiond (FFreq)

Normalized HINT score fractione (FHINT)

Water enhancement coefﬁcientf (CWater)

7.64 324.69 2843.03 2380.73 6,339.45 2,181.46 76,926.48 19,750.84 12,676.29 6,555.54 3,466.98 26,377.91 2,042.02 14,182.72 21,453.10 23,287.18 247.20 296.22 278.25 22.98 7,558.70 1,707.82 1,499.05 24,612.23 2,598.89 17,688.34 23,439.23 24,002.29 665.53 2759.79 1,280.58 2436.04 1,252.07 1,288.57 2,383.22 612.80 441.93 2307.56 2414.22 382.65 565.24 2380.50 291.29 21,565.81 1,659.96 889.42 27,792.37 2,592.77 2261.48 2541.15 2357.84 21,415.28 638.85 143.95 1,063.11 1,698.32

0.36 13.53 224.80 26.14 30.19 8.66 251.39 63.92 78.25 74.49 33.02 233.75 75.63 202.61 223.07 280.18 27.87 213.75 39.75 7.66 116.29 31.63 22.71 246.12 136.78 285.30 245.25 2148.23 13.87 215.83 22.08 26.41 59.62 46.02 62.72 12.51 14.73 216.19 221.80 12.34 20.93 254.36 28.30 231.32 23.71 8.09 180.47 23.79 210.89 241.63 223.86 254.43 37.58 10.28 50.62 35.38

0.46 0.53 0.75 1.37 4.63 5.56 6.75 6.82 3.57 1.94 2.32 4.17 0.60 1.54 1.39 0.90 0.13 0.15 0.15 0.07 1.43 1.19 1.46 2.21 0.42 1.37 1.68 0.60 1.06 1.06 1.28 1.50 0.46 0.62 0.84 1.08 0.66 0.42 0.42 0.68 0.60 0.15 0.24 1.10 1.54 2.43 3.40 2.41 0.53 0.29 0.33 0.57 0.38 0.31 0.46 1.06

0.00 0.17 20.43 20.20 3.25 1.12 39.41 10.12 6.49 3.36 1.78 23.27 1.05 7.27 20.74 21.68 20.02 20.05 0.14 0.01 3.87 0.87 0.77 22.36 1.33 9.06 21.76 22.05 0.34 20.39 0.66 20.22 0.64 0.66 1.22 0.31 0.23 20.16 20.21 0.20 0.29 20.19 20.05 20.80 0.85 0.46 14.24 1.33 20.13 20.28 20.18 20.73 0.33 0.07 0.54 0.87

1.00 1.00 1.00 1.00 2.78 3.00 1.22 1.35 1.67 1.25 3.59 1.25 3.95 1.44 3.36 1.18 1.00 1.00 1.00 1.00 1.32 2.08 3.26 1.26 1.48 1.24 2.53 1.00 1.00 1.00 1.00 1.00 2.13 1.97 2.41 1.73 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 5.18 4.45 1.70 1.60 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (continued)

Journal of Computational Chemistry

DOI 10.1002/jcc

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

1962

Table 2. (Continued).

Interaction AA

Base

Pro

A C G T A C G T A C G T A C G T A C G T A C G T

Ser

Thr

Trp

Tyr

Val

Total

Overall interaction counta

HAA-B, Overall HINT scoreb

Mean HINT scorec

Interaction count fractiond (FFreq)

Normalized HINT score fractione (FHINT)

Water enhancement coefﬁcientf (CWater)

13 4 6 18 47 38 75 99 55 56 62 112 3 9 8 6 23 21 20 44 52 20 36 77 4532

199.28 104.05 64.82 2296.39 2031.98 154.33 4828.07 21189.11 167.47 2871.31 628.14 2911.87 59.39 587.07 27.13 136.44 775.67 683.38 761.17 308.79 226.62 21306.46 2603.57 21204.04 19,5197.00

15.33 26.01 10.80 216.47 43.23 4.06 64.37 212.01 3.04 215.56 10.13 28.14 19.80 65.23 3.39 22.74 33.72 32.54 38.06 7.02 20.51 265.32 216.77 215.64 –

0.29 0.09 0.13 0.40 1.04 0.84 1.65 2.18 1.21 1.24 1.37 2.47 0.07 0.20 0.18 0.13 0.51 0.46 0.44 0.97 1.15 0.44 0.79 1.70 –

0.10 0.05 0.03 20.15 1.04 0.08 2.47 20.61 0.09 20.45 0.32 20.47 0.03 0.30 0.01 0.07 0.40 0.35 0.39 0.16 20.01 20.67 20.31 20.62 –

1.00 1.00 1.00 1.00 3.24 2.49 1.36 1.06 10.46 3.41 1.04 1.16 1.00 1.00 1.00 1.00 3.41 3.08 2.09 2.38 1.00 1.00 1.00 1.00 –

a

Count of AA-B interactions of the speciﬁed type. Sum of HINT scores of speciﬁed AA-B type. c Mean HINT score values for each speciﬁed AA-B type. d Fractions of AA-B interaction counts of the speciﬁed type with respect to the total number of interactions, normalized to 100. e Fractions of the HINT score contributions for each speciﬁed AA-B interaction with respect to the total HINT score values, normalized to 100. f Score enhancement due to water contribution, {|HAA-Water-B| 1 |HAA-B|}/|HAA-B|. See Supplemental Materials for HAA-Water-B. A water enhancement factor of 1 indicates an AA-B with no signiﬁcant bridging water molecules. b

Water-Mediated AA-B Interactions 44

In an earlier report, we showed that water molecules contribute to the energetics of the formation of protein-DNA complexes and, thus, they might also mediate speciﬁc AA-B recognition. To explore this possibility, the Rank algorithm was applied to identify water molecules bridging amino acids and bases within the dataset. A bridging water molecule is deﬁned as a water molecule that makes more than two hydrogen bonds with both amino acids and nucleotide bases,12 showing: (i) positive Rank with both the amino acid and the base and (ii) an overall Rank higher than 3.44 Of the ˚ at 1997 crystallographic waters, located within a range of 4.0 A the protein-DNA interface of the 100 analyzed complexes, 47% met these conditions. Upon subsequent visual inspection, 431 (22%) were classiﬁed as nonredundant (see Materials and Methods), thus speciﬁcally mediating AA-B recognition. The weighted HINT scores for water-mediated AA-B contacts, reported in Figure 6, show the relative HINT score of the

water molecules that are interacting simultaneously with protein and DNA. With few exceptions, the same AA-B pairs that are signiﬁcant for direct protein-DNA interactions are also signiﬁcant in water-mediated interactions. In particular, Arg-G accounts for 11.4% of the water-mediated interactions (Table 7SM). The hydrogen bond-donating Arg easily forms strong contacts with carbonyl acceptors in bases without mediation and the ﬂexibility of its long side chain is an additional asset. Interactions with adenine (A) are notably enhanced relative to the others, especially with H-bond donor residues because water is capable of reducing the electrostatic repulsion between donor atoms of A and the side chains of the amino acids. Interactions with T are decreased, presumably because the hydrophobic portion of this base can not make favorable interactions with water molecules. The interaction enhancement for each AA-B pair due to the energetic contribution of a bridging water molecule is deﬁned as the water enhancement factor CWater, and is reported in Table 2. A water enhancement factor of 1 indicates an AA-B with no sig-

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

1963

53,60,65,66 Table 3. Comparision of Experimental and Predicted AA-B Matches for Phage Display Zinc Finger Peptide-DNA Oligomer Complexes.

Sequences of residues at positions 21, 3 and 6 on The recognition helix of zinc ﬁnger domain speciﬁcally binding 50 -ANN-30 , 50 -CNN-30 , 50 -GNN-30 DNA triplets are indicated. For each cell, the experimental base for the peptide amino acid is indicated on the left; the base predicted using the energetic recognition code in this work is reported on the right of each cell. N identiﬁes cases for which no base preference could be deduced. No experimental peptide sequence was reported for the base triplet sequence in the shaded cells.

niﬁcant bridging water molecules. The overall average energetic effect of the water-mediated recognition on the AA-B interaction was calculated to be 1.55. Thus, enhancement factors larger than this value indicate AA-B pairs, where water mediation exhibits a particularly signiﬁcant role, for example, Arg-A (2.78), Arg-C (3.00) Asn-G (3.59), Asp-A (3.95), Asp-G (3.36), Gln-G (3.26), Lys-A (5.18), Lys-C (4.45), Ser-A (3.24), Thr-A (10.46), Thr-C (3.41), Tyr-A (3.41), and Tyr-C (3.09). Interestingly, Asn-B and Asp-B interactions are enhanced more (1.71 and 1.78, respectively) than Gln-B and Glu-B interactions (1.57 and 1.39, respectively), probably because the shorter side chain of the for-

mer requires water mediation to ‘‘lengthen’’ the side chain to effectively contact bases. Additionally, interactions with the three hydroxyl residues, Ser-B (1.80), Thr-B (2.49), and Tyr-B (2.80), are also notably enhanced by water.

Recognition Preferences Predicting Zinc Finger Protein-DNA Complexes

Cys2His2 zinc ﬁnger proteins are a particularly suitable class of complex for testing the recognition preferences in prediction of

Journal of Computational Chemistry

DOI 10.1002/jcc

1964

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

Figure 5. Relative contribution of each AA-B pair HINT score, normalized with respect to the total AA-B HINT score. Columns are colored code as in Figure 2 to identifying bases. The inset shows the relative contribution given by each base with respect to the total AA-B HINT score.

the most favorable AA-B contacts. The zinc ﬁnger-DNA complexes present a structurally well-characterized and conserved recognition module. The protein recognition a helix ﬁts into the DNA major groove with the formation of essentially one amino acid-to-one nucleotide base speciﬁc contacts57–59 between protein residues placed at positions 21, 2, 3, and 6 of the helix and

Figure 6. Relative HINT score contributions for water-mediated amino acid–base interactions, normalized with respect to the total AA-W-B HINT score. The inset shows the relative contribution given by each base, with respect to the total AA-W-B HINT score. Protein-water contributions are represented by solid bars, while DNA-water contributions are represented by hashed bars.

the corresponding bases in the three to four canonical subsites (see Fig. 7). This one-to-one registration between amino acids and bases is fairly unique feature of zinc ﬁnger protein-DNA complexes. Protein residues occupying positions other than 21, 2, 3, and 6 are involved in regulating the ﬁne speciﬁcity.60–64 The major recognition contacts are made by residues at position 21, 3, and 6, contacting on one strand of the double DNA the 30 , middle, and 50 nucleotides of a triplet base-pair sequence, respectively, while the amino acid in position 2 interacts with the previous nucleotide base on the 30 side of the other DNA strand. This interaction pattern is conserved in canonical zinc ﬁnger proteins, including Tramtrak ﬁnger 2, two Zif268 ﬁnger 1 variants, TFIIIA ﬁnger 3, Berg’s designed protein ﬁngers 1, 2, and 3, and YY1 ﬁnger 3. Different interactions can be observed in other zinc ﬁnger-DNA complexes, where residues at positions 21 and 2 make alternate contacts compared to this typical recognition pattern.20 Segal and coworkers used phage display selection on fourteen 50 -ANN-30 , ﬁfteen 50 -CNN-30 , and all sixteen possible 50 GNN-30 (N 5 any nucleotide base) to identify the zinc ﬁngerlike peptide with highest afﬁnity for these base pair triplets53,60,65,66 from a library of [ 109 peptides (data for 50 -TNN-30 are not available67). Subsequent mutational optimizations yielded speciﬁc binding peptides for the 45 base-pair triplets. These data comprise a test set that can be explored with the AA-B frequency and energy-based preferences set out in Table 2. The experimentally observed combination of triplet basepairs and amino acid residues placed at position 21, 3, and 6 of the zinc ﬁnger domain recognition a helix are reported in Table 3 for each of the three ANN, CNN, and GNN sequence sets. The prediction of the preferred nucleotide base that is associated with each amino acid in these constructs, using the frequencybased interaction count fraction (FFreq), yields very poor results—correct prediction of only 34/135 (25.2%) of AA-B

Figure 7. Close-up of the crystallographic structure of the complex formed by the Zif268 protein DSNR variant and the GACC site located on the DNA double strand (PDB code 1A1F). The fundamental one-to-one interactions formed by residues 21, 2, 3, and 6, on the protein recognition helix, and four bases, located on the corresponding DNA major groove subsite, are highlighted by yellow dashed lines. Residues and bases are shown in capped stick, while the protein and the DNA are rendered in ribbon/tube and are colored magenta and cyan, respectively.

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

associations. T has the highest frequency of interactions with three-quarters of the amino acids, and is thus indicated as the preferred base, although, energetically, most amino acid interactions with T are very weak. However, the HINT-based energetic analysis using FHINT is fairly accurate, with 95/135 (70.4%) of the experimentally observed amino acid–nucleotide base pairs correctly predicted. This latter prediction is illustrated in Table 3, where the actual and predicted bases for each peptide residue in the one-to-one contacts described above are listed. As some AA-B interactions are nonspeciﬁc and relatively insigniﬁcant in terms of protein-DNA energetics, while others are robust, to ascertain that our method correctly identiﬁes the energetically signiﬁcant AA-B interactions, and only fails to identify weak or nearly random AA-B associations, each prediction was weighted for their relative energetic relevance with respect to the total protein-DNA recognition energy. We calculated a relevance weighted success descriptor, SW, as follows: P o F Sw ¼ P HINT 100 FpHINT

(2)

where FoHINT is the FHINT fraction (Table 2) for each actual AAB interaction (Table 3) and FpHINT is the FHINT fraction for the ‘‘most favorable’’ B contact for that amino acid. For example, the erroneous prediction of an Ala-A contact involves the substitution of the best fraction of 0.17, related to the preferred Ala-C contact with the 0.00 value (Table 2), whereas the erroneous prediction of an Arg-A combination with respect to the much more favorable Arg-G contact would have a more signiﬁcantly impact on the success rate, that is, FoHINT 5 3.25 vs. 39.41 (Table 2). The SW was found to be 89.7%. The prediction only slightly increases when the water enhancement factor CWater (Table 2) was taken into account.

Discussion The Dataset

The composition of the dataset is a key factor for robustness and generality in predicting amino acid–base preferences. The number, type, and crystallographic quality of the protein-DNA complexes in the dataset can greatly affect the results. Early studies2,4–6,10,11 analyzed datasets limited in the number of complexes (from 8 to 50), in type (mainly transcription factors) and ˚ ). More recently, two larger structural resolution (average \ 3 A datasets, containing 12912 and 139 complexes,14 were analyzed but the dataset composition was still far from ideal. The cut-off ˚ in both cases, with few comfor the resolution was set to 3.0 A ˚ . Also, several complexes plexes at resolution better than 2.0 A in these datasets contained anomalous DNA structures (singlestranded DNA, thymine dimers, nonclassical bases, or classical bases with substituents) or complexes between antibodies and short DNA oligomers. In one case,12 a high degree of redundancy was present. We felt that these factors might generate biased results, and, therefore, we enhanced the quality of our dataset as much as possible. First, we analyzed only structures with ˚ .This is a crucial issue because resolution [ 3.0 A ˚ R 2.5 A

1965

does not guarantee that the geometries of the atoms are correctly localized. We also discarded complexes in which some anomalous elements (i.e., the presence of an unusual base) and unnatural protein-DNA complexes could perturb the analysis. Finally, our dataset included both transcription factors (67%), and nontranscription factors (33%), as restriction enzymes, transposase, transferase, recombinase, that generated interaction rules are not biased by the protein function. Geometry-Based vs. Energy-Based Amino Acid–Base Preference

Several investigations have searched for general rules governing AA-B recognition within protein-DNA complexes.3,5,6,8–10,13,58,68–73 The majority of these studies derived AA-B preferences through qualitative and quantitative approaches based purely on geometric considerations, such as counting the number of hydrogen bonds between amino acids and bases as deduced from the protein-DNA three-dimensional structures. Qualitative AA-B preferences were ﬁrst proposed by Desjarlais and Berg for Cys2His2 zinc ﬁnger proteins.3 Similar results, organized into a zinc ﬁnger recognition code, were subsequently given by Choo and Klug,9 by Wolfe et al.58 and by Pabo et al.59 Resultant AAB preferences were strictly position-dependent, that is, did not include energy-based parameters. These qualitative rules evolved into semiquantitative predictive models, for example, in calculating chemical and stereochemical merit points as by Suzuki and Yagi, who obtained good speciﬁcity indexes (ranging from 92 to 99%) for limited sets of speciﬁc protein families.5 AA-B preferences were also proposed by Lustig and Jernigan,70 starting from a speciﬁc set of zinc ﬁnger proteins74 and logarithms of the AAB relative frequencies. Mandel-Gutfreund and Margalit10 and Benos et al.13 showed that better predictions could be obtained by enlarging the datasets and including different and nonredundant structures. In the latter case, SAMIE was developed from a set of 876 EGR protein-DNA complexes with a derived score that reﬂects AA-B compatibility, but does not possess any energetic meaning. However, 80% of the training set DNA targets were correctly ranked in the top 1% of all the possible targets and a correlation factor of 0.80 was obtained when predictions were compared to available literature experimental afﬁnity measurements.65 A notable success rate (68%) was obtained by Jones et al. by using more parameters (accessible surface area, electrostatic potentials, residue interface propensity, hydrophobicity, and conservation).72 A statistical evaluation combining electrostatic potentials and molecular surface shape parameters by Tsuchiya et al. achieved a 86% success in prediction.73 Despite these promising results, there are several limitations to these methods: (i) the uneven availability of experimental data, since predictions are often reliable only within a particular family of proteins; (ii) the sole reliance on hydrogen bond frequencies in most cases, without considering other interactions; (iii) the assumption of additivity, that is, each hydrogen bond contributes equally to the recognition, even though it is well known that the strength of hydrogen bonds vary signiﬁcantly with distance and interacting partners75; and (iv) the real lack of any energetic meaning to the parameters and predictions. By analogy, as part of our analysis, we derived FFreq (Table 2 and Fig. 2), a tool for

Journal of Computational Chemistry

DOI 10.1002/jcc

1966

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

frequency-dependent interaction preferences between amino acids and nucleotide bases. This tool has limited value: (i) only Arg exhibits a frequency of interaction with speciﬁc base higher than 4%, and, in fact, comprises more than 23% of all interactions, with surprisingly little speciﬁcity; (ii) Asn and Lys exhibit the next most recurrent interactions, each yielding a total of about 10–12% of all interactions, again with limited speciﬁcity; and (iii) few amino acids exhibit clear base preferences, only Ala-T, Gln-T, Leu-T, Phe-T, Thr-T, and Val-T show a frequency approaching double of the amino acid’s second choice. Clearly, FFreq is not an appropriate tool for this purpose. We can only speculate on why frequency-based speciﬁcity predictions had earlier success, but one factor may be the lack of depth and diversity of the training sets used in the past. Speciﬁc recognition between an amino acid and a base is dictated by a geometric complementarity resulting in a series of favorable (and unfavorable) energetic contributions. Thus, structure/energy-based recognition preferences of amino acid–base associations should provide more stringent and quantitative criteria than those purely based on geometric factors. Energy-Based Preference Rules

In this work, we are applying the HINT force ﬁeld that extracts geometric information from crystallographic structure and translates it into a quantitative parameter, the HINT score. This score provides an energy-based parameter assessing all possible favorable and unfavorable interactions. The parameter FHINT summarizes the score information with respect to both frequency and energy of AA-B interactions. Examination of Table 2, Figure 4 and especially Figure 5, reveals that about three-quarters of amino acids indicate a FHINT base preference, and that the cases without a strong preference contribute little to the overall protein-DNA interaction ensemble. Interestingly, the interaction preferred by several amino acids, Ala, Gln, Leu, Thr, and Val, with T (Table 2), is not energetically favored, as signaled by their negative FHINT values. Only the preferred interaction of Phe with T is also energetically favorable. The case of Val is peculiar because all FHINT values are negative indicating that this amino acid causes unfavorable interactions with all four bases. In accord with previous reports in analyses based on frequency of contacts,1,6,10,12,70 the most relevant AA-B preferences (Table 2) are Arg-G, Lys-G, Glu-C, Asp-C, Asn-A, Gln-A and, to a lesser extent, Ser-G and His-G. The relatively high incidence of speciﬁc amino acid interactions with G and the associated relevant energetic contribution is due to a high number of potential hydrogen bonding atoms being found on the base edge, thus allowing the formation of multiple bonds.6,12 The relevance of Arg and, in particular, Arg-G contacts can be explained by the length of this residue side-chain and its capacity to interact in different conformations that produce good hydrogen-bonding geometries. As a result, bi-dentate interactions, as in Arg-G, Lys-G, or Asn-A, increase the bond energy for an AA-B contact. In contrast, single and complex hydrogen bonds (involving the binding of a single AA to more than one base at the same time), van der Waals interactions, and water-mediated contacts are strongly dependent on the conformation of the interacting macromolecules and usually produce context-dependent speciﬁcities.12

As illustrated above, the HINT energy-based analysis more effectively extracts structural information from crystallographic data, and reveals both favorable and unfavorable contacts. The more common statistical potential approaches do not quantify or predict unfavorable interactions between amino acids and bases.76 Some of the negative scoring AA-B contacts in our analysis are surprisingly well conserved. Thymine, usually because of its methyl group, generates numerous negative scoring hydrophobic-polar interactions and is frequently present in unfavorable contacts with polar residues, in particular Asn-T, Gln-T, Asp-T, and Glu-T. Also, Asp and Glu are involved in repulsive interactions with G. Unlike the interactions made by T that seem to be ubiquitous in the dataset, the Asp-G and Glu-G contacts probably serve a more functional purpose in modulating the recognition between amino acid and base sequences for the formation of speciﬁc protein-DNA complexes. The Role of Water Molecules in Protein-DNA Recognition

Several studies have addressed the role of water molecules in protein-DNA complexes (refs. 12,77–82,83 and references therein). As described by Reddy et al.,81 water molecules found in protein-DNA complexes can play several functions: (i) waters contacting both the protein and DNA, thus mediating interaction and speciﬁc recognition; (ii) waters contacting polar groups belonging either to the protein or to the DNA, thus acting as solvating agents or buffering electrostatic repulsions between DNA phosphate groups and protein electronegative atoms; (iii) waters contacting hydrophobic groups in either the protein or DNA; and (iv) waters in contact only with other water molecules as part of the hydrogen-bond network. Therefore, only a small number of the crystallographically detected waters in proteinDNA complexes are involved in amino acid–base recognition. Indeed, our analysis indicates that only 22% of water molecules potentially mediating the protein-DNA recognition (located ˚ of both protein and DNA at the complex interface) within 4.0 A can be demonstrated to act as protein-DNA linkers. This set of 431 nonredundant molecules corresponds to only 2.3% of all the crystallographic waters in our dataset. Similarly, Reddy reported that only 2% of 17,963 crystallographic water molecules compensate for the lack direct hydrogen bonding, while the remainder ﬁll void spaces at the protein-DNA interface.81 Water-mediated contacts (see Fig. 6) between A and positively charged or H-bond donor amino acids like Arg, Lys, Asn, Tyr, His, and Ser likely alleviate electrostatic repulsion between H-bond donor groups. C was more often found in water-mediated contacts with Asp and Glu, with the water likely acting as an extension arm for shorter amino acid side chains as previously suggested.12,81–83 and clearly observed by Otwinowski et al. in their groundbreaking study on the trp repressor/operator complex, where water-mediated H-bonds were shown to be fundamental in the speciﬁc operator-trp repressor interaction.84 The energetic contribution played by water molecules in amino acid– base recognition is summarized with the CWater coefﬁcients (Table 2), enhancement factors that indicate the extent by which a deﬁned amino acid increases its interaction energy with a speciﬁc base via a bridging water molecule. As reported in Table 2, Thr-A and Lys-A exhibit about tenfold and ﬁvefold increases,

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

respectively, whereas several amino acid–base interactions are not mediated by water and CWater 5 1 for these cases. Prediction Rules Tested in the Zinc Finger-DNA Complexes

Having compiled the frequency and energy speciﬁcity rules of Table 2, we were interested in applying our model to the prediction of amino acid–nucleotide base interactions. The rich data set of zinc ﬁnger—oligonucleotide interaction preferences obtained through phage display technology53 was used to develop a test set. As indicated in the Results section, using AA-B frequencies (FFreq) as the prediction basis yielded an essentially useless result, that is, no better than random guessing. However, using FHINT allowed us to correctly predict 70.4% of the bases interacting with the key peptide amino acid residues, a success rate somewhat better than those obtained by applying the codes proposed by Choo and Klug,9 Wolfe et al.,58 and by Pabo et al.,59 which led to 57.8%, 66.7% and 69.6%, respectively. None of these previous codes, mostly based on chemical and steric effects guiding the amino acid–base recognition, contain any energetic information and tolerate multiple AA-B correspondences, since a given amino acid may recognize more than one base in a speciﬁc position of the triplet.8,9,17,58,59,85 Although our results are encouraging, we have to set the context of this particular experimental data. First, the experiment is, in a sense, the opposite of our computational analysis in that we are predicting the base from the amino acid whereas the experiment is ‘‘ﬁshing’’ for a speciﬁc peptide with a predeﬁned oligonucleotide as ‘‘bait.’’ Since Nature has provided ﬁve times more unique amino acids than bases, speciﬁcity might not be necessarily commutative. Clearly the phage display experiments are not free of context-dependent effects, as certain peptide sequences would likely adopt conformations inconsistent with a zinc ﬁnger due to side chain–side chain interactions. Other factors such as the nature of the nucleotide bases (e.g., pyrimidines are less accessible than purines), the size/length of amino acids, target site overlap, cross-subsite contacts, and sequence-dependent conformational effects in DNA structures may impact the experiments.20,65,66,86–88 Since these effects are not parameterized in computational predictions, they likely account for at least some of the incorrect (40/ 135) predictions by our model. Second, 10 AA-B mismatches involve the presence of a thymine in the middle position of the DNA triplet (Table 3). FHINT does not consider any AA-T interaction other than Phe with T to be both preferred and favorable. Thus, the amino acid residue in position 3 interacting with T in the middle of the DNA triplet is probably mostly determined by the nature of the neighboring amino acids located in positions 21 and 6. Third, the possible presence of a water molecule, able to mediate the speciﬁc Thr-T recognition in the zinc ﬁnger motif, may guide the nine associations between Thr and T in 21 and 30 , respectively. The heterogeneous character of our dataset could explain the failed identiﬁcation of Thr-T pair as a potential preference, even when the energetic enhancement due to water molecule contribution is considered (see Table 2). The same, or an additional, water molecule could also stabilize the recognition helix by mediating favorable interaction between Thr in 21 and other polar, H-bond acceptor, residues (Asn, Asp, Glu, or His) often found in position 3. In general, speciﬁc recognition of many nucleotides

1967

may be dictated by motifs rather than single residues as suggested by Segal et al. for the combination {Thr(21), Ser(1) Gly(2)} interacting with T in 30 in four 50 -GNN-30 triplets65 or by Dreier et al. for the cases of {His(21), Arg(1) Thr(2)} and {His(21), Lys(1) Asn(2)} that both recognize T in 30 of two 50 -ANN-30 triplets.66 Fourth, substitution of the expected Thr with His or Ser in positions 21 or 3 in three cases may be explainable because the presence of two threonines on the same side of the ﬁnger is seemingly disallowed, and, in fact, is even very seldom seen in zinc ﬁnger a-helices, probably due to steric hindrance. Fifth, no more than one negatively charged residue is present in each zinc ﬁnger motif (positions 6, 3, 21). In ﬁve cases His, Ser, or Thr were phage displayed in position 3 as the complement to C as the middle DNA base. In isolation, Asp would likely have been selected in the phage, and our model would have predicted C as the middle base. Three similar mismatches were observed in the 50 -CNC-30 triplet, where only C in 50 was hydrogen-bonded to Glu in position 6, while C in 30 interacted with His or Ser. Sixth, Dreier et al.66 observed eight different residues selected to bind A in 50 -ANN-30 triplets. This lack of speciﬁcity clearly impacts our predictions in seven cases. Lastly, as there are only crystal data on a small handful of structures in the Zif268 motif, some of the substitutions in the dataset may have structurally modiﬁed or perturbed the expected one-to-one interaction pattern that is the core assumption of our predictions. Despite the shortcomings of this test dataset, some of which are structural and others phenomenological, our predictive tool managed to successfully describe 95/135 AA-B interactions based on a simple one-to-one amino acid-to-nucleotide base model. The increased prediction success rate, 89.7%, obtained applying the weighted success descriptor Sw [Eq. (2)] suggests that it is important to take into account the relevance of each AA-B pair with respect to the total energy of the recognition AA-B pairs, that is, to weight the relative energetic preferences of an amino acid for a given base with respect to the absolute energetic preference of an AA-B contact over the entire dataset. Further weighting in the prediction, including CWater, the water enhancement factor (Table 2), led to only a small increase of the prediction success rate. For some systems, where bridging water molecules might be more signiﬁcant than in zinc ﬁngerDNA complexes, the water contribution in the prediction can result more relevant. These ﬁndings support the principle that speciﬁc amino acid– base binding is essentially driven by energy-based preferences, and that recognition codes based only on the frequency of observed contacts cannot completely decipher the relevance of the different amino acid–base interactions. We believe that evaluation of both the enthalpic and entropic contributions of amino acid–base binding represents the ﬁrst step toward the deﬁnition of reliable recognition rules. Nevertheless, it is clear that context-dependent effects signiﬁcantly affect association preferences/energetics, and further investigations are necessary to better understand the rules guiding amino acid–base interactions and to further improve the reliability of predictions.

Conclusions The analyses that we have carried out indicate that: (i) the contacts between some amino acids and DNA bases (Arg-G, Lys-G,

Journal of Computational Chemistry

DOI 10.1002/jcc

1968

Marabotti et al. • Vol. 29, No. 12 • Journal of Computational Chemistry

Glu-C, Asp-C, Asn-A, Gln-A, Ser-G, His-G) are highly speciﬁc and conserved among all protein-DNA complexes and contribute about 80% of the total speciﬁcity free energy; (ii) these frequent and strong contacts involve G, C and A, but not T. This ﬁnding is intriguing since T is the base that is different in DNA and RNA, and may have an evolution-related interpretation. In fact, the code for interaction between protein and nucleic acid might have been developed during the ‘‘RNA world’’ and only subsequently adapted to recognize DNA. To verify this hypothesis a similar analysis should be carried on protein-RNA complexes. It is also interesting to note that amino acid-T interactions are the most frequent (but weak) and account for most of the negative HINT score. This work represents the ﬁrst step in developing rules for the prediction of amino acid or base counterparts in known proteinDNA interaction motifs. Several factors should be accounted for when trying to understand and decipher the contribution of each amino acid–base interaction to a protein-DNA binding process: the frequency of occurring contacts, the contribution of bridging water molecules, but most of all the energetics and sign of each interaction, together explicitly and, for the most part, unambiguously detail the relative preference of each amino acid–base contact in global protein-DNA binding. On the horizon, decoding context-dependent preferences to understand sequence factors in interaction preferences are a logical next step. We believe our computational tool set and methodologies will make this possible.

Acknowledgments The authors thank colleagues of the Department of Biochemistry and Molecular Biology, University of Parma, for several illuminating discussions.

References 1. Seeman, N. C.; Rosenberg, J. M.; Rich, A. Proc Natl Acad Sci USA 1976, 73, 804. 2. Pabo, C. O.; Sauer, R. T. Annu Rev Biochem 1984, 53, 293. 3. Desjarlais, J. R.; Berg, J. M. Proc Natl Acad Sci USA 1992, 89, 7345. 4. Suzuki, M. Structure 1994, 2, 317. 5. Suzuki, M.; Yagi, N. Proc Natl Acad Sci USA 1994, 91, 12357. 6. Mandel-Gutfreund, Y.; Schueler, O.; Margalit, H. J Mol Biol 1995, 253, 370. 7. Choo, Y.; Klug, A. Proc Natl Acad Sci USA 1994, 91, 11163. 8. Choo, Y.; Klug, A. Proc Natl Acad Sci USA 1994, 91, 11168. 9. Choo, Y.; Klug, A. Curr Opin Struct Biol 1997, 7, 117. 10. Mandel-Gutfreund, Y.; Margalit, H. Nucleic Acids Res 1998, 26, 2306. 11. Kono, H.; Sarai, A. Proteins 1999, 35, 114. 12. Luscombe, N. M.; Laskowski, R. A.; Thornton, J. M. Nucleic Acids Res 2001, 29, 2860. 13. Benos, P. V.; Lapedes, A. S.; Fields, D. S.; Stormo, G. D. Pac Symp Biocomput 2001, 115. 14. Lejeune, D.; Delsaux, N.; Charloteaux, B.; Thomas, A.; Brasseur, R. Proteins 2005, 61, 258. 15. Benos, P. V.; Lapedes, A. S.; Stormo, G. D. Bioessays 2002, 24, 466.

16. Moravek, Z.; Neidle, S.; Schneider, B. Nucleic Acids Res 2002, 30, 1182. 17. Desjarlais, J. R.; Berg, J. M. Proc Natl Acad Sci USA 1993, 90, 2256. 18. Matthews, B. W. Nature 1988, 335, 294. 19. Nardelli, J.; Gibson, T. J.; Vesque, C.; Charnay, P. Nature 1991, 349, 175. 20. Wolfe, S. A.; Nekludova, L.; Pabo, C. O. Annu Rev Biophys Biomol Struct 2000, 29, 183. 21. Jamieson, A. C.; Wang, H.; Kim, S. H. Proc Natl Acad Sci USA 1996, 93, 12834. 22. Jayaram, B.; McConnell, K. J.; Dixit, S. B.; Beveridge, D. L. J Comput Phys 1999, 151, 333. 23. Jayaram, B.; McConnell, K.; Dixit, S. B.; Das, A.; Beveridge, D. L. J Comput Chem 2002, 23, 1. 24. Pichierri, F.; Aida, M.; Gromiha, M. M.; Sarai, A. J Am Chem Soc 1999, 121, 6152. 25. Yoshida, T.; Nishimura, T.; Aida, M.; Pichierri, F.; Gromiha, M. M.; Sarai, A. Biopolymers 2001, 61, 84. 26. Packer, M. J.; Dauncey, M. P.; Hunter, C. A. J Mol Biol 2000, 295, 85. 27. Sponer, J.; Leszczynski, J.; Hobza, P. J Biomol Struct Dyn 1996, 14, 117. 28. Piacenza, M.; Grimme, S. J Comput Chem 2004, 25, 83. 29. Paillard, G.; Lavery, R. Structure 2004, 12, 113. 30. Beveridge, D. L.; Barreiro, G.; Byun, K. S.; Case, D. A.; Cheatham, T. E., III; Dixit, S. B.; Giudice, E.; Lankas, F.; Lavery, R.; Maddocks, J. H.; Osman, R.; Seibert, E.; Sklenar, H.; Stoll, G.; Thayer, K. M.; Varnai, P.; Young, M. A. Biophys J 2004, 87, 3799. 31. Arauzo-Bravo, M. J.; Fujii, S.; Kono, H.; Ahmad, S.; Sarai, A. J Am Chem Soc 2005, 127, 16074. 32. Olson, W. K.; Gorin, A. A.; Lu, X. J.; Hock, L. M.; Zhurkin, V. B. Proc Natl Acad Sci USA 1998, 95, 11163. 33. Michael Gromiha, M.; Siebers, J. G.; Selvaraj, S.; Kono, H.; Sarai, A. J Mol Biol 2004, 337, 285. 34. Selvaraj, S.; Kono, H.; Sarai, A. J Mol Biol 2002, 322, 907. 35. Kellogg, G. E.; Semus, S. F.; Abraham, D. J. J Comput Aided Mol Des 1991, 5, 545. 36. Abraham, D. J.; Kellogg, G. E. J Comput Aided Mol Des 1994, 8, 41. 37. Kellogg, G. E.; Abraham, D. J. Eur J Med Chem 2000, 35, 651. 38. Dill, K. A. J Biol Chem 1997, 272, 701. 39. Hansch, C.; Leo, A. J. Substituent Constants for Correlation Analysis in Chemistry and Biology; Wiley: New York, 1979. 40. Cashman, D. J.; Scarsdale, J. N.; Kellogg, G. E. Nucleic Acids Res 2003, 31, 4410. 41. Cashman, D. J.; Kellogg, G. E. J Med Chem 2004, 47, 1360. 42. Cozzini, P.; Fornabaio, M.; Marabotti, A.; Abraham, D. J.; Kellogg, G. E.; Mozzarelli, A. J Med Chem 2002, 45, 2469. 43. Kellogg, G. E.; Fornabaio, M.; Spyrakis, F.; Lodola, A.; Cozzini, P.; Mozzarelli, A.; Abraham, D. J. J Mol Graph Model 2004, 22, 479. 44. Spyrakis, F.; Cozzini, P.; Bertoli, C.; Marabotti, A.; Kellogg, G. E.; Mozzarelli, A. BMC Struct Biol 2007, 7, 4. 45. Marabotti, A.; Colonna, G.; Facchiano, A. J Comput Chem 2007, 28, 1031. 46. Fornabaio, M.; Cozzini, P.; Mozzarelli, A.; Abraham, D. J.; Kellogg, G. E. J Med Chem 2003, 46, 4487. 47. Spyrakis, F.; Fornabaio, M.; Cozzini, P.; Mozzarelli, A.; Abraham, D. J.; Kellogg, G. E. J Am Chem Soc 2004, 126, 11764. 48. Burnett, J. C.; Botti, P.; Abraham, D. J.; Kellogg, G. E. Proteins 2001, 42, 355. 49. Fornabaio, M.; Spyrakis, F.; Mozzarelli, A.; Cozzini, P.; Abraham, D. J.; Kellogg, G. E. J Med Chem 2004, 47, 4507.

Journal of Computational Chemistry

DOI 10.1002/jcc

Interaction Between Proteins and Nucleic Acids

50. Cozzini, P.; Fornabaio, M.; Marabotti, A.; Abraham, D. J.; Kellogg, G. E.; Mozzarelli, A. Curr Med Chem 2004, 11, 3093. 51. Kellogg, G. E.; Chen, D. L. Chem Biodivers 2004, 1, 98. 52. Amadasi, A.; Spyrakis, F.; Cozzini, P.; Abraham, D. J.; Kellogg, G. E.; Mozzarelli, A. J Mol Biol 2006, 358, 289. 53. Ghosh, I.; Stains, C. I.; Ooi, A. T.; Segal, D. J. Mol Biosyst 2006, 2, 551. 54. Berman, H.; Henrick, K.; Nakamura, H.; Markley, J. L. Nucleic Acids Res 2007, 35, D301–D303 (Database issue). 55. Berman, H. M.; Olson, W. K.; Beveridge, D. L.; Westbrook, J.; Gelbin, A.; Demeny, T.; Hsieh, S. H.; Srinivasan, A. R.; Schneider, B. Biophys J 1992, 63, 751. 56. Katti, M. V.; Sami-Subbu, R.; Ranjekar, P. K.; Gupta, V. S. Protein Sci 2000, 9, 1203. 57. Elrod-Erickson, M.; Benson, T. E.; Pabo, C. O. Structure 1998, 6, 451. 58. Wolfe, S. A.; Greisman, H. A.; Ramm, E. I.; Pabo, C. O. J Mol Biol 1999, 285, 1917. 59. Pabo, C. O.; Peisach, E.; Grant, R. A. Annu Rev Biochem 2001, 70, 313. 60. Dreier, B.; Fuller, R. P.; Segal, D. J.; Lund, C. V.; Blancafort, P.; Huber, A.; Koksch, B.; Barbas, C. F., III. J Biol Chem 2005, 280, 35588. 61. Kang, J. S. Biochem J 2007, 403, 177. 62. Wuttke, D. S.; Foster, M. P.; Case, D. A.; Gottesfeld, J. M.; Wright, P. E. J Mol Biol 1997, 273, 183. 63. Pavletich, N. P.; Pabo, C. O. Science 1991, 252, 809. 64. Wolfe, S. A.; Grant, R. A.; Elrod-Erickson, M.; Pabo, C. O. Structure 2001, 9, 717. 65. Segal, D. J.; Dreier, B.; Beerli, R. R.; Barbas, C. F., III. Proc Natl Acad Sci USA 1999, 96, 2758. 66. Dreier, B.; Beerli, R. R.; Segal, D. J.; Flippin, J. D.; Barbas, C. F., III. J Biol Chem 2001, 276, 29466.

1969

67. Wu, H.; Yang, W. P.; Barbas, C. F., III. Proc Natl Acad Sci USA 1995, 92, 344. 68. Greisman, H. A.; Pabo, C. O. Science 1997, 275, 657. 69. Suzuki, M.; Brenner, S. E.; Gerstein, M.; Yagi, N. Protein Eng 1995, 8, 319. 70. Lustig, B.; Jernigan, R. L. Nucleic Acids Res 1995, 23, 4707. 71. Jones, S.; Barker, J. A.; Nobeli, I.; Thornton, J. M. Nucleic Acids Res 2003, 31, 2811. 72. Jones, S.; Shanahan, H. P.; Berman, H. M.; Thornton, J. M. Nucleic Acids Res 2003, 31, 7189. 73. Tsuchiya, Y.; Kinoshita, K.; Nakamura, H. Proteins 2004, 55, 885. 74. Desjarlais, J. R.; Berg, J. M. Proc Natl Acad Sci USA 1994, 91, 11099. 75. Levitt, M.; Perutz, M. F. J Mol Biol 1988, 201, 751. 76. Sarai, A.; Kono, H. Annu Rev Biophys Biomol Struct 2005, 34, 379. 77. Oda, M.; Nakamura, H. Genes Cells 2000, 5, 319. 78. Parsegian, V. A.; Rand, R. P.; Rau, D. C. Methods Enzymol 1995, 259, 43. 79. Garner, M. M.; Rau, D. C. EMBO J 1995, 14, 1257. 80. Sidorova, N. Y.; Rau, D. C. J Mol Biol 2001, 310, 801. 81. Reddy, C. K.; Das, A.; Jayaram, B. J Mol Biol 2001, 314, 619. 82. Schwabe, J. W. Curr Opin Struct Biol 1997, 7, 126. 83. Jayaram, B.; Jain, T. Annu Rev Biophys Biomol Struct 2004, 33, 343. 84. Otwinowski, Z.; Schevitz, R. W.; Zhang, R. G.; Lawson, C. L.; Joachimiak, A.; Marmorstein, R. Q.; Luisi, B. F.; Sigler, P. B. Nature 1988, 335, 321. 85. Desjarlais, J. R.; Berg, J. M. Proteins 1992, 12, 101. 86. Segal, D. J.; Barbas, C. F., III. Curr Opin Chem Biol 2000, 4, 34. 87. Pabo, C. O.; Nekludova, L. J Mol Biol 2000, 301, 597. 88. Segal, D. J.; Crotty, J. W.; Bhakta, M. S.; Barbas, C. F., III; Horton, N. C. J Mol Biol 2006, 363, 405.

Journal of Computational Chemistry

DOI 10.1002/jcc

Lihat lebih banyak...

Energy-based prediction of amino acid-nucleotide base recognition

Descripción

Comentarios