Forbidden penta-peptides

Share Embed


Descripción

Forbidden penta-peptides TAMIR TULLER,1 BENNY CHOR,1

AND

NATHAN NELSON2

1

School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel The George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel

2

(R ECEIVED June 12, 2007; F INAL R EVISION July 9, 2007; ACCEPTED July 9, 2007)

Abstract There are 3,200,000 amino acid sequences of length 5 (penta-peptides). Statistically, we expect to see a distribution of penta-peptides that is determined by the frequency of the participating amino acids. We show, however, that not only are there thousands of such penta-peptides that are absent from all known proteomes, but many of them are coded for multiple times in the non-coding genomic regions. This suggests a strong selection process that prevents these peptides from being expressed. We also show that the characteristics of these forbidden penta-peptides vary among different phylogenetic groups (e.g., eukaryotes, prokaryotes, and archaea). Our analysis provides the first steps toward understanding the ‘‘grammar’’ of the forbidden penta-peptides. Keywords: short peptides; proteomes; evolutionary selection; protein grammar; phylogenetic groups Supplemental material: see www.proteinscience.org

Proteins, consisting of amino acid (AA) sequences, are the building blocks of all living cells and organisms. A great deal of work has addressed the effect of existing AA sequences on the spatial arrangements and conformation of proteins (Krogh et al. 1994; Abe and Mamitsuka 1997; Durbin et al. 1998; Bystroff et al. 2000; Eddy 2004). But what about nonexisting AA sequences? A simple counting argument shows that, out of the possible AA sequences of length 100 (20100 sequences), only a negligible fraction can appear in the proteome. At the other extreme, all 400 possible sequences of two amino acids are present in the proteome. Consider, then, intermediate length sequences, for example, the 205 ¼ 3,200,000 AA sequences of length 5. A probabilistic argument shows that sequences composed of frequently used AAs are expected to be present in many of the known proteomes. Herein, we show that thousands of penta-peptides do not appear in any known proteome. Moreover, an analysis of several genomes reveals that these penta-petides are not encoded by

Reprint requests to: Tamir Tuller, School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel; e-mail: [email protected]; fax: 03-6409357. Article and publication are at http://www.proteinscience.org/cgi/ doi/10.1110/ps.073067607.

expressed genes, but DNA sequences encoding the pentapeptides are nevertheless present in the non-coding genomic regions. This observation suggests a strong editing process that prevents these peptides from being expressed. We have termed such AA sequences ‘‘forbidden pentapeptides.’’ Biological processes such as conformation of a protein (Ramachandran and Sasisekharan 1968; Kotelchuck et al. 1969; Chou and Fasman 1974; Chothia et al. 1981; Blaber et al. 1993), signaling within the cell (ProchnickaChalufour et al. 1991; Pouysse´gur 2000; Zhang and Xiong 2001; Neduva et al. 2005), and recognition of materials (Szuromi 2005) are often mediated by relatively short peptide sequences. Do forbidden penta-peptides also have similarly important roles in shaping the conformation of proteins? We examined 368 organisms from the bacteria, archaea, and eukaryote kingdoms, for which both genome and proteome information is currently available. Our analysis is aimed at understanding this set of forbidden penta-peptides and at defining initial ‘‘grammar’’ rules for it. We also show that the characteristics of these forbidden penta-peptides vary among different phylogenetic groups (e.g., eukaryotes, prokaryotes, and archaea). Many bioinformatic investigations have explored the sequences of amino acids in proteins (see, for example,

Protein Science (2007), 16:2251–2259. Published by Cold Spring Harbor Laboratory Press. Copyright Ó 2007 The Protein Society

2251

Tuller et al.

Gavel and Heijne 1992; Echols et al. 2002; Qi et al. 2004; White and Heijne 2004; Otaki et al. 2005; White and Heijne 2005; Ulitsky et al. 2006; Hampikian and Andersen 2007), or have attempted to model proteins by various probability models (Krogh et al. 1994; Abe and Mamitsuka 1997; Durbin et al. 1998; Bystroff et al. 2000; Eddy 2004). Otaki and colleagues1 (Otaki et al. 2005) have examined the space of ‘‘missing’’ AA sequences and have discovered the missing penta-peptides. Their analysis, however, did not take into account the non-coding parts of the genome. They scored missing sequences solely on the basis of amino acid frequencies in all proteomes, with corresponding effects assigned to the biological significance of the sequences they reported (see Supplemental material). A systematic, exhaustive study of missing DNA k-mers was recently carried out by Hampikian and Andersen (2007).2

Preliminaries and definitions A ‘‘missing’’ sequence, relative to a specific set of proteomes, is one that does not appear as a subsequence in the set. This condition by itself is not always meaningful, so we apply additional considerations and criteria to determine when a missing AA sequence is significant. It is clear that, to be of interest, the length of the missing sequence, k, should be as short as possible, while the total length of the proteomes, ‘p, should be as large as possible. Hence for ‘p we take the combined length of all specified proteomes, while we assign k ¼ 5, simply because every possible quartet of amino acids does appear in the proteomes analyzed. A k-mer may be absent from a set of proteomes either by chance or for biological reasons. We wanted to identify sequences that are missing due to the latter possibility. We used two filters (i.e., corresponding Pvalues) for identifying such sequences. First, we assume that random changes at the DNA level are generated in both the coding and the non-coding regions by similar processes. Consequently, under similar evolutionary pressures, amino acid sequences appearing frequently in the non-coding regions would be expected to appear with similar frequencies in the coding regions. The encoding of a missing sequence (by the standard genetic code) may actually appear many times in the ‘‘non-coding’’ regions of one or more organisms.

Throughout this paper, we will refer to peptides ‘‘encoded’’ by DNA sequences present in non-coding regions of the genome as ‘‘virtual peptides’’ or ‘‘vpeptides.’’ The frequent occurrence of such v-peptides in non-coding regions, together with their absence from coding regions, suggests the presence of an editing selection that eliminates them from coding regions. We identified short v-peptides by considering all DNA sequences of 15 nucleotide bases in length (corresponding to five codons) in the non-coding regions (taking all six reading frames into account). Second, short AA sequences consisting mainly of ‘‘rare’’ amino acids are a priori much more likely to be missing. To prevent bias and to filter out such peptides, we took the cellular abundance of amino acids in the proteome into account: A missing sequence containing frequent AAs is assigned a higher significance (lower P-value) than a missing sequence containing rare AAs. We associated two P-values with each missing sequence. The first P-value, denoted ‘‘non-coding P-value,’’ is determined by the number of occurrences of each v-peptide in the non-coding regions and the density of coding regions in the genome. The second P-value, the ‘‘background Pvalue,’’ is determined by the frequencies of the AAs in the entire proteome (this is also known as the ‘‘background frequency’’) and by the length of the proteome. For a given density of the coding regions, the non-coding P-value is smaller (more significant) when the number of occurrences of the v-peptide is higher.3 For a given proteome length, the background P-value of a given penta-peptide is smaller (more significant) when the frequencies of its amino acids are higher. The exact formulas of each P-value appear in the Materials and Methods section; more details and examples of the P-values appear in the Supplemental Materials and Methods. We studied all the organisms whose complete genomes and proteomes were known as of May 2006. They contain 368 organisms (27 archaea, 28 eukaryotes, and 313 prokaryotes). The total length of these 368 genomes is 2.051010 base pairs (bp), while the total length of their annotated proteomes is 6.39108 amino acids. Our study focuses on missing sequences of length k ¼ 5, since there are no missing peptide sequences of 4 amino acids. Details about the specifications of the database appear in the Materials and Methods and in the Supplemental Materials and Methods sections.

1

All penta-peptides that were reported as missing in the work of Otaki and colleagues do appear in our database, which is more up-to-date (and more than four times larger than the data set Otaki and colleagues used) (Otaki et al. 2005). 2 Hampikian and Andersen mainly describe a method for fast calculation of k-mers that do not appear in a set of proteins. They do not deal with the statistical significance of such k-mer nor do they research the features of such sequences (Hampikian and Andersen 2007).

2252

Protein Science, vol. 16

3 The non-coding P-value is calculated under the assumption that the distribution of DNA sequences of 15 nucleotide bases in length is identical in coding and non-coding regions. Of course this is an oversimplification, since coding sequences and non-coding sequences experience very different functional/evolutional pressures. But this fact does suggest a strong selection process that prevents such forbidden penta-peptides from being expressed.

Forbidden penta-peptides

Given a sequence of AAs that is missing from the proteomes of a set of species, we call forbidden with respect to these species if both P-values are significant (P-values < 0.05, after FDR correction for multiple testing); (Benjamini and Yekutieli 2001). A peptide is called universally forbidden if it is forbidden with respect to all species (whose genome and proteome are known). A specific penta-peptide may be missing from the proteomes of all 368 organisms, yet it may not be forbidden, either because the peptide consists mainly of rare amino acids or because it is encoded infrequently as a v-peptide in the non-coding regions of the 368 genomes. These phenomena will cause one log P-value (or both) to be above the 0.05 threshold, thus making it a nonuniversally forbidden peptide. Suppose B and C are groups of species and B is a proper subset of C. Due to different background frequencies of AAs in B and C proteomes, abundance of v-peptides in the non-coding parts of B and C proteomes, and other related factors such as proteome lengths and density of coding regions, a peptide may be forbidden with respect to B but not with respect to C, or vice versa. Results Considering all 368 proteomes in our data set, we identified 8084 missing amino acid sequences of length 5 (‘‘missing penta-peptides’’). Out of those, 5000 are classified as universally forbidden. When the total length of all proteomes, ‘p ¼ 6.4108, and the AA abundance are taken into account, the most significant forbidden pentapeptides are actually expected to appear ;19 times in a random proteome of that length. When taking into account the lengths of the proteome, the genome, and the number of times a forbidden v-penta-peptide is encoded in the non-coding regions, the most significant forbidden peptides are expected to appear ;56 times in a random proteome of length ‘p.4 We analyzed the sets of missing and forbidden pentapeptides in nine major phylogenetic groups. These include all organisms, archaea, bacteria, eukaryotes, plants and fungi, animals, insects, vertebrates, and mammals. For each group, Figure 1A gives the number of missing penta-peptides, the number of forbidden pentapeptides with respect to the group, and the number of forbidden penta-peptides that are present in a sibling phylogenetic group (the sibling groups are described by

4 It is clear that, as more proteomes become available, forbidden penta-peptides may appear. However, we expect that it will be at much lower frequencies than other penta-peptides. Furthermore, if more proteomes become available the statistics of coding and non-coding sequences may change and new missing sequences may become forbidden.

the tree in Fig. 1C). For example, there are 2862 pentapeptides that are forbidden with respect to eukaryotes and are present both in bacteria and in archaea. Note that archaea do not share any forbidden sequences with bacteria and eukaryotes. All the forbidden penta-peptides with respect to archaea are present either in eukaryotes or in bacteria. Likewise, no universally forbidden pentapeptide is also forbidden with respect to archaea (even though it is of course missing in archea). The main reason is the short proteome of archaea in our data set (few sequenced organisms, with short proteomes) and, to a lesser extent, the different AAs frequencies in it, caused presumably by their extreme living condition. This typically causes the background P-value to be insignificant. For example, the two universally forbidden pentapeptides with the pair of most significant P-values are WCFNL and FFMCT. But with respect to the short proteome and genome of archea, WCFNL has log background P-value 0.13 (log non-coding P-value, 6.4, nine v-peptides occurrences), while FFMCT has log background P-value 0.34 (log non-coding P-value, 2.1, three v-peptides occurrences). The distribution of the two log P-values for the forbidden penta-peptides with respect to each of the phylogenetic groups is described in Figure 1B. Numerous penta-peptides with significant P-values with respect to each phylogenetic group were identified. Table 1A lists the 12 most significant universally forbidden penta-peptides by the two measures. All have log non-coding P-values smaller than or equal to 31, and log background P-values smaller than or equal to 6.7. Their respective v-peptides are encoded >1450 times in the non-coding regions. As a specific example, the first sequence, WCFNL, is coded for 3639 times in the non-coding regions of all 368 organisms. We remark that the order and structure of the peptide are important. Consequently, shuffling the AAs of a universally forbidden penta-peptide (e.g., WCFNL or FFMCT) may yield a penta-peptide that is present in one or more species (e.g., LWCFN or FMCTF, respectively).5 These highly significant universally forbidden pentapeptides can often be explained by their biochemical properties. For example, the forbidden penta-peptide WCFNL includes tryptophan, the largest AA, which is very bulky, hydrophobic, and with a large aromatic group; cysteine, which can generate disulfide bonds, and is hydrophobic; phenylalanine, a very large amino acid,

5

Since we used an i.i.d. background model, two penta-peptides with the same composition of amino acids but in different order will have similar background P-value. Using probabilistic models where different sites are dependent may lead to a larger difference between the P-values of two such penta-peptides.

www.proteinscience.org

2253

Tuller et al.

Figure 1. (A) Distributions of missing and forbidden penta-peptides with respect to phylogenetic groups. For each group we present the total number of organisms, the total length of the proteomes, the number of missing penta-peptides, the number of forbidden penta-peptides, and the number of forbidden penta-peptides that are present in a sibling phylogentic group. (B) Distributions of forbidden penta-peptides for each phylogenetic group, split according to three ranges of log P-values (nine major squares). The partition of each major square to nine subsquares (a subsquare for each of the phylogenetic groups) is according to the legend on the left. (C) A tree that describes the nine phylogenetic groups. Two phylogenetic groups with a common ancestor in the tree are called ‘‘sibling groups’’ (e.g., the Insects and the Vertebrates). The group ‘‘non-mammals’’ includes the organisms in our data set that are vertebrate but not mammals (two fish and a bird), so it is not monophyletic.

2254

Protein Science, vol. 16

Forbidden penta-peptides

Table 1. List of universal forbidden penta-peptides with the most significant P-values and the distribution of amino acids in forbidden penta-peptides (A) Universal forbidden penta-peptides with two low P-values (non-coding log P-value < 31, and background log P-value < 6.9). Note that all these sequences contain ‘‘C’’ (Cystetine). No.

Sequence

log P-value (non-coding)

log P-value (background)

No. of v-peptides found

1 2 3 4 5 6 7 8 9 10 11 12

WCFNL FFMCT RTCMY QTKCH HCVNY KFMCF WEGPC PYLWC KWCFV RNMFC FNTCM TYIMC

75 60 57 53 47 46 40 39 37 34 31 31

8.5 7 7.5 10.6 6.7 7.8 12 7.2 7 8.7 7.7 6.9

3639 2882 2748 2570 2261 2255 1963 1885 1783 1658 1526 1486

(B) The distribution of the amino acids in each of the positions of the universal forbidden penta-peptide, and the distribution of amino acids in our data set (last column). AA A R N D C Q E G H I L K M F P S T W Y V

1

2

3

4

5

Background

0.0072 0.0146 0.0418 0.0258 0.1634 0.0480 0.0260 0.0096 0.0920 0.026 0.0032 0.0382 0.1228 0.0294 0.0268 0.0096 0.0238 0.2112 0.0656 0.015

0.0108 0.0110 0.0382 0.0280 0.1598 0.0376 0.0266 0.008 0.0752 0.0232 0.0044 0.027 0.1594 0.0406 0.0254 0.007 0.025 0.2078 0.069 0.016

0.0116 0.0108 0.0382 0.0238 0.1606 0.0294 0.0224 0.0098 0.071 0.0296 0.0046 0.023 0.1724 0.036 0.0286 0.0086 0.027 0.209 0.0666 0.017

0.013 0.0096 0.0392 0.0178 0.1618 0.0366 0.027 0.0088 0.069 0.0252 0.0024 0.03 0.1762 0.0342 0.0274 0.0086 0.0222 0.2166 0.0652 0.0092

0.0138 0.0126 0.0382 0.0202 0.16 0.0348 0.0286 0.0092 0.0778 0.0266 0.0040 0.0274 0.1388 0.0390 0.0276 0.0104 0.0258 0.2212 0.0712 0.0128

0.0857 0.0582 0.0384 0.0521 0.0146 0.0408 0.0639 0.0711 0.0231 0.0548 0.0993 0.0531 0.0233 0.0383 0.0516 0.0699 0.0536 0.0124 0.0283 0.0666

hydrophobic, and with a large aromatic group; asparagine, a hydrophilic residue, usually appearing on the surface of the protein, and leucine, a hydrophobic, branched-chain amino acid. Since the amino acids W (position 1), C (position 2), F (position 3), and L (position 5) are hydrophobic, they tend to force this penta-peptide toward the inward side of a protein. However, two of the amino acids, W and F (positions 1 and 3), are bulky, while a third, N, is hydrophilic (position 4). These mutual forces may disrupt the folding of the protein. Furthermore, the cysteine can generate false S-S disulfide bonds, and they, too, may disrupt the folding of the protein. Some of forbidden penta-peptides from Table 1A have similar contradictory

properties, and most of them (11 out of 12) include both hydrophobic and hydrophilic amino acids. One possible explanation for the effect of these proteins containing AAs with contradictory properties is that they negatively affect folding. To explore this possibility, we replaced random penta-peptides with forbidden ones and analyzed the resulting proteins using the FoldIndex program (Prilusky et al. 2005). This program predicts whether proteins are intrinsically unfolded. Our analysis of several proteins by this program resulted in the observation that the forbidden sequences extended the folded part of the protein on the expense of the unfolded regions. Replacing penta-peptides by forbidden ones enhanced protein folding irrespectively of www.proteinscience.org

2255

Tuller et al.

whether the replaced AAs resided in the folded or the unfolded region of the protein. Thus, it is possible that those forbidden penta-peptides are selected against preventing premature rigidity in proteins. We noticed that, in universally forbidden penta-peptides, the amino acids cysteine and tryptophan prevail. This may be due to the reactivity of cysteine residues and the bulkiness of the tryptophan side chains. Surprisingly, sequences of five consecutive cysteines or five consecutive tryptophans are present in several existing proteins. This may be explained by the fact that five hydrophobic amino acids (such as cysteine or tryptophan), without interleaved hydrophilic amino acids, do not have the mutually contradicting forces described in the example above, and can reside in the inward side of a protein. There are only 21 universally forbidden penta-peptides (out of 5000) that contain neither cysteine nor tryptophan (see Supplemental Table S1). There are no universally missing penta-peptides shorter than 5 amino acids. However, there are motifs of length 3 and 4 that are common to many universally forbidden penta-peptides. Supplemental Tables S2 and S3 list the most frequent motifs of length 4 and 3 that appear in the universal forbidden penta-peptides. For example, the motifs YMWC, WCMI, and MWMC appear 11, 11, and 10 times, respectively, in the universally forbidden pentapeptides, while the motifs CMW, MWC, and MCW appear 97, 90, and 89 times, respectively. These motifs can be used for formulating composite ‘‘forbidden grammatical’’ rules. For example, the motive YMWC determines a universally forbidden penta-peptide if and only if it appears right after the amino acids R, D, Q, L, S, or V or before the amino acids K, P, S, V, or R. The Supplemental material details more such examples. Table 1B describes the distribution of the 20 amino acids in each position of the universally forbidden pentapeptides, and in the background distribution of amino acids in all the proteomes. We noted that the distributions at different positions are usually similar, but not identical. However, sometimes the differences between positions are significant. The most striking example is the amino acid alanine (A). The frequency of A at the fourth position of the universally forbidden penta-peptides is 0.0138, more than twice the frequency of this amino acid in the first position, 0.0072 (Student’s t-test, P-value 16. This supports the hypothesis that all the distributions are different. (B) The distribution of amino acid in the penta-peptides that are forbidden with respect to bacteria, archaea, and eukaryotes.

Forbidden penta-peptides with respect to different phylogenetic subgroups The forbidden penta-peptides were also analyzed in the same nine large groups of organisms that appear in Figure 1A. A detailed analysis of each group appears in the Supplemental material. We noted differences in the characteristics of the forbidden penta-peptides for the different groups. Excluding the archaea, the KullbackLeibler distances between the amino acids distributions in

forbidden penta-peptides of recent sibling subgroups, insects and vertebrates, are larger than more ancient sibling groups such as bacteria vs. eukarya, or animals vs. plants and fungi (Fig. 3), even though the divergence time between pairs of such subgroups is shorter. This pattern stands in contrast to the background amino acid distribution of the same groups. It may be related to more specific and targeted physiological constraints that shape the characteristics of forbidden penta-peptides in more recent sibling subgroups. www.proteinscience.org

2257

Tuller et al.

Figure 3. Amino acid distribution in penta-peptides forbidden with respect to large phylogenetic subgroups. The numbers on the solid edges denote the symmetric Kullback-Leibler distances (Cover and Thomas 1991) between the amino acid distributions across the edge; the dashed lines denote the symmetric Kullback-Leibler distances between the background distribution of amino acids. Chi-square test of all the distributions (background and in forbidden penta-peptides) gives very significant results, all log P-values >16. This supports the hypothesis that all the distributions are different. The estimated time from the divergence of each pair of sibling groups appears on the right.

Discussion The existence of forbidden penta-peptides is a basic biological phenomenon that does not follow from simple probability arguments. Furthermore, the encoding of many of these forbidden penta-peptides appears thousands of times in non-coding regions. Some pentapeptides are ‘‘universally’’ forbidden, while others are ‘‘specific’’ to groups of organisms. They may be shaped by the nature and the living environment of different organisms. Like DNA base composition, codon bias, and abundance of amino acids, the existence of forbidden penta-peptides is a consequence of a long evolutionary process. Therefore, it is unlikely that mutations in the DNA to introduce forbidden penta-peptide in a neutral site of certain protein will result in a detectable phenotype. Site-directed mutagenesis in numerous proteins also resulted in silent phenotypes under laboratory conditions but failed to compete at natural environment (Nelson 1992). The universal forbidden penta-peptides are likely to be determined by structural constraints in protein folding. There are numerous examples where substitution of a single amino acid causes a large-scale disruption of the protein folding and/or stability (Nelson 1992). We propose that, in the case of specific groups of organisms, specialized physiological constraints govern the phenom2258

Protein Science, vol. 16

enon of forbidden sequences. Forbidden penta-peptides provide possibly the simplest grammatical constraint on proteins and can be used for inferring higher level grammatical constraints. They may also serve as a general genetic engineering tool and in particular as an auxiliary device to deal with difficult-to-crystallize proteins. This manuscript suggests a few immediate challenges: inferring more compact but still statistically significant grammatical constraint of proteins; comprehending the evolutionary differences in forbidden pentapeptides for different groups of organisms and their origins; and understanding the biochemical nature of the forbidden penta-peptides described here. We believe that the last challenge is not a trivial task due to the following reasons: First, there are thousands of forbidden-penta peptides, thus a biological verification (by introducing them into the proteins of a model organism) of them all is not trivial. Second, the current software that simulates folding of proteins do not seem to help since these computer programs are designed for finding common protein structures (e.g., a-helixes, bsheets, etc.) sometimes based on the structures of homologous sequence search. Even if we introduce forbidden penta-peptides in the proteins sequence, the outcome of the packages does not change much. Finally, since the length of the forbidden-penta peptides is relatively short, we do not expect that the damaging effect of introducing such penta-peptides in a protein will be discovered easily. However, nature is very sensitive, and these pentapeptides may be filtered out after few generations by purifying selection. Materials and Methods Data sets Most of the genomes and proteomes we analyzed are either from NCBI6 or the UCSC7 genome browser, some are from the TIGER8 and the FUGU9 genome projects. Sections 1.6 and 1.7 in the Supplemental Materials and Methods include details about the source, proteome length, and genome length of each organism.

The two P-values Let Prbmis sin g(a),h denote the probability that an amino acid sequence of length 5, a, is missing in a ‘‘proteome’’ of size h. We estimate Prbmis sin g(a),h by repeatedly generating at random, according to the background probabilities, ‘‘proteomes’’ of size lp h = 1000 . Specifically, we repeated this experiment 10,000 times.

6

NCBI Genome database, ftp://ftp.ncbi.nih.gov/genomes. UCSC Genome database, ftp://hgdownload.cse.ucsc.edu/goldenpath. The TIGER genome annotation database, ftp://ftp.tiger.org/pub/data/. 9 The FUGU genome project, http://www.fugu-sg.org/. 7 8

Forbidden penta-peptides

We use the frequency in which these shorter proteomes had a missing, (Prbmis sin g(a),h)lp/h, as an estimate of the background P-value (i.e., the probability that a is missing in a proteome of size lp). Sections 1.1.1 and 1.1.2 in the Supplemental Materials and Methods include more details and explanations about the background P-value. Suppose a is coded for na times in all the six reading frames along the genome (and considering all different coding sequences) of a genome of length lg. Suppose the total length of all coded proteins, measured in amino acids, is lg. This corresponds to 3lg length on the genome. The overall fraction of coding to 3l non-coding regions is, thus, q = 2lpg , where the 2 in the denominator accounts for the two strands in the genome. Under our assumption, the probability that a given occurrence does not fall within the coding regions equals 1  q. The probability that none of na occurrences fall within the coding regions is denoted by Prnon–coding(a) and equals PrnoncodingðaÞ = ð1  qÞna . We call this the ‘‘non-coding’’ P-value of a. This probability decreases when na is larger and when q is larger. Section 1.1.3 in the Supplemental Materials and Methods includes more details and explanations about the non-coding P-value.

Statistical analysis and distance between probability distributions We used standard FDR correction for multiple testing (Benjamini and Yekutieli 2001) for correcting the two P-values. Kullback-Leibler distance (Cover and Thomas 1991) and Chisquare were used to compare the background distribution and the distribution of amino acids in the forbidden sequences of different groups of organisms. Sections 1.2, 1.3, and 1.4 in the Supplemental Materials and Methods include more details about the FDR correction, Kullback-Leibler distance, and the Chisquare test, respectively.

Acknowledgments T.T. was supported by the Edmond J. Safra Bioinformatics program at Tel Aviv University. We thank Dr. Martin Kupiec for critical reading of the manuscript and Dr. Daniel Yekutieli for helpful discussions.

References Abe, N. and Mamitsuka, H. 1997. Predicting protein secondary structure using stochastic tree grammars. J Mach Learn 29: 275–301. Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., and Walter, P. 2002. Molecular biology of the cell, 4th ed. Garland, New York. Benjamini, Y. and Yekutieli, D. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29: 1165–1188. Blaber, M., Zhang, X.J., and Matthews, B.W. 1993. Structural basis of amino acid alpha helix propensity. Science 260: 1637–1640. Bystroff, C., Thorsson, V., and Baker, D. 2000. HMMSTR: A hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301: 173–190.

Chothia, C., Levitt, M., and Richardson, D. 1981. Helix to helix packing in proteins. J. Mol. Biol. 145: 215–250. Chou, P.Y. and Fasman, G.D. 1974. Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry 13: 211–222. Cover, T. and Thomas, J. 1991. Elements of information theory. John Wiley, New York. DeLong, E. 1998. Enhanced: Archaeal means and extremes. Science 280: 542–543. Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK. Echols, N., Harrison, P., Balasubramanian, S., Luscombe, N.M., Bertone, P., Zhang, Z., and Gerstein, M. 2002. Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes. Nucleic Acids Res. 30: 2515–2523. Eddy, S.R. 2004. What is a hidden Markov model? Nat. Biotechnol. 22: 1315–1316. Gavel, Y. and Heijne, G. 1992. The distribution of charged amino acids in mitochondrial inner-membrane proteins suggests different modes of membrane integration for nuclearly and mitochondrially encoded proteins. Eur. J. Biochem. 205: 1207–1215. Hampikian, G. and Andersen, T. 2007. Absent sequences: Nullomers and primes. Pac. Symp. Biocomput. 12: 355–366. Kotelchuck, D., Dygert, M., and Scheraga, H.A. 1969. The influence of shortrange interactions on protein conformation, III. Dipeptide distributions in proteins of known sequence and structure. Proc. Natl. Acad. Sci. 63: 615– 622. Krogh, A., Brown, M., Mian, I.S., Sjo¨lander, K., and Haussler, D. 1994. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235: 1501–1531. Margulis, L., Champan, M., Guerrero, R., and Hall, J. 2006. The last eukaryotic common ancestor (LECA): Acquisition of cytoskeletal motility from aerotolerant spirochetes in the Proterozoic Eon. Proc. Natl. Acad. Sci. 103: 13080–13085. Neduva, V., Linding, R., Su-Angrand, I., Stark, A., de Masi, F., Gibson, T.J., Lewis, J., Serrano, L., and Russell, R.B. 2005. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol. 3: e405. doi: 10.1371/journal.pbio.0030405. Nelson, N. 1992. Evolution of organellar proton-ATPas. Biochim. Biophys. Acta 1100: 109–124. Otaki, J.M., Ienaka, S., Gotoh, T., and Yamamoto, H. 2005. Availability of short amino acid sequences in proteins. Protein Sci. 14: 617–625. Pouysse´gur, J. 2000. Perspectives signal transduction: An arresting start for MAPK. Science 290: 1515–1518. Prilusky, J., Felder, C.E., Zeev-Ben-Mordehai, T., Rydberg, E.H., Man, O., Beckmann, J.S., Silman, I., and Sussman, J.L. 2005. FoldIndex: A simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21: 3435–3438. Prochnicka-Chalufour, A., Casanova, J., Avrameas, S., Claverie, J., and Kourilsky, P. 1991. Biased amino acid distributions in regions of the T cell receptors and MHC molecules potentially involved in their association. Int. Immunol. 3: 853–864. Qi, J., Wang, B., and Hao, B. 2004. Whole proteome prokaryote phylogeny without sequence alignment: A K-string composition approach. J. Mol. Evol. 58: 1–11. Ramachandran, G.N. and Sasisekharan, V. 1968. Conformation of polypeptides and proteins. Adv. Protein Chem. 23: 283–438. Szuromi, P.D. 2005. Peptides seeing polymers. Science 310: 19. Ulitsky, I., Burstein, D., Tuller, T., and Chor, B. 2006. The average common substring approach to phylogenomics. J. Comput. Biol. 13: 336–350. White, S.H. and Heijne, G. 2004. The machinery of membrane protein assembly. Curr. Opin. Struct. Biol. 14: 397–404. White, S.H. and Heijne, G. 2005. Transmembrane helices before, during, and after insertion. Curr. Opin. Struct. Biol. 15: 378–386. Zhang, Y. and Xiong, Y. 2001. A p53 amino-terminal nuclear export signal inhibited by DNA damage-induced phosphorylation. Science 292: 1910–1915.

www.proteinscience.org

2259

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.