Bayesian semiparametric regression models to characterize molecular evolution

Share Embed


Descripción

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

RESEARCH ARTICLE

Open Access

Bayesian semiparametric regression models to characterize molecular evolution Saheli Datta1* , Abel Rodriguez2 and Raquel Prado2

Abstract Background: Statistical models and methods that associate changes in the physicochemical properties of amino acids with natural selection at the molecular level typically do not take into account the correlations between such properties. We propose a Bayesian hierarchical regression model with a generalization of the Dirichlet process prior on the distribution of the regression coefficients that describes the relationship between the changes in amino acid distances and natural selection in protein-coding DNA sequence alignments. Results: The Bayesian semiparametric approach is illustrated with simulated data and the abalone lysin sperm data. Our method identifies groups of properties which, for this particular dataset, have a similar effect on evolution. The model also provides nonparametric site-specific estimates for the strength of conservation of these properties. Conclusions: The model described here is distinguished by its ability to handle a large number of amino acid properties simultaneously, while taking into account that such data can be correlated. The multi-level clustering ability of the model allows for appealing interpretations of the results in terms of properties that are roughly equivalent from the standpoint of molecular evolution. Background The structural and functional role of a codon in a gene determines its ability to freely change. For example, nonsynonymous (amino acid altering) substitutions may not be tolerated at certain codon sites due to strong negative selection, while at other sites some nonsynonymous substitutions may be allowed if they do not affect key physicochemical properties associated with protein function [1]. Thus, at such preferentially changing sites, more frequent substitutions occur between physicochemically similar amino acids (or codons which lead to those amino acids) than dissimilar ones [2-4]. Methods which use changes in physicochemical amino acid properties have thus been proposed in the study of evolution. For example, [5-7] use distances to calculate deviations from neutrality for a particular amino acid property. Alternative approaches model the evolution of protein coding sequences as continuous-time Markov chains with rate matrices that distinguish between property-altering and property-conserving mutations as in [8] and [9]. More *Correspondence: [email protected] 1 Fred Hutchinson Cancer Research Center, Seattle, WA, USA Full list of author information is available at the end of the article

recently, [10] proposed a Bayesian hierarchical regression model that compares the observed amino acid distances to the expected distances under neutrality for a given set of amino acid properties and incorporates mixture priors for variable selection. The hierarchical mixture priors enable the model in [10] to identify neutral, conserved and radically changing sites, while automatically adjusting for multiple comparisons and borrowing information across properties and sites. A common feature of all the methods listed above is the implicit assumption that properties are independent from each other in terms of their effect on evolution. A review of the amino acid index database (available for example at http://www.genome.jp/dbget/aaindex.html), which lists more than 500 amino acid properties, shows that a large number of them are highly correlated. Although the correlations we observe in the data can be different from those computed from the raw amino acid scores due to the influence of factors such as codon bias, by ignoring these correlations we are also ignoring the fact that correlated properties may affect a particular site in similar ways. Hence, approaches that do not take into account the correlations in the rates of mutations on different

© 2012 Datta et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

codons do not make use of key information about the relative importance of different physicochemical properties on molecular evolution. A natural way to account for correlations in the data is by considering a factor structure, see for example [11]. However, selecting the number and order of the factors can be a difficult task in this type of factor models. In addition, the particular structure of the model in [11] makes it difficult to incorporate the effect of the factors on regions that are very strongly conserved. This paper extends the Bayesian hierarchical regression model in [10] by placing a nonparametric prior on the distribution of the regression coefficients describing the effect of properties on molecular evolution. The prior is an extension of the well known Dirichlet process prior [12,13] to model separately exchangeable arrays [14,15]. As in [10], the main goal of the model described in this paper is to identify sites that are either strongly conserved or radically changing. In order to account for correlations across properties, our model clusters properties with similar effects on evolution, and within each such group, clusters sites with similar regression coefficients and nonparametrically estimates their distribution. In addition to accounting for correlations across properties, this structure allows us to dramatically reduce the number of parameters in the model and generate interpretable insights about molecular evolution at the codon level. Although the clusters of properties can in principle be considered nuisance parameters that are of no direct interest, in practice posterior inference on the clustering structure can provide interesting insights about the molecular evolution process of a given gene. Indeed, as will become clear in the following sections, our approach incorporates the effect of amino acid usage bias. Hence, any significant differences between the cluster structure estimated from the observed protein-coding sequence alignment and the correlation structure derived from the raw distances between the properties in such cluster can be interpreted a signal of extreme amino acid usage bias in that particular region of the genome. The rest of the paper is organized as follows. A brief review of DP mixture models along with the details of our model is provided in the Methods section. This section also includes a review of some of the currently available methods for characterizing molecular evolution that take into account changes amino acid properties. The model is then evaluated via simulation studies and illustrated through a real data example. The simulated and real data analyses, as well as comparisons between the proposed semiparametric regression approach and other methods, are presented in Results and discussion. Finally, the Conclusions section provides our concluding remarks.

Page 2 of 17

Methods Dirichlet process mixture models

The Dirichlet process (DP) was formally introduced by [12] as a prior probability model for random distributions G. A DP(ρ, G0 ) prior for G is characterized by two parameters, a positive scalar parameter ρ, and a parametric base distribution (or centering distribution) G0 . ρ can be interpreted as the precision parameter, with larger values of ρ resulting in realizations of G that are closer to the base distribution G0 . One of the most commonly used definitions of the DP is its constructive definition [13], which characterizes DP realizations as countable mixtures of point masses. Specifically, a random distribution G generated from DP(ρ, G0 ) is almost surely of the form G(·) =

∞ 

wl δφl (·),

l=1

where δφl (·) denotes a point mass at φl . The locations φl are i.i.d. draws from G0 , while the corresponding weights wl are generated using the following “stick-breaking” l−1 mechanism. Let w1 = v1 and define wl = vl r=1 (1 − vr ) for l = 2, 3, . . ., where {vl : l = 1, 2, . . .} are i.i.d. draws from a Beta(1, ρ) distribution. Defining the weights in this way ensures ∞ l=1 wl = 1. Furthermore, the sequences {vl : l = 1, 2, . . .} and {φl : l = 1, 2, . . .} are independent. The DP is most often used to model the distribution of random effects in hierarchical models. In the simplest case where no covariates are present, these models reduce to nonparametric mixture models (e.g., [16-18]). Assume that we have an independent sample of observaind

tions y1 , y2 , . . . , yn such that yi |θi ∼ k(·; θi ), where k(·; θi ) is a parametric density. Then, the DP mixture model places a DP prior on θi as θi |G G|ρ

i.i.d.



G,

i = 1, . . . , n

∼ DP(ρ, G0 )

The almost sure discreteness of realizations of G from the DP prior allows ties in θi , making DP mixture models appealing in applications where clustering is expected. The clustering nature is easier to see from the P´olya urn characterization of the DP [19] which gives the induced joint distribution for the θi s, by marginalizing G over its DP prior. Under that representation, we can write θi = θξ∗i where θ1∗ , θ2∗ , . . . is an independent and identically distributed sample from G0 and the indicators ξ1 , . . . , ξn are discrete indicators sequentially generated with ξ1 = 1 and ⎧ i r ⎪ ⎪ ⎨ k k ≤ maxj≤i {ξi } Pr(ξi+1 = k|ρ, ξi , . . . , ξ1 ) = i + ρ ⎪ ⎪ ρ ⎩ k = maxj≤i {ξi } + 1, i+ρ

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

where rki =

i

j=1 I(ξj

I(ξj = k) =



= k) and

1 ξj = k 0

Page 3 of 17

of codon k at a particular site i in the DNA sequence under study is denoted by Fki . Then, the expected mean distance for a particular site i and a given property j is given by

otherwise.

One advantage of DP mixture models over other approaches to clustering and classification is that they allow us to automatically estimate the number of components in the mixture. Indeed, from the P´olya urn representation of the process it should be clear that, although the number of potential mixture components is infinite, the model implicitly places a prior on the number of components that, for moderate values of ρ, favors the data being generated by an effective number of components K ∗ = maxi≤n {ξi } < n. The model

Our data consist of observed and expected amino acid distances derived from a DNA sequence alignment, a specific phylogeny, a stochastic model of sequence evolution, and a predetermined set of physicochemical amino acid properties. In the analyses presented here, we disregard uncertainty in the alignment/phylogeny/ancestral sequence level since our main focus is the development and implementation of models that allow us to make inferences on the latent effects that several amino acid properties may have on molecular evolution for a given phylogeny and an underlying model of sequence evolution. Extensions of these analyses that take into account these uncertainties are briefly described in Conclusions. For further discussion on this issue, see also [10]. In order to calculate the observed distances, we first infer the ancestral sequences under a specific substitution model and a given phylogeny. In our applications, we use PAML version 3.15 [20] and the codon substitution model of [21], which accounts for the possibility of multiple substitutions at a given site. Nonsynonymous substitutions are then counted by comparing DNA sequences between two neighboring nodes in the phylogeny. The observed mean distance, denoted as yi,j for site i and property j, is obtained as the mean absolute difference in the property scores due to all nonsynonymous substitutions at site i. Only those sites with at least one nonsynonymous change from the ancestral level are retained for further analysis. To compute the expected distances, note that each codon can mutate to one of at most nine alternative codons through a single nucleotide substitution [5], only some of which are nonsynonymous (changes to stop codons are ignored). Let Nk be the number of nonsynonymous mutations possible through a single nucleotide change, corresponding to a particular codon k (k = i,j 1, . . . , 61). Let Dk,l be the absolute difference in property j between nonsynonymous codon pairs at site i differing at one codon position, where l = 1, . . . , Nk . The frequency

61 xi,j ≡

i,j DE

i,j i Nk k=1 Fk l=1 Dk,l . 61 i k=1 Fk Nk

=

We consider a hierarchical regression model that relates xi,j to yi,j and allows us to compare the expected and observed distances at the codon level for several properties simultaneously with the following rationale. If a given site i is neutral with respect to property j, then yi,j ≈ xi,j . If property j is conserved at site i, then yi,j > xi,j . To construct our model, we first standardize the distances xi,j and yi,j by dividing them by the maximum possible distance for each property. This enables us to use priors with the same scale for all the regression coefficients. Our regression model for the standardized distances y∗i,j and x∗i,j , for sites i = 1, . . . , I and properties j = 1, . . . , J, can be written as y∗i,j |βi,j , σi,j2



N(βi,j x∗i,j , σi,j2 )

if βi,j = 0

N(βi,j x∗i,j , σi,j2 /nO i )

if βi,j  = 0,

(1)

where nO i is the observed number of nonsynonymous changes at a particular site i and βi,j and σi,j2 are the regression coefficient and variance parameter associated with site i and property j. The mixture model accounts for the fact that some of the y∗ij s can be equal to zero as some nonsynonymous changes do not change the value of the property being measured (e.g., Aspargine, Aspartic acid, Glutamine, Glutamic acid all have the same hydropathy score). To complete the model, we need to describe a model for the matrix of regression coefficients [ βi,j ]. There are a number of possible models for this type of data which utilize Bayesian nonparametric methods; some recent examples include the infinite relational model (IRM) [22,23], the matrix stick breaking process (MSBP) [24], and the nested infinite relational model (NIRM) [14,15]. In this paper we focus on the NIRM, which is constructed by partitioning the original matrix into groups corresponding to entries with similar behavior. This is done by generating partitions in one of the dimensions of the matrix (say, rows) that are nested within clusters of the other dimension (columns). This structure allows us to identify groups of (typically correlated) properties with similar pattern and then, within each such group, identify clusters of sites with similar values of βi,j (Figure 1 provides a graphical representation of this idea). In our setting, we take [ θ ij ] =[ βi,j , σi,j2 ] and employ a NIRM to generate a prior for [ θ ij ].

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

Page 4 of 17

Figure 1 Stylized representation of our model. Each sub table at the second level of clustering shares a common value for the regression coefficient βi,j . Rows correspond to properties, while columns correspond to sites.

More specifically, we denote by θ j = (θ1,j , . . . , θIj ) the vector of regression coefficients and the associated variances corresponding to property (column) j. To obtain clusters for the properties, we assume that θ j ∼ F, where F=

∞ 

πk δθ ∗k

(2)

k=1

 is a random distribution such that πk = vk s 1 and probability p1 . The lysin data was analyzed with the model in The Model subsection with the 32 amino acid properties listed in Table 4. A few of the properties were chosen because of their functional importance. Some of the other properties have been previously used in analyses by [25]. Only sites which showed at least one nonsynonymous change were retained for the final analysis, which led to a data set with 94 sites. We used K = 25 and L = 35 as truncation levels for this data. The prior distributions with the following hyperparameters were used in the analysis. The DP concentration parameters ρ and γk were assumed to follow a Ga(1,1). λ, the prior probability for φl,k being 0, was assumed to follow a Beta(2, 8) which implied that about 20% of the unique βi,j s were expected to be 0 a priori. aκ 2 when and bκ , the hyperparameters for the prior of ϑl,k φl,k is 0, were chosen as 2 and 100 which implied a prior mean of 0.01. When φl,k is different from zero, a∗σ = 2 2 . V , the scale facand b∗σ = 10 control the prior for ϑl,k 0 2 tor for ϑl,k , was fixed at the ratio of prior means of σ 2 and τi2 (the variance terms in the regression model in [10] for which we had used prior means of 0.1 and 0.01 respectively). Finally, the αk s were assumed to follow a N(1, 0.25) to conform to our prior assumption of neutrality for the properties. Results are based on 20000 iterations, of which the first 10000 were burn-in. Convergence was assessed by visual inspection of trace plots of some of the parameters and there did not seem to be any obvious problems with convergence. Figure 8 shows the marginal posterior probabilities of any two properties being assigned to the same cluster.

Table 2 Sites identified as significant by TreeSAAP for the different properties for the simulation study based on a biological model Property

Radically changing (1.645)

Radically changing (3.695)

Conserved (1.645)

Conserved (3.695)

h

5, 59, 65, 67, 71, 74, 81, 82, 89

74

36, 83

None

p

21, 24, 37, 64, 65, 67, 71, 74, 75, 81, 82, 89

None

7, 18, 36, 49, 55

None

Mv

10, 33, 66

None

5, 18, 36, 49

None

V0

10, 13, 33, 66

None

18, 36

None

pHi

39, 55, 72

None

11, 64, 72

None

Values in parentheses denote the cut-off values for the z-test statistic. Sites marked in bold are in the region of interest.

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

Page 11 of 17

Table 3 Sites that have high posterior probabilities (> 0.95) of belonging to each site class for the different partitions for EvoRadical for the simulated data Property

ω ≤ 1, γ ≤ 1

ω ≤ 1, γ > 1

ω > 1, γ ≤ 1

ω > 1, γ > 1

p

None

None

None

1, 2, 5, 7, 10, 11, 12, 13, 14, 18, 19, 20, 26, 27, 30, 32, 33, 34, 36, 37, 42, 43, 47, 53, 57, 59, 61, 62, 63, 64, 66, 67, 68, 69, 72, 73, 74, 75, 77, 82, 83, 86, 87, 88, 90

Mv

None

None

None

2, 7, 9, 18, 19, 20, 22, 27, 31, 32, 36, 38, 53, 55, 61, 62, 64, 67, 72, 74, 86

Sites marked in bold are in the region of interest.

There seem to be four mostly distinct clusters in the properties in our list. The biggest cluster consists of 20 properties that are related to polarity and hydropathy. All 20 properties are assigned to this cluster with very high probability. The next cluster is comprised of the properties Bl , and c. There is also a fairly big cluster whose members are related to volume (Mv , V 0 , Mw , Cα , μ). pzim , which is correlated with p to some extent, is clustered with pHi with which it shows a large correlation value (about 0.9).

There is some uncertainty regarding the membership of K 0 and Esm , since both of them are assigned to the largest cluster about 50% of the time, while Esm is clustered with properties related to volume to a lesser extent. pK 1 is the only property that is almost never clustered with other properties. Site specific results based on the posterior means (denoted by βˆi,j s), for one representative property each from the four clusters in Figure 8 are shown in Figure 9.

Table 4 List of 32 amino acid properties used in the analysis AAindex accession number (if available)

Property

Symbol

AAindex accession number (if available)

Property

Symbol

KYTJ820101

Hydropathy



Helical contact area

Ca

GRAR740103

Molecular volume

Mv

ZIMJ680104

Isoelectric point

pHi

MANP780101

Surrounding hydrophobicity

Hp

OOBM770103

Long-range non-bonded energy

El

ZIMJ680103

Polarity(Zimmerman)

pzim



Mean r.m.s. fluctuation displacement

F

h

CHOP780201

Alpha-helical tendencies



FASG760101

Molecular weight

Mw

GRAR740102

Polarity(Grantham)

p



Normalized consensus hydrophobicity

Hnc

PONP800108

Average number of surrounding residues

Ns

COHE430101

Partial specific volume

V0



Power to be at the C-terminal

αc

WOEC730101

Polar requirement

Pr

GRAR740101

Composition

c



Power to be at the middle of alpha-helix

αm



Compressibility

K0



Power to be at the Nterminal

αn

FAUJ880113

Equilibrium constant (ionization of COOH)

pK

MCMT640101

Refractive index

μ

CHOP780202

Beta-structure tendencies



OOBM770102

Short and medium range non-bonded energy

Esm

ZIMJ680102

Bulkiness

Bl

PONP800107

Solvent accessible reduction ratio

Ra



Buriedness

Br



Thermodynamic hydrophobicity

Ht



Chromatographic index

RF

OOBM770101

Total non-bonded energy

Et

CHAM830101

Coil tendencies

Pc

CHOP780101

Turn tendencies

P

Properties marked by ∗ are from [36].

transfer

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

Page 12 of 17

pK1 pHi pzim μ

1

Cα Mw V0 Mv c Bl Esm

0.75

K0 Ns αm Ht Hnc αn αc

0.5

El Et h Br Ra Hp

0.25

Pα P F Pc p Pr RF Pβ Pβ RF Pr p Pc F P Pα Hp Ra Br h Et El αc αn Hnc Ht αm Ns K0 Esm Bl c Mv V0 Mw Cα μ pzim pHi pK1

0

Figure 8 Marginal posterior probabilities of any two properties being in the same cluster for the lysin data.

The sites are sorted according to the increasing value of mean βˆi,j for each image. Sites on the far right radically change properties in each group. For example, most of the sites that appear on the far right, like sites 15, 16, 21, 75, 82, 99 and 126, for cluster 1 (represented by h) have βˆi,j values of 1.2-1.4. There seem to be more sites radically changing properties in cluster 1 than in clusters 2 (represented by c) or 3 (represented by Mv ). The first three clusters also have a fairly large number of sites with mean

βˆi,j between 0 and 1. This is different from what we see for cluster 4 (represented by pzim ), which corresponds to properties pzim and pHi . A large number of sites in cluster 4 strongly conserve the properties (e.g., sites 35, 43, 49, 51, 64, 114, 117, 121), as is evident by the very small mean βˆi,j s for sites in the far left, unlike in the other clusters. Figure 10 shows the posterior summaries of βi,j s different from zero for sites 82, 99, 120 and 127 for properties belonging to different clusters. Of these, sites 120 and 127

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

Page 13 of 17

96 68 129 11 121 69 88 131 24 100 29 19 111 35 94 98 105 28 128 43 27 114 72 42 47 110 58 103 115 104 71 66 117 46 86 108 18 97 12 113 22 64 49 107 106 26 25 51 57 81 41 13 30 38 53 40 33 91 17 123 87 10 124 74 120 63 125 67 119 36 44 61 130 101 14 32 109 80 83 127 116 45 73 122 79 37 70 75 21 16 99 126 15 82

h

1.4

1.2

c

111 35 28 128 117 51 22 11 121 68 129 96 100 88 29 131 94 24 47 43 114 108 42 69 103 115 12 19 105 98 71 110 27 72 104 13 66 10 58 57 46 97 18 64 81 113 86 120 49 53 44 38 79 91 36 124 67 130 30 80 107 33 61 106 14 83 26 74 41 15 37 119 101 17 63 109 25 70 21 123 87 125 32 16 40 126 45 73 127 122 99 116 82 75

1.0

0.8

0.6

0.4

Mv

28 128 51 117 22 111 35 108 29 68 94 43 100 12 115 121 47 11 103 129 131 10 88 96 13 42 15 24 114 105 98 110 79 19 37 71 81 57 104 97 58 21 69 44 80 46 64 66 72 83 120 18 130 27 16 67 61 36 49 82 113 126 38 86 109 45 70 14 124 91 17 122 53 33 101 30 99 107 32 63 73 119 106 25 74 125 41 26 123 127 87 116 75 40

0.2

43 64 35 121 49 114 117 51 100 47 69 11 81 58 25 66 111 122 19 17 42 29 88 98 53 110 96 105 101 18 94 129 131 24 73 68 79 104 27 115 71 33 37 63 116 13 107 32 86 36 108 12 124 80 41 126 72 14 15 113 40 103 44 10 83 130 30 74 82 45 123 109 38 21 57 61 128 28 125 22 46 91 16 67 97 70 26 120 99 106 119 75 127 87

pzim

Figure 9 Posterior means βˆi,j s for the four clusters (denoted by representative properties) in Figure 8 for lysin. The sites are sorted according to the increasing value of posterior means.

were found to be under positive selection by PAML, while sites 82, 99 and 127 were identified as radically changing some of the properties by the regression model in [10]. The sites show different behavior for the different properties, for example, site 82 shows radical changes for h, while it conserves Mv . We can also see similarities in the posterior summaries across sites. For example, for property pK 1 sites 82, 120 and 127 have similar values for βi,j . One of the advantages of using the semiparametric approach is that we can identify groups of sites that either conserve or radically change a set of similar amino acid properties. For example, sites 122 and 127 both seem to be altering the amino acid properties in the first large cluster of properties related to p and h. However, sites 122 and 127 have a

very different behavior in cluster 4 related to pzim : site 122 strongly conserves properties in this cluster while site 127 radically changes them. Table 5 lists sites that are highly conserved with posterior mean βˆi,j s less than 0.4 for the different clusters. The largest number of highly conserved sites appears in cluster 4, which includes properties pzim and pHi , in agreement with Figure 9. Some of these sites like 35, 51, 111 and 117 also conserve properties in clusters 2 and 3. A number of them, such as sites 28, 35, 58, 66, 94, 104, 117, and 128 are also identified as sites under negative selection by methods that take into account the relative rate of nonsynonymous to synonymous rate ratio, such as those implemented in PAML [20]. In order to determine which

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

Page 14 of 17

0.5

1.0

1.5

Site 99

0.0

0.0

0.5

1.0

1.5

Site 82

c

1 Mv pzim K0 pK

h

c

1 Mv pzim K0 pK

Properties

Properties

Site 120

Site 127

0.0

0.0

0.5

0.5

1.0

1.0

1.5

1.5

h

h

c

1 Mv pzim K0 pK

h

c

Properties

1 Mv pzim K0 pK Properties

Figure 10 Posterior summaries of βi,j s different from zero for sites 82, 99, 120 and 127 in lysin data. The first 4 properties on the x-axis belong to 4 different clusters and the next 2 do not belong to any specific cluster all the time. The vertical lines are 90% posterior intervals of the βi,j s that are different from 0, the medians (filled circles) and the 25th and 75th percentiles (stars) are highlighted.

sites are under positive and negative selection by PAML, we follow an approach similar to that used by [35] in the analysis of the lysin data. In particular, [35] found that PAML model M8, which supports positive selection, is the model that better fits the lysin data. Therefore, we classified sites as negatively selected if the estimated ω was smaller than 0.3 and if Pr(ω > 1|data) < 0.5 using PAML model M8. Results comparing sites conserving or radically changing a small group of properties with sites inferred to be under positive or negative selection by PAML was also presented in [10]. The results are fairly robust to the choice of different 2 hyperparameter values. Note that the scale factor for ϑl,k ultimately affects the variation in the βi,j values, and it is advisable to choose it so that the prior variance for the unique βi,j s is not too large. Table 5 Strongly conserved sites (βˆij < 0.4) for lysin data for different clusters Cluster

Site number

1

96

2 and 3

22, 28, 35, 51, 111, 117, 128

4

53, 58, 64, 66, 68, 69, 71, 73, 79, 81, 88, 94, 96, 98, 100,

11, 17, 18, 19, 24, 25, 27, 29, 33, 35, 42, 43, 47, 49, 51,

101, 104, 105, 110, 111, 114, 115, 117, 121, 122, 129, 131

Conclusions In this paper, we present a Bayesian hierarchical regression model with a nested infinite relational model on the regression coefficients. The model is capable of identifying sites which show radical or conserved amino acid changes. The (almost sure) discreteness of the DP realizations induces clustering at the level of properties which is analogous to the factor model in [11], with the advantage being that the nonparametric method automatically determines the appropriate number of clusters. The multi-level clustering ability of the NIRM also induces clustering at the level of sites and allows us to capture skewness and heterogeneity in the distribution of the random effects distribution associated with each cluster of properties. The main advantage of the models we have described is their ability to simultaneously handle multiple properties with potentially correlated effects on molecular evolution. Our simulations suggest that our models are flexible but robust, being capable of dealing with a range of situations including those where properties are perfectly correlated, as well as those where all properties are uncorrelated. Our semiparametric regression models also work well, particularly in comparison with the regression model in [10], TreeSAAP and EvoRadical, when applied to DNA sequence data generated from an evolutionary model. In addition, the analysis of the lysin data suggests that the model leads to reasonable results.

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

The NIRM that is the basis of our model defines a separately exchangeable prior on matrices. This means that the prior is invariant to the order in which properties and sites are included. This is due to the fact that the rows as well as the columns of the parameter of interest are independent draws from a DP. From the point of view of modeling multiple properties, this is a highly desirable property. However, assuming that DNA sites are exchangeable can be questionable. Although this is a potential limitation of our model, we should note that the assumption of independence across sites (which is a stronger assumption than exchangeability) underlies all the methods discussed in the Background section. If information about the 3dimensional structure of the encoded protein or other sequence specific information that can guide the construction of the dependence model is available, our model could be easily extended to account for this feature. In the absence of such information, exchangeability across DNA sites seems to be a reasonable prior assumption. Indeed, in contrast to the most common independence assumption, our exchangeability assumption allows us to explain correlations at the level of sites. In our applications, we have used codon substitution models for reconstructing ancestral sequences as we wished to compare our methods to other methods for detecting selective sites that also use codon substitution models, such as those implemented in PAML and EvoRadical. However, it is possible to perform the proposed Bayesian semiparametric analyses using amino acid substitution models instead of codon substitution models. Note that the substitution model is only used in the calculation of the observed distances. First, we infer the ancestral sequences under a specific substitution model and a given phylogeny. We then compute the observed distances for a given property and a given site as the mean absolute difference in property scores due to all nonsynonymous substitutions at that site, where the nonsynonymous substitutions are counted by comparing the DNA sequences between two neighboring nodes in the phylogeny. The reconstructed ancestral sequences, and therefore the observed distances in our model, may differ under different substitution models, but the method can be implemented under any substitution model, including amino acid substitution models. The gain in execution time from using amino acid substitution models instead of codon-based ones could potentially be significant if the uncertainty in the alignment/phylogeny/ancestral level is taken into account. Finally, it is important to note that the “observed” distances are not really directly observed, but are instead constructed from ancestral sequences and, therefore, subject to error. A simple way to account for this additional level of uncertainty is to modify the computation of expected distances by incorporating the ideas of [37].

Page 15 of 17

This approach was previously employed in [10], with little impact on the final results.

Appendix: details about the Gibbs sampler The truncations and the introduction of the configuration variables imply that (2) and (3) can be written as ζj |{πk } ∼

K 

πk δθk∗

ξi,k |{wl,k } ∼

k=1

L 

wl,k δϕl,k

(5)

l=1

with ϕl,k ∼ G0lk and πk and wl,k being the appropriate stick breaking weights. Writing the model as in (5) helps in obtaining the forms of the full conditionals as below. The column indicators ζj for j = 1, . . . , J are sampled from a multinomial distribution with probabilities P(ζj = k| · · · ) = qjk ∝

L 

2 πk N(y∗i,j |φl,k x∗i,j , ϑl,k ),

l=1 {i:ξi,k =l} O 2 is ϑ 2 if φ 2 where ϑl,k l,k = 0 or is ϑl,k /ni if φl,k is difl,k parts: first, by ferent from zero. πk is sampled in two generating vk from a Beta(1 + mk , ρ + K s=k+1 ms ) for k = 1, . . . , K − 1 and vK = 1, where mk is the number of columns assigned to cluster k and then, by constructing  πk = vk k−1 s=1 (1 − vs ). For i = 1, . . . , I and k = 1, . . . , K, the indicators ξi,k are also sampled from a multinomial with probabilities of the form

2 wl,k N(y∗i,j |φl,k x∗i,j , ϑl,k ). P(ξi,k = l| · · · ) = pli,k ∝ {j:ζj =k}

The updated weights wl,k are sampled in a manner similar to the πk , i.e., ul,k are generated from a Beta(1 + nl,k , γk +  L r=l+1 nlr ) for l = 1, . . . , L − 1 and uLk = 1, where nl,k is the number of βi,j s assigned to atom l of cluster k and l−1 (1 − ur,k ). then, by constructing wl,k = ul,k r=1 Following [18], the DP concentration parameters ρ and γk are sampled in two steps by introducing auxiliary variables η1 and η2 . First, sample η1 from p(η1 |ρ, · · · ) = Beta(ρ + 1, J) and then ρ from p(ρ|η1 , · · · ) =

aρ + n∗ζ − 1 aρ + n∗ζ − 1 + J(bρ − log(η1 )) × Ga(aρ + n∗ζ , bρ − log(η1 )) +

J(bρ − log(η1 )) aρ + n∗ζ − 1 + J(bρ − log(η1 ))

× Ga(aρ + n∗ζ − 1, bρ − log(η1 )),

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

Page 16 of 17

where n∗ζ is the number of unique column indicators ζj . Similarly, for each k = 1, . . . , K, p(η2 |γk , · · · ) = Beta(γk + 1, I), p(γk |η2 , · · · ) =

l,k

aγ + m∗ξ ,k − 1

× Ga(aγ + m∗ξ ,k , bγ − log(η2 )) I(bγ − log(η2 )) aγ + m∗ξ ,k − 1 + I(bγ − log(η2 ))

× Ga(aγ + m∗ξ ,k − 1, bγ − log(η2 )),

p(αk | · · · ) ∼ N(m∗α , Cα∗ ) where

is the number of unique row indicators ξi,k , for a specific cluster of columns k. 2 )s given in (4), we To sample the unique ϕl,k = (φl,k , ϑl,k introduce a set of indicator variables ψl,k which take the value 1 when φl,k is different from zero. For l = 1, . . . , L 2 and φ are jointly sampled in and k = 1, . . . , K, ψl,k , ϑl,k l,k the following way - ψl,k is sampled by integrating out φl,k 2 from its full conditional, ϑ 2 is sampled condiand ϑl,k l,k tional on ψl,k and φl,k is sampled conditional on both the 2 , i.e., corresponding ψl,k and ϑl,k 2 2 , φl,k | · · · ) = p(ψl,k | · · · )p(ϑl,k |ψl,k , · · · ) p(ψl,k , ϑl,k 2 ,···) × p(φl,k |ψl,k , ϑl,k

with the individual expressions obtained as follows. i,j First, let l,k = {(i, j) : ξiζj = l, ζj = k}. Then, p(ψl,k | · · · ) ∝ λ





⎢ ∗ 2 ⎥ 2 2 ⎣ N(yi,j |0, ϑl,k )⎦ IG(ϑl,k |aκ , bκ )d(ϑl,k ) i,j

+ (1 − λ)

l,k





∗ ∗ 2 O ⎥ ⎣ N(yi,j |φl,k xi,j , ϑl,k /ni )⎦ i,j

l,k 2 2 2 × N(φl,k |αk , ϑl,k /V0 )IG(ϑl,k |a∗σ , b∗σ )d(φl,k )d(ϑl,k ).

if ψl,k = 0 if ψl,k = 1,



where I ∗ J ∗ = given by σ1,scale

i,j 1{ξiζj =l,ζj =k} and the update terms are  y∗2  nOi y∗2 αk2 V0 i,j i,j = 2 and σ2,scale = 2 + 2 −

i,j l,k  O ∗ ∗ 2 (αk V0 + i,j ni yi,j xi,j )   l,k O ∗2 2(V0 + i,j ni xi,j )  l,k

i,j l,k

.



2 p(φl,k |ψl,k , ϑl,k ,···) =



where mφ =

0

if ψl,k = 0

N(mφ , Cφ )

if ψl,k = 1,

 O ∗ ∗  αk V0 + i,j ni yi,j xi,j   l,k O ∗2 V0 + i,j ni xi,j 

l,k

and Cφ =

1 Cα

and m∗α =

+

{l:ψl,k =1}

 Cα∗

mα Cα





+

V0 2 ϑl,k

 {l:ψl,k =1}

 V0 φl,k 2 ϑl,k

.

Software availability

The R code implementing the models in the paper is freely available at http://www.ams.ucsc.edu/∼raquel/software/.

Additional file Additional file 1: Additional simulations are provided in a separate supplemental file. Competing interests The authors declare that they have no competing interests. Authors’ contributions SD, AR and RP formulated the model. SD performed the analyses and drafted the manuscript. AR and RP revised the manuscript draft. All authors read and approve the final version of the manuscript.



⎧  ∗ ∗ −1   1 I J ⎪ ⎪ , + σ IG + a ⎪ κ 1,scale ⎪ ⎨ 2 bκ 2 p(ϑl,k |ψl,k , · · · ) =   −1  ⎪ ⎪ 1 I∗J∗ ⎪ ⎪ + a∗σ , ∗ + σ2,scale ⎩ IG 2 bσ

1

Cα∗ = 

where m∗ξ ,k



l,k

Finally, for k = 1, . . . , K, the full conditional of αk is given by

aρ + m∗ξ ,k − 1 + I(bγ − log(η2 ))

+

The full conditional of λ is given by   p(λ| · · · ) ∼ Beta(aλ + 1{ψl,k =0} , bλ + 1{ψl,k =1} ).

ϑ2  l,k O ∗2 . V0 + i,j ni xi,j  l,k

Acknowledgements RP and SD were supported by the NIH/NIGMS grant R01GM072003-02. AR was supported by the NIH/NIGMS grant R01GM090201-01. Author details 1 Fred Hutchinson Cancer Research Center, Seattle, WA, USA. 2 Department of Applied Mathematics and Statistics, University of California Santa Cruz, Santa Cruz, CA, USA. Received: 15 December 2011 Accepted: 11 October 2012 Published: 30 October 2012 References 1. Pakula AA, Sauer RT: Genetic analysis of protein stability and function. Annu Rev Genet 1989, 23:289–310. 2. Zuckerkandl E, Pauling L: Evolutionary divergence and convergence in proteins. In Evolving Genes and Proteins. New York: Academic Press; 1965:97–166. 3. Sneath PHA: Relations between chemical structure and biology. J Theor Biol 1966, 12:157–195. 4. Miyata T, Miyazawa S, Yasunaga T: Two types of amino acid substitutions in protein evolution. J Mol Evol 1979, 12(3):219–236. 5. Xia X, Li WH: What amino acid properties affect protein evolution? J Mol Evol 1998, 47:557–564. 6. McClellan DA, McCracken KG: Estimating the influence of selection on the variable amino acid sites of the cytochrome b protein functional domains. Mol Biol Evol 2001, 18:917–925.

Datta et al. BMC Bioinformatics 2012, 13:278 http://www.biomedcentral.com/1471-2105/13/278

7.

McClellan D, Palfreyman E, Smith M, Moss J, Christensen R, Sailsbery J: Physicochemical evolution and molecular adaptation of the cetacean and artiodactyl cytochrome b proteins. Mol Biol Evol 2005, 22:437–455.

8.

Sainudiin R, Wong WSW, Yogeeswaran K, Nasrallah JB, Yang Z, Nielsen R: Detecting site-specific physicochemical selective pressures: applications to the class I HLA of the human major histocompatibility complex and the SRK of the plant sporophytic self-incompatibility system. J Mol Evol 2005, 60:315–326. Wong WSW, Sainudiin R, Nielsen R: Identification of physicochemical selective pressure on protein encoding nucleotide sequences. BMC Bioinf 2006, 7:148–157. Datta S, Prado R, Rodriguez A, Escalante AA: Characterizing molecular evolution: a hierarchical approach to assess selective influence of amino acid properties. Bioinformatics 2010, 26:2818–2825. Datta S, Prado R, Rodriguez A: Bayesian factor models in characterizing molecular adaptation. 2012. Tech. rep., University of California, Santa Cruz. Ferguson T: A Bayesian analysis of some nonparametric problems. Ann Stat 1973, 1:209–230. Sethuraman J: A constructive definition of Dirichlet priors. Statistica Sinica 1994, 4:639–650. Shafto P, Kemp C, Mansinghka V, Gordon M, Tenenbaum JB: Learning cross-cutting systems of categories. In Proceedings of the 28th Annual Conference of the Cognitive Science Society. Erlbaum; 2006:2146–2151. Rodriguez A, Ghosh K: Nested partition models. Tech. rep., University of California, Santa Cruz. 2009. Lo AY: On a class of Bayesian nonparametric estimates: I. density estimates. Ann Stat 1984, 12:351–357. Escobar MD: Estimating normal means with a Dirichlet process prior. J Am Stat Assoc 1994, 89:268–277. Escobar MD, West M: Bayesian density estimation and inference using mixtures. J Am Stat Assoc 1995, 90:577–588. ´ Blackwell D, Macqueen JB: Ferguson distribution via Polya urn schemes. Ann Stat 1973, 1:353–355. Yang Z: Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol 1997, 42:294–307. Nielsen R, Yang Z: Likelihood models for detecting positively selected amino acid sites and applications to the HIV–1 envelope gene. Genetics 1998, 148:929–936. Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N: Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1: AAAI Press; 2006:381–388. Xu Z, Tresp V, Yu K, Kriegel HP: Infinite hidden relational models. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence: AUAI Press; 2006:544–551. Dunson DB, Xue Y, Carin L: The matrix stick-breaking process: flexible Bayes meta-analysis. J Am Stat Assoc 2008, 103:317–327. Woolley S, Johnson J, Smith MJ, Crandall KA, McClellan DA: TreeSAAP: Selection on Amino Acid Properties using phylogenetic trees. Bioinformatics 2003, 19:671–672. MacEachern SN: Estimating normal means with a conjugate style Dirichlet process prior. Commnunications Stat, Part B - Simul Comput 1994, 23:727–741. MacEachern SN, Muller P: Estimating mixture of Dirichlet process models. J Comput Graphical Stat 1998, 7:223–238. Ishwaran H, James LF: Gibbs sampling methods for stick-breaking priors. J Am Stat Assoc 2001, 96:161–173. Ishwaran H, Zarepour M: Dirichlet process sieves in finite normal mixtures. Statistica Sinica 2002, 12:941–963. Green PJ, Richardson S: Modelling heterogeneity with and without the Dirichlet process. Scand J Stat 2001, 28:355–375. Jain S, Neal RM: A split-merge Markov Chain Monte Carlo procedure for the Dirichlet process mixture model. J Comput Graphical Stat 2004, 13:158–182. Blei DM, Jordan MI: Variational inference for Dirichlet process mixtures. Bayesian Anal 2006, 1:121–144. Walker SG: Sampling the Dirichlet mixture model with slices. Commun Stat - Simul Comput 2007, 36:45.

9.

10.

11.

12. 13. 14.

15. 16. 17. 18. 19. 20. 21.

22.

23.

24. 25.

26.

27. 28. 29. 30. 31.

32. 33.

Page 17 of 17

34. Rodriguez A, Dunson DB, Gelfand AE: The nested Dirichlet process. J Am Stat Assoc 2008, 103:534–546. 35. Yang Z, Swanson W, Vacquier V: Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineage and sites. Mol Biol Evol 2000, 17:1446–1455. 36. Gromiha MM, Oobatake M, Sarai A: Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys Chem 1999, 82:51–67. 37. Minin VN, Suchard MA: Counting labeled transitions in continuous-time Markov models of evolution. J Math Biol 2008, 56:391–412. doi:10.1186/1471-2105-13-278 Cite this article as: Datta et al.: Bayesian semiparametric regression models to characterize molecular evolution. BMC Bioinformatics 2012 13:278.

Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.