Bayesian predictive model comparison via parallel sampling

June 21, 2017 | Autor: Peter Congdon | Categoría: Econometrics, Statistics, Model averaging, Computational Statistics and Data Analysis

Share Embed

Laporkan tautan ini

Descripción

Computational Statistics & Data Analysis 48 (2005) 735 – 753 www.elsevier.com/locate/csda

Bayesian predictive model comparison via parallel sampling P. Congdon∗ Department of Geography, Queen Mary University of London, Mile End Rd, London E1 4NS, UK Received 18 March 2004; accepted 18 March 2004

Abstract Methods of model comparison and checking, and associated criteria, are proposed based on parallel sampling of two or more models subsequent to convergence. These complement Bayesian predictive criteria already proposed (e.g. error sum of squares and deviance based) but are on a scale that may be compared across applications. Penalised criteria for model comparison based on the AIC are also investigated, together with AIC model weights and evidence ratios. Parallel sampling enables posterior summaries to be obtained for continuous comparison measures (e.g. likelihood and evidence ratios). A forward selection procedure for regression is suggested as one possible extension, as well as procedures for model averaging and posterior predictive checking. Comparisons with the DIC are made together with implications of parallel sampling for assessing the density of the DIC. Three worked examples illustrate the working of the procedures in practice. c 2004 Elsevier B.V. All rights reserved. Keywords: Predictive model comparison; Model checking; Parallel sampling; Predictor selection; Penalties for complexity

1. Introduction Formal Bayesian model choice is based on marginal likelihoods of di6erent models and resulting Bayes factors. However, the methodology of Bayes factors is complicated by computational complexities in obtaining marginal likelihoods (Han and Carlin, 2001; Ntzoufras et al., 2004) and by the sensitivity of the Bayes factor. The Bayes factor B21 when model 2 is more complex than model 1 tends to zero as sample size n increases, ∗

Tel.: +0207-882-7760; fax: +0208-981-6276. E-mail address: [email protected] (P. Congdon).

c 2004 Elsevier B.V. All rights reserved. 0167-9473/$ - see front matter doi:10.1016/j.csda.2004.03.016

736

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

and if model 2 contains an additional parameter and the di6useness of the prior on that parameter increases, then B21 also tends to zero (Bartlett, 1957; Lindley, 1957). Possible solutions are Bayes factors using minimal training samples from y=(y1 ; : : : ; yn ) to provide proper priors, especially the fractional Bayes factor (O’Hagan, 1995) and the intrinsic Bayes factor (Berger and Pericchi, 1996a, b). A variety of Bayesian model choice and diagnostic procedures shift the focus onto observables (and their prediction) away from parameters, alleviating the sensitivity to speciFcation of priors. Such procedures are based on predictive sampling of replicate data Z| where the parameter set is sampled from its posterior; see, for example, Geisser and Eddy (1979) and Laud and Ibrahim (1995). The DIC criterion (Spiegelhalter et al., 2002) may be justiFed in predictive terms, as the expected de˜ viance E{D(Z; )|y} for Z at a parameter estimate ˜ such as the posterior mean . The paper here suggests parallel sampling of models, leading to posterior probabilities that predictive Ft under a particular model j ∈ (1; : : : ; M ) improves on that under competing models. Parallel sampling provides an M × (M − 1) array of probabilities that model j provides a better predictive Ft than models k = j, together with probabilities that a model provides the maximum likelihood (minimum deviance) among candidate models. Spiegelhalter et al. (2002) suggest the predictive loss functions of Gelfand and Ghosh (1998) and Laud and Ibrahim (1995) may not penalise complexity suIciently. In terms of the parallel model sampling method, this suggests comparing penalised likelihoods or deviances, providing probabilities that a model provides the maximum penalised likelihood or minimum penalised deviance. Continuous measures of relative model performance may be obtained by converting penalised likelihoods or deviances to evidence ratios or model weights (Brooks, 2002; Burnham and Anderson, 2002) for which credible intervals may be obtained from parallel sampling. Criteria for model comparison, averaging or choice can be combined with comparisons of model adequacy, namely predictive checks across models sampled in parallel. By contrast, Gelman et al. (1995, p. 169) propose predictive checks when models are estimated in isolation. The predictive parallel sampling approach is illustrated with simulated data (where the true model is known), with a linear regression (continuous outcome) where predictor selection is the main question, and Poisson regression where representing overdispersion is an issue.

2. Predictive model assessment and checking Let m be a vector of parameters of dimension dm for model m (m = 1; : : : ; M ) with prior P(m ). Inference for m conditional on observations y = (y1 ; : : : ; yn )T is based on the posterior density P(m |y) ˙ P(y|m )P(m ); where the likelihood is equivalently written L(m |y) and P(y|m ). To assess the suitability of the model one may sample replicate data Zm = (Z1m ; : : : ; Znm ) from the same

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

737

model assumed to produce y. These are drawn from the posterior predictive density P(Zm |y) = P(Zm |m )P(m |y) dm : Predictions Zm may be used both in model selection and in predictive checks (Gelman et al., 1995). For model selection, Laud and Ibrahim (1995) and Meyer and Laud (2002) propose minimisation of the criterion C(Zm ; y) = (Zm− y) (Zm− y) with posterior mean with respect to the predictive distribution C m = E[C(Zm ; y)|y] =

n

{var(Zim ) + [E(Zim ) − yi ]2 }:

(1)

i=1 (t) This may be estimated by the average of (Zim − yi )2 over a large number of iterations t = 1; : : : ; T . Carlin and Louis (1996) and Sahu and Buck (2000) propose standardised deviance criteria D(Zm ; y) for discrete data; for example for Poisson data (t) the posterior mean Dm = E[D(Zm ; y)|y] is estimated by the average of D(Zm ; y) = n (t) (t) 2 i=1 [yi log(yi =Zim ) − (yi − Zim )]. Spiegelhalter et al. (2002) propose the posterior mean of the (unstandardised) deviance Dm (where Dm = −2 log Lm ) to measure model adequacy, and penalising Dm by a count of e6ective parameters or complexity measure dem leading to the DIC, a Bayesian analogue of the AIC of Akaike (1973), with de = D − D(y|). Brooks (2002) points out other penalised criteria may be used such as the posterior mean BIC, E(BIC) = D + de log(n). Dempster (1997) and Aitkin (1991, 1997) consider model assessment based on posterior distributions of the likelihood and likelihood ratio jk = −2 log[L(j |y)=L(k |y)] for models j and k. Let Lm = E{L(m |y)|y} be the expectation of the likelihood with respect to P(m |y), estimated by a long run average of sampled likelihoods Lm . Aitkin (1991) chooses a Monte Carlo summary estimator called the posterior Bayes factor [ jk = Lj =Lk obtained by estimating each model separately. Dempster (1997) sugpoBF gests that other criteria might be used such as the mean of jk or the posterior probability that jk ¿ h with probability where jk = Lj =Lk . h ¿ 1 is a preset constant chosen to assess how far model j outperforms model k; larger values of h (e.g. h = 20) set a more stringent requirement on model j (see Dempster, 1997, pp. 248–249). The posterior densities of jk and jk , and posterior probabilities such as Pr(jk ¿ h), can be obtained from parallel sampling. Laud and Ibrahim (1995) consider minimising A(m |y) = [L(m |y)]−1=n estimated via

Am = (Lm )−1=n

(2)

with the best model being that with lowest A. Laud and Ibrahim (1995) and Carlin and Louis (1996) show that predictive criteria C m and Dm incorporate a penalty for imprecise predictions resulting from less well identiFed parameters. Under a formal Bayesian approach the prior itself penalises overparameterisation and so the Bayes factor will penalise complex models (Ntzoufras et al., 2004); the prior on the model P(m) itself may also penalise complexity.

738

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

However, the impact of a prior-based penalty on model choice may be a6ected by the sensitivity of the prior predictive density (Sahu, 2003). For model selection based on the posterior predictive density, there may be gains from formally penalising the likelihood or deviance by a function of the parameter n total. Predictive criteria have components reMecting imprecision, e.g. i=1 var(Zim ) in (1), but such imprecision may result from sources (e.g. poorly identiFed parameters) other than the parameter count. Consider parameter count penalties dm on the deviance, where = 2 leads to the AIC (Akaike, 1973), = 32 to the local Bayes Factor (Smith and Spiegelhalter, 1980), and = log(n) to the BIC. For Normal linear regression one form of the AIC is log(SSm =n) + 2dm =n with SSm = C(y; ). With H (Zm ; y) = exp[log{C(Zm ; y)=n} + 2dm =n], one may propose the AIC penalised predictive Ft criterion for linear regression: p

C m (Zm ; y) = nE{H (Zm ; y|y)}:

(3)

For both discrete and linear regression, the following two criteria apply: p

Dm (Zm ; y) = E[D(Zm ; y)|y] + 2dm ; p

log(Am ) = −(1=n)[log(Lm ) − dm ]:

(4) (5)

3. Predictive measures from parallel sampling As developments in Bayesian estimation have emphasized, an alternative perspective on parameter uncertainty, model choice and prediction is provided by repeated sampling. Most of the predictive criteria suggested so far are based on single model estimation, with the possible exception of Dempster (1997), whereas parallel sampling increases the scope for predictive model assessment. The present paper suggests criteria for models run in parallel once convergence of both has been obtained. Sampling and speciFcation of priors between models is independent. Let C(Zj(t) ; y), D(Zj(t) ; y) and A(j(t) |y) be Ft criteria based on sampling new data from the parameters j of model j at iterations t = 1; : : : ; T in a long sampling run. The probability that model j provides better predictions than model k may be estimated by the proportion of iterations where model j is preferred. For example, the Monte Carlo estimate of Pr[C(Zj ; y) ¡ C(Zk ; y)] is PˆC (j; k) =

T t=1

1(C(Zj(t) ; y) ¡ C(Zk(t) ; y))=T;

(6)

where 1() is the indicator function. The corresponding estimates for A and D criteria are denoted PˆA (j; k) and PˆD (j; k), while those based on penalised comparisons are denoted PˆCp (j; k), PˆAp (j; k) and PˆDp (j; k). Similarly, Bj =

T t=1

1(C(Zj(t) ; y) ¡ C(Zk(t) ; y); all k = j)=T

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

739

estimates the probability Pr{C(Zj ; y) ¡ C(Zk ; y); all k = j} that model j provides the best predictive Ft among the candidate models. So an additional criterion based on parallel model sampling is to select model j if Bj = max(B1 ; : : : ; BM ). The probability Pr[A(j |y) ¡ A(k |y)] is equivalent to the probability that model j has a higher likelihood L(j |y) than model k, and hence that Pr(jk ¿ 1) where jk is the likelihood ratio. For more stringency one may take Pr(jk ¿ h), such as h = 20 or 100 (Dempster, 1997). This is equivalent to Pr[A(j |y)=A(k |y) ¡ h−1=n ]. Continuous measures of relative model performance are provided by appropriate di6erences such as D(Zj ; y) − D(Zk ; y) or ratios such as D(Zj ; y)=D(Zk ; y). The ratio of penalised likelihoods, namely Lj exp(−dj ) to Lk exp(−dk ), namely Ejk = (Lj =Lk ) exp(dk − dj ) is known as the Akaike evidence ratio, and is used to indicate relative model probabilities in the AIC sense (Burnham and Anderson, 2002; Akaike, 1981). This ratio is equivalently [Ap (j |y)=Ap (k |y)]−n . From parallel sampling the density of this ratio or its log may be obtained. If for example the 95% interval of the log of the evidence ratio, i.e. !jk = log(Ejk ) = −n log[Ap (j |y)=Ap (k |y)]

(7)

entirely exceeds zero then the evidence ratio favours model j. Ratios or di6erences in DIC measures can also be monitored under parallel sampling. When dj and dk are known (e.g. in regression models with Fxed e6ect parameters only) then the posterior mean of (t) "jk = (Dj(t) + dj ) − (Dk(t) + dk )

(8)

approximates the di6erence in DICs, QDICjk = Dj − Dk + dej − dek . From parallel sampling, one may obtain the density of "jk , for example whether its 95% interval is entirely negative (model j outperforms model k), or the probability that model j has a lower DIC, namely Pr("jk ¡ 0). In complex (e.g. random e6ects) models dej and dek may be obtained in an initial run, and a subsequent run used to estimate the density (t) of "jk = (Dj(t) + dej ) − (Dk(t) + dek ). An e6ective parameter estimate (or measure of complexity) de is derivable as half the variance of the deviance: C=D(|y)−Dmin (min |y) is #2 (de ) with variance 2de (Gelman et al., 2003, p. 182). So a di6erence Cjk in deviances (relative to their minima) will be distributed as a di6erence in chi-square variables, with variance being a measure of the combined complexity of the models. It follows that Var[QDICjk ] = 2(dej + dek ): Therefore a given di6erence in DICs (e.g. QDICjk = 4) is less likely to be signiFcant when the models compared are complex. Since the log of the evidence ratio is on the log-likelihood rather than deviance scale, the variance of !jk = log(Ejk ) will be 0:5(dej + dek ) (see Example 1). Comparative model performance may also be summarised on a continuous scale by model weights (Brooks, 2002; Burnham and Anderson, 2002) that include information on the relativities between, say, Lp (j(t) |y) and Lp (k(t) |y), not just whether one exceeds

740

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

the other. The estimand here is the posterior average of the weights, wk . For criteria on a deviance scale the weights are obtained at each iteration by taking di6erences from the minimum deviance model m∗ , QDk(t) = Dk(t) − Dm(t)∗ ; calculating Wk(t) = exp(−0:5QDk(t) ) and then scaling the Wkt to sum to one, giving wk(t) = Wk(t) Wk(t) : (9) These can be obtained for penalised and unpenalised criteria, and with AIC penalties are called Akaike weights. Under parallel sampling a density for the weights is obtained as well as their average, so allowing for model and parameter uncertainty. A count measure of performance can also be obtained. Suppose Kjt =k if model j provides a better Ft than k other models at iteration t. The average K j of the Kjt will equal or be close to (M − 1)=2 if no model is preferred to any other. Hence another selection criterion is to choose model j if K j = max(K 1 ; : : : ; K M ). Examples below with real data show that the rankings of models produced by the criteria {C(Zj ; y); A(j |y); wj ; K j ; Bj } are highly correlated. The best model would be that which simultaneously provides minimum C(Zj ; y), and A(j |y), maximum K j , wj , and Bj , as well as P(j; k) all above 0.5. 4. Predictive measures from parallel sampling: some extensions This section considers applications of the parallel sampling approach in stepwise selection, model averaging and predictive checks. With regard to regression selection, comparison probabilities, model weights, and additional criteria such as K j and Bj provide stopping rules in forward regression selection (cf. Meyer and Laud, 2002). An algorithm for Normal linear regression with p potential predictors based on parallel sampling of models may be stated as follows: 1. For j = 1; : : : ; M (=p + 1) parallel models yi = j + 'j xij + eij ; eij ∼ N (0; *j ); j = 1; : : : ; p; yi = p+1 + ei; p+1 ;

ei; p+1 ∼ N (0; *p+1 );

obtain C(Zj |y), A(j |y), K j , wj , Bj (j = 1; : : : ; p) and Pˆc (p + 1; j), PˆA (p + 1; j). 2. If Pˆc (p + 1; j) ¿ 0:5 and PˆA (p + 1; j) ¿ 0:5 for all j ¡ p + 1, stop. 3. Otherwise select model j ¡ p + 1 (and predictor xj = xj1 ) with maximum K j , wj , and Bj and minimum C(Zj |y) and A(j |y). 4. Compare j = 1; : : : ; M (=p) parallel models yi = j + j xij1 + 'j xik + eij ;

eij ∼ N (0; *j ); j = 1; : : : ; p − 1;

k ∈ (1; : : : ; p); k = j1 ; yi = p + p xij1 + eip ;

eip ∼ N (0; *p ):

5. If Pˆc (p; j) ¿ 0:5 and PˆA (p; j) ¿ 0:5 for all j ¡ p, stop.

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

741

6. Otherwise select model j ¡ p (with predictor xij2 ) with maximum K j , wj and Bj and minimum C(Zj |y) and A(j |y). 7. Compare j = 1; : : : ; M (=p − 1) parallel models, yi = j + 1j xij1 + 2j xij2 + 'j xik + eij ;

eij ∼ N (0; *j );

j = 1; : : : ; p − 2; k ∈ (1; : : : ; p); k = {j1 ; j2 }; yi = p−1 + 1; p−1 xij1 + 2; p−1 xij2 + ei; p−1 ;

ei; p−1 ∼ N (0; *p )

and etc. Stopping (as at 2 and 5) is justiFed because a lower dimension model is giving better predictions. In case of conMicts between criteria at 3 or 6 choose the model satisfying the most Ft comparisons. An alternative to unpenalised criteria is to take AIC or other types of penalised criteria. Model averaging is desirable in not ignoring model uncertainty in inferences about parameters common to all models (Burnham and Anderson, 2002). Consider a quantity ,k = ,(k ; y) depending on both the data and the parameters of each model, then the averaged quantity at iteration t using weights is (t) ,((t) ; y) = wk ,(k(t) ; y); (10) k

where weights (rather than binary criteria) best reMect small di6erences in predictive performance. The model averaged posterior ,, the unconditional variance var(,) and the full posterior density of , are obtained under parallel sampling by analysing ,((t) ; y) over a large number T of iterations. Akaike (1981) and Bozdogan (1987) show that choosing the lowest AIC model amounts to choosing the model with lowest Kullback–Liebler information loss, so Akaike weights wk are probabilities that model k is best, given the data and set of models. It remains a possibility that the best model selected is still not adequately describing the data. Consider a statistic U (Zm ; ) and let (t) (t) Gm(t) = |U (Zm ; ) − U (y; (t) )|;

Then m ¡ Gk ] = Pr[G

t

1[Gm(t) ¡ Gk(t) ];

t = 1; : : : ; T; m = 1; : : : ; M: (m = k)

estimates the probability that model m predictions check better against the data. One might also rank models in terms of the mean gaps G m . 5. Worked examples The following examples illustrate model choice or checking using parallel sampling and predictive or likelihood criteria. The Frst example uses simulated data where the true model is known, while the other examples use real data. The Frst real data application is a linear regression where predictor choice by forward selection is the main question. The second such application is a Poisson regression mixture, where the issue is

742

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

Table 1 Criteria for linear regression models (true model is model 1)

1 1 + X1 Unpenalised PC (1; k) PA (1; k) (=Pr(1k ¿ 1)) (C m )1=2 Am Km Bm wm Pr(1k ¿ 20) AIC penalisation PCp (1; k) PAp (1; k) p (C m )1=2 p Am p Km p Bm wpm !1; k DIC criteria DIC Pr("1k ¡ 0) "1k

2 3 4 5 1 + X1 + X2 1 + X1 + X2 + X3 1 + X1 + X2 + X3 + X4 1 + X2

— — 44.89 5.873 3.02 0.453 0.372

0.518 0.599 44.95 5.879 2.63 0.280 0.282 0.08

0.520 0.692 44.98 5.886 2.30 0.160 0.199 0.134

0.520 0.735 44.95 5.891 2.04 0.107 0.147 0.185

1 1 110 14.41 0 0 0 1

— — 45.1 5.908 3.74 0.796 0.657 —

0.535 0.815 45.2 5.927 2.89 0.175 0.245 1.45

0.561 0.945 45.5 5.945 2.04 0.025 0.076 2.95

0.569 0.980 45.44 5.962 1.36 0.003 0.022 4.32

1 1 110.8 14.50 0 0 0 448

1775.8 0.713 1.9

1777.9 0.861 3.9

1779.6 0.930 5.6

1773.9 —

2671 1 897

checking against overdispersion in the observed data. See http://www.geog.qmw.ac.uk/ sta6/congdon.html for the WINBUGS programs and data. 5.1. Example 1. Simulation analysis: true model known A sample of n = 500 points are generated from yi = −2 + 3xi1 + ei ; where ei ∼ N (0; 2), and xi1 are N (0; 1). Three other N (0; 1) deviates, xi2 –xi4 , unrelated to yi , were generated. The true model (model 1) yi ='10 +'11 xi1 +e p1ij is then compared to three larger incorrect models including x1 , namely yi = 'j0 + k=1 'j1 xik + eji (for j = 2; 3; 4 and pj = 2; 3; 4). A Ffth model does not involve the correct predictor but x2 instead, namely yi = '50 + '52 xi2 + e5i . Priors on 'j0 are taken as N (0; 10000), on 'j1 − 'j4 as N (0; Vb ) where Vb is set at 10, and a Ga(a; a) prior with a = 0:1 is assumed on the precisions for eji . The Fve models are run in parallel over two chains of 10,000 iterations (using dispersed starting values) with convergence by under 1000 iterations. Table 1 contains unpenalised criteria and comparison probabilities, and their penalised equivalents including log evidence ratios !1k for the correct model 1 against

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

743

Fig. 1. Log evidence ratios, model 1 vs. models 2–4.

other models. Criteria K j and Bj are based on comparing A(j(t) |y). Also shown is the in-sample DIC (Spiegelhalter et al., 2002) according to the WINBUGS program which uses the unstandardised (minus twice log-likelihood) deFnition of the deviance. An initial run was made to ascertain de1 ; de2 ; de3 ; de4 , and de5 (namely 2.98, 4.02, 4.99, 5.99 and 3.00). The densities of "jk = (Dj + dej ) − (Dk + dek ), together with the probabilities Pr("jk ¡ 0) for j = 1 and k = 2; 3; 4; 5, are then obtained in a subsequent run of 10,000 iterations. It is apparent that all techniques identify the inferior model 5. Considering models 2–4 against model 1, it is also apparent that penalised criteria based on the likelihood prefer model 1; thus PˆAp (1; 4) = Pr[Ap (1 |y) ¡ Ap (4 |y)] = 0:98. The standard deviation for A(1 |y) is 0.014 so the means A(3 |y) and A(4 |y) are separated by over 2 standard deviations from A(1 |y). Credible intervals for the log evidence ratios of model 1 vs. models 2–4 (Fig. 1) also show model 1 to be clearly distinguished from model 4. Note that var(!12 ), var(!13 ), var(!14 ) and var(!15 ) are, respectively, 3.47, 3.96, 4.49, and 2.99; multiplying by 2 gives an estimate of the total complexity (total e6ective parameters) of the models being compared. Criteria based on replicate data also prefer model 1, though not in terms that would be judged signiFcant. The standard deviation for C(1 |y) is 1.4 and so C(1 |y) is not clearly separated from C(3 |y) and C(4 |y). As to the DIC, the di6erences between models 2–4 and model 1 do not appear to be signiFcant in terms of rules such as Pr("jk ¡ 0) ¿ 0:95; see also Fig. 2 containing a summary of the densities of "1k for (k = 2; 3; 4). The posterior probability that jk ¿ 1 where jk = Lj =Lk is the ratio of unpenalised likelihoods is the same as the posterior probability that A(j |y) ¡ A(k |y). It may

744

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

Fig. 2. Di6erence in DICs, model 1 vs. models 2–4.

be noted that the stringent procedure suggested by Dempster (1997), for example that the posterior probability that jk ¿ 20 or jk ¿ 100 exceeds a threshold such as 0.5, is best at distinguishing the true model against model 5 excluding the correct predictor x1 . The probability that 15 ¿ 20 is 1 while the probability that 14 ¿ 20 is only 0.18. Laud and Ibrahim (1995, p. 254) discuss whether less informative priors provide less protection against overFtting. Accordingly less informative prior values a = 0:001 and Vb = 1000 are also tried, but the values for criteria were virtually identical to those in Table 1. 5.2. Example 2. Malnutrition in cystic 7brosis Everitt and Rabe-Hesketh (2001) considers n = 25 malnutrition measures in cystic Fbrosis patients, with nine possible predictors (Table 2). A normal linear model with weight, BMP, FEV and RV (i.e. 1 + X4 + X5 + X6 ) is selected by classical backward and forward selection. Here noninformative N (0; 1000) priors are adopted on regression coeIcients, and continuous predictors centred. At step 1, models containing one predictor plus intercept (i.e. 1 + Xj for models j = 1; : : : ; 9) are compared to each other and an intercept only model in a two chain run of 10,000 iterations. Table 3 shows the last row of the comparison probability matrix (the estimated probabilities PˆC (M; k) and PˆA (M; k) where M = 10). Bj and K j (based on likelihoods) and weights wj are highest, and Aj and C j lowest, for model 4. Penalised criteria a6ect comparison of the intercept model with single predictor models:

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

745

Table 2 Predicting malnutrition in cystic Fbrosis patients

y

95 85 100 85 95 80 65 110 70 95 110 90 100 80 134 134 165 120 130 85 85 160 165 95 195

X1

X2

X3

X4

X5

X6

X7

X8

X9

Age

Sex

Height

Weight

BMP

FEV

RV

FRC

TLC

7 7 8 8 8 9 11 12 12 13 13 14 14 15 16 17 17 17 17 19 19 20 23 23 23

0 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0

109 112 124 125 127 130 139 150 146 155 156 153 160 158 160 153 174 176 171 156 174 178 180 175 179

13.1 12.9 14.1 16.2 21.5 17.5 30.7 28.4 25.1 31.5 39.9 42.1 45.6 51.2 35.9 34.8 44.7 60.1 43.6 37.2 54.6 64 73.8 51.1 71.5

68 65 64 67 93 68 89 69 67 68 89 90 93 93 66 70 70 92 69 72 86 86 97 71 95

32 19 22 41 52 44 28 18 24 23 39 26 45 45 31 29 49 29 38 21 37 34 57 33 52

258 449 441 234 202 308 305 369 312 413 206 253 174 158 302 204 187 188 172 216 184 225 171 224 225

183 245 268 146 131 155 179 198 194 225 142 191 139 124 133 118 104 129 130 119 118 148 108 131 127

137 134 147 124 104 118 119 103 128 136 95 121 108 90 101 120 103 130 103 81 101 135 98 112 101

Sex: 0 for males. BMP: Basic metabolic panel. FEV: Forced expiratory volume. RV: Residual volume. FRC: Functional residual capacity. TLC: Total lung capacity.

PˆCp (M; k) and PˆAp (M; k) show four single Xj models are out-performed by the intercept only model. At the second stage (M = 9) there are eight new models 1 + Xj + X4 (j = 4). One model 1 + X4 + X5 improves on 1 + X4 according to unpenalised criteria, for example PˆC (M; k) and PˆA (M; k). AIC penalised criteria also mostly select 1 + X4 + X5 , but model 1 + X4 is in close competition and outperforms the new model using the K j criterion. In the third stage there are seven new models comparing additional predictors to 1 + X4 + X5 . The model 1 + X4 + X5 + X6 decisively improves on 1 + X4 + X5 according to both penalised and unpenalised criteria. At the fourth stage penalised criteria show no new model improving over 1+X4 +X5 +X6 . By contrast unpenalised criteria support a larger model, namely 1 + X4 + X5 + X6 + X7 . At the Ffth stage, however, the selection stops with unpenalised criteria also. Whereas 1 + X4 + X5 + X6 is well identiFed (95% intervals on '0 ; '4 ; '5 , and '6 all either entirely positive or entirely negative), the coeIcients on both 1 and X7 in 1 + X4 + X5 + X6 + X7 straddle zero, so choice using penalised criteria seems preferable.

746

Stage Model

Unpenalised C

1

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

2

1. 2. 3. 4. 5. 6. 7. 8. 9.

1=2

Penalised p

p

A

B

K

w

PC (M; k)

PA (M; k)

(C )1=2

A

Bp

C

p

C

p

PCp (M; k)

PAp (M; k)

1 + X1 1 + X2 1 + X3 1 + X4 1 + X5 1 + X6 1 + X7 1 + X8 1 + X9 1

188.4 228.0 191.2 184.3 232.7 212.7 226.8 217.0 234.5 233.5

113.7 137.5 115.2 111.1 140.1 128.3 136.6 130.8 141.5 140.9

0.263 0 0.140 0.598 0 0 0 0 0 0

7.84 2.86 7.50 8.36 2.11 5.25 3.13 4.70 1.65 1.62

0.290 0.003 0.216 0.454 0.002 0.018 0.004 0.011 0.002 0.001

0.143 0.454 0.161 0.124 0.496 0.323 0.439 0.359 0.509

0.004 0.288 0.006 0.002 0.415 0.067 0.256 0.101 0.500

212.7 257.6 215.3 208.7 262.1 240.7 254.9 245.2 264.2 253.7

128.1 154.9 129.9 125.2 157.8 144.7 154.0 147.4 159.6 152.7

0.263 0 0.140 0.598 0 0 0 0 0 0

7.83 2.54 7.50 8.35 1.77 5.16 2.85 4.56 1.35 3.10

0.289 0.003 0.215 0.453 0.002 0.018 0.004 0.011 0.002 0.004

0.184 0.535 0.209 0.176 0.570 0.390 0.512 0.439 0.591

0.010 0.607 0.018 0.006 0.739 0.158 0.541 0.237 0.805

1 + X4 + X1 1 + X4 + X2 1 + X4 + X3 1 + X4 + X5 1 + X4 + X6 1 + X4 + X7 1 + X4 + X8 1 + X4 + X9 1 + X4

186.8 182.2 187.6 176.3 182.5 187.0 187.5 186.5 185.0

112.7 110.7 113.3 106.6 109.9 112.6 113.4 112.5 111.2

0.031 0.121 0.023 0.516 0.152 0.048 0.023 0.039 0.049

3.42 4.46 3.17 6.10 4.64 3.57 3.10 3.50 4.06

0.076 0.120 0.069 0.270 0.140 0.080 0.069 0.082 0.094

0.527 0.495 0.540 0.420 0.480 0.525 0.543 0.522

0.583 0.450 0.617 0.253 0.425 0.562 0.621 0.566

219.6 215.5 221.1 207.2 214.6 219 220.8 219.2 207.5

132.3 129.7 133 125.1 129.2 132 133.2 132.1 125.2

0.020 0.078 0.012 0.404 0.105 0.024 0.010 0.021 0.327

3.20 4.19 2.96 5.87 4.38 3.34 2.89 3.27 5.90

0.068 0.107 0.060 0.238 0.118 0.073 0.058 0.070 0.208

0.600 0.573 0.615 0.497 0.557 0.598 0.617 0.597

0.803 0.714 0.822 0.471 0.684 0.791 0.826 0.797

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

Table 3 Cystic Fbrosis data, forward selection diagnostics

1. 2. 3. 4. 5. 6. 7. 8.

1 + X4 + X5 + X1 1 + X4 + X5 + X2 1 + X4 + X5 + X3 1 + X4 + X5 + X6 1 + X4 + X5 + X7 1 + X4 + X5 + X8 1 + X4 + X5 + X9 1 + X4 + X5

176.6 175.3 176.3 164.2 181.2 180.9 180.2 176.6

106.3 105.7 106.1 98.8 109.0 108.9 108.4 106.5

0.078 0.097 0.087 0.635 0.022 0.019 0.024 0.038

3.53 3.74 3.61 5.81 2.63 2.60 2.74 3.35

0.100 0.113 0.107 0.416 0.058 0.059 0.064 0.084

0.495 0.484 0.494 0.363 0.544 0.546 0.539

0.473 0.444 0.460 0.171 0.609 0.605 0.578

215.6 214.3 215.1 200.8 221.0 220.9 219.5 207.1

129.7 129.0 129.5 120.8 132.9 133.0 132.5 124.9

0.056 0.068 0.062 0.573 0.013 0.012 0.015 0.201

3.31 3.52 3.40 5.67 2.44 2.41 2.55 4.69

0.089 0.100 0.095 0.378 0.052 0.052 0.056 0.178

0.574 0.563 0.568 0.438 0.616 0.614 0.605

0.689 0.662 0.672 0.312 0.787 0.790 0.780

4

1. 2. 3. 4. 5. 6. 7.

1 + X4 + X5 + X6 + X1 1 + X4 + X5 + X6 + X2 1 + X4 + X5 + X6 + X3 1 + X4 + X5 + X6 + X7 1 + X4 + X5 + X6 + X8 1 + X4 + X5 + X6 + X9 1 + X4 + X5 + X6

164 167.3 163.9 158.5 161.4 161.8 163.9

98.6 100.7 98.7 95.7 97.5 97.8 98.8

0.125 0.042 0.122 0.307 0.171 0.155 0.077

2.80 2.57 2.81 3.23 3.00 2.97 3.63

0.132 0.081 0.129 0.233 0.162 0.151 0.112

0.500 0.532 0.491 0.438 0.472 0.475

0.474 0.598 0.485 0.355 0.438 0.452

207.8 213.4 207.8 202.3 205.6 206.9 200.2

125.2 128.3 125.3 121.7 124.0 124.5 120.5

0.095 0.028 0.094 0.253 0.133 0.120 0.279

2.78 2.03 2.73 3.61 3.04 2.92 3.89

0.117 0.068 0.115 0.204 0.141 0.132 0.224

0.569 0.613 0.569 0.515 0.546 0.559

0.658 0.761 0.667 0.538 0.623 0.643

5

1. 2. 3. 4. 5. 6.

1 + X1 + X4 + X5 + X6 + X7 1 + X2 + X4 + X5 + X6 + X7 1 + X3 + X4 + X5 + X6 + X7 1 + X4 + X5 + X6 + X7 + X8 1 + X4 + X5 + X6 + X7 + X9 1 + X4 + X5 + X6 + X7

161.4 162.4 160.9 162.3 161.1 159.1

97.3 97.7 97.1 97.8 97.1 95.6

0.167 0.1325 0.175 0.1361 0.1728 0.2166

2.38 2.30 2.41 2.30 2.41 3.19

0.166 0.146 0.170 0.144 0.173 0.202

0.526 0.537 0.523 0.536 0.517

0.551 0.583 0.550 0.587 0.549

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

3

747

748

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

For small n and moderately sized p running models in parallel is relatively inexpensive in computing time. For example, each stage in the above forward selection process (monitoring all the quantities in Table 3) took around 60 s on a laptop computer. For large n or large p, variable selection methods (e.g. Chipman et al., 2001) are likely to be faster. However, the methods presented here have the advantage that an ‘ensemble’ of comparative and checking criteria are obtained and there is no problem in extending to more complex regression models with multi-level features or multiple random e6ects. Variable selection methods search over the model space whereas the aim of parallel sampling is to consider inter-model criteria (e.g. likelihood or evidence ratios) that are not obtainable via model search methods. 5.3. Lip cancer in GDR To illustrate model averaging and checking via parallel sampling consider Poisson discrete mixture regression for cases yi of male lip cancer in 219 districts of the former GDR over the period 1980–1989. The observations are overdispersed (var(y) = 44:7, y=10:5). Let yi ∼ Poi(3i ) and assume a log link, with expected deaths Ei an o6set and predictor x1 (% of workforce in agriculture, forestry and Fsheries, or AFF). Boehning and Schlattman (1999) compare a three group (J = 3) regression mixture (model 1) with varying intercepts but homogeneous e6ect of x1 , with model 2 where the impact of x1 varies by group. In a Bayesian estimation of these models intercept constraints '01 ¡ '02 ¡ '03 and '01 ¡ 0 are adopted (the second constraint corresponds to assuming a low mortality group). A Dirichlet prior with elements 1 is assumed for the prior probabilities 5j = Pr(Wi = j) where Wi is the latent group for district i and j = {1; 2; 3}. Finally, a relatively informative N (0; 1) prior is assumed on the varying regression e6ects '1j in model 2 since weaker priors led to convergence problems. Plotting the crude SMRs suggests several minor modes and a model with more groups (Fig. 3). In a third model, a Dirichlet process prior (Ishwaran and Zarepour, 2000), where the number of groups is stochastic, was used. To account for heterogeneity in model SMRs via varying intercepts, a Dirichlet concentration parameter of 10 is assumed and a maximum of JDPP = 20 groups. The number of parameters d3 at each iteration is calculated using the number of nonempty clusters. The proportionality assumption regarding Ei is also relaxed, while a homogeneous e6ect of x1 is assumed. Thus for model 3, log(3iWi ) = '0Wi + '1 x1i + '2 log(Ei ); where the latent group indicator Wi is determined under the DPP prior. Table 4 shows model 1 preferred using likelihood criteria such as wk and A(k |y), especially when penalised (K and B statistics are based on comparing likelihoods). By contrast predictive criteria such as Dk , show model 3 as competitive with other models—in model 1, one of the variable intercepts is poorly identiFed (posterior mean approximately equalling posterior standard deviation) and in model 2 one of the variable slopes is poorly identiFed. In fact the penalty for imprecise predictions of Carlin and Louis (2001, Eq. (6.34)) is lower for model 3 (namely 74 vs. 77 and 75.8 for models

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

749

Fig. 3. Histogram of crude SMRs.

1 and 2). Under a parameter count penalty, the extra clusters in model 3 (the average nonempty clusters are J DPP = 8:5) adversely a6ects its relative performance. To check which model most closely reproduces the observed overdispersion, ratios U (Zm ) = var(Zm )=Zm for replicate data are compared to the observed overdispersion ratio U (y) = 4:28. With Gm = |U (Zm ) − U (y)|, the mean discrepancies G 1 , G 2 and G 3 are 1.30, 1.44 and 0.46. The replicate data under models 1 and 2 show higher variation than the data, whereas model 3 reproduces the observed overdispersion more 3 ¡ G2 ] and Pr[G 3 ¡ G1 ] are estimated as 0.87 and accurately; thus probabilities Pr[G 0.89, respectively. For these data there is therefore a conMict between model comparison and checking criteria and further model elaboration is suggested. To illustrate model averaging using likelihood and Akaike weights (17), consider two Districts, one with a high crude SMR despite much lower than average AFF, and another with a crude SMR below 1, despite having one of the highest AFF values. Table 5 shows relevant data plus estimated risks under each model and the averaged relative risks. Under Akaike weights these are dominated by model 1.

6. Discussion This paper has considered model assessment based on parallel sampling of M ¿ 2 models. Model comparisons to detect di6erences in predictive performance include di6erences in, or ratios of, likelihoods and deviances, and related criteria such as model weights. Parallel sampling may be used to study sampling densities of di6erences or ratios of Ft criteria (e.g. DIC di6erences). Credible intervals for such continuous

750

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

Table 4 Lip cancer in GDR

Model 1 Variable intercept, G = 3

Model 2 Variable intercepts and slopes, G = 3

Model 3 Variable intercepts, DPP mixture

Unpenalised D A B K w

230.7 1.033 0.52 1.282 0.447

228.2 1.034 0.43 1.282 0.378

220.8 1.038 0.05 0.436 0.174

Penalised p D p A Bp p K p w

238.7 1.052 0.93 1.932 0.834

240.2 1.063 0.07 1.064 0.162

242.5 1.088 0.00 0.003 0.003

0.986 1.30

0.989 1.44

0.605 0.46

2 0.4793 — 0.5917

3 0.3728 0.4083 —

2 0.5606 — 0.1552

3 0.7763 0.8448 —

2 0.5285 — 0.4903

3 0.5379 0.5097 —

2 0.9439 — 0.0051

3 0.9995 0.9949 —

Checking criteria Qm Gm Matrix of PC (j; k) (unpenalised) j 1 2 3

k

Matrix of PA (j; k) (unpenalised) j 1 2 3

k

Matrix of PC (j; k) (penalised) j 1 2 3

k

Matrix of PA (j; k) (penalised) j 1 2 3

k

1

— 0.5331 0.6188

1

— 0.4394 0.2237

1

— 0.4715 0.4621

1

— 0.0561 0.0005

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

751

Table 5 Model averages of relative risk

SEELOW

y E Crude SMR AFF

9 11.39 0.79 0.357

STADTKRS STRALSUND

y

21

E Crude SMR AFF

8.43 2.49 0.018

Mean

Std dev

RR (model 1) RR (model 2) RR (model 3) Average with likelihood weights Average with Akaike weights RR (model 1)

1.31 1.13 1.21 1.25 1.29 1.62

0.18 0.21 0.29 0.15 0.17 0.45

RR (model 2) RR (model 3) Average with likelihood weights Average with Akaike weights

2.27 1.72 1.86 1.72

0.36 0.48 0.36 0.41

measures may be used to assess the strength of evidence on models, though whether conventional intervals (e.g. 95%) are appropriate to assess signiFcance or choose models is open to research. Criteria such as model weights obtained at each iteration under parallel sampling can be used to make model averaged inferences taking account of both parameter and model uncertainty. Delta or bootstrap approximations are avoided in obtaining summaries for model averaged parameters. In the worked examples (Examples 1 and 2) a formal penalty for the parameter count was found useful in model discrimination in addition to the penalty on imprecision illustrated by the decomposition in (1). Imprecise predictions may be generated by poorly identiFed parameters even in a relatively parsimonious model, and conversely a relatively highly parameterised model may be well identiFed (as for model 3 in Example 3). There is still much debate about appropriate penalties based on parameter counts. For example, George and Foster (2000) propose a modiFed version of the risk inMation penalty (the RIC takes = 2 log dm ), with features of both BIC and RIC penalties, that performed well for true models with both large and small parameter counts. Model predictions are relevant to assessing other aspects of Ft that can be assessed via predictive checks. When predictive checks are inconsistent with other criteria (Example 3), or penalised and unpenalised comparisons are at odds, this suggests further model development or that model averaging rather than model choice should be the relevant focus. For large data sets or comparing complex models computing constraints may conFne analysis to small M (e.g. M = 2), but for small datasets with metric or discrete regressions a large number (10 or more) models may be run in parallel. Using the forward selection method (with p regressors) and comparison criteria as above, there are only Sp − S(S − 1)=2 models to be compared (rather than 2p ) where S 6 p is the Fnal stage. Because criteria such as P(j; k), Bj and wk are on a (0; 1) scale, indicative values may be obtained as to strength of evidence. These may be compared across a

752

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

range of applications. Application to a set of published analyses suggests very approximate guidelines: that values of P(j; k) between 0.5 and 0.6 provide moderate evidence favouring model j, values 0.6–0.8 provide more substantial evidence, values 0.8–0.9 provide strong support for model j and values P(j; k) ¿ 0:9 are decisive. The methods outlined here suggest how a uniFed system of model comparison and checking may be obtained by using model predictions. Predictive criteria based on parallel sampling, and based on Bayesian signiFcance test or model averaging principles, thus provide a way of exploiting the simplicity of the predictive approach to model assessment. References Aitkin, M., 1991. Posterior Bayes factors. J. Roy. Statist. Soc. B 53, 111–142. Aitkin, M., 1997. The calibration of P-values, posterior Bayes factors and the AIC from the posterior distribution of the likelihood. Statist. Comput. 7, 253–272. Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Petrov, B., Csaki, F. (Eds.), Second International Symposium on Information Theory. Akademiai KaidWo, Budapest, pp. 267–281. Akaike, H., 1981. Likelihood of a model and information criteria. J. Econometrics 16, 3–14. Bartlett, M., 1957. A comment on D.V. Lindley’s statistical paradox. Biometrika 44, 533–534. Berger, J., Pericchi, L., 1996a. The intrinsic Bayes factor for model selection and prediction. J. Amer. Statist. Assoc. 91, 109–122. Berger, J., Pericchi, L., 1996b. The intrinsic Bayes factor for linear models. In: Bernardo, J., Berger, J., Dawid, A., Smith, A. (Eds.), Bayesian Statistics 5. Clarendon Press, Oxford, pp. 23–42. Boehning, D., Schlattman, P., 1999. Disease mapping with hidden structures using mixture models. In: Lawson, A., Biggeri, D., BXohning, E., Lesa6re, E., Viel, J., Bertollini, R. (Eds.), Disease Mapping and Risk Assessment for Public Health. Wiley, Chichester, pp. 49–60. Bozdogan, H., 1987. Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52, 345–370. Brooks, S., 2002. Discussion to Spiegelhalter et al. J. Roy. Statist. Soc. B 64, 616–618. Burnham, K., Anderson, D., 2002. Model Selection and Multimodel Inference: A Practical Information— Theoretic Approach. Springer, New York. Carlin, B., Louis, T., 1996. Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall, London. Carlin, B., Louis, T., 2001. Bayes and Empirical Bayes Methods for Data Analysis, 2nd Edition. Chapman & Hall/CRC Press, Boca Raton, FL. Chipman, H., George, E., McCulloch, R., 2001. In: Lahiri, P. (Ed.), The Practical Implementation of Bayesian Model Selection, IMS Monograph 38, Model Selection. Institute of Mathematical Statistics, Hayward, CA. Dempster, A., 1997. The direct use of likelihood for signiFcance testing. Statist. Comput. 7, 247–252. Everitt, B., Rabe-Hesketh, S., 2001. Analyzing Medical Data using S-PLUS. Springer, New York. Geisser, S., Eddy, W., 1979. A predictive approach to model selection. J. Amer. Statist. Assoc. 74, 153–160. Gelfand, A., Ghosh, S., 1998. Model choice: a minimum posterior predictive loss approach. Biometrika 85, 1–11. Gelman, A., Carlin, J., Stern, H., Rubin, D., 1995. Bayesian Data Analysis, 1st Edition. Chapman & Hall, London. Gelman, A., Carlin, J., Stern, H., Rubin, D., 2003. Bayesian Data Analysis, 2nd Edition. Chapman & Hall/CRC Press, Boca Raton, FL. George, E., Foster, D., 2000. Calibration and empirical Bayes variable selection. Biometrika 87, 731–747. Han, C., Carlin, B., 2001. Markov chain Monte Carlo methods for computing Bayes factors: a comparative review. J. Amer. Statist. Assoc. 96, 1122–1132.

P. Congdon / Computational Statistics & Data Analysis 48 (2005) 735 – 753

753

Ishwaran, H., Zarepour, M., 2000. Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models. Biometrika 87, 371–390. Laud, P., Ibrahim, J., 1995. Predictive model selection. J. Roy. Statist. Soc. B 57, 247–262. Lindley, D., 1957. A statistical paradox. Biometrika 44, 187–192. Meyer, M., Laud, P., 2002. Predictive variable selection in generalized linear models. J. Amer. Statist. Assoc. 97, 859–871. Ntzoufras, I., Dellaportas, P., Forster, J., 2004. On prior distributions for model selection. Technical Report, University of Aegean. O’Hagan, A., 1995. Fractional Bayes factors for model comparison. J. Roy. Statist. Soc. B 57, 99–138. Sahu, S., 2003. Applications of formal model choice to archaeological chronology building. Technical Report, Faculty of Mathematical Studies, University of Southampton. Sahu, S., Buck, C., 2000. Bayesian models for relative archaeological chronology building. Appl. Statist. 49, 423–444. Smith, A., Spiegelhalter, D., 1980. Bayes factors and choice criteria for linear models. J. Roy. Statist. Soc. B 42, 213–220. Spiegelhalter, D., Best, N., Carlin, B., van der Linde, A., 2002. Bayesian measures of model complexity and Ft. J. Roy. Statist. Soc. B 64, 583–639.

Lihat lebih banyak...

Bayesian predictive model comparison via parallel sampling

Descripción

Comentarios