problems

July 7, 2017 | Autor: Andrew Barron | Categoría: Econometrics, Statistics, Bayesian Inference, Exponential family, Kullback Leibler, Hellinger Distance

Share Embed

Laporkan tautan ini

Descripción

November 19, 1996

THE CONSISTENCY OF POSTERIOR DISTRIBUTIONS IN NONPARAMETRIC PROBLEMS By Andrew Barron1, Mark J. Schervish and Larry Wasserman2

Yale University and Carnegie Mellon University

We give conditions that guarantee that the posterior probability of every Hellinger neighborhood of the true density tends to 1 almost surely. The conditions are (i) a smoothness condition on the prior and (ii) a requirement that the prior put positive mass in appropriate neighborhoods of the true density. The results are based on the idea of approximating the set of densities with a nite dimensional set of densities and then computing the Hellinger bracketing metric entropy of the approximating set. We apply the results to some examples.

1. Introduction. Recent advances in statistical computing have gen-

erated renewed interest in nonparametric Bayesian inference. As argued by Diaconis and Freedman (1986), these nonparametric methods are of little value unless they possess reasonable consistency properties. Indeed, Diaconis and Freedman (1986) showed that even if the prior puts positive mass in weak neighborhoods of the true density, it does not follow that the posterior mass of every weak neighborhood of the true density tends to 1. Doob (1949) showed consistency of the posterior under very weak conditions. However, his proof only gives consistency almost surely with respect to the prior. Consistency can fail on a null set and the theorem gives no guidance on what this null set looks like. For example, if the prior is a point Research partially supported by an Oce of Naval Research contract N00014-86-K0670 and by a National Science Foundation Postdoctoral Fellowship. 2 Research partially supported by NIH grant RO1-CA54852 and NSF grants DMS9303557 and DMS-9357646.. AMS 1991 subject classi cations. 62G20. Key words and phrases. Exponential families, Hellinger distance, Nonparametric Bayesian inference, Polya trees. 1

1

mass at a single density g then Doob's theorem applies yet consistency fails at all densities except g. Schwartz (1965) showed that if the prior puts positive mass in each Kullback-Leibler neighborhood of the true density f0, then asymptotically the posterior does accumulate in weak neighborhoods of f0. However, weak neighborhoods contain many distributions that, in any practical sense, do not resemble f0. Thus it seems useful to seek convergence in some stronger sense. In an unpublished technical report, Barron (1988) gives two conditions on the prior to guarantee consistency in the total variation norm, i.e. the posterior probability of any total variation neighborhood of the true density tends to 1. His proof relies on the theory of exponentially consistent tests. Shen (1995) gives rate of convergence results for nonparametric problems. His proofs make use of empirical process techniques. The purpose of this paper is to provide a relatively simple, self-contained proof of consistency in Hellinger distance (which is equivalent to consistency in total variation) using only a few conditions like those in Barron (1988). A brief sketch of the main idea behind our results is as follows. Let n X = (X1; : : :; Xn ) be n i.i.d. observations from a distribution P0 . (For convenience, we will let P0 also stand for the n-fold product measure P0n as well as the in nite product measure P01 .) Let be a prior on a set of distributions (described more formally in the next section) and let (AjX n) be the posterior probability of A given X n . Let A = fQ : d(P; Q) g where d(; ) is some metric. We say that consistency holds at P0 if for every > 0, (AjX n ) ! 1 almost surely [P0]. Our strategy is to nd a sequence fFng1n=1 of sets of densities such that the prior probability of FnC is exponentially small. We then show that the posterior probability of FnC is exponentially small. The sequence fFng1n=1 is essentially a sieve as in Grenander (1981) and Geman and Hwang (1982). Next, we nd a nite set of upper brackets f(fiU ) : i = 1; : : : ; N g such that each f 2 Fn satis es f fiU for some i. The likelihood function is then bounded above by the fiU 's and we show that the posterior is exponentially small outside A as long as the number of brackets does not grow too quickly. Bracketing methods have been used for many types of consistency results such as Wong and Shen (1995), Van de Geer (1993) and Pollard (1991). An outline of our paper is as follows. In Section 2 we present the notation and main results about consistency. In Section 3 we present some speci c examples. In Section 4 we discuss an alternative approach to achieving similar results. 2

2. General Results. Let be a - nite measure on a measurable space (X ; B). (Without loss of generality, we will assume that is, in fact, a probability measure.) Let F be the set of all density functions with respect to . (If we think of F as equivalence classes of densities that are equal a.s. [], then F is equivalent to the set of all probability measures on (X ; B)

that are absolutely continuous with respect to . Absolute continuity of all probability measures under consideration with respect to a common - nite measure is a condition of Bayes' theorem.) Let F0 be the set of all nonnegative functions that are integrable with respect to . Let X n denote the product space of n copies of X , and let n denote product measure. Let X 1 be the product space of countablyqmany copies of X . Let d(; ) denote the R Hellinger metric on F0, d(f; g) = (f (x)1=2 ? g(x)1=2)2d(x). Let C be the Borel - eld of subsets of F induced by the metric d. Suppose that fXng1n=1 is a sequence of IID random variables with distribution P0 having density f0. Let E0 stand for expectation under distribution P0. Let I (; ) stand for the Kullback-Leibler information Z I (f ; g) = log fg((xx)) f (x)d(x): For each > 0, de ne

N = fg 2 F : I (f0; g) g ; A = fg 2 F : d(f0; g) g : Let be a prior distribution on (F ; C ). Some other useful notation is given below:

Xn X1 xn x1 mn(xn)

(X1; : : : ; Xn); (X1; X2; : : :); (x1; : : : ; xn); (x ; x2; : : :); Z 1Y n f (xi)d(f ); i=1 Z Y n n n ?1 f (xi)d(f ); (B jx ) = [mn(x )] = = = = =

B i=1

3

pn (xn) =

n Y i=1

f0(xi);

n Dn (xn; f ) = n1 log Qnpn (fx(x) ) ; for f 2 F0. i i=1 First, we state the assumptions under which we prove consistency. The following de nition is based on one from Alexander (1984). Definition 1. For > 0 and C F , de ne H(C; ) to be the logarithm of the minimum of the set of all k such that there exist functions

f1U ; : : :; fkU 2 F0 satisfying R fiU (x)d(x) 1 + for all i, for all f 2 C there exists i such that f fiU . We call H(C; ), the metric entropy with upper bracketing of C with bound . Assumption 1. For every > 0, (N ) > 0. Assumption 2. For every > 0,hthere exists p i2 asequence

fFn g1n=1 of

subsets of F , c1; c2; > 0 and 0 < c < ? ? =2 such that 1. (FnC ) < c1 exp(?nc2) for all but nitely many n,

2. H(Fn ; ) nc for all but nitely many n.

The purpose of Assumption 1 is to avoid problems like those highlighted by Diaconis and Freedman (1986). The prior used by Diaconis and Freedman put positive probability on weak neighborhoods of the true distribution, but not on sets with nite Kullback-Leibler information. Since the likelihood function at f0 divided by the likelihood at f is exp[nDn(xn ; f )] and Dn (xn; f ) ! I (f0; f ), a.s., it seems plausible to expect the posterior distribution to concentrate on the set of densities f for which I (f0; f ) is small, but only if that set has positive prior probability. Assumption 2 is designed to prevent the prior from giving substantial mass to distributions that might 4

accidently track the data too closely. In Section 3.5, we give a detailed example in which Assumption 1 holds but the prior puts too much mass on distributions with densities that are allowed to jump up wherever a data value appears and jump down again right after. Assumption 2 is designed to force the prior probabilities of such distributions to be small enough so that only extremely large samples of highly clustered data will lead to their having large posterior probabilities. This same problem arises in nonparametric maximum likelihood estimation and it is often addressed in a similar fashion, namely by using sieves that satisfy a condition like part 2 of Assumption 2. To check part 2 of Assumption 2, it is often easiest to set = 2=16 and c = 2=5. Then one checks that H(Fn ; 2=16) n2=5 for all large n. The main result of this paper is the following consistency result. Theorem 1. Under assumptions 1{2, for every > 0, nlim !1 (A jx

n ) = 1;

a.s. [P0 ]:

The proof of Theorem 1 requires some lemmas, but the following simple consequence of Theorem 1 is easy to prove. Corollary 1. Assume assumptions 1{2, and de ne Z f^n () = f ()d(f jxn): Then limn!1 d(f0 ; f^n ) = 0, a.s. [P0 ]. Proof. For each > 0, we have

d(f0; f^n) (1)

=

Z

Z

d(f0; f )d(f jxn)

A

d(f0

; f )d(f jxn ) +

Z A

C

d(f0; f )d(f jxn ); a.s. [P0]),

where the inequality follows from Jensen's inequality and the convexity of d(f0; ). The rst term on the right-hand side of (1) is at most by the de nition of A, and the second term goes to 0 a.s. [P0] by Theorem 1. Since is arbitrary, the result follows. 2 The proof of Theorem 1 will appear after the next few lemmas. 5

X 1 such that P0(B ) = 1 and such 1 that for every x 2 B , there is a set C (x1 ) such that (C (x1)) = 1 and for every f 2 C (x1 ), limn!1 Dn (xn; f ) = I (f0; f ). Lemma 1. There exists a set B

Proof. By the strong law of large numbers P0 (x1 : limn!1 Dn (xn ; f ) =

I (f0; f )) = 1, for every f 2 F . Let C (x1) = ff : limn!1 Dn (xn; f ) = I (f0; f )g. By Tonelli's theorem we have

1 n 1 = F P0 x : nlim !1 Dn (x ; f ) = I (f0 ; f ) d (f ) Z Z = IC(x1 )(f )dP0 (x1)d(f ) 1 ZF XZ = 1 IC(x1 )(f )d(f )dP0 (x1): (2) X F R Let B be the set of all x1 such that IC(x1)(f )d(f ) = (C (x1)) = 1. It follows from (2) that P0(B ) = 1. 2 Lemma 2. Under Assumption 1, for every > 0, ! n) ( x m n P0 x1 : p (xn) exp(?n); i.o. = 0: n Proof. Let > 0 be given. Then, Z Qn f (xi) m n (xn ) i=1 exp(n)d(f ) exp(n) p (xn) = p n n (xn ) Z Qn f (xi) N ip=1(xn) exp(n)d(f ) n Z = N exp(n[ ? Dn (xn; f )])d(f ): Z

Let B be the set from Lemma 1. For each x1 2 B , let C (x1) be as in Lemma 1. For each x1 2 B , lim infn!1 exp(n[ ? Dn (xn; f )]) = 1 for all f 2 N \ C (x1). Assumption 1 and Lemma 1 say that (N \ C (x1)) > 0. Fatou's lemma implies that for every x1 2 B , mn(xn) = 1; lim inf exp( n ) n!1 pn (xn) hence mn (xn)=pn (xn) > exp(?n) all but nitely often. 2 6

Lemma 3. Assume Assumption 1. Let c1 ; c2 > 0. Suppose that fBn g1 n=1

is a sequence of subsets of F such that (Bn) < c1 exp(?c2 n) for all but nitely many n. Then limn!1 (Bnjxn) = 0, a.s. [P0 ]. Proof. It suces to prove that, for each > 0,

(3)

P0(x1 : (Bnjxn) > ; i.o.) = 0:

First, write

Z Y n (Bnjxn) = m 1(xn ) f (xi)d(f ) B i=1 n Z Qn f (xi) p n (xn ) i=1 d(f ): = m (xn ) B pn (xn ) n We see that, for all but nitely many n, c2 ! Z Qn f (xi) i =1 n P0 x : d(f ) > exp ?n 2 B pn (xn ) c2 Z Z Y n exp n 2 X B f (xi)d(f )dn (xn) =1 c2 Z Z iY n = exp n 2 f (xi)dn (xn)d(f ) c2 B X i=1 exp n 2 (Bn) c1 exp ?n c22 ; where the rst line follows from the Markov inequality, the second line follows from Tonelli'sPtheorem and the last line follows from the hypotheses of the lemma. Since 1 n=1 exp(?nc2=2) < 1, the rst Borel-Cantelli lemma implies that c2 ! Z Qn f (xi) i =1 1 P0 x : (4) p (xn) d(f ) > exp ?n 2 ; i.o. = 0: n

n

n

n

n

>From Lemma 2,

Bn

n

n

n

n P0 x1 : mpn ((xxn)) n

c2 ! > exp n 4 ; i.o. = 0: 7

Combining this with (4) yields c2 1 n P0 x : (Bnjx ) > exp ?n 4 ; i.o. = 0; which implies (3). 2 The following lemma is a modi cation of Lemma 1 of Wong and Shen (1995). p R Lemma 4. Let g 2 F0 , > 0, d(f0 ; g ) = , g (x)d(x) 1 + and . Then ! ! n g (x ) Y

? ? i n : P0 x : f (x ) exp[?n ] exp ?n 2 i=1 0 i Proof.

0 !0 !1=21n " #1 n g (x ) !1=2 Y n g ( X ) n i 1 P0 @xn : exp ? 2 A exp 2 @E0 f (X ) A f ( x ) 0 i 0 1 i=1 ! !n

? n exp 2 1 ? 2 ! " #!

? n = exp 2 exp n log 1 ? 2 " #!

? exp n 2 ? 2 ; where the rst line follows from the Markov inequality, the second follows from the de nition of d(; ) and the fourth follows from the facts that

2 and log(1 ? x) ?x for 0 < x < 1. 2 R Lemma 5. Let f 2pF and g 2 F0 be such that f g and g (x)d(x) 1 + . Then d(f; g) . Proof.

d(f; g)2 = Write

Z q Zq q 2 f (x) ? g(x) d(x) 2 + ? 2 f (x)g(x)d(x):

q q q q f (x)g(x) = f (x) + f (x) g(x) ? f (x) f (x): 8

Rq R It follows pthat f (x)g(x)d(x) f (x)d(x) = 1. So d(f; g)2 and d(f; g) . 2 Proof of Theorem 1. Let > 0 be given, and let fFn g1 n=1 , c, c1, c2 and be as guaranteed by Assumption 2. Write (5)

(AC jxn) = (AC \ Fnjxn) + (AC \ FnC jxn ):

Since (FnC ) c1 exp(?nc2) for all but nitely many n, it follows from Lemma 3 that the second expression on the right-hand side of (5) goes to 0, a.s. [P0]. So, it suces to prove that the rst expression on the right-hand side of (5) goes to 0, a.s. [P0]. Let Cn = AC \ Fn. Now, write R Qn f (x )d(f ) n f (x ) n) Z Y p ( x i i n i =1 C n (6) (Cnjx ) = R Qn f (x )d(f ) = m (xn) d(f ): C i=1 f0 (xi ) i n i=1 n

n

Let m = exp(H(Fn ; )). For j = 1; : : : ; m, de ne

Ej = ff 2 Fn : f fjU g: We now write Z Y n f (x ) m Z n f (x ) Y X i i d ( f ) d(f ) f ( x E \ C C i=1 f0 (xi) i=1 0 i ) j =1 n f U (xi ) m Z Y X j d(f ) f j =1 E \C i=1 0(xi ) n f U (xi ) m Y X j = (Ej \ Cn): j =1 i=1 f0(xi ) h p i2 Let 0 < < ? ? ? 2c, and de ne !) ( Y n f U (xi ) n j n Fn;j = x : f (x ) exp ? 2 : n

i=1 0 i

j

n

j

n

p

For every j and every f 2 Ej , Lemma 5 says that d(f; fjU ) < . For every n and every f 2 Cn, d(f0p; f ) > . By the triangle inequality, Ej \ Cn = ; whenever d(f0; fjU ) < ? . For every n and every j such that d(f0; fjU ) 9

h p i2 p ? , apply Lemma 4 with g = fjU , = d(f0; fjU )2 ? , and as above to see that P0(Fn;j ) exp(?nv), where v > c. So, )! ( Z Y n f (x ) n i n P0 x : d(f ) exp ? 2 C i=1 f0 (xi ) 0 m n U ( )1 Y X f ( x ) j i (E \ C ) exp ? n A P0 @xn : f (x ) j n 2 n

m X

j =1 i=1 0 i

P0(Fn;j ) j =1 m exp(?nv) = exp (H(Fn ; ) ? nv) exp(?n[v ? c]);

for all but nitely many n. The rst Borel-Cantelli lemma implies that ( ) ! Z Y n f (x ) n i 1 P0 x : d(f ) exp ? 2 ; i.o. = 0: C i=1 f0(xi ) Lemma 2 says that ( ) ! p n (xn ) 1 P0 x : m (xn) exp n 4 ; i.o. = 0: n Combining these last two equations with (6) gives that ( ) ! n 1 n P0 x : (Cnjx ) exp ? 4 ; i.o. = 0: Hence limn!1 (Cnjxn) = 0, a.s. [P0]. 2 Verifying part 2 of Assumption 2 can be awkward. The next Lemma gives a speci c condition that can be checked to verify this assumption. The condition is similar to, but slightly stronger than Condition (B) of Barron (1988). Lemma 6. Let fTn g1 n=1 be a sequence of nite measurable partitions of X and let Nn be the cardinality of Tn . For each n, let dn > 0 and suppose that (A) = 1=Nn for every A 2 Tn . De ne Fn = ff 2 F : 8A 2 Tn 8x; y 2 A; jf (x) ? f (y)j dn g : Then H (Fn; 2dn ) Nn [1 + log(1 + 1=[2Nn dn ])]. n

10

Proof. For each n and each vector ` of nonnegative integers ` =

~

(`1; : : : ; `N ) such that n

(7) de ne

dn

N X n

i=1

~

N X `i Nn dn (`i + 2);

f U` (x) = dn

n

i=1

N X n

IAi (x)(`i + 2): i=1 It is easy to see that for every f 2 Fn, there exists f U` such that f f U` . ~ ~ Note that Z Nn X f U` (x)d(x) = dn (`i + 2) N1 1 + 2dn : ~ n i=1 U So the collection of f ` functions for all ` that satisfy (7) forms an upper ~ ~ bracketing of Fn with bound 2dn . The cardinality of this upper bracketing is the number of hypercubes with sides of length 2dn needed to cover the Nn ? 1dimensional simplex in IRNn . An upper bound onPthis number is (2dn )?Nn times the volume of Cn = fx 2 IRNn : 8i xi 0; Ni=1n xi 1 + 2Nn dn g. It is easy to see that Cn is just 1 + 2Nndn times the set where the Dirichlet density with all parameters equal to 1 is positive. Hence the volume of Cn is equal to (1 + 2Nn dn)Nn =Nn !. It follows that ~

H (Fn; 2dn ) Nn log(1 + 2Nn dn ) ?log(Nn!) ? Nn log(2dn ) Nn log(Nn) + Nn log 1 + 2N1 d ? Nn log(Nn) + Nn

n n 1 = Nn 1 + log 1 + 2N d ; n n since log(x!) x log(x) ? x for all x. 2 A simple corollary helps to verify part 2 of Assumption 2. Corollary 2. For each > 0, let Nn n2=10, dn = 2=32 and = 2 =16. If limn!1 Nn = 1, then the sequence fFng1n=1 from Lemma 6 satis es H(Fn ; ) n2=5: for all but nitely many n. To verify part 1 of Assumption 2, one must show that (FnC ) is exponentially small.

11

3. Some Prior Distributions. In this section, we present some prior

distributions that satisfy Assumptions 1 and 2 when taken together with the sequence of sets in Corollary 2. Let X = IR, and let F be the prior cumulative distribution function of X1 with density f. Let Yi = F(Xi ) for all i, so that Yi 2 [0; 1]. We will construct random probability measures over [0; 1] for the sequence fYi g1i=1. These have the prior predictive distribution of Yi being uniform on [0; 1], and they induce random probabilities on IR in the obvious way. In Section 3.5, we also give an example to show how failure of Assumption 2 can lead to an inconsistent posterior. 3.1. Histograms. One simple sort of prior distribution for continuous distributions that satis es the conditions of Theorem 1 is a prior concentrated on a collection of step-function densities. For each n, we construct a collection Un of densities each of which is constant on each of the nitely many intervals in a partition Tn. We use Corollary 2 to ensure that part 2 of Assumption 2 holds. We assign the set Un prior probability pn , which is chosen so that part 1 of Assumption 2 holds. We distribute the probability pn over Un in such a way that we can apply a result of Barron (1988, Lemma 17) to conclude that Assumption 1 holds. The details follow. Suppose that F consists of all probability measures on [0; 1] that are absolutely continuous with respect to Lebesgue measure P. Assume that I (f0; 1) < 1. For each integer n, let pn > 0 be such that 1n=1 pn = 1. For each n, let Nn be an integer and let Tn be the partition 1 1 2 Nn ? 1 Tn = 0; N ; N ; N ; : : :; N ; 1 : n n n n Let Un be the collection of all densities that are constant on every interval in Tn . Our prior distribution will place probability pn on the set Un and distribute the probability as follows. Let an > 0 and denote a random element of F as F . If F 2 Un , we can write F = (f1; : : : ; fN ) where PNi=1 fi(Ai) = 1 and Ai is the interval [(i ? 1)=Nn ; i=Nn ). This makes fi equal to the value of the density corresponding to F on the interval Ai. Conditional on F 2 Un, we assign F=Nn the Dirichlet distribution Dir(an ; : : :; an). We now prove that, by careful choice of Nn and pn as functions of n this prior distribution will satisfy the conditions of Theorem 1. We will let Nn = 2m with fmng1n=1 a nondecreasing sequence of positive integers that n

n

12

n

goes to 1. The following result is Lemma 17 of Barron (1988). Its proof is deferred until the Appendix. Lemma 7. Let (X ; B ; ) be a probability space, and let C be a collection of measurable real-valued functions de ned on X . For each b > 0, de ne

Cb = ff 2 C : ess supjf j bg; Lb = ff : ess supjf j bg; where the essential supremum is relative to . Suppose that there exists r > 1 such that Crb is dense (in the sense of L1 ()) in Lb for all large b. Let P be another probability on (X ; B) such that I (f0 ; 1) < 1, where f0 = dP=d. Then for every > 0 there is a bounded function g 2 C such that I (f0; pg ) < , where pg is the density

exp(g(x)) : pg (x) = R exp( g(y))d(y)

Let C be the set of all step functions that are constant on all of the intervals in at least one of the Tn partitions, and let be Lebesgue measure. Since step functions are dense in the collection of bounded measurable functions and C is dense in the collection of step functions, it follows that for each , there exists n and f 2 Un such that I (f0; f ) < =2. Since the Dirichlet distribution over Un assigns positive probability to every open neighborhood of f and I (f0; f ) is continuous as a function of the coordinates of f , it follows that Assumption 1 holds. Next, construct the sets fFng1n=1 for an arbitrary sequence fdn g1n=1. Since each dn > 0 and the functions in Un are constant on every A 2 Tn , it follows that Un Fn, no matter how we choose the dn . Also, Un Un+1 for all n, so (FnC ) P1`=n+1 p` . Setting pn = (1 ? a)an for some 0 < a < 1 will satisfy part 1 of Assumption 2. Finally, let Nn = n= log(n) (that is, mn = blog2(n) ? log2(log(n))c) in Corollary 2 so that part 2 of Assumption 2 holds. 3.2. Polya Tree Priors. The class of Polya tree distributions was described by Mauldin, Sudderth and Williams (1992) and Lavine (1992). Polya trees are special cases of tailfree processes (see Freedman, 1963 and Fabius, 1964). Consider the sequence of partitions fSk g1k=1 of [0; 1] such that S1 = 13

f[0; 1=2]; (1=2; 1]g and each Sk contains the left and right halves of all intervals in Sk?1 for k > 1. (For convenience, let S0 = f[0; 1]g.) For an interval I 2 Sk for k = 0; 1; : : : create a random variable VI taking values in [0; 1] and having mean 1=2. Make all of the VI independent of each other. For each I 2 Sk for k 1, de ne p1(I ) 2 Sk?1 to be the interval J in Sk?1 such that I J and set ( left subinterval of p1 (I ), WI = 1 ?VpV1 (I1) ifif II isis the the right subinterval of p1(I ). p (I )

For I 2 Sk with k 2, de ne p2 (I ) = p1(p1(I )), and similarly de ne pi (I ) for i = 1; : : : ; k. (For convenience, let p0(I ) = I for all I .) Let S = [1k=0 Sk , and for each I 2 S , de ne K (I ) to be that k such that I 2 Sk , and set Q K ( I ) P (I ) = i=1 Wp ?1(I ). The set function P extends uniquely to the smallest - eld containing S , which is the Borel - eld, and becomes a random probability measure on the unit interval. Kraft (1964) shows that, if the distribution of VI becomes concentrated around 1 suciently rapidly as I shrinks (moves through later partitions Sk ) then P will have a density with respect to Lebesgue measure with probability 1. (For example, if each VI 2 Sk hasPBeta(ak; ak ) distribution, then the conditions of Kraft, 1964 will be met ?1 if 1 k=1 ak < 1.) In such a case, let p denote the density of P , and let f (x) = p(F(x))f(x) for x 2 IR. (That is, f is the density of the probability on IR induced from P by the transformation F?1.) Lavine (1994, Theorem 2) and Ghosal, Ghosh and Ramamoorthi (1995, Theorem 3.1) prove results like the following. Proposition 1. Suppose that for every k and P every I 2 Sk VI has =2 Beta(ak; ak ) distribution and that I (f0; f) < 1. If 1k=1 a?1 < 1, then k (N) > 0 for every > 0. In words, Assumption 1 can be satis ed by a Polya tree distribution so long as the prior predictive distribution is not in nitely far away from the true distribution. We show next that Assumption 2 can also be satis ed. We will construct a sequence of sets fFng1 n=1 as in Corollary 2. Let P have Polya tree prior distribution on [0; 1], and let P have a density p with probability 1. For each y 2 [0; 1] and k = 0; 1; : :Q:, let Ik(y) be that interval in Sk that contains y. Then p(y) = lim 2k k W . (See Kraft, 1964.) Let p^ (y) = i

k!1

i=1

Ii (y)

14

k

2k Qki=1 WI (y), the approximation to p based on the rst k partitions. Suppose that WI PBeta(ak; ak ) for all I 2 Sk . Let fgk g1k=1 be a sequence of numbers 1 2k g < 1, and let e = Pr(j2W ? 1j > g ) for I 2 S . Let such that k k I k k k=1 P E = 1 2ie and G = P1 2i?1g . Then i

k

i=k+1

i

k

i

i=k+1

(9y : jp(y) ? p^k (y)j > Gk ) Ek ; because WI (y) < 2 for all k and y. If Gk 2=16, then the event whose probability is bounded in (8) contains FnC . Hence, (8) provides a bound on (FnC ). The partition Sk plays the role of Tn in Lemma 6, and the cardinality of Sk is Nn = 2k . To satisfy the conditions of Corollary 2, we need k log2(2n=10). So, let $ 2n !% kn () = log2 10 ; and set Tn = Sk () to guarantee that part 2 of Assumption 2 holds. Choose =2 < 1 (so Proposition 1 says that the ak large enough so that P1k=1 a?1 k Assumption 1 holds) and large enough so that log Ek b1 ? b22k for some constants b1; b2 with b2 > 0. This makes part 1 of Assumption 2 hold. As an example of how to set the numbers ak , consider the following. Let W Beta(a; a), and let Z = j2W ? 1j. Then, for g 2 (0; 1), 1 + g Pr(Z > g) = 2 Pr W > Z1 2 ?(2 a ) = 2 ?(a)2 1+ wa?1(1 ? w)a?1dw 2 a?1 1 ? g a ?(2 a ) 1 + g 2 ?(a)2 2 2 pa p (1 ? g2)a pa < p exp(?g2a); (8)

k

n

g

where the third line follows from the monotonicity of the Beta density on [1=2; 1], the fourth line follows from the facts that ?(2a) = p1 22a?1?(a)? a + 1 2 15

p

and ?(a + 21 ) ?(a) a and the fth line follows from the fact that (1 ? x)y < exp(?xy) for 0 < x < 1 and y > 0. So, we can let gk = 2?k for > 0 and let ak = 1=gk3 . It now follows that p k 8 ek p exp ?2k ; so p i 1 8 X 2i p exp(?2i ) Ek i=k+1 p k+1 1 p (2 8) pexp(?2k+1 ) X = (2 8)i?(k+1) exp(?2i + 2k+1 ): i=k+1 Since the last sum is nite it follows that log(Ek ) b1 ? b22k . In summary, a Polya tree prior with every WI for I 2 Sk having Beta(ak; ak ) distribution with ak = 8k will satisfy the assumptions of Theorem 1 with the sequence fFng1n=1 from Corollary 2 so long as I (f0; f) < 1. 3.3. In nite-Dimensional Exponential Families. Let = f j g1j=1 be a sequence of independent random variables with j N (0; j2). Let fj ()g1j=1 be a sequence of orthogonal polynomials on [0; 1]. De ne 1 01 X f (x) = exp @ j j (x) ? c( )A ; j =1

where c() makes f a density. This model for in nite dimensional parameter spaces has been studied by Leonard (1978), Lenk (1988, 1991) and Barron (1988). Let aj = sup0x1 j j j and bj = sup0x1 j0j (x)j, which are nite since the j are polynomials. Now, choose the j s so P that Pj aj j < 1 to assure that f is a density with probability 1 and j bj j < 1. Let An;i = [(i ? 1)=N; i=N ), where N = bn= log(n)c, for i = 1; : : : ; N . Let Tn = fAn;1; : : : ; An;N g. De ne Yn;i = sup jf (x) ? f (y)j x;y2An;i

Xn;i =

sup log ff ((xy)) : x;y2A n;i

16

Let > 0. Since Xn;i implies Yn;i exp() ? 1, we need to show that with = dn = 2=32 as in Corollary 2, there exists c1; c2 > 0 such that Pr(Xn;j > log(1 + )) c1 exp(?c2n) for all but nitely many n. Let = log(1 + ). Now, for x; y 2 An;i, write 1 X log ff ((xy)) = j [j (x) ? j (y )] j =1 1 Zx X 0 (t)dt = j y j

Let Z N (0; 1). Then

j =1 1 X j =1

1 X bj j j jjx ? yj N1 bj j j j: j =1

01 1 X (Xn;j > ) @ bj j j j > N A 0j=1 2 1 3 1 X = @exp 4 bj j j j5 > exp[N ]A j =1 1 0 1 X inf exp(?N t)E exp @t bj j j jA t0

= inf exp(?N t) t0

1 Y

j =1

E exp(tbj jZ j)

j =1 1h Y

i 2 b2 2=2) 2( b t ) exp( t = inf exp( ? N t ) j j j j t0 j =1 0 11 1 Y X exp @?N t0 + t20 b2j j2=2A [2(bj j t0)] ; j =1

j =1

where is the standard normal function and tP0 0. Suppose P1 distribution b < 1 . Let t0 = N = 1j=1 b2j j2. Then that we choose the so that Q1 [2(b t )] = cj < 1, andj=1 j j j j 0 1 j =1 2 2 ! N Pr(Xn;j > ) c1 exp ? P1 b2 2 : j =1 j j 17

Since N 2 > n, this completes the proof that the prior probability of FnC is exponentially small, where Fn is as in Corollary 2. Let 1 stand for the uniform density on [0; 1]. We want to show that if I (f0; 1) < 1 then, for every > 0, (N) > P 0. First, suppose that there exist m and 1; : : :; m such that log f0 = mj=1 j j ? c(). Let (f; g) = sup0x1 j log f (x) ? log g(x)j and let B = ff : (f0; f ) g. A simple calculation shows that I (f ; g) (f; g). So it suces to show that (B) > 0. (B) Let Z (r) = f = ( 1; 2; : : :) : P1j=r aj j j j < g. Then P (BjZ (r))(Z (r)). Recall that the j 's have been chosen so that j aj j j j is nite with probability 1. It follows that for any > 0 there exists r0 such that (Z(r0)) > 0. Choose = =2 and choose r = maxfr0; m + 1g. For 2 Z (r), and de ning j = 0 for j > m we see that X (f0; f ) = sup ( j ? j )j (x) 0x1 j X aj j j ? j j

So

j r X

j =1

aj j j ? j j + :

1 0r X (BjZ (r)) @ aj j j ? j j + < Z (r)A 0 j=1 1 r X @ aj j j ? j j < 2 Z (r)A 0 j=1 1 r X @ aj j j ? j j < 2 A : j =1

Since the marginal distribution of ( 1; : : : ; r) has support over all IRr , we see that this latter event has positive probability. Thus, (BjZ (r)) > 0. Now consider any f0 such that I (f0; 1) < 1. Lemma 7 says that for any a > 0 there exists a density f such that log f is a polynomial of nite degree 18

and such that I (f0; f ) < a. Further,

I (f0; f ) I (f0; f ) + 0sup log ff ((xx)) a + (f; f ): x1

Choose a = =2 and note that B=2 N . Since we have already shown that B=2 has positive prior probability, the proof is complete. 3.4. Mixtures of Priors. In the examples earlier in this section, the posterior distributions are somewhat sensitive to the choice of prior marginal distribution F. In particular, the posterior predictive densities of future observations computed from histogram and Polya tree priors tend to have noticeable jumps at the boundaries of the sets in the partitions Tn and Sk . Also, a choice of F that is particularly unlike the sample distribution of the data will make the convergence of the posterior very slow. One way to alleviate these problems is to use a mixture of prior distributions. Suppose that we replace F by a family of distributions fF : 2 g where ( ; ) is a measurable space and each F . (One typical choice is a location/scale family that one thinks of as a rst-order approximation to the distribution of the data.) Let be a prior probability measure on ( ; ). Let be a random variable such that, conditional on = , the prior distribution on (F ; C ) is , where is constructed just like a in one of the other examples in this section with F replacing F. Let (jxn) denote the conditional posterior distribution on (F ; C ) given = after observing X n = xn. Let (jxn) denote the posterior distribution of given X n = xn . Then the posterior on (F ; C ) is Z (B jxn) = (B jxn)d (jxn ):

To prove consistency of this posterior, we will make some additional assumptions. Let f be the density of F and let f00 be the density of P0 on X = IR (before transforming the data to lie in (0; 1)). Assume that (f : I (f00 ; f ) < 1g) = 1. Suppose that (jxn) is uniformly consistent a.s. [ ], that is, there is a subset B of X 1 with P0(B ) = 1 such that for every x1 2 B and > 0, there exists Bx1 and N (x1 ) such that (Bx1 ) = 1 and n N (x1 ) implies (Ajxn ) > 1 ? for all 2 Bx1 . First, note that the conditional distribution of X n given = is absoR Q n f (x )d (f ). n lutely continuous with respect to with density g (x j) = n

19

F

i=1

i

By Bayes' theorem (see Schervish, 1995, Theorem 1.31), for each n, (jxn) with probability 1 under Mn, the marginal distribution of X n . Next, note that in the earlier examples in this section, we transformed the data to the interval (0; 1) using F. The resulting distribution for the density f of the transformed data gave probability 1 to the set of densities that are strictly positive on all of (0; 1). This, together with I (f00 ; f ) < 1 implies that for all n, the n-fold product of P0 on X n is absolutely continuous with respect to Mn. Let C be the set of x1 such that (jxn ) for all n. Since Mn (C ) = 1, then P0(C ) = 1 and P0 (C \ B ) = 1. For each x1 2 C \ B , n N (x1) implies Z (Ajxn) = (Ajxn)d (jxn ) > (1 ? ) (Bx1 jxn) = 1 ? : Bx1

Since is arbitrary, this proves that P0 (limn!1 (Ajxn) = 1) = 1. Uniform consistency is dicult to verify in general, and we do not believe that it is a necessary condition. (For example, Ghosal, Ghosh and Ramamoorthi, 1995 use a continuity condition instead of uniform consistency to prove a weaker form of consistency for location mixtures of symmetric Polya trees.) On the other hand, uniform consistency does hold in the simple case in which is a nite set. For example, might consist of a Polya tree, an exponential family, and a histogram. The posterior, as the prior, would then be a mixture of these three types of distributions. 3.5. A Counterexample. In this section, we present an example in which Assumption 1 holds but Assumption 2 fails and the posterior is inconsistent. This example is modeled after an example of Barron (1989, Section 5). The idea of the example is that the prior is split evenly between disjoint sets of densities F0 and F. There are densities f 2 F0 such that I (f0; f ) is arbitrarily small and (N) > 0 for all > 0. The densities in F however are very far from f0 in Hellinger distance yet they track the data suciently well to acquire signi cant posterior probability in nitely often. For each positive integer N , let ( " 2 #) 1 2 1 2 N ? 1 TN = 0; 2N 2 ; 2N 2 ; 2N 2 ; : : :; 2N 2 ; 1 20

be a partition of [0; 1]. Let FN be the set of density functions that are constant on every interval in TN and that only the two values 0 assume and 2.P The cardinality of FN is qN = 2NN22 . Let aN = N ?2=c0, where c0 = 1N =1 1=N 2 . Our prior distribution will place probability aN =[2qN ] on each density in FN for all N . The other half of the probability is distributed as follows. Let U0 be the with variance 1 and p p set of all normal distributions mean between 0 and 2. Denote the mean as 2 where 2 (0; 1) with prior distribution having density 1 exp ? 1 I (); where c = Z 1 exp ? 1 d: 1 c1 (0;1) 0 Let F0 be the collection of all distributions on [0; 1] obtained by transforming the distributions in U0 by the standard normal cumulative distribution function p . That is, each density f 2 F0 is of the form f (xj) = exp[? + 2?1(x)] for 0 < x < 1. Now, suppose that f0(x) 1 is the uniform density on [0; 1] and that fXn g1n=1 are IID with this density. We will show that Assumption 1 holds but that the posterior is not consistent in the Hellinger distance metric. First, note that I (f0; f (j)) = and the prior on puts positive probability on every neighborhood of 0 which means that puts positive probability on every set N. So Assumption Second, let F = [1N =1FqN . Each q 1 holds. p p f 2 F satis es d(f0; f ) = 2 ? 2, so A \ F = ; for all < 2 ? 2. We will prove that almost surely (Fjxn) 1=2 in nitely often, hence the conclusion to Theorem 1 fails. 2 n, then there are at least 2N22 ?n densities f 2 FN such that If N N ?n Qn f (x ) = 2n . So i i=1 ! Z Y 1 a n X N 2N 2 ? n n f (xi)d(f ) 2 p 2q N 2 ? n F i=1 N =d ne N 2N 2?n 1 X = 2n?1 p aN N22N?2n N =d ne N2 n 1 X p a2N 1 ? nN?2 1 N =d ne n ?2) ; (9) a2n 1 ? n n?2 1 exp( 2c0n2 21

for all but nitely many n simultaneously for all possible data. Next, set rn = 2+log(c0)+2 log(n) ? log(c1) and note that 1= + n rn simultaneously for all 2 (0; 1) if n is large enough. It follows that, for all data and all suciently large n ! Z Y Z1 n n p X 1 1 f (xi)d(f ) = 2c 0 exp ? ? n + 2 ?1 (xi) d F0 i=1 1 i=1 )! ( X n p 1 ?1 2c exp ?rn + 2 max 0; (xi) : 1 i=1 P We know that with probability 1, max f0; ni=1 ?1 (xi)g = 0 in nitely often according to the law of the iterated logarithm. It follows that, with probability 1, Z Y n ?1=2) ; i.o. (10) f (xi)d(f ) 21c exp(?rn ) = exp( 2c0n2 F0 i=1 1 Since (F0 [ Fjxn) = 1, it follows from (9) and (10) that 1 n P0 (Fjx ) 2 ; i.o. = 1:

4. An Alternative Approach to Consistency. In an unpublished manuscript Barron (1988) extended the results in Schwartz (1960, 1965). In this section, we brie y outline Barron's results. Recall that an is exponentially small if there exists r > 0 such that an < e?nr for all large n. Proposition 2. Suppose that Assumption 1 holds. Then (Acn jX n ) is

exponentially small if and only if there exist sequences of measurable sets Bn ; Cn 2 C and Sn 2 Bn such that (i) An [ Bn [ Cn = F ,

(ii) (Bn) is exponentially small, (iii) P0((X1; : : : ; Xn) 2 Sn i:o:) = 0 and (iv) supP 2C P ((X1; : : : ; Xn) 2 Snc ) is exponentially small. n

22

Conditions (iii) and (iv) imply that Sn are the critical sets for a consistent test of H0 : P = P0 versus H1 : P 2 Cn . A uniformly exponentially consistent (UEC) test for H0 versus H1 (that is a test with exponentially small type I and type II error) would automatically satisfy (iii) and (iv). If W is a weak neighborhood of P0, then set An = W , Bn = ; and Cn = W c. Hoeding and Wolfowitz (1958) and Barron (1989, p. 108) show that UEC tests exist for this case, so Proposition 2 says that the posterior asymptotically concentrates in weak neighborhoods of P0 under Assumption 1. Indeed, the posterior in the example in Section 3.5 concentrates in weak neighborhoods of the uniform distribution. For example, many of the distributions in FN for large N that take the value 2 on intervals that are scattered throughout [0; 1] will be close to the uniform distribution in the sense of weak convergence. Let fTn g1 n=1 be a sequence of measurable partitions and let kn () be the smallest number of sets in Tn whose union has probability at least 1 ? under P0. Say that the eective cardinality of Tn , (written card(Tn)) is O(n), if for every > 0, lim supn!1 kn ()=n < 1. Let Dn be the - eld generated by Tn . Let dn be Hellinger distance restricted to Dn . Given a density f , let f n () = dP0(jDn )=d. The sequence Tn is rich if limn!1 d(f; f n ) = 0 for all densities f with respect to . Assumption 1 implies that (ff ; dn (f0; g) > g) is exponentially small almost surely as long as card(Tn ) = O(n). Furthermore, if Tn is rich then limn!1 d(f0; f^n ) = 0 almost surely, where f^n is the predictive density. Also, AssumptionP1 implies Cesaro convergence in Kullback-Leibler distance, i.e. limn!1 n?1 ni=1 E[I (f ; f^n)] = 0. To obtain results like our Theorem 1, Barron (1988) introduces two conditions called Conditions (A) and (B). Condition (A) is our Assumption 1. Condition (B) holds if there exists a rich sequence Tn with card(Tn) = O(n) such that, for every > 0, (ff : d(f; f n ) > g) is exponentially small. If Conditions (A) and (B) hold, Barron (1988) applies Proposition 2 with A = ff : d(f0; f ) > g, Bn = ff : d(f; f n) > =4g and Cn = Ac ? Bn. The set Cn is contained in Dn = ff ; dn (f0; f ) > =2g and Barron (1989) showed that there exists a UEC test for H0 : P = P0 versus H1 : P 2 Dn . The sets Bn are crucial since there does not exist a UEC test against ff : d(f0 ; f ) > g.

23

5. Discussion. We have shown that to check for consistency of the

posterior, one need only check two simple conditions. Geman and Hwang (1982) and Wong and Shen (1995) give results on consistency of sieve maximum likelihood estimators (MLEs). Some of their conditions are similar to ours. One major dierence between proving consistency for MLEs and posterior distributions using sieves is that for MLEs the sieve plays a crucial role in the de nition of the MLE. That is, the MLE is the element of Fn that leads to the largest value of the likelihood function. If one changes to a dierent sieve, the sequence of MLEs will change. On the other hand, when using sieves to prove consistency of posterior distributions, only the prior distribution and likelihood aect the posterior. The particular sieve used to prove consistency is only a tool for the proof. Of course some sieves are easier to work with than others, but they do not gure in the computation of posterior probabilities. We have not discussed rates of convergence in this paper. It is possible to compute rates of convergence by replacing the xed in Theorem 1 with a decreasing sequence fn g1n=1. However, Lemma 2, which was used to place a lower bound on the denominator of the expression in Bayes' theorem, is not useful in this case. Instead, one might use the law of the iterated logarithm. This is the approach used in Shen (1995) who, furthermore, uses an empirical process technique to bound the numerator of the expression in Bayes' theorem. It remains an open question whether our Theorem 1 in conjunction with a modi ed Lemma 2, will produce the same rates as those of Shen (1995). In particular, Assumption 1 would need to be modi ed to put a lower bound on the rate at which (N ) goes to 0. Such an assumption would be much more dicult to check, even in the examples presented in this paper. n

APPENDIX Proof of Lemma 7. Fix 2 (0; 1). Let h = log dP0=d and let Ab = fx : jhj < bg, A+b = fx : h bg and A?b = fx : h < ?bg. De ne (x) = h(x)IA (x) + bIA+ (x) ? bIA? . Choose b such that E(jhjIjhjb) < where the expectation is with respect to . Thus, E(jh ? j)

Lihat lebih banyak...

problems

Descripción

Comentarios