ejercicioi

Descripción

Computational Statistics & Data Analysis 51 (2007) 2559 – 2572 www.elsevier.com/locate/csda

Quantile estimation in two-phase sampling María del Mar Ruedaa,∗ , Antonio Arcosa , Juan Francisco Muñoza , Sarjinder Singhb a Department of Statistics and O.R., University of Granada, 18071 Granada, Spain b Department of Statistics, St. Could State University, 720 Fourth Avenue South, St. Cloud, MN 56301-4498, USA

Received 12 January 2005; received in revised form 3 January 2006; accepted 3 January 2006 Available online 24 January 2006

Abstract The estimation of quantiles in two-phase sampling with arbitrary sampling design in each of the two phases is investigated. Several ratio and exponentiation type estimators that provide the optimum estimate of a quantile based on an optimum exponent are proposed. Properties of these estimators are studied under large sample size approximation and the use of double sampling for stratiﬁcation to estimate quantiles can also be seen. The real performance of these estimators will be evaluated for the three quartiles on the basis of data from two real populations using different sampling designs. The simulation study shows that proposed estimators can be very satisfactory in terms of relative bias and efﬁciency. © 2006 Elsevier B.V. All rights reserved. Keywords: Auxiliary information; Finite population quantiles; Two-phase sampling; Stratiﬁed random sampling

1. Introduction The problem of estimating a population mean in the presence of an auxiliary variable has been widely discussed in the ﬁnite population sampling literature. However, for the problem of estimating a population median, the situation is quite different and only recently has this problem been discussed. Rao et al. (1990) proposed ratio and difference estimators for the median using a design-based approach. Kuk and Mak (1989) proposed two estimators for which it was only necessary to know the values of the median of the auxiliary variable for the whole population. More recently, Rueda et al. (1998) and Rueda and Arcos (2001) proposed conﬁdence intervals for quantiles based on ratio and difference estimators of the distribution function. In Rueda et al. (2003, 2004) the population information is used through a quantile of the auxiliary variable with the same or different order as that of the quantile of the main variable considered for estimation using difference type estimators. The above estimators are based on prior knowledge of the median Qx (0.5) of the auxiliary characteristic. In many cases Qx (0.5) may not be known, and it may be seen that taking the sample selection in two phases is an attractive solution. Two-phase sampling is a good compromise for surveys in which no prior knowledge is available about the population. A key to successful two-phase sampling is the creation of a highly informative frame for the part of the population ∗ Corresponding author. Departamento de Estadística e I.O., Facultad de Ciencias, Avda. Fuentenueva, Universidad de Granada, 18071, Granada, Spain. Tel.: +34 958240494; fax: +34 958243267. E-mail addresses: [email protected] (M. del Mar Rueda), [email protected] (A. Arcos), [email protected] (J.F. Muñoz), [email protected] (S. Singh).

0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2006.01.002

2560

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

from which the subsample is drawn. The estimation of the median in two-phase sampling is developed by Singh et al. (2001), Singh (2003) and Allen et al. (2002). Swamy et al. (2005) have shown that auxiliary information, without knowing its true functional form, can also be used to reduce the bias while estimating the relation among the federal funds and the Federal Reserve’s expectations about future values of certain policy variables is considered. These papers have been developed using simple random sampling. Sampling surveys for economic variables (as income) that possess highly skewed distributions are almost always complex in structure, and methods such as stratiﬁcation and probability proportional to size are common place. In this article we propose various estimators of a -quantile in two-phase sampling with arbitrary sampling designs in each of the two phases.

2. Quantile estimation in two-phase sampling This study has been carried out under the ﬁxed population approach. Let U be a ﬁnite population with N different elements where y1 , . . . , yN are the values of the variable of interest y, and Fy (t)=N −1 N i=1 (t − yi ), (−∞ < t < ∞), is the population distribution function, where (a) takes the value 1 if a 0 and the value 0 otherwise. Let x be an auxiliary variable and xi (i = 1, . . . , N) be the value of its ith population unit. The ﬁrst-phase sample s of size n is drawn according to a sampling design d1 , such that pd1 s is the probability that s is chosen and where the corresponding ﬁrst and second order probabilities are i and ij for i and j ∈ U . For can the elements in s , information of the auxiliary variable be recorded. Given s , the second-phase sample s of size n is drawn according to the design d2 such that p s/s is the conditional probability of choosing s. The inclusion probabilities under this design are denoted by i/s and ij /s . A particular case is presented when the variable x is used to stratify s into L strata denoted by sh , (h = 1, . . . , L), with nh elements in the hth stratum. In this way, a sample sh of size nh can be drawn from sh according to a design ph /s independently from each stratum. The ﬁnal sample is s = L h=1 sh . This particular design is called Two-phase sampling for stratiﬁcation. 2.1. Direct estimation y ()=inf t|F HTy (t) Without using auxiliary information, the natural candidate to estimate the -quantile Qy () is Q −1 −1 Thompson (1952) type estimator of Fy (t) = FHTy (), where FHTy (t)=N i∈s (t − yi ) /i is the Horvitz and and the inclusion probability of the ith element is given by i = s i pd1 s i/s . Consequently, to determine i we must know the probabilities i/s for every s , which we ordinarily do not, because i/s may depend on the outcome of phase one (for example if the second-phase sample is drawn by a sampling proportional to an auxiliary variable). Because the Horvitz–Thompson estimator of a mean cannot always be used in practice, in two phase sampling, Särndal et al. (1992) proposed the use of ∗ estimators . Using this idea, we introduce the quantities i = pd1 s , ij = pd1 s , ∗i = i · i/s and ∗ij = ij · ij /s , s i

s i,j

to deﬁne the ∗ -estimator of the distribution function as (t − yi ) ∗ (t) = 1 F , HTy N ∗i i∈s

and thus, we suggest the following direct estimator of the -quantile: ∗y () = F ∗−1 (). Q HTy

(1)

∗y () does not generally agree with the estimator Q y () except in rare cases, but it makes direct calculation Note that Q possible for all sample designs d1 and d2 used in each phase.

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

2561

∗y () estimator 2.2. Properties of the Q ∗y () estimator. For this, a linear approximation is needed because Q ∗y () is not We now study the properties of the Q a continuous function. ∗y () can be expressed asymptotically as a linear function of the estimated distribution function The estimator Q evaluated at the quantile Qy () by the Bahadur representation (see Chambers and Dunstan, 1986):

1 ∗ ∗y () − Qy () = −F + O n−1/2 , (2) Q HTy Qy () fy Qy () where fy (·) denotes the derivative of the limiting value of Fy (·) as N −→ ∞. This linear approximation previously used by Kuk and Mak (1989) and Chen and Wu (2002) helps to study the asymptotic properties of the estimator. ∗y () is asymptotically unbiased because F ∗ (t) is an unbiased estimator of F (t). On the one hand, the estimator Q HTy

∗ ∗y () = Qy () + O n−1/2 . In this way, E − F Qy () = 0, and by using (2) it can be seen that E Q HTy

∗y (), to the ﬁrst degree of approximation, as On the other hand, from (2) we obtain the asymptotic variance of Q ⎛ Qy () − yi Qy () − yj

1 1 ∗ ⎝ ij − i j V Qy () = 2 2 N fy Qy () i j i,j ∈U ⎡ ⎤⎞ Qy () − yi Qy () − yj ⎦⎠ , +Ed1 ⎣ ij /s − i/s j/s ∗i ∗j i,j ∈s

and one can construct an unbiased estimator of the variance as

⎛ ∗y () − yi Q ∗y () − yj − Q

1 ij i j Q ∗y () = 1 ⎝ V N 2 fy2 Qy () ∗ij i j i,j ∈s

⎞

∗y () − yi Q ∗y () − yj ij /s − i/s j/s Q ⎠. + ij /s ∗i ∗j i,j ∈s

An approximate value of fy Qy () can be obtained by applying standard methods such as the kernel or the kth nearest neighbour methods (Silverman, 1986). The variance estimator is stated in an explicit form (it does not depend on the expected value over the ﬁrst phase design), thus making direct calculation possible. 3. Estimation using auxiliary information In the previous section an estimator is deﬁned without using auxiliary information. We now deﬁne a class of estimators that takes the auxiliary variable into account. Assuming simple random and without replacement (SRSWOR) sampling and the median of the variable x is known, Kuk and Mak (1989) proposed a ratio estimator for the population median as ry (0.5) = Q y (0.5) Qx (0.5) . Q x (0.5) Q Furthermore, Kuk and Mak (1989) proposed other estimators of quantiles under SRSWOR design called position and stratiﬁcation estimators, but the extension of them to more complex sampling designs is very difﬁcult. Rueda et al. (2003, 2004) proposed, for any sampling design d and for any , difference and exponentiation methods to estimate a -quantile. Singh et al. (2001) suggested ratio, regression, position and stratiﬁcation estimators of the median when the sample is drawn in two phases, using SRSWOR in both phases. Under this sampling design, Allen et al. (2002) proposed two classes of estimators for the population median using information on two auxiliary variables x and z in double sampling when the population median of z is known.

2562

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

3.1. Proposed estimators Here, we present a class of estimators of ﬁnite population quantiles when the sample is drawn using a general two-phase sampling, described earlier, as ∗ ∗ H Q y () = H (Qy (), t ),

(3)

x () ∗x ()/Q x (), and Q x () being the estimator of Qx () from the ﬁrst stage of sampling, i.e. Q with t∗ = Q −1 −1 (t) , where F (t) = N = inf t|F i∈s (t − xi ) /i . The function H satisﬁes the following conditions: HTx HTx (1) It assumes values in a closed convex subset C contains the point Qy (), 1 ; ⊂ R2 which (2) H is a continuous function in C such that H Qy (), 1 = Qy (), and (3) The ﬁrst and second order partial derivatives of H exist and are also continuous in C,with jH (q, t ∗ ) = 1. H10 Qy (), 1 = ∗ jq (q,t )=(Qy (),1) A particular case within the general class of estimators H is the ratio type estimator ∗y () Qx () , ∗yr () = Q Q ∗x () Q which corresponds to the choice H (q, t ∗ ) = q/t ∗ . Another estimator of the -quantile, called the exponentiation estimator, can be derived from x () Q ∗ ∗ ye () = Q y () Q , ∗x () Q with as a ﬁxed constant, which corresponds to the choice of H (q, t ∗ ) = q/(t ∗ ) . ∗ye () = Q ∗y (), i.e. Q ∗ye () coincides with the ∗ -estimator, if = 1 then Q ∗ye () = Q ∗yr (), Note 1. If = 0 then Q ∗ ∗ and if = −1 then Qye () = Qyp (). This we can deﬁne as a product estimator. ∗yr () and Q ∗ye () lead, Note 2. If SRSWOR sampling is used in each phase and = 0.5, the proposed estimators Q (a) (b) y proposed by Singh et al. (2001). y and M respectively, to the estimators M 3.2. Properties of the class of estimators Any estimator in H is asymptotically unbiased for Qy (). This result can be obtained from the following expressions:

1 ∗ ∗y () − Qy () = −F + O n−1/2 , Q HTy Qy () fy Qy ()

1 ∗ (Qx ()) + O n−1/2 , ∗x () − Qx () = −F Q HTx fx (Qx ()) x () − Qx () = Q

1 (Qx ()) + O(n −1/2 ), −F HTx fx (Qx ())

and by using the ﬁrst order Taylor’s series expansion for H about the point Qy (), 1 : ∗ H () = H Q (), 1 + Q () − Q () H10 (Qy (), 1) Q y y y y

∗ + t − 1 H01 Qy (), 1 + O n−1 , where H10 and H01 denote the ﬁrst order partial derivatives of H with respect to q and t ∗ , respectively.

(4)

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

2563

∗ (t) and F ∗ (t) are unbiased estimators of Fy (t) and Fx (t), respectively, any estimator in H is asympWhen F HTy HTx totically unbiased for Qy (). 3.3. Asymptotic expression of variances Consider the Taylor’s series expansion (4) and consequently the expression

∗x () Q H ∗ − 1 H01 (Qy (), 1) + O n−1 . Qy () − Qy () = Qy () − Qy () + x () Q Then, we have e1 − e2 H H01 Qy ( , 1) Q y () − Qy () = Qy ()e0 + 1 + e2 Qy ()e0 + (e1 − e2 ) (1 − e2 ) H01 Qy (), 1 = Qy ()e0 + (e1 − e2 ) H01 Qy (), 1 − e2 (e1 − e2 ) H01 Qy (), 1 , where ∗y ()/Qy () − 1, e0 = Q

∗x ()/Qx () − 1 e1 = Q

and

x ()/Qx () − 1, e2 = Q

and we obtain, to the ﬁrst order of approximation, the variance

yH () = Qy ()2 V (e0 ) + H01 Qy (), 1 2 V (e1 − e2 ) V Q + 2H01 Qy (), 1 Qy () Cov (e0 , e1 − e2 ) . On the other hand, in two phase sampling:

H H H + V () = E V Q ()/s E Q ()/s V Q d1 d1 y y y reﬂects the variation due to each of the two phases of sampling. Using the known properties of the Horvitz–Thompson estimator and its variance by denoting ij = ij − i j and sij = ij /s − i/s j/s , we obtain ⎛ ⎞

Q () − y Q () − y 1 1 y i y j H ⎠ ⎝ Vd1 E Q ij y ()/s = N 2 fy2 Qy () i j i,j ∈U

and

⎛ ⎞

1 Qy () − yi Qy () − yj H ⎝ 1 ⎠ sij Ed1 V Q y ()/s = Ed1 N 2 fy2 Qy () ∗i ∗j i,j ∈s 2 Q (), 1 (Qx () − xi ) Qx () − xj H01 1 1 y s + ij Q2x () N 2 fx2 (Qx ()) ∗i ∗j i,j ∈s H01 Qy (), 1 1 1 +2 Qx () N 2 fy Qy () fx (Qx ()) Qy () − yi Qx () − xj s × ij . ∗i ∗j i,j ∈s

The last variance is not stated explicitly, but as an expected value over the ﬁrst phase design. This causes no problem for the variance estimation, Qy () − yi Qy () − yj ij i j i,j ∈U

2564

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

which can be estimated by

∗y () − yi Q ∗y () − yj ij Q i,j ∈s

and

∗ij

⎛ Ed1 ⎝

i,j ∈s

i

j

,

⎞ Qy () − yi Qy () − yj ⎠ ij ∗ ∗ i j s

by

∗y () − yi Q ∗y () − yj sij Q i,j ∈s

ij /s

∗i

∗j

and fx (Qx ()) and fy Qy () by following Silverman (1986). The asymptotic variances of ratio, product and exponentiation estimators corresponding to H (q, t ∗ ) = q/t ∗ , H (q, t ∗ ) = qt ∗ and H (q, t ∗ ) = q/(t ∗ ) , respectively, can be derived. 3.4. Optimal estimators ∗ye (). Again the optimality is deﬁned In this section we derive the expression of the optimal estimator in the class Q in the sense of minimizing the (asymptotic) variance of these estimators. This leads to the optimal value of given by x () − Cov Q y (), Q x () y (), Q Qx () Cov Q . opt = x () + Q x () − 2 Cov Q x (), Q x () Qy () V Q By using the properties of two-phase sampling the next expression can be obtained

s ∗ ∗ (Q (Q E () − y )/ () − x )/ d1 y i x j i,j ∈s ij i j Qx () fx (Qx ())

, opt =

s ∗ Qy () fy Qy () E (Qx () − xi )/ (Qx () − xj )/∗ d1 i,j ∈s

ij

i

j

and then

yopt () = Q ∗y () Q

x () Q ∗x () Q

opt .

It can be easily seen

opt H y () − K1 V Q () V Q () =V Q y y x () − Cov Q y (), Q x () 2 y (), Q Cov Q y () − , =V Q x () + Q x () − 2 Cov Q x (), Q x () V Q

(5)

H that is, the lower bound of the variance of Q y () is the variance of the exponentiation estimator with opt . yopt () always remains more efﬁcient than the simple estimator Q y (). Eq. (5) shows that the proposed estimator Q Speciﬁcally, K1 is the amount by which the variance is reduced when we use the exponentiation estimator with an y () estimator. optimal instead of the Q

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

2565

In practice the optimal value of is unknown. Nevertheless, the sample data can be used to calculate its estimator. Thus, an estimator of the optimal value of is given by

s ∗y () − yi /∗ Qx () − xj /∗ ∗x () fx (Qx ()) i,j ∈s ij /ij /s Q i j Q

(6) = ∗ . ∗y () fy Qy () s ∗ Q / () − x () − x / Q / (Q ) x i x j ij /s ij i,j ∈s i j We can deﬁne an optimal estimator of the -quantile as x () Q ∗ Qy () = Qy () . ∗x () Q

y () = Qy () + o n−1 and to the Following the procedure discussed in Allen et al. (2002) it can be shown that E Q

y () = V Q yopt () , i.e., the estimators Q y () and Q yopt () are asymptotically ﬁrst degree of approximation, V Q equivalent. 4. Two-phase sampling for stratiﬁcation In Section 2 we show a particular case of two-phase sampling where the ﬁrst phase sample is stratiﬁed using the auxiliary variable. This sampling design is called two-phase sampling for stratiﬁcation. We now deﬁne an estimator for the quantile Qy () under this sampling design and analyze several of its properties. In the ﬁrst place, we deﬁne the following estimator for the distribution function: L (t − yi ) st∗ (t) = 1 F , ∗i N h=1 i∈sh

st∗−1 (), where the inverse F st∗−1 exists in the same way ∗st () = F and we suggest estimating the quantile Qy () by Q −1 above. as F HTy ∗st () estimator, we will ﬁrst analyze the properties of the F st∗ (t) estimator. To study the properties of the Q ∗ st (t) is unbiased and its variance is given by Note that F ⎛ ⎡ ⎤⎞ L ∗ t − y t − y − y − y 1 (t (t ) ) j j i i st (t) = ⎝ ⎦⎠ . (7) ij + Ed1 ⎣ sij V F ∗ ∗ N2 i j i j i,j ∈U

h=1 i,j ∈sh

Thus, an unbiased estimator of variance is given by ⎛ ⎞ L ij (t − yi ) t − yj sij (t − yi ) t − yj ∗ 1 F st (t) = ⎝ ⎠, V + N2 ∗ij i j ij /s ∗i ∗j i,j ∈s

(8)

h=1 i,j ∈sh

because each component of (8) is unbiased for its counterpart in Eq. (7). ∗st () estimator can be expressed as a linear function of F st∗ Qy () . In addition, because Similar to Section 2.2, the Q st∗ (t) is unbiased of Fy (t), we deduce that Q ∗st () is asymptotically unbiased. An approximate unbiased estimator of F the variance is given by ⎛ ∗ ∗ st () − yi Q st () − yj ij Q ∗ 1 1 Q st () = ⎝ V N 2 fy2 Qy () ∗ i j i,j ∈s ij ∗ ⎞ ∗ L st () − yi Q st () − yj sij Q ⎠. + ij /s ∗i ∗j h=1 i,j ∈sh

2566

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

Table 1 Description and references of populations Population

Description

Variables

yx

Fam1500 (N = 1500)

Families of Andalucía (Spain)

y:Feeding expenses x1 :Family incomes x2 :Other expenses

0.848 0.546

Counties (N = 304)

Counties in Carolina and Georgia

y:Population in 1970 x1:Population in 1960 x2:Households in 1960

0.982 0.982

References Fernández and Mayor (1994)

Royall and Cumberland (1981) Valliant et al. (2000)

5. Empirical study The present investigation proposes several estimators for quantiles in sampling in two phases with unequal probabilities. The use of two-phase sampling for stratiﬁcation has also been considered for estimating quantiles. In this section we carried out a simulation study to reveal the behaviour of these estimators and to point out the most efﬁcient estimator. For this purpose, we examined two natural populations, used previously for ﬁnite population sampling. The populations in question are Fam1500 and Counties. A brief description and the references of these populations can be seen in Table 1. In these populations there are several auxiliary variables having different linear correlation coefﬁcients with the variable of interest y. In this study the behaviour of estimators can be observed when strong and weak relationships between variables are considered. We have generated 1000 independent samples under different methods in each phase. The ﬁrst phase sample size, n , is ﬁxed at 150 and the second phase sample size, n, is allowed to change from 10 to 100. The methods used are (1) (SRSWOR.M) The ﬁrst phase is SRSWOR of size n . The second phase is carried out using the Midzuno–Sen method (Singh, 2003, p. 390) to extract samples with unequal probabilities: i =

n , N

i/s =

xi n − n n−1 → ∗i = i i/s . + n − 1 j ∈s xj n −1

(2) (SRSWOR.P) The ﬁrst phase is SRSWOR of size n . The second phase is carried out by Poisson sampling (Singh, 2003, p. 499) such that the conditional inclusion probability is proportional to x: i =

n , N

i/s = n

xi

j ∈s xj

→ ∗i = i i/s .

(3) (ST.M) Two-phase sampling for stratiﬁcation: in the ﬁrst phase, a sample is drawn according SRSWOR. For the elements in s information is recorded that will permit a stratiﬁcation. From stratum h, a sample sh of size nh is drawn with unequal probabilities using the Midzuno–Sen Method: i =

n , N

i/sh =

nh − nh xi nh − 1 → ∗i = i i/sh + nh − 1 x n − 1 j ∈s j h

for i ∈ sh .

h

The performance of the proposed estimators is evaluated for the three quartiles, = 0.25, 0.50, 0.75, in terms of relative bias (%) (RB) and relative efﬁciency (RE) with Monte Carlo approximations derived from the B = 1000 independent samples iy () B i MSE Q − Q () Q () 1 y b y , RBi = 100 × , REi = B Qy () ∗ () MSE Q b=1

y

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

β=0.25

β=0.5

1.0 RE

0.8 0.7 0.6 0

RE

(*)

25

50

75

100

2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8

1.0 0.9 0.8 0.7 0.6 0

25

50

75

100

Estimator 1 Estimator 2 Estimator 3

100

0

25

50 n

75

100

1.0 0.8

0.8 100

75

1.2

1.0

75

50

1.4

1.2

50 n

25

1.6

1.4

25

0 1.8

1.6

0 (**)

β=0.75

1.0 0.9 0.8 0.7 0.6 0.5

0.9

2567

0

25

50 n

75

100

(*) x1 is used as an auxiliary variable and x2 is used to assign probabilities. (**) x2 is used as an auxiliary variable and x1 is used to assign probabilities.

Fig. 1. RE for Fam1500 population and under SRSWOR.M sampling design. n = 150.

iy () denotes the ith proposed estimator with where b indexes the bth simulation run and Q 1y () = Q ∗y () Qx () , • Q ∗ () Q x x () Q 2y () = Q ∗y () • Q , where can be seen in (6), ∗ () Q x opt 3y () = Q ∗y () Qx () • Q , ∗x () Q 4y () = Q ∗st (). • Q 2 i y ()b − Qy () and MSE Q ∗y () is similarly deﬁned for Q ∗y (), the direct estimator iy () = B −1 B Q MSE Q b=1 deﬁned in (1). This does not use the auxiliary information. The random generations, calculations and all the estimators were obtained using the R program. Programming details are available from the authors. 1y (), Q 2y () and Q 3y () estimators in different populations and the SRSWOR.M Figs. 1–4 represent the RE for Q and SRSWOR.P designs. These ﬁgures show the behaviour of the estimators when the sample size in the second phase increases, while the ﬁrst phase sample size remains ﬁxed. If there is a high linear correlation coefﬁcient between y and the auxiliary variable, then all estimators are more ∗y () estimator (shown with horizontal dotted lines). The gain in relative efﬁciency decreases if the efﬁcient than the Q sample size in the second phase increases. This result is logical because if the sample size in the second phase is small, ∗y () estimator will present a larger degree of error, the sample will have less information of the y variable, and the Q ∗y () obtains better while the ratio estimators are more efﬁcient because more information is used. As n increases, Q estimator which is closer to the ratio estimator. Note that for the Fam1500 population and under SRSWOR.P sampling

2568

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

β=0.25

β=0.5

β=0.75

0.55 0.55

0.55

0.50

0.50

0.45

0.45

0.45

0.40

0.40

0.40

RE

0.50

0

(*)

25

50

75

100

0

25

50

75

100

1.4

1.4

1.2

1.2

1.0

1.0

0.8

0.8

0.6

0.6

1.4 RE

0

1.2

25

50

75

100

50 n

75

100

1.0 0.8 0

25

(**)

50 n

75

100

0

Estimator 1 Estimator 2 Estimator 3

25

50 n

75

100

0

25

(*) x1 is used as an auxiliary variable and x2 is used to assign probabilities. (**) x2 is used as an auxiliary variable and x1 is used to assign probabilities.

Fig. 2. RE for Fam1500 population and under SRSWOR.P sampling design. n = 150.

RE

β=0.25 0.6

0.5

0.5 0.4 0.3 0.2

0.4

0.1

0.1

0.6 0.5 0.4 0.3 0.2 0.1

0.2

25

50

75

0

100

0.6 0.5 0.4 0.3

25

50

75

100

0.4 0.3 0.2 0.1 0

25

50 n

75

Estimator 1 Estimator 2 Estimator 3

100

0

25

50

75

100

0

25

50 n

75

100

0.6 0.5 0.4 0.3 0.2 0.1

0.5

0.2 0.1

(**)

β=0.75

0.3

0

(*)

RE

β=0.5

0

25

50 n

75

100

(*) x1 is used as an auxiliary variable and x2 is used to assign probabilities. (**) x2 is used as an auxiliary variable and x1 is used to assign probabilities.

Fig. 3. RE for Counties population and under SRSWOR.M sampling design. n = 150.

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

β=0.25

β=0.5

β=0.75 0.12

0.02

RE

0.02

0.08

0.01

0.00 0

25

50

75

100

0.01

0.04

0.00

0.00 0

0.015

0.015

0.010

0.010

0.005

0.005

25

50

75

100

0

25

50

75

100

0

25

50 n

75

100

0.10 0.08 0.06

RE

(*)

2569

0.04 0.02

0.000 0 (**)

0.00

0.000 25

50 n

75

100

Estimator 1 Estimator 2 Estimator 3

0

25

50 n

75

100

(*) x1 is used as an auxiliary variable and x2 is used to assign probabilities. (**) x2 is used as an auxiliary variable and x1 is used to assign probabilities.

Fig. 4. RE for Counties population and under SRSWOR.P sampling design. n = 150.

design with the ﬁrst phase sample size n = 150, as the second phase sample size n increases from 10 to 100, the RE shows two peaks: if n = 25 and n = 80 for = 0.25; if n = 55 and n = 80 for = 0.5; and if n = 60 and n = 100 for = 0.75 . It looks that if we are estimating higher quartile then a large second phase sample size may be required so long as the efﬁciency of the proposed estimators is concerned. 3y () is the most efﬁcient estimator in many cases. This is expected because this estimator is asymptotically optimum Q 2y () has very similar values and does not depend on unknown values. in the class (3). Nevertheless, the estimator Q 1y () is usually less efﬁcient than other proposed estimators. When the linear relation between the variables is weaker, Q 1y () is even less efﬁcient than the direct estimator, while Q 2y () and Q 3y () continue to perform better. In short, the Q use of the exponentiation estimator improves the estimates, especially if there is a weak relationship between the study and auxiliary variables. On the other hand, the Poisson method of sampling produces more efﬁcient results in the sense of RE than the ∗y () because the direct estimator present disperses estimates under the Midzuno–Sen method and with regard to Q Poisson method caused by the heterogeneity of the inclusion probabilities. Proposed estimators are almost equivalents in the Counties population because the linear correlation coefﬁcients are larger. In fact, the RE of the proposed estimators in this population is better than those in the Fam1500 population. Bias is another important aspect, particularly for ratio estimator that can show the underestimation or overestimation. ∗y () having the largest at 3% The RBs values in the Fam1500 population are all within a reasonable range, with the Q as seen in Fig. 5. The RBs values for the Counties population when x1 is used as an auxiliary variable and x2 is ∗y () estimator clearly leads to serious overestimation, especially used to assign probabilities are shown in Fig. 6. The Q when the sample size is small and under the SRSWOR.P sampling design, whereas the absolute RBs of the proposed estimators are less than 7% for the SRSWOR.M sampling design and less than 13% for the SRSWOR.P sampling design, 2y () estimator, which has the largest at 25%. In short, the study of the RB reveals except on small sample sizes for the Q that the proposed estimators are better than the direct estimator. Fig. 7 is an example of two-phase sampling for stratiﬁcation. The proposed estimator is compared with the direct estimator if the strata are not considered. It can also be observed that the use of stratiﬁcation is recommended because

2570

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

β=0.25

β=0.5 0.2

RB

0.6 0.4 0.2

0.1

0.0

-0.1

-0.2

-0.2

RB

0.8 0.6 0.4 0.2 0.0 -0.2 -0.4

0.0

-0.4

-0.3 0

(*)

β=0.75

25

50

75

100

0

2.5 2.0 1.5 1.0 0.5 0.0

25

50

75

0

100

1.5

1.5

1.0

1.0

25

50

75

100

25

50 n

75

100

0.5 0.5

0.0

0.0

-0.5 0

25

(**)

50 n

75

100

0

25

50 n

75

Direct estimator Estimator 1 Estimator 2 Estimator 3

100

0

(*) SRSWOR.M sampling design. (**) SRSWOR.P sampling design.

Fig. 5. RB in percent for Fam1500 population when x1 is used as an auxiliary variable and x2 is used to assign probabilities. n = 150.

RB

β=0.25

RB

β=0.75 30 25 20 15 10 5 0

30 25 20 15 10 5 0

25 20 15 10 5 0 0

(*)

25

50

75

100

25 20 15 10 5 0

0

25

50

75

100

10

0

25

50

75

100

25

50 n

75

100

10 5 0 -5 -10

5 0 -5 -10 0

(**)

β=0.5

25

50 n

75

100

0

25

50 n

Direct estimator Estimator 1 Estimator 2 Estimator 3

75

100

0

(*) SRSWOR.M sampling design. (**) SRSWOR.P sampling design.

Fig. 6. RB in percent for Counties population when x1 is used as an auxiliary variable and x2 is used to assign probabilities. The RB’s values for the direct estimator in (∗∗ ) are larger than 97.6%, 74.6% and 21.5% for = 0.25, 0.5 and 0.75, respectively, and are omitted. n = 150.

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

β=0.25

β=0.5

RE

1.0 0.9

0.9

0.8

0.8

0.8

0.7

0.6

0.7

0.6

0.6

RE

β=0.75

1.0

1.0

0.4 0

(*)

2571

25

50

75

100

0.5 0

25

50

75

100

0

1.0

1.0

1.0

0.8

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2 25

50

75

50

75

100

0.6 0.4

100

25

50

n

(**)

25

75

100

25

50

n

75

100

n

Proposed estimator using variable x1. Proposed estimator using variable x2.

(*) Fam1500 population. (**) Counties population.

Fig. 7. RE for Fam1500 and Counties populations and under ST.M sampling design. n = 150.

β=0.25

β=0.5

0.6

RB

0.4 0.2

β=0.75

0.4

0.8

0.2

0.4

0.0

0.0

-0.2

-0.2

0.0 -0.4

-0.4 0

(*)

25

50

75

100

RB

15

0

50

75

100

15

10

10

5

5

0

0 25

(**)

25

50

75 n

100

0

25

50

75

100

10 8 6 4 2 0 25

50

75 n

Direct estimator Proposed estimator

100

25

50

75

100

n (*) Fam1500 population. (**) Counties population.

Fig. 8. RB in percent for Fam1500 and Counties populations under ST.M sampling design and when the variable x1 is used. n = 150.

the estimates are more efﬁcient, especially if the sample size in the second phase of the sample decreases. In all cases the proposed estimators show improvement over the direct estimator irrespective of the linear relationship between variables, although the gain in RE is better if this coefﬁcient is larger. In reality, the gain in efﬁciency is guaranteed because the strata are well designed, i.e., the strata are homogeneous inside and heterogeneous among them.

2572

M. Rueda et al. / Computational Statistics & Data Analysis 51 (2007) 2559 – 2572

∗st () is better than Q ∗y () as can be observed in Fig. 8. The As far as the RB is concerned, the proposed estimator, Q ∗st () are less than 10%, whereas the Q ∗y () estimator leads to a weak overestimation for the Counties RBs values of Q population. In fact, the Fam1500 population produces better estimates than the Counties population in terms of RB. The estimators are showing similar behaviour when the variable x2 is used and consequently these ﬁgures are not shown. Acknowlegements The authors would like to thank the referee and the Associated Editor for their many helpful comments and suggestions. The authors are also thankful to a professional English Editor Ms. Melissa Lindsey, St. Cloud State University for editing the manuscript. This work was supported by the Spanish Ministry of Education and Science (Contract no. MTM2004-04038). References Allen, J., Singh, H.P., Singh, S., Smarandache, F., 2002. A general class of estimators of population median using two auxiliary variables in double sampling. INTERSTAT. Chambers, R.L., Dunstan, R., 1986. Estimating distribution functions from survey data. Biometrika 73, 597–604. Chen, J., Wu, C., 2002. Estimation of distribution function and quantiles using the model-calibrated pseudo-empirical likelihood method. Statist. Sin. 12, 1223–1239. Fernández, F.R., Mayor, J.A., 1994. Muestreo en Poblaciones Finitas: Curso Básico. P.P.U, Barcelona. Horvitz, D.G., Thompson, D.J., 1952. A generalization of sampling without replacement from a ﬁnite universe. J. Amer. Statist. Assoc. 47, 663–685. Kuk, A., Mak, T.K., 1989. Median estimation in the presence of auxiliary information. J. Roy. Statist. Soc. B 1, 261–269. Rao, J.N.K., Kovar, J.G., Mantel, H.J., 1990. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 77, 365–375. Royall, R.M., Cumberland, W.G., 1981. An empirical study of the ratio estimator and estimator of its variance. J. Amer. Statist. Assoc. 76, 66–88. Rueda, M., Arcos, A., 2001. On estimating the median from survey data using multiple auxiliary information. Metrika 54, 59–76. Rueda, M., Arcos, A., Artés, E., 1998. Quantile interval estimation in ﬁnite population using a multivariate ratio estimator. Metrika 47, 203–213. Rueda, M., Arcos, A., Martínez-Miranda, M.D., 2003. Difference estimators of quantiles in ﬁnite populations. Test 12, 481–496. Rueda, M., Arcos, A., Martínez-Miranda, M.D., Román,Y., 2004. Some improved estimators of ﬁnite population quantile using auxiliary information in sample surveys. Comput. Statist. Data Anal. 45, 825–848. Särndal, C.E., Swensson, B., Wretman, J., 1992. Model Assisted Survey Sampling. Springer, New York. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Singh, S., Joarder, A.H., Tracy, D.S., 2001. Median estimation using double sampling. Austral. New Zealand J. Statist. 43, 33–46. Singh, S., 2003. Advanced Sampling Theory with Applications: How Michael “Selected” Amy. Kluwer Academic Publisher, The Netherlands. Swamy, P.A.V.B., Tavlas, G.S., Chang, I.L., 2005. How stable are monetary policy rules: estimating the time-varying coefﬁcients in monetary policy reaction function for the US. Comput. Statist. Data Anal. 49, 575–590. Valliant, R., Dorfman, A.H., Royall, R.M., 2000. Finite Population Sampling and Inference: A Prediction Approach. Wiley Series in Probability and Statistics, Survey Methodology Section. Wiley, New York.

Lihat lebih banyak...

ejercicioi

Descripción

Comentarios