Estimating Comparable Scores Using Surrogate Variables

July 20, 2017 | Autor: Michelle Liou | Categoría: Psychology, Comparative Study, Sample Selection Bias, EM algorithm, Data Fitting, Incomplete Data, Generic model, Incomplete Data, Generic model

Share Embed

Laporkan tautan ini

Descripción

Applied Psychological Measurement http://apm.sagepub.com/

Estimating Comparable Scores Using Surrogate Variables Michelle Liou, Philip E. Cheng and Ming-Yen Li Applied Psychological Measurement 2001 25: 197 DOI: 10.1177/01466210122032000 The online version of this article can be found at: http://apm.sagepub.com/content/25/2/197

Published by: http://www.sagepublications.com

Additional services and information for Applied Psychological Measurement can be found at: Email Alerts: http://apm.sagepub.com/cgi/alerts Subscriptions: http://apm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://apm.sagepub.com/content/25/2/197.refs.html

>> Version of Record - Jun 1, 2001 What is This?

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

Estimating Comparable Scores Using Surrogate Variables Michelle Liou and Philip E. Cheng, Academia Sinica Ming-Yen Li, College Entrance Examination Center

The possibility of using surrogate variables (e.g., school grades, other test scores, examinee background information) as replacements for common items predicting sample-selection bias between groups was investigated. The problem was specified as an incomplete data problem of comparability studies and was addressed using nonequivalent groups. A general model for estimating complete data (fitted) distributions through covariates is pro-

posed (including common-item scores and surrogate variables as special cases). Model parameters are estimated using the EM algorithm. Standard errors of comparable scores are derived under the proposed model. Data from an empirical example examined the use of surrogate variables for establishing score comparability. Index terms: categorical data, data imputation, EM algorithm, equipercentile equating, loglinear smoothing.

Test equating commonly refers to the scaling of two equivalent forms of the same test to achieve score comparability (e.g., new and old versions of the Armed Services Vocational Aptitude Battery; Little & Rubin, 1994). It is also possible to equate scores on target tests with similar content that are not necessarily equivalent (e.g., the American College Testing Assessment and the College Board’s Scholastic Aptitude Test; Marco, Abdel-fattah, & Baron, 1992). The term “comparability study” is used here to describe the equating of similar but not necessarily equivalent tests. A common-item design—in which a set of common items is administered with target tests to nonequivalent groups—often is used in data collection for comparability studies. Scores on common items serve as a basis for adjusting possible group differences using the chained equipercentile method (CEM) or the frequency estimation method (FEM; Angoff, 1984). Common-item scores do not always correlate highly with scores on target tests. Also, because target tests are frequently administered a few months apart, scores collected at the second testing occasion might be contaminated by nonrandom errors due to test disclosure. Wright & Dorans (1993) suggested using selection (surrogate) variables (e.g., school grades, other test scores) as the anchor to account for group differences. Selection variables, along with examinee background information, might better account for sample-selection bias than do common-item scores alone. Let Test X and Test Y be target tests between which comparable scores must be determined. The equipercentile method (Angoff, 1984) defines score x on Test X and score ξ on Test Y to be comparable if ξ = G−1 [F (x)], where F and G are the distribution functions of x and y scores in the reference population. A frequency corresponding to an integer score x is assumed to be uniformly distributed in the interval [x − .5, x + .5]. Therefore, ξ can be computed by interpolation. When target tests are administered to nonequivalent groups on different occasions, F and G estimates might be biased. Comparability studies using nonequivalent groups are considered incomplete-data problems—examinees have observed scores on one test and missing scores on the other (Liou & Cheng, 1995a). Applied Psychological Measurement, Vol. 25 No. 2, June 2001, 197–207 ©2001 Sage Publications, Inc.

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

197

198

Volume 25 Number 2 June 2001 APPLIED PSYCHOLOGICAL MEASUREMENT

A generalized loglinear model is proposed to estimate score distributions that would be observed if examinees had been administered both tests. As a result, F and G estimates based on scores from both groups must be less biased than those using incomplete data. Fitted (complete data) distributions are estimated through useful covariates, such as common-item scores or surrogates. Model parameters are estimated using an EM algorithm with the missing-at-random assumption (Rubin & Thayer, 1978). This assumption is tenable because examinees are assigned to different groups prior to taking the target tests. Comparable scores [and their standard errors (SEs)] are determined with the equipercentile method when F and G estimates are available for both groups. Modeling Population Distributions The common-item design has three variables: X, Y , and common items V . When V is unavailable or contaminated by nonrandom errors, it can be replaced with a surrogate variable (e.g., scores on a test available for both groups). The surrogate variable should correlate moderately well with X and Y . V is used here to denote either the common items or a surrogate variable. Let B be the cross-classification of examinees on all background variables (e.g., gender, school). Let f (i, j, k, l) be the joint distribution between x = i , y = j , v = k , and b = l . Under the saturated model, the fitted distribution (i.e., both groups having x and y scores) is Y V B XY XV XB YV YB VB log f (i, j, k, l) = η + λX i + λj + λk + λl + λij + λik + λil + λj k + λj l + λkl XYB XVB YVB XYVB + λXYV ij k + λij l + λikl + λj kl + λij kl ,

(1)

where η is a normalizing constant. The model is comprised of each variable’s main effects and higher-order interactions between variables. In a loglinear model, marginal distributions for X, Y , V , and B must be preserved whenever specified in the model. For example, the X marginal probability is constrained to be identical in the fitted and observed distributions if the X main effect is included in the model. When Test X contains 100 items, the i integer scores range from 0 to 100. A complete specification of the 100 parameters for fitting the X marginal probability is impractical. Because X, Y , and V are score distributions, less-stringent constraints can be used. For example, X X 2 X q λX i = φ1 i + φ 2 i + · · · + φ q i ,

(2)

where φhX (h = 1, 2, . . . , q) are parameters for fitting the X distribution. These q parameters are the same for i = 0, 1, . . . , 100. For maximum likelihood estimates (MLEs) of φ , the first q fitted moments equal the corresponding moments in the observed X distribution (Holland & Thayer, 1987; Rosenbaum & Thayer, 1987). A model with q = 3 fits a unimodal distribution well (e.g., Livingston, 1993), and q ≥ 4 is desirable for fitting skewed distributions with heavy tails. The XY marginal probability also can be constrained, so that the cross-product moments (e.g., XY, X2 Y , XY 2 ) are identical in the fitted and observed distributions. That is, XY XY 2 XY 2 λXY ij = φ1 ij + φ2 i j + φ3 ij + · · · .

(3)

Two tests measuring similar content are positively correlated with each other; the cross-product XY in Equation 3 preserves a monotone relationship. Higher-order interaction terms can improve model-data fit to distribution tails. The interaction between X and B can be expressed as XB XB 2 XB q λXB il = φ1l i + φ2l i + · · · + φql i .

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

(4)

M. LIOU, P. E. CHENG, and M.-Y. LI COMPARABLE SCORES USING SURROGATE VARIABLES

199

Equation 4 preserves the first q moments of X within the l th category. For example, if male and XB (B representing gender) improves female students have different mean scores on Test X, then φ1l the model-data fit. Other parameters in Equation 1 can be similarly simplified. The logarithm of the complete-data likelihood can be expressed as logL = C + n(i, j, k, l) logf (i, j, k, l) , (5) i,j,k,l

where C is a constant, and n(i, j, k, l) is the sample size in the (i, j, k, l)th cell. The MLE of the hth parameter can be found by solving for φh in the likelihood equation ∂logL u v w z = i j k t [n(i, k, k, l)/N −f (i, j, k, l)] ≡ Sh /N − i u j v k w t z f (i, j, k, l) = 0 , ∂φh i,j,k,l

(6)

i,j,k,l

where t is the score assigned to the l th category (e.g., t = 1 for male, and t = −1 for female); N is the total sample size; u, v , w , and z are integer values specified in the marginal model (e.g., for φ2XY in Equation 3, u = 2, v = 1, w = 0, and z = 0); and Sh is the sufficient statistic for estimating φh . In the common-item design, examinees who have taken Test X have an observed x score and a missing y score; for Test Y, they have an observed y score and a missing x score. The incompletedata log-likelihood can be written as   ∗ ∗   logL = C + n(i, ·, k, l) log f (i, j, k, l) + n(·, j, k, l) log f (i, j, k, l) , (7) i,k,l

j

j,k,l

i

where n(i, ·, k, l) is the sample size for Test X, with score i on Test X, k on V , and l on B , and n(·, j, k, l) is the corresponding sample size for Test Y. Little & Rubin (1987) showed that an incomplete-data likelihood is maximized using the EM algorithm under the missing-at-random assumption. Specifically, the pth stage of the EM algorithm consists of two steps. E-Step. (p) Sh = i u j v k w t z (8) n(p) (i, j, k, l) , i,j,k,l

where

n(p) (i, j, k, l) = f(p) (i, j, k, l)/f(p) (i, ·, k, l) n(i, ·, k, l)

+ f(p) (i, j, k, l)/f(p) (·, j, k, l) n(·, j, k, l) ,

(9)

where f(p) (i, ·, k, l) and f(p) (·, j, k, l) are the MLE estimates of the XVB and YVB marginal probabilities, respectively. The values of Sh are adjusted to the provisional n(i, ·, k, l), n(·, j, k, l), and the current estimates of f(p) (i, j, k, l). M-Step. (p) i u j v k w t z f(p+1) (i, j, k, l) = Sh /N . (10)

i,j,k,l

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

200

Volume 25 Number 2 June 2001 APPLIED PSYCHOLOGICAL MEASUREMENT

p+1 estimates are solved using the adjusted values of Sh from the previous E-step. φ p becomes stable. The MLE of f (i, j, k, l) then The EM cycle is repeated until the sequence of φ can be used to compute X and Y distributions, (x) = (11) F f(i, j, k, l) , i≤x,j,k,l

and

G(y) =

f(i, j, k, l) .

(12)

i,j ≤y,k,l

(x) and G(y) can be used to determine the comparable scores between X and Y using the equiperF

centile method. A model’s goodness of fit can be examined by the likelihood ratio (LR) statistic LR = −2 [n(i, ·, k, l)/NX ] log[n(i, ·, k, l)/ n(i, ·, k, l)] i,k,l

+

[n(·, j, k, l)/NY ] log[n(·, j, k, l)/ n(·, j, k, l)] ,

(13)

j,k,l

where NX and NY are the sizes of Groups X and Y , respectively. The degrees of freedom for LR are the number of cells in the XVB and YVB tables minus the number of estimable φ s. For scored multiway tables, the degrees of freedom are normally larger than desirable. The LR statistic can be used by first fitting a simple marginal model (e.g., a model preserving the first three sample moments of X, Y , and V , along with the main effect of B ), and then adding higher-order marginal moments and interactions into the model. A particular parameter should be retained in the model if it significantly reduces the LR statistic (by using a χ 2 test for the LR difference with 1 degree of freedom). In the common-item design, none of the examinees have scores on both target tests. Therefore, the XY interaction in the saturated model is not estimable, nor are the XYV, XYB, and XYVB interactions. However, V and B correlate with X and Y . As suggested by Rubin & Thayer (1978), information pertaining to the XY interaction can be estimated indirectly through V and B . When XV, YV, XB, and YB interactions are already in the model, the XY interaction further reduces the size of LR. The size of LR is not the only criterion for model selection. Other useful criteria include: 1. Examining the plots of observed and fitted distributions—a valid model should yield distributions that at least make intuitive sense. 2. Two models with different LR values can give similar equipercentile functions—a parsimonious model is preferable to more complex models. 3. Complex models can yield reasonable LR values but larger SEs for estimating comparable scores. SE plots across the score range are useful for model selection. Asymptotic SEs of Comparable Scores (x). When F and G are available from the EM algorithm, the comparable For simplicity, p ≡ F score on Test Y can be found by computing −1 ( ξ =G p) ,

(14)

where ξ is the comparable score on Test Y corresponding to the integer score x on Test X.

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

M. LIOU, P. E. CHENG, and M.-Y. LI COMPARABLE SCORES USING SURROGATE VARIABLES

201

All repetitions of observed integer scores i and j are assumed to be uniformly distributed in the unit-interval range. Consequently, the first derivatives of F and G with respect to x and y exist at all possible scores; that is, ∂F (x) = f (x) = F (i) − F (i − 1) > 0 , ∂x

(15)

and ∂G(y) = g(y) = G(j ) − G(j − 1) > 0 , ∂y

(16)

where i and j satisfy i − .5 < x ≤ i + .5 and j − .5 < y ≤ j + .5. Based on Bahadur’s (1966) theorem, the asymptotic SE of ξ is (Ghosh, 1971) 1

(x)] + Var[G(ξ )] − 2Cov[F (x), G(ξ )] 2 /g(ξ ) , Se( ξ) ∼ (17) = Var[F ξ =ξ (x) and G(y) are functions of φ . The first-order variances where g(ξ ) can be estimated by its MLE. F of these functions can be approximated by ∂F (x) T ∂F (x) (x)] ∼ Cov( φ) (18) Var[F φ = = φ , φ φ ∂φ ∂φ )] ∼ Var[G(ξ =

∂G(ξ ) T ∂G(ξ ) Cov( φ) φ = φ , φ φ ∂φ ∂φ

(19)

and ∂G(ξ ) T ∂F (ξ ) ∼ Cov(φ ) Cov[F (x), G(ξ )] = φ = φ , φ φ ∂φ ∂φ

(20)

where T is the transpose of a matrix, φ is the parameter vector, and Cov( φ ) approximates the variance-covariance matrix of the MLEs. The variance-covariance matrix can be estimated by the inverse of the observed information matrix. The observed information matrix at φ = φ can be obtained by computing the second derivative of the incomplete-data log-likelihood in Equation 7. Empirical Example Method The data were scores on two forms—Form X and Form Y—of a sociology test designed for 12thgrade high school students. Form X had 45 items, and Form Y had 46 items. They were administered to nonequivalent groups in three schools. Nineteen additional common items were administered to all examinees. The common items were designed to have content and item difficulties similar to the target forms. Two scores were computed for each examinee: number-correct scores on the common items (V ) and on the appropriate target test. Also used were examinees’ average school performance (scores ranging from 0–100) in Geography for Grades 10 and 11. Different methods— CEM, FEM, and the imputation approach using either common-item or Geography scores—were used to estimate comparable scores on Test Y for each integer score on Test X.

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

202

Volume 25 Number 2 June 2001 APPLIED PSYCHOLOGICAL MEASUREMENT

The Geography score was considered a surrogate for the common-item score. The design, sample sizes, and basic statistics for the empirical data are listed in Table 1. As Table 1 shows, V had higher correlations with target test scores than did the Geography scores. Form X examinees had slightly higher average V scores than did Form Y examinees; however, the average Geography scores for both examinee groups were similar. Table 1 Research Design, Mean (M) and SD for the Test Scores (X, Y , and V ), and Correlations Between Target Test Scores and Common-Item Scores (rX|Y,V ) and Target Test Scores and Geography Scores (rX|Y,G ) School 1

N 103 101

2

74 74

3

150 149

Total

327 324

Form X M SD M SD M SD M SD M SD M SD M SD M SD

23.31 5.33

Form Y —

—

23.73 5.22

16.51 5.52

—

—

15.60 5.69

27.96 5.61

—

—

27.97 6.10

23.90 7.16

—

—

23.74 7.45

V

Geography

rX|Y,V

rX|Y,G

11.05 2.76 10.52 2.50 8.32 2.74 7.31 2.78 13.54 2.79 12.96 2.79 11.57 3.45 10.91 3.50

70.88 9.41 70.42 7.74 64.70 5.52 66.12 6.71 73.63 7.34 73.32 6.37 70.75 8.45 70.77 7.44

.70

.59

.49

.55

.63

.63

.75

.71

.69

.65

.74

.55

.80

.69

.80

.65

A simple marginal model was fitted by preserving the first three moments for X, Y , and V , and the main effect for B (the cross-classification of examinees according to schools). Higher-order moments and interactions then were added to the model. A parameter was retained (by using a χ 2 test) if it significantly reduced the LR statistic or deleted if it resulted in unreasonable score distributions (e.g., extremely skewed distributions). Results Table 2 lists fit of some of the models to the data and their corresponding LR statistics. For example, Model C1 fit the first three marginal moments for X, Y , and V, all possible two-way interactions, and some three-way interactions. The XB interaction preserved individual mean scores on Test X for each school, and XVB preserved the individual cross products, XV. The LR statistic for Model C2 was significantly smaller than that for Model C1. The other models in Table 2 did not improve the model-data fit. Figures 1a and 1b show incomplete data (observed) distributions for Forms X and Y , respectively, and the complete data (fitted) distributions estimated using Model C2. Model C2 shifted the Form X distribution to the lower tail and the Form Y distribution to the upper tail. Comparable scores on Form Y also were found for each integer score on Form X using the equipercentile method. The equipercentile functions are plotted, along with the SEs of comparable scores, in Figures 2a and 2b for Models C1– C5. All models resulted in similar equipercentile functions across the x score range, except Model C1—which had the largest LR value. Model C1 had the smallest SEs across the score range, followed by Model C2 and other more complex models, respectively. Model C2 was less contaminated by sampling error and yielded an equipercentile

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

Figure 1 Score Distributions for Forms X and Y for Models C1 and G1 a. Model C2, Form X b. Model C2, Form Y

M. LIOU, P. E. CHENG, and M.-Y. LI COMPARABLE SCORES USING SURROGATE VARIABLES

d. Model G1, Form Y Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

c. Model G1, Form X

203

204

Volume 25 Number 2 June 2001 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 2 Models Fit to the Data, Number of Parameters, and LR Test Statistics Response Model Model C1

Parameters 27

407.616

LR

30

392.542

33

388.076

34

390.780

36

389.692

27

1373.489

30

1336.179

33

1333.173

34

1326.020

36

1327.716

(X, X2 , X 3 ), (Y, Y 2 , Y 3 ), (V , V 2 , V 3 ), (B), (XV), (YV), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model C2 (X, X2 , X 3 , X 4 ), (Y, Y 2 , Y 3 , Y 4 ), (V , V 2 , V 3 , V 4 ), (B), (XV), (YV ), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model C3 (X, X2 , X 3 , X 4 , X 5 ), (Y, Y 2 , Y 3 , Y 4 , Y 5 ), (V , V 2 , V 3 , V 4 , V 5 ), (B), (XV), (YV), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model C4 (X, X2 , X 3 , X 4 ), (Y, Y 2 , Y 3 , Y 4 ), (V , V 2 , V 3 , V 4 ), (B), (XV, X2 V , XV2 ), (YV, Y 2 V , YV2 ), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model C5 (X, X2 , X 3 , X 4 ), (Y, Y 2 , Y 3 , Y 4 ), (V , V 2 , V 3 , V 4 ), (B), (XV ), (YV ), (XY), (XB, X 2 B), (YB, Y 2 B), (VB, V 2 B), (XYB), (XVB), (YVB)

Model G1 (X, X2 , X 3 ), (Y, Y 2 , Y 3 ), (V , V 2 , V 3 ), (B), (XV), (YV), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model G2 (X, X2 , X 3 , X 4 ), (Y, Y 2 , Y 3 , Y 4 ), (V , V 2 , V 3 , V 4 ), (B), (XV), (YV), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model G3 (X, X2 , X 3 , X 4 , X 5 ), (Y, Y 2 , Y 3 , Y 4 , Y 5 ), (V , V 2 , V 3 , V 4 , V 5 ), (B), (XV), (YV), (XY), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model G4 (X, X2 , X 3 , X 4 ), (Y, Y 2 , Y 3 , Y 4 ), (V , V 2 , V 3 , V 4 ), (B), (XV, X2 V , XV2 ), (YV, Y 2 V , YV2 ), (XY ), (XB), (YB), (VB), (XYB), (XVB), (YVB)

Model G5 (X, X2 , X 3 , X 4 ), (Y, Y 2 , Y 3 , Y 4 ), (V , V 2 , V 3 , V 4 ), (B), (XV), (YV ), (XY ), (XB, X2 B), (YB, Y 2 B), (VB, V 2 B), (XYB), (XVB), (YVB)

function similar to that of the more complex models. Thus, Model C2 was the preferred model for these data. Different models then were fit to the joint distribution between X, Y , B , and Geography scores. Table 2 also lists some of these models and their corresponding LR statistics. (Note that V denotes Geography scores for these analyses.) Figures 1c and 1d give the fitted distributions for Forms X and Y, based on Model G1. Comparison of Figures 1a and 1b with 1c and 1d suggest that the fitted distributions using common-item and Geography scores differed only slightly. Figure 2c shows that equipercentile functions were close to each other for all these models, except for slight differences at the two extremes. SE plots (Figure 2d) show that Model G1 had the smallest SEs across the x score range. Based on the LR statistics, Model G2 is preferable to Model G1. However, Model G1 yielded an equipercentile function similar to those of more complex models, and it was less contaminated by sampling error. Thus, Model G1 was preferred for these data because it had fewer parameters for scaling comparable scores. Comparable scores also were computed with the same data using the CEM and FEM based on common-item scores. A comparison of the equipercentile functions based on CEM and FEM is

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

Figure 2 Form X–Form Y Equipercentile Functions and SEs of Comparable Scores for Models C1–C5 and G1–G5 a. Equipercentile Functions, Models C1–C5 b. SEs, Models C1–C5

d. SEs, Models G1–G5

M. LIOU, P. E. CHENG, and M.-Y. LI COMPARABLE SCORES USING SURROGATE VARIABLES

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

c. Equipercentile Functions, Models G1–G5

205

206

Volume 25 Number 2 June 2001 APPLIED PSYCHOLOGICAL MEASUREMENT

plotted in Figure 3, along with results for Models C2 and G1. Due to zero frequencies at the lower and upper tails of the number-correct score distributions, comparable scores and SEs could not be estimated well for the CEM and FEM; therefore, Figure 3 shows results only in the score range of (5, 40) for all methods. Figure 3 Form X–Form Y Equipercentile Functions and SEs of Comparable Scores for Different Methods a. Equipercentile Functions

b. SEs

The imputation method using common-item scores theoretically is closer to the smoothed FEM (Liou & Cheng, 1995a). Figure 3a shows that the equipercentile functions for Model C2 and the FEM were similar to each other, except for larger sampling fluctuations associated with the FEM. Comparable scores based on Model G1 did not differ much from those of the FEM. The equipercentile function plots suggest that the difference between Models C2 and G1 was less significant than that between the CEM and FEM. SEs of comparable scores using the CEM and FEM were computed from equations similar to Equation 17 (Liou & Cheng, 1995b; Liou, Cheng, & Johnson, 1997). SEs are shown in Figure 3b. Both conventional methods resulted in much larger SEs across the x score range. Conclusions Common-item scores can have small correlations with target scores (Wright & Dorans, 1993). The proposed methodology allows the inclusion of other covariates, along with common-item

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

M. LIOU, P. E. CHENG, and M.-Y. LI COMPARABLE SCORES USING SURROGATE VARIABLES

207

scores, into the model to improve the prediction of group differences. The empirical example suggested that a Geography score worked as well as the common-item score for imputing missing data, even though the two variables had lower correlations with target tests. Sample-selection bias was not as serious as had been expected. It would be useful to determine whether the use of surrogate variables for imputing missing scores is effective when target groups differ in ability to a larger degree. References Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton NJ: Educational Testing Service. [Reprinted from R. L. Thorndike (Ed.), Educational Measurement (2nd ed.), Washington DC: American Council on Education, 1971, 508– 600]. Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580. Ghosh, J. K. (1971). A new proof of the Bahadur representation of quantiles and an application. Annals of Mathematical Statistics, 42, 1957–1961. Holland, P. W., & Thayer, D. T. (1987). Notes on the use of loglinear models for fitting discrete probability distributions (Technical Report No. 87-79). Princeton NJ: Educational Testing Service. Liou, M., & Cheng, P. E. (1995a). Equipercentile equating via data-imputation techniques. Psychometrika, 60, 119–136. Liou, M., & Cheng, P. E. (1995b). Asymptotic standard error of equipercentile equating. Journal of Educational and Behavioral Statistics, 20, 259– 286. Liou, M., Cheng, P. E., & Johnson, E. G. (1997). Standard errors of the kernel equating methods under the common-item design. Applied Psychological Measurement, 21, 349–369. Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley. Little, R. J. A., & Rubin, D. B. (1994). Test equating from biased samples, with applications to the Armed Services Vocational Aptitude Battery. Journal of Educational and Behavioral Statistics, 19, 309–335. Livingston, S. A. (1993). Small-sample equating with

loglinear smoothing. Journal of Educational Measurement, 30, 23–29. Marco, G. L., Abdel-fattah, A. A., & Baron, P. A. (1992). Methods used to establish score comparability on the enhanced ACT assessment and the SAT (College Board Report No. 92-3). Princeton NJ: Educational Testing Service. Rosenbaum, P. R., & Thayer, D. (1987). Smoothing the joint and marginal distributions of scored twoway contingency tables in test equating. British Journal of Mathematical and Statistical Psychology, 40, 43–49. Rubin, D. B. & Thayer, D. (1978). Relating tests given to different samples. Psychometrika, 43, 3– 10. Wright, N. K. & Dorans, N. J. (1993). Using the selection variable for matching or equating (Research Report No. 92-3). Princeton NJ: Educational Testing Service.

Acknowledgments The authors thank Tzu-Jung Hsiao, Ja-Yi Wu, the editor, and reviewers for helpful comments. They also thank Zheng-Feng Tsai and Chieh-Jung Wu for writing the programs. This research was supported by grant NSC86-2413-H-001-002 from the National Science Council and grant R-84-091 from the College Entrance Examination Center, ROC.

Author’s Address Send requests for reprints or further information to Michelle Liou, Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, ROC. Email: [email protected].

Downloaded from apm.sagepub.com at ACADEMIA SINICA on October 18, 2014

Lihat lebih banyak...

Estimating Comparable Scores Using Surrogate Variables

Descripción

Comentarios