Comparability of essay question variants

May 23, 2017 | Autor: Catherine Trapani | Categoría: Linguistics, Assessing Writing, Standard Deviation, North American, Large Scale

Share Embed

Laporkan tautan ini

Descripción

Assessing Writing 16 (2011) 237–255

Contents lists available at ScienceDirect

Assessing Writing

Comparability of essay question variants Brent Bridgeman ∗, Catherine Trapani, Jennifer Bivens-Tatum Educational Testing Service, Princeton, NJ 08541, USA

a r t i c l e

i n f o

Article history: Received 14 October 2010 Received in revised form 6 June 2011 Accepted 10 June 2011 Available online 5 August 2011 Keywords: High-stakes essay assessment Analytical writing Essay question variants

a b s t r a c t Writing task variants can increase test security in high-stakes essay assessments by substantially increasing the pool of available writing stimuli and by making the speciﬁc writing task less predictable. A given prompt (parent) may be used as the basis for one or more different variants. Six variant types based on argument essay prompts from a large-scale, high-stakes North American writing assessment and six based on issue prompts from the same test were created and evaluated in the research section of the test administrations in the winter of 2009. Examinees were asked to volunteer to write an essay on one of the new prompt/variant tasks. Essays were obtained from 7573 examinees for argument prompts and 10,827 examinees for issue prompts. Results indicated that all variant types produced reasonably similar means, standard deviations, and rater reliabilities, suggesting that the variant strategy should be useable for operational administrations in high stakes essay assessments. Variant type did not interact with gender, ethnicity, or language (self-report that English or another language is the examinee’s “best” language). © 2011 Elsevier Ltd. All rights reserved.

1. Introduction In high-stakes essay assessments, some test takers have subverted the ability of the test to assess writing and thinking skills by memorizing substantial chunks of existing well-written texts and including this memorized material in their essays. The essay questions are often sufﬁciently broad that this memorized text can be inserted after only a few transition sentences that link the memorized text

∗ Corresponding author at: Educational Testing Service, 09-R, Princeton, NJ 08541, USA. Tel.: +1 609 734 5767; fax: +1 609 734 1755. E-mail address: [email protected] (B. Bridgeman). 1075-2935/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.asw.2011.06.002

238

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

to the actual question. In order to discourage this practice and maintain the validity of the inferences that can be made from the test scores, writing test developers can create new essay questions that require responses that are much more closely tied to the speciﬁc content of the essay questions. Because writing good essay questions is labor-intensive, an approach that enables many unique questions to be produced from a single prompt has the potential to reduce costs at the same time as it is improving test validity by discouraging the use of pre-memorized material. This approach to question writing starts with a prompt “parent” that can generate several different variants by specifying different writing tasks in response to the same stimulus. Thus, for example, one examinee may be shown a prompt that presents an argument and asks the examinee to explain the unstated assumptions: The following appeared in a memo to the board of directors of Bargain Brand Cereals. “One year ago we introduced our ﬁrst product, Bargain Brand breakfast cereal. Our very low prices quickly drew many customers away from the top-selling cereal companies. Although the companies producing the top brands have since tried to compete with us by lowering their prices and although several plan to introduce their own budget brands, not once have we needed to raise our prices to continue making a proﬁt. Given our success in selling cereal, we recommend that Bargain Brand now expand its business and begin marketing other low-priced food products as quickly as possible.” Write a response in which you examine the unstated assumptions of the argument above. Be sure to explain • how the argument depends on those assumptions and • what the implications are if the assumptions prove unwarranted. Another examinee would see the same memo to the Bargain Brands board of directors, but be asked to evaluate the recommendation: Write a response in which you discuss what questions would need to be answered in order to decide whether the recommendation and the argument on which it is based are well justiﬁed. Be sure to explain how the answers to these questions would help to evaluate the recommendation This approach to the use of variants makes the writing task less predictable for the examinee in a way that it is hoped will reduce the use of pre-memorized material. Although the use of variants with high-stakes writing prompts has not been previously investigated, there is a long history exploring different factors that might make prompts differentially difﬁcult. Kroll and Reid (1994) identiﬁed six factors that test developers should consider in creating prompts of comparable difﬁculty: contextual variables, content variables, linguistic variables, task variables, rhetorical variables, and evaluation variables. Huang (2008) identiﬁed factors that examinees identiﬁed as making some prompts apparently more difﬁcult than others. These factors included knowledge about the topic, interest in the topic, experience, and data availability. Hinkel (2009) provides a good review of the many prompt effects that must be controlled in creating prompts that are accessible to a broad range of L2 writers. Hamp-Lyons and Mathias (1994) found that test developers were generally able to agree on prompt difﬁculty, but surprisingly examinee scores were not closely related to these difﬁculty ratings and even tended to be in the opposite direction with higher scores from the more difﬁcult prompts. The authors suggested that raters were able to adjust for the apparent difﬁculty in the way that they assigned scores, and that it is possible that “essay readers are consciously or unconsciously compensating in their scoring for relative prompt difﬁculty based on their own, internalised difﬁculty estimates” (p. 59). Perhaps partly because of this compensation ability combined with carefully written prompts, no signiﬁcant differences were found among ﬁve prompts on the Canadian Academic English Language Assessment in a sample of 254 ESL university applicants (Jennings, Fox, Graves, & Shohamy, 1999). Similarly, a study of 47 TOEFL CBT writing prompts in a sample of over 600,000 examinees found relatively trivial differences in prompt difﬁculty (Breland, Lee, Najarian, & Muraki,

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

239

2004). It is important not to confuse statistical signiﬁcance with practical signiﬁcance; because of the huge sample size very small differences among prompts were still statistically signiﬁcant. This essentially replicated earlier results with the Test of Written English that also found no practically signiﬁcant difﬁculty differences among eight prompts with about 10,000 essays per prompt (GolubSmith, Reese, & Steinhaus, 1993), and similar results were found for essay prompts on the Graduate Record Examination (Schaeffer, Briel, & Fowles, 2001). Although careful attention to controlling the factors that can affect prompt difﬁculty appears to be generally successful in a number of high-stakes testing programs, there is certainly no guarantee that prompt effects will always be trivial (Lee & Anderson, 2007). Given careful attention to the factors that can drive prompt difﬁculty in the initial creation of the prompts, and given the pre-testing of prompts, it is still remains unclear whether quite different questions from the same base prompt will yield comparable results. The idea of generating a variety of different items with comparable difﬁculty levels from a single parent item has been extensively studied in contexts other than the assessment of writing skills (e.g., Bejar, 1993; Bejar & Braun, 1999; Embretson, 1998; Hively, Patterson, & Page, 1968; Morley, Lawless, & Bridgeman, 2005). Bejar and Braun (1999) showed that an item variant approach could be used to produce items for an architect licensing exam. In this exam, variants were made of complex architectural tasks, such as positioning a building on a lot in such a way that a variety of zoning restrictions could be accommodated. Although the basic task would remain the same from one test to another, the speciﬁc zoning restrictions to be accommodated might change. Bridgeman, Bejar, and Friedman (1999) explored whether any of the speciﬁc architectural item variants acted differently in different population subgroups. In a study of math items on the Graduate Record Examination (GRE® ), item variants were shown to provide a way of producing many comparable items from a single parent item (Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2002). Similarly, in a study of GRE verbal analogy and antonym items, the potential of variants for efﬁcient item generation was demonstrated (Scrams, Mislevy, & Sheehan, 2002). Although these apparent successes with variants are encouraging, they provide little assurance that variants will work effectively for essay questions in which the task demands may be much more difﬁcult to control than in simple verbal and math items. In the current study, six argument variant types and six issue variant types from the analytical writing section of a high-stakes university-level examination were evaluated. The test is intended to assess “critical thinking and analytical writing skills, speciﬁcally the test taker’s ability to articulate complex ideas clearly and effectively” (Educational Testing Service, 2010). Samples of the argument variant types are in Appendix A and samples of the issue variant types are in Appendix B. Because there are no equating procedures for writing prompts and their variants, fairness considerations demand that prompts/variants be of comparable difﬁculty and yield comparable score distributions. Although the study was designed primarily to focus on potential differences among the different variant types, the design also allows investigation of differences among variants belonging to the same parent prompt. Thus, the research questions to be addressed in this study are as follows: 1) Are score distributions (means and dispersion) comparable across prompts and variants? 2) Are any variant types differentially difﬁcult for a particular gender and ethnic subgroups or for examinees whose best language is not English? 3) Is reader reliability consistent across prompts and variant types? 2. Method 2.1. Data source Although many prompts and variants on the current target test have been pretested, the use of this existing data was problematic in two respects. First, raters were initially inexperienced with the different task demands of the new prompt types, and the rater trainers were not conﬁdent that these ratings appropriately reﬂected the new scoring rubrics. Rater training and experience has improved so that this now should no longer be a problem. If rater training/experience were the only issue, existing essays could simply have been re-rated. But the second problem has to do with the essays themselves,

240

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

and could not be addressed by simple re-rating. The problem with the essays is that examinees who volunteered to take these experimental prompts/variants at the end of a regular administration did not appear to be aware of the changes in the task demands. The screen that introduced the experimental section mentioned “a writing task that the GRE program is considering for possible use in future tests,” but did not emphasize the substantial differences in the task demands compared to the existing prompt directions. Raters reported that many examinees responded to the prompts using the old task directions, rather than focusing on the much more speciﬁc task directions that were provided. Given these problems with the existing data, a new data collection was conducted in the research section of operational administrations. 2.2. Procedures Examinees who took the operational test at computer testing centers in the winter of 2009 were invited to participate in the research project immediately following completion of the regular GRE. If the examinee pressed “proceed” following the invitation screen, the following message appeared: IMPORTANT The instructions for this task differ from the standard instructions that appeared at the beginning of your GRE test. Please follow the instructions for this task carefully. I understand that the writing task I am about to see is different from the writing tasks on the current test [click here to begin the new essay section] About 25% of the examinees pressing the “proceed” button did not make any attempt to answer the essay question; they would either write nothing or would write something like “I decided not to do this.” These examinees were removed from all subsequent analyses. We considered an additional motivational screening that would remove all examinees with operational writing scores of 5.0 or higher who had scores of less than 3.0 on the experimental section. But of the 2163 examinees with operational scores of 5.0 or higher, only 157 had experimental scores less than 3.0; as it was possible that some motivated students would simply ﬁnd the experimental task to be much more difﬁcult than the current task, we decided not to screen out any of these cases. After screening, 18,400 usable responses were obtained. Because more prompt/variant combinations were available for issue essays, sample sizes were somewhat larger for issue than for argument prompts/variants (7573 argument and 10,827 issue). Essays were evaluated on six-point rating scales in an on-line scoring environment by two independent raters who had been trained on the scoring rubrics for the new variant types. The two ratings were averaged with normal adjudication rules applied such that ratings that differed by more than a point would go to a third reader and the two closest scores would be averaged. 2.3. Design Participants were randomly assigned to one of the prompt/variant combinations. A fully crossed design was not possible because every prompt (parent) cannot generate every possible variant type, but examples for every variant type were generated from at least two different parents. The design for the administration of the prompts/variants, with the number of examinees in each cell, is shown in Table 1 for the argument prompts/variants and in Table 2 for the issue prompts/variants. 2.4. Analysis Means and score distributions for each variant type were calculated and analysis of covariance (with operational writing score as the covariate) was used to evaluate potential interactions with gender, ethnicity, and language (English-best vs. English-not-best). Ethnicity and language were obtained from a voluntary questionnaire that examinees are asked to complete when they register for the test; the language question asks examinees to indicate whether their “best” language is English or another language. Only U. S. citizens are asked to respond to the ethnicity question, but the best language question is asked of all examinees.

Variant

Alt explanations Assumptions Evidence Prediction Rec/result Eval recommend Study total

Prompt (parent) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

Total

337 0 316 0 0 0 653

0 0 0 0 0 301 301

0 0 0 0 0 360 360

0 0 300 0 0 367 667

323 271 0 0 0 0 594

319 0 0 0 243 265 827

0 0 238 0 0 0 238

263 263 310 0 323 0 1159

246 0 319 0 0 0 565

0 0 0 0 0 351 351

0 0 0 0 0 311 311

0 211 0 269

0 0 0 0 0 255 255

0 315 0 497 0 0 812

1488 1060 1483 766 566 2210 7573

0 480

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 1 Number of examinees for each argument prompt/variant.

241

242

Variant

Claim/reason Positions Generalization Policy rec Pos/counter Recommend Study total

Prompt (parent) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Total

0 531 0 337 349 334 1551

0 0 0 0 0 316 316

0 0 325 307 0 0 632

0 532 0 0 0 0 532

0 0 285 0 0 0 285

0 561 0 0 0 0 561

280 0 0 0 251 0 531

0 0 318 0 268 0 586

345 0 0 0 0 0 345

0 0 328 0 0 334 336

301 0 0 0 0 0 301

0 0 0 316 297 260 873

241 0 0 0 0 0 241

0 0 0 331 0 361 692

0 506 0 0 0 273 779

0 0 0 281 0 0 281

0 0 0 380 0 0 380

0 0 0 325 289 350 964

315 0 0 0 0 0 315

1482 2130 1256 2277 1454 2228 10827

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 2 Number of examinees for each issue prompt/variant.

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

243

Means and standard deviations for each unique parent/variant combination were calculated. Statistical tests of every pairwise comparison would be cumbersome and not very informative. Because individual (parent/variant) cell sizes were generally 250 or higher and standard deviations were all close to 0.9, the standard error of the difference between cell means is about 0.08, so using a two standard error criterion, mean differences less that 0.16 may be considered to be both statistically and practically trivial. Score distributions (i.e., number of examinees at each score level) were computed for each variant type. Rater reliability (quadratic weighted Kappa and % exact agreement) was evaluated separately for each parent/variant combination. 3. Results and discussion Means for each parent/variant combination for argument prompts are presented in Table 3. Standard deviations for each prompt/variant combination were in the 0.8–0.9 range with the single exception of the alternate explanations variant of Prompt 8, which had an SD of 0.7. Mean differences between prompts within variant type were generally quite small; an exception was the two prompts in the prediction variant type with a difference of 0.25. As indicated in the Total column in the far right of Table 3, differences across variant types were also small; the largest difference on the 1–6 scale was 0.20 (or a standardized difference of 0.24). Both prompts in the “evaluation of a recommendation/predicted result” (Rec/Result) variant type had higher means than any of the ﬁve prompts in the Evidence variant type. Three of the four Assumptions variants had higher means than any of the Evidence variant types. The grand mean across all argument variant types was 2.95, and the largest discrepancy from the grand mean was 0.13 for the Rec/Result variant type. For comparison, Table 3 also shows the variation across prompts for the regular operational administration of the ten prompts that were used both operationally and in the study. This level of variation is consistent with that noted in the experimental studies prior to the introduction of writing to the GRE (Schaeffer et al., 2001). Note that the operational means were higher because the study data were collected in the winter when mean scores tend to be lower and because motivation levels were higher for the operational essays. Nevertheless, the variation across prompts for the study sample and operational sample were nearly identical (SDs of 0.082 and 0.081 respectively). Results for the issue prompts are in Table 4. The grand mean for the issue variant types (2.93) was very close to the grand mean for the argument variant types (2.95). The largest discrepancy from the issue grand mean was 0.12 for the relatively easy Pos/Counter variant type. Three issue prompt/variants had means of 2.74, and the highest mean was 3.17; most means fell between 2.80 and 3.04. Standard deviations for 30 of the 32 prompt/variant combinations were 0.8 to 0.9; the two exceptions were both of the “position with counterarguments” (Pos/Counter) variant type with SDs of 0.7 and 1.0 for Prompts 7 and 12 respectively. At the variant type level, means ranged from 2.83 to 3.05 and standard deviations for all variant types were 0.9. 3.1. ANCOVAs for argument variant types Means, SDs, and Ns for gender and ethnic subgroups on the argument variant types are in Table 5. The 2 (gender) × 4 (ethnic [Asian, Black, Latino, White]) × 6 (variant type) ﬁxed-effects ANCOVA (with operational writing score as the covariate), revealed a small but statistically signiﬁcant variant type effect (F [5,5,738] = 3.60, p = .009), which is consistent with the small mean differences noted in Table 3. There were no signiﬁcant interactions of variant type with either gender or ethnicity (all ps > .29). Similarly, in the ANCOVA comparing examinees whose best language was English with English-notbest examinees, there was a signiﬁcant main effect for variant type (F [5,7059] = 10.72, p < .001), but the interaction with variant type was not signiﬁcant (F [5,7059] = 1.20, p = .304). 3.2. ANCOVAs for issue variant types Table 6 shows means, SDs, and Ns for gender and ethnic subgroups for the issue variant types. The 2 (gender) × 4 (ethnic [Asian, Black, Latino, White]) × 6 (variant type) ANCOVA (with operational writing score as the covariate), once again revealed a small but statistically signiﬁcant variant type

244

Variant

Prompt (parent) 1

Alt explanations Assumptions Evidence Prediction Rec/result Eval recommend Study totala Operational parentb

2

3

4

2.96 2.85

2.91 3.75

5

6

2.89 3.06

2.95

2.87

2.96 2.96

3.02 3.02 3.77

2.92 2.90 3.88

2.96 3.73

3.05 2.93 2.98

7

8

9 2.85

2.83

2.79 3.13 2.95

10

11

12

13

14

2.83

3.03

2.75

3.00

2.93

3.11 2.83 3.73

a

Mean of variant scores within prompt.

b

Mean prompt score from operational administrations from August 2006 to September 2007.

3.00 3.93

2.89 3.85

3.10 3.10 3.86

3.03 3.03 3.90

2.79

2.85 2.85

3.02 3.71

Total 2.89 3.01 2.88 2.88 3.08 2.97 2.95 3.81

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 3 Means for each argument prompt/variant.

Variant

Prompt (parent) 1

Claim/reason Positions Generalization Policy rec Pos/counter Recommend Study totala Parentb a b

2

3

4

5

6

7

8

2.80 2.96 3.06 3.12 3.06 3.05 3.79

2.94 3.11 2.92 3.00 3.00 3.99

3.02 3.84

10

2.83

11

12

3.01

13

14

2.85 3.93

16

17

18

19

Total

2.74

2.83 2.89 2.99 2.97 3.05 2.98 2.95 3.87

2.74 3.04

2.94 3.84

15

2.80

2.94 2.85

2.94

9

2.93

2.94

2.87 3.88

2.99 3.83

2.95

2.83

3.09 3.02

Mean of variant scores within prompt. Mean prompt score from operational administrations from August 2006 to September 2007.

3.01 3.90

2.92 3.04 2.88 2.95 3.94

2.80

2.80 3.81

2.89 2.85

2.74 2.74

3.05

3.04

3.05 3.92

3.04

3.03 3.21 3.17 3.14

2.74

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 4 Means for each issue prompt/variant.

245

246

Variants

Female Af Am

Alt explanations Assumptions Evidence Prediction Rec/result Recommendation

Male Asian

Hispanic

White

Af Am

Asian

Hispanic

White

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

101 61 96 63 39 134

2.7 2.8 2.7 2.7 3.0 2.8

0.7 0.9 0.6 0.8 0.6 0.8

58 38 54 27 25 91

2.9 3.0 2.9 3.2 3.1 3.0

0.8 0.8 0.8 0.9 0.8 0.8

63 30 56 28 22 82

2.8 2.9 2.8 3.0 3.0 2.9

0.7 0.9 0.8 0.7 0.7 0.8

510 376 511 274 197 793

3.1 3.2 3.1 3.1 3.4 3.2

0.8 0.9 0.8 0.9 0.8 0.8

32 23 49 16 15 47

2.8 2.8 2.7 2.7 2.5 2.7

0.8 0.8 0.8 0.9 0.9 0.8

55 43 59 31 17 52

2.7 2.9 2.6 2.7 2.7 2.6

0.9 1.0 0.9 0.9 0.8 0.9

33 26 37 13 12 43

2.9 3.0 2.8 2.8 3.0 3.1

0.7 0.8 1.0 1.0 0.9 0.9

281 206 274 145 104 439

3.1 3.3 3.0 3.1 3.2 3.2

0.8 0.9 0.8 0.8 0.7 0.8

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 5 Gender by ethnicity by variant type for argument variants.

Female

Male

Af Am

Claim/reason Competing positions Generalization Policy rec Pos/counter Recommendation

Asian

Hispanic

White

Af Am

Asian

Hispanic

White

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

N

Mean

SD

87 144 75 137 93 145

2.7 2.7 2.9 2.9 2.9 2.8

0.8 0.8 0.8 0.9 0.7 0.9

62 81 52 98 48 85

2.8 2.9 2.8 3.0 3.0 3.1

0.8 0.8 1.0 0.9 0.8 1.0

54 92 36 90 63 95

2.7 2.9 3.0 3.0 3.0 3.0

0.7 0.8 0.8 0.8 0.8 0.8

507 764 440 787 513 758

3.0 3.1 3.2 3.2 3.2 3.2

0.9 0.8 0.9 0.8 0.9 0.9

44 35 22 52 35 64

2.8 2.6 2.9 2.9 2.9 2.7

0.9 0.8 0.8 0.9 0.7 0.9

61 59 51 75 44 77

2.7 2.8 2.8 2.7 2.7 2.6

0.9 1.0 0.8 1.0 1.0 0.9

29 46 32 44 30 43

2.6 3.0 2.6 2.8 3.1 2.8

0.9 0.9 0.9 1.0 0.7 1.0

295 398 229 434 258 440

3.1 3.1 3.3 3.2 3.4 3.2

0.9 0.8 0.9 0.8 0.9 0.9

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 6 Gender by ethnicity by variant type for issue variants.

247

248

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 7 Score distributions for argument variant type: % of examinees at each score. Variant

Score

Alt explanations Assumptions Evidence Prediction Rec/result Eval recommend

1

2

3

4

5

6

3.8 3.4 4.7 5.0 2.7 3.9

28.8 26.2 27.8 28.1 21.9 25.3

45.0 40.6 46.9 44.1 42.9 44.7

18.4 23.0 16.1 17.9 25.4 20.5

3.5 5.9 3.8 5.1 6.5 5.2

0.4 0.9 0.8 0.0 0.5 0.4

Table 8 Score distributions for issue variant types: % of examinees at each score. Variant

Claim/reason Positions Generalization Policy rec Pos/counter Recommend

Score 1

2

3

4

5

6

8.0 5.5 7.6 6.2 6.2 7.0

26.9 26.2 21.4 23.3 18.5 21.6

42.6 44.0 41.3 41.6 48.0 42.4

18.9 20.9 25.0 25.1 23.2 24.2

3.4 3.3 4.6 3.4 3.4 4.5

0.4 0.1 0.1 0.5 0.7 0.4

effect (F [5,8163] = 9.04, p < .001), but there was no signiﬁcant interaction of variant type with either gender or ethnicity (all ps > .17), suggesting that no variant type favored (or disfavored) any gender or ethnic group. The ANCOVA comparing English-best with English-not-best examinees generally mirrored the results for the argument variant types. Variant type yielded a signiﬁcant main effect (F [1,10075] = 8.67, p < .001), but the F for the interaction of variant type with native language was less than 1, suggesting that no variant type was differentially difﬁcult for non-native speakers of English.

3.3. Score distributions Rating score distributions by variant type were created using scores from only one rater to simplify the presentation by avoiding half point increments; adding the half points would not affect any conclusions. Distributions for the argument variant types are in Table 7 and the distributions for the issue variant types are in Table 8. Distributions were comparable across the different variant types. The most notable feature of these distributions is the low frequency of scores of 5, and the extremely low frequency for scores of 6. Several factors may contribute to these low frequencies: (a) very high scores are rare for unmotivated examinees in a no stakes test, (b) the variant types are new and unfamiliar for examinees who have not had a chance to practice these new variant types, and (c) the variant types are also unfamiliar to the raters who, although trained on the new variants, do not have extensive experience with them. The modal score for every variant type was a 3.

3.4. Rater reliability Rater reliability (quadratic weighted kappa, and % adjacent agreement) for the argument variants is in Table 9. Although there is some variation among individual variants, there is no variant type that is clearly more or less reliable than any other variant type. The same appears to be true for the issue variants in Table 10.

Variant

Prompt (parent) 1

Alt explanations Assumptions Evidence Prediction Rec/result Eval recommend

2

3

4

.70 (56) .73 (62)

5

6

.70 (63) .79 (67)

.78 (62)

.69 (55)

.73 (61)

.72 (57)

.69 (63)

.62 (61) .74 (62)

7

8

9 .79 (63)

.77 (66)

.72 (64) .77 (65) .67 (60)

10

11

12

13

14

.84 (67)

.79 (66)

.69 (62)

.74 (60)

.73 (62)

.75 (61) .78 (67)

.83 (68)

.76 (62)

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 9 Rater reliability for argument prompts: weighted kappa and (% exact agreement).

249

250

Variant

Prompt (parent) 1

Claim/reason Positions Generalization Policy rec Pos/counter Recommend

2

3

4

5

6

7

8

.79 (64) .77 (65) .77 (61) .80 (67) .67 (60) .75 (66) .80 (67) .77 (64)

.76 (62)

9

10

.71 (58)

11

12

.76 (59)

13

14

16

17

18

.74 (64) .77 (59) .71 (67) .76 (63)

.83 (69) .79 (66) .83 (65) .80 (64)

19 .77 (61)

.74 (62) .79 (63)

15

.73 (57)

.79 (68) .79 (63) .73 (56)

.77 (66) .74 (59) .79 (63) .75 (58)

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Table 10 Rater reliability for issue prompts: weighted kappa and (% exact agreement).

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

251

4. Conclusion Differences between variant types in terms of means, distributions, and rater reliabilities were small enough to suggest that the use of variants is a practical approach to the need to produce a large number of comparable essay questions for a high stakes essay test. No variant type was differentially difﬁcult for any identiﬁed subgroup, suggesting that the variant approach should not impact subgroup differences. Although it is sometimes necessary to make a trade-off between cutting costs and enhancing validity and fairness (Quellmalz, 1984), the variants approach appears to be a win–win that potentially both enhances validity (by reducing the use of pre-memorized material) and reduces test creation costs. The magnitude of the observed differences between variant types is consistent with the level of variation noted within the current issue and argument prompt types. Thus, introduction of variants should not introduce any more variability than is currently observed. Nevertheless, the variability is sufﬁcient to be of some practical concern, and the particular variant used at a given testing session would not be a matter of total indifference to the examinee. The impact on the total writing score for individual examinees in a high stakes test could be mitigated by pairing a relatively difﬁcult argument parent/variant with a relatively easy issue prompt/variant. A pairing approach matching relatively easy issue prompts with relatively difﬁcult argument prompts, and vice versa, is currently used in the GRE analytic writing assessment. Additional research will be needed after the new GRE prompts/variants become operational and data on fully motivated samples is available. Because it may take several months for examinees (and coaching schools) to develop optimal strategies for responding to each variant type, a follow-up study would probably be best after at least six months of regular operational testing. At this time sample sizes should be sufﬁcient for more ﬁne-grained subgroup analyses, including dividing the English-not-best group into individual language or country groups. The analyses in this paper were narrowly focused on the scores that raters assign and not how the actual writing products may differ in response to the different variant types. Such analyses that would require a discourse analysis or a corpus linguistics approach were well beyond the scope of this study, but would be a fruitful ﬁeld for additional studies and analyses. Additional research might also use cognitive interviews with examinees to determine whether the examinees themselves perceive any differences in their approaches to different variant types. Appendix A. Sample argument variants Alternate explanations “There is now evidence that the relaxed manner of living in small towns promotes better health and greater longevity than does the hectic pace of life in big cities. Businesses in the small town of Leeville report fewer days of sick leave taken by individual workers than do businesses in the nearby large city of Masonton. Furthermore, Leeville has only one physician for its one thousand residents, but in Masonton the proportion of physicians to residents is ﬁve times as high. And the average age of Leeville residents is signiﬁcantly higher than that of Masonton residents. These ﬁndings suggest that people seeking longer and healthier lives should consider moving to small communities.” Write a response in which you: • discuss one or more alternative explanations that could rival the explanation offered aboveand • indicate how your explanation(s) can plausibly account for the facts presented in the argument. Unstated assumptions The following appeared in a memo to the board of directors of Bargain Brand Cereals. “One year ago we introduced our ﬁrst product, Bargain Brand breakfast cereal. Our very low prices quickly drew many customers away from the top-selling cereal companies. Although the

252

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

companies producing the top brands have since tried to compete with us by lowering their prices and although several plan to introduce their own budget brands, not once have we needed to raise our prices to continue making a proﬁt. Given our success in selling cereal, we recommend that Bargain Brand now expand its business and begin marketing other low-priced food products as quickly as possible.” Write a response in which you examine the unstated assumptions of the argument above. Be sure to explain • how the argument depends on those assumptions and • what the implications are if the assumptions prove unwarranted. Speciﬁc evidence The following appeared in a letter from a ﬁrm providing investment advice to a client. “Homes in the northeastern United States, where winters are typically cold, have traditionally used oil as their major fuel for heating. Last year that region experienced 90 days with belowaverage temperatures, and climate forecasters at Waymarsh University predict that this weather pattern will continue for several more years. Furthermore, many new homes have been built in this region during the past year. Because of these developments, we predict an increased demand for heating oil and recommend investment in Consolidated Industries, one of whose major business operations is the retail sale of home heating oil.” Write a response in which you • discuss what speciﬁc evidence is needed to evaluate the logical soundness of the argument aboveand • explain how the evidence would weaken or strengthen the argument Evaluate a prediction Benton City residents have adopted healthier lifestyles. A recent survey of city residents shows that the eating habits of city residents conform more closely to government nutritional recommendations than they did ten years ago. During those ten years, local sales of food products containing kiran, a substance that a scientiﬁc study has shown reduces cholesterol, have increased fourfold, while sales of sulia, a food rarely eaten by the healthiest residents, have declined dramatically. Because of these positive changes in the eating habits of Benton City residents, we predict that the obesity rate in the city will soon be well below the national average. Write a response in which you discuss what questions would need to be answered in order to decide whether the prediction and the argument on which it is based are reasonable. Be sure to explain how the answers to these questions would help to evaluate the prediction. Evaluate a recommendation/predicted result The following appeared in a memo from a budget planner for the city of Grandview. “Our citizens are well aware of the fact that while the Grandview Symphony Orchestra was struggling to succeed, our city government promised annual funding to help support its programs. Last year, however, private contributions to the symphony increased by 200 percent, and attendance at the symphony’s concerts-in-the-park series doubled. The symphony has also announced an increase in ticket prices for next year. Such developments indicate that the symphony can now succeed without funding from city government and we can eliminate that expense from next year’s budget. Therefore, we recommend that the city of Grandview eliminate its funding for the Grandview Symphony from next year’s budget. By doing so, we can prevent a city budget deﬁcit without threatening the success of the symphony.”

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

253

Write a response in which you discuss what questions would need to be addressed to decide whether the recommendation is likely to have the predicted result. Be sure to explain how the answers to the questions would help to evaluate the recommendation. Evaluate a recommendation The following appeared in a memo to the board of directors of Bargain Brand Cereals. “One year ago we introduced our ﬁrst product, Bargain Brand breakfast cereal. Our very low prices quickly drew many customers away from the top-selling cereal companies. Although the companies producing the top brands have since tried to compete with us by lowering their prices and although several plan to introduce their own budget brands, not once have we needed to raise our prices to continue making a proﬁt. Given our success in selling cereal, we recommend that Bargain Brand now expand its business and begin marketing other low-priced food products as quickly as possible.” Write a response in which you discuss what questions would need to be answered in order to decide whether the recommendation and the argument on which it is based are well justiﬁed. Be sure to explain how the answers to these questions would help to evaluate the recommendation. Appendix B. Sample issue variants Claim with reason Claim: “When planning courses, educators should take into account the interests and suggestions of their students.” Reason: “Students are more motivated to learn when they are interested in what they are studying.” Write a response in which you • discuss the extent to which you agree or disagree with the claimand • explain how the given reason would affect your position on the claim. Two competing positions “Some people believe that competition for high grades motivates students to excel in the classroom. Others believe that such competition seriously limits the quality of real learning.” Discuss which view more closely aligns with your own position and explain your reasoning for the position you take. In developing and supporting your position, you should explain what principles you used in choosing between the two views. Generalization “The best ideas arise from a passionate interest in commonplace things.” Discuss the extent to which you agree or disagree with the statement above, and explain your reasoning for the position you take. In developing and supporting your position, you should consider ways in which the statement might or might not hold true, and explain how those considerations shape your position Recommended policy position “Colleges and universities should require their students to spend at least one semester studying in a foreign country.”

254

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

Discuss your views on the policy above and explain your reasoning for the position you take. In developing and supporting your position, you should explain the possible consequences of implementing the policy Position with counterarguments “People’s behavior is largely determined by forces not of their own making.” Write a response in which you • discuss the extent to which you agree or disagree with the claimand • anticipate and address the most compelling reasons or examples that could be used to challenge your position. Recommendation “Educators should solicit students’ input on the content of academic courses.” Discuss the extent to which you agree or disagree with the recommendation above and explain your reasoning for the position you take. In developing and supporting your position, describe speciﬁc circumstances in which adopting the recommendation would or would not be advantageous and explain how those samples shape your position. References Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In: N. Frederiksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory for a new generation of tests (1st ed., pp. 323–357). Hillsdale, NJ: Lawrence Erlbaum. Bejar, I. I., & Braun, H. (1999). Architectural simulations: From research to implementation. Final Report to the National Council of Architectural Registration Boards (ETS RM-99-02). Princeton, NJ: ETS. Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2002). A feasibility study of on-the-ﬂy item generation in adaptive testing (GRE Board Professional Report No. 02-23). Princeton, NJ: ETS. Breland, H., Lee, Y-W., Najarian, M., & Muraki, E. (2004). An analysis of TOEFL CBT writing prompt difﬁculty and comparability for different gender groups. TOEFL Research Report 76 (ETS RR 04-05). Princeton, NJ: Educational Testing Service. Bridgeman, B., Bejar, I., & Friedman, D. (1999). Fairness issues in a computer-based architectural licensure examination. Computers in Human Behavior, 15, 419–440. Educational Testing Service (2010). GRE Program Brochure. Princeton, NJ: Author. Embretson, S. E. (1998). A cognitive design system approach to generating valid tasks: Application to abstract reasoning. Psychological Methods, 3, 380–396. Golub-Smith, M., Reese, C., & Steinhaus, K. (1993). Topic and topic type comparability on the Test of Written English. TOEFL Research Report 42. Princeton, NJ: Educational Testing Service. Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difﬁculty in essay tests. Journal of Second Language Writing, 3, 49–68. Hinkel, E. (2009). The effects of essay topics on modal verb uses in L1 and L2 academic writing. Journal of Pragmatics, 41, 667–683. Hively, W., Patterson, H. L., & Page, S. H. (1968). A universe deﬁned system of arithmetic achievement tests. Journal of Educational Measurement, 5, 275–290. Huang, H. J. (2008). Essay topic writablity examined through a statistical approach form the college writer’s perspective. English Language Teaching, 1, 79–85. Jennings, M., Fox, J., Graves, B., & Shohamy, E. (1999). The test-takers’ choice: An investigation of the effect of topic on languagetest performance. Language Testing, 16, 426–456. Kroll, B., & Reid, J. (1994). Guidelines for designing writing prompts: Clariﬁcations caveats, and cautions. Journal of Second Language Writing, 3, 231–255. Lee, H-K., & Anderson, C. (2007). Validity and topic generality of a writing performance test. Language Testing, 24, 307–330. Morley, M., Lawless, R., & Bridgeman, B. (2005). Transfer between variants of mathematics test questions. In: J. Mestre (Ed.), Transfer of learning from a modern multidisciplinary perspective. Greenwich, CT: Information Age Publishing. Quellmalz, E. S. (1984). Designing writing assessments: Balancing fairness, utility, and cost. Educational Evaluation and Policy Analysis, 6, 63–72. Schaeffer, G., Briel, J., & Fowles, M. (2001). Psychometric assessment of the new GRE writing prompts (GRE No. 96-11P; ETS RR-01-08). Princeton, NJ: Educational Testing Service. Scrams, D. K., Mislevy, R. J., & Sheehan, K. M. (2002). An analysis of similarities in item functioning within antonym and analogy variant families (GRE No. 95-17aP; ETS RR-02-13). Princeton, NJ: Educational Testing Service. Brent Bridgeman is Distinguished Presidential Appointee at Educational Testing Service. He is Director of the Validity Research area.

B. Bridgeman et al. / Assessing Writing 16 (2011) 237–255

255

Catherine Trapani is Principal Research Data Analyst at Educational Testing Service. Her responsibilities include designing and interpreting statistical analyses of scores from writing assessments. Jennifer Bivens-Tatum is Assessment Specialist II at Educational Testing Service. Her responsibilities include overseeing the design of the Analytical Writing section of the Graduate Record Examinations® General Test.

Lihat lebih banyak...

Comparability of essay question variants

Descripción

Comentarios