Recommending a Minimum English Proficiency Standard for Entry-Level Nursing

Share Embed


Descripción

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 129

Journal of Nursing Measurement, Volume 13, Number 2, Fall 2005

Recommending a Minimum English Proficiency Standard for Entry-Level Nursing Thomas R. O’Neill, PhD National Council of State Boards of Nursing Chicago, IL

Richard J. Tannenbaum, PhD Educational Testing Service Princeton, NJ

Jennifer Tiffen, RN, MS National Council of State Boards of Nursing Chicago, IL When nurses who are educated internationally immigrate to the United States, they are expected to have English language proficiency in order to function as a competent nurse. The purpose of this research was to provide sufficient information to the National Council of State Boards of Nursing (NCSBN) to make a defensible recommended passing standard for English proficiency. This standard was based upon the Test of English as a Foreign Language (TOEFL™). A large panel of nurses and nurse regulators (N = 25) was convened to determine how much English proficiency is required to be minimally competent as an entry-level nurse. Two standard setting procedures, the Simulated Minimally Competent Candidate (SMCC) procedure and the Examinee Paper Selection Method, were combined to produce recommendations for each panelist. In conjunction with collateral information, these recommendations were reviewed by the NCSBN Examination Committee, which decided upon an NCSBN recommended standard, a TOEFL score of 220. Because the adoption of this standard rests entirely with the individual state, NCSBN has little more to do with implementing the standard, other than answering questions and providing documentation about the standard. Keywords: TOEFL; English proficiency; standard setting; passing standard; nursing; adaptive testing; NCSBN; licensure

S

tate boards of nursing are charged with protecting the public through the regulation of nursing in their jurisdiction. One of the ways in which they perform that function is by setting and enforcing prerequisite conditions for receiving a nursing license. Often these prerequisites are related to education, experience, and demonstrating that they are at least minimally competent by passing the National Council Licensure Examination (NCLEX®). However, there are some circumstances, such as being educated

© 2005 Springer Publishing Company

129

08 JNM 13(2) 129-146.qxd

11/18/05

130

2:39 PM

Page 130

O’Neill et al.

internationally, that might warrant the use of additional prerequisites, such as passing an English proficiency examination.

BACKGROUND In the years 2001, 2002, and 2003, the number of first-time, NCLEX-RN® candidates who were not educated in one of the NCSBN1 member board jurisdictions was 8,613, 12,762, and 16,490, respectively. The numbers for first-time, NCLEX-PN® candidates were 1,363, 1,810, and 2,198, respectively. The numbers were even higher when repeat candidates were included. Clearly, there are a large number of nurse candidates who were educated outside of the United States, and the trend seems to be increasing. For many of these candidates, English is not their primary language. This provides an additional challenge to boards of nursing. Not only do the boards of nursing need information regarding the clinical competence of these candidates, but the board also needs to know if the candidate has adequate language skills to effectively use their clinical skills. Typically, English language proficiency tests produce a score, not a pass-fail decision. How the score should be interpreted is specific to the purpose. For example, the minimum English proficiency required for an editor or communications director is likely to be quite different than the minimum proficiency required for an accountant or actuary. For these reasons, NCSBN set out to establish a recommended minimum standard of English proficiency specific to entry-level nursing. Making available to the member board jurisdictions such a legally defensible passing standard would be an obvious benefit. Rather than have each jurisdiction perform essentially the same study, it seemed sensible to commit significant resources to conduct a single, well-crafted study that could be used by all jurisdictions, should they so choose. This could provide an additional benefit for internationally educated candidates by making the examination results portable across the jurisdictions that use the standard. Using an underlying principle of public safety, the minimum standard was intended to reflect the level of English language proficiency believed necessary for entry-level nurses to be able to perform important nursing responsibilities safely and effectively. It is recommended that internationally educated nurse-candidates meet or exceed this standard before they are issued a license. It is important to note that the standard was intended to reflect the minimum level of English proficiency necessary for safe and effective entrylevel practice, not the level of proficiency necessary for nurse-candidates to take the NCLEX examination. Before a minimum English proficiency standard can be set, at least one instrument to measure it must be identified. The Profiles of Member Boards 2002 (National Council of State Boards of Nursing, Inc., 2003) suggests that the most commonly used English proficiency examination used by boards of nursing in the United States and its territories is the Test of English as a Foreign Language (TOEFL). Therefore, as a first step in establishing a minimum English proficiency standard, a standard setting study was conducted using the TOEFL (Educational Testing Service, 2000a, 2003) examination. The intended examinee population consists of nurse-candidates who have been educated outside of the United States and in a language other than English. English is probably not the first or primary language of these candidates. This population typically includes both inexperienced practitioners and experienced practitioners; regardless, all are seeking entrance into the

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 131

NCSBN’s Passing Standard for TOEFL

131

nursing profession in the United States. For this reason, the minimum level of English language proficiency is understood in the context of entry-level (United States-based) nursing practice.

THE STANDARD SETTING PROCESS Although psychometrics can provide useful information for standard setting, the setting of standards is not really a measurement issue in the traditional sense. To illustrate this point, consider the ruler. The ruler has been around for a long time and is generally regarded as a stable instrument for measuring distance. However, when a child goes to an amusement park and asks why one must be a certain height to ride a particular ride, the explanation about the ruler’s stability seems quite irrelevant. Why not an inch lower? Or higher? Of course, there is a safety-based rationale that considers acceptable risks behind the rule, but how safe a ride should be and what constitutes an acceptable risk are really personal judgments made by a person or a group of people. To implement this judgment evenly across all people, it is necessary to develop a policy. Although psychometrics as a field has been quite successful in devising tests and questionnaires to measure traits, aptitudes, and attitudes that are demonstrably reliable and valid, the selection of cut-off criteria for making classification decisions is not quite as scientific. Its creation is more of a policy decision than a measurement decision. Cizek (2001) expressed this perspective in his book on standard setting. Although psychometrics falls more along the lines of science, standard setting falls more into the social. Standard setting is the branch of psychometrics that blends more artistic, political, and cultural ingredients into the mix of its products than any other. (p. 5)

This perspective, however, has not been universally embraced. Some procedures explicitly define the ideal standard as the threshold that optimally classifies candidates with regard to a “known” classification or outcome. For example, the contrasting groups method (Livingston & Zieky, 1982) initially classifies examinees into two groups, qualified and unqualified, on the basis of something other than their test score. Next, a two-way table of percentages, status (qualified or unqualified) by test score, is constructed. The score in which the greatest classification accuracy is achieved is considered optimal. Although this method permits decision-makers to evaluate the number of correct classifications and the number of false positives and false negatives, there remain two issues that cannot be resolved empirically. First, the criteria selected to classify the groups are in fact judgmentbased decisions. Second, what constitutes an acceptable percentage of false positives and false negatives also remains a judgment. A passing standard is a function of informed professional judgment. There is no passing standard that is empirically correct. A passing score reflects the values of those professionals who participate in its definition and adoption, and different professionals may hold different sets of values. Its determination may be informed by empirical information or data, but ultimately, the passing standard is a judgment-based decision. Regardless of one’s theoretical perspective, the standard used to classify examinees must not be made in an arbitrary and capricious manner. Furthermore, the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999) recommends that the rationale and procedures used to set the standard be clearly

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

132

Page 132

O’Neill et al.

documented. This includes a description of the standard setting procedure, the panelist selection process and the qualifications of panelists selected, as well as a description of the training provided. This article documents these aspects of the standard setting process in relation to the TOEFL examination.

METHODS TOEFL Examination The TOEFL is an examination designed to assess English language ability in examinees for whom English is not their native language. It is comprised of three components, Listening Comprehension, Structure & Written Expression, and Reading Comprehension. In the Listening Comprehension section, the examinee listens to a conversation and is asked to answer 30 questions about it. These questions are administered adaptively. In the Structure portion of the Structure & Written Expression section, there are two item types. The examinee is presented either with an incomplete sentence and asked to select the response that will correctly complete the sentence or with a sentence that has several sections underlined and asked to identify which one is an error. There are 20 of these questions and they are administered adaptively. In the Written Expression portion of the Structure & Written Expression section, the examinee is provided with a topic and asked to write a brief essay, which is scored 0–6. In the Reading Comprehension section, the examinee reads a few passages. After each passage, the examinee is asked to answer questions about it. This section has 44 questions that are administered nonadaptively. With the exception of the writing section, all items are dichotomously scored and use a Selected Response (SR)-type item format (essentially multiple-choice-question format, although there is some variety in how the responses are selected, and multiple responses are required in some instances). Although there is also a paper format,2 the computer-based format is the most widely used (approximately 80%). Because most examinees take the computer-based TOEFL, the test-form used in this study is aligned with the computerized specifications.3 Three section-level subscores (0 to 30 points each) and a total TOEFL score (0 to 300 points) are reported. The total TOEFL score is the average of the three section scores multiplied by 10.

Standard Setting Procedures Two standard setting procedures, the Simulated Minimally Competent Candidate (SMCC) method for the selected response format items and the Examinee Paper Selection Method for the essay portion, were combined to produce passing standard recommendations for each panelist. Because some sections of the TOEFL are adaptive and the test uses item response theory (IRT) to equate all examinees performances to a common scale, it was desirable to use a standard setting procedure that was congruent with adaptive testing and IRT. The SMCC method is such a procedure. SMCC Method. The SMCC method essentially asks each panelist to respond to a sample of items the way they imagine a minimally competent examinee would. Based upon those responses, a score is computed for the panelist that should represent the panelist’s notion of minimal competence. If the items are already calibrated using IRT, the tests can even be given adaptively because the candidate ability estimates produced will automatically be equated to the same scale. Therefore, this method permits each rater to receive either identical sets of items or different sets of items. In this way, items can be

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

NCSBN’s Passing Standard for TOEFL

Page 133

133

administered to panelists in a manner similar to how an examinee would receive them on an actual test. The SMCC procedure produced a subtest score for the Listening Comprehension and Reading Comprehension subtests. It also provided a partial subtest score for Structure & Written Expression. The origins of this method are not well documented. In 1995 or 1996, Mary E. Lunz and the Histotechnology Committee of the American Society of Clinical Pathologists’ Board of Registry, developed the idea. While using the Synthetic Candidate Method (Lunz, 2000; also 1995 unpublished manuscript cited in Plake & Hambleton, 2001) to set a passing standard for a histotechnology performance assessment, the committee wondered if the procedure could be modified to be used with their computerized adaptive test. Lunz indicated that it could be accomplished, but it would be more logical and efficient if the consensus building stage were executed after individual raters provided their input. These ideas were considered, but not executed because a formal standard setting exercise was not executed for some time. However, Sireci and Clauser (2001) attribute this method to Howard Wainer, citing a personal communication from March 31, 2000. It seems likely that these ideas were an obvious next step given the rise of adaptive testing at that time, so it is not surprising that different people were thinking about them simultaneously. This author credits Lunz and committee with the idea. Examinee Paper Selection Method. It seemed impractical to have the panelists attempt to write an essay in the way they imagine a minimally competent candidate would and then have the operational raters score it. The opportunity for the panelists to reconsider their ratings after discussion would be lost. Instead, the Examinee Paper Selection method (Hambleton, Jaeger, Plake, & Mills, 2000) was used, which permitted the panelists to read the rubric descriptions, the elements of what constituted each point (0–6) on the rubric, and then read sample essay responses that corresponded to each point on the rubric. Panelists were asked to pick the response that, in their expert judgment, reflected the response of the single examinee with just enough English language skills to perform the job of an entry-level nurse safely and effectively. Panelists were permitted to use half points if they felt that the minimally competent candidate would perform somewhere between two adjacent exemplars. This is consistent with the actual scoring process because two raters grade each essay and the average rating is used (Educational Testing Service, 2003). A conversion table was used to combine the selected rating with the partial subtest score for Structure & Written Expression that was generated using the SMCC procedure. The SMCC procedure and the Examinee Paper Selection method each contributed approximately 50% of the Structure/Written Expression subtest. Finally, there were three subtest scores for each panelist that could have ranged from 0 to 30. These subtest scores were combined into a total score (0 to 300) by summing the three section scaled scores and multiplying this sum by ten-thirds, effectively allowing each section scaled score to contribute equally to the total scaled score.

Adaptive Testing Although Reading was administered as a fixed form test, the Listening Comprehension and Structure sections were administered adaptively. That is, the difficulty level of an item presented to a candidate is dependent on the candidate’s response to the immediate previous item and to the other previous items. A correct response to an item, for example, is followed by an item of greater difficulty; an incorrect response is followed by an item of lesser difficulty. In this way, a candidate receives a set of items maximally tailored to his or her overall ability in each of the two adaptive sections.

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 134

134

O’Neill et al.

Use of PowerPrep® Each panelist had a laptop computer that was preloaded with TOEFL PowerPrep software. PowerPrep (Educational Testing Service, 2000b) contains two full-length, computeradaptive editions of the TOEFL, drawing upon a pool of more than 1,200 items. PowerPrep considers item difficulty and item discrimination parameters of the items presented and the individual panelist’s responses to the items (with an adjustment for “guessing”) to compute the estimate of the panelist’s ability for each of the three sections. The software does not provide a final score for the Structure & Written Expression section, but instead it produces the lower bound of the Structure score, which essentially assumes that zero points were earned on the essay. Panelists were to combine this Structure score with their essay score through the use of a conversion table to produce a single score for the Structure & Written Expression section. For each section, the panelist’s ability estimate was translated to a scaled score that could range from 0 to 30. Finally, a total score (0 to 300) for the panelist was obtained by summing the three section scaled scores and multiplying this sum by ten-thirds, effectively allowing each section scaled score to contribute equally to the total scaled score.

Selection of Raters The composition (number, representativeness, and qualifications) of the standard-setting panel was a crucial element in establishing the validity and credibility of the standard. Twenty-five experts served on the standard-setting panel. Several methods were employed in order to recruit a large pool of nurses. Ethnic and minority nursing groups were contacted as well as state boards of nursing. Also, a small sample of recent NCLEX passers, who had identified themselves as members of an ethnic minority, was also contacted. Applicants were grouped by their qualifications of: (a) having previously taken the TOEFL exam, (b) working with clients who speak languages other than English, (c) supervising nurses who speak languages other than English, or (d) working as a nursing regulator, educator, or representing a consumer of nursing services. NCSBN further sorted applicants by selecting candidates from each of the most commonly spoken non-English languages in the US (Spanish, Chinese, French, German, and Tagalog) as represented in the 2000 U.S. Census, and selecting representatives from all four NCSBN geographic regions. These experts, all female, were recruited by NCSBN to represent a range of professional perspectives and experiences. Nine of the panelists self-reported being nurses who were educated internationally, and so had also taken the TOEFL. Seven panelists reported being nurses who work with clients who speak languages other than English. Five panelists reported being clinical supervisors of nurses who speak a primary language other than English. Four reported being a nursing regulator, nursing educator, or public member (two of the four reported having had taken the TOEFL). As can be seen in Table 1, panelists had a range of experience and clinical/work specialties. Panelists were selected from 17 states and represented 16 languages.

Panelist Orientation and Training The panelists were first provided with an overview of the goals and purpose of the study. It was explained that a passing score was meant to reflect the level of English language proficiency necessary for entry-level nurses to perform important nursing tasks safely and effectively. It was clarified that the passing score was not the level of English language proficiency necessary to take the NCLEX® examination—the focus of the study was on

Regulators, Educator, and Public Member n=4

3 to 30 years

1 to 20 years

< 1 to 27 years

27 to 47 years

Geriatric Medical-surgical Primary care Obstetrics Pediatrics Mental health

Obstetrics Pediatrics Home health LTC-geriatric Medical-surgical Critical care Mental health California District of Columbia Florida Kansas Massachusetts Oregon Texas

LTC-geriatric Critical care Medical-surgical Mental health

Education consultant Executive director Nursing instructor Editor, Spanish publisher

Alaska California Florida Illinois Texas

Iowa Massachusetts Oregon Washington

Spanish Tagalog Japanese Korean Polish

Spanish Tagalog German French

Spanish Tagalog Chinese German French Vietnamese Korean Russian African dialects

135

California Georgia Hawaii Kansas Minnesota North Carolina Ohio Virginia Spanish Tagalog Chinese German Japanese Vietnamese Mandarin Nigerian Hmong Persian Arabic

Page 135

Languages represented

Supervise Non-EnglishSpeaking Nurses n=5

2:39 PM

States represented

Work with Non-EnglishSpeaking Clients n=7

11/18/05

Range in years of experience Clinical/work specialties

Have Taken the TOEFL N=9

08 JNM 13(2) 129-146.qxd

Sample Population N = 25

NCSBN’s Passing Standard for TOEFL

TABLE 1. Panel Demographics

08 JNM 13(2) 129-146.qxd

11/18/05

136

2:39 PM

Page 136

O’Neill et al.

the job. Second, the panelists were led through an overview of the TOEFL computer-based test, the PowerPrep® software that would be used to administer the test, and the general process that was to be followed in arriving at the recommended passing score. Because it seemed sensible to provide the instructions for each section immediately before the execution of the section, the training for the multiple-choice questions occurred on Day 1 and the instructions for the essay portion of the Structure and Written Expression section occurred on Day 2. After the orientation, the panel was asked to identify the core tasks that all entry-level nurses needed to perform. It was important to agree on the scope of activity that was being considered before trying to assess how much English one needed to know to perform them. The panel split into two groups and worked independently for half an hour to identify 8 to 10 tasks critical to entry-level nurses. Each group recorded their task list on flipchart paper. The two groups then reassembled and a whole-group discussion followed whereby the final task list was defined. The list included: taking patient histories, conducting patient assessment, completing documentation, educating or training patients, taking orders, reporting, implementing safety practices, delegating, communicating, providing client service, and prioritizing responsibilities. This list was posted to serve as a frame-of-reference for the rest of the exercise. The panelists were then instructed to imagine a nurse candidate who was educated outside the United States and in a language other than English. They were also told to imagine that this person was seeking to become an entry-level nurse in the US and just barely possessed the English proficiency necessary to be safe and effective as a nurse. It was also discussed that while this examinee may be stronger in certain English language skills than in others, overall, this examinee was sufficiently proficient in English to perform the job. Panelists were reminded that the focus was not on the examinee’s nursing knowledge or skill, but rather on their English language skills. Panelists were informed that they would each be taking a version of the computer-based TOEFL. They were, however, to respond to each question as if they were the minimally proficient examinee that they had just imagined. For the multiple-choice questions, the panelists were instructed to select the answer that they believed the SMCC would choose. For the writing sample, the panelists were instructed to identify from a set of exemplar writing samples, the writing sample that reflects what the SMCC would be capable of producing.

Panelist Judgments Practice Judgments. For each section except the writing portion, panelists were given an opportunity to practice making judgments before making their first-round standardsetting judgments. Using five questions from the 1997–1998 edition of the paper version of the TOEFL, each panelist was asked to note what the correct answer was and whether or not an examinee with just enough English language skills to perform the job of an entry-level nurse would know the correct answer. After each panelist noted their responses, the group discussed their rationales. After the discussion of each item, the correct answer was revealed, as was the proportion of 1997–1998 examinees4 that chose the correct answer, and whether the item would be classified as being easy, of medium difficulty, or hard. (The rule of thumb used in this latter classification was if 70% or more answered it correctly, the item was easy, if 30% or less answered it correctly, it was hard, and if between 30% and 70% answered it correctly, it was of medium difficulty.) Before each section, panelists were reminded to respond as if they were the examinee with just

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

NCSBN’s Passing Standard for TOEFL

Page 137

137

enough English language skills to perform the job of an entry-level nurse safely and effectively. Listening Comprehension Judgments. Panelists rendered their first-round standardsetting judgments for the Listening section. Each panelist was asked to respond to two items, as if she were a candidate with just enough English language proficiency to perform the job of an entry-level nurse (as exemplified by the list of tasks previously defined) safely and effectively. Once two items were completed, the panelists were asked to quit the test. A whole-group discussion then occurred to get a sense of how difficult the panelists believed the questions were for this examinee. Although panelists likely encountered different items—as the Listening Comprehension section is adaptive—the discussion was helpful to bring to light the relevant perspectives of the panelists. After the discussion, panelists completed all their remaining first-round items for the Listening section of the TOEFL.5 Reading Comprehension Judgments. Each panelist responded to two items, as the minimally competent examinee would for the Reading Comprehension section. After the two items were completed, the panelists discussed how difficult the questions were for this examinee. After the discussion, panelists completed all their remaining first-round items for this section. Structure and Written Expression Judgments. After responding to two items, as the minimally competent candidate would for the Structure portion of this section, the panelists discussed how difficult the questions were for this examinee. Although panelists likely encountered different items—as this section is adaptive—the discussion helped to reveal the different perspectives of the panelists. After the discussion, panelists completed all their remaining first-round items for this portion of this section. Later, the panelists provided judgments for the Written Expression portion of this section. Panelists were provided with several sample candidate responses that illustrate each of the rubric score points (0–6); these benchmark performances represent clear examples of each point. The panelists were asked to pick the benchmark performance (rubric score point) that reflects the response of the “sufficiently English language competent” entry-level nurse. Panelists first made independent judgments and then were asked to share their rationales. Subsequently, they had an opportunity to adjust their initial judgment if they were persuaded by some of the rationales that were discussed. The panelists’ final judgments were recorded and combined with the Structure score. The Written Expression score and the lower bound score associated with the Structure portion were combined to form one Structure and Written Expression score using a conversion table. Final Review of Judgments. After the first round-scaled scores were computed, panelists had the opportunity to review the scaled scores for their idealized minimally competent candidate and adjust them if they felt it was necessary. These adjustments were informed by additional discussion among the panelists regarding why the standard should be higher or lower.

RESULTS Panel Recommendations Before the quantitative results are summarized, it is important to note that the social dynamics among the participants and between the staff and participants was one of collaboration. It appeared that no panelist was reticent to provide their opinion and no individual or small

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 138

138

O’Neill et al.

subgroup dominated the discussions. It appeared that the results provided by the panel represented the panelists’ true opinions regarding the minimum English proficiency required to practice nursing at the entry level. For each panelist, three subtest scores and a total test score were computed for their first-round judgments (Table 2) and their final second-round judgments (Table 3). The panelists tended to indicate a higher standard was needed when they reconsidered their initial judgments. The mean score increased from 212 in the first round to 221 in the second round. While the mean and median values increased, the variability (standard deviation) of the panelists’ judgments tended to decrease, indicating a greater degree of panelist consensus. This was expected, as previous research (Hurtz & Auerbach, 2003) indicates that group discussion of standard-setting judgment can result in reduced variability among panelist judgments and higher mean values. TABLE 2. First Round Scores for All Panelists

Panelist P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24 P25 Mean (truncated) Median (truncated) SD Minimum Maximum

Combined Structure Writing Structure Listening Component Component and Writing Reading Total (0–30) (0–13) (0–6) (0–30) (0–30) (0–300) 22 22 18 24 25 24 16 15 17 13 25 10 24 23 23 20 22 21 22 22 22 26 24 20 24

12 10 6 9 11 6 6 2 11 7 12 6 13 9 10 10 8 3 8 9 5 13 5 11 12

3.5 4 5 5 4 3.5 4 4 5 3.5 3 4 4.5 4 4 3.5 4 4 3.5 4 4 3.5 4.5 3 4

23 23 22 25 24 18 19 15 26 19 22 19 26 22 23 22 21 16 20 22 18 24 20 21 25

19 26 24 23 24 23 22 19 24 22 20 24 25 19 19 24 24 18 19 21 16 21 18 19 25

213 237 213 240 243 217 190 163 223 180 223 177 250 213 217 220 223 183 203 217 187 237 207 200 247

20 22 3.94 10 26

8 9 3.02 2 13

3 4 0.53 3 5

21 22 2.87 15 26

21 22 2.75 16 26

212 216 22.77 163 250

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 139

NCSBN’s Passing Standard for TOEFL

139

In addition to considering the panel as a whole, the scores from participants who had previously taken the TOEFL (Table 4) and those who had not (Table 5) were considered separately. This was done because those panelists who had previously taken the TOEFL examination as part of the emigration or licensing process may have had a different perspective regarding minimum competence. Yet, the data did not support this hypothesis, as the mean scores of the group that had taken the TOEFL (M = 218) and the mean scores of the group that had not (M = 223) were not different to a statistically significant extent, t (23) = 0.721, p = 0.48, two tailed. Both of these groups tended to indicate after discussion that a higher standard was required than they thought in their initial judgment and both groups tended to show less variability in their postdiscussion judgments. Furthermore, it was interesting to note that across subgroups, the mean scores for each of the three subtests (Listening, Writing-Structure, and Reading) were all very similar. Despite the general TABLE 3. Second Round Scores for All Panelists

Panelist P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24 P25 Mean (truncated) Median (truncated) SD Minimum Maximum

Combined Structure Writing Structure Listening Component Component and Writing Reading Total (0–30) (0–13) (0–6) (0–30) (0–30) (0–300) 20 22 26 24 22 24 20 24 20 18 20 15 23 23 21 20 22 25 21 22 22 26 22 20 22

8 10 9 9 10 10 6 8 11 8 9 7 11 9 10 10 10 9 8 8 9 11 10 8 11

4 4 5 4 4.5 3.5 4 4 4.5 3.5 3.5 4 4 4.5 4.5 4.5 4.5 4 3.5 4 4 3.5 3.5 3.5 4

21 23 25 22 24 22 19 21 25 20 21 20 24 24 24 24 24 22 20 21 22 23 22 20 24

20 23 26 22 21 23 22 20 24 22 20 24 23 22 20 24 24 25 19 21 22 25 20 20 25

203 227 257 227 223 230 203 217 230 200 203 197 233 230 217 227 233 240 200 213 220 247 213 200 237

21 22 2.40 15 26

9 9 1.29 6 11

4 4 0.41 3.5 5

22 22 1.73 19 25

22 22 1.95 19 26

221 223 15.74 197 257

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 140

140

O’Neill et al.

TABLE 4. First and Second Round Scores for Panelists Who Have Taken TOEFL (11 Panelists)

Panelist Round 1 judgment P1 P4 P5 P7 P12 P14 P16 P20 P21 P22 P23

Combined Structure Writing Structure Listening Component Component and Writing Reading Total (0–30) (0–13) (0–6) (0–30) (0–30) (0–300) 22 24 25 16 10 23 20 22 22 26 24

12 9 11 6 6 9 10 9 5 13 5

3.5 5 4 4 4 4 3.5 4 4 3.5 4.5

23 25 24 19 19 22 22 22 18 24 20

19 23 24 22 24 19 24 21 16 21 18

213 240 243 190 177 213 220 217 187 237 207

Mean (truncated) Median (truncated) SD Minimum Maximum Round 2 judgment P1 P4 P5 P7 P12 P14 P16 P20 P21 P22 P23

21 22 4.39 10 26

8 9 2.67 5 13

4 4. 0.43 3.5 5

21 22 2.23 18 25

21 21 2.59 16 24

213 213 21.01 177 243

20 24 22 20 15 23 20 22 22 26 22

8 9 10 6 7 9 10 8 9 11 10

4 4 4.5 4 4 4.5 4.5 4 4 3.5 3.5

21 22 24 19 20 24 24 21 22 23 22

20 22 21 22 24 22 24 21 22 25 20

203 226 223 203 196 230 227 213 220 247 213

Mean (truncated) Median (truncated) SD Minimum Maximum

21 22 2.68 15 26

8 9 1.40 6 11

4 4 0.33 3.5 4.5

22 22 1.60 19 24

22 22 1.56 20 25

218 220 13.73 197 247

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 141

NCSBN’s Passing Standard for TOEFL

141

TABLE 5. First and Second Round Scores for Panelists Who Have Not Taken TOEFL (14 Panelists)

Panelist Round 1 judgment P2 P3 P6 P8 P9 P10 P11 P13 P15 P17 P18 P19 P24 P25

Combined Structure Writing Structure Listening Component Component and Writing Reading Total (0–30) (0–13) (0–6) (0–30) (0–30) (0–300) 22 18 24 15 17 13 25 24 23 22 21 22 20 24

10 6 6 2 11 7 12 13 10 8 3 8 11 12

4 5 3.5 4 5 3.5 3 4.5 4 4 4 3.5 3 4

23 22 18 15 26 19 22 26 23 21 16 20 21 25

26 24 23 19 24 22 20 25 19 24 18 19 19 25

237 213 217 163 223 180 223 250 217 223 183 203 200 247

Mean (truncated) Median (truncated) SD Minimum Maximum Round 2 judgment P2 P3 P6 P8 P9 P10 P11 P13 P15 P17 P18 P19 P24 P25

20 22 3.53 13 25

8 9 3.27 2 13

3 4 0.59 3 5

21 21 3.28 15 26

21 22 2.71 18 26

212 216 24.07 163 250

22 26 24 24 20 18 20 23 21 22 25 21 20 22

10 9 10 8 11 8 9 11 10 10 9 8 8 11

4 5 3.5 4 4.5 3.5 3.5 4 4.5 4.5 4 3.5 3.5 4

23 25 22 21 25 20 21 24 24 24 22 20 20 24

23 26 23 20 24 22 20 23 20 24 25 19 20 25

227 257 230 217 230 200 203 233 217 233 240 200 200 237

Mean (truncated) Median (truncated) SD Minimum Maximum

22 22 2.14 18 26

9 9 1.12 8 11

4 4 0.46 3.5 5

22 22 1.80 20 25

22 23 2.19 19 26

223 228 16.88 200 257

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 142

142

O’Neill et al.

subtest agreement among the panelists, two panelists, #10 and #12, produced listening subtest scores that were noticeably lower than those of their colleagues. Both panelists raised their ratings after the discussion with their colleagues, but still remained the lowest subtest scores.

Examination Committee Deliberation The NCSBN Board of Directors charged the Examination Committee with developing a recommended minimum passing score for the TOEFL. The Examination Committee reviewed: 1. 2. 3. 4.

The panel’s recommendations in conjunction with existing U.S. visa-screening requirements, state licensing criteria, and normative TOEFL performance data on people applying for a professional license (Table 6).

The recommended passing standards from individual panelists ranged from 197 to 257 with no drastic outliers. The difference between the mean and median was so small that it doesn’t seem to reflect much difference in terms of language proficiency. Given that the group of panelists who had previously taken the TOEFL and those who had not produced comparable scores, the committee felt strongly that the recommended standard should consider the opinions of the entire panel, not just a subset of the panel. The mean score for the entire panel was 221, and that was the initial idea for the standard. Additional discussion led the committee to consider the current U.S. visa-screening requirements for the different professions (Table 6). The current requirement for practical or vocational nurses is a TOEFL score of 197 and a score of 207 for registered nurses. However, NCSBN staff was unable to uncover any research or documentation to support those standards. Staff did find a standard setting study performed for Occupational Therapists and Physical Therapists. This study recommended a TOEFL score of 220 to be considered minimally competent. The Examination Committee considered the level of communication required for those jobs and concluded that entry-level nurses needed to have comparable communication abilities. This led the committee to revise their recommended standard to 220. The Examination Committee then reviewed this standard in light of the standards used by some states as a licensing requirement. Many states use the U.S. visa-screening requirements as their criteria, but there are some exceptions. The Profiles of Member Boards 2002 identifies three states with different standards (Kansas 163, North Carolina 213, and Florida 217). The committee did not feel that the standards used by these states were better supported with research or rationale than their proposed standard. However, it was reassuring that two of the three state standards were close to the committee’s current thinking, a standard of 220. The committee also wanted to have a general idea regarding the impact their standard would have. The committee looked at the 2001–2002 test score information for the computer-based TOEFL (Table 6). TOEFL examinees who reported that they were taking the test to become licensed to practice their chosen profession are likely to be more similar to a pool of internationally educated nurses in this study than would likely be the entire pool of TOEFL examinees. The subset of examinees seeking a professional license was further disaggregated by gender. This permitted the test score information to be better aligned with the demographic characteristics of the pool of internationally educated nurses, which is likely to contain more women than men.

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

Page 143

NCSBN’s Passing Standard for TOEFL

143

TABLE 6. Information Considered by NCSBN’s Examination Committee to Recommend a Minimum Standard on TOEFL Scorea

Source

Panel Recommendations 218 Mean final recommendation of panelists who took TOEFL. 220 Median final recommendation of panelists who took TOEFL. 221 Mean final recommendation of all panelists. 223 Median final recommendation of all panelists. 223 Mean final recommendation of panelists who did not take TOEFL. 228 Median final recommendation of panelists who did not take TOEFL. CGFNS’ TOEFL Requirements for Visa Screeningb 197 Current Standard for Licensed Practical Nurses, Vocational Nurses, Clinical Laboratory Technicians, and Medical Technicians. 207 Current Standard for Registered Nurses, Speech Language Pathologists, Audiologists, Clinical Laboratory Scientists, Medical Technologists, and Physician Assistants. 220 Current Standard for Physical Therapists and Occupational Therapists. State TOEFL Standardsc 163 Kansas 213 North Carolina 217 Florida Population Means and Standard Deviations for Computer-Based TOEFL 2001–2002 Examineesd 214 (47) Mean score of all examinees taking the computer-based TOEFL (N = 572,394). 225 (40) Mean score of female 2001–2002 TOEFL examinees that applied for any type of professional license (N = 21,187). 229 (42) Mean score of all 2001–2002 TOEFL examinees that applied for any type of professional license (N = 34,721). 235 (44) Mean score of male 2001–2002 TOEFL examinees that applied for any type of professional license (N = 13,283). CGFNS Validity Study Sample Mean and Standard Deviatione 237 (19) The CGFNS TOEFL sample was based on the written examination, not the CAT examination. The written examination scores were converted to CAT scores via the following formula: CAT = (Written 273.9) * 0.769. This formula was based on a conversion table found on page 13 of the TOEFL 2003–04 Information Bulletin for Computer-based and Paper-based Testing. aTOEFL scores can range from 0 to 300. bThis information comes from the CGFNS website as of January 22, 2004. cThis information comes from the Profile of Member Boards 2002 (NCSBN, 2003). dNumbers of examinees are based on those who responded to a question about their group membership. eBased on the Commission on Graduates of Foreign Nursing Schools’ Validity Study April 1999 through March 2000.

08 JNM 13(2) 129-146.qxd

11/18/05

144

2:39 PM

Page 144

O’Neill et al.

To make a prediction regarding the impact of the proposed TOEFL standard of 220, a few scenarios were modeled. First it was assumed that the distribution of English language proficiency among internationally educated nurses taking the TOEFL was the same as the distribution of all TOEFL examinees applying for a professional license. Given that the mean and standard deviation for the population applying for a professional license was µ = 229, σ = 42, one would expect that 58% (z = -0.214) of these TOEFL examinees would pass. However, if the population were limited to female TOEFL examinees (µ = 225, σ = 40), one would expect that 55% (z = -0.125) of this group would pass. On the other hand, using the data reported by the Commission on Graduates of Foreign Nursing Schools (2000) in their validity study, one could get a better idea of the typical distribution of English language proficiency for internationally educated nurses taking the TOEFL. Using only the people reported in that study who were in the 1999 or 2000 TOEFL cohort, an estimate for the population of nurses was derived (µ = 237.5, σ = 19).6 Using this population, one would expect 82% (z = -0.921) of them to pass. The Examination Committee considered the impact predictions and agreed that a standard of 220 on the TOEFL was appropriate to demonstrate the minimum degree of English proficiency necessary to be a safe and effective, entry-level nurse. Correspondingly, a score of 560 on the paper version of the TOEFL would be considered equivalent.

DISCUSSION The purpose of this study was to arrive at a recommended passing score on the TOEFL that represents the level of English language proficiency believed necessary to perform important entry-level nursing tasks safely and effectively. The Examination Committee was asked to make a policy decision after being informed with the appropriate types of information.

Adjustments Cizek (1996) discusses three types of standard setting post hoc adjustments: adjustments to participants, adjustments to the data, and adjustments to the final standard. Using Cizek’s categories, the adjustments considered and implemented by the committee were discussed. Adjustments to the participants may include removing panelists with idiosyncratic ratings from consideration. Similarly, an entire class of panelists could be excluded for philosophical reasons especially if the inclusion would lead to a noticeably different standard. After reviewing the data and considering the qualifications of the raters, the Examination Committee did not see a need to exclude the rating of any panelist. Adjustments to the data are typically either statistical or some sort of reconsideration of ratings by the panelists after additional training or discussion. Statistical manipulation of the data is typically performed to minimize the variability in panelists’ judgments. This type of manipulation was not considered necessary to interpret these results, and concerns that it might damage the inferences derived from the data ruled out its use. Nevertheless, a reconsideration of the ratings by the panelists who made the rating was an integral part of the design of the exercise. The revised ratings were hopefully made on the basis of considering more perspectives rather than a desire for social cohesion. Fortunately, the social dynamics observed in the study suggest that the broader perspectives effect outweighed any social cohesion effects. The usual rationale for adjustments to the final standard is because multiple standard setting procedures have been used or additional information is being considered. NCSBN

08 JNM 13(2) 129-146.qxd

11/18/05

2:39 PM

NCSBN’s Passing Standard for TOEFL

Page 145

145

endorses this approach. The Examination Committee regards the recommendations from the panel as one of many pieces of information to consider. Of course, the panel recommendations carried a substantial amount of weight with the committee, but it was not the only piece. Also, the panel’s recommendation is not regarded as being correct in an empirical sense, only a thoughtfully considered consensus by people who have both nursing experience and second language experience. Using NCSBN’s framework, the committee was setting the standard, not the panel. Therefore, using something that deviates from the panel’s recommendation is not really considered an adjustment in the same sense.

Limitations Typically, there are some shortcomings that are inherent in licensure and certification tests. Test developers are often restricted in the types of data that they can collect to verify the standard. In practice, boards of nursing only license or certify people who they believe to be competent because it would be unethical to do otherwise. Because these people come only from the upper end of the ability continuum, there are sampling problems related to attempting to establish the predictive validity of the standard. Another approach to examining the passing standard would be to randomly sample, say 100 candidates, and have raters follow them around for a month to assess their performance. Yet this approach would be impractical and expensive because one would need to recruit participants, provide them with a legal mechanism to practice during the observation period, and hire experienced raters to assess their performance. Furthermore, the presence of a rater could be intrusive enough to change the observed behavior.

Future Activities After the Examination Committee recommended the standard, it was communicated to the NCSBN boards of nursing. The question now is, how many of the boards of nursing will use this standard as a legal requirement for licensure? In a similar vein, the adoption of these standards for U.S. visa screening purposes is also of interest. Because the adoption of this standard rests entirely with the individual state, NCSBN has little more to do with implementing the standard, other than answering questions and providing documentation about the standard. As an additional service to boards of nursing, NCSBN intends to provide recommended standards for other English proficiency examinations. This will provide boards of nursing and candidates with more choices in test providers. As part of the decision to select a test for standard setting, NCSBN considers the technical quality and psychometric soundness of the test. Tests of technically poor quality are not considered for receiving an NCSBN recommended standard.

NOTES 1. The National Council of State Boards of Nursing, Inc. (NCSBN) is a not-for-profit organization that is composed of 60 jurisdictional boards of nursing in the United States and U.S. territories whose mission is to provide leadership to advance regulatory excellence for public protection. 2. The paper format does not have a writing sample component that is incorporated into the total score; however, the results report does have a box for the Test of Written English (TWE) results. Also, the paper and computer versions report results on different scales. 3. For the purposes of this study, all subsequent references to TOEFL will refer to the computerbased test.

08 JNM 13(2) 129-146.qxd

11/18/05

146

2:39 PM

Page 146

O’Neill et al.

4. This form of the TOEFL had been administered to more than 750,000 examinees. 5. During the discussion of the first two questions in the Listening section, it became apparent that not all panelists assumed the role of the target examinee. All panelists were consequently reminded to “role play” accordingly and asked to redo their Listening judgments. 6. The CGFNS TOEFL sample was based on the written examination, not the CAT examination. The written examination scores were converted to CAT scores via the following formula: CAT = (Written - 273.9) * 0.769. This formula was based on a conversion table found on page 13 of the TOEFL 2003–04 Information Bulletin for Computer-based and Paper-based Testing.

REFERENCES American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Cizek, G. J. (1996). Standard setting guidelines. Educational Measurement: Issues and Practice, 15, 13–21. Cizek, G. J. (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Lawrence Erlbaum. Commission on Graduates of Foreign Nursing Schools. (2000). Commission on graduates of foreign nursing schools validity study April 1999 through March 2000. Unpublished statistical report. Educational Testing Service. (2000a). TOEFL test of English as a foreign language: Computerbased TOEFL score user’s guide. Princeton, NJ: Author. Educational Testing Service. (2000b). TOEFL test PowerPrep® software. Princeton, NJ: Author. Educational Testing Service. (2003). TOEFL 2003–04 information bulletin for computer-based and paper-based testing. Princeton, NJ: Author. Hambleton, R. K., Jaeger, R. M., Plake, B. S., & Mills, C. (2000). Setting performance standards on complex educational assessments. Applied Psychological Measurement, 24, 355–366. Hurtz, G. M., & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and Psychological Measurement, 63, 584–601. Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service. Lunz, M. E. (2000). Setting standard on performance examinations. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 181–199). Stamford, CT: Ablex. National Council of State Boards of Nursing, Inc. (2003). Profile of member boards 2003. Chicago: Author. Plake, B., & Hambleton, R. (2001). The analytic judgment method for setting standards on complex performance assessments. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 283–312). Mahwah, NJ: Lawrence Erlbaum. Sireci, S. G., & Clauser, B. E. (2001). Practical issues in setting standards on computerized adaptive tests. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 355–369). Mahwah, NJ: Lawrence Erlbaum. Acknowledgments. This research was funded by the National Council of State Boards of Nursing. We thank Anne Wendt, RN, PhD, for her assistance with creating the recommended specifications for the selection of panelists. Offprints. Requests for offprints should be directed to Thomas R. O’Neill, PhD, National Council of State Boards of Nursing, 111 E. Wacker Drive, Suite 2900, Chicago, IL 60601. E-mail: [email protected]

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.