Using corpora to design a reliable test instrument for English proficiency assessment

June 30, 2017 | Autor: Faisal Mustafa | Categoría: Languages and Linguistics, Applied Linguistics, Corpus Linguistics, Language Assessment
Share Embed


Descripción

THE ASSOCIATION OF TEACHING ENGLISH AS A FOREIGN LANGUAGE IN INDONESIA

International Conference 2015 Denpasar, 14th - 16th September 2015

PROCEEDINGS Teaching and Assessing L2 Learners in the 21st Century

ENGLISH DEPARTMENT FACULTY OF LETTERS AND CULTURE IN COLLABORATION WITH UDAYANA UNIVERSITY POST GRADUATE STUDY PROGRAM

USING CORPORA TO DESIGN A RELIABLE TEST INSTRUMENT FOR ENGLISH PROFICIENCY ASSESSMENT Faisal Mustafa

[email protected] Syiah Kuala University Banda Aceh, Aceh, Indonesia

Abstract Designing a grammar test, among other tests, for classroom use requires much effort and the results need to be tested for reliability to ensure that it gives teachers the information they need about their students’ achievement and to make sure students are fairly scored. However, the data shows that most of the tests given to students to measure their performance were not tested for their reliability. It seems that teachers, as well as lecturers, do not bother with the reliability test or do not have the knowledge or opportunity to conduct the test. Therefore, this paper presents a way to design a reliable test without having to test for its reliability by using corpora such as COCA and BNC. After designing two sets of TOEFL-like grammar test, the writer pilot tested for their reliability to prove that this way of designing a test was effective. Then, the results were compared with the reliability of TOEFL grammar designed by the ETS, the official TOEFL test designer. The analysis results showed that the reliability of designed tests was .85 and .88. The results of this comparison showed that the grammar tests designed by using a corpus was very similar to those designed by the ETS, i.e. .86, proving that the designed tests were reliable, thus did not require any reliability test. Therefore, it is recommended that teachers use a corpus in designing a grammar test when a reliability test is not considered as an option for obtaining a reliable test. Keywords:English proficiency assessment, grammar test, reliability test, corpus

1

INTRODUCTION

A test is a part of teaching and learning process in all education levels. The results of the test are used to decide whether a student is recommended to upgrade to the next level. These results are recorded in either score reports or academic transcripts which will be used by prospect employers as a job recruitment document. Other types of tests, such as TOEFL and IELTS, may be used by scholarship providers in deciding to whom thousands of dollar scholarships are granted. These language standardised tests are designed by English language testing institutions which employ steps of test designing process proposed by Douglas (2014), Brown (2004), Shibliyev & Gilanlıoğlu (2009), and Alias (2005), i.e. conducting need analysis, deciding test task, deciding blueprint, developing test questions, reviewing, and conducting pilot testing. The pilot testing is followed by statistical analysis where each item is analysed for the level of difficulty and Proceedings The 62nd TEFLIN International Conference 2015 ISBN: 978-602-294-066-1

344

The 62nd TEFLIN International Conference 2015

|345

above all reliability. Therefore, the results of the test are fair for everybody. However, in English language classroom, those steps are rarely applied. In designing this test, some procedures of test development are left out, including reliability test, which has been claimed to be one of the most determining steps in ensuring fairness of the test. Frisbie (1988, p. 29) revealed that the reliabilities of most standardised tests ranges between 0.85 and 0.95 while the average reliabilities for tests made by teachers are 0.50, failing to reach the minimum accepted test reliability for classroom use of 0.70 proposed by Wells & Wollack (2003, p. 5) and Douglas (2014, p. 107) or 0.80 by Frisbie (1988, p. 29). This low level of reliability suggests that teachers do not consider reliability test as an important requirement in designing tests. In Indonesia, Javanese language tests designed by teachers in Banyumanik, a subdistrict in Central Java, were rejected by the Local Education Department because they were considered unreliable (Mujimin, 2011). A research conducted by Dwipayani (2013) also revealed that the Indonesian language final exams designed by teachers at Senior High School 1 in Bangli, Bali were not at all reliable. Another unreliable test (0.5), for Arabic language class, was found at Islamic Senior High School Sabdodadi Bantul in Yogyakarta (Aliyah, 2012). The fact that teachers do not consider reliability test as a crucial step in designing a test, as in the above examples, posts a threat on the fairness of the assessment conducted in classrooms, considering the results of which are potentially used to decide the future of students, among other purposes of a test. In addition, the results of the test influence students’ learning experience because the teachers will teach or review materials based on what they believe about their students’ performance, which is based on the test result (Südkamp, Kaiser, & Möller, 2014, p. 5). In a more serious case, if the test results are recorded in academic transcript, graduates might lost a chance to be admitted at a higher level educational institution or to get a job. In order to avoid these disadvantages of test results, teachers need to be given training because, according to Krolak-Schwerdt, Glock, & Böhmer (2014, p. 1), teacher developments contribute to “ability to assess students’ achievements adequately “. Indeed, teachers in Indonesia have an easy access to training; however, most trainings focus on selecting instruments for assessment rather than on designing an instrument. In addition, English teachers are graduates of English language training institutions where assessments are taught, followed by a semester internship at schools. In addition, reliability tests take much time and most teachers or lecturers teach so many classes that it leaves limited time for them to conduct such a test for each test they make. Furthermore, Brown (2004, p. 55) doubted that trying-out the test, in which the reliability test is covered, is even possible in everyday classroom because the try-out needs to be given to other students, not the students to be given the ready-to-use test. Therefore, an alternative way of developing a reliable test is required. This paper is going to present one of the alternatives in designing a reliable test without having to test for its reliability. One of the tools used for language assessment is a corpus (Gabrielatos, 2005, p. 5). It is a collection of texts in various fields stored electronically and allows easy online access (Cheng, 2011; C. Jones & Waller, 2015; Kennedy, 1998). It has been used to analyse grammar and many works have been written on grammar based on corpus analysis (Hunston, 1999; C. Jones & Waller, 2015). It shows how language is actually used in real communication, both oral and written. This method of designing a language test has been applied since more than a decade ago, and thus the next section of this paper is a discussion on how to use corpora to design a test. In order to proceed to this step, the review on procedures in designing a test is given first.

346|The 62nd TEFLIN International Conference 2015

2

STEPS IN TEST DESIGN

A test is intended to measure what it is intended to measure and it is expected that the results of the test represent what students understand about the tested material. In order that those purposes are met, a test development should follow the procedures as suggested by Brown (2004), Douglas (2014), Fulcher (2013), and Fulcher & Davidson (2007, 2013). In general, the procedures comprise determining the purpose of the test, drafting the test, evaluating the test, revising the test, and determining the scoring system. Before making a test, a teacher should know why they want to test their students. The test content and format is determined by this purpose (Fulcher, 2013, p. 93). After that, a teacher needs to picture what ability should be shown by the students, known as construct, in order that the teacher can outline the test. After that,the teaching makes a draft of the test. The draft should be based on the specifications, which include test outline, skill to be tested, and how questions will look like (Brown, 2004, pp. 30-31). The type of questions is decided by considering both efficiency (Bachman, 1990, p. 46) and purpose (Douglas, 2014, pp. 48-49). After deciding the specifications, the drafting can begin. The draft cannot be considered ready to use because it might have high measurement errors resulting from, according to Wells & Wollack (2003, p. 2), test specific factors such as test items which are probably too difficult, unclear instructions, or double correct answers for multiple choice questions. Although other factors can affect measurement errors such as students’ condition and scoring factor, test-specific factor has been claimed the most dominant factor causing these errors (Symonds, 1928, pp. 75-77). When a test has high measurement errors, a teacher cannot rely on the result of that test in determining the students’ progress or achievement. Therefore, the test is termed as having low reliability. To find out the level of reliability, the test needs to be tried out or pilot tested to students for whom the test is not actually intended (Brown, 2004, p. 55; Fulcher, 2013, pp. 179-180; Read, 2013, p. 307). If the results of statistical analysis prove the test unreliable, it needs a revision. Frisbie (1988, p. 30), Wells & Wollack, (2003, p. 5), and Fulcher (2013, p. 57) suggest to lengthen the test because the longer the test, the more reliable it becomes. However, excessive length potentially creates unreliability due to students’ condition, such as exhaustion. Another method of improving reliability is revision based on item analysis results, which are parts of statistical analysis in pilot testing, i.e. item discrimination (Wells & Wollack, 2003, p. 7) and level of difficulty (Frisbie, 1988, p. 30; Fulcher, 2013, p. 182). “An item is considered to be discriminating if the high-achieving students tend to answer the item correctly while the lower achieving students tend to respond incorrectly” (Wells & Wollack, 2003, p. 7), and each item should not be too difficult or too easy. If it turns out that the statistic analysis yields low level of reliability, the test should be revised by consulting factors influencing reliability proposed by experts in language testing. Symonds (1928, pp. 75-77) and Henning (1987, p. 78) suggest that the factors affecting the reliability are among others test length, item difficulty, and item discrimination. Although the number of test items correlates with reliability, the time constraint need to be considered for classroom assessment. Instead, the items which are too difficult or too easy should be revised. In terms of item discrimination, certain items which can be answered by most low-achieving learners should be reconsidered. For multiple choice questions, the distractors should be analysed. When no students select a certain distractor, it is proven not a good distractor and thus should be revised or changed. In theory, the result of final revision should be given another cycle of test development until the reliability is achieve. However, if it is not a high stake test, the next cycle is not highly necessary.

The 62nd TEFLIN International Conference 2015

3

|347

DESIGNING A GRAMMAR TEST BY USING CORPORA

Corpus has been used in language assessment to help classroom teachers and test developers design a high quality test. Although it can be used to determine what to test, such as by analysing frequently-used words (Moder & Halleck, 2013, pp. 144-145), it can also be used to design a test. Because it is composed of sentences used by native speakers, the sentences are grammatically accepted by most speakers of the language. Therefore, grammatical error would less likely be found if corpora were used when designing the test. In fact, corpora have also been used to validate test materials because even native speakers’ intuition cannot be relied on (Barker, 2014, p. 1019). To design a reliable grammar test where corpora is involved, teachers do not have to follow all steps of test development discussed in previous section because they can use a template of a standardised test which have been proven to follow all those steps. The steps of designing a test by using corpora is presented in the following. However, these steps only apply to grammar test, on which this work is focused.

3.1

Establishing templates for the test

A teacher does not have to design a test from the scratch, but a standardised test can be used as a template since, according to Brown (2004, p. 76&81) and Nissan & Schedl (2013, p. 81), standardised tests have undergone many researches before and after they are designed. One of the tests in selected standardised test is taken as the template and the teacher should design a test by following this template. This does not guarantee the same level of reliability, but at least it gives a reliable test to use in the classroom, i.e. 0.70 as suggested by Wells & Wollack, (2003, p. 5). In this research, a grammar test was designed based on a template from structure and written expression section in TOEFL designed by ETS. It is the second section in Paper-based TOEFL, consisting of 40 multiple-choice questions.

3.2

Fitting items from corpora into the template

In filling in the template, a teacher has to consider the subject matter of the item in the decided template, while the topic (language feature) tested has been determined by the template. For example, structure and written expression section in the TOEFL is made up of test items from variety of subject matters, i.e. natural and social sciences, arts, literature, geography, economics, laws, and history (Hilke & Wadden, 1997, pp. 35-37). In finding sentences in corpora, the search should be specified only to the subject matter in the template, which is called category in corpora and most corpora enable filtering on any of these categories. This step is to eliminate errors which contribute to low reliability because one of those errors, according to Jones (2013, p. 352), is test content.

3.3

Handling search results from corpora

A single search in a corpus using predetermined category restriction might sometimes give the tokens of more than hundreds of search results, other times not even one. In the case when the search results are abundant, the teacher should choose one which is the closest to the template in terms of subject matter and topic. However, when the result is zero, which is not uncommon, other corpora should be used or the subject matter should be left out to widen the search results. However, the later sacrifices reliability.

3.4

Writing up options (distractors) for multiple-choice questions

Another challenging step is to make sure the options for multiple choice questions consist of only one correct answer. It is not unlikely that the tests designed by classroom teachers have more than one correct answers, while the worst scenario is no correct answer at all.

348|The 62nd TEFLIN International Conference 2015

Since the test is designed by using a template, the options for multiple choice questions can be directly taken, or if necessary adopted, from the template. However, for error analysis questions, the second part of structure and written expression section in the TOEFL, deciding which parts of each item are used as distractors is not as easy as that for multiple choice questions. For experienced EFL teachers, this task can be simple because they can guess what errors students usually conduct. However, others including ESL teachers are suggested to use learner corpora to find out the actual errors made by certain level English learners (Barker, 2014, pp. 1018-1019).

4

PROVING THE RELIABILITY

In order to prove that a test designed by using a corpus is reliable, two grammar tests were designed by using the steps presented above. In designing these tests, the writer used Corpus of Contemporary American English (COCA) which is available for public at http://corpus.byu.edu. The corpus contains 450 million words from 1990 - 2012. When COCA did not give any search result, the writer used other corpora available at the website such as Global Web-Based English (GloWbE) consisting of 1.9 billion words from 2012-13, Corpus of Historical American English (COHA) consisting of 400 million from 1810-2009, and British National Corpus (BNC) consisting of 100 million from 1980s-1993. Filtering features are slightly different from one corpus to another. In COCA and BNC, the second most used corpus for this paper, the search can be filtered based on several main categories (subject matters), i.e. spoken, fiction, magazine, newspaper, and academic. Academic category was specifically used for the purpose of designing these tests. Under that category, there are subcategories which should be selected to match those in the template, i.e. education, history, geography/social science, law/political science, humanities, philosophy, science/technology, and medicine. These subcategories are only available in COCA, not in BNC. GloWbE can only be filtered by countries and COHA by years. For corpora other than COCA, the subject matters were determined based on context because they do not provide such filtering feature. The templates used for these tests were an official paper-based TOEFL materials tested in August 1996 (herein after referred as Test A) and May 1996 (herein after referred as Test B). The tests which consists of 40 items come in two parts. The first 15 questions are fill-in-the-blank questions and the rests are error analysis questions. In choosing distractors for error analysis questions, the writer used his intuition, which was based on experience in teaching EFL students, because the test was intended only for classroom use. After the tests have been developed, they were pilot testing to find out the levels of reliability. Because each item in the tests assesses different language ability, the most applicable reliability test is test - retest reliability (Bachman, 1990, p. 174). It ensures that if a test taker takes these tests twice at different times, the test taker is going to get somewhat the same score. If the test taker indeed obtains the scores which the difference is still within the allowed range, the test is still considered reliable. The participants for Test A were 27 students and 5 graduates of Syiah Kuala University in Banda Aceh, Indonesia, majoring English language teaching. Some students were in their second year and others were in their third, and fourth years. The graduates were all experienced English tutors. For Test B, 23 fresh graduates of the same major participated in the test. They have different level of English proficiency and had participated in one year teaching in remote areas across the Indonesian islands, sponsored by Indonesian Government. The participants were given at most a week after the first test before taking the second test.

The 62nd TEFLIN International Conference 2015

|349

To analyse the level of reliability for this test, the data in the first and the second tests were calculated by using the Pearson's Product-Moment Coefficient of Correlation (r) suggested by Best & Kahn (Best & Kahn, 2005, p. 384) and Henning (1987, p. 60), i.e. rxy =

Σxy

( Σx )( Σy ) 2

2

Where: 𝑟𝑥𝑦

= the correlation between two sets of raw score

∑ 𝑥𝑦

= the cross product of the mean subtracted from that score (𝑋 − 𝑋̅)(𝑌 − 𝑌̅)

∑ 𝑥2

= the sum of the 𝑋̅ subtracted from each X score squared (𝑋 − 𝑋̅)2

∑ 𝑦2

= the sum of the 𝑌̅ subtracted from each Y score squared (𝑌 − 𝑌̅)2

The test results in Test A and Test B were tabulated in two separate tables. The scores in the tables were calculated statistically to obtain variables required by the above formula. The results of calculation are presented in the following tables.

X

(𝑋 − 𝑋̅) (𝑌 − 𝑌̅) (𝑋 − 𝑋̅)2 (𝑌 − 𝑌̅)2

Y

(𝑋 − 𝑋̅)(𝑌 − 𝑌̅)

∑ 𝑇𝑒𝑠𝑡 𝐴 735 757 0

0

1754.96

1931.21

1620.66

∑ 𝑇𝑒𝑠𝑡 𝐵 427 420 0

0

649.65

598.43

532.61

Based on the data presented above, the correlation can be calculated by inserting the data into the the Pearson's Product-Moment Coefficient of Correlation formula.

rxy =

Test A

∑ xy

( ∑ x )( ∑ y ) 2

2

=

1620.66 = .88 rxy = 1754.96 1931.21

∑ xy

( ∑ x )( ∑ y ) 2

2

=

532.61 = .85 4649.65 598.43

Test B

The results of calculation, .88 for Test A and .85 for Test B, proved that both tests were highly reliable for classroom use, according to Wells & Wollack (2003, p. 5), Douglas (2014, p. 107), and Frisbie (1988, p. 29). This level of reliability is somewhat similar to the reliability of Structure and Written Expression section in real TOEFL tested between July 1995 and June 1996, that is .86, according to ETS Official Handbook (1997, p. 30). One of the most influencing characteristic of test items for reliability is difficulty index, which has to be between .33 and .67 (Henning, 1987, p. 50); otherwise, it is too difficult or too easy. The difficulty index for each item is calculated by dividing number of participants anwering the item correctly by the total number of participants. In TOEFL Preparation Kit published by ETS (1995), the difficulty is divided into three levels, i.e. easy ( .80 - 100), medium ( .57 - .79), and difficult ( .00 - .56). To show difficulty index

350|The 62nd TEFLIN International Conference 2015

for a test designed using the method proposed in this paper, Test A was chosen for the analysis because it was tested to participants with more varied scores compared to those in Test B. The numbers of items categorised into easy, medium, and difficult, compared to four tests in TOEFL Preparation Kit (1995), are presented in the following table.

Level of Difficulty

Test A

TOEFL Preparation Kit 1

2

3

4

Easy

11

13

11

10

13

Medium

10

19

23

18

17

Difficult

19

8

6

12

10

Acceptable (.33 - .67)

17

In addition to difficulty index, item questions in a test should also be able to discriminate between higher ability and lower ability participants (Fulcher & Davidson, 2007, p. 124; Henning, 1987, p. 51). Both Fulcher & Davidson (2007, p. 103) and Henning (1987, p. 52) suggest to use point biserial method to calculate item discrimination, with the following formula.

rpbi =

Xp

Xq sx

pq

Where: 𝑟𝑝𝑏𝑖

= point biserial correlation

𝑋̅𝑝

= mean score for participants with correct answer

𝑋̅𝑞

= mean score for participants with incorrect answer

𝑠𝑥

= standard deviation

p

= proportion of participant with correct answer

q

= proportion of participant with incorrect answer

The results of discrimination index calculation, also for Test A, show that only 5 items (12.5%) are less than .25, the lowest acceptable point biserial correlation according to Fulcher (2013, p. 185) and Henning (1987, p. 53). This proved that most items in the test (87.5%) were discriminating between high achieving and low achieving test participants.

The 62nd TEFLIN International Conference 2015

5

|351

CONCLUSIONS AND SUGGESTIONS

Classroom teachers and lecturers rarely, if not never, try out their designed test materials for classroom use. In some cases, teachers do not have adequate time for the try out and in others it is simply impossible. Therefore, they cannot guarantee that the tests they use in the classroom, which effect the washback, are reliable tests. Fortunately, a reliable test which does not need a pilot testing, can be designed by using a corpus. To design such test, a template should be taken from a proven reliable test such as a standardised test. Sentences are searched in corpora by considering topic and subject matter in the template. The data from the corpus are fitted into the template and options in the template can be directly, with some modifications if necessary, used for the test if it is a multiple choice test. When this is not possible, teachers can use their intuition in writing distractors for test items, or consult learner corpora for higher stake tests. A test designed by using this method has been proven reliable. A test was designed following such procedure and pilot tested twice (test-retest) to find out its level of reliability. The analysis results showed that the reliability was 0.80. This number is only slightly different from reliability of the standardised test (TOEFL) from which the template was extracted, ensuring that it is highly reliable for classroom use. Therefore, this method of designing a test is recommended for teachers in order that their students are fairly scored. One drawback for this method of designing a test is that the difficulty index does not always match that of the template. This result is not surprising because the templates used for these tests covered many advanced language features which participants had not learned. In classroom use, teachers should only use the templates covering the topics which have been taught to the students. This presumably will improve quality of test in terms of item difficulty index, which positively influences reliability. However, the results of this research only applies to grammar test. It is not unrecommended to use corpora in designing tests for other skills. Since each skillis tested differently, the procedures of designing the test are presumably different as well. Therefore, other researches presenting the procedures of using corpora in designing those test are required, also with reliability test to prove that the proposed method is applicable.

REFERENCES Alias, M. (2005). Assessment of learning outcomes: validity and reliability of classroom tests. World Transactions on Engineering and Technology Education, 4(2), 235238. Aliyah, D. (2012). Analisis kualitas soal ujian semester 1 mata pelajaran Bahasa Arab kelas XII Madrasah Aliyah Negeri Sabdodadi Bantul Tahun Ajaran 2011/2012. (Undergraduate), Universitas Islam Negeri Sunan Kalijaga, Unpublished. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Barker, F. (2014). Using Corpora to Design Assessment. In A. J. Kunnan (Ed.), The Companion to Language Assessment (Vol. 4, pp. 1013-1028). Hoboken: John Wiley & Sons, Inc. Best, J. W., & Kahn, J. V. (2005). Research in education. New York: Pearson Education Inc. Brown, H. D. (2004). Language assessment: Principles and classroom practices. New York: Pearson Education. Cheng, W. (2011). Exploring corpus linguistics: Language in action. New York: Routledge.

352|The 62nd TEFLIN International Conference 2015

Douglas, D. (2014). Understanding language testing. London: Routledge. Dwipayani, A. A. S. (2013). Analisis validitas dan reliabilitas soal ulangan akhir semester Bidang Studi Bahasa Indonesia kelas X.D SMA N 1 terhadap pencapaian kompetensi. Jurnal Jurusan Pendidikan Bahasa dan Sastra Indonesia, 1(5), 1-18. Frisbie, D. A. (1988). Reliability of Scores From Teacher-Made Tests. Educational Measurement: Issues and Practice, 7(1), 25-35. Fulcher, G. (2013). Practical language testing. London: Hodder Education. Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. New York: Routledge. Fulcher, G., & Davidson, F. (2013). The Routledge handbook of language testing. New York: Routledge. Gabrielatos, C. (2005). Corpora and Language Teaching: Just a fling or wedding bells? The Electronic Journal for English as a Second Language, 8(4), 1-35. Henning, G. (1987). A guide to language testing: development, evaluation, research. Beijing: Foreign Language Teaching and Research Press. Hilke, R., & Wadden, P. (1997). The Toefl and Its Imitators: Analyzing the TOEFL and Evaluating TOEFL-Prep Texts. RELC Journal, 28(1), 28-53. Hunston, S. (1999). Pattern grammar: a corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins. Jones, C., & Waller, D. (2015). Corpus linguistics for grammar: A research guide. New York: Routledge. Jones, N. (2013). Reliability and dependability. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 350-362). New York: Routledge. Kennedy, G. D. (1998). An introduction to corpus linguistics. London: Longman. Krolak-Schwerdt, S., Glock, S., & Böhmer, M. (2014). Teachers’ professional development assessment, training, and learning. Rotterdam: Sense Publishers. Moder, C. L., & Halleck, G. B. (2013). Designing language tests for specific social uses. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 137-149). New York: Routledge. Mujimin. (2011). Kompetensi guru dalam menyusun butir soal pada mata pelajaran Bahasa Jawa di sekolah dasar. Lingua Jurnal Bahasa dan Sastra, 6(2). np. Nissan, S., & Schedl, M. (2013). Prototyping new item types. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 281-294). New York: Routledge. Read, J. (2013). Piloting vocabulary tests. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 307-320). New York: Routledge. Service, E. T. (1995). TOEFL test preparation kit (Vol. 1). New Jersey: Educational Testing Service. Service, E. T. (1997). TOEFL test and score manual (1997 ed.). New Jersey: Educational Testing Service. Shibliyev, J., & Gilanlıoğlu, İ. (2009). Language Testing and Assessment: An Advanced Resource Book. ELT journal, 63(2), 181-183. Südkamp, A., Kaiser, J., & Möller, J. (2014). Teachers’ judgments of students’ academic achievement. In S. Krolak-Schwerdt, S. Glock, & M. Böhmer (Eds.), Teachers’ professional development assessment, training, and learning (pp. 5-25). Rotterdam: Sense Publisher. Symonds, P. M. (1928). Factors influencing test reliability. Journal of educational psychology, 19(2), 73-87. Wells, C. S., & Wollack, J. A. (2003). An instructor’s guide to understanding test reliability. Wisconsin: Testing & Evaluation Services, University of Wisconsin.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.