Editor

JLLT edited by Thomas Tinnefeld

Journal of linguistics and Language Teaching

Volume 12 (2021) Issue 2, pp. 247-252


Greta Gorsuch & Dale T. Griffee (2018): Second Language Testing for Student Evaluation and Classroom Research. Charlotte, CA: Information Age Publishing Inc. 369 pp. (ISBN 978-1-64113-011-0)


Among many publications offering both theoretical and practical perspectives on the field of language testing, Gorsuch & Griffee’s book stands out in many respects. It represents an introduction into the field, primarily intended for two main groups of readers: 

  • for classroom teachers of all levels of experience, who wish to develop or improve their own classroom-based tests or deepen their understanding of  standardized tests’ constructs and their scores’ interpretations, and 

  • for students of linguistics or second language education programs, who do not yet have much training in language testing and therefore will find the book helpful as an introduction to the field.

The book comprises two volumes, the main 369-page one and an accompanying student workbook with supplementary tasks. 

In the main volume, readers will find appendices, providing more explanations and usefully expanding on the issues discussed in the main part of the book, a comprehensive glossary of important terminology and, importantly, an extensive list of references, directing them both towards the seminal, theoretical works in the field and towards publications whose implications are of practical usefulness. 

In the introduction (xiii-xix), the authors give a brief overview of the history of language testing, followed by an explanation of why the field currently seems to represent a turning point; here, they put a particular emphasis on the issues of fairness and social consequences of tests as first discussed by Messick (1989, cited in Gorsuch & Griffee, 2018), and later by McNamara (2008) (xvi). In doing so, the authors demonstrate how mindful they are of their readers - after all, as teachers, we are particularly aware of the social consequences of language tests, even if we do not always think of them in terms of language testing validity.

Chapter 1 (1-20) focuses on norm-reference tests (NRTs), not only because of their long history and the fact that most readers are likely to be familiar with this type of test, but also because presenting NRTs allows the authors to introduce certain basic concepts such as fitness for purpose and the necessity to validate NRTs locally, wherever they are to be used to ensure their validity and fairness for a particular test population. Readers are also taught (or, reminded) that, as instruments, tests are not valid in themselves. Instead, “the question of validity is focused on the interpretation that test consumers make based on the scores” (12). With regards to score interpretation, the emphasis is placed on the fact that scores are subject to standard error of measurement, and responsible standardized tests providers should always supply test users with the information about their SEMs. This is something that school administrators and teachers should be aware of when they want to make decisions based on external standardized tests.

Already in Chapter 1, the authors carefully cross-reference the terms they introduce, both to later chapters and to the appendices. First, some more basic terms are introduced (1-3), followed by those that are conceptually (or mathematically) more difficult - an illustrative example being the term item facility (IF), followed by item discrimination (ID) (3-5). The presentation of these two terms is cross-referenced, in turn, to Chapter 8 (185), where the concept of reliability is presented, allowing readers to deepen their understanding of the usefulness of IF and ID for deciding whether a given test is reliable or not. Consequently, readers are also referred to practical hands-on tasks in the Student Workbook, where they can apply their new knowledge and calculate IF and ID for themselves, using the data provided for this purpose.

The chapter finishes with the description of a procedure for creating a standardized speaking placement test (16-20), which could be used by a group of teachers in a school or a university language center, and subsequently validated locally as well. This section of Chapter 1 also serves as a review of the chapter's content.

Beginning with a brief discussion of various definitions of what a test is and an explanation of what selected and constructed response items are, Chapter 2 (21-40) presents various test item formats such as multiple choice, true or false, matching items, fill-in-the-blank items, short-response items and finally, task items. Each of these sections comprises a brief description of the respective test item category, item writing guidelines and possible problems, and includes some examples of test items. The last section of Chapter 2  discusses grading, both objectively and subjectively scored. Particularly interesting is the latter since it requires a grading scale, either holistic or analytic. The authors briefly discuss both types and provide some practical examples.

In Chapter 3 (41-71), the authors write about criterion-referenced tests (CRTs), comparing them with the previously discussed norm-referenced tests, providing a clearly formulated definition of a CRT (42) as well as some related concepts such as mastery, criterion and construct (42), and describing both the advantages and disadvantages of these tests. Teachers are less likely to design and use their own NRTs, which require more resources and will typically be produced by professional testing companies, while their scores will be used by teachers or administrators for entrance or placement purposes. CRTs, on the other hand, are the tests which “create a conceptual bridge among the textbooks we [teachers] use, what we do in class, and how we have learners demonstrate their abilities through the tests teachers design” (41). It is therefore not surprising that this chapter is more detailed than the previous one about NRTs, and that it offers even more information about test item analysis, reliability estimates, including the strategies specifically used for CRTs, and basic validation strategies such as designing an intervention study, doing item analysis or defining the construct. Usefully for teachers, Chapter 3 concludes with 12 steps for test development as described in Downing (2006) (62-71).

Chapter 4 (73-93) focuses on the role of theory in second language testing. The authors begin by defining their terms, and then they explain the reasons why teachers often fail to appreciate the importance of theory, such as inadequate emphasis put on theory in teacher training courses, or confusion in research terminology itself. In this chapter, the authors’ presentation of their own model of different levels on which theories might influence teaching and testing practice is particularly noteworthy. 

Despite its brevity, this is an important chapter because without a theory of language and one of language acquisition, it will not be possible to define and validate constructs for language testing. For many practice-oriented teachers, it might also be useful to realize that their lack of understanding of theory might lead to various kinds of erroneous practices. Here, a particularly illustrative example is that of teachers following communicative competence theory while teaching, and at the same time testing their learners in a pre-communicative way only, using discrete grammar or vocabulary test items.  

Finally, in this chapter the authors emphasize the necessity to align course objectives with tests or sub-tests, which can only be achieved if test constructs are clearly defined, and the test items are well described in the test specifications (92). They also advocate that, if the purpose of a given course or its objectives have not been formulated explicitly, teachers should take the initiative and define their course objectives themselves and write their tests accordingly as this is “an important way for teachers to make headway and meaning in such aimless situations” (92) - in doing so, teachers will feel more empowered, but their efforts must be founded on theory. Throughout this chapter, inquisitive readers will find numerous references to seminal theoretical contributions to the field of language testing such as Canale & Swain (1980, 1981), Bachman (1990), Bachman & Palmer (1996, 2010) or McNamara (1996). 

Chapter 5 (9), which is highly practical, focuses on designing and using performance tests - and, when it comes to communicative teaching (and particularly, testing productive skills of speaking and writing), performance tests are an essential part of every teacher’s ‘professional toolkit’. First, the authors explain some basic concepts (such as a communicative task) and provide a clear definition of a performance test, presenting a slightly simplified model of how a performance test works - both from the perspective of a candidate taking the test and from the perspective of the teacher designing it (98-104). While describing the latter, the authors highlight the relationship between a real-life task, the curriculum and a test task and they also advocate making performance tests as closely related to the candidates’ real-life needs and curricula as possible. This is particularly difficult to achieve (but, important to strive for) in university contexts and while teaching language for specific purposes, where teachers of second or foreign languages typically are not experts in their students’ disciplines. The authors provide an example of how this can be done (Gorsuch 2017) (102) and encourage teachers “to investigate (...) and use tasks for performance tests that are reasonably relevant to their curriculum and to the real-world needs of their students” (102). In this chapter, they also offer a whole section about some inspiring examples of performance tests (106-111). After a brief discussion of the advantages and disadvantages of these examples, the authors offer a step-by-step explanation of how teachers might design their own performance tests, and how they might later estimate their reliability and provide validity evidence (115-130).  This chapter finishes with ample, highly useful, references to the literature about performance testing - including, among other topics, testing language for specific purposes and rater reliability and training (133-135). All in all, without delving into complex theories of performance testing, which may not be very useful for practitioners, this chapter gives a comprehensive overview of the most important aspects of this type of test, as well as useful pointers where to look for more knowledge and ideas on this topic.

Chapter 6 and Chapter 7 focus on basic statistical concepts and procedures, indispensable for language testing. In Chapter 6 (137-157), readers will find an introduction to different types of data measurement scales and the basic descriptive statistics (including the central tendency of a test, the dispersion measures, the normal distribution). The authors particularly focus on these statistical procedures which can be useful for teachers who typically have small groups of test takers. Chapter 7 (159-184) is devoted to correlation, beginning with some basic terms and their definitions, and discussing causality as well. Some of the procedures described will be covered by common spreadsheet programs, but others (e.g. the Kolmogorov-Smirnov test or the Shapiro-Wilk test) will require access to a statistical program like SPSS. Again, like in the previous chapters, interested readers will find many references to various resources solely devoted to statistics, as well as some questions and tasks in the workbook, which they could use to verify their understanding of the content of these two chapters.

Chapter 8 (185-217) is about reliability, a concept mentioned in earlier chapters, but certainly meriting more in-depth discussion. The authors particularly focus on test score reliability - and, more precisely, its traditional version, though they briefly describe reliability as it is understood within the modern test theory framework (e.g. the Rasch models). As it is important for performance tests, there is also a section of the chapter devoted to rater reliability (198-201). The presentation of various reliability coefficients both for norm- and criterion-referenced tests is done in a clear and practical manner. It is in this chapter that the definition of tests as data-collection instruments shines through, and anyone who is interested in designing other research instruments, such as questionnaires, will also find useful pointers to more literature on this topic there. The authors rightly remind us of the fact that, rather than being a property of a given test, reliability is “population- and occasion-specific even on the same test”, and therefore encourage their readers to calculate and report it on every occasion (216). Otherwise, how should we know whether to trust test scores? What many teachers, teacher-researchers or teacher trainees might find particularly useful in this chapter are clear examples of various methods of increasing test score reliability and how to calculate various reliability coefficients. Again, this is a chapter that maintains a fine balance between theory and practice - with a somewhat greater focus on practicality.

Chapter 9 (219-246 discusses two hugely important, but challenging concepts in language testing: validity and validation. The former focuses on the test measures, while the latter refers to the process of providing evidence for validity. After giving a brief overview of the history of validity, Gorsuch and Griffee focus on two validity models based on the Messick (1989 and 1995) (222-223) and their interpretations by Kunnan (1998) (page number?)and Bachman (1990) (223). The authors observe that in the literature on this topic, more emphasis is placed on score interpretations and less on social interpretations, and they suggest redressing the balance, leaving open the question of what validity evidence ought to be given priority and in what order the evidence should be presented. Given that different stakeholders may have different priorities, in this chapter, readers will find a useful list of questions that teachers, researchers, educators and evaluators may ask themselves based on their specific perspectives. The remaining part of the chapter focuses on validation, using Messick’s model (i.e., considering each quadrant from his model, such as score interpretation, test usefulness, stakeholder values and social, and learning and teaching consequences). Of particular usefulness to teachers could be the presentation of validity evidence collection regarding score interpretations as can be done by teachers - usually, working without the help of language testing experts. In traditional terms, this means collecting evidence with regard to construct and content validity. Equally important are the suggestions of how to gather evidence for the remaining quadrants - and here, while discussing washback effects (in other words, the social consequences of a test), the authors observe that most studies on washback effects are done by experts studying the effects of high-stakes tests - and very few  studies published are done by teachers using tests developed for their classrooms. However, it must be emphasized that washback is not a direct and easily observable phenomenon (Tsagari 2011) (238) and therefore so as "to capture possibly faint and elusive evidence of test washback", the authors propose a causal model (238). Their model, building on the work of Saif (2006), allows for a close alignment between the curriculum, teaching and learning materials, and the test design - in which, to the reviewer's mind, lies its special value. By thinking through this model and discussing it with colleagues, teachers will be able to collect validity evidence concerning washback effects of their classroom tests (i.e. Messick’s social consequences). The chapter ends with a list of chronological steps for validating a test (243-245). All in all, it provides readers with an introduction to one of the most important theoretical concepts in language testing as well as with practical ideas of how to collect validity evidence for classroom tests. 

Chapter 10 (247-265) is about standard setting and cut scores, which is another essential concern in language testing. Having presented and discussed the problems associated with traditional methods of setting cut scores, which often result in unfair decisions, the authors focus on more recent ones, such as the Angoff method and its variants (the yes-no method and the direct consensus method). The Angoff method requires time and the participation of several (or more) raters, which, in some cases when teachers work alone on their classroom-based test, is simply not available. The direct-consensus method, on the other hand, allows teachers to work by themselves or just with another colleague (255-257). This again shows how the authors always take into account individual teaching situations and how what they describe is often directly applicable in such individual teaching contexts, when time or other resources might be in short supply. Next, the authors focus on learner or group-based methods, such as the contrasting-groups method (257-261). Of particular interest could also be the less popular but nonetheless useful method which is described towards the end of this chapter: Griffee’s statistical method (261-264). This method, essentially based on the descriptive statistics calculated for a test administered to a given population of test takers, is suitable for those situations when there is no possibility to work with raters or when there is no existing data for any of the previously described group-based methods to be applied. What is important is that, apart from the necessary calculations, Griffee’s statistical method also involves human judgment, and this is another take-away  for teachers reading this chapter. Despite their apparent statistical objectivity, all the methods described involve human judgement - either that of teachers or that of language testing experts participating in standard settings. Overall, these methods are presented in great detail and clearly illustrated with examples, which makes this chapter useful for anyone interested in standard settings.

In Chapter 11, Tests and Teaching (267-281), Gorsuch & Griffee discuss what they rightly see as a lack of alignment between tests and teaching. Having analysed the reasons for this lack, the authors offer some solutions. In many institutions, there persist “implicit and undiscussed values that support a focus on grammar and vocabulary for teaching as well as tests” - even though many teachers believe they adhere to the communicative language theory (270). So, better teacher education and closer inspection of what values determine the types of tests and item formats used both by individual teachers and institutions might constitute the first step towards reducing disconnects between teaching and testing. Another solution to make well-informed use of the so-called ‘test effect’, which the authors describe in greater detail and illustrate using examples given in several research studies (e.g. Butler & Roediger 2007, Roediger, Agarwal, McDaniel & McDermott 2011, Roediger & Karpicke 2006) (page number?). Briefly, by taking tests, learners participating in these studies did not only have better immediate recall, but also did better on delayed post-tests. In practice, this means that teachers should make frequent tests, and these should be “pedagogically worthwhile” (272-273). In doing so, they should use descriptive feedback to help learners plan their learning and to self-regulate more efficiently. Such good practices will help bridge the gap between teaching and testing. 

In this chapter, Gorsuch & Griffee discuss the idea of dynamic assessment (274-276) and develop their own, practical, classroom model for this type of assessment. As in the previous chapters, their model and the practical examples they offer may serve as an inspiration for teachers. However, two main challenges (also discussed in this chapter) remain. First, it is not clear how to apply the model to large groups of learners, and second - no methodological model for pedagogical interventions has been developed, which makes dynamic assessment a potentially interesting area for future research and development (280-281).

Chapter 12 (283-296) deals with tests as data-collection instruments used for classroom research. After giving a brief description of different types of research, such as confirmatory and descriptive research, the authors discuss course evaluation as a type of research, offering their own model for it. As was the case in the previous chapters, many examples of studies using different types of research are given, and any reader interested in finding out more about these studies and their methodologies will find invaluable information here. The greatest value of the authors’ model for course evaluation lies in pointing to how closely it aligns the evaluated course with such essential components as learning outcomes, assessment and curriculum. The authors explain how focussing on these components can lead to a better understanding of what is happening in the second language classroom, i.e.  whether learning is taking place as originally planned and, if not, what elements might need to be modified to improve the learning outcomes. 

In summary, Second Language Testing for Student Evaluation and Classroom Research is about complex concepts, explained in an accessible, clear style, with many illustrative, helpful examples given. The authors’ commitment to the topic and their long-term hands-on experience of teaching linguistics and developing both second language courses and teacher training courses shine through the pages of this book. The book can be recommended to any teacher and any student of linguistics interested in developing their understanding of language testing. 



Reviewer:

Monika Sobejko

Teacher of English as a Second Language 

Language Centre

Jagiellonian University

Email: monika.sobejko@uj.edu.pl

Krakow

Poland