Editor

JLLT edited by Thomas Tinnefeld
Journal of Linguistics and Language Teaching
Volume 4 (2013) Issue 1
pp. 13 - 28


The Difficulty With Difficulty:
The Issue of Determining Task Difficulty in
Task-Based Language Assessment

Veronika Timpe (Dortmund, Germany)

Abstract (English)
Within the field of task-based language assessment, one issue in particular has raised a variety of questions: task difficulty. How difficult is a task? Are some tasks more difficult than others? If so, why? What makes a task more difficult than another one? Questions such as these have puzzled (applied) linguists and test developers for a number of years, resulting in various research studies. In form of a review article, this paper takes a closer look at ‘task difficulty’ as an issue that has been raised in nearly every publication on task-based assessment. While focusing on the most important publications in the field, the article discusses some general issues that have been raised in the discussion on the controversial nature of ‘task difficulty.’ Moreover, two main approaches will be outlined, a psycholinguistic and a sociolinguistic one, that have been taken to operationalize task difficulty in empirical studies. Finally, suggestions will be made for further research in order to unravel the mystery of task difficulty and understand the processes underlying TBLA.
Key words: task-based language assessment, task dificulty, (applied) linguistics, psycholinguistics, sociolinguistics

Abstract (Deutsch):
In der Forschung zu Task-based Language Assessment (TBLA) gibt es einen bestimmten Gesichtspunkt, der immer wieder Fragen aufgeworfen und Forscher vor ein Rätsel gestellt hat: der Schwierigkeitsgrad von Tasks(Aufgaben) in der fremdsprachenorientierten Leistungsüberprüfung. Wie schwierig ist eine Aufgabe? Sind einige Aufgaben schwieriger als andere? Wenn ja, warum? Was genau macht eine Aufgabe schwieriger als eine andere? Fragen wie diese beschäftigen (angewandte) Linguisten und Testentwickler seit vielen Jahren, mit dem Ergebnis, dass es eine nicht unbeachtliche Anzahl von Studien und Forschungsprojekten zu verzeichnen gibt. In Form eines Überblicksartikels wird in diesem Beitrag die Schwierigkeit von Aufgaben als ein in jeder Veröffentlichung problematisiertes Thema systematisch betrachtet. Zunächst wird die Kontroverse um die Schwierigkeit von Aufgaben analysiert. Im Anschluss werden zwei theoretische Ansätze dargestellt, der psycholinguistische und der soziolinguistische, von denen aus der Schwierigkeitsgrad von Aufgaben empirisch untersucht wurde. Zuletzt werden vor dem Hintergrund des bisherigen Forschungsstandes Vorschläge dafür unterbreitet, wie die Forschung in diesem Bereich weitergeführt werden könnte, um das Mysterium des Schwierigkeitsgrades von Aufgaben in der fremdsprachenorientierten Leistungsüberprüfung zu lösen und die Prozesse, die TBLA zugrunde liegen, weiter zu erforschen.
Stichwörter: Fremdsprachenorientierte Leistungsüberprüfung, Schwierigkeitsgrad von Aufgaben, (Angewandte) Linguistik, Psycholinguistik, Soziolinguistik



1. Introduction

Task-based language assessment (TBLA) as a form of educational assessment is not new. Throughout the past three decades, TBLA has become a well-established type of informal and formative evaluation in educational contexts which enables a meaningful integration of teaching, learning, and assessment (Wiggins 1994). However, what is relatively new is the use of TBLA in high-stakes examinations or for purposes of certification and selection (Brindley 2009: 437). The German Department at Georgetown University in the United States and the Adult Migration English Program (AMEP) in Australia are two examples where TBLA is now used in high-stakes contexts. At Georgetown, the German Department has replaced the former form-focused normative approach underlying the undergraduate program with a “content-oriented collegiate foreign language curriculum” including testing in form of task-based assessment (Byrnes 2002: 419). Within the AMEP in Australia, speaking and writing competencies, for instance, are certified on the basis of listening and writing tasks, i.e. performance on these tasks decides whether a test taker is awarded with the Certificates in Spoken and Written English (CSWE) or not (Brindley & Slatyer 2002: 370).

Although the increasing use of TBLA in high-stakes contexts reveals a progress for means of alternative assessment as recognized forms of testing, comparable to traditional discrete-point paper and pencil tests, the new role and significance that TBLA receives by being used in high-stakes contexts, also raises important issues and concerns about TBLA with a new urgency. Reliability, validity, comparability, and generalizability of scores achieved by test-takers are important in order to draw valid inferences that serve as a basis for further decisions. Lack of comparability and thus, inconsistency are serious threats to validity and reliability which could lead to invalid inferences. These are certainly issues all language tests have to face. However, given the greater range of language skills involved in TBLA than in traditional forms of measurement, the assessment of what is referred to as “communicative competence in an L2” includes much more complex constellations of abilities than testing, for example mere linguistic knowledge. Thus, Norris, Bygate & van den Branden (2009: 431f.) stress that “this complexity also introduces new sources of variability into the testing construct” which in turn raises questions about reliability, practicality, and validity issues in TBLA.

To come to terms with what Brindley and Slatyer (2002: pp. 369.) call a “particularly thorny problem” that affects the entire network of interdependent aspects such as reliability, generalizability, inferences, and especially comparability, researchers have increasingly concentrated on one issue: task difficulty. How difficult is a task? What makes a task difficult? Are some tasks more difficult than others? If so, why? Are there degrees of difficulty discernible in different tasks? Is difficulty a feature inherent in tasks? Or is it a result of the interaction between a test taker and a certain task? Is it possible to predict difficulty for purposes of task selection? Questions such as these have been discussed and investigated by (applied) linguists and testing experts in a variety of studies that have tried to fathom the “difficulty with difficulty” (Bachman, 2002: 462).

This paper takes a closer look at ‘task difficulty’ as an issue that has been raised in nearly every publication on task-based assessment. After the clarification of the basic concepts of ‘task’ and ‘task-based assessment,’ some general issues that have been raised in the discussion on the controversial nature of task difficulty will be raised. Next, two main approaches will be outlined, a psycholinguistic and a sociolinguistic one, that have been taken in order to operationalize task difficulty in empirical studies. In the context of the psycholinguistic approach, studies conducted by Brindley & Slatyer (2002), Elder, Iwashita & McNamara (2002) as well as Norris, Brown, Hudson, & Bonk (2002) will be critically reviewed. It will be shown how their inconsistent results gave rise to the more recent sociolinguistic perspective on evaluating task difficulty. With regard to the sociolinguistic perspective, studies by Fulcher & Márquez Reiter (2003) and Taguchi (2007) will be focused upon. The paper will be concluded by presenting ideas for further research that might contribute towards an understanding of what constitutes ‘task difficulty.’ Throughout the paper, Bachman (2002) will be drawn upon as well as other supplementary readings that contribute to the main argument which highlights the necessity of research on interaction processes underlying TBLA.


2   Tasks and Task-Based Assessment

In the 1970s and 80s, research into how SLA occurs revealed the insufficiencies of the traditional teacher-centered, form-focused language teaching (Norris 2002, van den Branden et al. 2009). As a response, communicative competence was implemented as a major objective in foreign language (FL) teaching. The early SLA research and the communicative, more function-oriented shift in language pedagogy (Bachman 2002: 454), gave rise to the concept of ‘task’ which has gradually become “an important element in syllabus design, classroom teaching and learner assessment” (Nunan 2004: 1).

Ever since the early 1980s, the term ‘task’ has been defined and used in a variety of ways by authors from different disciplines such as SLA, language pedagogy as well as language testing and assessment, leaving a heavily contested field for researchers.[1] For the purpose of this paper, the definition as introduced by Bachman & Palmer in 1996 will be adopted. They understand ‘task’ as a performance or an “activity that involves individuals in using language for the purpose of achieving a particular goal or objective in a particular situation” (Bachman & Palmer 1996: 44). As a goal-oriented notion of task that focuses on the process and meaningful use of language situated in a certain context, the definition is general enough to not only include tasks for language teaching, but also for assessment purposes – the context that is dealt with in this paper (Bachman 2002: 458).

According to Bachman & Palmer’s (1996) definition, assessment based on tasks would generally require test takers to engage in situations in which they are performing tasks using language as a means to achieve a particular goal while the performance, including language use, would need to be rated on the basis of certain criteria. Hence, the following aspects should be constitutive elements in (language) assessment based on tasks: test takers, a situation, task performance, language, a purpose, a goal, and a form of criteria-based rating. Although these elements seem to be fundamental to task-based (language) assessment, they have been given different emphasis by different scholars, resulting in a variety of terminological differences, definitions, and approaches to task-based (language) assessment.

Following McNamara’s distinction between “strong and weak forms of second language performance assessment” can be used to group the different approaches by means of their assessment focus (Wigglesworth 2008: 113; emphasis in the original). Norris, Brown, Hudson & Bonk’s (2002)’s approach, for instance, represents the strong form of task-based assessment. According to Norris et al. (2002: 395) task-based language assessment (TBLA) is a form of performance-based assessment that

takes the task itself as the fundamental unit of analysis motivating item selection, test instrument construction, and the rating of task performance. Task-based assessment does not simply utilize the real-world task as a means for eliciting particular components of the language system, which are then measured or evaluated; instead, the construct of interest is performance of the task itself.

Hence, Norris et al. (2002) – the so called “Hawaii Group” (Bachman 2002: 464) – focus their assessment on the whole task performance. They include language as a means necessary to achieve a task, but do not consider it the predominant feature of TBLA.

In contrast to this strong, holistic approach, a principal focus on language constitutes the core ofweak task-based language assessment. Elder, Iwashita & McNamara’s (2002) approach, whose main interest lies in the linguistic outcome of a task, constitutes an example of the weak form of task-based language assessment. According to Elder et al.’s definition, “task-based assessment requires the test-taker to engage in the performance of tasks which simulate the language demands of the real world situation with the aim of eliciting an “authentic” sample of language from the candidate” (347).

These two distinct perspectives on what is being evaluated in ‘task-centered assessment’ (TCA) (Brindley 2009) represent two key understandings of task-based (language) assessment (TB(L)A)[2]. On the one hand, the form of strong language performance assessment can be found. As such, TBLA is regarded as a holistic performance assessment in which the language elicited is supposed to give an insight into how learners would perform in a comparable situation in the “real” world. On the other hand, as in Elder et al., tasks in task-based assessment are conceptualized as “vehicles for eliciting language samples” (Wigglesworth 2008: 112). Thus, weak language performance assessment concentrates “less on the task and more on the language produced” (Wigglesworth 2008: 113). In short, despite the relatively large amount of terminological and in part also conceptual differences that can be found in the field, a rough distinction can be made on the basis of what is being the focus of assessment.

Despite the division into strong and weak forms of task-based language assessment, all approaches of TB(L)A share two underlying features. First, all scholars agree that task-based language assessment is built upon communicative language testing and, thus, constitutes a subcategory of integrated, direct performance assessment (Bachman 2002, Norris 2002, Norris et al. 2009, Brindley 2009). Elder, Iwashita & McNamara (2002), for example, investigate what they call “task performance” (Elder, Iwashita & McNamara 2002: 347). In a similar vein, Brindley & Slatyer (2002: 369) claim to present a “study focused on […] learners’ performance.” Fulcher & Márquez Reiter (2003: 323) substantiate the relationship between performance and task-based assessment by means of using the compound “performance tasks,” while Norris, Brown, Hudson & Bonk (2002: 395) already emphasize the performance-based character of TBLA in the title of their article when they speak of “task-based second language performance assessment.” Ultimately, Bachman (2002) underscores this aspect of task-based language assessment even further by introducing the acronym TBLPA – task-based language performance assessment – thus explicitly highlighting the performance aspect of task-based assessment.

The second issue inherent to virtually all discussions of task-based language assessment is “the underlying premise […] of inferences” (Bachman 2002: 454). As with any form of discrete-point testing or alternative assessment measure, task-based language assessment serves as a means of retrieving information about a test taker’s abilities. Therefore, Bachman (2002) points out that central to the issue of task-based language assessment – regardless of strong or weak types of language performance assessments – are “inferences we want to make […] about underlying ‘language ability’ or ‘capacity for language use’ or ‘ability for use” (ibid.). Therefore, inferences can be regarded as essential for decisions made in high-stakes contexts.

Given this large variety of components that influence inferences – either directly or indirectly – the question arises whether it is at all possible to make valid and comparable inferences about language abilities on the basis of TB(L)A. Some scholars have taken this critical stance as a point of departure for their research in order to throw some light on the “question of justifying inferences from test performance” (McNamara 1996: 17). Fulcher & Márquez Reiter (2003: 323), for instance, point out that variation in participants’ performance due to task characteristics or conditions leads to different scores which in turn influence inferences about abilities drawn from the scores. In a similar vein, Brindley & Slatyer (2002: 370) elucidate that especially in “outcomes-based systems, lack of comparability across tasks is a potentially serious threat to validity and reliability that could lead to invalid inferences concerning learners’ achievement of the targeted outcomes.” One aspect, strongly debated and investigated in recent literature, is ‘task difficulty’ as an aspect of tasks and task-based performance assessment that might be helpful with a systematic task selection for assessment purposes. Furthermore, knowing how difficult a task is will have clear implications for the way in which different performance levels are interpreted and test scores are used. Thus, task difficulty could serve as an initial reference point for a valid comparability of test takers’ language abilities (Bachman 2002: 468).


3. What is ‘task difficulty’?

The question of what makes a task more or less difficult than another one has occupied the minds of many scholars in the past decades. From a teaching perspective, for instance, Prabhu points out that difficulty is initially a matter of trial and error, that is a “rough measure of reasonable challenge for us that at least half the class should be successful with a task” (Prabhu 1987: 277). Prabhu basically claims here that difficulty is a vague and indefinite aspect in task-based assessment that cannot be taken into account systematically when planning a task. Instead, it  can only be guessed how difficult a task is going to be for a certain group of students by means of an intuitive estimate of students’ abilities. Whether a task is in fact too difficult for 50 per cent of the student population can only be determined in retrospect after the assessment. Since determining and understanding task difficulty could assist in the choice of suitable tasks for assessment purposes as well as contribute to an understanding of “inter-learner variability” (Ellis 2003: 221), scholars have been trying to answer the question of what constitutes task difficulty in a variety of empirical studies and investigations.

Assuming that the notion of task difficulty is inherent in the task itself, researchers have “[tried] to understand, explain, and predict task difficulty” (Bachman 2002: 462) by means of two general approaches. One approach, taken by scholars in the 1990s, was to “identify a number of task characteristics that are considered to be essentially independent of ability and then investigate the relationships between these characteristics and empirical indicators of difficulty” (Bachman 2002: 462). While this first approach is characterized by the belief that difficulty is an aspect exclusively inherent in the task per se and independent from the test taker, the second, more recent approach reveals a slightly broader perspective which includes not only task characteristics, but also requirements a task places on the test taker. Accordingly, studies featuring the second approach have attempted to single out “difficulty features,” i.e. combinations of ability requirements and task characteristics that are responsible for the level of difficulty of a task (Bachman 2002: 463). Authors who have taken this second approach in their (empirical) studies include Elder, Iwashita & McNamara (2002), Brindley & Slatyer (2002), and Norris, Brown, Hudson & Bonk (2002).


3.1 The psycholinguistic approach to task difficulty

These three teams of scholars have operationalized task difficulty from a psycholinguistic perspective drawing upon Skehan’s model which also conceptualizes task difficulty as “the classification of tasks according to an integration of ability requirements and task characteristics” (Norris et al. 1998: 47). Skehan’s model (1996; 1998) consists of a “three-way distinction for the analysis of tasks” (Skehan 1998: 99). These three criteria that Skehan claims affect task difficulty are ‘code complexity,’ ‘cognitive complexity,’ and ‘communicative stress.’ Code complexity refers to the linguistic features of a task such as linguistic complexity and variety, vocabulary load and variety, as well as redundancy and density. The category ‘cognitive complexity’ subsumes cognitive familiarity, for example, how familiar a test taker is with a certain topic, and cognitive processing (i.e., information organization), amount of computation, type of information etc. The third criterion, ‘communicative stress,’ comprises time limits and pressure, speed of presentation, number of participants, length of text and type of response as well as the opportunity to control the interaction (Skehan 1998: 99). All of these aspects, Skehan hypothesized, affect performance and ultimately the level of difficulty of a task.

In their respective studies, Elder et al. (2002), Brindley & Slatyer (2002), as well as Norris et. al (2002) set out to systematically manipulate particular features in Skehan’s model in order to shed light on difficulty as a “major determinant of task performance” (Bachman 2002: 465). In the attempt to answer the question to what extent task difficulty is systematically associated with task performance, Elder, Iwashita & McNamara (2002: 347) investigated the “impact of performance conditions on perceptions of task difficulty in a spoken language test” using Skehan’s cognitive complexity framework as a basis. In their study, candidates had to tell stories under different conditions. Elder et al. (2002) evaluated task complexity in the different story telling tasks by experimentally manipulating four different dimensions sequentially: perspective, immediacy, adequacy, and planning time. Within these four dimensions, task difficulty was operationalized in a dichotomous manner, assuming that the more cognitively demanding a task, the more difficult it must be. For example, in the perspective dimension, students were asked to tell a story from their own (less cognitive demanding) and from somebody else’s perspective (more cognitively demanding). However, these two categories could not be confirmed as being more or less difficult. In total, Elder et al. reported results which offered little support for Skehan’s model, as they had found “no systematic relationship between task difficulty and hypothesized task complexity, on the one hand, and actual test performance, on the other” (Elder et al. 2002: 360). Thereupon, Elder et al. (2002: 362f.; my italics) concluded that “the different task exemplars within each dimension elicited different candidate reactions [which] suggests that there were unanticipated dimensions of complexity/difficulty embedded within each task.”

Elder, Iwashita & McNamara’s (2002) conclusion points again to the task as the context that accommodates the entity referred to as “task difficulty.” They consider features inherent in the task itself, i.e. task characteristics as well as requirements a task places on the test taker, to be solely responsible for the degree of difficulty. What is missing in their line of argument is the test taker as a variable in the assessment context that needs to be taken into account as well. Parallel to teaching, a learner brings certain variables to an assessment situation. For instance, with regard to Elder et al.’s (2002) story telling task, participants will most likely have dealt with the task in what Norris et al. call “idiosyncratic ways” (Norris 2002: 414). Their performance regardless of the task requirements could have differed in, e.g. terms of creativity as well as strategic or (culturally-specific) knowledge of how a story can be structured. Especially in the context of the task that required the change of perspectives, e.g. empathy, in addition to cognitive abilities, has probably played a major role as well. Hence, it might be beneficial to look beyond the task itself and take into account what a test taker brings along to the assessment situation while also considering the interactive processes underlying the performance between task and test taker.

A study by Brindley & Slatyer (2002) with a focus on listening skills reported results along the lines of Elder et al.’s research. Their exploratory study was aimed at determining “whether changes in task characteristics and task conditions in competency-based listening tasks would result in differences in test performance” (Brindley & Slatyer 2002: 390). They also reported to have found no significant effects as a result of manipulating different variables in the listening tasks. Although their results were similar to those reported by Elder, Iwashita & McNamara (2002), Brindley & Slatyer (2002) drew two conclusions that were slightly broader in nature than the inferences made by Elder et al. (2002). First, Brindley & Slatyer (2002: 387) concluded that “there is a complex interaction between the different components of the task.” In order to capture the interaction between the task components for use in future research and assessment, they suggest the development of “models of listening test performance that incorporate a wide range of overlapping difficulty components” (Elder et al. 2002: 391). Although future test specifications would profit from a detailed model, the idea seems questionable for three reasons. First, even if there are models that represent the entire “Communicative language ability” (cf. Bachman 1990: 87), a model that describes listening tasks would need to incorporate a large variety of factors that appear to be nearly impossible to outline in detail. Second, even if it were possible to compile and describe all factors, a model would be a contradiction to Brindley & Slatyer’s (2002) earlier assumption that difficulty is inherent in interaction. Third, if the factors/components are as they say “overlapping,” how is a clear-cut separation possible that reveals the effects of the components on difficulty? The second aspect in Brindley & Slatyer’s (2002) conclusion might be more promising. In addition to the interaction within the task itself, they point out the importance of taking into account the complex network of task characteristics, requirements, and candidate responses (Brindley & Slatyer 2002: 390) – an issue that receives more attention in the conclusions drawn by the Hawaii group.

Norris, Brown, Hudson & Bonk (2002) also aimed at a better understanding of the relationship between task performance outcomes and particular cognitive processing factors with the purpose of making tasks comparable on the basis of the cognitive processes involved (Norris, Brown, Hudson & Bonk (2002: 399). They manipulated 13 complex and skills-integrative communication tasks with regard to their cognitive processing demands, assuming that a more complex task would be more difficult than a less complex one. A mixed group of 90 participants, 60 from the University of Hawaii and 30 from Kanda University of International Studies in Japan who had different proficiency levels in English, i.e. from intermediate L2 learners of English to English native speakers, performed these tasks. Performances were evaluated by raters and test takers. Despite elaborated statistical analyses, Norris et al. (2002: 414), similar to the studies presented above, had to conclude that “the task-dependent criteria identified by expert informants and used to evaluate specific task performances had little to do with the cognitive factors ostensibly involved in performing the tasks.” However, unlike Elder, Iwashita & McNamara (2002) and Brindley & Slatyer (2002), Norris et al. (2002) inferred from the results that different types of knowledge might have impacted performance on a task and that participants “may have been responding to tasks in idiosyncratic ways” (Norris et al. 2002: 414).

This final inference drawn on the basis of the results marks a broadening of perspectives, that is Norris, Brown, Husdon & Bonk (2002) consider test taker as a variable in the assessment process. However, it can be argued that their conclusion is still not reaching far enough, especially when taking into account the choice of participants for the study. Norris et al.’s (2002) initial assumption seemed to have been that the task itself constitutes the assessment context and that Skehan’s task difficulty features “can affect the difficulty of a given task (ostensibly for all learners,regardless of individual differences)” (Norris et al. 1998: 50; emphasis V.T.). Thus, they had explicitly claimed to have aimed at “consistency with assessment contexts” (Norris et al. 2002: 400) by operationalizing tasks “with substantial fidelity to target language use situations” (Norris et al. 2002: 404). Given the results of their study, Norris et al. (2002) eventually broadened their view in the sense that they acknowledge the assessment context to comprise more than merely the task itself. They have also come to view participants as a major constituent of the assessment context. However, to only assume that participants responded in “idiosyncratic ways,” which is most certainly the case, seems to not reach far enough. Given that they recruited participants from universities in Japan and the United States, two countries with completely different cultures, it might not only be idiosyncratic ways in which test takers’ responses differed, but also socio-cultural aspects, knowledge, strategies, norms, and behaviors that played a role. Hence, not only the interaction between the task and the participants will need to be further elucidated, but learner variables including socio-cultural differences will need to be taken into account in future studies as they might be constitutive aspects of ‘task difficulty.’

Taking into account all three exemplary studies reveals a meta-process discernible in the evaluation of the studies that has given rise to a new branch in research on ‘task difficulty.’  All of the studies have drawn upon Skehan’s model. The teams of researchers have applied psycholinguistic factors from Skehan’s model, incorporated them into the task structure, and systematically manipulated them, assuming that they would produce differential demands which would ultimately lead to qualitative differences in learners’ performances. Their results have yielded one consistent outcome: there is “virtually no systematic relationship among a prioriestimates of difficulty based on difficulty factors and empirical indicators of difficulty” (Bachman 2002: 463; emphasis in the original). This has led Norris, Brown, Hudson &Bonk (2002) to the conclusion that task difficulty and more precisely task performance might be affected by participant variables and candidates’ interaction with a particular task. Hence, it can be assumed that task “difficulty does not reside in the task alone, but is relative to any given test-taker” (ibid.: 462).

Bachman (2002) strongly supports the idea that task difficulty arises in the interaction between the task and the test taker. He has criticized that all of “the results of this research [conducted from a psycholinguistic point of view] seem to have brought us no closer to an understanding of this relationship” between task characteristics and task difficulty (Bachman 2002: 463). Instead of asking ‘what makes a task difficult?,’ he stresses the need to ask ‘wherein lies difficulty?’ (Bachman 2002: 467). In line with this latter question, Bachman calls into question the notion of “task difficulty” as a hypothetical entity that may function as a “primary determinant of task performance” (Bachman 2002: 466). He points out that no model about task performance displays ‘difficulty’ as an individual factor among task characteristics. Moreover, he argues, difficulty should rather be conceptualized as a factor that arises in the interactions between all components involved in assessment (cf. ibid: 467). Thus, he emphasized in 2000 and once again throughout his 2002 article that

[a]s soon as one considers what makes items difficult, one immediately realizes that difficulty isn’t a reasonable question at all. A given task or item is differentially difficult for different test takers and a given test taker will find different tasks differentially difficult. Ergo, difficulty is not a separate quality at all, but rather a function of the interaction between task characteristics and test taker characteristics. When we design a test, we can specify the task characteristics, and describe the characteristics of the test takers, but getting at the interaction is the rub. (posting to LTEST-L, 19 February 2000; Brindley & Slatyer 2002: 390)

Hence, interaction effects will need to be explored that account for differential performance of the individuals. It is the complex network of interaction between various aspects including task characteristics, cognitive features, participants, learner variables, raters, scales, language, performance etc. which needs to increasingly become the focus of attention in further research.


3.2 The sociolinguistic approach to task difficulty

In order to shed light on the interaction between tasks and test-takers, researchers have only recently begun to “link the concept of task difficulty to the social demands of a task, as well as the nature of the L2 output affected by the demands” (Taguchi 2007: 116). Two of the few studies into that direction have been conducted by Fulcher & Márquez Reiter (2003) and Taguchi (2007). Both approach task difficulty from a sociolinguistic or rather socio-pragmatic perspective, taking into account the pragmatic dimension of the task as well as the socio-cultural backgrounds and variables of the test-takers. Both research teams explicitly point out to have devised this approach because “replications of the psycholinguistic approach have shown the categories of the Skehan model to be insensitive in a language testing context” (Fulcher & Márquez Reiter 2003: 328). Hence, it can be argued that Fulcher & Márquez Reiter as well as Taguchi have picked up Bachman’s suggestion to focus research more on the social and interactional dimension of tasks. Their studies yield results that do not only provide more insights into the interaction processes in TB(L)A, but also provide a point of departure for distinguishing among tasks in view of difficulty.

In their study, Fulcher & Márquez Reiter (2003) take a novel approach to the investigation of task difficulty on speaking tests which takes into consideration some of the performance and cultural issues that affect discourse and task success (cf. 339). They systematically manipulated pragmatic task features, such as directness and politeness, in order to see how Spanish and English-speaking students would approach the tasks and whether their approaches yielded changes in discourse. To find out about potential culture-specific differences in strategies applied in task performances, all test-takers had to produce speech acts in situations that included interlocutors with different social stations, e.g. they had to borrow a book from a university professor or ask a neighbor to help move luggage (Fulcher & Márquez Reiter: 331). In a second step, the students were shown their own interaction on video in order to reflect upon and elaborate on their own approaches. The results revealed key differences in discourse between English and Spanish speaking test-takers due to culturally-specific conceptualizations of social power and imposition. For instance, while English-speaking students were reticent in asking a professor for the book because of the social distance between them and the professor, the Spanish students regarded a professor as a person who is supposed to help students so they did not regard borrowing a book as an unusual act. As a result, Spanish students did not use fewer polite conditional constructions than English speakers (Fulcher & Márquez Reiter 335). Hence, Fulcher & Márquez Reiter conclude that strategies students take in dealing with a situation are culturally-dependent, i.e. “task conditions of social power and imposition are significant, and the L1 background of the test-takers is also significant” (Fulcher & Márquez Reiter 334). Therefore, they suggest that social factors such as power, distance, and imposition “could serve as useful factors in creating task conditions and predict task difficulty, consequently affecting output in measurable ways” (Taguchi 2007: 116).
Nevertheless, on grounds of fairness and comparability, social power differences and imposition features should be carefully constructed in tasks for assessment purposes, especially in international high stakes assessment contexts. Moreover, if such tasks are to be used in a high-stakes assessment context then the cultural baggage that is imported to the situation by the participants may need to be defined as part of the construct that is tested because a task might be perceived as more difficult given culturally-determined norms of social distance and imposition.

While Fulcher & Márquez Reiter investigated the cultural influence in task-based interactions with regard to the L1, Taguchi’s (2007) study explored the type of social situation and L2 proficiency in relation to appropriateness of L2 speech act production, planning speed, and speech rate (Taguchi 2007: 118). Twenty native speakers of English and 59 Japanese students of English were asked to perform requests and refusals in role play tasks. The tasks were manipulated in terms of power difference, social distance and degree of imposition featuring two situation types. The first situation type was characterized by an equal power relationship, small distance, and small degree of imposition (PDR-low). In the second situation type, “the listener had greater power, the distance was large, and the degree of imposition was also large (PDR-high)” (Taguchi 2007: 113). Taguchi’s study showed that “different types of pragmatic tasks […] created different demands on performance” (Taguchi 2007:  130). As speech acts were produced more easily and quickly, PDR-low speech acts seemed to be easier for L2 learners than PDR-high situations which turned out to be more difficult for the Japanese learners of English. Post hoc descriptive analyses of linguistic expressions in speech acts on the basis of coding frameworks from Blum-Kulka et al. (1998), Beebe et al. (1990), and Nelson et al. (2002) revealed different patterns in the choice of linguistic expressions and processing strategies. Hence, Taguchi concludes that sociolinguistic variables can be used as criteria for distinguishing among tasks in view of difficulty (Taguchi 2007: 131).

Although both sociolinguistic studies conclude to have found degrees of difficulty due to differences in sociolinguistic demands, one issue still remains unresolved. Fulcher & Márquez Reiter as well as Taguchi investigate speaking tasks, i.e. productive skills, which allowed for a videotaping or recording of the linguistic output that could then be analyzed post hoc in order to reveal processing strategies (cf. Taguchi 2007: 124). Such an analysis is exacerbated in studies such as those conducted by Norris, Brown, Hudson & Bonk (2002) or Brindley & Slatyer (2002) that investigate listening skills which do not directly yield output that can be analyzed in order to disclose underlying processes. Nevertheless, scholars have stressed the necessity of receiving a more detailed understanding of processing strategies underlying task performances that involve receptive skills in order to shed further light on task difficulty as perceived by test-takers. A variety of scholars, among them Bachman (1990; 2002) and Norris et al. (2002), have point out the role of test-takers as informants who might contribute to an understanding of participants’ responses, processes, and strategies. Although scholars have suggested reflective feedback as a means to elicit information on underlying processes and perceived task difficulty (e.g. Grotjahn 1986), the “erratic pattern of test-taker perceptions” found in Elder, Iwashita & McNamara’s (2002: 363) study suggest not to rely too heavily on feedback after participants have performed a certain task.

Nevertheless, it can be argued that their approach is a step in the right direction. Instead of trying to collect the data post hoc, it could be attempted to canvass test-takers’ strategies and cognitive processes by means of introspection such as think aloud protocols. Introspective analysis of test performance could yield important information on the cognitive processes and strategies involved and thus, provide a valuable supplement to analysis of item response patterns. Furthermore, think aloud protocols could be useful in the validation of rating scales and criteria in order to evaluate to what extent the descriptions of processing strategies and assumed difficulty outlined in scales are in fact deployed by examinees.


4. Conclusion

Given the increasing use of TB(L)A in high-stakes contexts, much attention has been paid to task difficulty as a feature that, once clarified, was hoped to facilitate comparability between scores and performances on TB(L)A as well as validate inferences. To foster an understanding of what constitutes task difficulty, scholars have investigated the issue from a psycholinguistic as well as a sociolinguistic perspective. The psycholinguistic approach yielded no results that helped to identify task difficulty as a separate entity in the task itself. Instead, the inconsistent results that were produced led researchers to conceptualize task difficulty as a multidimensional phenomenon that results from a series of interactions between participants and tasks.

From a sociolinguistic perspective, investigations of the interaction between pragmatic dimensions of tasks and the participants’ socio-cultural backgrounds showed that in fact, cognitive, culture-dependent processes and strategies are applied by test takers in task performance. Although scholars who conducted these studies claim that sociolinguistic variables can be manipulated in order to achieve different degrees of task difficulty for test-takers from particular cultural backgrounds, these results again underscore the high context-dependency of TB(L)A. Thus, a task needs to be designed carefully, also taking the variables of the test-takers into account. Given this context- and learner-dependency, it currently seems to be impossible to establish a hierarchy of task difficulty or to point out with reasonable certainty a priori how difficult a task will be for a particular group of learners.

However, instead of asking how task difficulty may impact inferences drawn from task performance, the approach taken by the studies from a sociolinguistic perspective could be further expanded and the question could be phrased as follows: What can task-performances tell us about social and cognitive dimensions of L2 learning and functional dimensions of L2 use? Collecting further data on learners’ responses, processes, and strategies by means of introspection such as thinking aloud protocols might reveal valuable results that consolidate the position of TB(L)A as a reliable and valid form of assessment in high-stakes contexts.


References

Bachman, L.F. (2002). Some reflections on task-based language performance assessment. Language Testing, 19(4), 453–476.

Bachman, L.F. & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University Press.

Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.

Brindley, G. (2009). Task-centered language assessment in language learning. The promise and the challenge. In K. van den Branden, M. Bygate, & J.M. Norris (Eds.), Task-based language teaching. A reader(pp. 435-454). Amsterdam/Philadelphia: John Benjamins Publishing Company.

Brindley, G. & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing,19(4), 369–394.

Byrnes, H. (2002). The role of task and task-based assessment in a content-oriented collegiate foreign language curriculum. Language Testing, 19(4), 419–437.

Elder, C., Iwashita, N. & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19(4), 347–368.

Ellis, R. (2003). Task-based language learning and teaching. Oxford: Oxford University Press

Fulcher, G. & Márquez Reiter, R. (2003). Task difficulty in speaking tests.  Language Testing, 20(3), 321–344.

Grotjahn, R. (1986). Test Validation and Cognitive Psychology: Some Methodological Considerations.Language Testing, 3, 159-185.

McNamara, T. (1996). Measuring Second Language Performance. Longman: London.

Norris, J. M., Bygate, M. & van den Branden, K. (2009). Task-based language assessment. In K. van den Branden, M. Bygate, & J.M. Norris (Eds.), Task-based language teaching. A reader (pp. 431–434). Amsterdam/Philadelphia: John Benjamins Publishing Company.

Norris, J. M. (2009). Task-based teaching and testing. In M. Long and C. Doughty (Eds.), Handbook of language teaching (pp. 578-594). Cambridge: Blackwell.

Norris, J. M. (2002). Interpretations, intended uses and designs in task-based language assessment.Language Testing, 19(4), 337–346.

Norris, J. M., Brown, J. D., Hudson, T. D. & Bonk, W. (2002). Examinee abilities and task difficulty in task-based second language performance assessment. Language Testing, 19(4), 395–418.

Norris, J.M., Brown, J.D., Hudson, T. & Yoshioka, J. (1998). Designing Second Language Performance Assessments. University of Hawaii Press, Honolulu.

Nunan, D. (2004). Task-based language teaching. Cambridge: Cambridge University Press.

Prabhu, N .S. (1987). Second language pedagogy. Oxford: Oxford University Press.

Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press.

Taguchi, N. (2007). Task Difficulty in Oral Speech Act Production. Applied Linguistics, 28(1), 113–135.

Taguchi, N. (2007). Task Difficulty in Oral Speech Act Production. Applied Linguistics, 28(1), 113–135.

van den Branden, K., Bygate, M. & Norris, J.M. (2009). Task-based language teaching: Introducing the reader. In K. van den Branden, M. Bygate, & J.M. Norris (Eds.), Task-based language teaching. A reader (pp. 1-13). Amsterdam/Philadelphia: John Benjamins Publishing Company.

Wiggins, G. (1998). Educative assessment. Designing Assessments to Inform and Improve Student Performance. San Francisco: Jossey-Bass Inc.

Wigglesworth, G. (2008). Task and performance based assessment. In S. Shohamy, & N. H. Hornberger (Eds.), Encyclopedia of language and education (2nd Ed.), Volume 7: Language testing and assessment (pp.111-122). Springer Science+Business Media.


Author:
Veronika Timpe
TU Dortmund
Institut für Anglistik und Amerikanistik
Emil-Figge-Str. 50, Raum 3.329
D-44227 Dortmund
Tel.: 0231-7558140
E-mail: veronika.timpe@udo.edu 



[1] For a rough overview of the variety of definitions of ‘task’ see Ellis (2003: 4f.).
[2] In the further course of this paper, the term “task-based (language) assessment” or the acronym TB(L)A will be used – in contrast to Norris et al.’s acronym TBLA – as generally referring to task-based assessment, comprising both perspectives on language performance assessment (weak and strong).