Editor

JLLT edited by Thomas Tinnefeld
Journal of Linguistics and Language Teaching
Volume 3 (2012) Issue 2


The Differential Effects of Comprehensive Corrective Feedback on L2 Writing Accuracy

K. James Hartshorn (Provo (Utah), USA) /
Norman W. Evans (Provo (Utah), USA)


Abstract
Although recent studies of focused written corrective feedback (WCF), targeting only one or a few error types, may provide valuable insights for building second language acquisition theory, a growing number of scholars have been concerned with the ecological validity of these studies for the second language (L2) classroom. While many researchers favor focused WCF to prevent overload for L2 writers, this study examines an alternative instructional strategy, which targets all errors simultaneously. Based on principles derived from skill acquisition theory, this strategy avoids overload by using shorter pieces of writing. Building on earlier research showing this method improved overall accuracy, this study examines its effects on a variety of discrete linguistic categories. Analyses of pretest and posttest writing in a controlled, 15-week study, suggest that the treatment positively influenced L2 writing accuracy for the mechanical, lexical, and some grammatical domains. Theoretical and pedagogical implications are addressed along with limitations.  
Key words: written corrective feedback, L2 writing accuracy, second language acquisition


1   Introduction

Despite extensive debate over the efficacy of written corrective feedback (WCF) in second language (L2) writing pedagogy, a great deal of uncertainty remains. Numerous recent studies have advocated limiting WCF to one or few linguistic features to prevent learner overload. Although these studies of focused feedback have been beneficial to researchers and practitioners, some scholars have questioned their ecological validity for classrooms where more comprehensive feedback may be desired (e.g. Bruton 2009, 2010, Storch 2010, van Beuningen 2010). Moreover, studies utilizing insights from skill acquisition theory have provided evidence that extensive WCF can be both practical and effective in improving accuracy (e.g. Evans et al. 2011; Hartshorn et al. 2010). Nevertheless, these studies have only examined overall accuracy. Therefore, the effects of this type of comprehensive WCF on a broader array of linguistic domains remain unclear. Thus, the intent of this study was to identify the effects of comprehensive WCF on a range of linguistic domains in an ecologically valid classroom context. To do this, we analyzed a corpus of student writing originally elicited to test the general effects of WCF based on principles derived from skill acquisition theory (Evans et al., 2011; Hartshorn et al. 2010).



2   Research Context

At the outset, this study needs to be contextualized. We begin by defining WCF as any feedback targeting grammatical, lexical, or mechanical errors in L2 writing. Further, we define error broadly as a linguistic choice that would “not be produced by the speaker’s native counterparts” given the same context and conditions (Lennon 1991: 182). We use the term linguistic accuracy to refer to the absence of these errors.

This study is situated at the intersection of the L2 writing and the second language acquisition (SLA) research paradigms described by scholars such as Ferris (2010), Manchón (2011), and Ortega (2011).  In the tradition of L2 writing, we want to help our students to become “more successful writers” (Ferris 2010: 188). Nevertheless, we also hope to “facilitate interlanguage development” and to “draw L2 learners’ attention to linguistic forms in their own output” (Sheen 2010a: 175). As scholars such as Ferris (2010) and Ortega (2011) have observed, we believe that both of these perspectives can work synergistically; we therefore see both the L2 writing and SLA paradigms as highly relevant to our students’ learning and to this study.

Nevertheless, our primary interest in this study focuses on interlanguage development as demonstrated through the analysis of new texts rather than text revision. This is because new texts show how or whether WCF has affected the accuracy of the learner’s production. For discussions about text revision studies versus studies using new texts, see authors such as Ferris (2010), Sachs and Polio (2007), Truscott (2010), and Truscott and Hsu (2008).

Though improving linguistic accuracy may not be a priority for all L2 writers, it is crucial for many and therefore deserves our careful attention. While we recognize the obvious impropriety of those from previous eras who attempted to reduce writing to a “reinforcement of grammar teaching” (Watson 1982: 6), we also see compelling evidence from L2 writing classrooms demonstrating marked improvements in rhetorical skills (e.g. effective content, organization, flow of ideas) without improved linguistic accuracy (e.g. Evans et al. 2011, Hartshorn et al. 2010, Hinkel 2004, and Storch 2009). Though accuracy is not more important than rhetorical dimensions of writing, it deserves our attention simply because it is the single greatest struggle that many L2 writers face. Thus, rather than merely assisting learners to produce more accurate writing, our aim is to identify ways to help them become more accurate writers. 


3   Review of literature

3.1 Description of Terms

With this context in mind, we consider the literature most relevant to this study. Many researchers have examined the general efficacy of WCF. However, the conflicting results and divergent designs of early studies made comparing or synthesizing research findings difficult (e.g. Bitchener 2008, Ferris 2003, 2004, Guénette 2007, and Truscott 2007). Therefore, in an attempt to provide greater focus to WCF research, many scholars have investigated the benefits of specific types of errors or feedback methods. Because of their relevance to this study, we briefly consider three distinctions often used to analyze WCF:

(a) treatable and untreatable errors,
(b) direct and indirect feedback, and
(c) focused and unfocused feedback.


3.1.1 Treatable and Untreatable Errors

Ferris (1999) described treatable errors as those that could be prevented through the application of systematic rules governing the use of linguistic features such as articles, verb tense, verb form, subject-verb agreement, and plurals. On the other hand, it was believed that untreatable errors resulted from ignorance of idiosyncratic language rules that must be acquired over longer periods, such as word choice, word order, and certain sentence structures.

Several recent studies have been designed to target treatable errors. For example, many have examined the effects of various types of WCF on specific article functions. In such studies we observe a fairly consistent pattern in which those who receive the WCF use these article functions more accurately in the writing of new texts than those who do not (e.g. Bitchener 2008, Bitchener & Knoch 2008, 2010a, 2010b, Bitchener Young & Cameron 2005, Ellis, Sheen, Murakami & Takashima 2008, Sheen 2007, 2010b, Sheen et al. 2009). Though some studies have produced mixed results (e.g. Lu, 2010), most research has supported the treatable-untreatable distinction (e.g. Bitchener et al. 2005; Ferris 2006, Ferris & Roberts 2001).


3.1.2 Direct and Indirect Feedback

Another distinction in the literature is between different types of WCF. For example, scholars have differentiated between direct feedback, in which a correction is provided (e.g. written in the margin or between lines), and indirect feedback, where the location of an error is indicated but without any correction. Indirect feedback can be further classified as coded feedback, in which metalinguistic information in the form of symbols is used to indicate the error type, or uncoded feedback, where the error is identified through some form of marking such as underlining or circling (e.g. Ferris & Roberts 2001, Robb, Ross, & Shortreed 1986).

Research on the benefits of direct and indirect feedback seems inconclusive. Some studies have suggested that indirect feedback may be most beneficial, whether uncoded (e.g. Lu, 2010) or coded (e.g. Erel & Bulet 2007, Lalande 1982). However, other studies found no differences between various methods of direct and indirect feedback (Robb, Ross, & Shortreed 1986, Semke 1984), and no differences between control groups and groups receiving various combinations of direct and indirect feedback options (Bitchener & Knoch 2009a, Ferris & Roberts 2001). Thus, further study is needed to clarify the effects of these types of feedback.  


3.1.3 Focused and Unfocused Feedback

A final distinction emphasized here is between what has been called focused feedback (targeting only one or a few error types) and unfocused feedback (targeting many or all error types). Most current WCF researchers continue to advocate focused feedback over unfocused feedback (e.g. Bitchener 2008; Bitchener & Knoch 2009a 2009b, Bitchener et al. 2005, Ellis et al. 2008, Ferris 2006, Sheen 2007, Sheen et al. 2009). The primary rationale for restricting feedback is the need to keep the “processing load manageable” to avoid the “risk of overloading the students’ attentional capacity” (Sheen 2009: 559). Such studies have provided valuable evidence of the potential benefits of WCF on one or a limited number of error types (e.g. Bitchener 2008, Bitchener & Knoch 2009a, 2009b, Ellis et al. 2008, Sheen 2007, Sheen et al2009).

We are aware of only a few studies specifically designed to test whether focused or unfocused feedback results in greater accuracy. The first study by Ellis et al. (2008) observed no statistical differences between the two types of feedback groups. However, Sheen et al. (2009) suggested that these results may have been affected by limitations in the research design, which they attempted to overcome. They examined four different learner groups. Two groups received direct WCF, including focused feedback for one group (targeting articles) and unfocused feedback for the other group (targeting articles along with the copular to be, the regular past tense, the irregular past tense, and locative prepositions). The remaining groups included one which completed only the writing (without feedback) and a control group (without writing or feedback). They found that while each group made gains in accuracy, the focused WCF group outperformed all others and that the accuracy of the unfocused feedback group was no greater than occurred for the control group. They concluded that unfocused feedback tends to be “confusing,” “inconsistent,” and may “overburden” the learners (Sheen et al. 2009: 567).

Another study by Farrokhi & Sattarpour (2011) used direct feedback to examine the differential effects on accuracy from focused and unfocused WCF for high and low proficiency levels. English articles, the copular to be, regular and irregular past tense, third person possessive, and prepositions were targeted for the unfocused group. The groups receiving focused WCF outperformed both the unfocused and control groups for accuracy of English article use across both proficiency levels. Like Sheen et al. (2009), they concluded that focused WCF “is more effective…than correcting all of the existing errors” (Farrokhi & Sattarpour 2011: 1802).

Despite these conclusions and the valuable insights which such studies may provide, they overlook a number of important considerations. One is the desire which most learners have for comprehensive WCF. Anderson (2010), for instance, reported that 88% of his learners indicated a preference for comprehensive WCF while only 26% specified a preference for focused feedback (also see Ferris 2006, Leki 1991). Thus, before prematurely discarding comprehensive feedback, some scholars suggest we continue to study its viability (e.g. Ellis et al. 2008, van Beuningen 2010). Another consideration is a lack of systematicity in the feedback process. For example, Sheen et al. (2009) acknowledged that the unfocused feedback in their study was neither systematic nor comprehensive. Though this lack of systematicity may have been implemented in an attempt to reflect feedback commonly observed in practice as “arbitrary” and “inconsistent” (Sheen et al. 2009: 566), such limitations make it difficult, if not impossible, to weigh the actual benefits of focused feedback compared to a systematic approach to comprehensive feedback. What has yet to be studied is a feedback method that is comprehensive, systematic, and appropriate to learners’ cognitive capacity. The need for such study is especially acute in light of scholars who have called for researchers to investigate a broader range of linguistic features using additional types of feedback (e.g. Bitchener 2009, Ellis 2008).


3.2  Expanding the WCF Research Agenda

Though carefully focused WCF may be essential for certain types of research or theory building, it may be less ideal for the practical realities of some classroom contexts. Not only could highly restricted feedback inadvertently promote avoidance strategies (e.g. Truscott 2004, Xu 2009), but it could also divert learner attention away from a broader view of accuracy (van Beuningen 2010), possibly hindering language development in other linguistic domains.  Although most recent research continues to extol the virtues of focused WCF, some scholars have recognized problems associated with “strict limits on the number of errors” being treated and “narrowly defined error categories” (Ferris 2010: 192). Others have expressed concern over pervasive attempts to generalize about “the efficacy of WCF” when most of the available research is based on “a limited range of structures” (Storch 2010: 41).

Thus, while some researchers have considered “comprehensive [corrective feedback]” as the “most authentic feedback methodology” (van Beuningen 2010: 20), there are sound reasons for including it in SLA research agendas as well. Though we acknowledge reasonable grounds for isolating one linguistic feature at a time, there also is a compelling rationale for examining the effects of a particular treatment on various linguistic domains at the same time within the same learning condition. This is because it is only in such a research design that we can truly observe the differential effects of a specific treatment within a given context.      

Despite the important place in our research which must be reserved for focused feedback, its limitations justify scholars in exploring alternative strategies for using feedback that targets multiple error types without overloading the learner. For example, van Beuningen (2010: 19) declared, “the learning potential of comprehensive [corrective feedback] deserves more attention.” Similarly, Ellis et al. (2008: 367) asserted, “the question of the extent to which [corrective feedback] needs to be focused in order to be effective remains an important one,” and concluded, “if [corrective feedback] is effective when it addresses a number of different errors, it would be advantageous to adopt this approach.”


3.3    Comprehensive Feedback

3.3.1 Dilemmas

Though there are sound reasons for examining comprehensive feedback for research and pedagogy, a few dilemmas need to be reconciled for it to be a reasonable alternative. First, comprehensive WCF is often unmanageable for the teacher as well as the learner. In addition to risking teacher fatigue, students cannot benefit from more feedback than they are capable of processing. Second, not all pedagogical practices involving WCF may be designed well enough to ensure improved accuracy. Consider, for example, the limited efficacy of teacher feedback if the learner must wait for several weeks before receiving feedback or if the learner does not have adequate opportunities to process and practice utilizing the feedback. Such limitations in the use of WCF are inconsistent with the relevant notions from skill acquisition theory we now consider (e.g. DeKeyser 2001, 2007).


3.3.2 Skill Acquisition Theory

Early contributors to thinking about skill acquisition could include Hulstijn & Hulstijn (1983), who distinguished explicit knowledge (verbalizable) from implicit knowledge (non-verbalizable), Anderson (1983) who differentiated between declarative knowledge (what one knows) and procedural knowledge (what one can do), and McLaughlin, Rossman, and McLeod (1983), who described controlled processing (requiring a heavy strain on learner attention) versus automatic processing (requiring little or no attention). Van Patten and Benati (2010: 33) have described skill acquisition as “a general theory” which claims “adults [learn] through largely explicit processes”, and that with ongoing “practice and exposure,” they become “implicit processes.” In other words, skill acquisition theory seeks to explain or predict learner progress through the declarative, procedural, and automatic stages of skill use. Thus, the culminating point of skill acquisition is automaticity, which has been described as “the absence of attention control in the execution of a cognitive activity” (Segalowitz & Hulstijn 2005: 371). Moreover, DeKeyser (2001, 2007) has observed that skill acquisition theory predicts that errors will decrease as a function of practice when supported by abundant examples, explicit rule-based instruction, and frequent application.

Taken together, skill acquisition theory suggests that in order to facilitate progress toward automaticity, instruction, practice, and feedback should be meaningful, timely, and constant. These learning activities are meaningful when instruction is explicit, when students understand the practice task and its purpose, and when they understand the feedback they receive and what they are to do with it. Instruction, practice, and feedback are timely when instruction addresses the most germane concerns from the learner’s recent writing, practice immediately follows instruction, and feedback is provided promptly after practice. This process is constant when teachers and learners continually engage in this cycle of teaching and feedback-based learning over an extended period.     

With these theoretical ideals in mind, instruction, practice, and feedback need to be manageable if these learning activities are to be meaningful, timely, and constant—especially if the teacher intends to provide comprehensive WCF. One solution well suited to these principles is to shorten the length of the writing task. While dramatic limits on the length or volume of texts could undermine the practice and evaluation of rhetorical aspects of writing, we reason that such constraints have little if any adverse effect on identifying patterns of linguistic error production. On the other hand, time limits can preserve manageability for both the teacher and the student and make it possible for instruction, practice, and feedback to continue to be meaningful, timely, and constant.


3.3.3 Dynamic WCF

Given the limitations of focused WCF and the call for examining the effects of addressing multiple errors simultaneously, we have utilized what we have termed dynamic WCF, which is simply our way of operationalizing the principles associated with skill acquisition theory. Thus, dynamic WCF is an instructional strategy designed to help L2 learners improve the accuracy of writing by ensuring that instruction, practice, and feedback are manageable, meaningful, timely, and constant. One should note that our application of dynamic WCF simply targets linguistic accuracy—not rhetorical dimensions of writing. It is expected that students will also need practice and feedback on longer pieces of writing across various genres if they are to continue to develop their rhetorical skills.

The effects of dynamic WCF on overall accuracy were analyzed in earlier research. Though differences between a treatment group and a contrast group in an intensive English program (IEP) were not statistically significant for measures of rhetorical competence, fluency, or complexity, results showed a significant difference for accuracy as measured by error-free T-unit[1] ratios (Hartshorn et al. 2010). In additional research conducted with matriculated university students, accuracy was measured by error-free clause ratios (Evans et al. 2011). While the findings from both studies suggest a clear benefit to dynamic WCF on overall accuracy, they do not help us understand whether there was greater improvement for some linguistic domains than for others. This question has important implications for helping us understand the specific effects of those principles which underlie skill acquisition theory. Thus, additional research was needed.

Since utilizing skill acquisition theory to test the efficacy of extensive feedback is a novel line of inquiry, we believed that it was essential to first identify its global effects on a broad array of linguistic domains before examining its effect on highly specific linguistic features. Thus, we organized the most commonly observed errors into error types and error families. The three families included grammatical errors, lexical errors, and mechanical errors. The grammatical error family included sentence structure errors, determiner errors (e.g. articles, possessive nouns and pronouns, numbers, indefinite pronouns, and demonstrative pronouns), verb errors (e.g. subject-verb agreement, verb tense, and other verb form problems), numeric shift errors (e.g. count-non-count, singular-plural), and semantic errors (e.g. awkwardness, insertion / omission, unclear meaning, and word order). The lexical error family included word choice errors, word form errors, and preposition errors. The mechanical error family included errors in capitalization, indentation, non-sentence level punctuation, and spelling. With these error families in mind, we need to emphasize that the intent of this study was to compare the effects of comprehensive WCF with those of a traditional approach to process writing to determine whether the comprehensive feedback would increase the accuracy of L2 writing.


3.4 Research Question

We now articulate the seven parts of our research question: Compared to a traditional process writing class, to what extent will the treatment produce greater accuracy for each of the following linguistic domains:

(a) sentence structure accuracy,
(b) determiner accuracy,
(c) verb accuracy,
(d) numeric agreement accuracy,
(e) semantic accuracy,
(f) lexical accuracy, and
(g) mechanical accuracy?



4   Method

This section describes the participants in this study along with the procedures and data analyses which were utilized.


4.1    Participants

4.1.1 Learners

There were 19 students in the contrast group—4 males and 15 females—with a mean age of 25. The contrast group L1s included Spanish (6), Korean (3), Mandarin (3), Portuguese (3), French (1), Mongolian (1), Romanian (1), and Russian (1).  The treatment group included 28 students—16 males and 12 females—with a mean age of 24. The treatment group L1s included Spanish (19), Korean (6), Japanese (2), and French (1).  Although this study used intact classes, prior to the treatment, a preliminary test of accuracy based on error-free T-units of student writing was not statistically significant, t(45) = 0.58, p = 0.56, suggesting the comparability of the groups.


4.1.2  Teachers

Five teachers participated in this study. Each held a relevant graduate degree and was considered to be an effective teacher by his or her students and peers. Since some students who were admitted into university programs during the semester did not participate in the posttest, the number of students with complete data sets taught by each teacher was unequal. The first two each taught 10 students in the treatment group. The second and third teachers instructed 5 and 6 students, respectively, who were part of the contrast group. The fifth was the teacher of 8 students in the treatment group and 8 students in the contrast group. After undergoing a period of training, two of these instructors also blindly scored the essays on the various aspects of accuracy outlined above.


4.2  Procedures

Procedures in this study fall into two categories, including the daily activities which differentiate the treatment group from the contrast group and the procedures used for data elicitation.


4.2.1 Treatment and Contrast

Since the principles underlying dynamic WCF might be operationalized in a variety of ways, we briefly describe how we have applied them in the current study. First, to ensure manageability, writing tasks were completed within a 10-minute time limit. Writing prompts were general and were usually minimized to one or two words. Topics included social issues, science, history, and popular culture. While the primary audience was the teacher, learners were free to shape the writing task within the context of a given topic (e.g. “The Economy,” “Friendship,” or “Global Warming”). Genres varied but were largely descriptive, expository, narrative, or persuasive in nature.

In order to ensure that feedback was timely and constant, writing was completed three to four times per week. Teachers marked the short compositions with coded symbols based on the most commonly observed error types (see Appendix A). Teachers also provided learners with a holistic score for each student’s writing based on its linguistic accuracy and content. Writing was returned to the student by the next class period. To ensure that tasks and feedback were meaningful, students were taught about the purpose of the course and the writing tasks along with the codes at the beginning of the semester. Since the literature showed no obvious benefit for direct versus indirect feedback (whether coded or uncoded), we used coded symbols to facilitate counts of error type by frequency. These frequency counts functioned as an indication of performance levels over time. Rather than use a predetermined syllabus, teachers used this ongoing flow of information to determine or adjust classroom instruction in a dynamic manner attuned to the changing needs of the learners.  

Students also used a number of tools designed to facilitate linguistic awareness. These included error tally sheets (a list of error frequency counts from each piece of writing), error lists (a complete inventory of all errors produced along with the surrounding text), and edit logs (an ongoing record of the number of times the work was resubmitted before it was deemed free of errors). Students edited successive drafts of their writing until all the errors were corrected. If particular errors were not addressed in a subsequent draft, the errors were marked and the writing was returned to the student. This process was repeated until the piece of writing was deemed error-free or until one week transpired from the time the teacher provided the first WCF for a given piece of writing. Thus, students were engaged in editing multiple drafts at the same time. Nevertheless, since the pieces of writing were short, tasks remained manageable. Three or four times during the semester, learners also wrote longer compositions with the expectation that they would appropriately attend to all rhetorical requirements of the writing task.

We now compare the treatment group with the contrast group. Students in both groups were enrolled in the same IEP, participating in a 15-week semester. A battery of in-house language tests placed them into the institution’s highest proficiency level. Monday through Thursday, they attended four 65-minute classes. Thus, both groups participated in 13 classroom hours per week designed to strengthen reading, listening, and speaking skills, and 6 hours per week of related homework activities; 4 hours and 20 minutes per week were devoted to a traditional writing class for the contrast group and a class emphasizing dynamic WCF for the treatment group. Both groups participated in approximately two additional hours of homework each week relating to these experimental courses. Careful reviews of the curriculum led us to believe that nothing outside of the treatment was likely to create a differential effect for writing accuracy.

Students in the treatment group received dynamic WCF as described above.

Observation of classes and feedback suggested consistent patterns among the teachers involved in this study.

Students in the contrast group were taught with a traditional approach to process writing. They wrote four five-page papers during the course, each of which required three or four separate drafts for which the teacher provided thorough feedback on content and organization. Students were expected to include effective introductions, thesis statements, topic sentences, and conclusions. Papers were designed to demonstrate their ability to defend an opinion, synthesize, argue, hypothesize, and propose. While most of the class time was devoted to helping students understand and develop relevant rhetorical skills within the context of a particular genre, a limited amount of feedback also targeted linguistic accuracy. During the experimental period, learners from both the treatment and contrast groups wrote three or four 30-minute compositions similar to those used to elicit the pretest and posttest data.


4.2.2 Data Elicitation

Although most of the treatment dealt with short compositions, the intent was to compare the accuracy of new pieces of writing which were more substantial. Thus, we determined to analyze 30-minute essays. This format seemed appropriate because of its pervasiveness in standardized language examinations (Ferris 2009) as well as its prevalence in university courses. Moreover, we believed our window into language development would be more valid if we utilized authentic, timed tests, in which student focus would gravitate toward content rather than language (e.g. Long 2007). In a secure testing environment, students in both groups typed responses to the same pretest and posttest prompts (see Appendix B). Pretest data were obtained at the conclusion of one semester and posttest data were obtained at the conclusion of the following semester. The only word processing tools available to students during these tests were cutting, copying, and pasting text. The software automatically ended the test once the allotted time elapsed.


4.3 Data Analysis

We now describe the various measures of accuracy along with the scoring guidelines used in this study. 


4.3.1 Measures of L2 Writing Accuracy

In order to measure accuracy, many recent studies have used the obligatory occasion analysis (e.g. Bitchener et al. 2005, Bitchener 2008, Bitchener & Knoch 2009a, 2009b, 2010, Ellis et al. 2008), which involves identifying all obligatory occasions of a particular linguistic feature in a text and then calculating the ratio of the correctly supplied features over the total number of obligatory occasions (Ellis & Barkhuizen 2005).

Although obligatory occasion analysis seems well suited for some contexts, its limitations make it inappropriate for comprehensive feedback. For example, it may not be possible to identify all of the obligatory occasions for every linguistic feature; nor is it appropriate for writing samples that include no obligatory occasions for a particular linguistic feature. Furthermore, the method has no way of accounting for lexical errors. Due to these limitations, we recognized the need for an alternative measure of accuracy.

Subsequently, we adapted an approach from Wolf-Quintero, Inagaki & Kim (1998), who recommended the ratio of errors over the total number of T-units because of its high correlation with written language development. While traditionally, this has involved one overall measure of error production, we incorporated two small innovations in this study. First, rather than using the ratio to provide one overall measure of error production, we examined varying performance levels among the seven different error types within the three error families.

Second, rather than focusing on inaccuracy, the formula (1–Errors / T-units) was used in this study to express the accuracy for each linguistic domain. For example, if a student produced six determiner errors within 30 total T-units, the accuracy score for determiners would be (1– 6/30), or .80. Thus, this text-centric analysis of accuracy is represented as a proportion of the text’s overall communicative potential for each linguistic domain. Though this method does not distinguish between the accurate use of these linguistic features and their absence, all errors are targeted simultaneously. Thus, this method provides a fairly comprehensive view of accuracy since it produces one measure for each linguistic domain. Addressing all errors simultaneously also helps soften concerns over learner performance strategies such as avoidance (e.g. Truscott 2004, Xu 2009), which could be more problematic in studies which target a single error type.


4.3.2 Scoring guidelines

In addition to understanding how accuracy was measured, we also need to examine how the writing was scored. Most errors were assigned a value of one with no attempt to weight egregiousness. For example, a subject-verb agreement error or a missing determiner was counted as one error each. However, since those in the semantic error group affected meaning to varying degrees, an attempt was made to account for this variability. One error was counted for every word which was inserted inappropriately or every time an obligatory word was missing. One error was counted for every word order error where one shift (whether of one word or a group of words) could correct the error. In addition, the notion of awkwardness was defined as a production error that was obviously distracting, though the meaning of the construction seemed clear to the scorer. Such productions were also counted as one error each.  

Perhaps the most complex error type in the semantic error group was unclear meaning. Since scorers had no way of verifying learner intent, a text-based method of analysis was devised to quantify the proportion of the text which was incomprehensible. To qualify for a clear meaning, a particular word would need to exhibit semantic clarity with the word preceding and following it. For example, consider the following construction: After working all day, the work come bed TV sleep early. The breakdown in this construction begins with the word work. Though the word work is preceded acceptably by the article the, the verb come that follows violates syntactic expectations. Therefore, the error counting begins with the word work and continues through the word sleep for a total unclear meaning error value of five. The word early is not counted as an error because its preceding word, sleep, concatenates acceptably with the word early.

In addition to these grammar-based errors, the lexical error family included inaccurate word choices, word forms, and prepositions. Each of these inaccuracies was considered as one error. However, only those errors which were correctly spelled were eligible for these categories. Otherwise, such words were considered as spelling errors. In addition to spelling, each problem with punctuation or capitalization was counted as a mechanical error.


4.3.3 Reliability Estimates

Two scorers, (S1) and (S2), jointly determined the number of T-units in each essay. S1 scored all of the compositions (94) for each of the seven error types, and S2 scored just over half of the essays (48). Those scored by S2 were drawn from stratified random samples from six essay categories based on pretests and posttests within three levels of student proficiency (low, middle, or high). Pearson correlation coefficients included mechanical accuracy, r=.98 (p<.001), determiner accuracy, r=.94 (p<.001), semantic accuracy, r=.92 (p<.001), verb accuracy, r=.90 (p <.001), sentence structure accuracy, r=.87 (p<.001), numeric agreement accuracy, r=.83 (p<.001), and lexical accuracy, r=.81 (p<.001). While these results exceeded the recommendation of 70% consistency with a 30% overlap proposed by Stemler and Tsai (2008), coefficients for some linguistic domains were smaller than expected due to occasional discrepancies in error categorizations. Nevertheless, we believed that these correlations provided enough evidence of reliability to warrant the statistical analyses needed to address our research question. 


4.3.4  Statistical Analysis

With a significance level of .05, we computed mixed model, repeated measures analyses of variance (ANOVA). We also measured effect sizes using the partial eta squared (ƞ2p). Cohen (1988) provided guidelines for interpreting these effect sizes, noting that .01 was small, .06 was moderate, and .14 or greater was large. Moreover, Ferguson (2009: 533) reported .04 as the “recommended minimum effect size representing a ‘practically’ significant effect,” or the point at which a factor may have utility in practice regardless of statistical significance.


5   Results

The research question dealt with whether any of the seven mean accuracy scores from the posttest essays would be significantly greater for the treatment group when compared to the contrast group. Descriptive statistics for the mixed model ANOVA are displayed in Table 1. Results were mixed in that some tests suggested a meaningful difference while others did not. We will limit our presentation of results to the group by time interactions, each of which is examined in order of increasing effect size. 



Pretest Scores

Posttest Scores
Accuracy 
Groups
M
SD

M
SD







Sentence
Contrast
.961
.046

.965
.035
Treatment
.958
.052

.979
.031







Numeric
Contrast
.953
.051

.963
.047
Treatment
.936
.069

.933
.082







Determiner
Contrast
.866
.151

.794
.171
Treatment
.797
.161

.848
.153







Lexical
Contrast
.812
.100

.768
.115
Treatment
.717
.168

.793
.152







Verb
Contrast
.743
.148

.693
.254
Treatment
.726
.186

.794
.196







Semantic
Contrast
.688
.201

.708
.139
Treatment
.649
.253

.811
.138







Mechanical
Contrast
.150
.400

- .030
.565
Treatment       - .145
.690

  .140
.606







Tab. 1: Descriptive Statistics by Accuracy Type and Group

We begin with the numeric agreement accuracy scores, which included the appropriate use of count and non-count nouns as well as the accurate production of singular and plural constructions. Mean differences were not significant, F(1,45)=.30, p=.59, and the effect size was negligible, ƞ2p<.01. Similarly, mean differences in the degree to which student writing utilized complete sentences was not significant, as shown by the mean sentence structure accuracy scores, F(1,45)=2.31, p=.31. This analysis also produced an effect size too small to consider meaningful (ƞ2p=.02).. Like the previous analyses, the test examining verb accuracy scores, which included subject-verb agreement and verb tense was not statistically significant, F(1,45)=3.33, p=.08. Nevertheless, this test produced a moderate effect size (ƞ2p=.07), suggesting a practical benefit to the treatment group.

The next three tests were statistically significant and produced effect sizes at the border of moderate to large favoring the treatment group. These included determiner accuracy, F(1,45)= 6.11, p=.017, (ƞ2p=.12); semantic accuracy, F(1,45)=6.46, p=.015, (ƞ2p=.13); and lexical accuracy, F(1,45)=6.48, p=.014, (ƞ2p=.13). The final analysis was of mechanical accuracy, which included spelling, capitalization, and non-sentence level punctuation. This test produced a statistically significant interaction effect for group by time, F(1,45)=14.26, p<.001, and resulted in a fairly large effect size favoring the treatment group (ƞ2p=.24).


6   Discussion

6.1 Analysis of the Findings

One salient finding from this study is that the treatment influenced some linguistic domains more than others. Though the precise reasons for these differential effects are unclear, we note that at least some aspects from each of the three error families were affected positively by the treatment (i.e. effect sizes showed that the treatment benefited mechanical, lexical, and some forms of grammatical accuracy), despite no observable benefits for numeric agreement and sentence structure accuracy.

This lack of improvement for numeric agreement and sentence structure could have been the result of a number of possible factors. For example, while skill acquisition theory (as operationalized here) benefited learners for most linguistic categories, it may be inadequate for effecting greater accuracy for these linguistic domains. Another explanation is the possibility of a ceiling effect. For example, since pretest means for both groups ranged from .93 to .96 (see Table 1), there was very little room for improvement, which might account for these negligible differences.

In contrast, the treatment had the greatest positive impact on the mechanical accuracy family, largely made up of errors based on spelling, punctuation, and capitalization. This result is consistent with claims made by Truscott (2007: 258) who suggested that such errors “are among the most correctable.” Nevertheless, viewing grammatical structures as more complex, Truscott (2007: 258) also has maintained “correction may have value for some non-grammatical errors but not for errors in grammar”.

Thus, the most compelling results may be those associated with the grammatical error family. Although differences between the treatment and contrast groups were not statistically significant for verb accuracy scores, the analysis produced a moderate effect size, suggesting a practical benefit for the treatment group. While these benefits can only be seen as marginal, they seem consistent with focused feedback studies, showing the benefits of WCF on improved accuracy for both past tense (Bitchener et al. 2005, Sheen, 2007, Sheen et al. 2009) and present tense (Lu 2010).

In addition, statistically significant improvements were observed for determiner accuracy as well as semantic accuracy. The effect size for determiner use was at the border between moderate and large. These findings seem consistent with focused feedback studies, which examined the benefits of WCF on article use (e.g. Bitchener et al. 2005, Ellis et al. 2008, Ferris & Roberts 2001, and Sheen 2007).

The final category from the grammatical family examined in this study was semantic accuracy, which produced an effect size at the threshold of moderate to large. Semantic accuracy, as measured here, encompassed the application of a complex body of knowledge, including appropriate word order and collocations which help writers to avoid language that is awkward, unclear, or simply unintelligible. Such findings seem inconsistent with claims from Truscott (2007) that WCF is not beneficial for improving grammatical accuracy.

Moreover, when coupled with the apparent benefits of the treatment on lexical accuracy, these findings seem to suggest that even linguistic features that have been considered “untreatable” may benefit from WCF. For example, while Ferris (2010: 192) observed that most recent studies have been limited to linguistic features that are “relatively easy . . . to define, describe, and teach,” she also mentioned the more troublesome errors that “obscure meaning and interfere with communication” including problems with “word order, sentence boundaries, phrase construction, word choice, or collocations” (Ferris 2010:193).
Remarkably, these are the very kinds of errors learners must overcome in order to write with the semantic and lexical accuracy examined in this study. Though more research targeting the effects of WCF on semantic and lexical accuracy is needed, such findings should be encouraging for scholars such as Xu, who suggested,

If writing teachers are to feel more assured of error correction practice, we need evidence from some linguistic features which are not so teachable but prove to be treatable. (Xu 2009: 275)

At a time when many researchers continue to advocate focused WCF to prevent learner overload, the findings of this study suggest that focused WCF may not be the only way to ensure manageability. Although additional research and replication are needed in this line of inquiry, this study has demonstrated that the principles associated with skill acquisition theory produced positive results for an array of linguistic domains. However, the fact that improvements were not observed equally for all linguistic domains could suggest limitations in skill acquisition theory or how it was operationalized in this study.


6.2  Limitations and Suggestions for Further Research

While this study provided evidence of the benefits of dynamic WCF on broad linguistic domains, future research should examine specific linguistic features with greater precision. Another suggestion for future research relates to how errors are defined and how accuracy is measured. Though the internal consistency for error identification was 98% between the scorers, the difficulty of categorizing specific error types varied. Better training and protocols may resolve these challenges. For example, rather than simply categorizing a particular lexical item as accurate or inaccurate, it may be more effective to determine the extent to which a specific lexical choice may be perceived as acceptable or unacceptable. Moreover, since there may have been a ceiling effect for some variables, it may be helpful to include less proficient learners in future research. The treatment may only facilitate greater accuracy for certain linguistic domains at a particular range of proficiency. Finally, since this study only examined the effects of the treatment during a 15-week semester, future research should also analyze longitudinal data.


6.3  Pedagogical Implications

Despite their limitations, the findings of this study may have some practical implications for pedagogy. First, results have shown that a systematic application of the principles behind skill acquisition theory may have a positive effect on the accuracy of L2 writing for both non-grammatical and grammatical errors without undermining rhetorical competence. Second, our results underscore the assertion that focused WCF may not be the only appropriate form of feedback for every learning context: practitioners should be encouraged to explore what may be best for their specific learners. 

Nevertheless, potential benefits would need to be carefully considered along with practical constraints for each unique context. For example, some classes may not be able to meet frequently, potentially undermining the expectation for writing tasks and feedback to be timely and constant. Though technological solutions may effectively remedy these challenges in some settings, there may be no simple solutions in others. Nevertheless, if scholars can continue to clarify which factors have the greatest impact on accuracy, practitioners in contexts in which linguistic accuracy is among the priorities will be empowered to make informed decisions about how best to help their students to improve the accuracy of their writing. 


7   Conclusion

Although further research is needed to help us to fully understand the most effective ways to utilize WCF to improve L2 writing accuracy, the findings of this study should be encouraging for practitioners who are seeking insights to guide teaching and learning in their classrooms. We hope that more scholars and practitioners will take interest in this line of inquiry aimed at addressing overall linguistic accuracy in traditional classroom settings. It seems that there may be more than one appropriate way to guide L2 writers along the path of language development and that learners will benefit from researchers and practitioners who utilize pedagogical practices which help students maximize their opportunities to learn to write more accurately within their unique learning contexts.



Appendices


Appendix A


Coded Symbols Used to Mark Student Writing


Appendix B


Writing Tasks Used for Data Elicitation

Students were given 30 minutes to address the following writing tasks

Pretest writing task administered before Week 1:

Do you agree or disagree with the following statement?  Only people who earn a lot of money are successful.  Use specific reasons and examples to support your answer.

Posttest writing task administered in Week 15:

In your opinion, what is the most important characteristic (for example, honesty, intelligence, a sense of humor) that a person can have to be successful in life? Use specific reasons and examples from your experience to explain your answer.




References

Anderson, T. (2010). The effects of tiered corrective feedback on second language academic writing. An unpublished thesis from the University of British ColumbiaVancouverCanada.

Bitchener, J. (2008). Evidence in support of written corrective feedback. Journal of Second Language Writing, 17, 102–118.

Bitchener, J. (2009). Measuring the effectiveness of written corrective feedback: A response to “Overgeneralization from a narrow focus: A response to Bitchener (2008).” Journal of Second Language Writing, 18(4), 276–279.

Bitchener, J., & Knoch, U. (2008). The value of written corrective feedback for migrant and international students. Language Teaching Research, 12, 409–431.

Bitchener, J., & Knoch, U. (2009a). The relative effectiveness of different types of direct written corrective feedback. System, 37, 322–329.

Bitchener, J., & Knoch, U. (2009b). The value of a focused approach to written corrective feedback. Language Teaching Research, 12, 409–431.

Bitchener, J., & Knoch, U. (2010a). The contribution of written corrective feedback to language development: A ten month investigation. Applied Linguistics, 31(2), 193–214.

Bitchener, J., & Knoch, U. (2010b). Raising the linguistic accuracy level of advanced L2 writers with written corrective feedback. Journal of Second Language Writing, 19, 207–217.

Bitchener, J., Young, S., & Cameron, D. (2005). The effect of different types of corrective feedback on ESL student writing. Journal of Second Language Writing, 14, 191–205.

Bruton, A. (2009). Improving accuracy is not the only reason for writing, and even if it were…. System, 37, 600–613.

Bruton, A. (2010). Another reply to Truscott on error correction: improved situated designs over statistics. System, 38, 491–498.

Cohen, J. (1988). Statistical power analysis for the behavioral sciencesHillsdaleNJ: Erlbaum.

DeKeyser, R. (2001). Automaticity and automatization. In P. Robinson (Ed.), Cognition and second language instruction (pp. 125–151). CambridgeCambridge University Press.

DeKeyser, R. (2007). Skill acquisition theory. In B. Van Patten & J. Williams (Eds.), Theories in Second Language Acquisition (pp. 97–113). MahwahNJ: Erlbaum.

Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. OxfordOxford University Press.

Ellis, R., Sheen, Y., Murakami, M., & Takashima, H. (2008). The effects of focused and unfocused written corrective feedback in an English as a foreign language context. System36, 353–371.

Erel, S., & Bulut, D. (2007). Error treatment in L2 writing: A comparative study of direct and indirect coded feedback in Turkish EFL context. Journal of Institute of Social SciencesErciyes University, 23, 397–415.

Evans, N. et al. (2011). The efficacy of dynamic written corrective feedback for university-matriculated ESL learners. System, 39, 229-239.

Evans, N., Hartshorn, K. J., Allen, E. (2010). Written corrective feedback: Practitioner perspectives. International Journal of English Studies, 10, 47-77.

Evans, N., Hartshorn, K. J., McCollum, R. M., & Wolfersberger, M. (2010). Contextualizing corrective feedback in L2 writing pedagogy. Language Teaching Research14, 445-463.

Farrokhi, F. & Sattarpour, S. (2011). The effects of focused and unfocused written corrective feedback on grammatical accuracy of Iranian EFL learners. Theory and Practice in Language Studies, 1, 1797-1803.

Ferguson, C. J. (2009). An effect size primer: A guide for clinicians and researchers. Professional Psychology: Research and Practice, 40, 532–538.

Ferris, D. (1999). The case for grammar correction in L2 writing classes: A response to Truscott (1996). Journal of Second Language Writing, 8, 1–11.

Ferris, D. (2003). Response to student writing: Implications for second language studentsMahwahNJ: Erlbaum.

Ferris, D. (2004). The “grammar correction” debate in L2 writing: Where are we, and where do we go from here? (and what do we do in the meantime . . . ?) Journal of Second Language Writing 13, 49–62.

Ferris, D. R. (2006). Does error feedback help student writers? New evidence on the short- and long-term effects of written error correction. In K. Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 81–104). CambridgeCambridge University Press.

Ferris, D. R. (2009). Teaching college writing to diverse student populations. Ann Arbor:  University of Michigan Press.

Ferris, D. R. (2010). Second language writing research and written corrective feedback in SLA:  Intersections and practical applications.  Studies in Second Language Acquisition, 32, 181–201.

Ferris, D. R., & Roberts, B. (2001). Error feedback in L2 writing classes: How explicit does it need to be? Journal of Second Language Writing, 10, 161–184.

Guénette, D. (2007). Is feedback pedagogically correct? Research design issues in studies of feedback on writing. Journal of Second Language Writing, 16, 40–53.

Hartshorn, K. J. et al. (2010). Effects of dynamic corrective feedback on ESL writing accuracy. TESOL Quarterly, 44, 84-109.

Hinkel, E. (2004). Teaching academic ESL writing: Practical techniques in vocabulary and grammarMahwahNJLawrence Erlbaum.

Hunt, K. W. (1965). Grammatical structures written at three grade levelsUrbana , IL: The National Council of Teachers of English.  Journal, 66, 140–149.

Lalande, J. (1982). Reducing composition errors: An experiment. The Modern Language

Leki, I. (1991). The preferences of ESL students for error correction in college-level writing classes. Foreign Language Annals, 24, 203–218.

Lennon, P. (1991). Error: Some problems of definition. Applied Linguistics, 12, 180-196.

Long, M. H. (2007). Problems in SLAMahwahNJLawrence Erlbaum Associates.

Lu Y. (2010). The value of direct and indirect written corrective feedback for intermediate ESL students. Unpublished master's thesis, Auckland University of Technology, AucklandNew Zealand.

Manchón, R. M. (2011). Situating the learning-to-write and writing-to-learn dimensions of L2 writing. In R. M. Manchón (Ed.), Learning-to-Write and Writing-to-Learn in an Additional Language, (pp. 3–14).  PhilladephiaPA: John Benjamins.

McLaughlin, B., Rossman, T., & McLeod, B. (1983). Second language learning: an information-processing perspective. Language Learning, 36, 109-123.

Ortega, L. (2011). Reflections on the learning-to-write and writing-to-learn dimensions of second language writing. In R. M. Manchón (Ed.), Learning-to-Write and Writing-to-Learn in an Additional Language, (pp. 237–250).  PhilladephiaPA: John Benjamins.

Robb, T., Ross, S., & Shortreed, I. (1986). Salience of feedback on error and its effect on EFL writing quality. TESOL Quarterly, 20, 83–91.

Sachs, R., & Polio, C. (2007). Learners’ uses of two types of written feedback on a L2 writing revision task. Studies in Second Language Acquisition, 29, 67–100.

Segalowitz, N. & Hulstijn, J. (2005). Automaticity in bilingualism and second language learning. In J.F. Kroll & A.M.B. De Groot, (Eds.), Handbook of bilingualism: Psycholinguistic approaches. (pp. 371-388) OxfordUKOxford University Press.

Semke, H. (1984). The effect of the red pen. Foreign Language Annals, 17, 195–202.

Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental designs for generalized causal inferencesNew York: Houghton Mifflin.

Sheen, Y. (2007). The effect of focused written corrective feedback and language aptitude on ESL learners’ acquisition of articles. TESOL Quarterly, 41, 255–283.

Sheen, Y. (2010a). The role of corrective feedback in second language acquisition. Studies in Second Language Acquisition, 32, 169–179.

Sheen, Y. (2010b). Differential effects of oral and written corrective feedback in the ESL classroom. Studies in Second Language Acquisition, 32, 203–234.

Sheen, Y., Wright, D., & Moldawa, A. (2009). Differential effects of focused and unfocused written correction on the accurate use of grammatical forms by adult ESL learners. System, 37, 556–569.

Stemler, S.E., & Tsai, J. (2008). Best practices in estimating interrater reliability. In J. Osborne (Ed.). Best practices in quantitative methods (pp.29-49). Thousand OaksCA: Sage publications.

Sternberg R. J. & Sternberg, K. (2010). The psychologist’s companion (5th ed.). New YorkCambridge University Press.

Storch, N. (2009). The impact of studying in a second language (L2) medium university on the development of L2 writing. Journal of Second Language Writing, 18, 103–118.

Storch, N. (2010). Critical feedback on written corrective feedback research. International Journal of English Studies, 10, 29–46.

Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46, 327–369.

Truscott, J. (1999). The case for “the case for grammar correction in L2 writing classes”: A response to Ferris. Journal of Second Language Writing, 8, 111–122.

Truscott, J. (2004). Evidence and conjecture on the effects of correction: A response to ChandlerJournal of Second Language Writing, 13, 337–343.

Truscott, J. (2007). The effect of error correction on learners’ ability to write accurately. Journal of Second Language Writing, 16, 255–272.

Truscott, J. (2010). Further thoughts on Anthony Bruton’s critique of the correction debate. System, 38, 626–633. 

van Beuningen, C. G. (2010). Corrective feedback in L2 writing: Theoretical perspectives, empirical insights, and future directions. International Journal of English Studies, 10, 127.

Watson, C. B. (1982).The use and abuse of models in the ESL writing class. TESOL Quarterly, 16, 5-14.

Wolfe-Quintero, K., Inagaki, S., & Kim, H. (1998). Second language development in writing: Measures of fluency, accuracy, and complexityManoaHIUniversity of Hawaii at Manoa.

Xu, C. (2009). Overgeneralization from a narrow focus: A response to Ellis et al. (2008) and Bitchener (2008). Journal of Second Language Writing, 18, 270–275.





Authors:

K. James Hartshorn, PhD
Associate Coordinator
Brigham Young University
English Language Center
162 UPC
Provo, Utah  84602 USA
Tel. (801) 422-4034
Fax. (801) 422-0804
E-mail: james_hartshorn@byu.edu

Norman W. Evans, PhD
Associate Professor
Brigham Young University 
Linguistics & English Language
4050 JFSB
Provo, Utah 84602 USA




[1] The error-free T-unit ratio is calculated as the number of error-free T-units over the total number of T-units (Wolfe-Quintero, Inagaki, & Kim, 1998).