Editor

JLLT edited by Thomas Tinnefeld
Journal of Linguistics and Language Teaching
Volume 8 (2017) Issue 1

 
The Development and Validation of a Teacher Assessment Literacy Scale: A Trial Report

Kay Cheng Soh & Limei Zhang (both Singapore)

Abstract
Teachers share a similar responsibility with healthcare professionals in the need to interpret assessment results. Interest in teacher assessment literacy has a short history but Jas gained momentum in the recent years. There are not many instruments for measuring this important professional capability. The present study presents the results of trailed Teacher Assessment Literacy Scales which covers four essential aspects of educational measurement. Both classical and Rasch analyses were conducted with encouraging psychometric qualities.
Key words: Assessment, Assessment literacy, Measurement, Testing

What most of today's educators know about education assessment would fit comfortably inside a kindergartner's half-filled milk carton. This is astonishing in light of the fact that during the last 25 years, educator competence has been increasingly determined by student performance on various large-scale examinations… A profession's adequacy is being judged on the basis of tools that the profession's members don't understand. This situation is analogous to asking doctors and nurses to do their jobs without knowing how to interpret patient charts…
(Popham, 2006: para. 1; emphases added)

1   Introduction

There is no better way to emphasize the importance of assessment literacy than to quote Popham (2006). About one decade ago, Popham (2006) brought up his very apt analogy between educational and healthcare professions where the proper use of test information is crucial. Not only do teachers need assessment literacy but everyone else who has an interest in education and everyone includes teachers, school leaders, policy-makers, and parents.

In the past, patients were passive recipients of medical treatments. Present-day patients are involved in the healing process; they are informed and they are engaged. Analogously, in the past, student assessment tools were crafted by test specialists while teachers were passive users. This is true at least in the American context where standardized tests are the regular fixture of the school. Nowadays, with the emphasis on assessment for learning (or formative assessment) in contrast with assessment of learning (or summative assessment), teachers, in America and elsewhere, are expected to use assessment in a more engaged manner to help students learn. Teachers are therefore expected to use test information not only for assessment of learning but also, perhaps more importantly, for assessment for learning. This shift all the more underlines the importance of teachers' assessment literacy if they were to complete this crucial aspect of their job with professionalism.

Due to the change in the emphasis on formative assessment and its contribution to learning (Fulcher 2012), the notion of assessment literacy has changed contingently. Traditionally, assessment emphasizes objectivity and accuracy (Spolsky 1978, 1995), due to the influence of the psychometric and positivistic paradigms, and testing activities are normally carried out at the end of leaning periods (Gipps 1994, Wolf, Bixby, Glenn & Gardner 1991). In that context, only measurement specialists but not frontline teachers were expected to have specialized knowledge of test development, score interpretation, and theoretical concepts of measurement. In contrast, assessment is now perceived as an integral part of teaching and learning to provide timely information as feedback to guide further instruction and learning. This development requires teachers to design assessment, make use of test results to promote teaching and learning and be aware of inherent technical problems as well as the limitation of educational measurement (Fulcher 2012, Malone 2008). Thus, it is important that teachers have sufficient practical skills as well as theoretical understanding.



2 Assessment Literacy Measures 
 
2.1 The Importance of Assessment Literacy 
 
Over the years, efforts have been made to measure teacher assessment literacy. Gotch & French (2014) systematically reviewed teacher assessment literacy measures within the context of contemporary teacher evaluation policy. The authors collected objective tests of assessment knowledge, teacher self-reports, and rubrics to evaluate teachers’ work in assessment literacy studies from 1991 to 2012. They then evaluated the psychometric work from these measures against a set of claims related to score interpretation and use. Across the 36 measures reviewed, they found weak support for these claims. This highlights the need for increased work on assessment literacy measures in the educational measurement field.
 
Later, DeLuca, laPointe-McEwan & Luhange (2016) emphasized that assessment literacy is a core professional requirement across educational systems and that measuring and supporting teachers’ assessment literacy have been a primary focus over the past two decades. At present, according to the authors, there is a multitude of assessment standards across the world and numerous assessment literacy measures representing different conceptions of assessment literacy. Later, DeLuca, laPointe-McEwan & Luhange analyzed assessment literacy standards from five English-speaking countries (Australia, Canada, New Zealand, the UK, and the USA) and Europe to understand shifts in the assessment developed after 1990. Through a thematic analysis of 15 assessment standards and an examination of eight assessment literacy measures, the authors noticed shifts in standards over time though the majority of the measures continue being based on early conceptions of assessment literacy. 
 
Stiggins (1991) first coined the term assessment literacy to refer to teachers’ understanding of the differences between sound and unsound assessment procedures and the use of assessment outcomes. Teachers who are assessment literates should have a clear understanding of the purposes and targets of assessment, the competence in choosing appropriate assessment procedures, the capability of conducting assessment effectively and of avoiding pitfalls in the process of assessment practices and interpretation of results. 
 
This sounds simple but can be a tall order in actuality. For example, the by now classic textbook of educational measurement by Linn & Miller (2005) has altogether 19 Chapters in three parts. The five chapters in Part I cover such topics on the role of assessment, instructional goals of assessment, concepts of reliability and validity, and issues and trends. These may not be of immediate relevance to the classroom teachers’ work but provide necessary conceptual backgrounds for teachers to be informed assessors. Part II has 10 chapters of a technical or procedural nature, which equip teachers with the necessary practical skills in test design using a wide range of item formats. The ending Part III has four chapters, dealing with selecting and using published tests as well as the interpretation of scores involving basic statistical concepts. The three parts that made up the essential domains of assessment literacy expected of classroom teachers are typical of many education measurement texts that support teacher training programs. 
 
According to Popham (2009), increasing numbers of professional development programs have dealt with assessment literacy for teachers and administrators. Popham then asked the question of whether assessment literacy is merely a fashionable focus or whether it should be regarded as a significant area of professional development for years to come. Popham first divided educators’ measurement-related concerns into either classroom assessments or accountability assessments, and then argued that the educators’ inadequate knowledge about either of these can cripple the quality of education. He concluded that assessment literacy is a condicio sine qua non for today’s competent educator and must be a pivotal content area for current and future staff development. For this, 13 topics were set forth for the design of assessment literacy programs. He proposed a two-prone approach to solve the problem: until pre-service teacher education programs begin producing assessment-literate teachers, professional developers must continue to rectify this omission in educators’ professional capabilities. In short, Popham sees assessment literacy as a commodity needed by teachers for their own long-term well-being, and for the educational well-being of their students. The topics proposed are as follows:
  1. The fundamental function of educational assessment, namely, the collection of evidence from which inferences can be made about students’ skills, knowledge, and affect;
  2. Reliability of educational assessments, especially the three forms in which consistency evidence is reported for groups of test-takers (stability, alternate-form, and internal consistency) and how to gauge consistency of assessment for individual test-takers;
  3. The prominent role three types of validity evidence should play in the building of arguments to support the accuracy of test-based interpretations about students, namely, content-related, criterion-related, and construct-related evidence;
  4. How to identify and eliminate assessment bias that offends or unfairly penalizes test-takers because of personal characteristics such as race, gender, or socioeconomic status;
  5. Construction and improvement of selected response and constructed-response test items: (
  6. Scoring of students’ responses to constructed-response tests items, especially the distinctive contribution made by well-formed rubrics;
  7. Development and scoring of performance assessments, portfolio assessments, exhibitions, peer assessments, and self-assessments;
  8. Designing and implementing formative assessment procedures consonant with both research evidence and experience-based insights regarding the probable success of such procedures:
  9. How to collect and interpret evidence of students’ attitudes, interests, and values,
  10. Interpreting students’ performances on large-scale, standardized achievement and aptitude assessments.
  11. Assessing English Language learners and students with disabilities;
  12. How to appropriately (and not inappropriately) prepare students for high-stakes tests;
  13. How to determine the appropriateness of an accountability test for use in evaluating the quality of instruction.
The list seems overwhelming and demanding. However, it shows that teachers’ assessment literacy is a complex area of professional development which should be taken seriously if a high standard of professionalism is to be maintained. Nonetheless, Popham has a simple expectation, thus, 
When I refer to assessment literacy, I'm not thinking about a large collection of abstruse notions unrelated to the day-to-day decisions that educators face. On the contrary, assessment-literate educators need to understand a relatively small number of commonsense measurement fundamentals, not a stack of psychometric exotica. (Popham 2006: 5)
While Popham’s expectation is more realistic and palatable to busy classroom teachers, there is still a need for the specifics. For these, Plake & Impara (1993) developed the Teacher Assessment Literacy Questionnaire, which was later adapted for the Classroom Assessment Literacy Inventory (Mertler & Campbell 2005). It comprises 35 items measuring teachers’ general concepts about testing and assessment and teachers’ background information. These items were organized under seven scenarios featuring teachers who were facing various assessment-related decisions. The instrument was first trialed on 152 pre-service teachers and obtained an overall reliability of KR20=0.75, with an average item difficulty of F=0.64 and an average item discrimination of D=0.32. Using another sample of teachers, Mertler and Campbell (2005) found reliability of KR20=0.74, an average item difficulty of F=0.68, and an average item discrimination of D=0.31. In short, this instrument shows acceptable reliability, at least for research purposes, and is of moderate difficulty but low discrimination.

In the adaptation of the Classroom Assessment Literacy Inventory, Mertler (2005) followed The Standards for Teacher Competence in the Educational Assessment of Students (AFT, NCME, & NEA 1990). Seven such standards were included, resulting in items measuring the following seven aspects of assessment literacy:
  1. Choosing Appropriate Assessment Methods
  2. Developing Appropriate Assessment Methods
  3. Administering, Scoring, and Interpreting the Results of Assessments
  4. Using Assessment Results to Make Decisions
  5. Developing Valid Grading Procedures
  6. Communicating Assessment Results
  7. Recognizing Unethical or Illegal Practices
It stands to reasons that these seven aspects are intimately relevant to teachers’ day-to-day assessment responsibilities and that it is reasonable to expect all teachers to be equipped with the attendant understanding ans skills.
  
Later, Fulcher (2012), in England, developed a measure of assessment literacy with 23 closed-ended items measuring assessment literacy. The items cover teachers’ knowledge in test design and development, large-scale standardized testing, classroom testing and its wash-back, as well as validity and reliability. In addition, it also includes constructed-response items eliciting teachers’ feedback on their experience in assessment, and background information. Fulcher’s study involved 278 teachers, 85% of whom held a Master’s degree and 69% were female. Analysis of the quantitative data yielded a Cronbach’s α=0.93 (which is rather high) and a factor analysis with Varimax rotation returned with four orthogonal factors accounting for 52.3% of the variance. Although this is rather low, the four factors are interpretable: (1) Test Design and Development (17.9%), (2) Large-scale Standardized Testing (16.5%), (3) Classroom Testing and Washback (12.0%), and (4) Validity and Reliability (6.5%). The four factor scales have respectable Cronbach’s coefficients of α=0.89, α=0.86, α=0.79, and α=0.94, respectively.

2.2 Relevance to the Present Study

The review of the pertinent literature on assessment literacy and its measurement has implications for the present study.
 
First, in recent years, the Singapore Ministry of Education has launched initiatives emphasizing higher-order thinking skills and deep understanding in teaching, such as Teach Less, Learn More (TLLM) and Thinking Schools, Learning Nations (TSLN). Consequently, school teachers are required to make changes to their assessment practice and to equip themselves with sufficient assessment literacy. In spite of the importance of assessment literacy, few studies have been conducted to examine their assessment knowledge and skills. Among these few studies, Koh (2011) investigated the effects of professional development on Primary 4 and 5 teachers of English, Science and Mathematics. She found that on-going professional development of assessment literacy is especially effective in improving teachers’ assessment literacy, when compared with teachers who participated in workshops training them to design assessment rubrics. The findings suggest that to successfully develop teachers’ assessment literacy, the training needs be broad enough in the topics covered and it has to be extended over a reasonable period of time. 
 
In a more recent study, Zhang & Chin (under review) examined the learning needs in language assessment among 103 primary school Chinese Language teachers using an assessment literacy survey developed by Fulcher (2012). The results provide an understanding of teachers’ interest and knowledge in test design and development, large-scale testing, classroom testing, and test validity and reliability. With a very limited number of studies in the Singapore context, there is obviously a need for more studies to be carried out for a better understanding of Singapore school teachers’ assessment literacy. For carrying out such studies, it is necessary to develop an assessment literacy scale which is broad enough and yet concise to measure the teachers’ assessment competence properly and accurately. 
 
Secondly, in systems like that of the USA where standardized tests are designed by test specialists through a long and arduous process of test development, applying sophisticated psychometric concepts and principles (with regular intermittent revisions), it is reasonable to assume that the resultant assessment tools made available for teachers are of a high psychometric quality. In this case, the most critical aspect of assessment literacy that teachers need is the ability to properly interpret the results they obtain through the tests. Measurement knowledge beyond this level is good to have but not really needed. However, in a system like that of Singapore where standardized tests are not an omnipresent fixture, teacher-made tests are almost the only assessment tool available. This indicates the teachers’ need for assessment literacy of a much broader range, going beyond the interpretation of test results. Therefore, the present study aims to develop an instrument for assessment literacy to measure teachers’ assessment literacy in the Singapore context. 
 
Thirdly, previous studies have provided the framework for researchers to follow in designing an assessment literacy scale. As one of the most influential studies in language assessment literacy, Fulcher (2012) has expanded the definition of assessment literacy. According to him, assessment literacy comprises knowledge on three levels:
  • Level 1 concerns the knowledge, skills, and abilities in the practice of assessment, especially in terms of test design. Specifically, this type of knowledge includes how to decide what to test, writing test items and tasks, developing writing test specifications, and developing rating scales.
  • Level 2 refers to the processes, principles and concepts of assessment, which are more relevant to quality standards and research. This type of knowledge includes validity, reliability, fairness, accommodation, washback / consequences as well as ethics and justice of assessment.
  • Level 3 is about the historical, social, political, philosophical and ethical frameworks of assessment.
Following Fulcher’s (2012) framework, we aim to measure teachers’ assessment knowledge of two aspects: (1) their knowledge, skills, and abilities in assessment practice as well as the fundamental principles and (2) concepts of language assessment. This does not mean that we do not value the third domain (Level 3) but that we consider this as not being so urgently needed by teachers in Singapore and as not being so critical to their day-to-day use of assessment in the classroom context.


3   Method

3.1 Design

In the development of the Teacher Assessment Literacy Scale, consultation was made to two classics of educational measurement, namely Educational and Psychological Measurement and Evaluation (Hopkins 1998) and Measurement and Assessment in Teaching (Linn & Miller 2005). The first decision to be made was to identify and delimit the domains to be covered in the scale. After consulting the two classics, it was decided that four key domains needed to be represented in the new measure. 
 
Firstly, teachers need to develop an understanding of the nature and functions of assessment if they are to do their work with reflection and in an informed way. Such understanding enables them to know why they are doing what they do or are expected to do with meaningful purposes.

Secondly, teachers need the practical skills to design and use a variety of item formats to meet the instructional needs which vary with the content and students' abilities. Such skills may not be required when standardized tests are available for summative assessment, but they are critical in the present-day context when teachers are expected to craft their own tests to monitor student learning for formative assessment.

Thirdly, once teachers have obtained test results, they need to be able to properly interpret these to inform further teaching and guide student learning. Obtaining test scores without being able to properly interpret them is analogous to the situation, depicted by Popham (2006), in which health professionals are unable to interpret patient charts.

Finally, teachers need to be able to evaluate the qualities of the test results, and this entails basic knowledge of statistics. This, unfortunately, is what many teachers try to avoid, mainly due to a lack of training. Such knowledge enable teachers to see assessment results in a proper light, knowing their functions as well as their limitations in terms of measurement errors which are an inherent part of educational measurement. Without an understanding of such concepts of reliability and validity, teachers tend to take assessment results too literally and may make unwarranted decisions (Soh 2011 2016).

Thus, we intended to construct the scale to measure teachers’ skills and abilities as well as their understanding of principles and concepts in assessment as shown in Figure 1. It is expected that this scale will provide a measure of language teachers’ assessment literacy in Singapore and similar contexts. In the following part of this article, we will describe the processes of the development of the Teacher Assessment Literacy Scale and seek to provide evidence for its validity:
Fig. 1: Modelling Teacher’ Assessment Literacy

Against the background of the above considerations, it was decided that ten items be formulated for each domain as a sample representing the possible items of the domain. Four domains having been delimited, the whole scale thus comprises 40 items. It was further decided that there should be four-option-multiple-choice items so as to ensure objectivity in scoring and keep the testing time within a reasonable limit of about 30 minutes.

3.2 Trial Sample

The scale thus crafted was then administered to 323 Chinese Language teachers, 170 from primary schools and 153 from secondary schools and junior colleges. There is a female preponderance of 83%. Of the teachers, 52% had more than ten years of teaching experience. In terms of qualification, 93% held a university degree and 95% had completed professional training. However, only 48% reported that they had elected to study assessment in their pre-service training, and 78% acknowledged that they felt the need for more training in assessment. 
 
Teachers attended various in-service courses at the Singapore Centre for Chinese Language from January to March 2015. Admittedly, they might not form a random sample of Chinese language teachers in Singapore and hence, in this study, a convenience sample was used. However, at the time of study, there was an estimated number of 3000 Chinese language teachers in Singapore. According to Kenpro (2012:), a sample size of 341 is needed to adequately represent a population of 3000. Thus, the 323 teachers of this study are close to the expected sample size, 93% in fact. Moreover, the participating teachers can be considered as mature in the profession with more than half of them having 10 or more years of teaching experience. Moreover, the female preponderance is typical of the teaching profession in Singapore. Thus, bearing in mind some limitations in these regards, the participating Chinese language teachers can be deemed sufficiently representative of Chinese language teachers in Singapore.

3.3 Analysis

Three types of analysis were conducted in the present study: factor analysis, classical item analysis, and Rasch scaling. 
 
Confirmatory factor analysis was performed to examine whether the collected data support the proposed model (Figure 1) of assessment literacy with the four specified dimensions. Next, the classical item analysis was performed to obtain item difficulty (p) and item discrimination (r). Item difficulty indicates the proportion of teachers who chose the keyed answer and thus responded correctly to the respective item. This has also been referred to as facility or the F-index which, ironically, indicates the easiness of an item. Item discrimination is an indication of how well an item differentiates between teachers who have chosen the correct answer and those who have not. Statistically, it is the item-total correlation that indicates the consistency between overall ability (in terms of total score) and response to an item. Then, the Rasch analysis was conducted to estimate item locations to indicate item difficulty within the context of the whole set of items analyzed, with a positive index indicating the difficulty to answer correctly and vice versa; this is just the opposite of the classical F-index. The results of the Rasch analysis were used to correlate with the classical item-analysis results of the F-index for an indication of the efficacy of the two analyses.

4 Results

4.1 Descriptive Statistics

Descriptive statistics were first calculated at the subscale level to show the performance of teachers in the four domains of assessment literacy. As Table 1 shows, the means for the subscales vary between 5.52 and 2.97, out of 10. The highest mean is for the first subscale (nature & function of assessment) while the lowest is for the fourth subscale (concepts of reliability, validity, etc.). Generally, the means show that teachers were able to answer correctly about half of the 30 questions in subscales 1 to 3, but that they were able to answer correctly only about three of the ten questions in subscale 4. If a criterion-referenced approach requiring 90 percent of the teachers to be able to answer correctly 90 percent of the questions, the results obtained are far from satisfactory:

Subscale
Number of items
M (SD)
Nature & function of assessment
10
5.53 (1.23)
Design & use of test items
10
4.87 (1.38)
Interpretation of test results
10
4.50 (1.58)
Concepts of reliability, validity etc.
10
2.97 (1.50)
Table 1: Descriptive Statistics and Reliability Estimates at the Subscale Level

4.2 Confirmatory Factor Analysis

Confirmatory factor analysis was run to verify the proposed structure assessment literacy. The model tested indicated good fit as shown in Table 2. The incremental fit index CFI (1.00) is greater than .95 while the absolute fit index RMSEA (.001) is less than .06 (Hu & Bentler 1999). The RMSEA 95% confidence interval is narrow. X2/df (.57) is less than 3.00 (Kline, 2011):

X2
df
X2/df
CFI
RMSEA
RMSEA 95%CI
 1.13 
2
 .57 
 1.00 
 .001 
 .000-.093 
Table 2: Fit Indices of the CFA Model

Figure 2 presents the tested model with the estimated coefficients. The path coefficients of assessment literacy range from .21 to .49 (p<.05) and average at .39, indicating that the latent variable is well-defined by the four variables. However, of the four path coefficients, those for the first three subscales are sizable, varying from .41 to .49, but that for the fourth subscale, which deals with statistical and measurement concepts, is rather low at .21, indicating that the other three subscales are better measures of assessment literacy:
Fig. 2 The Tested CFA Model of Assessment Literacy

4.3 Classical Item Analysis

Classical item analysis focuses on the test as a whole and on item indices (Facility and Discrimination) in a deterministic sense. The item indices thus obtained are sample-specific in that they will take different values when the items are trailed on a different sample. The classical item analysis has three focuses: item difficulty, item discrimination, and score reliability. 
 
Item difficulty (p) is the proportion of teachers who chose the keyed option correctly. In fact, item difficulty has also been referred to as item facility or easiness since larger proportion denotes more correct responses. Hence, it is also referred to as the F-index (Facility Index).

Item discrimination (r) is the correlation between the correct response to an item and the total score for the test as a whole. This is, in fact, item-total correlation, which indicates the extent to which an item is able to differentiate between high- and low-scoring teachers. Hence, it is also referred to as the D-index (Discrimination Index). These two item indices are shown in Tables 3 to 6 for the 40 items and the four subtests, respectively. The question of score reliability is discussed in a later Section 5.1.


4.3.1 Subtest 1: Nature and Functions of Assessment

Table 3 presents the item indices for Subtest 1 which deal with understanding the functions that assessment has in teaching and learning, and concepts related to the norm- and criterion-referenced interpretation of test scores. The p’s vary from a very low 0.07 to a very high 0.94, with a mean of 0.55 (median 0.56). In short, the items vary widely in difficulty although the mean suggests that this subtest is moderately difficult as a whole. At the same time, the r’s vary from 0.13 to 0.33, with a mean of 0.23 (median 0.22). These figures indicate that the items have a low though acceptable discriminatory power.

Items
p
r
Item 1
0.93
0.22
Item 2
0.85
0.32
Item 3
0.10
0.17
Item 4
0.27
0.21
Item 5
0.93
0.3
Item 6
0.94
0.33
Item 7
0.92
0.29
Item 8
0.27
0.19
Item 9
0.26
0.14
Item10
0.07
0.13
Mean
0.55
0.23
Median
0.56
0.22
Table 3: Item indices for Subtest 1: Nature and Function of Assessment

4.3.2 Subtest 2: Design and Use of Test Items

The items of Subtest 2 deal with the understanding of the suitability of various item formats and their appropriate uses. The p’s vary from a low 0.13 to a very high 0.96, with a mean of 0.49 (median 0.44). In short, these items vary widely in difficulty although the mean suggests that this subtest is moderately difficult as a whole. At the same time, the r’s vary from 0.11 to 0.30, with a mean of 0.21 (median 0.22). These results indicate that the items have a low though acceptable discrimination power:

Items
p
r
Item 11
0.5
0.24
Item 12
0.76
0.28
Item 13
0.29
0.11
Item 14
0.33
0.2
Item 15
0.50
0.16
Item 16
0.13
0.19
Item 17
0.37
0.11
Item 18
0.87
0.26
Item 19
0.96
0.30
Item 20
0.16
0.23
Mean
0.49
0.21
Median
0.44
0.22
Table 4: Item indices for Subtest 2: Design and Use of Test Items

4.3.3 Subtest 3: Interpretation of Test Results

The items of Subtest 3 pertain knowledge of item indices and meanings of test scores. The p’s vary from a very low 0.03 to a high 0.78, with a mean of 0.45 (median 0.51). These figures indicate that the items vary widely in difficulty although the mean suggests that this subtest is of moderate difficulty. At the same time, the r’s vary from 0.05 to 0.47, with a mean of 0.24 (median 0.23). These results indicate that the subtest as a whole has acceptable discrimination:


Items
p
r
Item 21
0.14
0.05
Item 22
0.05
0.05
Item 23
0.39
0.23
Item 24
0.64
0.24
Item 25
0.75
0.18
Item 26
0.69
0.47
Item 27
0.43
0.22
Item 28
0.59
0.39
Item 29
0.03
0.1
Item 30
0.78
0.43
Mean
0.45
0.24
Median
0.51
0.23
Table 5: Item indices for Subtest 2: Design and Use of Test Items

4.3.4 Subtest 4: Concepts of Reliability, Validity and Basic Statistics

Subtest 4 deals with abstract concepts of test score qualities and knowledge of simple statistics essential to understand test results. The p’s vary from a low 0.11 to a high 0.64, with a mean of 0.30 (median 0.25). These figures indicate that the items are difficult ones when compared with those of the other three subtests. The r’s vary from 0.05 to 0.36, with a mean of 0.19 (median 0.17). These results indicate that the subtest as a whole has low discrimination:
 Items
p
r
Item 31
0.13
0.11
Item 32
0.21
0.22
Item 33
0.49
0.26
Item 34
0.15
0.13
Item 35
0.11
0.05
Item 36
0.56
0.36
Item 37
0.29
0.13
Item 38
0.28
0.21
Item 39
0.64
0.3
Item 40
0.12
0.1
Mean
0.3
0.19
Median
0.25
0.17
Table 6: Item indices for Subtest 4: Concepts of Reliability, Validity, and Basic Statistics

On average, the 40 items of the scale generally have an acceptable level of facility, and this means that they are neither too easy nor too difficult for the teachers involved in this study. However, the items tend be have a low discrimination power. These findings could, at least partly, account for the fact that the teachers taking part in this study have discerned deficits in their assessment literacy, with an overall mean of 18 for the 40 items.

4.4 Rasch Analysis

The Rasch analysis estimates the difficulty of an item in terms of the probability that teachers of given levels of ability will pass (or fail) an item, thus locating the item on a continuum of difficulty, hence, location. Rasch (1993) defines estimates for each item as an outcome of the linear probabilistic interaction of a person's ability and the difficulty of a given item. The goodness of fit of an item to the measurement model is evaluated with reference to its Outfit MSQ and Infit MSQ. However, there are no hard-and-fast rules for evaluating the fit statistics. A range of between 0.7 and 1.3 is recommended as the optimal goodness of fit of an item (Wright & Linacre 1994) for MCQ items. By this criterion, fit statistics lower than 0.7 suggest that the items are over-fitting (too predictable), and that those greater than 1.3 are under-fitting (too unpredictable).

Table 7 shows the Outfit and Infit MSQs of the 40 items in descending order of item difficulty. As can be seen thereof, item estimates vary from -3.664 to 3.245, with a mean of 0.000 (median 0.260). These show that the items cover a wide range of difficulty, and the median indicates that the items as a set are somewhat on the difficult side of the scale.

Table 7 also shows that the Infit MSQs vary between 0.690 and 1.146 (with a mean of 0.965 and a median 0.999) that and the Outfit MSQs vary between 0.859 and 1.068 (with a mean of 0.970 and a median of 0.978). These show that the item fit statistics all fall within the recommended 0.7-1.3 range, and therefore all 40 items of the scale fit the Rasch model well.

In Table 7, the 40 items are classified into three groups of difficulty. The 'difficult' group comprises 11 items. These have facilities (p’s) less than 0.2, indicating that less than 20 percent of teachers were able to answer these items correctly. These items are separated from the rest by a natural gap in Rasch difficulties between 14.17 and 1.125. As the brief content shows, most of these difficult items deal with some quantitative aspects of test items (Items 21, 22, 29, 34, 35, and 40). The other items deal with diagnosis, function, assessing written expression, and above-level assessment. Generally, answering these questions correctly requires specific training in assessment, which many of the teachers do not have, especially with regards to those items which are quantitative in nature.

At the other end, the 'easy' group of items has seven items which have facilities (p’s) greater than .80, indicating that 80% or more of the teachers answered them correctly. They are separated by a natural gap in Rasch difficulties, between -1.579 and -2.009. The content shows that these items deal with concepts which can be gained through experience and are commonsensical in nature.

In between the two extreme groups, there are the 'appropriate' items - 22 items which have facilities (p’s) greater than .20 and less than .80. This means that between 20% and 80% of the teachers chose the correct answers for these items. Their Rasch difficulties span from 1.125 to -1.579. In terms of item content, only three are from Subtest 1 (Nature and Function) and most of the other items are 'easy' items. There are six items from Subtest 2 (Design and Use of Test Items), seven items from Subtest 3 (Interpretation of Test Results), and six items from Subtest 4 (Reliability, Validity, and Basic Statistics). These clearly show where the deficits in assessment literacy are among the teachers:
Item Number
Brief content
Item difficulty
Outfit MSQ
Infit MSQ
p
Difficult Items
29
Interpretation of a T-score
3.245
0.961
0.930
0.03
22
Minimum passing rate for MCQ item
2.690
1.083
0.965
0.05
10
Diagnosing students’ learning difficulty
2.366
1.042
0.947
0.07
3
Educational function of assessment
1.968
0.908
0.949
0.10
35
Relations between reliability and validity
1.902
1.146
0.997
0.11
40
Basic statistical concept: correlation
1.809
1.003
0.985
0.12
16
Assessing written expression
1.693
0.906
0.954
0.13
31
Above-level assessment
1.693
0.987
0.993
0.13
21
Good MCQ F-index
1.639
1.142
1.017
0.14
34
Checking reliability of marking
1.561
1.034
0.983
0.15
20
Disadvantage of essay writing

1.417
0.907
0.952
0.16

Appropriate Items
32
Below-level assessment
1.125
1.025
0.960
0.21
9
Criterion-referenced testing
0.825
1.131
1.015
0.26
4
Direct function of assessment
0.777
1.001
0.988
0.27
8
Norm-referenced testing
0.777
0.984
1.002
0.27
38
Basic statistical concept: central tendencies
0.698
0.997
0.993
0.28
13
Language ability scrambled sentence
0.682
1.122
1.044
0.29
37
Basic statistical concept: skewed distribution
0.651
1.048
1.041
0.29
14
Reading comprehension ad validity
0.489
1.015
1.009
0.33
17
Weakness of objective items
0.321
1.064
1.068
0.37
23
Options of MCQ item
0.199
1.007
1.008
0.39
27
Concept of measurement error
0.028
1.021
1.012
0.43
33
Inter-rater reliability
-0.203
1.005
0.998
0.49
15
Use of cloze procedures
-0.241
1.059
1.052
0.50
11
Advantage of MCQ
-0.254
1.018
1.008
0.50
36
Basic statistical concept: mode
-0.485
0.931
0.942
0.56
28
Nature of the T-score
-0.629
0.905
0.921
0.59
24
D-index of MCQ item
-0.832
1.004
1.002
0.64
39
Basic statistical concept: standard deviation
-0.832
0.969
0.970
0.64
26
Interpretation of a test score
-1.062
0.830
0.867
0.69
25
Choice of topic to write
-1.398
1.035
1.013
0.75
12
Language ability and MCQ
-1.443
0.930
0.973
0.76
30
Difference between test scores
-1.579
0.801
0.875
0.78
Easy Items
2
Test results to help student
-2.009
0.859
0.917
0.85
18
Critical factor in written expression
-2.251
0.879
0.939
0.87
7
Most important of class tests
-2.741
0.801
0.921
0.92
1
Most important function of assessment
-2.876
0.869
0.952
0.93
5
Use of assessment results
-2.924
0.748
0.903
0.93
6
Best things to help students make progress
-3.142
0.690
0.880
0.94
19
Sex, race, and SES biases
-3.664
0.738
0.859
0.96
Summary Statistics
Minimum
-3.664
0.690
0.859
-
Maximum
3.245
1.146
1.068
-
Mean
0.000
0.965
0.970
-
Median
0.260
0.999
0.978
-
Table 7: Item Estimates and Fit Statistics

4.5 Correlations between Classical Facilities and Rasch Estimates

A question that has often been asked is whether the two approaches (classical approach and the Rasch approach) to item analysis yield comparable results and, if they differ, to what extent. It is therefore interesting to note that when the classical p’s and the Rasch estimates of the 40 items were correlated, this resulted in r=|0.99|. This very high correlation coefficient indicates that the two approaches of item calibration yielded highly similar results, and this corroborated with many recent studies (e.g. Fan 1998, Magno 2009, Preito, Alonso & Lamacra 2003) which also report high correlations between classical item indices and Rasch's method. For example, Fan (1998) used data from a large–scale evaluation of high school reading and mathematics from Texas, analyzed the correlations of 20 samples of 1,000 students each and obtained correlations between item indices of the two approaches, varying from r=|0.803| to r=0.920|. These suggest that the two sets of item estimates shared between 65 to 85 percent of common variance. Figure 2 is a scatter plot for the two types of indices of the present study, and the curve shows a near-perfect (negative) correlation:
Fig. 3: Scatter Plot for p’s (X-axis) and Locations (Y-axis)

5 Discussion

In the present study, the psychometric features and factorial validity of the newly developed Teachers’ Assessment Literacy Scale have been investigated. In the following section, we will first discuss the issue of reliability and then build validity argument based on the findings of the analyses.

5.1 Issue of Reliability

The conventional method of assessing score reliability is the Cronbach’s alpha coefficient, which indicates the degree of internal consistency among the items, with the assumption that the items are homogeneous. The 40 items of the scale are scored 1 (right) or 0 (wrong), and therefore, the Kuder-Richardson Formula 20 (KR20), which is a special case of Cronbach’s alpha for dichotomous items, was calculated. 
 
Table 8 below shows the KR20 reliabilities of the four sub-scales and the scale as a whole. The reliability coefficients vary from KR20=.18 to .40, with a median of 0.36. Moreover, for the scale as a whole and the total sample of combined primary and secondary teachers, Cronbach’s internal consistency coefficient is α=0.471. These indices are generally low, compared with the conventional expectation of a minimum of 0.7. This definitely leads to the question of trustworthiness.
Measure
Primary
Secondary
  1. Nature and functions of assessment
.28
.37
  1. Design and use of test items
.34
.38
  1. Interpretation of test results
.36
.18
  1. Reliability, validity, etc.
.36
.40
Whole test
.34
.58
Table 8: KR20 Reliabilities

However, there have been criticisms on Cronbach’s alpha as a measure of item homogeneity or unidimensionality (Bademci 2006). One condition which might have led to the low reliabilities as shown in Table 8 is the heterogeneous nature of item content among the 40 items since they cover many different aspects of educational measurement, some being qualitative and others being quantitative in nature, even within a particular sub-test. This being the case, it renders the conventional reliability measures (i.e. Cronbach’s alpha and its equivalent KR20) which assume item homogeneity unsuitable for the purpose of the present study. 
 
Group homogeneity is another factor contributing to low score reliability. Pike & Hudson (nd: 149) discussed the limitations of using Cronbach’s Alpha (and its equivalent KR20) to estimate reliability when using a sample with homogeneous responses in the measured construct and described the risk of drawing the wrong conclusion that a new instrument may have poor reliability. They demonstrated the use of an alternate statistic that may serve as a cushion against such a situation and recommended the calculation of the Relative Alpha by considering the ratio between the standard error of measurement (SEM) which itself involves the reliability as shown in the formula, thus:

SEM = SD*SQRT (1 – reliability)

Pike & Hudson’s Relative Alpha can take a value between 0.0 and 1.0, indicating the extent to which the scores can be trusted, using an alternative way to evaluate score reliability. Their formula is:
Relative Alpha = 1 – SEM2/ (Range/6)2

In this formula, SEM is the usual indicator of the lack of trustworthiness of the scores obtained and, under normal circumstances, the scores for a scale will theoretically span over six standard deviations. Thus, the second term on the right is an indication of the proportion of test variance that is unreliable. Relative Alpha indicates the proportion of test-variance off-set for its unreliable portion, i.e., the proportion of test variance that is trustworthy.

In the present study, the maximum possible score is 40, and the theoretically possible standard deviation is 6.67(=40/6). However, the actual data yields standard deviations of 4.24 (primary) and 4.66 (secondary) for the scale as a whole, which are 0.64 and 0.70, respectively, of the theoretical standard deviations. In other words, the two groups are found to be more homogeneous than theoretically expected. 
 
Table 9 shows the Relative Alphas for the primary and secondary groups. The statistics suggest that much of the test variance has been captured by the 40-item scale, and the scores can therefore be trusted:
Measure
Primary
Secondary
  1. Nature and functions of assessment
.98
.97
  1. Design and use of test items
.95
.95
  1. Interpretation of test results
.94
.93
  1. Reliability, validity, and basic statistics
.95
.95
Whole test
.97
.97
Table 9: Relative Alpha Coefficients


5.2 Validity Evidence

Regarding content-referenced evidence, the scale was developed, based on a model resulting from an analysis of empirical data and a survey of relevant literature (Fulcher 2012). In addition, content analysis was conducted on the scale for a better content representation. The Rasch analysis provides further content referenced evidence. Substantive validity evidence refers to the relationship between the construct and the data observed (Wolfe & Smith 2007a, b). In the current study, the Rasch analysis, Infit and Outfit as well as the confirmatory factor analysis provide substantive referenced evidence. Also, the alignment of the analysis based on classical test theory and the Rasch analysis supported the validity argument further.

Apart from direct validity evidence, information beyond the test scores is needed to verify score validity. Ideally, the criterion scores for validity come from a test of application of measurement concepts and techniques, but such information is not available within the results obtained, although some of the 40 items of the scale are of this nature (for instance, those items on statistical concepts). However, indirect evidence of score validity is provided by the teachers’ responses to the open-ended question asking for comments and suggestions with regards to educational assessment.

For the open-ended question, the teachers made 36 responses. Most of the responses reflect the teachers’ realization that assessment plays an important role in their teaching, for which specialized knowledge is needed. Examples of such responses are shown below:

(1) What is taught and what is assessed should be consistent.
(2) Teachers need to have knowledge of educational measurement.
(3) Teachers need to popularize knowledge of assessment among the school leaders.
(4) Teachers hope to gain knowledge of educational measurement to assess with in-depth understanding.
(5) Without any knowledge of educational measurement, data analysis of results conducted in the school is superficial.
(6) Teachers think that knowledge of education management is very much needed!
(7) The survey will help to improve teaching.

The second type of responses (i.e., 8, 9, 10, and 11) reflects the difficulty that teachers had in understanding items which involve technical concepts and terminologies. Such responses are expected in view of the lack of more formal and intensive training in educational assessment. Examples of such responses are shown below:

(8) Teachers are not familiar with the technical terms.
(9) Some teachers do not understand technical terms.
(10) Some teachers don’t understand some of the questions.
(11) They do not know many mathematical terms

The third type of responses (i.e., 12, 13) reflects the teachers' need to be convinced that assessment training is necessary for them to use assessment results properly as part of their instruction. Examples of such responses are shown below:

(12) They want to know if assessment can really raise students’ achievement and attitude and if it will add on the teachers’ work and be helpful to students?
(13) They wonder if data help in formative assessment.

These responses reveal and re-affirm the teachers' test-taking attitudes when responding to the scale. Their seriousness with the scale is clearly evident. The second type of responses (i.e., 8, 9, 10, and 11) corroborates with the finding that teachers lack relevant specific training in educational assessment and hence found the technical terms and concepts unfamiliar. These findings truly reflect their situations and lack of knowledge. The third type of responses indicates the reservation and inquisitiveness of some of the teachers; this indirectly reflects that they have to be convinced that they need more training in educational measurement. Thus, when read in context, these responses provide indirect evidence of the validity of the scores.


6 Conclusions

By way of summary, this article presents preliminary evidence of the psychometric quality, the content-referenced and substantive validity of the newly developed scale. As pointed out by Popham (2006), there is a similarity between the healthcare and teaching professions in that practitioners need to be able to properly read information about the people they serve as a prerequisite to what they intend and need to do. Thus, the importance of teachers’ assessment literacy cannot be over-emphasized. There is therefore a need for an instrument that can help gauge this crucial understanding and the skills of teachers. However, interest in this regard has rather a short history, and there are less than a handful of such measurement tools at our disposal at the moment. 
 
The new scale reported here is an attempt to fill the vacuum. It covers essential conceptual skills of educational measurement which are a need-to-know for teachers if they are to perform this aspect of their profession adequately. The new scale is found to be on the 'difficult' side, partly due to a lack of relevant training among the teachers who provided the data. However, it is encouraging that its items have also been found to fit the measurement model reasonably well. What needs be done from here on is to apply the scale to larger and more representative samples of teachers in varied contexts and subjects for its consolidation. In short, the current study is the alpha, far from being the omega.

In addition, it should be pointed out that there are several limitations to this study. Most importantly, the homogeneity of the teachers (i.e. they were all Chinese Language teachers in Singapore) might detract the validity of the items and, hence, the scale as a whole. Further research should involve teachers of different subjects and geographical backgrounds. With more studies done with these considerations, teachers’ assessment literacy can be better measured and investigated. In order to facilitate further research, the scale is available from the authors on request.


References

American Federation of Teachers, National Council on Measurement in Education & National Education Association. (1990). The Standards for Competence in the EducationaAssessment of Students. (http://www.unl.edu/buros/article3.html.; 22-07-2003).

Bademci, V. (2006). Cronbach’s Alpha is not a Measure of Unidimensionality or Homogeneity. Paper presented at the conference “Paradigm shift: Tests are not reliable" at Gazi University, 28 April 2006.

DeLuca, C., laPointe-McEwan, D. & Luhange, U. (August 2016). Teacher assessment literacy: a review of international standards and measures. Educational Assessment, Evaluation and Accountability, 28(3), 251-272

Fan, X. (1998). Item response theory and classical test theory: an empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58 (3), 357-373.

Fulcher, G. (2012). Language literacy for the language classroom. Language Assessment Quarterly, 9, 113-132.

Gipps, C. (1994). Beyond Testing: Towards a Theory of Educational Assessment. London: Falmer Press.

Gotch, C. M. & French, B. F. (September 2014). A systematic review of assessment literacy measures. Educational Measurement: Issues and practice, 33(2), 14-18.

Hopkins, K. D. (1998). Educational and Psychological Measurement and Evaluation. Needham Heights, MA: Allyn & Bacon.

Kenpro Project Organization (2012). Sample Size Determined Using Krejcie and Morgan Table. (http://www.kenpro.org/sample-size-determination-using-krejcie-and-morgan-table; 07-07-2017).

Linn, R. L. and Miller, M. D. (2005). Measurement and Assessment in Teaching, Ninth Edition. Upper Saddle River, New Jersey: Pearson, Merrill, Prentice Hall.

Magno, C. (2009). Demonstrating the difference between Classical Test Theory and Item Response Theory using derived test data. The International Journal of Educational and Psychological Assessment 1 (1), 1-11.

Malone, M. (2008). Training in language assessment. In E. Shohamy& N. H. Hornberger (Eds.), Encyclopedia of language and education, Vol. 7, Language testing and assessment (pp. 273-284). New York, NY: Springer.

Mertler, C. A. (2003). Preservice versus Inservice Teachers’ Assessment Literacy: Does Classroom Experience Make a Difference?Paper presented at the annual meeting of the Mid-Western Educational Research Association, Columbus, OH (Oct. 15–18, 2003).

Mertler, C. A. and Campbell, C. (2005). Measuring Teachers’ Knowledge and Application of Classroom Assessment Concepts: Development of the Assessment Literacy Inventory. Paper presented at the annual meeting of the American Educational Research Association, Montréal, Quebec, Canada April 11–15, 2005.

Pike, C. K. and Hudson, W. W. (1998). Reliability and Measurement Error in the Presence of Homogeneity. Journal of Social Service Research, 24 (1/2), 149-163.

Plake, B. & Impara, J. C. (1993). Assessment competencies of teachers: A national survey. Educational Measurement: Issues and Practice, 12 (4), 10-12. 
 
Popham, J. (2006). All about Accountability / Needed: A Dose of Assessment Literacy. Educational Leadership, 63(6), 84-85. (http://www.ascd.org/publications/educational-leadership/mar06/vol63/num06/Needed@-A-Dose-of-Assessment-Literacy.aspx; 07-07-2017).

Popham, W. J. (2009). Assessment literacy for teachers: faddish or fundamental? Theory Into. Practice, 48: 4-11. DOI: 10.1080/00405840802577536.

Preito, L., Alonso, J., and Lamarca, R. (2003). Classical test theory versus Rasch analysis for quality of life questionnaire reduction. Health and Quality of Life Outcomes, 1:27. DOI: 10.186/1477-7525-1-27.

Rasch, G (1993). Probabilistic Models for Some Intelligence and attainment tests. Chicago: Mesa Press; 1993. 

Soh, K. (2016). Understanding Test and Exam Results Statistically: An Essential Guide for Teachers and School Leader. New York: Springer.

Soh, K. C. (2011). Above-level testing: assumed benefits and consequences. Academic of Singapore Teachers: i.d.e.a2, Issue 2, October 2011, pp. 3-7.                                                    (http://www.academyofsingaporeteachers.moe.gov.sg/ast/slot/u2597/images/IDEA2/IDEA2_Issue2.pdf; 07-07-2017).

Spolsky, B. (1978). Introduction: Linguists and language testers. In B. Spolsky (Ed.), Approaches to Language Testing: Advances in Language Testing Series: 2 (pp. V-X). Arlington, VA: Center for Applied Linguistics. 
 
Spolsky, B. (1995). Measured Words: The Development of Objective Language Testing. Oxford: Oxford University Press.

Stiggins, R. J. (1991). Assessment literacy. Phi Delta Kappan, 72(7), 534-539.

Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. Review of Research in Education, 17, 31-125.

Wolfe, E. W., & Smith, E. V. Jr. (2007a). Instrument development tools and activities for measure validation using Rasch models: Part I—Instrument development tools. Journal of Applied Measurement, 8, 97–123.

Wolfe, E. W., & Smith, E. V. Jr. (2007b). Instrument development tools and activities for measure validation using Rasch models: Part II—Validation activities. Journal of Applied Measurement,8, 204–233.

Wright, B. D. and Linsacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8 (3), 370.(http://www.rasch.org/rmt/rmt83b.htm; 07-07-2017).


Authors:

Dr Kay Cheng Soh
Research Consultant
Nanyang Technological University
Singapore Centre for Chinese Language,
287 Ghim Moh Road
Singapore 279623
E-mail: limeizh2008@yahoo.com

Dr Limei Zhang
Lecturer
Nanyang Technological University
Singapore Centre for Chinese Language
287 Ghim Moh Road
Singapore 279623
E-Mail: sohkc@singnet.com.sg