Volume 8 (2017) Issue 1
The
Development and Validation of a
Teacher
Assessment
Literacy Scale:
A Trial Report
Kay
Cheng Soh & Limei Zhang (both Singapore)
Abstract
Teachers
share
a
similar responsibility with healthcare professionals in the need to
interpret assessment results. Interest in teacher assessment literacy
has a short history but Jas
gained
momentum in the recent years. There are not many instruments for
measuring this important professional capability. The present study
presents the results of trailed Teacher
Assessment Literacy Scales
which covers four essential aspects of educational measurement. Both
classical and Rasch analyses were conducted with encouraging
psychometric qualities.
Key
words: Assessment, Assessment literacy, Measurement, Testing
What most of today's educators know about education assessment would fit comfortably inside a kindergartner's half-filled milk carton. This is astonishing in light of the fact that during the last 25 years, educator competence has been increasingly determined by student performance on various large-scale examinations… A profession's adequacy is being judged on the basis of tools that the profession's members don't understand. This situation is analogous to asking doctors and nurses to do their jobs without knowing how to interpret patient charts…(Popham, 2006: para. 1; emphases added)
1 Introduction
There
is no better way to emphasize the importance of assessment literacy
than to quote Popham (2006). About one decade ago, Popham (2006)
brought up his very apt analogy between educational and healthcare
professions where the proper use of test information is crucial. Not
only do teachers need assessment literacy but everyone else who has
an interest in education and everyone
includes teachers, school leaders, policy-makers, and parents.
In
the past, patients were passive recipients of medical treatments.
Present-day patients are involved in the healing process; they are
informed and they are engaged. Analogously, in the past, student
assessment tools were crafted by test specialists while teachers were
passive users. This is true at least in the American context where
standardized tests are the regular fixture of the school. Nowadays,
with the emphasis on assessment for
learning (or formative
assessment)
in contrast with assessment of
learning
(or summative
assessment),
teachers, in America and elsewhere, are expected to use assessment in
a more engaged manner to help students
learn.
Teachers are therefore expected to use test information not only for
assessment of
learning
but
also, perhaps more importantly, for assessment for
learning. This shift all the more underlines the importance of
teachers'
assessment literacy if they were to
complete
this crucial aspect of their job with professionalism.
Due
to the change in the emphasis on formative assessment and its
contribution to learning (Fulcher 2012), the notion of assessment
literacy has changed contingently. Traditionally, assessment
emphasizes objectivity and accuracy (Spolsky
1978,
1995), due to the influence of the psychometric and positivistic
paradigms,
and testing activities are normally carried out at the end of leaning
periods (Gipps 1994,
Wolf, Bixby, Glenn
&
Gardner
1991).
In that context, only measurement specialists but not frontline
teachers were expected to have specialized knowledge of test
development, score interpretation, and theoretical concepts of
measurement. In contrast, assessment is now perceived as an integral
part of teaching and learning to provide timely information as
feedback
to guide further instruction and learning. This development requires
teachers to
design assessment, make use of test results to promote teaching and
learning
and
be aware of inherent technical problems as well as the limitation of
educational measurement (Fulcher
2012,
Malone
2008).
Thus, it is important that teachers have sufficient practical skills
as well as theoretical understanding.
2 Assessment Literacy Measures
2.1
The Importance of Assessment Literacy
Over
the years, efforts
have been made
to measure teacher assessment literacy.
Gotch &
French (2014) systematically
reviewed teacher assessment literacy measures within the context of
contemporary teacher evaluation policy. The authors collected
objective tests of assessment knowledge, teacher self-reports, and
rubrics to evaluate teachers’ work in assessment literacy studies
from 1991 to 2012. They then evaluated the psychometric work from
these measures against a set of claims related to score
interpretation and use. Across the 36 measures reviewed, they found
weak
support for these claims. This highlights the need for increased work
on assessment literacy measures in the educational measurement field.
Later,
DeLuca, laPointe-McEwan & Luhange (2016) emphasized that
assessment
literacy is a core professional requirement across educational
systems and that measuring and supporting teachers’ assessment
literacy have been a primary focus over the past two decades. At
present, according to the authors, there is a multitude of assessment
standards across the world and numerous assessment literacy measures
representing different conceptions of assessment literacy.
Later,
DeLuca, laPointe-McEwan & Luhange
analyzed
assessment literacy
standards from five English-speaking countries (Australia, Canada,
New Zealand, the UK, and the USA) and
Europe
to understand shifts in the assessment developed after 1990. Through
a thematic analysis of 15 assessment standards and an examination of
eight assessment literacy measures, the authors noticed shifts in
standards over time though the majority of the measures continue
being
based on early conceptions of assessment literacy.
Stiggins
(1991) first coined the term assessment
literacy
to refer to teachers’ understanding of the differences between
sound and unsound assessment procedures and
the
use of assessment outcomes. Teachers who are assessment literates
should have a clear understanding of the purposes and targets of
assessment, the competence in choosing appropriate assessment
procedures, the capability of conducting assessment effectively and
of
avoiding
pitfalls in
the process of assessment practices and interpretation of results.
This
sounds simple but can be a tall order in actuality. For example, the
by now classic textbook of educational measurement by Linn &
Miller (2005) has altogether 19 Chapters in three parts. The five
chapters in Part I cover such topics on the
role
of assessment, instructional goals of assessment, concepts of
reliability and validity, and issues and trends. These may not be of
immediate relevance to the classroom teachers’ work but provide
necessary conceptual backgrounds for teachers to be informed
assessors. Part II has 10 chapters of a technical or procedural
nature,
which equip teachers with the necessary practical skills in test
design
using
a wide range of item formats. The ending Part III has four chapters,
dealing
with selecting and using published tests as well as
the interpretation
of scores involving basic statistical concepts. The three
parts
that made up the essential domains of assessment literacy
expected of classroom teachers are typical of many education
measurement texts that support teacher training programs.
According
to Popham
(2009), increasing
numbers of professional development programs have dealt with
assessment literacy for teachers and
administrators.
Popham
then asked the question of
whether
assessment literacy is merely a fashionable focus or
whether
it should be regarded as a significant area of professional
development for years to come.
Popham first divided educators’ measurement-related concerns into
either classroom assessments or accountability assessments, and then
argued that the educators’ inadequate knowledge about
either
of these can cripple the quality of education. He concluded that
assessment literacy is a condicio
sine
qua non
for
today’s competent educator and must be a pivotal content area for
current and future staff development. For this, 13 topics were set
forth for the design of assessment literacy programs. He proposed a
two-prone approach to solve the problem: until pre-service teacher
education programs begin producing assessment-literate
teachers, professional developers must continue to rectify this
omission in educators’ professional capabilities. In short, Popham
sees assessment literacy as
a commodity needed by teachers for
their own long-term well-being, and for the educational well-being of
their students. The topics proposed are as follows:
- The fundamental function of educational assessment, namely, the collection of evidence from which inferences can be made about students’ skills, knowledge, and affect;
- Reliability of educational assessments, especially the three forms in which consistency evidence is reported for groups of test-takers (stability, alternate-form, and internal consistency) and how to gauge consistency of assessment for individual test-takers;
- The prominent role three types of validity evidence should play in the building of arguments to support the accuracy of test-based interpretations about students, namely, content-related, criterion-related, and construct-related evidence;
- How to identify and eliminate assessment bias that offends or unfairly penalizes test-takers because of personal characteristics such as race, gender, or socioeconomic status;
- Construction and improvement of selected response and constructed-response test items: (
- Scoring of students’ responses to constructed-response tests items, especially the distinctive contribution made by well-formed rubrics;
- Development and scoring of performance assessments, portfolio assessments, exhibitions, peer assessments, and self-assessments;
- Designing and implementing formative assessment procedures consonant with both research evidence and experience-based insights regarding the probable success of such procedures:
- How to collect and interpret evidence of students’ attitudes, interests, and values,
- Interpreting students’ performances on large-scale, standardized achievement and aptitude assessments.
- Assessing English Language learners and students with disabilities;
- How to appropriately (and not inappropriately) prepare students for high-stakes tests;
- How to determine the appropriateness of an accountability test for use in evaluating the quality of instruction.
The
list seems
overwhelming and demanding.
However, it shows that teachers’ assessment literacy is a complex
area of professional development which should
be taken seriously
if a high standard of professionalism is to be maintained.
Nonetheless, Popham has a simple expectation, thus,
When I refer to assessment literacy, I'm not thinking about a large collection of abstruse notions unrelated to the day-to-day decisions that educators face. On the contrary, assessment-literate educators need to understand a relatively small number of commonsense measurement fundamentals, not a stack of psychometric exotica. (Popham 2006: 5)
While
Popham’s expectation is more realistic and palatable to busy
classroom teachers, there is still a need for the specifics. For
these, Plake &
Impara (1993) developed the Teacher
Assessment Literacy Questionnaire,
which was later adapted for the Classroom
Assessment Literacy Inventory (Mertler
& Campbell
2005).
It comprises 35 items measuring teachers’ general concepts about
testing and assessment
and
teachers’ background information. These
items were organized under seven scenarios featuring teachers who
were facing various assessment-related decisions. The instrument was
first trialed on 152 pre-service teachers and obtained an overall
reliability of KR20=0.75, with an average item difficulty
of F=0.64 and an average item discrimination of D=0.32. Using another
sample of teachers, Mertler and Campbell
(2005)
found reliability of KR20=0.74, an
average
item difficulty of F=0.68, and an
average
item discrimination
of
D=0.31. In short, this instrument shows acceptable reliability, at
least for research purposes, and is
of moderate difficulty but low discrimination.
In
the adaptation of the Classroom
Assessment Literacy Inventory, Mertler
(2005) followed The
Standards for Teacher Competence in the Educational Assessment of
Students
(AFT, NCME, & NEA 1990). Seven such standards were included,
resulting in items measuring the following seven aspects of
assessment literacy:
- Choosing Appropriate Assessment Methods
- Developing Appropriate Assessment Methods
- Administering, Scoring, and Interpreting the Results of Assessments
- Using Assessment Results to Make Decisions
- Developing Valid Grading Procedures
- Communicating Assessment Results
- Recognizing Unethical or Illegal Practices
It stands to reasons that
these seven aspects are intimately relevant to teachers’ day-to-day
assessment responsibilities and that it is reasonable to expect all
teachers to be equipped with the attendant understanding ans skills.
Later,
Fulcher (2012), in England, developed a measure of assessment
literacy with 23 closed-ended items measuring assessment literacy.
The items cover teachers’ knowledge in test design and development,
large-scale standardized testing, classroom testing and its
wash-back, as well as validity and reliability. In addition, it also
includes constructed-response items eliciting teachers’ feedback on
their experience in assessment, and background information. Fulcher’s
study involved 278 teachers, 85% of whom held a Master’s degree and
69% were female. Analysis of the quantitative data yielded a
Cronbach’s α=0.93 (which is rather high)
and a factor analysis with Varimax rotation returned with four
orthogonal factors accounting for 52.3% of the variance. Although
this is rather
low,
the four factors are interpretable:
(1)
Test Design and Development (17.9%), (2) Large-scale Standardized
Testing (16.5%), (3) Classroom Testing and Washback (12.0%), and (4)
Validity and Reliability (6.5%). The four factor scales have
respectable Cronbach’s coefficients of α=0.89, α=0.86, α=0.79,
and α=0.94, respectively.
2.2
Relevance to the Present
Study
The
review of the pertinent literature on assessment literacy and its
measurement has implications for the present study.
First,
in recent years, the Singapore Ministry of Education has launched
initiatives emphasizing higher-order thinking skills and deep
understanding in teaching, such as Teach
Less, Learn More
(TLLM)
and
Thinking
Schools, Learning Nations
(TSLN). Consequently, school teachers are required to make changes to
their assessment practice and to equip themselves with sufficient
assessment literacy. In spite of the importance of assessment
literacy, few studies have been conducted to examine their assessment
knowledge and skills. Among these few studies, Koh (2011)
investigated the effects of professional development on Primary 4 and
5 teachers of English, Science and Mathematics. She found that
on-going professional development of assessment literacy is
especially effective in improving teachers’ assessment literacy,
when compared with teachers who participated in workshops training
them to design assessment rubrics. The
findings
suggest that to successfully develop teachers’ assessment literacy,
the training needs be broad enough in the topics covered and it has
to be extended over a reasonable period of time.
In
a more recent study, Zhang &
Chin (under review) examined the learning needs in language
assessment among 103 primary school Chinese Language teachers
using
an assessment literacy survey developed by Fulcher (2012). The
results
provide
an
understanding of teachers’ interest and knowledge in test design
and development, large-scale testing, classroom testing,
and test validity and reliability. With a very limited number of
studies in the Singapore context, there is obviously
a need for more studies
to
be carried out for a better understanding of Singapore school
teachers’ assessment literacy. For
carrying out such studies,
it is necessary to develop an assessment literacy scale which is
broad enough and yet concise to measure
the
teachers’ assessment competence properly and accurately.
Secondly,
in systems like that of the USA
where
standardized tests are designed by test specialists through
a
long and arduous process of test development, applying sophisticated
psychometric concepts and principles (with regular intermittent
revisions), it is reasonable to assume that the resultant assessment
tools made available for teachers
are
of a high psychometric quality. In this
case,
the most critical aspect of assessment literacy that
teachers
need is the ability to properly interpret the results they obtain
through the tests. Measurement knowledge beyond this
level
is good to have but not really needed. However, in a system like that
of Singapore
where standardized tests are not
an omnipresent fixture, teacher-made tests are almost the only
assessment tool available. This indicates the teachers’ need for
assessment literacy of a much broader range, going beyond
the
interpretation of test results. Therefore,
the
present study aims to develop an instrument for assessment literacy
to measure teachers’ assessment literacy in
the Singapore context.
Thirdly,
previous studies have provided the framework for
researchers
to
follow in designing an assessment literacy scale. As one of the most
influential studies in language assessment literacy, Fulcher (2012)
has
expanded the definition of assessment literacy. According to him,
assessment literacy comprises knowledge on three levels:
- Level 1 concerns the knowledge, skills, and abilities in the practice of assessment, especially in terms of test design. Specifically, this type of knowledge includes how to decide what to test, writing test items and tasks, developing writing test specifications, and developing rating scales.
- Level 2 refers to the processes, principles and concepts of assessment, which are more relevant to quality standards and research. This type of knowledge includes validity, reliability, fairness, accommodation, washback / consequences as well as ethics and justice of assessment.
- Level 3 is about the historical, social, political, philosophical and ethical frameworks of assessment.
Following
Fulcher’s (2012) framework, we aim to measure teachers’
assessment knowledge of
two aspects: (1)
their
knowledge, skills, and abilities in assessment practice as well as
the fundamental principles and (2) concepts of language assessment.
This does not mean that we do not value the third domain (Level 3)
but that we consider
this as
not
being
so
urgently needed by teachers in Singapore and as
not
being
so
critical to their day-to-day use of assessment in the classroom
context.
3 Method
3.1
Design
In
the development of the Teacher
Assessment Literacy Scale,
consultation was made to two classics of educational measurement,
namely Educational
and Psychological Measurement and Evaluation (Hopkins
1998) and Measurement
and Assessment in Teaching
(Linn & Miller 2005). The first decision to
be made was
to identify and delimit
the domains to be covered in the scale. After consulting
the two classics, it was decided that four key domains needed to
be
represented in the new measure.
Firstly,
teachers need to develop
an understanding of the nature and functions of assessment if
they
are to do their work with reflection and
in an informed way.
Such understanding enables them to know why they are doing what they
do
or are
expected
to
do with meaningful purposes.
Secondly,
teachers need the practical skills to design and use a variety of
item formats to meet the instructional needs which vary with the
content and students' abilities. Such skills may not be required when
standardized tests are available for summative
assessment,
but they are critical in the present-day context when teachers are
expected to craft their own tests to
monitor student learning for
formative assessment.
Thirdly,
once teachers have obtained test results, they need to be able to
properly interpret these to inform further teaching and guide
student learning. Obtaining
test scores without being able to properly interpret them is
analogous to the situation, depicted by Popham (2006), in which
health professionals are unable to interpret patient charts.
Finally,
teachers need to be able to evaluate the qualities of the test
results, and this entails basic
knowledge of
statistics. This, unfortunately, is what many teachers try to avoid,
mainly due to a lack of training. Such knowledge enable teachers to
see assessment results
in a proper light, knowing
their functions as well as their limitations in
terms of measurement errors which
are
an inherent part of educational measurement.
Without an understanding of such concepts of reliability and
validity, teachers tend to take assessment results too literally and
may make unwarranted decisions (Soh 2011 2016).
Thus,
we
intended to construct the scale
to measure teachers’ skills and abilities as well as
their
understanding of principles and concepts in assessment as shown in
Figure 1. It is expected that this scale will provide a measure of
language teachers’ assessment literacy in Singapore and similar
contexts. In the following part of this article, we
will
describe the processes of the development of the Teacher
Assessment Literacy Scale
and seek to provide evidence for its validity:
Fig.
1:
Modelling
Teacher’ Assessment Literacy
Against
the background of the above considerations,
it was decided that ten items be formulated
for each domain as a sample representing the possible items of the
domain. Four domains having been delimited, the whole scale thus
comprises 40 items. It was further decided that there
should be four-option-multiple-choice items
so as to
ensure objectivity in scoring and keep the testing time within a
reasonable limit of about 30 minutes.
3.2
Trial Sample
The
scale thus crafted was then administered to 323
Chinese Language teachers, 170 from primary schools and 153 from
secondary schools and junior colleges. There is a female
preponderance of 83%. Of the teachers, 52% had more than ten years of
teaching experience. In terms of qualification, 93% held a university
degree and 95% had completed professional training. However, only 48%
reported that they had elected to study assessment in their
pre-service training, and 78% acknowledged that they felt the need
for more training in assessment.
Teachers
attended
various in-service courses at the Singapore Centre for Chinese
Language from January to March 2015. Admittedly, they might not form
a random sample of Chinese language teachers in Singapore and hence,
in this study, a
convenience sample
was used. However, at the time of study, there was
an estimated number of 3000 Chinese language teachers
in Singapore. According to Kenpro (2012:), a sample size of 341 is
needed to adequately represent a population of 3000. Thus, the 323
teachers of this study are close to the expected sample size, 93% in
fact. Moreover, the participating teachers can be considered as
mature in the profession with more than half of them having 10 or
more years of teaching experience. Moreover, the female preponderance
is typical of the teaching profession in Singapore. Thus, bearing in
mind some limitations in these regards, the participating Chinese
language teachers can
be deemed sufficiently representative
of Chinese language teachers in Singapore.
3.3
Analysis
Three
types of analysis were conducted in the present study: factor
analysis, classical item analysis, and Rasch scaling.
Confirmatory
factor analysis
was performed to
examine whether the collected
data support the proposed model (Figure 1) of assessment literacy
with the
four specified dimensions. Next, the classical item analysis was
performed to obtain item difficulty (p)
and item discrimination (r).
Item
difficulty
indicates the proportion of teachers who chose the
keyed answer
and thus responded correctly to the respective item. This has also
been referred to as facility
or the F-index
which, ironically, indicates the easiness of an item. Item
discrimination is an indication of how well an item differentiates
between teachers who have chosen the correct answer and those who
have not. Statistically, it is the item-total correlation that
indicates the consistency between overall ability (in terms of total
score) and response to an item. Then, the
Rasch
analysis was conducted to estimate item locations to indicate item
difficulty within the context of the whole set of items analyzed,
with a
positive index indicating the difficulty
to answer correctly
and vice versa; this is just the opposite of the classical F-index.
The results
of the Rasch
analysis were used to correlate with the classical item-analysis
results
of the F-index
for an indication of the efficacy of the two analyses.
4 Results
4.1
Descriptive Statistics
Descriptive
statistics were first calculated at the subscale level to show the
performance of teachers in the four domains of assessment literacy.
As Table 1 shows, the means for the subscales vary between 5.52 and
2.97, out of 10. The highest mean is for the first subscale (nature
& function of assessment)
while the lowest is for the fourth subscale (concepts
of reliability, validity, etc.).
Generally, the means show that teachers were able to answer correctly
about half of the 30 questions in subscales 1 to 3, but that they
were able to answer correctly only about three of the
ten questions in subscale 4. If a criterion-referenced approach
requiring 90 percent of the teachers to be able to answer correctly
90 percent of the questions, the results obtained
are far from satisfactory:
- SubscaleNumber of itemsM (SD)Nature & function of assessment105.53 (1.23)Design & use of test items104.87 (1.38)Interpretation of test results104.50 (1.58)Concepts of reliability, validity etc.102.97 (1.50)
Table
1: Descriptive Statistics and Reliability Estimates at the Subscale
Level
4.2
Confirmatory Factor Analysis
Confirmatory
factor analysis was run to
verify the proposed structure assessment literacy. The model
tested indicated
good fit as shown in Table 2. The incremental fit index CFI (1.00) is
greater than
.95 while the absolute fit index RMSEA (.001) is less than .06 (Hu &
Bentler 1999). The RMSEA 95% confidence interval is narrow. X2/df
(.57)
is less than 3.00 (Kline, 2011):
- X2dfX2/dfCFIRMSEARMSEA 95%CI1.132.571.00.001.000-.093
Table
2: Fit Indices of the CFA Model
Figure
2 presents the tested
model with the estimated coefficients. The path coefficients of
assessment literacy range from .21 to .49 (p<.05) and
average at
.39, indicating that the latent variable is well-defined by the four
variables. However, of the four path coefficients, those for the
first three subscales are sizable, varying from .41 to .49, but that
for the fourth subscale, which deals with statistical and measurement
concepts, is rather low at .21, indicating that the other three
subscales are better measures of assessment literacy:
Fig.
2 The Tested CFA Model of Assessment Literacy
4.3
Classical Item Analysis
Classical
item analysis focuses on the test as a whole and on item indices
(Facility
and Discrimination)
in a deterministic sense. The item indices thus obtained are
sample-specific in that they will take different values when the
items are trailed on a different sample. The classical item analysis
has three focuses: item difficulty, item discrimination, and score
reliability.
Item
difficulty
(p)
is the proportion of teachers who chose the keyed option correctly.
In fact, item difficulty has also been referred to as item
facility
or easiness
since larger proportion denotes more correct responses. Hence, it is
also referred to as the F-index
(Facility
Index).
Item
discrimination (r)
is the correlation between the correct response to an item and the
total score for the test as a whole. This is, in fact, item-total
correlation, which indicates the extent to which an item is able to
differentiate between high- and low-scoring teachers. Hence, it is
also referred to as the D-index
(Discrimination
Index).
These two item indices are shown in Tables 3 to 6 for the 40 items
and the four subtests, respectively. The
question of score reliability is discussed in a later Section 5.1.
4.3.1
Subtest 1: Nature and Functions of Assessment
Table
3 presents the item indices for Subtest 1 which deal with
understanding the functions that assessment has in teaching and
learning, and concepts related to the norm- and criterion-referenced
interpretation of test scores. The p’s
vary from a very low 0.07 to a very high 0.94, with a mean of 0.55
(median 0.56). In short, the items vary widely in difficulty although
the mean suggests that this subtest is moderately difficult as a
whole. At the same time, the r’s
vary from 0.13 to 0.33, with a mean of 0.23 (median 0.22). These
figures indicate that the items have a low though acceptable
discriminatory power.
- ItemsprItem 10.930.22Item 20.850.32Item 30.100.17Item 40.270.21Item 50.930.3Item 60.940.33Item 70.920.29Item 80.270.19Item 90.260.14Item100.070.13Mean0.550.23Median0.560.22
Table
3: Item indices for Subtest 1: Nature and Function of Assessment
4.3.2
Subtest 2: Design and Use of Test Items
The
items of Subtest 2 deal with the understanding of the suitability of
various item formats and their appropriate uses. The p’s
vary from a low 0.13 to a very high 0.96, with a mean of 0.49 (median
0.44). In short, these items vary widely in difficulty although the
mean suggests that this subtest is moderately difficult as a whole.
At the same time, the r’s
vary from 0.11 to 0.30, with a mean of 0.21 (median 0.22). These
results indicate that the items have a
low though
acceptable discrimination power:
- ItemsprItem 110.50.24Item 120.760.28Item 130.290.11Item 140.330.2Item 150.500.16Item 160.130.19Item 170.370.11Item 180.870.26Item 190.960.30Item 200.160.23Mean0.490.21Median0.440.22
Table
4: Item indices for Subtest 2: Design and Use of Test Items
4.3.3
Subtest 3: Interpretation of Test Results
The
items of Subtest 3 pertain knowledge of item indices and meanings of
test scores. The p’s
vary from a very low 0.03 to a high 0.78, with a mean of 0.45 (median
0.51). These figures indicate that the items vary widely in
difficulty although the mean suggests that this subtest is of
moderate difficulty. At the same time, the r’s
vary from 0.05 to 0.47, with a mean of 0.24 (median 0.23). These
results indicate that the subtest as a whole has acceptable
discrimination:
-
ItemsprItem 210.140.05Item 220.050.05Item 230.390.23Item 240.640.24Item 250.750.18Item 260.690.47Item 270.430.22Item 280.590.39Item 290.030.1Item 300.780.43Mean0.450.24Median0.510.23
Table
5: Item indices for Subtest 2: Design and Use of Test Items
4.3.4 Subtest
4: Concepts of Reliability, Validity and Basic Statistics
Subtest
4 deals with abstract concepts of test score qualities and
knowledge of
simple statistics essential to understand test results. The p’s
vary from a low 0.11 to a high 0.64, with a mean of 0.30 (median
0.25).
These
figures indicate that the items are difficult ones when compared with
those of the other three subtests. The r’s
vary from 0.05 to 0.36, with a mean of 0.19 (median 0.17). These
results indicate that the subtest as a whole has low discrimination:
- ItemsprItem 310.130.11Item 320.210.22Item 330.490.26Item 340.150.13Item 350.110.05Item 360.560.36Item 370.290.13Item 380.280.21Item 390.640.3Item 400.120.1Mean0.30.19Median0.250.17
Table
6: Item indices for Subtest 4: Concepts of Reliability, Validity, and
Basic Statistics
On
average, the 40 items of the scale generally have an acceptable level
of facility, and this means that they are neither too easy nor too
difficult for the teachers involved in this study.
However, the items tend be have a low discrimination power. These
findings could, at least partly, account for the fact that the
teachers taking part in this study
have
discerned deficits in their assessment literacy, with an overall mean
of 18 for the 40 items.
4.4
Rasch Analysis
The
Rasch
analysis estimates the difficulty of an item in terms of
the probability
that
teachers of given levels of ability
will pass (or fail) an item, thus locating the item on a continuum of
difficulty, hence, location. Rasch
(1993) defines estimates for each item as an outcome of the linear
probabilistic interaction of a person's ability and the difficulty of
a given item. The
goodness of fit of
an item to the measurement model is evaluated with reference to its
Outfit MSQ and Infit MSQ. However, there are no hard-and-fast
rules
for evaluating the fit statistics. A range of between 0.7 and 1.3 is
recommended as the optimal
goodness of fit of
an item (Wright & Linacre 1994) for MCQ items. By
this criterion,
fit statistics lower than 0.7 suggest that the items are over-fitting
(too predictable), and that those greater than 1.3 are under-fitting
(too unpredictable).
Table
7 shows the Outfit and
Infit MSQs of the 40 items in descending order of item difficulty. As
can be seen thereof,
item estimates vary from -3.664 to 3.245, with a mean of 0.000
(median 0.260). These show that the items cover a wide range of
difficulty, and the median indicates that the items as a set are
somewhat on the difficult side of the scale.
Table
7 also shows that the Infit MSQs vary between 0.690 and 1.146 (with a
mean of 0.965 and a median 0.999) that
and the Outfit MSQs vary between 0.859 and 1.068 (with a mean of
0.970 and a median of 0.978). These show that the
item fit statistics all fall within the recommended 0.7-1.3 range,
and therefore all 40 items
of the scale fit the Rasch model well.
In
Table 7, the 40 items are classified into three groups of difficulty.
The
'difficult' group
comprises 11 items. These have facilities (p’s)
less than 0.2, indicating that less than 20 percent of teachers were
able to answer these items correctly. These items are separated from
the rest by a natural gap
in Rasch difficulties
between 14.17 and 1.125. As the brief content shows, most of these
difficult items deal with some quantitative aspects of test items
(Items 21, 22, 29, 34, 35, and 40). The other items deal with
diagnosis, function, assessing written expression, and above-level
assessment. Generally, answering these questions correctly requires
specific training in assessment, which many of the teachers do not
have, especially with regards to those items which are quantitative
in nature.
At
the other end, the 'easy'
group of items has seven
items which have facilities (p’s)
greater than .80, indicating that 80% or more of the teachers
answered them correctly. They are separated by a natural gap in Rasch
difficulties, between -1.579 and -2.009. The content shows that these
items deal with concepts which can be gained
through experience
and are commonsensical
in nature.
In
between the two extreme groups, there are the 'appropriate' items -
22 items which have facilities (p’s)
greater than .20 and less than .80. This means that between 20% and
80% of the teachers chose the correct answers for these items. Their
Rasch difficulties span from 1.125 to -1.579. In terms of item
content, only three are from Subtest 1 (Nature
and Function)
and most of the other items are 'easy' items. There are six items
from Subtest 2 (Design
and Use of Test Items),
seven items from Subtest 3 (Interpretation
of Test Results),
and six items from Subtest 4 (Reliability,
Validity, and Basic Statistics).
These clearly show where the deficits in assessment literacy are
among the teachers:
- Item NumberBrief contentItem difficultyOutfit MSQInfit MSQpDifficult Items29Interpretation of a T-score3.2450.9610.9300.0322Minimum passing rate for MCQ item2.6901.0830.9650.0510Diagnosing students’ learning difficulty2.3661.0420.9470.073Educational function of assessment1.9680.9080.9490.1035Relations between reliability and validity1.9021.1460.9970.1140Basic statistical concept: correlation1.8091.0030.9850.1216Assessing written expression1.6930.9060.9540.1331Above-level assessment1.6930.9870.9930.1321Good MCQ F-index1.6391.1421.0170.1434Checking reliability of marking1.5611.0340.9830.1520Disadvantage of essay writing
1.4170.9070.9520.16
Appropriate Items32Below-level assessment1.1251.0250.9600.219Criterion-referenced testing0.8251.1311.0150.264Direct function of assessment0.7771.0010.9880.278Norm-referenced testing0.7770.9841.0020.2738Basic statistical concept: central tendencies0.6980.9970.9930.2813Language ability scrambled sentence0.6821.1221.0440.2937Basic statistical concept: skewed distribution0.6511.0481.0410.2914Reading comprehension ad validity0.4891.0151.0090.3317Weakness of objective items0.3211.0641.0680.3723Options of MCQ item0.1991.0071.0080.3927Concept of measurement error0.0281.0211.0120.4333Inter-rater reliability-0.2031.0050.9980.4915Use of cloze procedures-0.2411.0591.0520.5011Advantage of MCQ-0.2541.0181.0080.5036Basic statistical concept: mode-0.4850.9310.9420.5628Nature of the T-score-0.6290.9050.9210.5924D-index of MCQ item-0.8321.0041.0020.6439Basic statistical concept: standard deviation-0.8320.9690.9700.6426Interpretation of a test score-1.0620.8300.8670.6925Choice of topic to write-1.3981.0351.0130.7512Language ability and MCQ-1.4430.9300.9730.7630Difference between test scores-1.5790.8010.8750.78Easy Items2Test results to help student-2.0090.8590.9170.8518Critical factor in written expression-2.2510.8790.9390.877Most important of class tests-2.7410.8010.9210.921Most important function of assessment-2.8760.8690.9520.935Use of assessment results-2.9240.7480.9030.936Best things to help students make progress-3.1420.6900.8800.9419Sex, race, and SES biases-3.6640.7380.8590.96Summary StatisticsMinimum-3.6640.6900.859-Maximum3.2451.1461.068-Mean0.0000.9650.970-Median0.2600.9990.978-
Table
7: Item Estimates and Fit Statistics
4.5
Correlations between Classical
Facilities
and Rasch Estimates
A
question
that has often been asked is whether the two approaches (classical
approach and the Rasch approach) to item analysis yield comparable
results and, if they differ, to what extent. It is therefore
interesting to note that when the classical p’s
and the Rasch estimates of the 40 items were correlated, this
resulted in r=|0.99|. This very high correlation coefficient
indicates that the two approaches of item calibration yielded highly
similar results, and this corroborated with many recent studies (e.g.
Fan 1998, Magno 2009, Preito, Alonso & Lamacra 2003) which also
report high correlations between classical item indices and Rasch's
method. For example, Fan (1998) used data from a large–scale
evaluation of high school reading and mathematics from Texas,
analyzed the correlations of 20 samples of 1,000 students each and
obtained correlations between item indices of the two approaches,
varying from r=|0.803| to r=0.920|. These suggest that the two sets
of item estimates shared between 65 to 85 percent of common variance.
Figure 2 is a scatter plot for the two types of indices of the
present study, and the curve shows a near-perfect (negative)
correlation:
Fig. 3: Scatter Plot for p’s (X-axis) and Locations (Y-axis)
5 Discussion
In
the present study, the psychometric features and factorial validity
of the newly developed Teachers’
Assessment Literacy Scale have
been investigated. In the following section, we will first discuss
the issue of reliability and then build validity argument based on
the findings of the analyses.
5.1
Issue of Reliability
The
conventional method of assessing score reliability is the Cronbach’s
alpha coefficient, which indicates the degree of internal consistency
among the items, with the assumption that the items are homogeneous.
The 40 items of the scale are scored 1 (right)
or 0 (wrong),
and therefore, the
Kuder-Richardson Formula
20 (KR20), which is a special case of Cronbach’s alpha for
dichotomous items, was calculated.
Table
8 below shows the KR20 reliabilities of the four sub-scales and the
scale as a whole. The reliability coefficients vary from KR20=.18 to
.40, with a median of 0.36. Moreover, for the scale as a whole and
the total sample of combined primary and secondary teachers,
Cronbach’s internal consistency coefficient is α=0.471. These
indices
are
generally low, compared with the conventional expectation of a
minimum of 0.7. This definitely leads to the question of
trustworthiness.
- MeasurePrimarySecondary
- Nature and functions of assessment
.28.37- Design and use of test items
.34.38- Interpretation of test results
.36.18- Reliability, validity, etc.
.36.40Whole test.34.58
Table
8: KR20 Reliabilities
However,
there have been
criticisms on Cronbach’s alpha
as a measure of item homogeneity or unidimensionality (Bademci 2006).
One condition which might have led to the low reliabilities as shown
in Table 8 is the heterogeneous nature of item content among the 40
items since they cover many different aspects of educational
measurement, some being qualitative and others being quantitative in
nature, even within a particular sub-test. This being the case, it
renders the conventional reliability measures (i.e. Cronbach’s
alpha and its equivalent KR20) which assume item homogeneity
unsuitable
for the purpose of the present study.
Group
homogeneity is another factor contributing to low score reliability.
Pike & Hudson (nd:
149)
discussed the limitations of
using
Cronbach’s Alpha (and its equivalent KR20) to estimate reliability
when using a sample with homogeneous responses in the measured
construct and described the risk of drawing the wrong conclusion that
a new instrument may have poor reliability. They demonstrated the use
of an alternate statistic that may serve as a cushion against such a
situation and recommended the calculation of the Relative Alpha by
considering the ratio between the standard error of measurement (SEM)
which itself involves the reliability as shown in the formula, thus:
SEM
= SD*SQRT (1 – reliability)
Pike
& Hudson’s Relative Alpha can take a value between 0.0 and 1.0,
indicating the extent to which the scores can be trusted, using an
alternative way to evaluate score reliability. Their formula is:
Relative
Alpha = 1 – SEM2/
(Range/6)2
In
this formula, SEM is the usual indicator of the lack of
trustworthiness of the scores obtained and, under normal
circumstances, the scores for a scale will theoretically span over
six standard deviations. Thus, the second term on the right is an
indication of the proportion of test variance that is unreliable.
Relative
Alpha indicates the proportion of test-variance off-set for its
unreliable portion, i.e., the proportion of test variance that is
trustworthy.
In
the present study, the maximum possible score is 40, and the
theoretically possible standard deviation is 6.67(=40/6). However,
the actual data
yields
standard deviations of 4.24 (primary) and 4.66 (secondary) for the
scale as a whole, which are 0.64 and 0.70, respectively, of the
theoretical standard deviations. In other words, the two groups are
found to be more homogeneous than theoretically expected.
Table
9 shows the Relative Alphas for the primary and secondary groups. The
statistics suggest
that much of the test variance has been captured by the 40-item
scale, and the scores can therefore be trusted:
- MeasurePrimarySecondary
- Nature and functions of assessment
.98.97- Design and use of test items
.95.95- Interpretation of test results
.94.93- Reliability, validity, and basic statistics
.95.95Whole test.97.97
Table
9: Relative Alpha Coefficients
5.2
Validity Evidence
Regarding
content-referenced evidence, the scale was developed, based on a
model resulting from an analysis
of empirical data and a survey of relevant literature (Fulcher 2012).
In addition, content analysis was conducted on
the scale
for a better content representation. The
Rasch analysis provides further content referenced evidence.
Substantive validity evidence refers to the relationship between the
construct and the data observed (Wolfe & Smith 2007a, b). In the
current study, the
Rasch analysis, Infit and Outfit as well as the confirmatory factor
analysis provide substantive referenced evidence. Also, the alignment
of the analysis based on classical test theory and the
Rasch analysis supported the validity argument further.
Apart
from direct validity evidence, information beyond the test scores is
needed to verify score validity. Ideally, the criterion scores for
validity come from a test of application of measurement concepts and
techniques, but such information is not available within the results
obtained, although some of the 40 items of the scale are of this
nature (for instance, those items on statistical concepts). However,
indirect evidence of score validity is provided by the teachers’
responses to the open-ended question asking for comments and
suggestions with regards to educational assessment.
For
the open-ended question, the teachers made
36 responses.
Most of the responses reflect the teachers’ realization that
assessment plays an important role in their teaching, for which
specialized knowledge is needed. Examples of such responses are shown
below:
(1)
What is taught and what is assessed should be consistent.
(2)
Teachers need to have knowledge of educational measurement.
(3)
Teachers need to popularize knowledge of assessment among the school
leaders.
(4)
Teachers
hope
to gain knowledge of educational measurement to assess with in-depth
understanding.
(5) Without
any knowledge of educational measurement, data analysis of results
conducted in the school is superficial.
(6)
Teachers think that knowledge of education management is very much
needed!
(7)
The
survey will
help to improve teaching.
The
second
type of responses (i.e.,
8, 9, 10, and 11) reflects
the difficulty that teachers had in understanding items which involve
technical concepts and terminologies. Such responses are expected in
view of the lack of more formal and intensive training in educational
assessment. Examples of such responses are shown below:
(8)
Teachers are not familiar with the technical terms.
(9)
Some teachers do not understand
technical terms.
(10)
Some teachers don’t understand some of the questions.
(11)
They do not know many mathematical terms
The
third type of responses (i.e.,
12, 13)
reflects
the teachers' need to be convinced that assessment training is
necessary for them to use assessment results properly as part of
their instruction. Examples of such responses are shown below:
(12)
They want to know if assessment can really raise students’
achievement and attitude and if it will add on the teachers’ work
and be helpful to students?
(13)
They wonder if data help in formative assessment.
These
responses reveal and re-affirm the teachers' test-taking attitudes
when responding to the scale. Their seriousness with the scale is
clearly evident. The second type of responses
(i.e., 8, 9, 10, and 11)
corroborates with the finding that teachers lack relevant specific
training in educational assessment and hence found the technical
terms and concepts unfamiliar. These
findings
truly reflect their situations
and
lack of knowledge. The third type of responses indicates the
reservation and inquisitiveness of some of the teachers; this
indirectly reflects that
they have to be convinced that they need more training in educational
measurement. Thus, when read in context, these responses provide
indirect evidence of the validity of the scores.
6 Conclusions
By
way of summary, this article presents preliminary evidence of the
psychometric quality, the content-referenced and substantive validity
of the newly developed scale. As pointed out by Popham (2006), there
is a similarity between the healthcare and teaching professions in
that practitioners need to be able to properly read information about
the people they serve as a prerequisite to what they intend and need
to do. Thus, the importance of teachers’ assessment literacy cannot
be over-emphasized. There is therefore a need for an instrument that
can help gauge this crucial understanding and the skills of teachers.
However, interest
in
this regard
has
rather a short history, and there are less than a handful of such
measurement tools at our disposal at the moment.
The
new scale reported here is an attempt to fill the vacuum. It covers
essential conceptual skills of educational measurement which are
a need-to-know for
teachers if they are to perform this aspect of their profession
adequately. The new scale is found to be on the 'difficult' side,
partly due to a lack of relevant training among the teachers who
provided the data. However, it is encouraging that its items have
also been found to fit the measurement model reasonably well. What
needs be
done from here on
is to apply the scale to larger and more representative samples of
teachers in varied contexts and subjects for its consolidation. In
short, the
current study
is the alpha,
far from being the omega.
In
addition, it should be pointed out that there are several limitations
to this study. Most importantly, the homogeneity of the teachers
(i.e. they were all Chinese Language teachers in Singapore) might
detract
the validity
of the items and, hence, the scale as a whole. Further research
should involve teachers of different subjects and geographical
backgrounds. With more studies done with these considerations,
teachers’ assessment literacy can be better measured and
investigated. In order to facilitate further research, the scale is
available from the authors on request.
References
American
Federation of Teachers, National Council on Measurement in Education
& National Education Association. (1990). The
Standards for Competence in the EducationaAssessment
of Students.
(http://www.unl.edu/buros/article3.html.;
22-07-2003).
Bademci,
V. (2006). Cronbach’s
Alpha is not
a
Measure of Unidimensionality or Homogeneity.
Paper presented
at the conference “Paradigm shift: Tests are not reliable" at
Gazi University, 28 April 2006.
DeLuca,
C., laPointe-McEwan, D. & Luhange, U. (August 2016). Teacher
assessment literacy: a review of international standards and
measures. Educational Assessment, Evaluation and Accountability,
28(3), 251-272
Fan,
X. (1998). Item response theory and classical test theory: an
empirical comparison of their item/person statistics. Educational
and Psychological Measurement, 58 (3),
357-373.
Fulcher,
G. (2012). Language literacy for the language classroom. Language
Assessment Quarterly, 9,
113-132.
Gipps,
C. (1994). Beyond
Testing: Towards a Theory of Educational Assessment.
London: Falmer Press.
Gotch,
C. M. & French, B. F. (September 2014). A systematic review of
assessment literacy measures. Educational Measurement: Issues and
practice, 33(2), 14-18.
Hopkins,
K. D. (1998).
Educational and Psychological Measurement and Evaluation. Needham
Heights, MA: Allyn & Bacon.
Kenpro
Project Organization (2012). Sample Size Determined Using Krejcie and
Morgan Table.
(http://www.kenpro.org/sample-size-determination-using-krejcie-and-morgan-table;
07-07-2017).
Linn,
R. L. and Miller, M. D. (2005). Measurement
and Assessment in Teaching, Ninth Edition.
Upper Saddle River, New Jersey: Pearson, Merrill, Prentice Hall.
Magno,
C. (2009). Demonstrating
the difference between Classical Test Theory and Item Response Theory
using derived test data. The
International Journal of Educational and Psychological Assessment 1
(1),
1-11.
Malone,
M. (2008). Training in language assessment. In E. Shohamy& N. H.
Hornberger (Eds.), Encyclopedia
of language and education, Vol. 7, Language testing and assessment
(pp. 273-284). New York, NY: Springer.
Mertler,
C. A. (2003). Preservice
versus Inservice Teachers’ Assessment Literacy: Does Classroom
Experience Make a Difference?Paper
presented at the annual meeting of the Mid-Western Educational
Research Association, Columbus, OH (Oct. 15–18, 2003).
Mertler,
C. A. and Campbell, C. (2005). Measuring
Teachers’ Knowledge and Application of Classroom Assessment
Concepts: Development of the Assessment Literacy Inventory. Paper
presented at the annual meeting of the American Educational Research
Association, Montréal, Quebec, Canada April 11–15, 2005.
Pike,
C. K. and Hudson, W. W. (1998). Reliability and Measurement Error in
the Presence of Homogeneity.
Journal
of Social Service Research, 24 (1/2),
149-163.
Plake,
B. & Impara, J. C. (1993). Assessment competencies of teachers: A
national survey. Educational
Measurement: Issues and Practice, 12
(4), 10-12.
Popham,
J. (2006). All about Accountability / Needed: A Dose of Assessment
Literacy. Educational
Leadership, 63(6),
84-85.
(http://www.ascd.org/publications/educational-leadership/mar06/vol63/num06/Needed@-A-Dose-of-Assessment-Literacy.aspx;
07-07-2017).
Popham,
W. J. (2009). Assessment literacy for teachers: faddish or
fundamental? Theory Into. Practice, 48: 4-11. DOI:
10.1080/00405840802577536.
Preito,
L., Alonso, J., and Lamarca, R. (2003). Classical test theory versus
Rasch analysis for quality of life questionnaire reduction. Health
and Quality of Life Outcomes, 1:27.
DOI: 10.186/1477-7525-1-27.
Rasch,
G (1993). Probabilistic
Models for Some Intelligence and attainment tests.
Chicago: Mesa Press; 1993.
Soh,
K. (2016). Understanding Test and Exam Results Statistically: An
Essential Guide for Teachers and School Leader. New York: Springer.
Soh,
K. C. (2011). Above-level testing: assumed benefits and consequences.
Academic of Singapore Teachers: i.d.e.a2,
Issue 2, October 2011, pp. 3-7.
(http://www.academyofsingaporeteachers.moe.gov.sg/ast/slot/u2597/images/IDEA2/IDEA2_Issue2.pdf;
07-07-2017).
Spolsky,
B. (1978). Introduction: Linguists and language testers. In B.
Spolsky (Ed.), Approaches
to Language Testing: Advances in Language Testing Series: 2
(pp. V-X). Arlington, VA: Center for Applied Linguistics.
Spolsky,
B. (1995). Measured
Words: The Development of Objective Language Testing. Oxford:
Oxford University Press.
Stiggins,
R. J. (1991). Assessment literacy. Phi
Delta Kappan, 72(7),
534-539.
Wolf,
D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their
minds well: Investigating new forms of student assessment. Review
of Research in Education, 17,
31-125.
Wolfe,
E. W., & Smith, E. V. Jr. (2007a). Instrument development tools
and activities for measure validation using Rasch models: Part
I—Instrument development tools. Journal of Applied Measurement, 8,
97–123.
Wolfe,
E. W., & Smith, E. V. Jr. (2007b). Instrument development tools
and activities for measure validation using Rasch models: Part
II—Validation activities. Journal of Applied Measurement,8,
204–233.
Wright,
B. D. and Linsacre, J. M. (1994). Reasonable mean-square fit values.
Rasch
Measurement Transactions, 8 (3),
370.(http://www.rasch.org/rmt/rmt83b.htm;
07-07-2017).
Authors:
Dr
Kay Cheng Soh
Research
Consultant
Nanyang
Technological University
Singapore
Centre for Chinese Language,
287
Ghim Moh Road
Singapore
279623
E-mail:
limeizh2008@yahoo.com
Dr
Limei Zhang
Lecturer
Nanyang
Technological University
Singapore
Centre for Chinese Language
287
Ghim Moh Road
Singapore
279623
E-Mail:
sohkc@singnet.com.sg