Four Pillars of Assessment: Reliability

Published On: December 3, 20174 min read

By Stuart Kime

This blog post on assessment reliability was first published as a guest post on The Association of School and College Leaders’ (ASCL) website. In previous blogs we looked at fitness for purpose and validity of judgements and conclusions. In this blog, we turn our focus to reliability.

What is a reliable assessment?

Have you ever weighed yourself in the morning, and then again in the afternoon? If you did, you probably got slightly different readings each time. So how much do you weigh? Which is the correct reading (if either of them is indeed ‘correct’)? Most people answer this question with the obvious response (‘the lower one’), but at the heart of the issue is the reliability of the measurement: its accuracy and consistency over time, and context.

Reliability in the assessment of student learning is also about accuracy and consistency and, as a rule, the higher the stakes of the decision we want to make based on assessment information, the more accurate and consistent we want the information to be. High-stakes decisions need highly reliable information. As we saw with validity, a determination of how reliable an assessment needs to be is informed by its intended end uses.

How reliable is your assessment?

There are lots of factors which contribute to the reliability of an assessment, but two of the most critical for teachers to acknowledge are:

the precision of the questions and tasks used in prompting students’ responses;
the accuracy and consistency of the interpretations derived from assessment responses.

Designing questions and assessment processes which work in the same way for different students at different points in time is a skill to be honed, but one that can pay repeated dividends to teachers and their students.

No assessment is 100% reliable

An assessment is a means by which we can create a set of circumstances in which a student can represent their knowledge, skill and understanding in an observable form. Because it is a proxy for something unseen, and because interpretation is often part of making sense of the information derived from an assessment, error is always present in some form or other.

Some (of the many) sources of error include:

the assessor’s unfamiliarity with the topic being assessed
the assessor’s unfamiliarity with robust assessment practices
bias (teachers are human, after all!)
the subjectivity of the material to be assessed
the conditions in which students take the assessment

Improving assessment reliability

There are lots of ways in which classroom assessment practices can be improved in order to increase reliability, and one of the most immediate is to improve so-called inter-rater reliability and intra-rater reliability.

Inter-rater reliability: getting people to agree with one another on simple matters can be hard enough, so when it comes to complex judgements (such as whether the grades two teachers award independently for the same writing task are consistent with each other), reliability challenges arise.

Intra-rater reliability: most people acknowledge that it is difficult to achieve high levels of inter-rater reliability, but an often overlooked challenge also comes from the accuracy and consistency of one’s own judgements.

Imagine your responses to a set of different assessment tasks of the same quality, but at different times during the day, week, month and year. Particularly in areas of subjectivity – where judgement is needed – you can imagine how your decisions, comments and grading of assignments may vary dependent on time of day, hunger, how many other tasks you’re juggling in your mind, caffeine ingestion…

Improving rater reliability: improving reliability begins by acknowledging that assessments always have a degree of unreliability inherent in them. Improving reliability will improve the quality of the information derived from the assessment process, thus increasing its potential value to teachers and students. Below are three ways to improve reliability of assessment in school:

Use exemplar student work to clarify what success looks like in specific assignments: be explicit about these criteria;
Blind-mark assignments: this reduces bias and increases rater reliability
Blind-moderate samples of students’ work: this increases rater reliability and also offers a good professional development opportunity to share standards.

Given that information from assessments are used to make decisions about the needs and progress of pupils, shouldn’t we be able to answer the question “how reliable is your assessment?” And how many of us could?