This blog post on assessment reliability was first published as a guest post on The Association of School and College Leaders’ (ASCL) website. Inย previous blogsย we looked at fitness for purpose and validity of judgements and conclusions. In this blog, we turn our focus to reliability.
What is a reliable assessment?
Have you ever weighed yourself in the morning, and then again in the afternoon? If you did, you probably got slightly different readings each time. So how much do you weigh? Which is the correct reading (if either of them is indeed โcorrectโ)? Most people answer this question with the obvious response (โthe lower oneโ), but at the heart of the issue is the reliability of the measurement: its accuracy and consistency over time, and context.
Reliability in the assessment of student learning is also about accuracy and consistency and, as a rule, the higher the stakes of the decision we want to make based on assessment information, the more accurate and consistent we want the information to be. High-stakes decisions need highly reliable information. As we saw with validity, a determination of how reliable an assessment needs to be is informed by its intended end uses.
How reliable is your assessment?
There are lots of factors which contribute to the reliability of an assessment, but two of the most critical for teachers to acknowledge are:
- the precision of the questions and tasks used in prompting studentsโ responses;
- the accuracy and consistency of the interpretations derived from assessment responses.
Designing questions and assessment processes which work in the same way for different students at different points in time is a skill to be honed, but one that can pay repeated dividends to teachers and their students.
No assessment is 100% reliable
An assessment is a means by which we can create a set of circumstances in which a student can represent their knowledge, skill and understanding in an observable form. Because it is a proxy for something unseen, and because interpretation is often part of making sense of the information derived from an assessment, error is always present in some form or other.
Some (of the many) sources of error include:
- the assessorโs unfamiliarity with the topic being assessed
- the assessorโs unfamiliarity with robust assessment practices
- bias (teachers are human, after all!)
- the subjectivity of the material to be assessed
- the conditions in which students take the assessment
Improving assessment reliability
There are lots of ways in which classroom assessment practices can be improved in order to increase reliability, and one of the most immediate is to improve so-called inter-rater reliability and intra-rater reliability.
Inter-rater reliability: getting people to agree with one another on simple matters can be hard enough, so when it comes to complex judgements (such as whether the grades two teachers award independently for the same writing task are consistent with each other), reliability challenges arise.
Intra-rater reliability:ย most people acknowledge that it is difficult to achieve high levels of inter-rater reliability, but an often overlooked challenge also comes from the accuracy and consistency of oneโs own judgements.
Imagine your responses to a set of different assessment tasks of the same quality, but at different times during the day, week, month and year. Particularly in areas of subjectivity โ where judgement is needed โ you can imagine how your decisions, comments and grading of assignments may vary dependent on time of day, hunger, how many other tasks youโre juggling in your mind, caffeine ingestionโฆ
Improving rater reliability:ย improving reliability begins by acknowledging that assessments always have a degree of unreliability inherent in them. Improving reliability will improve the quality of the information derived from the assessment process, thus increasing its potential value to teachers and students. Below are three ways to improve reliability of assessment in school:
- Use exemplar student work to clarify what success looks like in specific assignments: be explicit about these criteria;
- Blind-mark assignments: this reduces bias and increases rater reliability
- Blind-moderate samples of studentsโ work: this increases rater reliability and also offers a good professional development opportunity to share standards.
Given that information from assessments are used to make decisions about the needs and progress of pupils, shouldnโt we be able to answer the question โhow reliable is your assessment?โ And how many of us could?
Whatโs next?
In our next post we will conclude this series with an examination of the fourth pillar of assessment: value.
***
“Understanding Reliability” is one unit of learning from the Assessment Lead Programme, offered by Assessment Academy. The programme is designed to offer a grounding to school teachers (primary and secondary) in assessment theory, design and analysis, along with practical tools, resources and support to help improve the quality and efficiency of assessment in your school.
Within the NMM platform, using Comparative Judgement, there is the facility to measure and see a judge’s “infit” score. This is an excellent indicator for both inter and intra-rater reliability. We have built in use of this platform for our staff and also students to allow them all to develop a greater sense of ‘what constitutes a good piece of work’, and specifically looking at their own consistency in making judgments measured via the infit score.