Research and analysis

Classification accuracy in results from KS2 National Curriculum Tests: Summary

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Overview

Overview by: John Winkley, AlphaPlus Consultancy Ltd.

The report describes the results of a seminar and preceding research work investigating the reliability of level classifications in England바카라 사이트™s key stage 2 national tests (for 11 year olds) in science, mathematics and English taken in 2009 and 2010.

Students receive a level from 2 to 5 for each of these tests, corresponding to progress against National Curriculum objectives. The issue under investigation is whether the levels received by students reflect their actual levels of performance - in other words, are the test results reliable in terms of their classification accuracy?

The different statistical methods used to calculate the accuracy of classification show similar results in that classification accuracy is estimated at around 85% for English, 87% for science and 90% for mathematics, a substantial improvement compared with the tests at their introduction in the late 1990s.

2. Introduction

In 1995, National Curriculum tests in science, mathematics and English were introduced in England for children in the final year of primary school education (year 6, age 11). In mathematics and English, these tests continue to this day (science tests continue for a small sample of 11 year olds). The tests consist of multiple parts 바카라 사이트“ for example, the mathematics test includes a calculator paper (40 marks), a non-calculator paper (40 marks) and a mental arithmetic test (20 marks) 바카라 사이트“ with a candidate바카라 사이트™s total score being the sum of scores on the various parts.

For each test, children are awarded a level from 2 to 5, with most children expected to achieve a level 4. Children바카라 사이트™s scores in the test are converted to levels by using level boundaries that are adjusted each year for test difficulty, with the intention of maintaining standards over time. For example, in the mathematics test (marked out of 100), the range for level 4 was 46 to 76 in 2009 and 46 to 78 in 2010 - this adjustment process is called 바카라 사이트˜test equating바카라 사이트™ and relies on a sample of candidates in 2009 taking both the 2009 live test and 2010 pre-test.

In 2001, research demonstrated that unreliability in the tests meant that up to 30% of children taking the tests could be misclassified, ie awarded a level that did not correspond to their actual level of achievement. Research for Ofqual바카라 사이트™s reliability programme in 2010 shows that misclassification estimates have fallen to 10 to 15%, depending on the subject.

This research summary looks at the factors that contribute to examination unreliability and how unreliability and, in particular, misclassification can be measured.

3. What is classification accuracy and how is it calculated?

Classification accuracy is a measure of examination reliability which is used when the results of the examinations are a grade or level (rather than a score). Reliability is a key component in examination fairness - it refers to the extent to which a group of candidates would obtain the same results for an assessment irrespective of who marks their papers, what types of question are used (for example, multiple-choice or essay questions), which topics are set or chosen to be answered on a particular year바카라 사이트™s paper, or when the examination is taken. While reliability often considers the repeatability of scores, classification consistency looks at whether candidates would get the same grade on different test occasions - the public would expect that a KS2 candidate sitting several years바카라 사이트™ tests papers would receive the same level, whereas they might not be bothered if the candidate바카라 사이트™s scores varied a little from paper to paper.

There are a number of different statistical methods for calculating classification accuracy, all of which involve making assumptions about the test data so the analysis results are necessarily estimates (some of these assumptions are discussed below). In this project, researchers used six different models and found that the results were very similar (within three or four percentage points of each other on classification accuracy).

All the approaches are based on the same fundamental steps (the principles and outcomes are comprehensible to non-mathematicians even if the actual methods are not):

  • modelling and estimating all candidates바카라 사이트™ 바카라 사이트˜true scores바카라 사이트™ (the true measure of their ability 바카라 사이트“ the average score they would receive if they took many tests of the same subject at the same time 바카라 사이트“ a theoretical concept really)
  • from this, estimating candidates바카라 사이트™ 바카라 사이트˜true levels바카라 사이트™ based on the modelled 바카라 사이트˜true scores바카라 사이트™
  • comparing 바카라 사이트˜true levels바카라 사이트™ with the observed levels

The different methods used have different benefits: some try to allow for the fact that tests may be more reliable for some scores than others (e.g. the KS2 tests tend to classify candidates more reliably at level 5 than at level 2); others use a simple model of uniform error across the range of scores. One method (called Item Response Theory) seeks to create a scale of difficulty for items and a scale of ability for candidates. This allows statistical comparisons of candidates who took tests in one year with other candidates who took a different test in another year. 바카라 사이트˜Separating out바카라 사이트™ a candidate바카라 사이트™s ability from a question바카라 사이트™s difficulty is very attractive in looking at how candidates would perform on multiple tests (without them actually having to sit all the tests) but it relies on some big assumptions 바카라 사이트“ for example that the test assesses a single trait (such as 바카라 사이트˜a candidate바카라 사이트™s ability in science바카라 사이트™) and doesn바카라 사이트™t allow for the fact that some students perform better than their peers in some topics and less well in others.

4. What factors affect reliability and classification accuracy?

Many things can affect reliability: marker inconsistency, differences in test papers from year to year in terms of the content coverage, the types of questions used, the difficulty of the questions and where the level boundaries are set.

Classification accuracy is also affected by how many levels there are in the test results, and the consequent mark range for each level. A test may provide people with more information by having more levels (consider for example, the introduction of the A* grade at GCSE and A level) but this will result in a lower level of classification accuracy for all levels.

5. Classification accuracy results for Key Stage 2 tests in 2009 and 2010

The first analysis undertaken looks at the internal consistency of the test 바카라 사이트“ if the test measures a single trait, then good candidates would be expected to do better than weaker candidates on all items. An item that isn바카라 사이트™t consistent in this way is testing something other than the single trait, and undermines the internal reliability of the test (and one of the key assumptions for the statistics). KS2 tests in 2009 and 2010 have high levels of internal consistency, particularly when it is considered that the tests are marked by people and the scores therefore include a degree of variance due to marker subjectivity.

Classification accuracy for mathematics was around 90%. The classification accuracy for English was around 85%, and science around 87%. The six methods provided classification accuracies mostly within two percentage points of each other.

These classification accuracies are much higher than those reported in the early days of Key Stage 2 national tests. The increase in classification accuracy is likely to relate to improvements in the reliability of the tests and changes in their structure.

It should be noted that classification accuracy is an estimate based on a population as a whole 바카라 사이트“ it cannot be used for an individual candidate. A candidate whose score is close to a boundary mark is more likely to be misclassified than one whose score is in the middle of the mark range for the level.

Finally, it is clear from the work on the Ofqual Reliability Programme that technical terminology presents problems in many areas of assessment reliability. In this context, 바카라 사이트˜candidate level misclassification바카라 사이트™ in common meaning suggests a mistake has been made in handling the exams process (for example, a marking error has been made). The statistical term refers to 바카라 사이트˜measurement error between the observed and notional 바카라 사이트˜true바카라 사이트™ scores바카라 사이트™ 바카라 사이트“ i.e. not so much a mistake as a measurement of the variance between a notional score (the average score the candidate would receive if they took many tests of the subject one after another) and the actual score they achieved on the particular paper they sat.