Statistics – Inter Rater Agreement (IRA)

A study that purports to provide an objective assessment of a biological parameter MUST be able to demonstrate that the assessments made by individual investigators are reliable.

This problem is covered in detail in Altman D 1991 Chapter  14. This contains a detailed exposition of what is generally referred to as METHOD COMPARISON STUDIES.  Details of exploring the issues related to this issue of reliable assessments are  detailed in Petrie and Sabin 2009 Chapter 39.

In the context of DAE the first problem is to ensure that individaul assessors (Dentists, Doctors, Anthropologists, Forensic Scientists) are able to relaibly and consistently assess Tooth Development Stages (TDS).

There are two areas of activity where a demonstration of a high level of reliable assessment is needed

1.  Investigators responsible for assembling a Reference Data Set must be able to demonstrate a high level of agreement when assessing TDS.

2.  Assessors conducting formal DAE on subjects without birth records, although not responsible for developing and RDS, also need to demonstrate a high level of agreement when assessing TDS.  This is because reliable TDS assessments will lead to greater accuracy of Age Estimation of individual subjects.

An important study related to this issue was carried out in Montreal, Canada (Levesque et al 1980).  This is an important paper as it provides important  detail on errors of assessment using the 8 Stage System (Demirjian et al 1973). This detailed study shows that Stage Assessments are accurate 80% of the time, and for the remaining 20% the stage assessments were plus 1 or minus one only.

The interpretation and importance of this is that for RDS the incorrect assessments probably even out so that the Age of Attainment (AaA) for each TDS is accurate.  This is especially so where there are sufficiently large numbers in the RDS.  The DARLInG research group  recommend that N, the total number of cases in an RDS is set at 1,600 or more.

For individual DAE, where up to 16 TMT’s only may be contribute to the Dental Age, one or two incorrect assessments  may have a significant effect on the DA provided.

For these reason it is important that i. assemblers of an RDS and    ii.  individual assessors who are carrying out DAE only are both trained in the detailed technique of stage assessment and can demonstrate competence by returning satisfactory values for intra-rater and inter-rater agreement.

How is this done? In the context of DAE it is Cohen’s Kappa that is the appropriate statistical test. This is used at the TDS level.  Investigator’s are advised to carry out both Intra-Rater and Inter-Rater agreement of approximately 200 developing teeth.  This is easily covered by selecting 15 Dental Panoramic Tomographs (DPT) and assessing the 16 Tooth Morphology Types (TMT) on the Left side. 

Nomenclature – the assessment of reliability appears in the literature with a number of different terms.

Reliability, Repeatability, Reproducibility, Consistency.  All these touch on the required meaning but are insufficiently precise to convey the precise detail required.

In this context it is helpful to refer to agreement between a single assessors on two separate occasions as Intra-Rater Agreement (IRA).

The agreement between two different assessors on the same (or near same) occasion is know as Between-Rater Agreement (BRA).

Investigators may carry out a Reproducibility test by going to the page [ Inter Rater Agreement ]

The outcome value is the Kappa Statistic – a number between 0.00 and 1.00

Kappa Statistic           Strength of Agreement

     <0.00                              Poor
    0.00 – 0.20                    Slight
    0.21 – 0.40                      Fair
    0.41 – 0.60                  Moderate
    0.61 – 0.80                Substantial
    0.81 – 1.00              Almost Perfect

see Landis and Koch 1977.

Less emphatic descriptions are given by in Altman 1991 and are

Kappa Statistic           Strength of Agreement

     0.00 – 0.20                   Poor
    0.21 – 0.40                      Fair
    0.41 – 0.60                  Moderate
    0.61 – 0.80                   Good
    0.81 – 1.00                Very Good

Investigators must be able to demonstrate the ability to return Substantial or Almost Perfect Kappa Scores before moving on to establish an RDS. Also investigators must be able to demonstrate The same Kappa Scores before carrying out DAE on children or emerging adults.