Research in Pediatric Education| Volume 22, ISSUE 2, P313-318, March 2022

The Impact of Behavioral Anchors in the Assessment of Fellowship Applicants: Reducing Rater Biases

Published:December 02, 2021DOI:



      No standardized evaluation tool for fellowship applicant assessment exists. Assessment tools are subject to biases and scoring tendencies which can skew scores and impact rankings. We aimed to develop and evaluate an objective assessment tool for fellowship applicants.


      We detected rater effects in our numerically scaled assessment tool (NST), which consisted of 10 domains rated from 0 to 9. We evaluated each domain, consolidated redundant categories, and removed subjective categories. For 7 remaining domains, we described each quality and developed a question with a behaviorally-anchored rating scale (BARS). Applicants were rated by 6 attendings. Ratings from the NST in 2018 were compared with the BARS from 2020 for distribution of data, skewness, and inter-rater reliability.


      Thirty-four applicants were evaluated with the NST and 38 with the BARS. Demographics were similar between groups. The median score on the NST was 8 out of 9; scores <5 were used in less than 1% of all evaluations. Distribution of data was improved in the BARS tool. In the NST, scores from 6 of 10 domains demonstrated moderate skewness and 3 high skewness. Three of the 7 domains in the BARS showed moderate skewness and none had high skewness. Two of 10 domains in the NST vs 5 of 7 domains in the BARS achieved good inter-rater reliability.


      Replacing a standard numeric scale with a BARS normalized the distribution of data, reduced skewness, and enhanced inter-rater reliability in our evaluation tool. This provides some validity evidence for improved applicant assessment and ranking.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Academic Pediatrics
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Stephenson-Famy A
        • Houmard BS
        • Oberoi S
        • et al.
        Use of the interview in resident candidate selection: a review of the literature.
        J Grad Med Educ. 2015; 7: 539-548
        • Beheshtian E
        • Jalilianhasanpour R
        • Sahraian S
        • et al.
        Fellowship candidate factors considered by program directors.
        J Am Coll Radiol. 2020; 17: 284-288
        • Naides AI
        • Ayyala HS
        • Lee ES.
        How do we choose? A review of residency application scoring systems.
        J Surg Educ. 2021; 78: 1461-1468
        • Capers QT
        • Clinchot D
        • McDougle L
        • et al.
        Implicit racial bias in medical school admissions.
        Acad Med. 2017; 92: 365-369
        • Kiraly L
        • Dewey E
        • Brasel K.
        Hawks and doves: adjusting for bias in residency interview scoring.
        J Surg Educ. 2020; 77: e132-e137
        • Roberts C
        • Rothnie I
        • Zoanetti N
        • et al.
        Should candidate scores be adjusted for interviewer stringency or leniency in the multiple mini-interview?.
        Med Educ. 2010; 44: 690-698
        • Maxfield CM
        • Thorpe MP
        • Desser TS
        • et al.
        Bias in radiology resident selection: do we discriminate against the obese and unattractive?.
        Acad Med. 2019; 94: 1774-1780
        • Myford CM
        • Wolfe EW.
        Detecting and measuring rater effects using many-facet Rasch measurement: part I.
        J Appl Meas. 2003; 4: 386-422
        • Aubin AS
        • St-Onge C
        • Renaud JS.
        Detecting rater bias using a person-fit statistic: a Monte Carlo simulation study.
        Perspect Med Educ. 2018; 7: 83-92
        • Till H
        • Myford C
        • Dowell J.
        Improving student selection using multiple mini-interviews with multifaceted Rasch modeling.
        Acad Med. 2013; 88: 216-223
        • Gray JD.
        Global rating scales in residency education.
        Acad Med. 1996; 71: S55-S63
        • Kell HJ
        • Martin-Raugh MP
        • Carney LM
        • et al.
        Exploring Methods for Developing Behaviorally Anchored Rating Scales for Evaluating Structured Iinterview Performance.
        ETS Research Report Series, Princeton, NJ2017: 1-26
        • Taylor PJ
        • Small B.
        Asking applicants what they would do versus what the did do: a meta-analytic comparison of situational behaviour and past behaviour employment interview questions.
        J Occup Organ Psychol. 2002; 75: 277-294
        • Devcich DA
        • Weller J
        • Mitchell SJ
        • et al.
        A behaviourally anchored rating scale for evaluating the use of the WHO surgical safety checklist: development and initial evaluation of the WHOBARS.
        BMJ Qual Saf. 2016; 25: 778-786
        • Zeeman JM
        • McLaughlin JE
        • Cox WC.
        Validity and reliability of an application review process using dedicated reviewers in one stage of a multi-stage admissions model.
        Curr Pharm Teach Learn. 2017; 9: 972-979
        • Cook DA
        • Brydges R
        • Ginsburg S
        • et al.
        A contemporary approach to validity arguments: a practical guide to Kane's framework.
        Med Educ. 2015; 49: 560-575
        • Varpio L
        • Paradis E
        • Uijtdehaage S
        • et al.
        The distinctions between theory, theoretical framework, and conceptual framework.
        Acad Med. 2020; 95: 989-994
        • Hallgren KA.
        Computing inter-rater reliability for observational data: an overview and tutorial.
        Tutor Quant Methods Psychol. 2012; 8: 23-34
        • Harasym PH
        • Woloschuk W
        • Cunning L.
        Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs.
        Adv Health Sci Educ Theory Pract. 2008; 13: 617-632
        • Sturman N
        • Ostini R
        • Wong WY
        • et al.
        “On the same page”? The effect of GP examiner feedback on differences in rating severity in clinical assessments: a pre/post intervention study.
        BMC Med Educ. 2017; 17: 101
        • Lie D
        • May W
        • Richter-Lagha R
        • et al.
        Adapting the McMaster-Ottawa scale and developing behavioral anchors for assessing performance in an interprofessional team observed structured clinical encounter.
        Med Educ Online. 2015; 20: 26691