Describes the rationale and method used to create a new tool for evaluating bias in workplace settings.
Cite This ArticleWilliams, K. M. (2023, October 1). Measuring bias in job performance evaluation: Applying workplace contextualization and empirical assessment development criteria. The Score. https://www.apadivisions.org/division-5/publications/score/2023/10/bias-in-job-evaluation
Author Note: The author is part of the Educational Testing Service (ETS)
The ubiquity and consequences of job performance evaluations (Campbell & Wiernik, 2015) necessitate accurate responding in both self-report and other-report assessments. However, response distortion or response bias – described as “a systematic tendency to respond to a range of questionnaire items on some basis other than the specific item content” (Paulhus, 1991; p. 17) – represents a significant threat to the veracity of performance evaluations. Examples of response bias that may occur in either self- or other-reports include positive impression management (PIM; i.e., socially desirable responding, positive bias, “faking good”) and negative impression management (NIM; i.e., negative bias or “faking bad”; see reviews by McGrath et al., 2010; Paulhus, 2002). Self-reported bias is often related to pervasive dispositional factors or temporary situational pressures (e.g., job interviews), whereas other-report bias (e.g., evaluations from supervisors or coworkers) may be influenced by ratees’ impression management (IM) tactics (Peck & Levashina, 2017), halo effects (McGill et al., 2011), raters’ self-interest (Murphy, 2008; Murphy & Cleveland, 1995), to motivate ratees (Murphy, 2008; Murphy & Cleveland, 1995), or unrepresentative observations (Berry et al., 2012; Murphy, 2008).
To detect response distortion, researchers and practitioners frequently rely on assessments called bias indicators. These indicators are usually developed for general self-report research purposes beyond the workplace (e.g., Crowne & Marlowe, 1960; Paulhus, 1998; see review by McGrath et al., 2010). Although many authors support the use of bias indicators in research and applied workplace settings (e.g., McGrath et al., 2010), meta-analyses and reviews suggest only weak empirical evidence for response distortion’s threat to validity in workplace contexts (Li & Bagger, 2006; Ones et al., 1996). Plausible explanations for these results include the questionable validity of bias indicators themselves as well as criterion coarseness (poor alignment between predictors’ and outcomes’ respective scopes; McGrath et al., 2010). That is, bias indicators may lack appropriate relevance or specificity, given that most bias indicators utilized in workplace research include noncontextualized, generic statements (Blanch et al., 2009; Goffin & Christiansen, 2003; Li & Bagger, 2006; McGrath et al., 2010).
We aimed to address these concerns by developing a response bias assessment contextualized to workplace settings. Converting non-contextualized statements (e.g., “I am a responsible person”) to workplace-contextualized statements (e.g., “I am a responsible employee”) may detect workplace response bias more accurately. Meta-analyses and reviews suggest that contextualizing assessments increases their predictive validity (e.g., Heggestad & Gordon, 2008; Shaffer & Postlethwaite, 2012). Although instruments such as personality assessments have been contextualized in this manner (see Heggestad & Gordon, 2008; Shaffer & Postlethwaite, 2012), it appears that contextualization has yet to be applied to bias indicators.
Assessment Design and Item Pool Development. We developed a workplace-specific assessment – called the Occupational Performance Assessment – Response Distortion (OPerA-RD) scale – designed to measure biased performance ratings. The OPerA-RD includes four key features. First, the instrument exclusively includes items contextualized to the workplace. As a first step, we utilized tactics common in other impression management measures, including writing statements that describe the denial of common flaws or claiming of uncommon virtues (e.g., “I never swear”; “I always know why I like things”, Paulhus, 1991; see also McGrath et al., 2010). We then integrated workplace-relevant terminology, resulting in items such as “never misses deadlines” and “is the perfect employee”. Thus, items represent unrealistically high or low standards of performance. Second, the instrument is amenable to self- and other-reporting. Specifically, items describe behavioral or otherwise easily evaluative statements as opposed to more covert actions (e.g., thoughts, attitudes). Third, the assessment should be responsive to either over- or underreporting of workplace performance (i.e., PIM or NIM, respectively). Finally, items were selected based on their ability to detect self- and other-reported over-reporting and under-reporting of the ratee’s performance. This utilization of empirical data in the item selection process is distinctive from other IM assessments, which tend to use nonempirical (e.g., conceptual) criteria. After applying these criteria, an initial pool of 56 items was developed to detect over- or under-reporting of job performance from either a self- or other-report perspective. Respondents rate their level of agreement with these items on a Likert-type scale (i.e., 1 = strongly disagree to 5 = strongly agree). We then collected data from two independent samples for item selection (Sample 1) and initial validation (Sample 2).
Item Selection. The item selection sample included 600 currently employed individuals recruited online through Amazon Mechanical Turk (MTurk). Participants were randomly divided into six approximately equal-sized groups using a 2 (self-report vs. other-report) x 3 (fake good, fake bad, control) between-subjects design. Individuals in the other-report conditions were screened to ensure management experience, experience formally rating coworkers’ job performance, or both. Participants were instructed to respond to the OPerA-RD items as part of a simulated job performance evaluation exercise, with more specific directions provided for each group (Williams, 2022). Detailed faking instructions were provided to simulate realistic workplace scenarios in which an employee may rate themselves or a co-worker more positively or negatively than usual. Specifically, self-report fake good instructions requested participants to rate themselves more positively than usual so they could receive a raise, whereas self-report fake bad instructions requested participants to rate themselves more negatively than usual so they would receive an all-expenses-paid training opportunity (see Williams, 2022). For other-report, participants were provided motivation to fake good or bad such that the rate would be more likely to gain a desired outcome, similar to the self-report condition. Control conditions instructed participants to rate themselves (self-report) or a coworker (other-report) accurately. All items were analyzed, and the 20 best-performing items from this first phase were tested again in a new sample.
Sample 2 included 1,318 currently employed participants recruited through MTurk. Participants were randomly assigned to one of six groups using a research design that incorporated within-subjects and between-subjects features. Two of the groups responded to the final twenty-item OPerA-RD self-report scale, once under control instructions and once under either fake good (Group 1) or fake bad (Group 2) instructions similar to those in Sample 1 (see Williams, 2022). The remaining four groups were asked to assume the perspective of a supervisor and use the final twenty-item OPerA-RD other-report scale to rate a coworker (subordinate) whose performance they were highly familiar with and had worked with for at least one month. Participants provided co-worker ratings once under control instructions and once under either fake good (Groups 3 and 4) or fake bad (Groups 5 and 6) instructions, again similar to Sample 1. Twenty standard job performance items were embedded within the OPerA-RD to further disguise the latter items.
Within each group, the difference between mean OPerA-RD total scores under control and faking conditions was used to potentially confirm the OPerA-RD’s validity as an instrument capable of detecting the over- or under-reporting of self- or other-reported job performance. Results supported the OPerA-RD’s ability to detect faking in all conditions, as reflected by at least a large effect size (i.e., d ≥ |0.80|; Cohen, 1988) in the respective expected direction. Absolute d values ranged from 1.34 to 2.28, and none of the 95% confidence intervals included values below |0.80|.
Discussion. Given the debatable empirical support for the threat of response distortion in workplace performance assessment, many researchers and practitioners may be convinced that these types of screening procedures should be abandoned. However, most researchers maintain their support for the use of bias detection assessments in workplace contexts such as hiring, performance evaluation, and research (Goffin & Christiansen, 2003). Ideally, such an assessment should be able to detect over- or underreporting of job performance by oneself or others, meaning items should reference behaviors or attributes that are observable or otherwise amenable to external evaluation. To this end, the OPerA-RD – developed using a distinctively contextualized and empirical methodology – may advance both science and practice. More generally, our results may support the assessment design and item selection processes we utilized for future scale development. Additionally, this general methodology could be evaluated using alternative statistical criteria for selecting items (e.g., item response theory).
In practical applications, as with any high-stakes setting, basing decisions on the results of a single assessment is not recommended as legitimate, transient factors (e.g., applicant mood, stress, distractions) may contribute to construct-irrelevant variance. Instead, extreme OPerA-RD scores may warrant further scrutiny of an applicant (e.g., through interviews or other additional assessments) to support initial results. Additional details regarding this study are provided in Williams (2022). The development of this assessment may provide a useful tool for future research and applied settings alike.
Berry, C.M., Carpenter, N.C., & Barratt, C.L. (2012). Do other-reports of counterproductive work behavior provide an incremental contribution over self-reports? A meta-analytic comparison. Journal of Applied Psychology, 97, 613-636.
Blanch, A., Aluja, A., Gallart, S., & Dolcet, J.-M. (2009). A review on the use of NEO-PI-R validity scales in normative, job selection, and clinical samples. European Journal of Psychiatry, 23, 121-129.
Campbell, J.P., & Wiernik, B.M. (2015). The modeling and assessment of work performance. Annual Review of Organizational Psychology and Organizational Behavior, 2, 47-74.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Crowne, D.P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24, 349- 354.
Goffin, R.D., & Christiansen, N.D. (2003). Correcting personality tests for faking: A review of popular personality tests and an initial survey of researchers. International Journal of Selection and Assessment, 11, 340-344.
Heggestad, E. D., & Gordon, H. L. (2008). An argument for context-specific personality assessments. Industrial and Organizational Psychology, 1, 320-322.
Li, A., & Bagger, J. (2006). Using the BIDR to distinguish the effects of impression management and self-deception on the criterion validity of personality measures: A meta-analysis. International Journal of Selection and Assessment, 14, 131-141.
McGill, D.A., van der Vleuten, C.P.M., & Clarke, M.J. (2011). Supervisor assessment of clinical and professional competence of medical trainees: A reliability study using workplace data and a focused analytical literature review. Advances in Health Sciences Education, 16, 405-425.
McGrath, R.E., Mitchell, M., Kim, B.H., & Hough, L. (2010). Evidence for response bias as a source of error variance in applied assessment. Psychological Bulletin, 136, 450-470.
Murphy, K.R. (2008). Explaining the weak relationship between job performance and ratings and job performance. Industrial and Organizational Psychology, 1, 148-160.
Murphy, K.R., & Cleveland, J.N. (1995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: Sage.
Ones, D.S., Viswesvaran, C., & Reiss, A.D. (1996). Role of social desirability in personality testing for personnel selection: The red herring. Journal of Applied Psychology, 81, 660-679.
Paulhus, D.L. (1991). Measurement and control of response bias. In J.P. Robinson, P. Shaver, & L.S. Wrightsman (Eds.), Measures of personality and social psychological attitudes (pp. 17-59). San Diego, CA: Academic Press.
Paulhus, D.L. (1998). Manual for the Balanced Inventory of Desirable Responding (BIDR-7). Toronto, Canada: MHS.
Paulhus, D.L. (2002). Socially desirable responding: The evolution of a construct. In H.I. Braun, D.N. Jackson, & D.E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 49-69). Mahwah, NJ: Erlbaum.
Peck, J.A., & Levashina, J. (2017). Impression management and interview and job performance ratings: A meta-analysis of research design with tactics in mind. Frontiers in Psychology, 8:201.
Shaffer, J.A., & Postlethwaite, B.E. (2012). A matter of context: A meta-analytic investigation of the relative validity of contextualized and noncontextualized personality measures. Personnel Psychology, 65, 445-494.
Williams, K.M. (2022). The Occupational Performance Assessment – Response Distortion (OPerA-RD) scale: Measuring response bias in self- or other-reported job performance. Journal of Personnel Psychology, 21, 185-196.