A crucial step in developing the PIP© was creating a group of experts in psychological assessment who could provide accurate ratings of the psychological characteristics identified in (part 1 of our series). Job descriptions for this role were developed, and applicants were assessed based on their academic credentials and experience with psychological assessment. This group of raters has a combined 190 years of experience in psychological assessment and represents a diversity of races, age groups, and gender. This diversity in rater demographics further helps to mitigate bias in the AI algorithms.
A propriety application was developed to make the rating process highly effective and efficient. This rating application, named the Rating Application for Traits and Emotions (RATE)©, contains the measures and presents the speech files to the raters. The cohort of raters then rated each speaker in a speech file. This process is highly automated and enables a high quantity of speech files to be rated quickly. The combination of transparent and standardized measures and expert raters also resulted in high-quality labels and ratings of the psychological characteristics of each speaker.
Based on an extensive review of the literature, each of the raters received initial training on the RATE application and best practices for perceived (versus self-report) psychological assessment. Rater productivity and performance were monitored using a standardized report. Indicators of performance were a series of reliability, agreement, and rating error metrics. Assessors received ongoing feedback, training, and oversight.
When using multiple raters to rate an individual on a psychological characteristic, such as a person's degree of extraversion, it is essential to assess the level of agreement among those raters. In developing the PIP©, the examination of rater agreement served two purposes. First, the results from rater agreement analyses revealed information regarding the quality of the new measures (described in Part 1).
Second, speech files with high rater agreement can be considered high-quality and used to train the AI algorithms. Two different metrics of rater agreement were used depending on the psychological characteristic in question.
For emotions, which have a binary "present or not present" rating, an agreement threshold of a substantial majority of expert raters endorsing that a specific emotion was present in any 5-second segment of speech was used.
For personality traits, which have a continuous instead of binary distribution, the Intraclass Correlation Coefficient (ICC) was used to assess rater agreement. The ICC shows the degree to which raters agree on a scale from zero to one, with higher scores representing higher agreement. There are different versions of ICC, and researchers are guided to select the metric best suited to their specific use case. For these analyses, ICC3k was used, which provides the level of agreement among a fixed set of raters when the mean of the rater's ratings is the value of interest.
Best practice guidelines recommend the following interpretation of ICC values[8]:
The average ICC3k value for the personality traits showed a high degree of agreement.
The ICC measure of rater agreement was also used for the Mindset psychological characteristics as they also have a continuous rather than binary distribution of ratings. The average ICC3k value for the Mindset characteristics indicates excellent rater agreement.
Download the full validation study here.
Summary
This paper details the development and validation of VoiceSignals' PIP©. The PIP© has been designed and developed by experts in AI engineering and psychological assessment and validated according to best practices in psychological science. The validation evidence presented in this paper shows that the algorithms embedded in the PIP© are valid predictors of important psychological characteristics and can be applied in real-world settings. The PIP© has been developed to continue to improve its predictions as it learns from more data, and therefore the validation evidence will become increasingly more vital with time. Accordingly, the validity evidence contained in this paper will be updated periodically.
[1] Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: a meta‐analysis. Personnel psychology, 44(1), 1-26.
[2] Karlan, D., Mullainathan, S., & Robles, O. (2012). Measuring personality traits and predicting loan default with experiments andsurveys. Banking the world: Empirical foundations of financial inclusion, 393-410.
[3] Kassarjian, H. H. (1971). Personality and consumer behavior: A review. Journal of marketing Research, 8(4), 409-418.
[4] American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC.
[5] Costa Jr, P. T., & McCrae, R. R. (1995). Domains and facets: Hierarchical personality assessment using the Revised NEO Personality Inventory. Journal of personality assessment, 64(1), 21-50.
[6] Weidman, A. C., Steckler, C. M., & Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17(2), 267.
[7] Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist, 89(4), 344-350.
[8] Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2), 155-163.
[9] Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, 117327-117345.
[10] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd Edn. New York, NY: Academic Press
James is an Industrial/Organizational Psychologist and a demonstrated industry leader in the applied use of Artificial Intelligence (AI) and Machine Learning to enhance psychological assessment. He has a track record of developing innovative AI-based psychological assessments and his work has frequently been presented at industry conferences. James is a true scientist-practitioner and leverages his broad experience as both a research scientist and consultant to provide world-class solutions for customer needs.