Date of Award

5-2025

Document Type

Thesis

Degree Name

Master of Arts (MA)

College/School

College of Humanities and Social Sciences

Department/Program

Psychology

Thesis Sponsor/Dissertation Chair/Project Chair

Michael Bixter

Committee Member

Keven Askew

Committee Member

Manuel Gonzalez

Abstract

The following study seeks to answer the question of whether traditional regression models and more complex machine learning models can predict rare and infrequent events in the social and psychological sciences. Part of the study sought to compare the performance of regression models to more complex models, in an effort to determine whether the use of more complex models (which are harder to interpret and configure) is even necessary. This study explored this question via two studies. The first, being a study on workplace misconduct, in which 363 participants in the United States were surveyed as to their workplace experiences and behaviors, including acts of misconduct personally performed (a frequency of approximately 4%). The second used found data from a major news outlet’s database detailing civilian fatalities from police use-of-force incidents (firearms), from 2015 to 2023, and was paired with publicly available survey data (collected by the Federal government) focused on local and state police agencies organizational practices. In this second study, models were built to attempt to predict agencies that demonstrated a high risk of shooting unarmed civilians by virtue of their organizational practices and attributes (an approximately 1.5% occurrence). In both studies, various models, including logistic regression, random forest, XGBoost, and Tabnet were run in different configurations on the binary prediction problems (attempting to predict workplace misconduct in Study 1 and high-risk police agencies in Study 2), in an effort to identify (and then compare) those models that demonstrated sufficient performance in accurately identifying these rare events. Both Study 1 and Study 2 ultimately revealed that less sophisticated models tended to outperform more complex models. However, it was also observed that no single model performed well in both training and final validation – raising a question as to whether the models can be relied upon by virtue of only their repeated performance during training or their performance on unseen data (but not both). The study highlights the inherent difficulty in predicting rare events in the social sciences, where it is difficult to find rare events that, as a phenomenon, have a completely unique and strong signal (in terms of correlational strength) that is also common to all the rare events. The dynamic nature of these rare events, as well as the difficulty of applying machine learning to extremely imbalanced data, contributes to the inherent difficulty of achieving complete success in this area of study.

File Format

PDF

Share

COinS