Date of Award

8-2025

Document Type

Thesis

Degree Name

Master of Science (MS)

College/School

College of Science and Mathematics

Department/Program

Applied Mathematics and Statistics

Thesis Sponsor/Dissertation Chair/Project Chair

Christopher Leberknight

Committee Member

Haiyan Su

Committee Member

Aparna Varde

Abstract

In today’s digitally driven music landscape, understanding what drive’s a song’s popularity requires insight not only into its acoustic and lyrical content, but also into patterns of listener engagement across platforms. This thesis explores the predictive and descriptive dimensions of song popularity by applying supervised and unsupervised machine learning models to a multi-source dataset integrating audio features, sentiment analysis, and temporal consumption behavior. Drawing from a novel, multi-platform dataset that includes Billboard Hot 100 rankings, Spotify acoustic features and popularity scores, streaming, airplay, and sales metrics as reported on Luminate’s Music Connect, and lyrics from AZLyrics, the study investigates the relationships between musical structure, listener behavior, and popularity outcomes. Our research proceeds using predictive methods. Supervised learning models including Random Forest classifiers and regressors are employed to predict user engagement features that contribute the most to classifying core genre (as defined by Billboard), predict user engagement, and determine which features contribute the most to predicting artists generating the highest volume of digital song sales. This research also employs XGBoost and feature construction to determine user engagement influencing Spotify’s proprietary popularity score. Unsupervised learning techniques such as Principal Component Analysis and K-Means clustering are used to identify latent groupings of songs based on audio attributes. A methodological innovation we employ involves the application of the TSFresh package to extract hundreds of time-series features from weekly streaming data, enabling a detailed examination of how popularity evolves over time. Results from this show that temporal consumption behavior does have significant predictive power. Specifically, on-demand audio streaming contributes significantly to predicting popularity. This is a novel finding not reported in prior studies. In contrast, streamed video adds little value to popularity prediction, which in turn improves the efficiency of the classification model by reducing the overhead associated with collecting this additional feature. Our results show support that using a combined dataset of user engagement and audio features performs the best. Prediction accuracy from our model was 0.84 using the combined features. However, we found that genre is not a useful indicator for predicting user engagement. Research also finds support for features contributing the most to digital sales and the impact of user engagement on song popularity. These include aggregated prior sales at the artist level, streaming on demand audio, airplay, streaming programmed audio, popularity score, and genre. Our results also reveal that time-series dynamics, particularly streaming volatility and structured growth patterns, offer strong predictive value. While audio characteristics like energy, tempo, and valence help explain clustering patterns, popularity is shown to be more closely tied to the temporal structure of listener engagement. Sentiment analysis on a subset of lyrics provides additional context but is limited by data sparsity. This study contributes to the growing field of music analytics by presenting a hybrid framework that combines content-based features with behavioral consumption data. It offers practical implications for music recommendation systems, marketing strategies, and artist development. At the same time, advancing academic understanding of how popularity emerges and sustains in the modern music economy.

File Format

PDF

Share

COinS