Project: Predicting Song Popularity Using Data Analysis and Machine Learning
Vision:
To harness the power of data science to quantify creativity and predict song popularity, bridging the gap between emerging talent and audience discovery. This project aims to use data-driven insights to foster a supportive musical ecosystem by identifying tracks with the potential for success.
Objective:
To analyze song characteristics, address multicollinearity in features, and develop a machine learning model capable of predicting a song's popularity based on measurable metrics.
Methodology:
Data and Feature Analysis:
Utilized data from Statso to analyze key features like duration, energy, loudness, and danceability.
Investigated correlations to detect multicollinearity and biases in the dataset.
Feature Engineering:
Applied Principal Component Analysis (PCA) to resolve multicollinearity, creating composite features like
energy_loudness_pca.Selected features with significant correlations to the target variable (popularity) while excluding irrelevant ones.
Model Training and Evaluation:
Trained three models: Linear Regression, Random Forest, and XGBoost.
Evaluated models using error metrics and residual analysis, identifying Random Forest as the optimal model.
Conducted hyperparameter tuning for improved performance.
Model Application:
Developed a demo app that classifies songs into categories (Not Popular, Moderately Popular, Highly Popular) based on song attributes.
Results:
The Random Forest model outperformed the baseline Linear Regression model by a significant margin. Specifically:
Root Mean Squared Error (RMSE): Reduced from 7.97 to 6.01, an improvement of approximately 24.6%, indicating better predictive accuracy.
R-squared (Explained Variance): Increased from 0.045 to 0.457, an improvement of over 900%, demonstrating a substantial enhancement in the model's ability to explain the variance in song popularity.
Despite promising results, biases in the dataset and the presence of outliers limited the model’s predictive accuracy.
Conclusion:
This case study demonstrates the potential of machine learning in music analytics, enabling stakeholders like labels, independent artists, and streaming platforms to make data-informed decisions. With access to more unbiased data, the model could better predict popularity trends and support the discovery of emerging talent, aligning data-driven insights with creative expression.



