Spotify Wonderland

6 min read3 hours ago

This is a story all about how I went on a grand adventure (that took way more time than I expected) of music exploration with my Spotify user data.

Tableau Visualization. For

<div class='tableauPlaceholder' id='viz1736387217328' style='position: relative'><noscript><a href='#'><img alt='Spotify Listening History(An analysis of the majority of my Spotify listening history from 01&#47;2014 - 06&#47;2024 ) ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpotifyListeningHistory_17362126669200&#47;Spotify-iPadPortrait&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='SpotifyListeningHistory_17362126669200&#47;Spotify-iPadPortrait' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpotifyListeningHistory_17362126669200&#47;Spotify-iPadPortrait&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1736387217328');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

<div class='tableauPlaceholder' id='viz1736711076395' style='position: relative'>
  <noscript>
    <a href='#'>
      <img alt='Spotify Listening History(An analysis of the majority of my Spotify listening history from 01&#47;2014 - 06&#47;2024 ) ' 
           src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpotifyListeningHistory-EmbeddedforMediumArticle&#47;Spotify-iPadPortrait&#47;1_rss.png' style='border: none' />
    </a>
  </noscript>
  <object class='tableauViz'  style='display:none;'>
    <param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> 
    <param name='embed_code_version' value='3' /> 
    <param name='site_root' value='' />
    <param name='name' value='SpotifyListeningHistory-EmbeddedforMediumArticle&#47;Spotify-iPadPortrait' />
    <param name='tabs' value='no' />
    <param name='toolbar' value='yes' />
    <param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpotifyListeningHistory-EmbeddedforMediumArticle&#47;Spotify-iPadPortrait&#47;1.png' /> 
    <param name='animate_transition' value='yes' />
    <param name='display_static_image' value='yes' />
    <param name='display_spinner' value='yes' />
    <param name='display_overlay' value='yes' />
    <param name='display_count' value='yes' />
    <param name='language' value='en-US' />
  </object>
</div>                
<script type='text/javascript'>                    
  var divElement = document.getElementById('viz1736711076395');                    
  var vizElement = divElement.getElementsByTagName('object')[0];                    
  vizElement.style.width='100%'; // Set width to 100% 
  vizElement.style.height='800px'; // Set height to a fixed value or remove this line for automatic height
  var scriptElement = document.createElement('script');                    
  scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    
  vizElement.parentNode.insertBefore(scriptElement, vizElement);                
</script>

Taming the Spotify API: A Data Scientist’s Journey

In my quest to analyze my Spotify listening history, I embarked on a journey to tame the Spotify API and extract valuable genre information. This involved navigating the intricacies of authentication, API requests, and data wrangling in a cloud environment.

Setting the Stage: Spotify API Setup

* Authentication: I utilized the Spotipy library to handle the Spotify API authentication flow. Working in the PythonAnywhere cloud environment required a manual authorization dance, copying the authorization URL, logging in to Spotify, and pasting the code back into the console. To avoid this repetition, I cached the access token for smoother subsequent runs.

Navigating the Spotify API: A Data Scientist’s Guide to Genre Analysis

My journey into analyzing my Spotify listening history began with a quest to extract valuable genre information from the Spotify API. This involved overcoming challenges in authentication, API requests, and data manipulation within a cloud environment.

Setting Up the Spotify API

* Authentication: I employed the Spotipy library to manage the Spotify API authentication flow. Working in the PythonAnywhere cloud environment required a manual authorization process, which involved copying the authorization URL, logging in to Spotify, and pasting the code back into the console. To streamline this, I cached the access token for subsequent use.

* API Requests: The requests library was instrumental in interacting with Spotify’s /tracks and /artists endpoints. I implemented batch requests to efficiently fetch track details and used a cache to store artist genre information, preventing redundant API calls. Additionally, I incorporated error handling and exponential backoff to address potential API errors and rate limits.

Data Preparation and Transformation

* Loading and Cleaning: My Spotify listening history, stored in cleaned JSON files, was loaded and prepared for analysis.

* Extracting Track URIs: I extracted the spotify_track_uri field, filtering out any invalid or unavailable tracks.

* Fetching Genre Information: Using the Spotify API, I retrieved genre information for each track and stored the results in a dictionary.

* Creating DataFrames: The dictionary was then transformed into a Pandas DataFrame, with a separate DataFrame to log any unprocessed tracks.

* Exporting to CSV: Finally, I saved the DataFrames as CSV files, ready for further analysis and visualization.

Overcoming Challenges

* Cloud Authentication: The cloud environment (PythonAnywhere) presented challenges for authentication, which I addressed using a manual authorization process.

* Rate Limits: I tackled Spotify’s API rate limits by implementing exponential backoff and batching techniques.

* Unavailable Tracks: I systematically identified and logged unavailable tracks for further investigation.

This exploration of the Spotify API and data preparation has equipped me with the knowledge and tools to analyze my listening habits, uncover patterns, and gain insights into my musical preferences.

Song Prediction with Spotify Data: A Machine Learning Approach

This script delves into the world of music and data science, aiming to predict the popularity of songs — specifically, for songs I may like — using machine learning techniques applied to Spotify data.

Data: The Heart of the Matter

The journey begins with data, the lifeblood of any machine learning project. In this case, the data comprises various Spotify track metrics, including audio features like danceability, energy, and tempo, as well as other relevant information such as play counts and potentially artist popularity. This data can be sourced from BigQuery, Google Cloud’s powerful data warehousing solution, or from files stored on Google Cloud Storage (GCS).

Preprocessing: Preparing the Data for Action

Raw data is rarely ready for direct consumption by machine learning algorithms. It often requires preprocessing to ensure its quality and compatibility. This script performs several crucial preprocessing steps:

Aggregation: The data is grouped by track name, aggregating information such as total play time, average play time, and genres associated with each track.
Feature Engineering: New features are created from existing ones to potentially improve model performance. For example, the total play time is converted to session length in hours, and a flag is added to indicate whether a listening session was longer than two hours.
Feature Scaling: Numerical features are scaled using a technique called standardization. This ensures that all features have a similar range of values, preventing features with larger values from dominating the model’s learning process.
Data Splitting: The dataset is divided into training and test sets. The training set is used to train the machine learning model, while the test set is held back to evaluate the model’s performance on unseen data.
Label Encoding: The artist names, which serve as the target variable (what we’re trying to predict), are encoded into numerical labels using LabelEncoder. This is necessary because many machine learning algorithms require numerical input.
Handling Unsupported Data Types: Any features with unsupported data types (e.g., text or categorical data) are converted into numerical representations using Label Encoding.
Missing Value Imputation: Missing values in the dataset are handled using a technique called imputation, where they are replaced with estimated values (e.g., the mean of the column).
Data Type Conversion: All features are converted to a consistent data type (float64) to ensure compatibility with the machine learning algorithm.
Class Balancing: If there’s an imbalance in the dataset (e.g., more non-hit songs than hit songs), techniques like SMOTE-Tomek are applied to balance the classes and prevent the model from being biased towards the majority class.

Model Training: The Learning Process

With the data preprocessed and ready, the script moves on to model training. An XGBoost classifier is chosen for this task. XGBoost is a powerful gradient boosting algorithm known for its high accuracy and efficiency in classification problems. The script uses RandomizedSearchCV to find the best hyperparameters for the XGBoost model. This involves exploring different combinations of hyperparameter values and evaluating their performance using cross-validation. The model is trained on the resampled training data, and its performance is evaluated using metrics like accuracy and F1-score.

Evaluation and Saving: Assessing and Preserving the Model

Once the model is trained, its performance is evaluated on the held-out test set. Metrics like accuracy and a classification report provide insights into how well the model generalizes to unseen data. The best-performing model, along with its predictions and relevant metrics, is saved for future use or deployment. The script also identifies and displays the top-performing hyperparameter sets, which can be useful for further analysis or optimization.

In Essence

This script demonstrates a typical machine-learning workflow, highlighting the importance of data preprocessing, feature engineering, model selection, hyperparameter tuning, and evaluation. By applying these techniques to Spotify data, it aims to predict the elusive quality of a “hit” song, offering a glimpse into the potential of data science in understanding and predicting musical trends.

Spotify Wonderland

Written by Charles Whetstone

No responses yet