Music Recommendation System

Vineet Mukesh Haswani
11 min readJan 1, 2021

This blog explores detailed explanations for the implementation of the music recommendation system which is implemented on a dataset of KKBox’s Music Recommendation Challenge with a Kaggle score of 65(highest=74)

Table of Contents:

  1. Introduction
  2. Dataset
  3. Exploratory Data Analysis(EDA)
  4. Handling Missing Values and Outliers
  5. Feature Engineering
  6. Modeling
  7. Results and Deployment
  8. Future work
  9. Profile
  10. References

1. Introduction :

The music Recommendation system is all about recommending the songs which the user likes based on their previous activities like searches, previously heard songs, etc. There are many methods for building a recommendation system like Collaborative Filtering, Content-Based Filtering, Hybrid methods (Combining both Content-Based and Collaborative Filtering methods).

In this challenge, we have to build a recommendation system that can predict whether a user will listen to a song again within one month after the user’s very first observable listening event in the KKBox application. If the user did not listen to the song again within one month, the target variable will be 0, and 1 otherwise.

Business Problem: This can help KKBox company to recommend songs to users, to apply a rating to a song, to determine the taste of user in music.

ML Formulation: Building a recommendation system using Collaborative based algorithms like matrix factorization and word embedding.

Performance Metric: The performance metric for the challenge is Area under the Receiver Operating Characteristic Curve (AUC ROC) Score. As both the classes are balanced in the dataset, we will choose AUC ROC, not F1-Score. We can also choose Accuracy over AUC ROC as classes are balanced but the advantage of the AUC ROC score is that we can get the correct threshold if we are using linear models like Linear Regression or Logistic Regression.

2. Dataset:

KKBOX provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of your prediction is encouraged.

In this challenge, four CSV files are provided which have song information, user information, song extra information, and interaction of user and songs are represented in one CSV file with the application information. we need to merge the interaction file with the song information file using the song feature and merge the interaction file with the user information file using the user feature.

df = df.merge(members, on = 'msno', how='left')
df = df.merge(songs, on = 'song_id', how = 'left')

By merging all of the CSV files into a single CSV file we create our final dataset.

3. Exploratory Data Analysis (EDA) :

In this section, we will try to answer questions like

i. what each and every feature of the dataset mean?

ii. what is the relation between features and class labels?

iii. what is the relationship between features?

iv. what are the most important features in classification?.

In the dataset, there are the majority of categorical features compared to continuous features. So there are the majority of bar plots.

Categorical Feature

City:

Univariate and Multi-variate Plots for City Feature

The aim of plotting all the plots is to explore the city feature and determine what is the importance of the city in modeling.

From the above plots, we can conclude that

  1. There are the majority of users using the KKBox app from cities 1 and 13.
  2. City 1 and 13 must be cities with a high population as there are more users in the city
  3. Except for city 1 all cities mostly prefer registration mode as 13 only cities 1 prefer registration mode 7.
  4. The two most popular languages are 3.0 and 59.0

Artist Name:

Univariate and Multi-variate Plot for Artist Name Feature

The aim of plotting all the plots is to explore the Artist Names and determine what is the importance of the Artist Name in modeling.

From the above plots, we can conclude that

  1. The two most popular artists are jay chou and mayday.
  2. 10% of artist names are missing.
  3. 4.1% of Artists are unknown with the given name as “various artists”.
  4. Jay Chou sings only with genre id 458 and language 3.0.
  5. mayday sings only in 458 and 465 genres with only language 3.0.
  6. Chainsmoker is the only artist that uses genre id 1609 with language 52.0.
  7. Alan Walker is the only artist who uses genre id 1616 and 1609 with languages 52.0.
  8. jay chou is more popular in males than in females.
  9. Bigbang is only popular in males and not in females.
  10. Sodagreen is only popular in females and not in males.
  11. jay chou and mayday are the two most popular artists in cities 1 and 13.
  12. All the artist names in city 4, 6, 15, and 22 are missing.
  13. mayday and JJ lin are the most popular artists in genre id 465.
  14. jay chou and mayday are the most popular artists in the genre id 458.
  15. Chainsmoker is the only popular artist in the genre id 1609.
  16. Alan Walker is popular among genres id 1616 and 1609.

Application Information :

Univariate and Multi-variate Plots for application information

The aim of plotting all the plots is to explore the Artist Names and determine what is the importance of the Artist Name in modeling.

From the above plots, we can conclude that

  1. There are the majority of users using the KKBox app by using local storage ie my library.
  2. Most of the peoples discover songs using discover, search then they try to use there my library for playing songs.
  3. 30.9% of users listen to songs repeatedly from my library.
  4. 17.3% of users who discovered songs have not listened to them more than once.
  5. 43.8% of the users are using local playlists for playing songs.
  6. we can also conclude that most of the users are specific about their songs as they don’t even use my library search.
  7. 27.9% of users listen to songs repeatedly from the Local Playlist.
  8. 5.0% of users who listen to songs from the radio did not listen to them more than once.
  9. 72% of users are using apps using local-library, online-playlist, and local-playlist.
  10. 19.4% of users listen to songs repeatedly from the Local library.
  11. 5.1% of users who Listen to songs from the radio did not listen to them more than once.

Other Features :

Univariate and Missing value plots

The aim of plotting all the plots is to explore the features like gender, genre, language, registration method, and missing value to determine what is the importance of the mentioned features in modeling.

From the above plots, we can conclude that

  1. Gender, composer, and lyricist are highly sparsed ie. they have a high missing value rate. So we need to handle them carefully.
  2. Dataset is balanced for the target variable.
  3. From the gender plot, nothing can be judged due to the high missing value rate.
  4. 54.8% of users have song language 3.0
  5. It can be concluded that language 3.0 is the local language of that region/country where the app is used.
  6. 51.7% of users listen to songs with genre id 465
  7. 72.8% of users have registered via the 7 and 9 registration process.

Continuous Features :

Length and Age:

PDF, CDF, and Boxplot for song length and age

The aim of plotting all the plots is to explore the feature's age and song length to determine what is the importance of the mentioned features in modeling.

From the above plots, we can conclude that

  1. Age and Song Length Feature have outliers.
  2. The majority of values are lying between 0 to 100.
  3. song length is lying between 191518 to 395947 ms.
  4. Applying log transformation on song length makes sense in avoiding outliers.

4. Handling Missing Values and Outliers :

Categorical Features :

There are the majority of missing values in gender, composer, genre id, and source screen name. As there are the majority of classes and missing values in composer and lyricist so we should drop both the features. For handling source screen name, genre id, and gender we should treat all the missing values as a new category.

Continuous Features :

There are outliers in continuous features like song length and Age. We can handle outliers of song length by applying log transformation on the song length feature. As age varies from 0 to 100 so, all the points out of this range are concluded to the nearest point.

Missing values of song length are filled with the respective median feature of the training set.

Transformation of song length feature

5. Feature Engineering :

Feature engineering is one of the most important parts while handling data. so, we will pick each feature and try to apply feature engineering to it.

Registration and Expiration Date :

The given input format for the date is YYYY-DD-MM so we will convert the registration date feature into three features registration year, month, and day applying the same for the expiration date feature.

User Id and Song Id :

We will create a 30-dimensional embedding for each user id and song id using the target variable. For creating 30-dimensional embedding we will use the technique of Collaborative Filtering

Collaborative Filtering is the technique in which initially we initialize a number of users x 30 and the number of songs x 30-dimensional matrix and create a matrix where rows determine users and columns determine songs and values in the matrix are filled with the target variable value of respective user id and song id.

def predict(df, emb_user, emb_song):
df['prediction'] = np.sum(np.multiply(emb_song[df['song_id'].apply(lambda x: map_song_id[x])],emb_user[df['msno'].apply(lambda x: map_msno[x])]), axis=1)
return df
lmbda = 0.0002
def cost(df, Y,emb_user, emb_anime):
predicted = predict(df, emb_user, emb_song)
predicted = csc_matrix((df.prediction.values, (row, cols)), dtype = np.int8)
return np.sum((Y-predicted).power(2))/df.shape[0]
def create_embeddings(n, K):
return np.random.random((n, K))
def gradient(df, Y, emb_user, emb_song):
predicted = predict(df, emb_user, emb_song)
predicted = csc_matrix((df.prediction.values, (row, cols)), dtype = np.int8)
delta = (Y-predicted)
grad_user = (-2/df.shape[0])*(delta.T*emb_song) + 2*lmbda*emb_user
grad_song = (-2/df.shape[0])*(delta*emb_user) + 2*lmbda*emb_song
return grad_user, grad_song
Y = csc_matrix((df.target.values, (row, cols)), dtype = np.int8)emb_user = create_embeddings(30755, 30)
emb_song = create_embeddings(359966, 30)
beta = 0.9
grad_user, grad_song = gradient(df, Y, emb_user, emb_song)
v_user = grad_user
v_song = grad_song
for i in range(500):
grad_user, grad_song = gradient(df, Y, emb_user, emb_song)
v_user = beta*v_user + (1-beta)*grad_user
v_song = beta*v_song + (1-beta)*grad_song
emb_user = emb_user - 1*v_user
emb_song = emb_song - 1*v_song
print("\niteration", i+1, ":")
print("train mse:", cost(df, Y, emb_user, emb_song))

Genre Id :

Genre id is the feature that has majority classes and the majority of classes has less percentage of contribution in the feature. So for all the classes which have less than 1% of contribution in the feature, we will combine such classes into a single feature.

Source System Tab, Source Screen Name, Source Type, City, Registered via, Language and Gender :

Applying one-hot encoding to all these features by creating a new category of missing values.

6. Modeling :

For modeling, we tried different models like

i. Logistic Regression with calibrated classifier,

ii. SGDClassifier with calibrated classifier,

iii. Decision Tree with calibrated classifier,

iv. Random Forest Classifier,

v. AdaBoost Classifier,

vi. XGBoost Classifier,

vii. custom model (code is mentioned below)

For each of the models, we apply hyperparameter tuning and choose the best hyperparameter of the respective models. We were unable to apply calibrated classifier on remaining models because of requirement of high computation.

from scipy.stats import mode
from tqdm import tqdm
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
def custom_model(x_train, y_train, x_test, y_test, n_estimators = 10, alpha = 100, max_depth = 40):
d1_x_train, d2_x_train, d1_y_train, d2_y_train = train_test_split(x_train, y_train, test_size = 0.5, random_state = 0, stratify = y_train)
predictions = []
base_models = []
for i in tqdm(range(n_estimators)):
## generating samples
x, y, rows = generating_samples(d1_x_train, d1_y_train)
## training base model
base_model = DecisionTreeClassifier(max_depth=max_depth,random_state=42)
base_model.fit(x, y)
pred = base_model.predict_proba(d2_x_train)[:, 1].reshape(-1, 1)
predictions.append(pred)
base_models.append(base_model)
predictions = np.array(predictions).T
predictions = predictions.reshape(-1, n_estimators)
## training meta model
meta_model = LogisticRegression(penalty='l2',C=alpha,class_weight='balanced')
meta_model.fit(predictions, d2_y_train)
y_pred = meta_model.predict_proba(predictions)
score = roc_auc_score(d2_y_train, y_pred[:, 1])
print('AUC Score of Model on train set is :', score )
### Calculate AUC ROC Score on test set
predictions = []
for base_model in base_models:
pred = base_model.predict_proba(x_test)[:, 1].reshape(-1, 1)
predictions.append(pred)
predictions = np.array(predictions).T
predictions = predictions.reshape(-1, n_estimators)
y_pred = meta_model.predict_proba(predictions)
score = roc_auc_score(y_test, y_pred[:, 1])
print('AUC Score of Model on test set is :', score )
return base_models, meta_model, y_pred
base_models, meta_model, y_pred = custom_model(x_train, y_train, x_test, y_test, n_estimators = 100, alpha = 100, max_depth = 10)

7. Results and Deployment:

After obtaining the best hyperparameter and we train all the models with the best hyperparameter mentioned and obtained results on the test set.

results of model on test data

After Submitting on Kaggle we obtained a score of 65 where 74 is the highest score.

Submission score on kaggle: 0.653

Deployed Link on AWS :

8. Future Work :

We got the best results on Random Forest Classifier with n_estimators equals to 100. Future results can be improved by applying Calibrated Classifier CV on Random Forest Classifier but it requires more than 35 GB of RAM for computation and we were lagging in hardware.

Implementation of user-user similarity for a better recommendation of songs.

9. Profile :

For complete code visit ipynb notebook :

Stay connected with me on LinkedIn

10. References :

  1. https://www.researchgate.net/publication/328838360_KKbox's_ Music_Recommendation_Challenge_Solution_with_Feature_engineer ing_11th_ACM_International_Conference_on_Web_Search_and_Data _Mining_WSDM_2018_February_5–9_2018_Los_Angeles_California_ USA_WSDM_Cup
  2. https://www.kdnuggets.com/2017/06/feature-engineering-help-ka ggle-competition-1.html
  3. https://medium.com/towards-artificial-intelligence/recommendatio n-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-fil tering-533ff8a0e444
  4. https://analyticsindiamag.com/singular-value-decomposition-svd-a pplication-recommender-system/
  5. http://cs229.stanford.edu/proj2019spr/report/4.pdf
  6. AppliedAICourse.com

I hope this blog is helpful.

--

--

Vineet Mukesh Haswani

Data Scientist at Cron Labs | Computer Vision Researcher | Natural Language Processing