Music Recommendation System

11 min readJan 1, 2021

This blog explores detailed explanations for the implementation of the music recommendation system which is implemented on a dataset of KKBox’s Music Recommendation Challenge with a Kaggle score of 65(highest=74)

WSDM - KKBox's Music Recommendation Challenge

Can you build the best music recommendation system?

www.kaggle.com

1. Introduction :

The music Recommendation system is all about recommending the songs which the user likes based on their previous activities like searches, previously heard songs, etc. There are many methods for building a recommendation system like Collaborative Filtering, Content-Based Filtering, Hybrid methods (Combining both Content-Based and Collaborative Filtering methods).

In this challenge, we have to build a recommendation system that can predict whether a user will listen to a song again within one month after the user’s very first observable listening event in the KKBox application. If the user did not listen to the song again within one month, the target variable will be 0, and 1 otherwise.

Business Problem: This can help KKBox company to recommend songs to users, to apply a rating to a song, to determine the taste of user in music.

ML Formulation: Building a recommendation system using Collaborative based algorithms like matrix factorization and word embedding.

Performance Metric: The performance metric for the challenge is Area under the Receiver Operating Characteristic Curve (AUC ROC) Score. As both the classes are balanced in the dataset, we will choose AUC ROC, not F1-Score. We can also choose Accuracy over AUC ROC as classes are balanced but the advantage of the AUC ROC score is that we can get the correct threshold if we are using linear models like Linear Regression or Logistic Regression.

2. Dataset:

KKBOX provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of your prediction is encouraged.

In this challenge, four CSV files are provided which have song information, user information, song extra information, and interaction of user and songs are represented in one CSV file with the application information. we need to merge the interaction file with the song information file using the song feature and merge the interaction file with the user information file using the user feature.

df = df.merge(members, on = 'msno', how='left')
df = df.merge(songs, on = 'song_id', how = 'left')

By merging all of the CSV files into a single CSV file we create our final dataset.

3. Exploratory Data Analysis (EDA) :

In this section, we will try to answer questions like

i. what each and every feature of the dataset mean?

ii. what is the relation between features and class labels?

iii. what is the relationship between features?

iv. what are the most important features in classification?.

In the dataset, there are the majority of categorical features compared to continuous features. So there are the majority of bar plots.

Categorical Feature

City:

The aim of plotting all the plots is to explore the city feature and determine what is the importance of the city in modeling.

From the above plots, we can conclude that

There are the majority of users using the KKBox app from cities 1 and 13.
City 1 and 13 must be cities with a high population as there are more users in the city
Except for city 1 all cities mostly prefer registration mode as 13 only cities 1 prefer registration mode 7.
The two most popular languages are 3.0 and 59.0

Artist Name:

The aim of plotting all the plots is to explore the Artist Names and determine what is the importance of the Artist Name in modeling.

From the above plots, we can conclude that

The two most popular artists are jay chou and mayday.
10% of artist names are missing.
4.1% of Artists are unknown with the given name as “various artists”.
Jay Chou sings only with genre id 458 and language 3.0.
mayday sings only in 458 and 465 genres with only language 3.0.
Chainsmoker is the only artist that uses genre id 1609 with language 52.0.
Alan Walker is the only artist who uses genre id 1616 and 1609 with languages 52.0.
jay chou is more popular in males than in females.
Bigbang is only popular in males and not in females.
Sodagreen is only popular in females and not in males.
jay chou and mayday are the two most popular artists in cities 1 and 13.
All the artist names in city 4, 6, 15, and 22 are missing.
mayday and JJ lin are the most popular artists in genre id 465.
jay chou and mayday are the most popular artists in the genre id 458.
Chainsmoker is the only popular artist in the genre id 1609.
Alan Walker is popular among genres id 1616 and 1609.

Application Information :

The aim of plotting all the plots is to explore the Artist Names and determine what is the importance of the Artist Name in modeling.

From the above plots, we can conclude that

There are the majority of users using the KKBox app by using local storage ie my library.
Most of the peoples discover songs using discover, search then they try to use there my library for playing songs.
30.9% of users listen to songs repeatedly from my library.
17.3% of users who discovered songs have not listened to them more than once.
43.8% of the users are using local playlists for playing songs.
we can also conclude that most of the users are specific about their songs as they don’t even use my library search.
27.9% of users listen to songs repeatedly from the Local Playlist.
5.0% of users who listen to songs from the radio did not listen to them more than once.
72% of users are using apps using local-library, online-playlist, and local-playlist.
19.4% of users listen to songs repeatedly from the Local library.
5.1% of users who Listen to songs from the radio did not listen to them more than once.

Other Features :

The aim of plotting all the plots is to explore the features like gender, genre, language, registration method, and missing value to determine what is the importance of the mentioned features in modeling.

From the above plots, we can conclude that

Gender, composer, and lyricist are highly sparsed ie. they have a high missing value rate. So we need to handle them carefully.
Dataset is balanced for the target variable.
From the gender plot, nothing can be judged due to the high missing value rate.
54.8% of users have song language 3.0
It can be concluded that language 3.0 is the local language of that region/country where the app is used.
51.7% of users listen to songs with genre id 465
72.8% of users have registered via the 7 and 9 registration process.

Continuous Features :

Length and Age:

The aim of plotting all the plots is to explore the feature's age and song length to determine what is the importance of the mentioned features in modeling.

From the above plots, we can conclude that

Age and Song Length Feature have outliers.
The majority of values are lying between 0 to 100.
song length is lying between 191518 to 395947 ms.
Applying log transformation on song length makes sense in avoiding outliers.

4. Handling Missing Values and Outliers :

Categorical Features :

There are the majority of missing values in gender, composer, genre id, and source screen name. As there are the majority of classes and missing values in composer and lyricist so we should drop both the features. For handling source screen name, genre id, and gender we should treat all the missing values as a new category.

Continuous Features :

There are outliers in continuous features like song length and Age. We can handle outliers of song length by applying log transformation on the song length feature. As age varies from 0 to 100 so, all the points out of this range are concluded to the nearest point.

Missing values of song length are filled with the respective median feature of the training set.

5. Feature Engineering :

Feature engineering is one of the most important parts while handling data. so, we will pick each feature and try to apply feature engineering to it.

Registration and Expiration Date :

The given input format for the date is YYYY-DD-MM so we will convert the registration date feature into three features registration year, month, and day applying the same for the expiration date feature.

User Id and Song Id :

We will create a 30-dimensional embedding for each user id and song id using the target variable. For creating 30-dimensional embedding we will use the technique of Collaborative Filtering

Collaborative Filtering is the technique in which initially we initialize a number of users x 30 and the number of songs x 30-dimensional matrix and create a matrix where rows determine users and columns determine songs and values in the matrix are filled with the target variable value of respective user id and song id.

def predict(df, emb_user, emb_song):
    df['prediction'] = np.sum(np.multiply(emb_song[df['song_id'].apply(lambda x: map_song_id[x])],emb_user[df['msno'].apply(lambda x: map_msno[x])]), axis=1)
    return dflmbda = 0.0002
def cost(df, Y,emb_user, emb_anime):  
    predicted = predict(df, emb_user, emb_song)    
    predicted = csc_matrix((df.prediction.values, (row, cols)), dtype = np.int8)
    return np.sum((Y-predicted).power(2))/df.shape[0]def create_embeddings(n, K):
    return np.random.random((n, K))def gradient(df, Y, emb_user, emb_song):
    predicted = predict(df, emb_user, emb_song)
    predicted = csc_matrix((df.prediction.values, (row, cols)), dtype = np.int8)
    delta = (Y-predicted)
    grad_user = (-2/df.shape[0])*(delta.T*emb_song) + 2*lmbda*emb_user
    grad_song = (-2/df.shape[0])*(delta*emb_user) + 2*lmbda*emb_song
    return grad_user, grad_songY = csc_matrix((df.target.values, (row, cols)), dtype = np.int8)emb_user = create_embeddings(30755, 30)
emb_song = create_embeddings(359966, 30)beta = 0.9
grad_user, grad_song = gradient(df, Y,  emb_user, emb_song)
v_user = grad_user
v_song = grad_song
for i in range(500):
    grad_user, grad_song = gradient(df, Y, emb_user, emb_song)
    v_user = beta*v_user + (1-beta)*grad_user
    v_song = beta*v_song + (1-beta)*grad_song
    emb_user = emb_user - 1*v_user
    emb_song = emb_song - 1*v_song
    print("\niteration", i+1, ":")
    print("train mse:",  cost(df, Y, emb_user, emb_song))

Genre Id :

Genre id is the feature that has majority classes and the majority of classes has less percentage of contribution in the feature. So for all the classes which have less than 1% of contribution in the feature, we will combine such classes into a single feature.

Source System Tab, Source Screen Name, Source Type, City, Registered via, Language and Gender :

Applying one-hot encoding to all these features by creating a new category of missing values.

6. Modeling :

For modeling, we tried different models like

i. Logistic Regression with calibrated classifier,

ii. SGDClassifier with calibrated classifier,

iii. Decision Tree with calibrated classifier,

iv. Random Forest Classifier,

v. AdaBoost Classifier,

vi. XGBoost Classifier,

vii. custom model (code is mentioned below)

For each of the models, we apply hyperparameter tuning and choose the best hyperparameter of the respective models. We were unable to apply calibrated classifier on remaining models because of requirement of high computation.

from scipy.stats import mode
from tqdm import tqdm
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_splitdef custom_model(x_train, y_train, x_test, y_test, n_estimators = 10, alpha = 100, max_depth = 40):
    d1_x_train, d2_x_train, d1_y_train, d2_y_train = train_test_split(x_train, y_train, test_size = 0.5, random_state = 0, stratify = y_train)
    predictions = []
    base_models = []
    for i in tqdm(range(n_estimators)):
        ## generating samples
        x, y, rows = generating_samples(d1_x_train, d1_y_train)
        ## training base model
        base_model = DecisionTreeClassifier(max_depth=max_depth,random_state=42)
        base_model.fit(x, y)
        pred = base_model.predict_proba(d2_x_train)[:, 1].reshape(-1, 1)
        predictions.append(pred)
        base_models.append(base_model)
    predictions = np.array(predictions).T
    predictions = predictions.reshape(-1, n_estimators)## training meta model
    meta_model = LogisticRegression(penalty='l2',C=alpha,class_weight='balanced')
    meta_model.fit(predictions, d2_y_train)
    y_pred = meta_model.predict_proba(predictions)score = roc_auc_score(d2_y_train, y_pred[:, 1])
    print('AUC Score of Model on train set is :', score )### Calculate AUC ROC Score on test set
    predictions = []
    for base_model in base_models:
        pred = base_model.predict_proba(x_test)[:, 1].reshape(-1, 1)
        predictions.append(pred)
    predictions = np.array(predictions).T
    predictions = predictions.reshape(-1, n_estimators)
    y_pred = meta_model.predict_proba(predictions)
    score = roc_auc_score(y_test, y_pred[:, 1])
    print('AUC Score of Model on test set is :', score )
    return base_models, meta_model, y_predbase_models, meta_model, y_pred = custom_model(x_train, y_train, x_test, y_test, n_estimators = 100, alpha = 100, max_depth = 10)

7. Results and Deployment:

After obtaining the best hyperparameter and we train all the models with the best hyperparameter mentioned and obtained results on the test set.

After Submitting on Kaggle we obtained a score of 65 where 74 is the highest score.

Deployed Link on AWS :