Yahoo!+KDD+Cup+-+Final

=
 Nick Woolsey ======

=
 Saravanan Raj ======

=
 Sultan Bugshan ======

=
 ** Introduction ** ======

=
 The project is to analyze user ratings for music in the Yahoo! (KDD Cup) to predict future preference. There are more than 62 million user ratings for tracks, genre, artist, and album. These ratings are stored into two tracks with Track 1 having information to predict scores that users gave to various items and Track 2 requiring separation of loved songs from other songs to predict the same. The Hierarchy in which Track 1 data is organized is such that each track belongs to an album, albums belong to artists, and together they are tagged by genres. The Track 2 is where classification of loved songs is required. ======

=
 Our proposal best satisfies the course requirements as it involves the entire cycle of Knowledge Discovery in Data though data mining. There are several tables, which have to be connected to get the data in a format that is useful to run the analysis. The project intends to predict the missing ratings as well as to reduce the error in classification matrix through optimizing the model that we choose. This requires a trial and error method of trying not only various models but also trying with different applicable parameters. ======

=
 Such complexity in the objective gives us more room for learning through this process ranging from data cleaning, joining different tables in a meaningful way to classifying, finding clusters and finally using neural networks and recommender systems approach to obtain the prediction. Through the literature research, we found several concepts and models such as collaborative filtering, SVD and Neural networks that could be applied for predicting the ratings and procedure we tried to apply various models to our prototype. This required understanding of music choices as well as finding an effective way to deal with missing information (sparse data), which is a common problem in recommender systems. ======

=
 ** Literature Review ** ======

=
 The online movie industry has had well documented studies and contests on implementing and improving recommendation systems using data mining. When creating a similar model in online music for the KDD Cup, we reviewed the following three methods in hopes of gaining insight on accurately predicting user preference. ======

=
 **“A movie e-shop recommendation model based on Web usage and ontological data.” Arest, Andres, et al** ======

=
 Arest takes a holistic approach in his article on customer classification and presents a step-by-step method to gaining better accuracy in predicting movie preference. In his Data Analysis, he presents solutions to the following problems identified in predicting customer preference: user identification, session identification, path completion, and transaction identification. Using Naive Bayes classifier, Arest develops classifications for the customer, which he groups in to new and returning users for analysis. For our purposes however, we believe that a blended approach would yield higher accuracy in providing music recommendations based on multiple factors. ======

=
 **“The BellKor 2008 Solution to the Netflix Prize.” Bell, Robert, et al.** ======

=
 BellKor was the winning team of the highly publicized 2008 Netflix Prize, a contest to develop a model that improved upon the Netflix movie recommendation system. The key to their success was their ability to blend hundreds of models in order to maximize the accuracy of the result. Our team intends to learn from this method and use several models to analyze our results, but due to lack of resources and expertise, do not plan on using as many models as the BellKor team. ======

=
 **“Feature Weighted Linear Stacking.” Sill, Joseph, et al.** ======

=
 Members of “the Ensemble” who placed second in the 2008 Netflix Prize and ultimately collaborated with BellKor to achieve even higher accuracy published this paper. Like BellKor, they also used a blended model to take advantage of multiple models. In this case, they found that stacking linear functions on top of each other worked to create a second level, learning algorithm. In addition, they use meta-features, which are “additional inputs describing each sample in a dataset” to increase accuracy. Using SPSS Modeler, our team also intends to use multiple models but further research will need to be conducted to determine if we are capable of implementing linear stacking. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"> **“Predicting Missing Ratings in Recommender Systems: Adapted Factorization Approach.” Julià, Carme, et al.** ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"> Recommender systems store large data in matrix format – with users and their specific ratings to items. Since most users rate only a subset of data, there could be certain fields that are empty – thus dealing with sparse data is important. A technique called collaborative filtering is widely used in recommender systems to solve this problem. This is important as it computes neighborhood of similar ratings whose similarities are found through correlation between their ratings. This method also has particularly high cost when the data is sparse – as it is difficult to obtain overlap among the ratings. Singular Vector Decomposition (SVD) reduces data representation and gives predicted ratings using linear regression, but missing ratings must be filled in order to use SVD. This method has better accuracy through not only using information from correlated customers, but also using information obtained from users whose ratings are not correlated. Although collaborative filtering helps in recommender systems, it lacks human communication aspects and as a result conversational recommender systems are emerging. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">** Procedure ** ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">The aim of this study is to predict and classify whether the rater is going to rate a new track and predicting its rating value gathering current database of ratings from Yahoo Music. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">**__Dataset Description__** ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding: 5px 0px 0px;"> ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">**TrackID** : Each song from different artists are tagged with numeric value. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">**ArtistID** : Different Artists on the Yahoo Music Database are tagged with numeric value. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">**AlbumID** : Different Albums are tagged with numeric value. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">**Rating** : Rating is a value ranging from 0 – 100. Hence we have considered it to be nominal. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">**RaterID** : The people who listen to Yahoo music and rate the music, could be album, artist, track or genre. This is also a numeric value. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding: 5px 0px 0px;">**TrackClass :** It is the flag value denoting whether the track has been liked or not liked/rated (1-liked 0-not liked/rated) and is the response variable. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding: 5px 0px 0px;">**AlbClass :** It is the flag value denoting whether the album has been liked or not liked/rated (1-liked 0-not liked/rated) ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding: 5px 0px 0px;">**ArtClass :** It is the flag value denoting whether the artist has been liked or not liked/rated (1-liked 0-not liked/rated) ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding: 5px 0px 0px;">**GenClass :** It is the flag value denoting whether the genre has been liked or not liked/rated (1-liked 0-not liked/rated) ======

=
<span style="display: block; font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; line-height: 19px; margin: 0px; padding: 5px 0px 0px; text-align: justify;">All these have many to many relationships. Hence we need to use a database to manage these relations and finally arrive at a single data table with all the variables. ======



=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">The above is the complete dataset from the Yahoo database. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">Since we have constrain of machine resources, we are carrying out analysis on reduced number of datasets. Currently we have tried a sample dataset with 7474 ratings based on RaterID <=30. We will test further on training the model with more samples. The actual number of raters we analyzed were 1500 with about half a million ratings ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;">The workflow we are having for this project is shown below. ======



=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">**__Data Preparation__** ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">There is a lot of datapreparation which needs to be addressed. The biggest hurdle we feel is to handle the null values in all the variable fields. Some of the ways we have identified are: ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">a. Replace all null values with a fixed ‘0’. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">b. Replace ‘null’ value with its relative ID from the relationship map. For example if a particular trackID has been rated, use the TrackID to AlbumID,GenreID & ArtistID relation map to replace these values. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">c. Completely remove the null rows. ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">**__Modeling and Analysis:__** ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">Since the objective is to classify and predict the ratings, we would be using all the classification algorithms to check the accuracy of the model. The following are the algorithms we would be using for this project. ======



<span style="font-family: Verdana,Geneva,sans-serif;">Based on our understanding so far in the course, we feel that Neural Networks are the best to address this problem. There could also be a combination of models viz., an Ensemble which could provide a greater accuracy. We would also analyze the problem based on split models on the raters i.e the prediction would be based on each model generated for every rater. We would also analyze a possibility for clustering into datasets and then running the prediction algorithms.

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">**__Summary__** ======

=
<span style="font-family: Verdana,Geneva,sans-serif; font-size: 1em; font-weight: normal; margin: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 5px;"><span style="display: block; font-family: Verdana,Geneva,sans-serif; line-height: 19px; text-align: justify;">The takeaway from this project for us as a team would be the learnings of how to develop a ‘collaborative filtering’ kind of data mining approach and how useful it is to influence a person’s choice of music and provide more instances for people to rate different kinds of music tracks, albums or genres. The neural network model with split is explained below. ======

By taking a couple of Rater IDs, the results from the model are explained below.



**<span style="font-family: Verdana,Geneva,sans-serif;">Neural Network Model Output - With Split ** Rater 118: For this Rater ID, the most important predictor is the Album class. This Rater ID has about 87.9% accuracy in determining the right class.



**<span style="font-family: Verdana,Geneva,sans-serif;">Rater ID - 118 - Predictor Importance ** The classification matrix shows that there is about 13.6% misclassification where Track Class was "0" to be wrongly predicted as "1".



**<span style="font-family: Verdana,Geneva,sans-serif;">Rater ID 118 - Classification **

The neural network given below shows that for the right prediction of Track Class, there is a single hidden layer with the most important categorical predictor shown as Album Class.



**<span style="font-family: Verdana,Geneva,sans-serif;">Rater ID - 118 - Neural network **

**<span style="font-family: Verdana,Geneva,sans-serif;">Rater 109 - Predictor Importance **



**<span style="font-family: Verdana,Geneva,sans-serif;">Rater 109 - Classification **



**<span style="font-family: Verdana,Geneva,sans-serif;">Rater ID 109 - Neural Network ** <span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: left;">**__ Graphs __**

<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: left;"> The web graph explains the link between the Artist Class, Album Class, Genre Class, and Track Class. This explains the strong connection between artist and album classes (when both are =1) with strongest connection between Genre Class and Album Class.



__**Web Graph - Neural Networks with Split**__

**<span style="font-family: Verdana,Geneva,sans-serif; margin: 0px; padding: 0px; text-align: justify;">Problems **

<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: justify;"> The first major problem or challenge was to understand the dataset itself. The information was scattered in different places. There was multiple relationships in multiple data files for example tracks to albums, tracks to genres, track with raters and their rating, artist to raters and rating, album to raters and rating, album to genre and so on. We had to spend quite a lot of time in the data preparation for achieving our objective with this music data. The second challenge was to scope out the volume of data we would be analyzing for the problem. The entire database is so huge that we could not manage with our current hardware to run modeler on it. Hence we had to limit to 1500 Raters from 600K-raters database which runs to about 0.5 million ratings out of 62million ratings. This is a very low fraction of the entire database. We would have understood the results better if we had the actual track /artist /album/genre categories instead of just number id’s and their ratings.

<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: left;">**Key Learnings**

<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: justify;"> The basic objective was to understand the raters’ pattern for rating and their music preferences. We could understand this by classifying the entire tracks into two classes – most loved or not liked/rated. We found interesting observations from the analysis, most of the raters who rated Albums influenced their rating on the tracks. For some, Artist rating was preferred over any other criteria. There were other patterns as well like it depended only on the Artist or Album to which the track belonged to decide whether the track was liked or not. These learnings will help us to identify what kind of music the particular rater or music lover likes.

<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: left;">**Conclusion**

<span style="display: block; font-family: Verdana,Geneva,sans-serif; font-size: 12px; text-align: left;"> As a business perspective, Yahoo can use this approach to identify and suggest recommendation to people and music lovers on the kind of music they would love to listen and thus drive traffic to their music site. This is similar to the Netflix model of collaborative filtering where movies are recommended based on user preferences and the subscription is maintained by keeping the customer interested. Yahoo can also use the pattern in identifying the classes of Artists, Albums and Tracks, which are popular with most of the users and share it with the music fraternity and satisfy every music lover by constantly providing high quality music entertainment.