Team+Yahoo!+Music

=
The project is to analyze user ratings for music in the Yahoo! (KDD Cup) to predict future preference. There are more than 62 million user ratings for tracks, genre, artist, and album. These ratings are stored into two tracks with Track 1 having information to predict scores that users gave to various items and Track 2 requiring separation of loved songs from other songs to predict the same. The Hierarchy in which Track 1 data is organized is such that each track belongs to an album, albums belong to artists, and together they are tagged by genres. The Track 2 is where classification of loved songs is required. ======

=
Our proposal best satisfies the course requirements as it involves the entire cycle of Knowledge Discovery in Data though data mining. There are several tables, which have to be connected to get the data in a format that is useful to run the analysis. The project intends to predict the missing ratings as well as to reduce the error in classification matrix through optimizing the model that we choose. This requires a trial and error method of trying not only various models but also trying with different applicable parameters. ======

=
Such complexity in the objective gives us more room for learning through this process ranging from data cleaning, joining different tables in a meaningful way to classifying, finding clusters and finally using neural networks and recommender systems approach to obtain the prediction. Through the literature research, we found several concepts and models such as collaborative filtering, SVD and Neural networks that could be applied for predicting the ratings and procedure we tried to apply various models to our prototype. This required understanding of music choices as well as finding an effective way to deal with missing information (sparse data), which is a common problem in recommender systems. ======

=
The online movie industry has had well documented studies and contests on implementing and improving recommendation systems using data mining. When creating a similar model in online music for the KDD Cup, we reviewed the following three methods in hopes of gaining insight on accurately predicting user preference. ======

=
Arest takes a holistic approach in his article on customer classification and presents a step-by-step method to gaining better accuracy in predicting movie preference. In his Data Analysis, he presents solutions to the following problems identified in predicting customer preference: user identification, session identification, path completion, and transaction identification. Using Naive Bayes classifier, Arest develops classifications for the customer, which he groups in to new and returning users for analysis. For our purposes however, we believe that a blended approach would yield higher accuracy in providing music recommendations based on multiple factors. ======

=
BellKor was the winning team of the highly publicized 2008 Netflix Prize, a contest to develop a model that improved upon the Netflix movie recommendation system. The key to their success was their ability to blend hundreds of models in order to maximize the accuracy of the result. Our team intends to learn from this method and use several models to analyze our results, but due to lack of resources and expertise, do not plan on using as many models as the BellKor team. ======

=
Members of “the Ensemble” who placed second in the 2008 Netflix Prize and ultimately collaborated with BellKor to achieve even higher accuracy published this paper. Like BellKor, they also used a blended model to take advantage of multiple models. In this case, they found that stacking linear functions on top of each other worked to create a second level, learning algorithm. In addition, they use meta-features, which are “additional inputs describing each sample in a dataset” to increase accuracy. Using SPSS Modeler, our team also intends to use multiple models but further research will need to be conducted to determine if we are capable of implementing linear stacking. ======

=
Recommender systems store large data in matrix format – with users and their specific ratings to items. Since most users rate only a subset of data, there could be certain fields that are empty – thus dealing with sparse data is important. A technique called collaborative filtering is widely used in recommender systems to solve this problem. This is important as it computes neighborhood of similar ratings whose similarities are found through correlation between their ratings. This method also has particularly high cost when the data is sparse – as it is difficult to obtain overlap among the ratings. Singular Vector Decomposition (SVD) reduces data representation and gives predicted ratings using linear regression, but missing ratings must be filled in order to use SVD. This method has better accuracy through not only using information from correlated customers, but also using information obtained from users whose ratings are not correlated. Although collaborative filtering helps in recommender systems, it lacks human communication aspects and as a result conversational recommender systems are emerging. ======

=
The aim of this study is to predict and classify whether the rater is going to rate a new track and predicting its rating value gathering current database of ratings from Yahoo Music. ======

=
All these have many to many relationships. Hence we need to use a database to manage these relations and finally arrive at a single data table with all the variables. ======



=
Since we have constrain of machine resources, we are carrying out analysis on reduced number of datasets. Currently we have tried a sample dataset with 7474 ratings based on RaterID <=30. We will test further on training the model with more samples. ======

=
There is a lot of datapreparation which needs to be addressed. The biggest hurdle we feel is to handle the null values in all the variable fields. Some of the ways we have identified are: ======

=
b. Replace ‘null’ value with its relative ID from the relationship map. For example if a particular trackID has been rated, use the TrackID to AlbumID,GenreID & ArtistID relation map to replace these values. ======

=
Since the objective is to classify and predict the ratings, we would be using all the classification algorithms to check the accuracy of the model. The following are the algorithms we would be using for this project. ======



Based on our understanding so far in the course, we feel that Neural Networks are the best to address this problem. There could also be a combination of models viz., an Ensemble which could provide a greater accuracy. We would also analyze the problem based on split models on the raters i.e the prediction would be based on each model generated for every rater. We would also analyze a possibility for clustering into datasets and then running the prediction algorithms.

=
The brief analysis consisting of 7474 datasets with NULL values replaced by 0 is shown in the exhibits (1-2-3). The model used is a Neural Network algorithm. ======

=
<span style="font-family: Verdana,Geneva,sans-serif;">The takeaway from this project for us as a team would be the learnings of how to develop a ‘collaborative filtering’ kind of data mining approach and how useful it is to influence a person’s choice of music and provide more instances for people to rate different kinds of music tracks, albums or genres. ======



<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: center;">__Exhibit 2 – Data Reduction__
<span style="display: block; font-family: Verdana,Geneva,sans-serif; text-align: center;"> __Exhibit 3 – Summary of Sample Model__