moviegroupReport

= Executive Summary = The team acted as a group of consultants working with a DVD producer. The team’s goal was to look at relevant data from sales, theatrical releases, and DVD sales trends. We collected data from an online source for the years 2006-2008. Data included such variables as domestic ticket sales, international ticket sales, release date and DVD sales gross. The team then cleaned and analyzed the data using a variety of models utilizing SPSS’s PASW software and Microsoft Excel. The team chose four models for this analysis: General Linear, CR Tree, Linear Regression, and Neural Network. Upon completion of these models, the team found that the General Linear had the most appropriate fit for the purposes of the analysis. This was partially because this model was able to transform the data itself. Through this model, the team was able to determine that total US Gross ticket sales, total International Gross and DVD_ReleaseDate_Year_2007 sales were leading indicators in determining gross DVD sales. Other factors such as genre showed a surprisingly small amount of correlation which disagreed with the team’s original hypothesis. The research uncovered the fact that DVDs are generally released three months after the theatrical debut of the movie. Given this information and the results from the model, the team is able to approximate the number of DVDs that will be sold. Granted, there is a fairly high error but the model suggests that over time and with the collection of new data that the projection will greatly improve. = = = Analysis of Business Situation =

Business Challenge
The team outlined our challenge using the following lens and data set: · We are a DVD producer looking to discover how to better understand sales trends of DVDs. With the sales trend knowledge, we hope to better predict sales forecasts and which types of DVDs will be most profitable. · Our data set includes top grossing films from 2006 to 2008 along with categorical indicators such as genre, release date, rating and budget. We will use this data to predict units demanded of DVDs upon release.

Business Goal
· More accurately forecast total demand so the manufacturers do not under or over-produce, thus lower total costs and increase profit margin.

Data Mining Goal
· Identify characteristics of movies in the box office that are most likely to spur DVD sales. · Score and rank movies by probability of unit sales.

Hypothesis
The team hypothesizes that larger grossing movies of popular genres will sell more DVDs. According to economic studies, “The growth in DVD spending was propelled by the plethora of box office titles that became available in 2006, including 15 films that generated more than $100 million each at the box office.”[1] Moreover, when the box office is doing well, DVD sales often slump, indicating the seasonality of box office hits and DVD sales in an inverse relationship in the same moment of time.[2]

= Data Mining Process = The data process the team sought to undertake is as follows:
 * 1) Pre process the data to show industry understanding (including setting dummy variables and transforming skewed ranges).
 * 2) Partition the dataset into training and validation sets at 40% and 60% respectively.
 * 3) Test different models evaluating the effectiveness of each model in how well it can forecast unit sales.
 * 4) Using the top models, experiment with advanced features and output configurations to maximize probability of model success.
 * 5) Evaluate the top model by predicting its effectiveness and costs savings as it relates to proper demand forecasting.

Data Set
· The following data sets will be used for analysis. · ‘Units Sold’ will be our dependent variable. Variables
 * 1) DVD Full Name
 * 2) DVD Release Date
 * 3) **Units Sold**
 * 4) Total US Box Office Gross
 * 5) Total International Box Office Gross
 * 6) Theater Release Date
 * 7) DVD Release Date
 * 8) Budget
 * 9) MPAA Rating
 * 10) Running Time (min)
 * 11) Genre
 * 12) The-Numbers.com rating
 * 13) Rotton Tomatoes Rating

= MODEL DEVELOPMENT = After preprocessing the data, the team set out to analyze the data through predictive and certain classification models. The styles of models used were regression trees, linear models, and neural networks. After initial modeling with simple model parameters, four models surfaced with the best predictive ability: C/R Tree, Linear Regression (range values only), General Linear Model, and Predictive Neural Network. The team analyzed and repaired the aforementioned four most correlated models in depth to produce models with both the highest correlation and lowest error. After maximizing the top four models, the team used three common metrics to analyze model performance: RMS Error, Linear Correlation, and Lift (through the modeled lift charts). The performance of the four models is similar, differing slightly on variable importance and RMS Error. It is also important to note that the Regression and the well-trained Neural Network models do not employ categorical variables of date and other movie characteristics. The following table displays the team’s model analysis metrics. Importance: 44.5% || // “US Gross” // Importance: 52.5% || // “US Gross” // Importance: 30.0% || // “US Gross” // Importance: 16.1% || Importance: 42.8% || // “DVD rel. 2007 // Importance: 8.8% || // “Global Gross” // Importance: 13.8% || = = = MODEL CHARACTERISTICS =
 * || ** C/R Tree ** || ** Linear Regression ** || ** General Linear (not transformed) ** || ** Neural Network ** ||
 * ** Top Variable ** || // “US Gross” //
 * ** 2nd Variable ** || // “Global Gross” // Importance: 17.6% || // “Intl. Gross” //
 * ** RMS Error ** || 1,112,606 || 1,282,384 || 1,089,690 || 1,332,186 ||
 * ** Linear Corr. ** || 0.810 || 0.779 || 0.797 || 0.764 ||
 * ** Lift Chart ** || [[image:lift-crt.jpg]]  ||    [[image:lift-linear.jpg]] || [[image:lift-gen.jpg]]   || [[image:lift-neu.jpg]]   ||

CRT
The CRT model uses all forms of data, from range to category to classify the data. With both a strong lift chart and the lowest error, the C/R Tree is a good classifier of the dataset. The main fallback of this model is that it has the second lowest linear correlation. Moreover, it is simply a classification tool, it does not directly predict the total units sold of a DVD, it simply categorizes the dataset into ten sales categories. The following list displays a logical classification of an average selling DVD’s.

Parameter Estimates
Total_US_Gross_transformed > 1.7 Intl_Gross_transformed > 2.5 The-Numbers_Rating_transformed > 0.262 Result = 8,190,945

Regression
The regression model is the second weakest of the four, not only does the model only use range values to create a model, it has a high RMS error. The following parameter estimates show the models’ understanding of units sold and the relationship it has with both US box office sales and International box office sales.

Parameter Estimates
Total_US_Gross_transformed * 3,703,118.9 Intl_Gross_transformed * 3,327,196.6

General Linear (not transformed)
The general linear model is a strong well-rounded model. Not only does the model have an RMS error that is better than all other models, but also the model has the best lift chart. This model even has the highest linear correlation. This model is also strong because it can be used to predict sales and not simply classify them into bins. The main reason for this models strong success is that the data is transformed within the model, and not before modeling with z-scores. The following parameter estimates show the models’ understanding of units sold and the relationship it has with both US box office sales and International box office sales.

Parameter Estimates
Intercept: -453783 Total_US_Gross_transformed: .0026 Intl_Gross_transformed: 0.004

Neural Net
When exhaustively pruned, the neural network model can function with only one variable: Total US box office sales. When the model is in its simplest of forms, it uses all variables in a near equal fashion. While this model has a midline correlation metric, it is unreliable when over trained, it has the highest RMS error, and is has the worst lift chart.

= Findings   = The generalized linear model is extremely flexible. It allows for the dependent variable to have a non-normal distribution (but does not require non-normal) and covers widely used statistical models including logistic models for binary data, linear regression for normally distributed responses, log-linear models, and many others. In our analysis, we used a normal distribution as our data had already been normally transformed and the identity link function (//f//(//x//)=//x)//, which can be used with any distribution. The team found that the General Linear model had the best fit in relation to other models used in predicting the demand for the number of DVDs that can be sold based on the dependent variables used. In the Model Development section we noted the General Linear model gave us the best lift chart [lift chart equation = (hits in increment / records in increment) / (total number of hits / total number of records)] as well as an adequate RMS error in relation to the other models observed. Additional pros of this model were that it allowed us to predict the sales as opposed to classify them in bins as the CRT model did and gave us a greater number of predictor variables where the Neural Net model was limited to one when exhaustively pruned. Through every model the group found that contrary to their hypothesis the seasonality of movie releases does not considerably affect units sold, especially when other variables such as box office sales are included in the model. See appendix for modeling parameters]

= Conclusion  = After testing the four models described above, the Generalized Linear model appears to have the best fit among all the models. Partially due to the fact that this model is able to transform the data on its own as a component to the analysis. Our model allows us to predict, reasonably accurately, the demand for the number of DVDs that can be sold based on certain parameters. The Gen Lin model was chosen as opposed to the linear regression because we needed to minimize the difference between the actual and predicted number of units sold. For this the team required as many relevant variables as possible. Moreover, the team abandoned the CR Tree model because it simply classifies the data, and is not a predictive tool. The Total US Gross ticket sales, total International Gross and DVD_ReleaseDate_Year_2007 sales seem to be the leading indicators. This is not surprising given that it would seem obvious that successful movies would lead to larger sales of their corresponding DVDs. Other possibilities such as the holiday season and summer were thought to be leading indicators but were seen to be not as indicative as the former two parameters. An interesting side note is the fact that DVD sales were not impacted by genre of the movie or the period of release as much as one would have thought. Less than 5% of the change in the data was explained by genre specifically. Therefore any assumption that action or family movies spawn more DVD sales seems to be unfounded. Since DVDs usually come out three months after the theatrical release, as indicated by the trends in the data collected, the team will have the requisite information to reasonably predict the approximate number of DVDs that will sell. The current model has a significant error percentage which is a result of large differences between the predicted number of units sold and the actual number of units sold. Over time, with more data and refinement to our model, the team expects a much greater accuracy to be achieved.

See Appendix for PASW stream and supplimentary graphs]

[1] [] [2] []