MLB+Attendance+-+DYSS




 * Presented By: **
 * Casey Morgan **
 * Joe Parisi **

The goal of this project is to examine the attendance records in MLB ballparks and find correlations by looking at certain variables that could account for the fluctuations in game-to-game attendance. By looking at both internal factors such as team performance statistics and external factors such as weather and time of year, our hope is to find what a team owner, general manager, and event coordinators can do to affect attendance at their ballparks. Our hope is to not only examine what efforts have worked in the past, but also offer insight into predicting what efforts can increase attendance in the future. In order to keep our data-set a manageable size, we decided to select six teams with varying reputations and markets to evenly represent the entire league. The following teams were selected considering their market size, tradition, and success.
 * Background: **
 * • Atlanta Braves(Large Market, Historic, Winners)
 * • Los Angeles Dodgers(Large Market, Historic, Losers)
 * • New York Yankees(Large Market, Historic, Winners)
 * • Tampa Bay Rays(Small Market, New, Winners)
 * • Seattle Mariners(Small Market, Mid, Mid)
 * • <span style="font-family: 'Times New Roman',serif;">Kansas City Royals(Small Market, Historic, Losers)

<span style="font-family: 'Times New Roman',serif;">In looking at the home attendance records of these teams, we included the following variables:
 * <span style="font-family: Arial,sans-serif;">• <span style="font-family: 'Times New Roman',serif;">Wins, Streak, Runs, Runs Against, Division Rank, Games Behind
 * <span style="font-family: Arial,sans-serif;">• <span style="font-family: 'Times New Roman',serif;">Day of Week, Time of Day, Weather, Innings, Duration,
 * <span style="font-family: Arial,sans-serif;">• <span style="font-family: 'Times New Roman',serif;">Attendance, 30,000+, Opponents strength, Give-away, Bobble-head

<span style="font-family: 'Times New Roman',serif;">These variables take into consideration both on-field performance, outside promotion, and natural uncontrolled factors. Also, these factors take into consideration both prior to game data as well as during game data. The keys to selecting these variables are measures of performance and promotion.

<span style="font-family: 'Times New Roman',serif;">The data set was created using different variables from the Internet. The baseball stats were all pulled from baseballreference.com, the weather data was pulled from wunderground.com and information regarding ballpark giveaways was pulled from bobblesgalore.com. From this the following variables were collected.
 * <span style="font-family: 'Times New Roman',serif; font-size: 14pt;">Procedure: **
 * <span style="font-family: 'Times New Roman',serif;">Data Collection: **

<span style="font-family: 'Times New Roman',serif;">Most of the data was prepped within Excel for ease of use within modeler. However, after the data was put into modeler, the following steps were taken: <span style="font-family: 'Times New Roman',serif;">- Outliers and Extremes were discarded. There were only a few and left us sufficient data for analysis. <span style="font-family: 'Times New Roman',serif;">- Most of the variables followed a normal distribution; however, because we were comparing variables like rank (1-5) and attendance (10,000-50,000) we chose to normalize only the attendance variable. (it should be noted that models were first with raw data, then standardized and normalized and the best results that were achieved was when only attendance was normalized.) <span style="font-family: 'Times New Roman',serif;">- We also thought it was important to be able to read the data when we were done. That is another reason it wasn’t normalized or standardized. However, knowing the standardizing the attendance made the model better, a derive node was placed on the model to transform the attendance back to readable values. <span style="font-family: 'Times New Roman',serif;">- Flags were created for days of the week and team. <span style="font-family: 'Times New Roman',serif;">- Attendance was then noted as the target variable. <span style="font-family: 'Times New Roman',serif;">- During the prep, a PCA node was added and there was only one component that had high correlation between two variables, team rank and games behind. Looking at the data set, we thought it was important to keep these separate because they convey two different pieces of info that we would like to see at the end of the model separately.
 * <span style="font-family: 'Times New Roman',serif;">Data Prep: **


 * <span style="font-family: 'Times New Roman',serif;">Stream **

<span style="font-family: 'Times New Roman',serif;">

<span style="font-family: 'Times New Roman',serif;">The model was run several different times with different results. First, all of the variables were run thought the model with mediocre results. Even though the models were mediocre, the models that show the best results were: neural network, regression ensemble and the chaid tree model. Looking at the lift charts for the regression analysis, they look pretty good at first, but after adjusting to a user defined hit, the results don’t seem as good.



<span style="font-family: 'Times New Roman',serif; font-size: 11pt;">Two lift charts, one lift chart with normal settings (left) and one lift chart with a user defined hit on Attendance (right).

<span style="font-family: 'Times New Roman',serif; font-size: 11pt;">

<span style="font-family: 'Times New Roman',serif;">After further looking at the data, the team names were stripped out, as well as all of the variables that wouldn’t be known before the game began, runs, runs against, time of game, and number of innings. This is because we only want to use the information a customer would have before deciding to buy a ticket or not. Along with this, bagging to the model was added to the neural network model to try and improve the results. This time, the results were greatly improved.

<span style="font-family: 'Times New Roman',serif;">

<span style="font-family: 'Times New Roman',serif;">We now have a model with 97.2% accuracy from the reference model and a mean accuracy of 80.7%. Running the model again with a different seed generated similar results with 95.1% accuracy and a mean accuracy of 83.0%.

<span style="font-family: 'Times New Roman',serif;">Using the regression ensemble however produced worse results. The relative error went from about 30% to between 70% and 80%.



<span style="font-family: 'Times New Roman',serif;">Analyzing these models, there are two predictors that come out as the most important by far, and those are team rank and games back (GB) from first place. This is very evident in the neural network model as well as the chaid model, which splits at team rank. There are three splits, first place or below, between first and second and below second place. There are then a couple more splits at what day the game falls on and number of wins the team has, but rank is the most important predictor.



<span style="font-family: 'Times New Roman',serif;">After running all of these models, it is very evident that the number one predictor of baseball attendance is the rank of the team, essentially saying, “if the team is in first place, more people will go to the game.”

<span style="font-family: 'Times New Roman',serif;">After thinking about this and looking at the data set, it looks as if every team was in first place at one time or another, so what happens when a team isn’t that good, is there no hope to draw fans in? Seeing as there is ample data in this set, all of the games when a team was in first place were stripped out. This left the data set with 433 rows of data to analyze. The same stream was then recreated with another neural network and chaid model created.



<span style="font-family: 'Times New Roman',serif;">With this model created, the results were very different. The two models seem fairly accurate for prediction, but now the predictors have changed. The model isn’t as accurate with an accuracy rating of 80.6% and a mean accuracy rating of 76.4%.



<span style="font-family: 'Times New Roman',serif;">Now, however, the most important predictors change. The most important predictors are now games on Saturday, bobblehead and giveaways. This is fairly interesting because within the data set, there are only 75 rows with giveaways and 16 rows with bobbleheads; basically showing a spike in attendance during these games. The chaid model also shows similar, but a little bit different results. In the chaid model, the most important predictor is game day temperature, games on Saturday, giveaways and then bobbleheads. The tree also breaks at Saturday, showing games on Saturday have a higher attendance.



<span style="font-family: 'Times New Roman',serif;">After running this data, it is pretty evident that if a baseball team isn’t in first place, people go to the ball park for an experience and not just to watch the team play. This experience is further enhanced by the possibility of receiving a giveaway.

<span style="font-family: 'Times New Roman',serif;">We then created one last model. In baseball, if a stadium can average a per game attendance of 30,000 fans, they are usually profitable. Considering this, we created a dummy variable in the data set with all of the games that had an attendance of 30,000 plus fans. A 1 indicated a ball game of 30,000 fans and a 0 less than 30,000 fans.



<span style="font-family: 'Times New Roman',serif;">The results of this model are similar to the first model that was created. This time, the model had an accuracy of 96.7% on the reference model, 96.45% on the ensemble and 55.5% accuracy against the naïve rule. The most important predictors on this model remain team rank and games behind. This model also has a misclassification rate of 16.4%. So, while not great, it seems to do a pretty good job of predicting a game in which at least 30,000 fans attend the game.





<span style="font-family: 'Times New Roman',serif;">Also for our third model looking at the ensemble, in our chaid tree we found something very telling and interesting. If your team was ranked 3rd in their division or worse, the temperature was the next deciding factor, then with cooler weather it depended on whether or not it was a bobblehead giveaway night. This leads us again to the assumption that if your team is not performing on the field, stadium experience is what matters. Although a team cannot control the weather, it may be wise to offer giveaways on forecasted bad weather days. Basically, if you own a team that isn't in first or second place, and the average temp is below 87, then the way to draw fans to the stadium is by giving away bobbleheads to the first 30,000 fans.



<span style="font-family: 'Times New Roman',serif;">The conclusions after creating and running this model are the following. <span style="font-family: 'Times New Roman',serif;">- Game attendance is largely based on what place the team is in, and also largely based on whether the team is in first place or not. <span style="font-family: 'Times New Roman',serif;">- It is also largely based on how many games out of first they are. <span style="font-family: 'Times New Roman',serif;">- If there is a team not in first place, fans are more interested in a ball game as an outing and not necessarily a sporting event. <span style="font-family: 'Times New Roman',serif;">- While a giveaway may spike attendance for a couple of games, it will be hard to average 30,000 fans per game (a benchmark within baseball to determine a profitable ball club vs. a non-profitable one) just on promotions alone. <span style="font-family: 'Times New Roman',serif;">- Fans will come to games on Saturday's solely because of the day of the week. In order to shift some attendance numbers, it may be a good idea to have giveaways and bobbleheads on game days during the week.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Conclusions: **


 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Next Steps: **

<span style="font-family: 'Times New Roman',serif;">If we were working for a baseball organization and this was a real project, there is some more data that we would want to know. More information on ticket sales would be good to have. We would like to know how the distribution of ticket sales is based on things like weather, giveaways and bobblehead days. Because people want this to be more than just a ball game and they want it to be an outing, how far in advance are they buying tickets? When a promotion is known do people buy more tickets? When a game is on a day with pleasurable weather, are more tickets sold the day of the game? These could be important things for a marketer to know when creating promotions for the game. It may be a good idea to run promotions early in order to better forecast game attendance, or it may just depend on the weather that day and a lot of people will walk up for tickets. These are all things that would help marketing do a better job running promotions and the front office better forecast game attendance.

<span style="font-family: 'Times New Roman',serif;">Also, using the above data it may be possible to compile a customer database according to ticket purchases. The categories could be something like season ticket holder, occasional fan, avid fan ect. By using a k-means or two-step model, we would then be able to classify customers into clusters based on ticket purchasing habits. This would give us a better idea who to target with giveaway promotions.