Horse+Racing+Project+Proposal

Team members: Aaron Warren, Brandon Alcott, Abdullah Aljuffali

Purpose:
The purpose of our project is to predict the winner of horse races with accuracy better than picking the betting favorite.

Variable Control within the Purpose:
It is important to remember that horse racing is a form of gambling, and accurately predicting the outcome in gaming is extremely difficult. Horse racing, however, is different than some forms of gaming in that it is not purely a game of chance. Horseracing deals with live animals that behave in different ways go in and out of condition, and who deal with their environment with varying degrees of consistency. The advantage to the player is found in an analysis of the past performances of every horse in a given race, and by comparing these performances with the current race scenario. This past performance data is found in the Daily Racing Form, and we will be using every variable included in order to create as accurate of a model as possible. The Daily Racing Form includes data on past performances (race details, track, surface, race time, order of finish, weight carried, post position) as well as other information pertaining to the current race (jockey stats, trainer stats, post position, medications, morning line odds).

Hypothesis:
Our hypothesis is that we can create a model that can predict the winner with accuracy better than 33 percent, the accuracy of picking the betting favorite.

Data Collection and Procedure:
At [|The Daily Race Form], data is collected after each race on participating horses. These statistics include horse specific statistics like how fast it ran, what drugs it is on, weight, etc. Data is also collected on the Jockey, the track, and the previous races the horse ran in. This data is organized by horse, and race. This data can be purchased before the race day, and additional data is available directly before the start of the race. Since we are attempting to select the winning horse we will most likely use one of the classification models. Also because there will be a much higher percentage of non winning horses our data will be unbalanced; a "balance" node will be used to fix this issue. In an attempt to find the most accurate model we will use partitioned data and attempt run one set of models on data that has been z-scored and another on raw data. Because there is so much data, we run into the curse of dimensionality so we will attempt to use the PCA factor or Feature Select nodes in order to find the most pertinent data and reduce the predictors used in the model. Another way to reduce the data would be to compare the important predictors in each of the models selected, if certain factors are consistently important we can attempt to just use them to train a model. Models we will attempt to use will be things like neural networks, trees, or other classification models. We will run them all in order to find the highest accuracy and most reliable model. We also conducted a literature review into what others have done. A large amount of research has been done in the past about horse race prediction. Although predictors used vary, a number of people we found use Neural Networks to predict the Win, Place, or Show horse (first, second, and third places). [|NeuNet Software], [|Race Predictor] , [|Freakonomics] , [|Race Profit Generator (RPG)] , and the list goes on.

Pre-Process:
We used data from three different tracks, Portland Meadows, Emerald Downs, and Due to the large amount of predictors in our data sets as well as lack of large portions of certain predictors the preparation of our data was the most time consuming portion of our modeling. After getting the data into appropriate format to be used with SPSS we filtered out a number of unnecessary fields that were either lacking data or had unusable data such as comments. Next we appended all of our data together into one large data source. From here, we used the Auto Data Prep node to help standardize the data and further remove missing data. After the Auto Data Prep we still had around 1,200 predictors; in order to lower this number we used a feature select node to only find the most relevant predictors which cut back our number to 149. We used a number of different classification models (listed in the next section) in order to find the most accurate.

Models to use:
Auto Classify SVM Bayes Net Neural Net CHAID QUEST C & R Tree C5.0

Developing the model:
After checking all of these models we found that the Neural Network was the most accurate, as well as produced probabilities which allow us to bet on the order of finish of the horses. Through multiple revisions and an additional Feature Select node we were able to get a fairly accurate model that can make a good return using the Exacta bet described below.

Data Interpretation:
In the creation of our model, we grouped the data from every race together. Therefore, the predicted winners were derived from propensity scores of the entire data set. Knowing that this would cause the model to predict multiple winners for some races, and no winners for others, we opted to use the raw propensity scores to manually create a predicted order of finish for each race (the horse with the highest score being the predicted winner). Based on these scores, we can use our enterprise knowledge and the odds of each horse in order to make betting decisions. For instance, if our model shows that two horses have a nearly identical chance of winning, then the better value is to bet on the horse with longer odds (better payout), or to use other "at the track" information to pick a winner (if a horse looks overly excited directly before the race, you might change your betting decision).

In order to test our model, we bought the Daily Racing Form for May 2nd at Emerald Downs (Auburn, WA). We wanted to see how the model might work in a real-world scenario. The racing form is available two days before the race day, so there is plenty of time to run the data through a model. After the results were available after the race, we had the following outcome:

Of 9 races: First pick won 2 times First or Second pick won 6 times First, Second, or Third pick won 7 times

In order to better evaluate this output, you must look at payout given some typical betting strategies.

One of the simplest and most common bets is the win bet. This requires you to bet on a single horse, and this horse must win the race. Our model only correctly predicted 2 of 9 winners (22%), however, the favorite also won only 2 times. On this particular race day, our model predicted the winner equally as well as betting on the favorite. If we were to bet $20 per race we would spend a total of $180. Considering the payout of the two winners, we would get $146 back. This is a loss of $34, or a Return on Investment of -18.9%.
 * The Win Bet**

Exactas are a very common “exotic” bet, in which you must pick the top two finishers in order. An exacta “box” is actually multiple exacta bets that capture your choices in every possible combination. For example, if I get an exacta box with horse #1 and horse #2, I am actually purchasing two exacta bets, one in the combination 1-2, the other 2-1. As long as the number 1 and 2 horses finish first and second (in either order), I win the bet. You can also put more than two horses in an exacta box. A three horse exacta box is actually 6 separate exacta bets, capturing all exacta combinations. The bet would look like this: First Place: 1,2,3 Second Place: 1,2,3
 * The Exacta Box**

Looking at our model, we know that it is good at predicting the winner out of its first or second choices. Knowing this, while betting at the track, I might avoid win bets in favor of an exacta box. This bet doesn’t force me to pick one horse to win. In fact, I might do a 3 horse exacta box to further improve my chances. If we were to bet $18 per race on an exacta box with our top 3 choices, we would cash tickets on 4 of the races. We would spend a total of $162, and would get back $211.50. This is a profit of $49.50, providing a ROI of 30.6%.

The trifecta is another exotic bet, but it requires you to pick the top 3 finishers in order. Trifectas can be difficult to pick, because even the “worst” horse in the race can find its way into third place, ruining your chance of cashing a ticket. However, you can create a trifecta bet in a number of ways. One common strategy is to create a ticket with your top two choices in the 1st spot, top three choices in the 2nd spot, and top 5 choices in the 3rd spot. The bet might look like this: First Place: 1,2 Second Place: 1,2,3 Third Place: 1,2,3,4,5
 * The Trifecta**

As you can see, this bet could be perfect for our model, as either one of our top two choices can win the race.

If we were to spend $24 on this trifecta bet in each race, we would spend $208 (one race only had 4 horses, so spent less on this race). We would cash 3 tickets, returning a total of $275.20. This is a profit of $67.20, and a ROI of 32.3%.

Conclusion:
Our model did not predict the winner with higher accuracy than picking the betting favorite (it performed equally as well), but it performed VERY well at picking the winner out of its top two choices. Using some enterprise knowledge, it is clear that the above-mentioned trifecta betting strategy maximizes the performance of the model. It is impossible to say that this model coupled with a trifecta betting strategy would continue to offer similar returns, but it seems very plausible that with further refinement the model could continue to provide value to betting decisions. This model was created using only 11 days of racing data (approximately 90 races), and incorporating more data into the model would certainly aid in its refinement. It would also be beneficial to include additional data inputted by the user. For instance, if you could input information on the current condition of each horse in the minutes before a race, it could add value (a calm horse might perform better than an overly excited horse).