Future+Status+of+Crude+Oil+Price+-+Report

= __**Future Status of Oil Prices**__ = = = Prepared by; Kendra Kennedy, Minh Nguyen and Akkadet Udomsirithamrong



Data Set for range model Data Set for binary model **
 * __Raw Data__


 * __Streams__

__Presentation__ **


 * __Executive Summary__**

This report explores different models designed to predict the future status of oil prices, where "status" is defined as whether the price went up or down from the last period. The data used in the models was collected from a variety of reliable sources including government agencies, Google finance, and Yahoo finance and consist of the monthly average crude oil price, monthly averages of various economic indicators, and monthly averages of related stock prices from 1980 through 2009. Using a CHAID model and a regression tree, we were able to create two models that can be used to predict the future status of oil prices with about 60 percent accuracy. One model uses range data and caters towards economists and financial experts that would have access to accurate forecasts of the economic indicator variables; the other model uses binary data and caters towards people who might not have access to accurate forecasts, but that have some idea of if the various independent variables would go up or down. The analysis indicated that the most important variable in the models was FedEx's stock price; investors should keep an eye on this stock when choosing long or short positions in futures contracts.


 * __Introduction and Purpose__**

Trading futures contracts can yield high returns to investors; however, wherever there are high returns, there are also high risks. Accurately predicting the direction and magnitude a commodity’s price will change often involves major losses if incorrect. Therefore, it is necessary to be confident in this estimate. One of the most commonly traded commodities is crude oil. For our project, our team aims to determine if the oil price will go up or down based on a set of economic and related indicators. If successful, this model will add value to financial investors because they will be able to accurately predict the future status of the oil price for futures contracts.

To the team's knowledge, there has not yet been a study quite like ours. Our literature review indicated that most economic research and modeling has been targeted at predicting the price of oil as a function of the situation in the Middle East (see project proposal). These models primarily use regression techniques to predict the exact price of oil, but do not specify a model that predicts only if the price will go up or down as a binary variable. Though different from our proposed model, these studies helped us to determine which variables to include in the study. Two significant variables identified in the literature review were the CPI and the Federal Funds interest rate; these variables had initially been overlooked by the team. Another insight the team gathered from the literature review is that it is better to use long range data (monthly, quarterly, or yearly) for volatile markets such as the market for crude oil. The team decided to use monthly data from 1980-2009, for a total dataset of 360 records.


 * __Variables__**

Our model is trying to predict the status of the crude oil price, where “status” refers to if the price will go up or down from period to period. We derived this variable by taking the real monthly price of crude oil, lagging it by one period, and using if statements to change the two columns of range variables into a binary variable. If the price of oil went up over the last period, the binary variable has a value of one; otherwise the value is zero.

Our first attempt at the model included the consumer price index, the consumer confidence index, the federal funds rate, the seasonally adjusted unemployment rate, the domestic population, and the S&P 500 index. Although these economic indicator variables certainly play a role in the price of oil, we also wanted a few variables that would be more highly correlated with the demand for oil. We decided to add FedEx’s stock price and Southwest Airline’s stock price to the model since these two companies rely on oil for their business operations.

The data mining team collected a lot of different economic indicator variables, but most of them were filtered out because we wanted to avoid multicollinearity. For example, we decided not to use the real inflation rate because the model already included the consumer price index. We also decided not to use real GDP because the S&P 500 index captures the same idea. We chose to use the federal funds rate because it is a good indicator of interest rates in general and interest rates impact the overall economy. Instead of using the regular unemployment rate, we chose to use the seasonally adjusted unemployment because it is more accurate. One of the most important impacts on the price of oil is the situation in the Middle East. To capture this effect, we included the consumer confidence index because we reasoned that this would be a good proxy for how people feel about the global economy.

Additionally, we included the dollar index because we wanted to explore the possibility of the price of oil being impacted by how the US dollar is doing compared to other currencies. Indeed, in the correlation matrix, these variables were strongly inversely correlated. Investors should watch the dollar index when determining if they will take a short or long position.

We chose to include an indicator variable for each month of the year to see if maybe the price of oil is dependent somewhat on what month it is being sold in. Perhaps the price would increase in summer months or there is some other trend.

Finally we included the price of corn because recently corn has been used as an alternative to gasoline as a fuel source for cars. We expected these variables to have an inverse relationship because they are substitutes; this was later confirmed in a correlation matrix. Although this is consistent with our instincts, the corn price variable might skew the analysis somewhat because it was so recently that corn was thought as a substitute to crude oil for a fuel source. This is something to keep in mind in the conclusions.

One thing the team should keep in mind when doing our study is to make sure that we understand cause and effect. While general economic conditions certainly affect the price of oil, the price of oil affects economic indicators as well. Many articles in our literature review argued that fluctuations in the oil price were the cause of recessions, meaning that the oil price could be a leading indicator of an economic downturn. Since we are going to make the opposite argument—that oil is a function of what is going on in the economy—we will need to keep in mind that there may be some correlation between the oil price and economic indicators and other company specific stock price variables in an opposing causal relationship to the one we are arguing.
 * Variable || Type || Direction ||
 * Month || Set || Input ||
 * Change in Crude Oil price || Flag || Output ||
 * Monthly Inflation Rate || Range || Input ||
 * Monthly S&P 500 Index || Range || Input ||
 * Consumer Confidence || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Consumer Price Index || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Diesel Price || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Population || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Monthly GDP || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Monthly Federal Funds Rates || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Monthly Unemployment Rate || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Monthly Corn Price || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">Monthly Dollar Index || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">FedEx's stock price || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * <span style="color: black; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal;">SouthWest Airline's stock price || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Range || <span style="color: black; display: block; font-family: Verdana,sans-serif; font-size: 10pt; font-weight: normal; text-align: center;">Input ||
 * __Completed Data Mining Steps__**


 * Data collection. For sources of data, please refer to the project proposal.
 * Change the crude oil price into a binary variable
 * Gain an understanding of the oil industry through literature reviews
 * Visualize the data by computing statistics and viewing histograms of each variable
 * Identify extreme values and replace them with the three sigma control limit
 * Data pre-processes: Transform the variables if needed. Remove any missing values in the data set.
 * Partition the data into 60% training and 40% testing.
 * Classify the data under a variety of classification models (K-NN, Naïve Bayes, Logistic Regression, Classification Trees, Neural Nets and Discriminant Analysis).
 * Evaluate the models and select the one with the highest accuracy
 * Comment on any interesting findings relating to variable importance
 * Present findings to other data miners


 * __Models__**

We began the modeling process by doing a data audit of the variables. The output variable, status, was almost perfectly evenly distributed, so there was no need to balance it. Most of the variables were skewed and had drastically differing ranges. Because of this, the team determined it was necessary for the model to include an auto data prep node to normalize the data.



The team started by creating a model that used the status variable as the response with the predictor variables in range form. We did an auto classifier analysis and the classification tree came back as the best model and the neural net model was second best. We performed additional analysis on these models; the highest accuracy we got was 59.21 percent on the validation data of the classification tree.

For both of the two best models using the range data, FedEx’s stock price was the most important variable. Investors should keep an eye on this stock’s performance when choosing futures positions. Other important variables included the consumer confidence index, the consumer price index, the federal funds rate, and the S&P 500 index. Investors should try to get accurate forecasts of these variables to maximize the payoff of their oil futures.

Sometimes accurate economic forecasts will not be available, but there will at least be a forecast of if a macroeconomic indicator variable will go up or down. We decided to create a second model incorporating this idea: instead of using the range variables, we transformed all the independent variables into binary variables indicating if the variable went up or down between periods—just like the status variable. This model will allow investors to plug in the status of the independent variables to find the status of the price of oil. We filtered out the population variable in this model because, in general, the population will always increase from period to period so we didn’t think that this variable would add any insight to investors. We performed an auto classifier analysis to identify the best models to further analyze.

The highest accuracy we got was 60 percent with the CHAID (Chi Squared Automatic Interaction Detector) model, which is kind of like a classification tree and uses and multi-way tree algorithm to explore the data. The team was pleased to see that the binary model had the same level of accuracy as the range model. In this model, the most important variable by far was the price of FedEx stock price and the second most important was the month of May variable. We found it interesting that the May variable was so significant; we went back and looked at the correlation matrix and found that this variable was directly correlated with the price of oil. This means that if it’s May, the price of oil is likely to increase. Investors should keep this in mind when choosing short or long positions.




 * __Deployment__**

The range model caters towards futures brokers, economists, or other people who would be privy to specific forecasts on the variables. The binary model could be used by the public because people will generally be able to gather from the news if a variable will go up or down—they just might not know by how much. Although neither model has particularly good accuracy, this is to be expected with predicting commodities prices; if models could be created with 80 percent accuracy or more, futures would not be such a risky investment and therefore would not have as high of a payoff.


 * __Limitations__**

Although we are pleased with the results of our analysis, our models have several limitations. First, there is not a perfect variable to capture the situation in the Middle East, which is arguably one of the most important factors in the price of oil. Second, oil is a global phenomenon but there are no global economic indicator variables. Using American macroeconomic variables such as unemployment and inflation is a good proxy for global economic conditions, but it’s not perfect. Third, the model might be more useful to investors if the data could be in bimonthly or even weekly form; however, this data is not available for the majority of variables in our model. Fourth, our model is based on the demand side of the price of oil and we ignore aspects of the supply side such as technological breakthroughs that would play into the commodity’s price. We did this because it is much easier to find data on the demand side than on the supply side. Finally, there are a lot of aspects of the price of oil that just seem random and fluky—such as the month of May being such an important variable. There are surely other variables that we did not even think to include in the model and there is certainly a large component of the price of oil that cannot be explained by anything but randomness.