Predicting+the+Stock+Market+Movement

The purpose of this project is to determine the age old question of, “Can the stock market be predicted?” and in doing so, create tremendous wealth for those capable of predicting the unknown. Every January, hordes of highly paid experts attempt to predict what the economy and world markets will do in the coming year. Later in that year, nearly all of the forecasts turn out to be wrong. Analysts typically forecast out taking into consideration inflation and interest rates, earnings, oil/energy prices, political instability or unrest…and even with millions of lines of information available to be analyzed, nobody has been able to quite put their finger on why the stock market acts the way it does. We as a group are bound and determined to test commonly known economic indicators such as oil price, agriculture output, and gas supply and demand on their ability to produce the most accurate results in correctly predicting the stock market. Credible sources like The Wall Street Journal 1 and financial guru J. Welles Wilder 2suggest an investor study sentiment or “mood” surveys as a market signal or use a momentum oscillator that measures the speed of change in price movements. While there are a multitude of theories on stock market prediction there is really only one thing that we can be sure of when making stock predictions…..that nothing is certain.
 * Introduction **

Stock market indicators can be grouped into 4 major market forces that look like this: 3

So what we will to do as a group is analyze a data set containing different stock market indicators based on the above 4 categories to test which indicators are most accurate and effective as predictors. In our analysis we determined that logistic regression as well as neural nets may be our most effective models to work with. We focused on neural nets because of their ability to implicitly detect complex nonlinear relationships between independent and dependent variables as well as the ability to detect all possible interactions between predictor variables. Logistical regression is advantageous for our data set/project choice due to the fact that the final goal for many regression analyses is to produce a mathematical function and in our research it seems evident that many who claim they can “predict the stock market” base this assumption off an algorithm or equation they created. For example, the Renaissance Technologies was a hedge fund that got famous based on statistical models and predictors to determine which funds and stocks to invest in. The fund did fairly well for a number of years, but they discovered that even with PhD’s in statistics; they could not calculate the full risk of the market and ended up failing. Therefore, any model should be taken with a grain of salt, and qualitative understandings of current markets should be used in conjunction with the statistical model.

It has become evident to us that an inordinate amount of time is spent trying to “game” the stock market and predict the future in hopes of getting rich. When it comes down to it, the only reason people enter and invest in the stock market is to make money. Period. So it seems that spending some time and energy in hopes of gaining a greater understanding of the stock market has the potential to yield incredible results and is well worth the blood, sweat and tears. Based on multiple articles and research studies, the group will test each hypothesis on the accuracy of their prediction to determine which theory holds true as the best predictor of the market.
 * Purpose of Model**


 * Literature Review**

According to research Sincere has conducted by speaking and analyzing multiple stock traders’ strategies, he uses two broad based sets of data: sentiment surveys and trading volume. Sentiment surveys include Investors Intelligence Sentiment Survey and the Consumer Confidence Index. Trading volume includes Arms Index, which tracks overbought or oversold stocks which can indicate when potential bubbles or toughs occur in the market, and VIX, which measures volatility in the options market. Sincere assumes that investor psychology along with trends in volatility and trading volumes are indicators of market movements, not necessarily macroeconomic factors.
 * 1.)** //Use These Market Indicators to Predict Stock Moves// by Michael Sincere 4

According to Ken Little, 7 factors affect the stock: Inflation, interest rates, company earnings, oil and energy prices, war and terrorism, crime and fraud, and serious domestic political unrest. All these factors include both short term and long term factors. These also are making the assumption that markets are efficient and that investor psychology does not play as big of a role as Sincere seems to believe. Little makes the assumption that uncertainty is the key driver of price drops.
 * 2.)** //Uncertainty Makes the Stock Market Crazy// by Ken Little 5

Morgan Stanley uses macroeconomics to determine conditions that will affect the market in upcoming months. These macroeconomic factors include: real GDP, Consumer Price Index, unemployment rate, treasury yields, gold, oil, yen and euro exchange rates, and the VIX volatility index. These indices, according to Morgan Stanley, are accurate in analyzing the current economic health of the US. According to them, they state that current market conditions are out of line with the bull market currently in effect; therefore they expect a drop is the stock market due to the mismatch of data and current prices.
 * 3.)** //Investment Perspectives March 29, 2012 -// A Morgan Stanley Research Report 6

George Kester, a professor of finance at Martel, tests the age old theory that the outcome of the super bowl is a predictor of the stock market. This theory states that if the winning team originates for the National Football League, the market will increase, if the winning team originates from the old American Football League, the market will decline. The psychology is that investors see the teams from the old American Football League winning as something amiss in America and tend to make bearish trades. Kester has continued to research this subject (which began in the late 1960s) and shows a 91% success rate.
 * 4.)** //Super Bowl Stock Market Predictor Still a Winner// by George Kester 7

The group’s hypothesis is that each of the first articles incorporates the best model for predicting market changes. Each article, as a sum, touches on each of the four market forces mentioned in the introduction, therefore a mixture of all predictors from the four categories of government, international transactions, speculation and expectation, and supply and demand will yield the most accurate model. While the research on Super Bowl winnings as a predictor is compelling, it is the group’s prediction that this model is irrational and will not be sustained into the future. Also, the article mentions that the progression of the National Football League (NFL) is making it harder to determine which team is from each historical league, so analyst suspect that this theory will eventually break down.
 * Hypothesis**

Data was collected from Yahoo! Finance for index prices of the S&P 500, NASDAQ Composite Index, and Dow Jones Industry Index. All these indices have a collection of diversified stocks from multiple industries which we assume will give a generalized representation of the overall US market. Yahoo! Finance was used to pull data for the VIX index. All economic indicators are pulled from the Federal Reserve Bank of St. Lious’s “FRED Economic Data.” These data sets are broken down into 7 categories: money, banking, and finance; national accounts; population, employment and labor markets; production and business activity; prices; international data; and U.S. regional data. All data for the Morgan Stanley research article can be located on this site, as well as information for oil and energy prices, and multiple interest rates. Inflation rate data and is extracted from [|www.inflationdata.com] which gives monthly US inflation data from 1914 to 2012. The consumer confidence index is extracted from the “Understanding Dairy Markets” web site that gives monthly data from 1977-2012. The Super Bowl outcome data can be retrieved from [|www.superbowlhistory.net]. Other non-financial data, such as crime, terrorism, and fraud will be drawn from the FBI crime report and the US census web sites.
 * Data Collection**

Data preparation will consist of discarding outliers and blank data rows; normalizing the data by transforming them into z-scores, and partitioning the data into training and testing data sets. In addition, each of the predictors needs to be lagged by one month increments for 4 months, giving a total of 5 rows of data per predictor. For the super bowl model, dummy variables will need to be prepared for win/loss outcomes of the game.
 * Data Preparation**


 * Procedure**

//Models 1-3// Three different targets will be selected for each model: S&P 500, Dow Jones Industrial Index, and the NASDAQ Composite index, and each predictor will be tested against it. Three models will be created: linear regression, neural net, and CHAID. In addition, in order to avoid multicollinearity, a PCA factor analysis will be done to attempt to extract the primary drivers of the stock market. Multiple models for each theory will be developed to attempt to improve accuracy and stability.

//Model 4// For this model, the group is primarily concerned with either a win or a loss in the Super Bowl will lead to either a bull or bear market outcome for the year. Therefore, each year of data will assigned either a 1 for bull, or 0 for bear and 1 for win and 0 for loss. Next, the group will partition the data and run a Bayes Net, Logistic Regression, and decision tree models.


 * Model Data Specifications**

//Model 1// Data was averaged on a monthly basis from 2003-2012. The predictor descriptions are as follows:
 * 1) __Investor Confidence Index__8: Data collected from State Street Global Markets that surveys investors on their prediction of how well the market will do in future months
 * 2) __National Financial Conditions Index__ (NFCI)9: A weekly index of if US financial conditions including money markets, debt and equity markets, and banking transactions
 * 3) __Adjusted Financial Conditions__ Index (ANFCI)9: Similar to the NFCI, however it only includes data sets that are uncorrelated to the market
 * 4) __Consumer Confidence Index__ (CCI)10: The CCI is a measure the degree of saving and spending from consumers of the US market
 * 5) __Volume of the Dow Jones Industrial Average__ (DIA)11: Volume of trades for the ETF index DIA

//Model 2:// Data was averaged on a quarterly basis from 1991-2012. The predictor descriptions are as follows:
 * 1) __Gas__10: Average United States gas price index
 * 2) __Terrorism__12: This is a sum of the number of people killed and injured per quarter
 * 3) __Crude Oil__12: The average price of crude oil per barrel
 * 4) __Inflation__13: Historic inflation rates by month
 * 5) __Electric and Gas Utilities__ (IPUTIL)10: National output of the US electric and gas industry
 * 6) __3-Month Treasury Bonds__ (TB3MS)10: Average monthly US treasury yields in percentage
 * 7) __6-Month Treasury Bonds__ (TB6MS)10: Average monthly US treasury yields in percentage
 * 8) __Federal Funds__ (FEDFUNDS)10: The average effective rate that the US federal reserve lends money on a short term basis.
 * 9) __S&P Dividends__11: The total amount of dividends paid per stock of the SPY ETF
 * 10) __Crime Rate__13: The total amount of crime committed in the US per quarter as reported by the FBI

//Model 3:// Data was averaged on a quarterly basis from 1999-2010. The predictor descriptions are as follows:
 * 1) __Volatility Index__ (VIX)11: The S&P 500 measure of volatility in the market
 * 2) __Euro/US Exchange Rate__ (EXUSEU)10: The average currency exchange between the US dollar and the European Euro
 * 3) __Japanese Interest Rates__ (JPNINT)10: An average composite of Japanese government bond rates
 * 4) __Gross Domestic Product__ (GDPC1-M)10: A quarterly measure of the output for the entire US economy
 * 5) __Consumer Price Index__ (CPI)10: An index of average consumer prices from multiple US industries
 * 6) __Unemployment__ (IC4WSA)10: The US national average unemployment rate
 * 7) __High Yield Bond Rates__ (WSLB20)10: A composite of US high risk corporate debt yields
 * 8) __3-Month Treasury Bonds__ (TB3MS)10: Average monthly US treasury yields in percentage
 * 9) __Gold__13: The average spot rate for US gold futures
 * 10) __Crude Oil__12: The average price of crude oil per barrel

//Model 4:// Daily S&P market prices were annualized from 1967 to 2011. The wins and losses for NFL Super Bowls from 1967 to 2011 were classified in dummy variables.


 * Model Building**

//Model 3: Actual// Data was prepared by normalizing the data then partitioned it into 60% training and 40% testing dispersion. Three models were built: Neural Net, Linear Regression, and CHAID Classification Tree. They were then analyzed using the “Analysis” node and Lift chart. //Model 1: Actual// The same process was used as model 3, however no CHAID node was build do to poor performance of the model. //Model 2: Actual// The same process was used for model 3, however, due to missing data from the “Terrorism” data set, a filler node was used to replace missing data with a value of zero. //Model 1: Gains/Losses// In order to test whether the predictors in each model were strong predictors of market gains and losses, the S&P index was converted into dummy variables. A value of 1 was given to quarters that yielded a positive or zero average returns, and a value of 0 was used for quarters that yielded average losses. The data was not normalized, as it was not needed for this type of analysis. Two models were built: C5.0 Classification Tree and Logistic Regression. They were then evaluated using the “Analysis” Node, Lift Chart, and Confusion Matrix. //Model 2: Gains/Losses// The same process was used for model 2 as was used for model 1, however only a C5.0 Classification Tree was built due to this calssification model producing the most accurate results out of the two models. //Model 3: Gains/Losses// The same process was used for model 2 as was used for model 1, however only a C5.0 Classification Tree was built due to producing the most accurate results out of the two models.
 * Model Analysis Results**

//Model 3: Actual// Using the three different nodes of analysis, the Neural Net produced the greatest accuracy, lift, and the smallest amount of error. The model was set to boosting to improve accuracy. The option of bagging was used as well; however it did not affect the amount of error nor the lift. The model produced 99.8% accuracy for the ensemble and a 99.2% for the reference model. The top five predictors in order of importance were GDP, high yield bonds, Japanese interest rates, consumer price index and crude oil. A stable lift of 2 was produced showing fairly accurate predicting power of the model. The mean error for the testing partition is 39.613, mean absolute error of 64.126, and a standard deviation of 73.667.

//Model 2: Actual// As in model 3, the Neural Net node with boosting was used. The model produced 99.6% accuracy for the ensemble and 99.0% for the reference model. The top five predictors in order of importance were crime, S&P dividends, electric and gas utilities, gas, and crude oil. A stable lift of 1.4 was produced showing less predictive power than model 3. The mean error for the testing partition is 9.873, mean absolute error of 54.605, and a standard deviation of 67.394. Model 2 performs better in respects to error and volatility.

//Model 1: Actual// As in model 3, the Neural Net node with boosting was used. The model produced 90.4% accuracy for the ensemble and 79.0% for the reference model. The top four predictors in order of importance were consumer confidence, NFCI, DIA volume, ANFCI, and investor confidence. An unstable lift of 1.6 was produced showing less predictive power than models 2 and 3. The mean error for the testing partition is 12.249, mean absolute error of 104.043, and a standard deviation of 119.7. Model 1 preforms poorly in accuracy, error and volatility.

//Model 3: Gains/Losses// Using the two different nodes of analysis, the C5.0 produced the greatest accuracy, lift, and the smallest amount of error. The model was set to boosting to improve accuracy. The option of bagging was used as well, however it did not affect the amount of error nor the lift. The categories the model for rule #1 (87.5% accuracy) broke the data into included Japanese interest rates 3-month treasury bill, and the US/Euro exchange rate. This predictors are to be expected because all predictors that are not interest rates or exchange rates proceed in an upward linear trend, therefore not useful for future analysis. Results conclude that anytime Japanese interest rates are greater than 5.12%, the market is going to see losses, based on three historical events. In addition, when Japanese interest rates are equal to or above 5.12%, 3-month treasury bills are equal to or below 5.043% and the US/Euro exchange rate is above .877, there is a 84% chance the market will see gains. In other words, when international and US debt rates are low, but the Euro is depreciating the market is doing well. In normal markets, interest rates are low, meaning cash is cheap, and foreign goods are cheap, this is enticing to companies to spend more rather than spend. This will reflect positively in the market. Rule #8 (96.62% accuracy) shows that when 3-month bills drop below 5.043% while Japanese interest rates are equal to above 5.12% in the second lag quarter, while Japanese interests drop below 1.18% in the next quarter, the market will see gains. This is most likely due to fiscal stimulus from the US and Japanese central banks enticing spending from investors. An unstable lift of 2 was produced showing good but volatile predictions. The accuracy of the testing partition only showed 52.63%. The confusion matrix shows an accuracy of around 80% accuracy; however this does include both the training and testing partitions.

//Model 2: Gain/Loss// Since this model contained many more interest rate predictors, this model received an accuracy of 95.45% in rule #1. An interesting rule that was produced by this model was that when 3-month Treasury bill rates are above 4.54%, and 6-month treasury bills are equal to or below 5.69%, there is a 93.75% chance the market will see gain in the next quarter. This is most likely due to the fact that a low, flat yield curve shows stability and less risk in the short term, enticing investors to investor in the stock market. In the same respects, when there was a divide of the 3-month and 6-month treasury bills the market will see losses, showing more uncertainty in the market. An unstable lift of 2 was produced in the 40th percentile showing poor predictive power. The accuracy of the testing partition showed 67.65%. Better than model 3, however it is still not as high as we had hoped. The confusion matrix shows an accuracy of around 86% accuracy; however this does include both the training and testing partitions.

//Model 1: Gain/loss// Due to the poor performance and the lack strong predictors, the model was unable to run successfully and was dismissed.

//Model 4: Super Bowl Predictor// Due to the simplicity of the model, a statistical program was not needed to analyze its accuracy. Instead, a scenario was developed predicting three different strategies to see which one would do the best. For a benchmark strategy of longing the S&P 500, total returns from 1967 to 2011, the total annualized return would be 542.76%. In order words, an investment in the S&P 500 where you would not touch your initial investment. The second strategy is using the Super Bowl predictor strategy where if the Super Bowl predictor predicted a gain, we would invest in the S&P, while if it predicted a loss, we would short the S&P, or bet against it and gain any subsequent losses. This strategy underperformed the market with a return of 424.04%. Finally, a third strategy would be to buy S&P stock when it predicts gains, and sell stock and hold it in a cash account when it predicts losses. This strategy doubled the returns of the market at 803.02%.


 * Project Conclusions**

For the models used with neural net analysis, the accuracy did surprisingly well, however the amount of error and volatility are too great to be a good predictor of stock market prices. In stock trading, accuracy and timing are essential and an error of $20 can show huge losses in your portfolio. With the classification tree, interest rates and exchange rates where decent predictors of stock market movements, and fell in line with historical economic logic. In addition, the strategy of the best model showed a higher probability of success rather than failure. However, the classification tree could not calculate the magnitude of the losses and gains; therefore it is very risky strategy. An investor could be seeing gains more than losses; however the magnitude of the losses could equal total negative returns. In conclusion, these models by themselves are not sufficient to predict the stock market by themselves. However, they would be useful tools in conjunction with other qualitative and quantitative measure. A new model was used to take the best predictors. All predictors from model 3 as well as the crime index from model 2. However, a lower accuracy and higher volatility were yielded, therefore we did not include it in this report.


 * Resources**

1.) @http://articles.marketwatch.com/2011-02-21/investing/30765000_1_sentiment-surveys-stock-market-traders 2.) @http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:relative_strength_index_rsi 3.) @http://stocks.about.com/od/whatmovesthemarket/a/Whatmovesmarket.htm 4.) [] 5.) [] 6.) //Investment Perspectives//**,** Morgan Stanley & Co LLC, March 29, 2012 7.) [] 8.) [] 9.) [] 10.)[] 11.)[] 12.)[] 13.)[|http:/www.gold.org/investment/statistics/gold_price_chart/]