Bear+Dancing

=DATA MINING PROJECT PROPOSAL= ---Beau Bailey, Bryan Davis, Kyle Hill, Sajal Maheshwari, Tyler Robinson

INTRODUCTION
** The intended project audience includes members from both sides of housing sales: prospective homebuyers, realtors, and mortgage lenders. Homeowners may use the results of our study to determine if their house has been taxed at a fair rate, and conversely, appraisers could use this data as a benchmark when they prepare their appraisal. Additionally, city planners and public finance officials may be interested in the study, to determine how policy may affect tax revenues.
 * The purpose of our study is to examine housing prices in Polk County, OR. Specifically, we are hoping to examine regression, neural networks, and clustering models in an attempt to determine how different attributes of a house affect the price, to the aim of attempting to predict the price of a house given its attributes.

The project will be presented on May 13.

We will be using housing data from Polk County, with variables including sales price, property type, square footage, house amenities (number of bedrooms and bathrooms), year built, sales area, zoning, and price/sq. footage. We will use this data to build regression and classification models, specifically linear regression, neural networks, a priori clustering, and k-means. Model performance will be optimized given our data, and then models will be evaluated against each other using results from untrained data (validation and test data).

LITERATURE REVIEW

 * The types of research conducted previously has consisted of hedonic regression for a prediction of real estate prices based on variables such as zoning, population density, and wetland areas (Adams and Mahan and Polasky, 2000). In addition, neural network models have been developed (Gan and Lee and Limsombunchai, 2004). **Further, more effective clustering techniques and better regression models need to be introduced to reduce the effects of multicollinearity (Gilley and Pace, 2001) which has been an issue in the past regarding these types of analyses.

For the average person, he/she can go to websites such as portlandmaps.com and zillow.com and obtain real estates prices based primarily on sales area and comparisions to the houses with similar characteristics. In addition, real estates professionals generally have available to them more extensive data in regards to a particular property using which they are able to do a professional property appraisal. For the average user, using this technique is both expensive and time consuming. A comparison of our model with these websites will give us a clearer idea of the effectiveness and usability of our model. Today, regression models tend to be the state of the art while neural network modeling may in fact be the next phase in the predictive modeling process for this type of analysis. Essentially, the approach for this particular project is to build on the techniques that have already been attempted and apply some new techniques. We will vary from these traditional techniques through the utilization of regression with the binning of variables and clustering using geographic and regular data.

A complete bibliography of all sources will be provided with the final report.

HYPOTHESIS
We hypothesize that in Polk County, home value will be most affected by variables such as square footage, acreage, and property type. We assume that a bigger house will be more expensive than a smaller one and that a house in an area zoned for farming will be less expensive per acre than a house zoned in a residential area.

PROCEDURE

 * Tasks** – this section should follow the CRISP datamining process from class slides.
 * 1) Gather the data
 * 2) The data was gathered for 2001-2005 from the Polk County Clerk's office. -- Completed
 * 3) Data for recent housing bust is not readily available, and houses may be priced under distress, so predicting housing prices now would be difficult – hard to model some attributes, like fear of price deflation, affect of foreclosures, etc.
 * 4) Clean the data
 * 5) Handle incomplete records
 * 6) Interpret the different variables
 * 7) Remove variables that will not be used
 * 8) Normalize or otherwise transform the data
 * 9) Consult Experts
 * 10) Mike Hand
 * 11) James Frew
 * 12) Run the models
 * 13) Linear Regression
 * 14) Neural Networks
 * 15) Clustering
 * 16) Analyze the results of training and testing datasets
 * 17) Lift charts
 * 18) Decile charts
 * 19) Confusion matrix
 * 20) Report the findings
 * Equipment**
 * 1) Clementine
 * 2) XL Miner
 * Data**
 * 1) 5 years of sales data for Polk County (10,214 records)
 * 2) We will be modeling the sales price (adjusted for inflation)
 * 3) Our independent variables are: square footage, #bedrooms, #bathrooms, year built, sales area, # ½ bathrooms, property type, zoning.
 * 4) This information should be presented in a table along with summary statistics for each attribute from the cleaned data (min, q1, median,q3, max, mean, stdev)
 * 5) Ask Mike Hand for his perspective and work done to date
 * 6) Ask James Frew, an economic professor who has studied home prices, for feedback and other suggestions.
 * Previous Work**
 * 1) Hedonic regression models, neural networks, multiple regression, quantile regression – more information in the literature review.

April 17: Literature Review done by everyone (each member was responsible for identifying 2 outside sources) April 17 – 21: Research Proposal done by everyone April 22: Data Cleaning & Filtering done by Bo, Brian, and Kyle April 27 – 30: Consult Experts done by Sajal and Tyler May 3-4: Run the Models done by everyone May 6 – 10: Model & Data Analysis done by everyone (each team member takes the results from one model and brings his/her analysis back to the team) May 10: Draft final report done by Brian, Bo and Sajal May 12: Revise final report done by Kyle and Tyler May 13: Presentation done by Brian, Kyle and Sajal

We do not expect to incur any costs during the course of this project.