Bear+Dancing+Report

==

=COMPARING MODELING TECHNIQUES USING HOME SALE PRICES=

**EXECUTIVE SUMMARY**
The following report details a statistical modeling process for predicting property values. Using a substantial amount of data provided by the Polk County, Oregon Tax Assessment Office, the project team began by establishing a business case to ensure the methodology could be meaningfully employed by a variety of users. Initial models were developed from a pared down data-set in an effort to concentrate on the most influential variables for determining the value of only residential properties. Based on results from statistical modeling software, the analysis revealed a selection of effective models for predicting property values. RMS error was used as the primary measure for determining model accuracy. In examining the results, there were some surprises in terms of variable importance and the effect of coefficients in the regression model, but otherwise, the models produced close to expected results. A brief discussion of known and potential errors and possible project extensions conclude the analysis.

BUSINESS PROBLEM - PREDICTING HOME SALES

 * In the project proposal, the team set out to examine housing prices in Polk County, OR and attempt to develop a model that would assist members from both sides of housing sales (prospective homebuyers, realtors and mortgage lenders) in predicting the sales price of a home. Homeowners may use the results of our study to determine if their house has been taxed at a fair rate, and conversely, appraisers could use this data as a benchmark when they prepare their appraisal. Additionally, city planners and public finance officials may be interested in the study to determine how policy may affect tax revenues.

We hypothesize that in Polk County, home prices will be most affected by variables such as square footage, acreage, and property type. We assume that a bigger house will be more expensive than a smaller one and that a house in an area zoned for farming will be less expensive per acre than a house zoned in a residential area.

The team's previous experience with home sales models focused exclusively on regression models. In reviewing and further exploring the literature surrounding this topic (for more information on the literature review and references, please refer to the project proposal), the team discovered that other modeling techniques, specifically neural networks, have been applied to the area of home sales, and that these models may offer superior performance compared to regression models. Using our Polk County data [which includes data on sales price, property type, square footage, number of bedrooms, number of bathrooms, year built, sales area, zoning, and price/sq. footage], the team set out to test the performance of various models. Models under consideration were regression and neural networks, in addition to generalized linear regression, KNN, CHAID, CART, SVM. Though regression and neural networks were the focus of the project, these additional models were included to give a broader view of how different techniques performed in modeling the data. **

DATA PREPROCESSING
The dataset collected for Polk County needed to be extensively cleaned before the initial model building could begin. The information included transaction data for several different types of zones: residential, commercial, industrial, rural, farm, timber, etc. As the focus of the project was on residential prices, the other zones needed to be removed from the dataset. Unfortunately, the lines were not as clearly drawn as one could hope. In addition to various multizoned properties, some zones were most likely what one would consider residential but were not classified as such. This latter grouping was primarily composed of rural and farm areas, which are zoned differently. To compound matters, Polk County has a mixture of zoning systems it uses with some towns and municipalities using their own coding system. This required careful examining of the data to sort through the 100+ zones to pull the relevant data. For reference, attached at the end of this report are documents that outline descriptions and possible values for Planning Zones, RMV Class, Stat Class, and Study Areas.

MODEL DEVELOPMENT
After the raw data was preprocessed and cleaned, the number of records available for modeling purposes decreased from approximately 10,000 to approximately 7,000. However, all original variable groups from the raw data were retained in the cleaned data. Descriptions of the variable groups for the cleaned data are as follows:
 * **MA:** Metropolitan Area, e.g. Dallas, Willamina, Monmouth
 * **SA:** Study Area represents neighborhood types, e.g., more desirable neighborhoods, neighborhoods with low cost housing, neighborhoods in unincorporated areas of Polk County
 * **NH:** New Home, Yes or No indicated by a binary variable
 * **RMV Class**: Represents property class codes for the State of Oregon, e.g., residential condominium, industrial vacant, multi-family potential development improved
 * **Map #**: This is assumed to be some type of locator number corresponding to a map used for tax assessment purposes in Polk County
 * **Book**: Assumed that this number represents the year in which the property was sold
 * **Page**: Assumed that this refers to a page number of a volume on file with Polk County in which the specific data on the property is found
 * **Situs Address**: Address of the property
 * **Land Size**: Size of land in acres
 * **Stat Class**: Refers to the structure type of the property; e.g. residential two-story, multi-family duplex, manufactured structure single-wide
 * **Year Built**
 * **Effective Year Built**
 * **Living Area**: Square footage of living area
 * **Bed**: Number of bedrooms
 * **Full Bath**: Number of full bathrooms
 * **Half Bath**: Number of half bathrooms
 * **Adjusted Sale Price**: Final sale price of property
 * **$ per Square ft**
 * **Sales Date**
 * **Zone**: Property zoning type; e.g. industrial general, central business district, residential medium density

**Initial Filtering**
From the above variables, the following were filtered __out__ of the cleaned dataset, for the reasons as noted, based upon the team's understanding of the different factors that should logically affect the price of a home:
 * **MA:** Zoning and RMV variables are considered more specific predictors for location
 * **NH:** Was not considered important
 * **Map #:** Does not having any effect on property value
 * **Book:** Data in the effective date sold variable lists the year the property was sold
 * **Page:** Does not having any effect on property value
 * **Address:** Due to the number of records in the data, it was beyond the ability of the group to properly integrate this variable into the model in an effective way
 * **Year Built**: Effective year built was used instead
 * **$ per Square ft**: Adjusted sale price was a more favorable response variable for indicating property value
 * **Sales Date**: The group is relying solely on effective year built as a predictor variable representing the age of the property

**Secondary Filtering**
After the team manually filtered out some of the variables in the data, all of the remaining numeric variables were set to continuous/range variables and the 4 remaining categorical variables were set as set variables. The categorical variables were then set to be flag variables, coded in binary form. A filter was then used to remove the original text based categorical variables as they were now represented in binary form. The data set grew after converting the 4 categorical variables into binary flags. the number of flags for each category are listed below. Counting the numeric range variables, the data set included a total of 184 predictors.

__Remaining binary variables__
 * **Zone: 92 variables **
 * **Stat class:** 41 variables
 * **SA: 25 variables **
 * **RMV class: 20 variables **

__Remaining range/continuous variables__
 * **Land size**
 * **Effective year built**
 * **Living area**
 * **Bedrooms**
 * **Full bathrooms**
 * **Half bathroom**
 * **Adjusted sales price (target variable)**

**Feature Select**
To pare down the number of predictors in the data set, a feature select node was employed in an effort to determine the relative importance of each variable in predicting adjusted sale price. The results of the feature select node yielded fourteen variables that maintain some statistically significant degree of influence on the adjusted sales price of Polk County properties. Based on the probability measures generated from the feature select node (this numeric measure is represented by the Pearson –p-value, where statistical importance is greater than or equal to 95%) the following fourteen variables were included in the different models considered.
 * **Land Size**
 * **Effective year built**
 * **Living area**
 * **Bed**
 * **Full Bath**
 * **Half Bath**
 * **Zone RS (single family residential)**
 * **SA 2 (core city neighborhoods)**
 * **SA 3 (transitional city neighborhoods)**
 * **SA 16 (properties built after 1990)**
 * **RMV Class 101 (use residential improved)**
 * **Stat Class 131 (pertains to one-story residences in a specific neighborhood)**
 * **Stat Class 141 (pertains to one-story residences in a specific neighborhood)**
 * **Stat Class 143 (pertains to two-story residences in a specific neighborhood)**

An initial interpretation of the results from the feature select node would be that the continuous variables (land size through half bath) are universally applicable when predicting home values, i.e., these could be thought of as standard measures of a property’s value regardless of location. However, the remaining categorical variables, (Zone RS through Stat Class 143) we can assume may have some sort of effect specific to the region.

Finally, as the last step prior to developing each model, the data was partitioned into training (60%) and testing (40%) partitions. For purposes of comparing the performance of the models, the Auto Numeric node was used in PASW to develop all of the models concurrently, using the same variables model settings. Using the Auto Numeric was also immensely helpful in that the creation of the 7 different models was almost completely automated. As different models were identified as useful tools to predict adjusted sales price, those models (regression, KNN, and neural networks) were built independently to analyze them in greater detail.

RESULTS
A common metric was needed to analyze the performance of each model. Model performance was evaluated by comparing the predicted adjusted sales prices against the known adjusted sales prices in the data set. The team calculated the root mean squared error (RMS) for each model, using Derive and Aggregate nodes in PASW. RMS is calculated by squaring the difference of each pair of predicted and actual adjusted home sale prices (the error of the predictions), summing all of the squared errors, dividing by the number of adjusted home sale prices, and then taking the square root of the quotient. The formula for caluclating RMS is also shown here: In general, comparing model performance based upon RMS error boils down to choosing the model which has the lowest RMS, since that model has the lowest overall error in predicting adjusted home sale prices compared to the actual adjusted home sale prices in the data set. The performance of each model in terms of RMS is shown in the table below:

The models are ranked according to RMS error, from the lowest to the highest. As shown in the table above, the KNN model was the highest-performing (having the lowest RMS error), and the SVM model was the worst-performing model, having the highest RMS error. The regression model, the long-hailed choice of professionals, finished a close second to the KNN model.

DISCUSSION OF RESULTS
In general, our hypothesis was correct in that home prices in Polk County are most affected by variables such as square footage, acreage, and property type, and that a bigger house will be more expensive than a smaller one and that a house in an area zoned for farming will be less expensive per acre than a house zoned in a residential area. In terms of the models used, however, the team did uncover some unexpected results in terms of which model performed best and in how the different variables used to build the models affected home prices.

The graphs shown below depict the actual adjusted home sale prices on the x-axis and the predicted adjusted home sale prices for the top 4 best performing models. The results for Regression and Generalized Linear were identical (using main effects and two-way interaction made the model worse so only main effects was used), so really, only the top 3 best performing models are shown. For reference, each graph has a straight line that represents a perfect prediction (actual and predicted are the same, hence the x,y pair would form a 45-degree line). As can be seen in the graphs below, the more closely grouped the data points to the reference line, the better-performing the model. The team's original goal was to examine the performance of several modeling techniques in modeling adjusted home sale prices. The team's a priori expectations, based upon experience using regression models and information discovered on neural networks during the literature review, was that a regression model and a neural network would be best able to model and predict adjusted home sale prices. In expanding our project to include similar models, the team discovered that both of these models were outperformed by a k-nearest neighbor model, as measured by lowest RMS. The differences between the respective RMS for each model is slight, yet still, the KNN model performed best.

In thinking about how the KNN model works, it is perhaps not that surprising that this model performed as well as it does. The KNN model groups homes of similar chacteristics together; as new homes are introduced to the model, the closest existing homes determine the price of the new home. When we examine how realtors and appraisers determine a home's value, the method employed by real estate professionals is very similar to the KNN approach. As a home comes up for sale, the final selling price is determined by the market, which can be approximated by looking at the selling prices of nearby homes of similar characteristics.

The regression model came up short in our assessment compared to the KNN model, in terms of minimizing RMS error. However, while the KNN model outperforms the regression model, it is not as accessible and not as readily understood nor accepted as the regression model. Further, the KNN model is not as easy to deploy and make predictions with as the regression model. For these reasons, though the KNN model has the greatest predictive accuracy, the team's determination is that the regression model remains the choice of professionals (accept no substitutes) in predicting home sale prices. The RMS error has already been reported above, but to use another measure of accuracy, the R-squared value for the regression model approaches 50%.

**Relationship Identification, Model, and Focus**
Moving on to the team's model of choice, the regression model, the model reveals that fourteen of the variables are the main drivers of the sales price, as seen below in the regression coeefficients and in the graph of variable importance. Three of the variables are statistical classes that Polk County uses to classify homes by type and what the group assumes to be by neighborhood or region. Three more variables are statistical areas that classify residential property by their relative location within or near cities. The areas that were deemed were significant were houses built towards the city center or houses built after 1990. This latter grouping has some overlap with the effective year built, but the statistical area mainly refers to areas where all of the houses are relatively new.

The variables deemed most significant were expected based upon the team's mental model of the drivers of home prices: living area and land size. These variables are generally major drivers of housing values, even in the current market. Other aspects of the model were surprising, both in variable importance and in the regression model coefficients. One interesting note is that there is a negative association between the prices of a house and the number of bedrooms that it has, such that the larger the number of bedrooms, the lower the price of the house. This seems counterintuitive at first, but the team has taken the context of the other variables in the model, the team has theorized that this negative relationship may result from the model taking into account that the number of bedrooms in a house can detract from the rest of the available living space. From this, we can picture a scenario with two houses with identical total living area. If one house has 5 bedrooms and the other has 3, we can imagine that the house with more bedrooms most likely has smaller, more cramped bedrooms, and the other rooms in the house are likely smaller as well in comparison to the house with only 3 bedrooms, and thus the house with more bedrooms would be less attractive to buyers and fetch a lower price.

When we more closely examine the different attributes and their relative importance as predictors in the regression model, shown below, we see that living area and land size are at the top, with the number of bedrooms at the very bottom. This agrees with the team's earlier assertion that the number of bedrooms may be less relevant than the quality of the bedrooms, and the quality of the bedrooms may at least in part be a function of how spacious the bedrooms are, i.e how much living area is available in the house.

Another surprising note is that the number of half baths is relatively more important than the number of full baths, and looking at the regression equation, we see that half baths contribute more to the price than full baths. The team has theorized that the model may be valuing half baths highly because many of the houses in the data set contain at least one bathroom, and the price difference between houses with 2 bathrooms and houses with 3 bathrooms may not be that great. However, more expensive homes which may have half bathrooms primarily for use by guests could be causing the model to value half baths so highly, since half baths are found primarily in more expensive homes.





The mental model of predicting property values suggests a moderately similar pattern to that of the regression model. For example, the average person concerned with determining the value of a property would be able to predict, within a certain degree of accuracy, the value of a property, e.g. a single-family home, using variables such as square footage, number of bedrooms/bathrooms and lot size. When using regional specific variables, such as the Stat class variables and the RMV class variables, however, the average person may have a more difficult task predicting property value, especially if the person is not from the immediate area and is not reasonably familiar with a factor such as neighborhood quality. Real estate professionals would be better equipped to make an assement of the sales price of a home, having information about neighborhooods and schools. Essentially, the regression model follows a similar approach as a real estate professional would as it attempts to incorporate the specific knowledge of the real estate professional of the housing market and attributes of the house and neighborhood to determine an expected home sale price.

Based on the results shown by our model, we can say that our hypothesis was correct in expecting home prices to be most affected by square footage (Liv Area) and acreage (Land Size). We were surprised to see half bath as being an important variable for house prices. We also saw certain property types as being an important variable such as property near downtown. Our hypothesis that bigger houses would be more expensive than smaller ones also proved to be correct. Contrary to our expectations, zoning was not as important a variable in predicting house prices. While we expected residential zoning to have a positive impact on house prices, as per our regression model, residential zoning has a negative coefficient. Thus, if a house was in a residential zone, it would be cheaper.

**Relationship Examples**
To evaluate the effectiveness of our 3 best-performing models, we found data for 2 houses in West Salem and predicted their value using Regression, KNN, and Neural Networks. The first house we picked had the following characteristics and results for each model:




 * While a KNN model did not allow us to predict a value, we chose the 2 most important variables, Living Area and Land Size, to predict the value of this house, and then tried to find a similiar house in our KNN model, as shown below.

The second house we picked had the following characteristics and results for each model:

Using these 2 homes as examples indicates that our regression and KNN model are much better able to predict home prices in Polk County, while neural networks were less accurate. This example is in agreement with the RMS error comparison discussed above, which showed the neural networks model as having a higher RMS error than then regression and KNN models.

Errors
Above is the lift chart for the KNN, Regression, and Neural Network models, represented by the light blue, red, and dark blue lines repectively. Looking at the very high level of reported accuracy for the training set in the lift chart, the team has concluded that the model is overtraining itself to the data. When examinign the lift chart for the testing set, distortions in the chart make interpreting the lift results difficult, and thus using the testing lift chart as a resource for interpretation is questionable.

Every model the team developed had some amount of prediction error. These prediction errors fell into two main categories: underpriced and overpriced. The overpriced predictions are not that big of a concern, as it is possible to adjust the initial price downwards after gauging market interest. Adjusting the price downward would result in the house spending additional time on the market, which is most likely detrimental to the seller who favors a quick sale. Relatively speaking, however, the issue of overpricing is not as great a concern as the issue of underpricing. An underpriced property will be purchased before the price can be adjusted, and the result will be a loss in potential sales dollars that the seller could have received for the house. It is very likely that a seller using an underpriced valuation of the property would not have a satisfactory sale. On average, our top three models underpriced 36% of the houses in our dataset by at least 5%. While the team acknowledges that there is major room for improvement in the models to account for underpricing, the team also hypothesizes that some of these properties may have been sold way above real value in the actual transaction, which took place between 2001 and 2005, during the time leading up to the bursting of the bubble in the housing market.

Project Extensions
The analysis begun with this project could be expanded to include industrial and commercial transactions within Polk County. Depending on transferability and availability of data, the model could be transported to different counties, but would most likely need to be built using past data from each county due to local events and price history. In addition to expanding the model to cover additional types of zoning and other counties, the current model could be refined by incorporating the appraised values of the properties thus allowing further study of the extreme outliers. It may also be possible to come up with an adjustment for the inflated prices during the housing bubble to determine if the model's accuracy improves - even if the model does not improve, being able to adjust for inflated prices will help determine if the model has any design or implementation erros. The appraised value may be helpful in adjusting for inflated prices, and thus incorporating appraised value into the model would likely increase the models' accuracy and there usefulness as predictive tools. We would also like to expore options to make KNN more user-friendly so that it can be used by appraisers. Since appraisers are already looking for options that allow them to compare house prices, this would be a valuable resource for them.

// **Model String**

Common tools for predicting property values// [] [|www.zillow.com] [|www.portlandmaps.com] []

media type="youtube" key="xRQ_CMiDVQ8" height="344" width="425"

//Academic sources//
Adams, Richard M., Polasky, Stephen, Mahan, Brent L. (200) "Valuing Urban Wetlands: A Property Price Approach", //Land Economics//

Giley, Otis W. & Pace, Kelly R. (1990) "A Hybrid Cost and Market-Based Estimator for Appraisal", //The Journal of Real Estate Research

Limsombunchai, Visit, Lee, Minsoo, Gan, Christopher (2004) "House Price Prediction: Hedonic Price Model vs. Artificial Neural Network", // //American Journal of Applied Sciences//

Netusil, Noelwah R. (2005) "The Effects of Zoning and Amenities on Property Values: Portland, Oregon", //Land Economics//

Pardoe, Iain (2008) "Modeling Home Prices Using Realtor Data", //Journal of Statistics Education//

Robinson, Linda M., Lin, Ta-Win, Rabiega, William A. (1984), "The Property Value Impacts of Public Housing Projects in Low and Moderate Density Residential Neighborhoods", //Land Economics

Sirmans, G. Stacy, Zietz, Joachim, Zietz, Emily N., (2007), "Determinants of Houses Prices: A Quantile Regression Approach", // //Department of Finance and Economics Working Paper Series,// Middle Tennessee State Universi//ty//

[]

[]

Grateful dead graphic used without permission from: []

House Example #1 []

House Example #2 []