Don't+Get+Kicked!

=** Project Title: ** **Don't Get Kicked!** =

**Team Members**
 * Hussain Al Haddad
 * Karina Javier
 * Megha Lingajappa Satish Kumar
 * Nathaniel Giscombe

**Introduction**

The automobile industry generates billions of dollars annually in revenues and in order to do so they rely heavily on car dealerships to sell cars. Car dealerships in return generate revenue from the sales of both new and used cars. Generally, new cars sales are more profitable but used cars generate a substantial amount of income for car dealerships as well. Most car dealerships acquire used automobiles by either trade-in or auction purchases.

Purchasing used cars from auto auctions is a risky business. It is one of the biggest challenges an auto dealership can face due to the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases "kicks". Kicks are costly to dealerships whether they wanted to fix their issues or sell it as-is.

Dealerships goal is to provide the best inventory selection possible to their customers otherwise they lose reputation, customers, and profitability. A model that help figure out which cars have a higher risk of being kick can provide enormous value to dealerships. This impact on the bottom line and the performance of dealerships, industry, and in turn the economy shows the value of this project.

The core problem in this project is to predict if the car purchased at the Auction is a Kick (bad buy). In order to solve this problem we will use a set of variables that describe the car as well as its history. The project will be conducted using SPSS Modeler 14.2 software from IBM.

**Research Question**

To predict if the car purchased at the Auction is a Kick (bad buy). In other words, for every car considered for purchase we are trying to answer the question: Is this car a kick?

**Literature Review (Business Understanding)**

The used automobile market represents nearly half of the retail automobile sales market in the US. This industry generates about $370 billion in annual sales, making it one of the largest retail segments in the US economy.

For automobile dealers, used automobiles represent a large portion of revenue and since the margins in this industry aren’t very high, about 20% on average, it’s important for dealers to sell large quantities. For example, in 2011 44 million used cars were sold in the US alone, compared to 17 million new cars. When dealers acquire used cars it’s important for them not spend an exorbitant amount on getting the car ready for sale as it will cut into their profit margins.

Some dealership costs associated with getting a car ready for sale include cleaning, marketing, and certification fees. Due to all these initial expenses it’s important for automobile dealers to identify what types of automobiles have a higher probability of being a “kick”, because these cars cost significantly more to sell. By reducing the probability of purchasing a kicked car, dealers would be increasing their margins and therefore increase their profit.

Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problems. Associated costs include transportation cost, throw-away repair work, and market losses in reselling the vehicle.

**Project Plan (Intent)**

Stated below are our initial plans for this project. It is likely that we do not use all the mentioned models and/or steps according to findings and limitations. The major part of this plan beside data understanding and summarization is the descriptive analysis part. Analysis is to be finalized on Saturday 28 April, 2012 to allow enough time for revision and reporting before submission due date on May 3, 2012.


 * 1) **Data Understanding:**discussing each variable and the nature of the data (includes a discussion data types, possible use of some data for splits, frequency of responses under each variable, extremes, outliers, missing data, data distribution, possible maintenance needs –for some models-). Use the following tools:
 * 2) Basic statistics of important variables
 * 3) Scatter plots
 * 4) Correlations
 * 5) Cross-tabulations


 * 1) **Clustering:** is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
 * 2) Kohonen
 * 3) K-Means
 * 4) 2-Step
 * 5) Anomaly


 * 1) **Relationship mining**: Dependency modeling which Searches for relationships between variables.
 * 2) Association rule: Carma and Apriori
 * 3) Sequential pattern: Sequence


 * 1) **Predictive Analysis:**
 * 2) Density estimation
 * Naïve Bayes
 * KNN
 * 1) Classification: is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam.
 * C 5.0 and/or C&R Tree and/or CHAID
 * Discriminant Analysis 2
 * Neural Net
 * Bayesian Network
 * SVM (Support Vector Machines)
 * 1) Regression: Attempts to find a function which models the data with the least error.
 * Linear Regression (Linear)
 * Multiple Regression (Regression)
 * Cox
 * Logistic Regression

**Steps to follow within each model:** **Things to always pay attention to:**
 * 1) **Summarization**: providing a more compact representation of the data set, including visualization and report generation.
 * 2) Did the implementation of the study fulfill the intentions of the research design?
 * 1) **Data Preparation:** Outliers and Missing values maintenance, Standardization, Normalization, PCA Factor, Feature selection.
 * 2) **Modeling**
 * 3) **Evaluation:** by testing with the testing/validation partition
 * 4) **Deployment:** discussing the possible uses and the value added from the findings
 * Overfitting.
 * Readability.
 * Visual presentation: charts and diagrams.

**Data Description**

The data to be analyzed contains 73,014 rows of data entries. Our target variable is IsBadBuy and there are 32 independent variables. Below, there is a more specific definition of all variables in the data file:
 * **RefID:** Unique number assigned to vehicles.
 * **IsBadBuy:** Identifies if the kicked vehicle was an avoidable purchase.
 * **PurchDate:** The date the vehicle was purchased at Auction.
 * **Auction:** Auction provider at which the vehicle was purchased.
 * **VehYear:** The manufacturer's year of the vehicle.
 * **VehicleAge:** The Years elapsed since the manufacturer's year.
 * **Make:** Vehicle Manufacturer.
 * **Model:** Vehicle Model.
 * **Trim:** Vehicle Trim Level.
 * **SubModel:** Vehicle sub model.
 * **Color:** Vehicle Color.
 * **Transmission:** Vehicles transmission type (Automatic, Manual).
 * **WheelTypeID:** The type id of the vehicle wheel.
 * **WheelType:** The vehicle wheel type description (Alloy, Covers).
 * **VehOdo:** The vehicles odometer reading.
 * **Nationality:** The Manufacturer's country.
 * **Size:** The size category of the vehicle (Compact, SUV, etc.).
 * **TopThreeAmericanName:** Identifies if the manufacturer is one of the top three American manufacturers.
 * **MMRAcquisitionAuctionAveragePrice:** Acquisition price for this vehicle in average condition at time of purchase.
 * **MMRAcquisitionAuctionCleanPrice:** Acquisition price for this vehicle in the above Average condition at time of purchase.
 * **MMRAcquisitionRetailAveragePrice:** Acquisition price for this vehicle in the retail market in average condition at time of purchase.
 * **MMRAcquisitonRetailCleanPrice:** Acquisition price for this vehicle in the retail market in above average condition at time of purchase.
 * **MMRCurrentAuctionAveragePrice:** Acquisition price for this vehicle in average condition as of current day.
 * **MMRCurrentAuctionCleanPrice:** Acquisition price for this vehicle in the above condition as of current day.
 * **MMRCurrentRetailAveragePrice:** Acquisition price for this vehicle in the retail market in average condition as of current day.
 * **MMRCurrentRetailCleanPrice:** Acquisition price for this vehicle in the retail market in above average condition as of current day.
 * **PRIMEUNIT:** Identifies if the vehicle would have a higher demand than a standard purchase.
 * **AcquisitionType:** Identifies how the vehicle was acquired (Auction buy, trade in, etc.).
 * **AUCGUART:** The level guarantee provided by auction for the vehicle (Green light - Guaranteed/arbitrable, Yellow Light - caution/issue, red light - sold as is).
 * **KickDate:** Date the vehicle was kicked back to the auction.
 * **BYRNO:** Unique number assigned to the buyer that purchased the vehicle.
 * **VNZIP:** Zip code where the car was purchased.
 * **VNST:** State where the car was purchased.
 * **VehBCost:** Acquisition cost paid for the vehicle at time of purchase.
 * **IsOnlineSale:** Identifies if the vehicle was originally purchased online.
 * **WarrantyCost:** Warranty price

**Data Collection**

The data was obtained from http://www.kaggle.com/c/DontGetKicked and is available in MS Excel format. Data Variables seem to be appropriate and relevant, with few ambiguous data.

While modeling this project, it will be interesting to observe the results from WheelType, WheelTypeId, VehOdo, different prices, AUCGUART, VehicleAge, and PRIMEUNIT as to how much importance they have on predicting the target IsBadBuy. We can observe that VehYear and VehicleAge are two variables with almost the same meaning. We will have to filter one of the two out while modeling. It is the same case for variables VNZIP and VNST as well. The variable Color might seem interesting for an auto dealer as high market demand come from customers who prefer certain car colors.

Some variables that could help build a better model are unfortunately not available in our data set. Miles driven, Engine type (Petrol, Diesel) and number of accidents the car survived are good examples of those variables. Although those factors that could highly help improve the reliability and accuracy of the model for decision making purposes are absent, the available data is relevant and hopefully good enough to assist the decision making process when auto-dealers are purchasing inventory from auctions.

**Next Step: Execution**

Based on this plan, we will conduct the study. The final report should be published on this Wiki Group by Thursday May 3, 2012.