Caravan+Insurance+Policy

This is the project page for the Caravan (Camper) Insurance Project. The links below will take you to the project or the presentation. Note that the text after the links is the project proposal.

=Project = =Presentation=

=Caravan Insurance Project Proposal=

=Introduction= =C ross-selling, or "selling additional products or services among your established clients" [1], has been one of the most successful marketing techniques in the modern days. Online retailers like Amazon or DVD rental firm Netflix are classic examples of cross-selling in practice. The technique, however, does not guarantee the same success across all industries. For example, the success of cross-selling in the insurance industry has been very limited over the years. One of the reasons is the huge challenge of "creating a 360-degree profile of a customer" [2]. =

In 2000, an insurance company in Europe had to face this aforementioned daunting task. This well-known, established company has been offering a variety of insurance services such as life, auto, property, boat, etc. insurance to its huge customer base. The newest service, "Caravan insurance policy" sales result, however, has been rather dissapointing. The company marketing department knew that if they could take advantage of the exisiting customer base, the service would pick up market share and become an instant hit. The million-dollar question, of course, is to target whom, among these million customers?

With the help from Maarten van Someren and Peter van der Putten, two Machine Learning scientists from the University of Amsterdam, the company launched an online data mining competition to help them solve the puzzle. (The official site of the competition can be found here: [|http://hcs.science.uva.nl/benelearn99/comppage.html)] The theme of the competition is to answer the question “Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?” To answer this question, they asked the contestants to complete two tasks: After 9 years, our team will take on this challenging task with newer, more sophisticated approaches. Although the competition has been closed, our results could be used as a tool to help insurance companies in general to "profile" their customers and put their marketing dollars to the best use.
 * 1) Predict which customers are potentially interested in a caravan insurance policy from the provided datasets.
 * 2) Describe the potential customers; and possibly explain why these customers buy a caravan policy (Profiling).

Literature Review
This problem was originally posed as a challenge in a contest called CoIL Challenge 2000. The original contest contained 43 entries that tackled this challenge in a variety of ways. Our group will use some of these original contest entries as research.

The data contains 86 variables ranging from demographic information to zip codes. With such a wide range of information it might be necessary to weed out a large number of variables. From our research, the winning technique to predict customers who would buy "Caravan insurance policy" is the Bayesian network; but our group will likely utilize multiple techniques of examining the data.

As already mentioned, the task was originally pursued in 2000 as a data mining challenge. It has been nine years since the challenge was completed and technology and techniques have advanced since then. We will take the task in different approaches and using more modern, advanced tools such as SPSS Clementine, XLMiner, Rapid Miner and state-of-the-art data visulization tools such as Trendalyzer and ManyEyes.

As was previously mentioned this dataset was originally a challenge with over 40 entrants. We can attempt to learn from many of the entrants to this challenge, both successful and unsuccessful. We plan on utilizing a report from Charles Elkan from University of California San Diego, [|whose entry won the contest]. Our group will also utilize other reports that were written by other contestants in order to determine the most effective and ineffective techniques.

Project Summary
Team Name: 3MPS Current Date: 4/19/09 Title: Cross-Selling Solution for an Insurance Company Project Start: 4/13/2009 Project Finish: 5/12/2009

Materials Needed:
 * Computers
 * Data visualization software: Excel, Minitab, SPSS Clementine, Trendalyzer, ManyEyes, RapidMiner
 * Data mining software: SPSS Clementine, XLMiner, RapidMiner

The estimated cost for this project is 150 hours (30 hours x 5) at $15 per hour resulting in a total cost of $2,250.

Project Schedule
Task List
 * Identify business case and classify data mining problem type
 * Identify sources of data
 * Data understanding
 * Clean, validate, and visualize the data
 * Develop predicting models with alternative techniques
 * Interpret results and derive final predicting models
 * Finish final report
 * Make Final Presentation

==

__1. Data Understanding__
Recall that we are trying to predict the number customers who are more likely to purchase Caravan insurance. The dataset that we obtained consists of 86 variables including product usage data and socio-demographic data derived from zip area codes. The breakdown of the dataset is as follow:
 * The first 43 variables are demographic and social data. Some of variables that we have look very redundant. For example, we have a variable called “Roman Catholic” and another called “Protestant”, then we have two more variables called “Other Religion” and “Non Religion”. Another example is that we have an “Average Income” variable and then we have 5 other “Income range” variables. Therefore, as a safe practice, we definitely have to reduce the number of predictors before applying our predicting models.
 * The remaining 43 variables are insurance-related variables such as auto, life, trailer, boat, property insurances. Again, redundancy seems to be high in these variables and needs to be addressed.

__2. Data Visualization__
Besides conventional data visualization techniques such as tables, pie charts, bar charts, correlation matrix, and others, we will also use modern techniques such as Trendalyzer from Google and ManyEyes from IBM to visualize the dataset. The purpose of data visualization is for us to answer several initial questions that we couldn’t answer from the last step. Those questions include: a. Are there groups of predictors that convey the same information and how important that information is to predict Caravan customers? b. Can we make educated guess about the profile of a Caravan customer just by looking at the dataset? c. Are there any anomalies or outliers in the dataset?

__3. Hypothesis__
From the last step, we can make several educated guesses to predict the characteristics of our Caravan customers. These educated guesses include:
 * Caravan customers are most likely to be in a household with children
 * Caravan customers most likely to be in upper status social class (A, B1, B2)
 * Caravan customers most likely to have purchased boat, bicycle, surfboard insurance policies

__4. Data Pre-process__
Partitioning: dividing the dataset into 60% training and 40% validation Transformation: converting skew variables into normal-distributed variables Noise removal: removing outliers and missing values out of the dataset Feature extraction: Pull-out variables that are significant in predicting Caravan customers using correlation analysis and regression trees

__5. Data Analysis with Various Predicting Techniques__
This section below describes the data mining methods we will use to analyze and predict our final outcome. This section can be expanded more when we progress.

a. Naïve Rule: Under this approach, we assume that all 2000 customers from our Test dataset will be classified as member of the majority class in the Training dataset. In other words, if the majority of customers in the Training dataset buy Caravan insurance, we will immediately assume that all the customers in the Test dataset will buy Caravan insurance. Even though this method does not promise high predicting power, it can be used as a baseline for evaluating the performance of more complicated methods.

b. Naïve Bayes: Using this method, a naïve Bayes rule is developed from the poster and prior probabilities of important predictors. Since we have a fairly large dataset, this method might be useful in classifying Caravan customers.

c. K-Nearest Neighbors: This method is used to identify k observations in the training dataset that are similar to the test records that we wish to classify. We then use these neighboring records to classify the test records as Caravan customers or not.

d. Classification Trees (CART, C4.5) : This technique is used to divide and derive certain sets of rules for prediction. The terminal nodes are marked with 1 or 0 with 1 corresponding to Caravan customers and 0 as non-customers.

e. Logistic Regression: This model contains two steps. The first step yields estimates of probabilities of records belonging to class 1 (Caravan customers). In the next step, we determine a cutoff value on these probabilities. (This cut off value can be obtained from the Naïve rule)

f. Neural Nets: Using this model, records are inserted into layers of the big neural net. The model then self-corrects its own errors using the back propagation method.

g. Discriminant Analysis: This method will be particularly useful in distinguish the characteristics of Caravan customers. However, due to the size of the dataset that we have, it will be a small challenge to separate records into classes.

h. Association Rules: In essence, our business challenge is a cross-selling problem. Therefore, we can experiment Association Rules on this dataset. This last method allows us to answer our initial hypothesis: “If a customer purchases boat, bicycle or surfboard insurance, will he also purchase a Caravan insurance?”

__Group members__: David Mackley Connor Malone Chad Meynders Howie Pham Nick Stuart


[1] [] [2] [] 