Caravan+Insurance+Project+Page

=Proposal= =Presentation=

=Caravan Insurance Project Report=

**INTRODUCTION**
=**C ross-selling, or "selling additional products or services among your established clients" [1], has been one of the most successful marketing techniques in the modern days. Online retailers like Amazon or DVD rental firm Netflix are classic examples of cross-selling in practice. The technique, however, does not guarantee the same success across all industries. For example, the success of cross-selling in the insurance industry has been very limited over the years. One of the reasons is the huge challenge of "creating a 360-degree profile of a customer" [2]. **=

In 2000, an insurance company in Europe had to face this aforementioned daunting task. This well-known, established company has been offering a variety of insurance services such as life, auto, property, boat, etc. insurance to its huge customer base. The newest service, "Caravan insurance policy" sales result, however, has been rather dissapointing. The company marketing department knew that if they could take advantage of the exisiting customer base, the service would pick up market share and become an instant hit. The million-dollar question, of course, is to target whom, among these million customers? media type="custom" key="3786383"

With the help from Maarten van Someren and Peter van der Putten, two Machine Learning scientists from the University of Amsterdam, the company launched an online data mining competition to help them solve the puzzle. (The official site of the competition can be found here: [|http://hcs.science.uva.nl/benelearn99/comppage.html)] The theme of the competition is to answer the question “Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?” To answer this question, they asked the contestants to complete two tasks: After 9 years, our team will take on this challenging task with newer, more sophisticated approaches. Although the competition has been closed, our results could be used as a tool to help insurance companies in general to "profile" their customers and put their marketing dollars to the best use. media type="youtube" key="nbM0kef0DnM" height="170" width="250" media type="youtube" key="vWfaa-Ry3QA" height="170" width="250" media type="youtube" key="bZGA8ZOkofo" height="170" width="250"
 * 1) Predict which customers are potentially interested in a caravan insurance policy from the provided datasets.
 * 2) Describe the potential customers; and possibly explain why these customers buy a caravan policy (Profiling).


 * DATA / BUSINESS UNDERSTANDING**

__1. Business / Customer Understanding__ Caravan (or Travel Trailer / Mobile Home in North America) is increasingly becoming popular in Europe. It's estimated that the UK caravan market is now worth around $2 billion. Along with the growth in popularity is the increasing problems of theft, fire and natural hazard damages, etc. For example, it is estimated that there are around 1,600 caravans stolen each year in England. Insurance companies now require caravan owners to install addtional security features before applying for coverage. [3]

For marketers, one way to understand their targeted customers is to study different ways customers interact with the media Examples of that are which sites the customers tend to visit, which TV channels do they watch or which magazines do they read. [4] In the process of learning about caravan customers, a particularly useful tool that we used was KartOO. KartOO (at [|kartoo.com]) is a specialized search engine that gives user a quick visual summary of the most popular sites for a search term. For example, the picture above shows the 12 most popular sites in UK for the search term "caravan". In addition, KartOO also shows us the "links" or realtionships among these sites. From the example above, with the keyword "caravan", these relationships go from "article", "discover", "guide", "tour" to "sell", "sale". A deeper look at the visual summary from KartOO tells a very interesting story about our targeted caravan customers. First, it's easy to see that the top left sites are mostly for people who want to get away. These site are: - www.lets-getaway.com - www.joalleisure.com - www.discover.co.uk - www.myholidaycaravan.com That's not all however. The visual summary also shows that our caravan customers are careful shoppers who try hard for bargains. They do their own research at Edmunds and look for "deals" on Ebay. They also read online maganize and participate in an online community for caravan owners at caravanmagazine.co.uk Last but not least, these caravan customers also visit Noblemarine.co.uk, a popular online insurance company for boat owners in UK. At this point, we can make an educated guess that caravan owners most likely also purchase boat (and boat insurance). The validity of this guess, of course, can only be confirmed after careful data mining steps.

__2. Data Understanding__ The dataset that we obtained consists of 86 variables including insurance-product usage data and social-demographic data derived from zip area codes. The breakdown of the dataset is as follow: Furthermore, the caravan data set is not set up as individuals but as groups. Each row/group represents a large sample size. The insurance company created these groups to utilize the mass data that they have while keeping the data set manageable. In way of example, the religion variable is set up as No Religion, Protestant, and Roman Catholic. These variables range in value from zero to nine. Each value (0-9) represents a percentage / participation level of the group, ranging from 0% - 100%, respectively. When the values are added together, they represent a total of 100%. This data representation style applies to several variables in the data set.
 * The first 43 variables are demographic and social data. Some of variables that we have look very redundant. For example, we have a variable called “Roman Catholic” and another called “Protestant”, then we have two more variables called “Other Religion” and “Non Religion”. Another example is that we have an “Average Income” variable and then we have 5 other “Income range” variables. Therefore, as a safe practice, we definitely have to reduce the number of predictors before applying our predicting models.
 * The remaining 43 variables are insurance-related variables such as auto, life, trailer, boat, property, etc. insurances. Again, redundancy seems to be high in these variables and needs to be addressed.


 * DATA VISUALIZATION

//"Data visualizations encompass a wide and growing range of projects, reflecting creative ways of representing all sorts of data visually, with virtually no limit to what kind of information can be translated into an image"// [5] ** In this phase of the project, we will try to develop basic understanding about the data set that we have. With the size of the data set, it's simply not efficient to just look at tables of raw data. Since "a picture is worth a thousand numbers", we will rely on traditional and modern visualization tools to help us understand our data better. We have tried more than a dozen visualization tools including Many Eyes from IBM, Gapminder from Google, 3D Miner, Visualization Toolkit (VTK), etc. However, due to the nature of our data, these tools didn't provide more insights than traditional approaches. Therefore, we decided to proceed with visualization features from SPSS PASW Modeler and Excel.

Recall that we need to develop a customer profile for potential Caravan insurance customers. Out of more than 5000 records in the training data set, there are only 348 Caravan customers. The most logical first step is, of course, to study these customers. What do they do? Are they married or not? How big are their households? What other insurances that they also purchased? Below is a collection of graphical presentations that we used to shed some lights on the characteristics of the Caravan customer groups. On the first graph, the spike in the 3D-surface chart shows that the majority of the Caravan customers in the data set has average age at level 2 (30-40 year-old) and average size of household of 4 members. Next, it appears that our Caravan customers are most likely have a 50/50 chance of being married and having children. The majority of data points are in the center with "Married" and "Kids" categories both rated at level 5.

Every customer of this insurance company is "classified" into 41 sub-categories range from "High Status Seniors" to "Student in Apartment". Below is the distribution of these sub-categories for the 348 Caravan customers. It's easy to notice the unsual spikes at categories 8 (Middle class families) and 33 (Lower class large family). From this evidence, we can hypothisize that Caravan customers could be either lower class families using caravan as a cheap accommodation means or middle class families using Caravan for recreational activities. Again, since the number of Caravan customers is only 6% of the training data set, these kinds of hypothesis or "educated guesses" have to be confirmed with appropriate oversampling and data mining steps. Last but not least, the bubble chart below shows that an abnormally high number of these Caravan customers also purchased car policies, fire policies and third-party insurance. Boat insurance doesn't come out on top as we expected. However, our initial hypothesis might still be correct. (People who buy boat and boat insurance are more likely to buy caravan and caravan insurance, not the other way around)


 * HYPOTHESIS**

Our initial hypothesis that boat insurance purchasers are also caravan insurance purchasers was again buttressed by Google Trends. The following graph shows a near perfect correlation between boat insurance and caravan insurance. Note that there is an unusual spike toward the end of 2006. There is no data explaining this variation. Also consider the peak of insurance searches during each years' summer months. At this point of the project, we have come to a few hypotheses about our final results. These hypotheses are: 1. Caravan customers might be divided into two groups: +"Want-to-get-away" middle class families +"I want a bargain" lower class large families 2. Targeted caravan customers might be around 30-40 year olds living in a household of 4 3. Customers who buy boat insurance most likely also buy caravan insurance. 4. Other products that caravan customers buy are car insurance and fire insurance With these hypotheses in minds, we're moving to the next phases of our project.


 * DATA PRE-PROCESSING**

Preparation of the data started with changing the obtuse column names from the original data set to more descriptive titles. We then ran a data audit on all fields and discovered that no values were missing. This removed the need of imputing missing values.

The data ranges were all quite similar values. This removed the need of transforming the data through log10, natural log, z-score, or other transformations. In the data set caravan insurance purchasers were few. We oversampled the data in order to overcome this limitation.  Also, all of the insurance policy variables are binary. We assume that this means that all persons in the group either the have the particular policy being discussed, or they do not. All variables should be understandable by viewing both this written explanation and the data explanation given in the data set itself. 


 * MODEL DEVELOPMENT AND INTERPRETATION**

Using the cleaned Excel data set we sought to produce the most accurate predictors and classifier models possible. To do this we attempted several different Clementine processing techniques that eventually led to the following best case. 1. Data audit of the cleaned Excel dataset found that the data was already in binary form and therefore had no skewness. This means that our model was more likely to be accurate and required limited manual normalization. 2. At this stage we used the type node to identify all variables as inputs and the number of caravan policies as the output. After realizing that the number of caravan polices was always either a 1 or a 0 we decided to identify it as a flagged variable. This would enable us later to use predictive modeling. 3. We then further explored the data by splitting it (using select nodes) into non-caravan insurance records and caravan insurance records. Using the statistics node we identified a 5474 to 348 split. This represents a clear under sample of caravan insurance holders. 4. We then used the balance node to oversample the data to get a more representative dataset. We used a multiplier of 16.73 that is based on the previous step. 5. Next we partitioned the data 60% training 40% testing. 6. Afterwards, we ran a feature select node set at a 95% importance rating. Any variables that did not meet this level of importance were eliminated at this point resulting in a slightly increased level of accuracy for our predictor models. 7. From here we tested many different analysis models. We did validations of each model using options in order to find the most accurate. We measured accuracy using analysis and lift charts. 8. We concluded that the following models were the most useful for our analysis.

C 5.0 Model
This model indicates the following with 94.63% accuracy. The most influential predictor of whether a set of customers will have caravan insurance is if they have car insurance or not. If they have car insurance they are 68% likely to have caravan insurance.

The graph below indicates the predictive importance of each variable. In summary it shows that in addition to having car insurance whether the group of customers: are single, have income below 30K (the lowest income level), are not religious, are social class D (the lowest social class).

Interpreting this model shows that all of the predictive variables following car insurance are indicative of a lower socio economic class of people. This is very interesting as it allows us to identify the type of customer/ person that has caravan insurance and therefore the type of customer/ person that would want caravan insurance in the future.

In addition we can use the C5.0 model predictive formula produced to specifically identify whether a group of customers will or will not have caravan insurance. This formula can be specifically applied to each record’s variable information and a yes/no to caravan insurance can be determined to 94.54% accuracy. This is useful for cost effective targeted advertising.

Logistic Regression Model
Two alternative models to support our conclusion/ recommendation 1. Neural Net Model and Logistic Regression. These two models showed accuracy rates around 73%. They showed unexpected results as analysis occurred. Compared to our most relevant model, the C5.0, these models didn’t show the same consistency of data (comparing variable importance, table 1 to table 2. As we analyzed these two models, comparing them to our most accurate and others, we came to understand that the caravan insurance company has a very distinct target market, as well as a less obvious secondary market.

This secondary market is made up of the following variables, pulled from the variable importance table 2: income 75-125k, number of boat policies, and social class A. Seeing these variables, we assume that the secondary market is made up of higher income people who own caravans that are used for recreation. They use them possibly at the same time as their boats and other recreations items. Notice that this finding is absolutely consistent with the hypotheses that we made even before mining the data.



K-Means Model
Finally, the K Means model. As seen in the accompanying results, this model had poor cluster quality.

The limitations to our conclusions are that we have only seen the superficial view of the secondary market. We have not been able to pull out the most important factors of the secondary market. We feel that variables that might show up in the recreational users could be similar to the primary target market, but we cannot decipher if they are of importance to the primary or secondary market. For example, the variable rent. We assume that it is of importance for our primary target, being those who are low income only being able to rent, but it could also be those who have a place they rent seasonally, or all year for recreational activities. To solve this we could split the data to focus on what we assume would be important for each target, getting more specific understandings, but at the same time we could be running the risk of manipulating the data to tell us what we want it to tell us.




 * PROJECT CONCLUSION**

The group 3MPS searched for a data set, given the challenge to utilize all of the key learnings through out the spring 2009 semester at Akinson Graduate School of Management. The data mining challenge that concluded in 2000 was found that discussed the cross-selling technique for caravan insurance. The data set was evaluated, understood, cleaned, created a hypothesis, visualized, and a project plan was set forth. With a sound understanding of the data, we used data mining software to analyze the data, using various techniques and processes. The model helped us conclude that there is a primary target market consisting of people purchasing caravan insurance because it is possibly their home. Additionally, a secondary target market was found consisting of people purchasing caravan insurance to insure their recreational caravan.

3MPS consists of 5 Data Mining students: David Mackley, Connor Malone, Chad Meynders, Howie Pham, and Nick Stuart who participated in the Data Mining course GSM 672 held at Atkinson Graduate School of Management, a part of Willamette University in Salem, Oregon U.S.A. As a final project to combine the skills learned throughout the course, each group of students, 3MPS only being one of many, searched our networks and the internet to find data sets that could be used to utilize our new skills.

3MPS concluded its search after finding a data mining challenge that was held by an insurance company in Europe. They were having difficulties with the cross-selling technique to sell caravan insurance, and wanted more information on the subject. In 2000, this insurance company put on a challenge called The CoIL Challenge 2000. The challenge was originally tackled by 43 entrants who took on the challenge in a variety of ways. Our goal was to use the information gained by the original entrants and use more advanced techniques to get a high percentage accuracy rate that could be used by the European insurance company in its efforts to cross-sell caravan insurance.

3MPS evaluated the data set determining that the data set was adequate for the necessary analysis requirements set by the course. We commenced a superficial analysis in an attempt to understand the intricacies of the data set. After realizing that the data set actually represented hundreds of thousands of insurance policy owners, the data structure became clear. We used standard cleaning techniques, finding that most of the data was usable and that there were no missing values. Reflecting on our current knowledge, we created a hypothesis stating that boat insurance would have a high relevance in predicting caravan insurance. At this point, to get a more in-depth understanding of what we were looking at, we used visualization tools, ManyEyes, Google Trends, and others. This allowed us to see superficial patterns and helped us validate our hypothesis. We concluded this stage of our analysis with creating a plan on how to take an in-depth view of the data using data mining tools. We planned on using many different models ranging from regression models to neural network models. Following the modeling period, we analyzed our findings to discover if our initial hypothesis proved correct.

We used data mining software to analyze the data, using various techniques and processes. The model C5.0 concluded with the most accurate model showing more than 90% accuracy. This model showed that low social class, low income, no-to-1 car individuals purchase caravan insurance. Through comparison of our other models we also found that this group could be assumed to live in a caravan. We also found a small group who use caravans for recreation. These people have high income and high social class. Our conclusion is that there is a primary target market consisting of people purchasing caravan insurance because it is possibly their home and also a secondary target market consisting of people purchasing caravan insurance to insure their recreational caravan.

Our actionable management recommendation can be summed up as follow: -First identify customers that have car insurance, as this is by far the most important predictor. -Then, identify all of those customers that either fit into the lower class with big family or the high economic class with recreational interests - Next, based on which one they fit into target those customers with specific custom advertising. - In addition, management can use the C5.0 model predictive formula produced by SPSS Clementine to specifically identify whether a group of customers will or will not have caravan insurance. This formula can be specifically applied to each record variable information and a yes/no to caravan insurance can be determined up to 94.54% accuracy. This is useful for cost effective targeted advertising.

 [1] "Cross-selling." __Wikipedia: The Free Encyclopedia__. May 13 2009 <[] >. [2] Anthony O'Donnell "The Elusive Prize: Effective Cross-selling," //Insurance &Technology//, Sep. 16, 2005 ([] ). [3] "Travel Trailer." __Wikipedia: The Free Encyclopedia__. May 13 2009 <[] >. [4] Leslie Hamp "Five Secrets in Reaching Target Customer Through Media," //Article Base//, Oct. 10, 2007 ( [|http://www.articlesbase.com/marketing-articles/5-secrets-to-reaching-target-customers-through-the-media-231723.html)] [5] Seven things you should know about Data Visualization. []
 * WORK CITATION**


 * REFERENCE MATERIAL**

Charles Ling and Chenghui Li. Data Mining for Direct Marketing: Problem and Solution. [] Elkan, Charles. Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000. [|http://wwwcse.ucsd.edu/users/elkan/kddcoil.pdf] ** Frank Brown. "Lean Innovation." IndustryWeek.com, May 13 2009. [] Glenn J. Myaat, Making Sense of Data - //A Practical Guide to Exploratory Data Analysis and Data Mining// (New Jersey: John Wiley & Sons, Inc.), 2007 Seth Grimes. "Data Mining for the Masses." Intelligent Enterprise, June 12 2004. [] Tamraparni Dasu and Theodore Johnson, //Exploratory Data Mining and Data Cleaning// (New Jersey: John Wiley & Sons, Inc.), 2003 **
 * Ai Cheo Yeo and Kate Smith, //Implementing Data Mining Solution for an Automobile Insurance Company// (Idea Group Publishing), 2002.
 * Galit Shmueli, Nitin R. Patel and Peter C. Bruce, //Data Mining for Business Intelligence// (New Jersey: John Wiley & Sons, Inc.), 2007.