Oregon+Healthcare+Insurance+Predictive+Model+Final

 Individual Health Insurance Coverage Analysis within the State of Oregon



 Prepared By: Gibran Braithwaite, Adelaida Patrasc, Wenxian Wu  May 3rd, 2011 AGSM 672: Data Mining

//**Executive Summary**//

This project’s purpose is to develop a model that can be used in predicting whether an individual in the State of Oregon possesses private health insurance. This comes as a response to the health care system’s need to increase their percentage of the privately insured patients, a need we identified through primary and secondary research. The following pages provide a more detailed introduction into the context of the health insurance system, especially the types of insurance existent in the market and the importance of identifying the privately insured patients. The following pages also present the research steps that were undertaken in order to develop a model that predicts whether an Oregonian patient is privately insured or not. These steps include hypothesising, data collection, business understanding, data understanding, data preparation, modelling, model evaluation, deployment and the conclusion. In order to test the hypothesis that there are a certain set of demographic characteristics that can predict whether a person would or would not have private health insurance in the future, a secondary a source of data was identified - The 2009 American Community Survey 1-year PUMS Population File. The size of this data set caused very low speeds when running streams in SPSS Modeler, which is why the data preparation was focused not only on managing outliers and extremes observations but also on ways to reduce the data set. For this reason, two separate stream files were created and the feature select node was employed to reduce the number of variables, the original Excel file was replaced in the second stream by a smaller version of the data set. Ultimately, the key criterion for selection of the model to be used was the model’s capability to employ weighted data, a critical factor in making the results obtained by mining the data set relevant for describing the general population. Due to this criterion, the modelling process was based on the CHAID tree model, the Cart tree model, and the C5 algorithm. Nodes such as merge and ensemble were further used in order to resolve any conflict if the multiple models predicted conflicting results.. This allows the decision makers in the health system to benefit from the combined predictive power of the three models, which is higher than any of the predictive powers of the each model alone. Furthermore, a cluster analysis was also performed based on the key predictors identified in the predictive model. This revealed four clusters - high income segment with high occupation levels, mid income segment with high occupation levels, low occupation segment with widely spread income levels, and low income segment with mid to high occupation levels. When connected with secondary data such as maps of spread of income level within the city of Portland, output from the model can then be used to strategically plan future hospital locations. For instance, the western side of Portland would be in ideal place to locate a hospital that will be surrounded by a high number of privately insured patients. Another example of deployment is helping decision makers choose the locations where outdoor advertising could have the highest audience reach for the privately insured segment. To conclude, there is a variety of ways in which the predictive and classification models developed for this projects can ultimately improve the decision-making and increase revenues in healthcare systems.

 //**Introduction**//

The objectives of this project are to classify if an individual is carrying private health insurance and to identify demographic characteristics that could be used for patient segmentation and targeting purposes. This would help decision makers in the health care system increase the percentage of privately insured patients in their customer base, therefore leading to an increase in revenues that can compensate for the negative impact of lower reimbursement rates from uninsured and publicly insured patients. Hospitals typically have three types of patients, namely: 1) Privately Insured Patients 2) Uninsured Patients and 3) Medicare/Medicaid Patients and other Publicly Insured Patients. These three patient categories carry varying amounts of payment levels per services performed (reimbursement rates). Medicare/Medicaid Patient reimbursement rates are typically insufficient to cover the cost of the services being performed. According to the CNNMoney.com article ‘Doctor’s Threaten Medicare Backlash’ one surgeon had a 20% Medicare Patient load that accounted for only 5% of his total income. Uninsured Patients typically cannot afford to pay their medical bills. Privately Insured Patients are the only group of patients that provide high enough reimbursement rates to cover the cost of their care. Hospital Systems use their volume of Privately Insured Patients to compensate for the low reimbursement rates of their Publicly Insured and Uninsured Patients. This project will also allow us to address the objectives of this course, specifically by attaining business and data understanding and employing core methodologies of data mining for 1) data cleaning, 2) finding reliable measurement indicators and key factors, 3) detecting patterns and outliers or anomalies, 4) classifying, segmenting and finding clusters, and 5) forecasting and predicting.        //**Hypothesis**//     H1: There is a certain set of demographic characteristics that can predict whether a person would or would not have private health insurance in the future. The report ‘Income, Poverty, and Health Insurance Coverage in the United States: 2009’ presents information regarding absolute values and percentages of the US population for each of the three types of insurance plans and also by demographic characteristics. “Between 2008 and 2009, the percentage of people covered by private health insurance decreased from 66.7 percent to 63.9 percent [...]. The percentage of people covered by employment-based health insurance decreased to 55.8 percent in 2009, from 58.5 percent in 2008.” This will make it even more important for hospitals’ to identify and attract a larger percentage of patients that are privately insured. <span style="background-color: transparent; display: block; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"> <span style="background-color: #ffffff; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; font-family: Arial; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; white-space: pre-wrap;">In their study representing all civilian noninstitutionalized nonelderly families in the U.S. in 2006, Bernard & Bathin (2009), found that “total expenditures on health care services were highest among families with public coverage and lowest among uninsured families. Mean total expenditures were $8,831 among families with public insurance, $6,785 among families with private insurance, and $1,425 among uninsured families.” However, reimbursements rates do not have the same structure as the expenditures by type of insurance plan. The same study also indicated that families with private coverage have higher out-of-pocket health care services expenditures than those with public insurance or those uninsured, which might support the fact that overall revenues at hospital level are higher for privately insured patients than for public or uninsured ones. <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//**Data Collection Procedure**//      <span style="background-color: #ffffff; display: block; font-family: 'Times New Roman'; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"><span style="background-color: #ffffff; color: #000000; font-style: normal; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿﻿ <span style="color: #000000; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">1. The 2009 American Community Survey 1-year PUMS Population File was downloaded from the <span style="color: #000099; font-family: Arial; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">[|__www.data.gov__] <span style="color: #000000; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> website. The folder contains two .csv files and pdf instructions. <span style="background-color: #ffffff; display: block; font-family: 'Times New Roman'; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"> <span style="background-color: #ffffff; color: #000000; font-style: normal; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="color: #000000; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">2. Reviewed the instructions and located the PUMS Data Dictionary located at <span style="color: #000099; font-family: Arial; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">[|__http://www.census.gov/acs/www/Downloads/data_documentation/pums/DataDict/PUMSDataDict09.pdf__] <span style="background-color: #ffffff; color: #000000; font-style: normal; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-style: normal; font-weight: normal; line-height: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">3. Found the predictor describing the State of Oregon in the ST variable. The State of Oregon is represented by the number 41. <span style="background-color: #ffffff; color: #000000; font-style: normal; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-style: normal; font-weight: normal; line-height: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">4. Created a new Excel file from the filtered data set. <span style="background-color: #ffffff; color: #000000; font-style: normal; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-style: normal; font-weight: normal; line-height: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">5. Ran a Feature Select Node on the Oregon Filtered Data Set. <span style="background-color: #ffffff; color: #000000; font-style: normal; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-style: normal; font-weight: normal; line-height: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">6. Created a new Excel file from the results of the Feature Select. <span style="background-color: #ffffff; color: #000000; font-weight: normal; line-height: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-weight: normal; line-height: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//Note: the default settings were used when running the Feature Select Node.// <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: 15px; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//**Business Understanding**// <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Hospital systems group their patients into three broad categories: <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"> <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">1. Medicare/Medicaid Patients <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">2. Privately Insured Patients <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">3. Uninsured Patients <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Traditionally Medicare/Medicaid reimbursements rates have been lower than the cost of the services being performed. Additionally, uninsured patients have not historically paid for their medical expenses. As a result, hospital systems have become very sensitive to their mix of medicare/medicaid, privately insured and uninsured patients. It is hoped that the use of a data mining model can be used to help hospital systems better focus their marketing efforts on individuals that will most likely have health insurance. If this is successfully executed, hospital systems will have more insight into monitoring their patient mix and may even have further insight into which areas of health care that they would like to specialize in. <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//**Data Understanding**// <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> <span style="background-color: #ffffff; color: #000000; display: block; font-family: 'Times New Roman'; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Due to the large number of predictors in the data set (approximately 270 initially) it was impractical to develop a detailed understanding of all the predictors in the data set. Therefore the Feature Select Node was executed in SPSS to reduce the number of predictors uncorrelated with the PRVICOV independent variable. This process reduced the number of predictors to 222. The variable meanings were then ascertained in the data dictionary to ensure that an adequate understanding had been obtained. <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//**Data Preparation (with Variable Control within the Purpose)**// <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Variable control was mainly used for data reduction purposes. Our original data set was The Public Use Microdata Sample (PUMS), which is a sample of the American Community Survey (ACS) and contains data describing approximately one percent of the US population. “The PUMS dataset includes variables for nearly every question on the survey, as well as many new variables that were derived after the fact from multiple survey responses (such as poverty status).”  <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Considering the large number of variables and units from which data was collected (both person and household level) that were included in the original data set, we first used the State Code variable (ST) in order to reduce the data set to one that includes only information of individuals from the State of Oregon. In order to do so we used a Select Node with the value 41, which is presented in the PUMS Data Dictionary as representing Oregon. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">While this reduced the size of our data set, the number of variables was still too large to be easily managed - especially in terms of the speed of running a stream that makes use of the data in SPSS Modeler. We further used a Feature Select Node to compile a list of variables that were highly correlated with our variable of interest. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In order to increase the speed of running the stream we decided to create a separate data set derived from the original data set (the one used in Healthinsurance.str). The reduced data set consisted of the observations and predictor variables that remained after filtering out information that only pertained to the State of Oregon and were highly correlated with the independent variable <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The independent variable is PRIVCOV (Private health insurance coverage recode) and is coded as 1 if a person has private health insurance coverage and as 2 if a person does not have any private health insurance coverage. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The dependent variables all represent demographic characteristics that were included in the American Community Survey (ACS). The initial number of dependent variables was 276. After conducting data reduction, the number of variables decreased to 230. Further variable reduction was carried out during the Business Understanding stages of the CRISP-DM process outlined below. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Once an understanding of the reduced data had been attained, redundant variables were filtered out of the data set. Next, appropriate measurement types (continuous, flags etc.) were selected. Outliers were then cut from the data set and null values were replaced with zeroes. The data was then evaluated using the predictor to observation ratio of 10 to 1 before proceeding to analyze the data. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The data set was then partitioned into Training (50%), Testing (30%) and Validation (20%) sets in preparation for analysis. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">A new smaller stream was created in the same file in order to prepare the data once again for increased speed and accuracy of the models to be used next. This was accomplished by using the Feature Select node, which led to the creation of a new, smaller Excel file - ‘Oregon Final Prepared Dataset’. This new data set was further used in the stream file ‘Healthinsurance (Selected)’. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In this second stream file the data was balanced so that the models build based on the training set are representative of the entire sample population under analysis. The census data was provided together with weighted measurements that allow us to generalize the results in the sample from which demographic data was collected to the entire US population. This is why using models that employ weighted measurement (CHAID tree model, Cart tree model and the C5 algorithm) were critical in making the results of this study representative. Even though the model is supposed to make predictions for the Oregon population, using the weight for national level would still improve the accuracy of the results. <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//**Modeling**// <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; vertical-align: baseline; white-space: pre-wrap;">__Select Classifier Model__ <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; text-align: justify;"> <span style="background-color: #ffffff; color: #000000; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The following steps were taken to find out the best possible model(s) to classify private insurance carriers. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 1) We tried to use Auto Classifier Node to run all the 10 classifier models together, but the system kept returning an error. We assumed the reasons could be that the system was not able to handle such a large dataset, or that some of the models were not applicable for this dataset. 2) We tried to run the first five models within the Auto Classifier Node, and the results of three models were generated, including C5 Tree, Logistic Regression, and Discriminant Analysis. When we ran the last five models together, there was still an error.

<span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">3) We therefore ran each model individually. One of the most important model selection criteria was whether the model had a “weight field,” because we assumed that without the “weight field,” the model could not serve our purpose of analyzing survey data.   <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;">    <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">4) We selected the three possible models for classification purpose, including CHAID, C&R tree, and C5 Tree, based on the models’ compatibility with a “weight field”. <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">5) To compensate the missing benefit provided by Auto Classifier, we merged the three selected models and used an “Ensemble” node to incorporate the three models for prediction.

<span style="background-color: #ffffff; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #000000; display: block; font-family: Arial; font-style: normal; font-weight: normal; text-align: justify; vertical-align: baseline; white-space: pre-wrap;"> __Optimize Selected Model__ <span style="color: #000000; display: block; font-family: Arial; font-style: normal; font-weight: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 1) To reduce the number of brunches, we reduce the tree levels to 3 and set the minimum records in the child branch to 1,000 and in parent branch as 1,001.     <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">2) We also factored in the differences of misclassification costs, as it would cost more if the actual uninsured patients were misclassified as insured than if the actual insured patients were misclassified as uninsured patients. By changing the input in the right up filed from 1.0 to 2.0, the model accuracy was improved. <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;">  <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">//**Evaluation**//   <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">All the selected models had an overall accuracy above 80% for the validation data. The lift charts suggested that the model performance was good.

<span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;"> <span style="background-color: #ffffff; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt;"><span style="color: #000000; display: block; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">﻿﻿﻿ __Results of the Classifier Model__ <span style="color: #000000; display: block; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> Based on the predictor importance ranking generated by the three selected models, the most important predictors were: POVPIP (Person Poverty Status Recode), WAGP (Wages or salary income past 12 months), and OCCP (Occupation). <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> __Selected Cluster Model__    <span style="background-color: #ffffff; color: #000000; font-family: 'Times New Roman'; font-size: 12pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: center; vertical-align: baseline; white-space: pre-wrap;"> ﻿ <span style="background-color: #ffffff; color: #000000; display: block; font-family: 'Times New Roman'; font-size: 12pt; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: center; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">We chose the TwoStep model because it had the smallest number of clusters – 4 clusters were enough to serve our clustering purpose. <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;">

<span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: center; vertical-align: baseline; white-space: pre-wrap;"> __Cluster Results__ <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"> <span style="background-color: transparent; color: #000000; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; margin-bottom: 0pt; margin-top: 0pt; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> Four clusters were generated by the TwoStep Model, with fair cluster quality.

//<span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-size: 11pt; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">**Deployment** // The model will be used to predict whether or not individuals within Portland carry private health insurance. Additionally, the cluster analysis will be used to group individuals into four segments. It is hoped that with additional research, hospital systems can better tailor their services to meet the needs of these four distinct types of customers. Insight derived from the model shows that individuals’ poverty level, their occupational title and their wages during the past year are the strongest predictors. <span style="background-color: #ffffff; color: #000000; font-size: 11pt; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: transparent; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Given this important information, output from the model can then be used to strategically plan future hospital locations within the city of Portland. The map below shows that there are clusters of low poverty levels along the entire west side of Portland. The western side of Portland would be in ideal place to locate a hospital that will be surrounded by a high number of privately insured patients.

<span style="background-color: #ffffff; color: #000000; display: block; font-family: 'Times New Roman'; font-size: 11pt; font-style: normal; font-weight: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: center; vertical-align: baseline; white-space: pre-wrap;">﻿ Insights into the demographic makeup of individuals that have health insurance can further be obtained by performing a cluster analysis. The privately insured individuals were split into four clusters. A three-dimensional plot of the four clusters is shown below. Note that the WAGP variable has been transformed during the data preparation step in order to standardize the data. Transforming the data back into it’s original format shows that mean salary is $65,500 with a with maximum of $131,000. The additional graphs below were used to analyze the four clusters. <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;">

<span style="background-color: transparent; display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;">

<span style="display: block; font-family: 'Times New Roman'; font-size: medium; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"> <span style="background-color: transparent; color: #000000; display: block; font-family: Arial; font-style: normal; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">**Cluster 1** <span style="background-color: transparent; display: block; font-family: 'Times New Roman'; text-align: justify;"> <span style="background-color: transparent; color: #000000; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This cluster contains individuals who have low Occupational Codes and make $65,500 and more annually. The lower Occupation Codes refer to those that are in management positions. The higher salaries of individuals in this cluster will likely mean that they have insurance through their employer as well as having high levels of discretionary income to pay their medical bills. As a result we call this cluster <span style="background-color: transparent; color: #000000; font-family: Arial; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">**Discretionary Income Managers** <span style="background-color: transparent; color: #000000; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. <span style="background-color: #ffffff; font-size: medium; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; white-space: pre-wrap;">﻿﻿ **Cluster 2**   <span style="color: #000000; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">﻿ This cluster contains individuals with both an average salary of approximately $65,500 annually and below, that are in management or otherwise perform some type of office work and are in the middle level of the percentage of the poverty status values. This cluster will likely have insurance through their employers but may not have the high level of discretionary income of those in cluster 1 to pay an excessive medical bill. We have named the individuals in cluster 2 <span style="color: #000000; font-family: Arial; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">**Mid-class Managers and Office Professional** <span style="color: #000000; font-family: Arial; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. **Cluster 3** <span style="color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This cluster contains individuals that make roughly the same amount of money as those in cluster 2. However the Occupational Codes show individuals that work within construction, agricultural and other labor intensive industries. Due to the physical nature of these jobs, individuals in this cluster may be at a higher risk of job related injuries. Individuals in the cluster are called <span style="color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">**Experienced Laborers** <span style="color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">.

** Cluster 4 ** <span style="background-color: #ffffff; color: #000000; display: block; font-family: 'Times New Roman'; font-size: medium; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline;"><span style="color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This cluster contains individuals that have income of below $65,500 per year and fall very low on the poverty scale. These individuals will likely opt to not have insurance coverage through their employer because it may be too expensive. However, these individuals may have some other type of low-cost private insurance plan. This cluster is called <span style="color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">**Bargain insured patients** <span style="color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. <span style="background-color: #ffffff; color: #000000; display: block; font-size: 15px; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">﻿ <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: 15px; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">While many insights can be gained by using the current model, additional research into finding pockets of the individuals where these clusters are present can assist in finding locations in Portland where hospitals will be most successful in getting high volumes of commercially insured patients. In addition, demographic information can be gathered and run through the predictive model to determine neighborhoods in Portland where there are clusters of individuals that carry private insurance. <span style="background-color: #ffffff; color: #000000; display: block; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline;">// ﻿ // <span style="background-color: #ffffff; color: #000000; display: block; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline;">// **Conclusion** //  <span style="background-color: #ffffff; color: #000000; display: block; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline;"> <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: 15px; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">While the deployment section of this project has been focused on using the classification model for determining geographic areas in Portland with high concentration of privately insured individuals, the project has two main ways in which it can be used, corresponding to the two types of models - predictive and classification. <span style="background-color: #ffffff; color: #000000; display: block; font-family: Arial; font-size: 15px; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The predictive model can mainly be used in order to determine if individual patients are privately insured or not, based on their demographics. The classification model and the clusters identified through it can then add a lot of value in terms of decision making by providing avenues for building new hospital units, identifying locations where outdoor advertising could have the highest audience reach for the privately insured segment, help in identifying companies that target the same segments and potentially partnering with them for promotional or sales purposes, etc. <span style="background-color: #ffffff; color: #000000; display: block; font-style: normal; line-height: normal; margin-bottom: 0pt; margin-top: 0pt; text-align: justify; text-decoration: none; vertical-align: baseline;">// **References** //


 * 1) <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; list-style-type: decimal; text-decoration: none; vertical-align: baseline;"><span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Bernard D., Banthin J. Family Level Expenditures on Health Care and Insurance Premiums among the Nonelderly Population, 2006. Research Findings No. 29. March 2009. Agency for Healthcare Research and Quality, Rockville, MD. <span style="background-color: #ffffff; color: #000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">[|__http://www.meps.ahrq.gov/mepsweb/data_files/publications/rf29/rf29.pdf__]
 * 2) <span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: medium; font-style: normal; font-weight: normal; list-style-type: decimal; text-decoration: none; vertical-align: baseline;"><span style="background-color: #ffffff; color: #000000; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Kavilanz P., Doctors threaten Medicare backlash, 2010. <span style="background-color: #ffffff; color: #000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">[|__http://money.cnn.com/2010/02/24/news/economy/doctors_ditching_medicare_patients/index.htm__]
 * 3) <span style="background-color: #ffffff; color: #000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; list-style-type: decimal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">[|__http://www.census.gov/prod/2010pubs/p60-238.pdf__]

// **Files** //