Oregon+Healthcare+Insurance+Predictive+Model

Individual Health Insurance Coverage Analysis within the State of Oregon



Prepared By: Gibran Braithwaite, Adelaida Patrasc, Wenxian Wu April 12th 2011 AGSM 672: Data Mining

Introduction
The purposes of this project are to classify if an individual is carrying private health insurance and to identify demographic characteristics that could be used for patient segmentation and targeting purposes. This would help decision makers in the health care system increase the percentage of privately insured patients in their customer base, therefore leading to an increase in revenues that can compensate for the negative impact of lower reimbursement rates from uninsured and publicly insured patients.

Hospitals typically have three types of patients, namely: 1) Privately Insured Patients 2) Uninsured Patients and 3) Medicare/Medicaid Patients and other Publicly Insured Patients. These three patient categories carry varying amounts of payment levels per services performed (reimbursement rates). Medicare/Medicaid Patient reimbursement rates are typically insufficient to cover the cost of the services being performed. According to the CNNMoney.com article ‘Doctor’s Threaten Medicare Backlash’ one surgeon had a 20% Medicare Patient load that accounted for only 5% of his total income. Uninsured Patients typically cannot afford to pay their medical bills. Privately Insured Patients are the only group of patients that provide high enough reimbursement rates to cover the cost of their care. Hospital Systems use their volume of Privately Insured Patients to compensate for the low reimbursement rates of their Publicly Insured and Uninsured Patients.

This project will also allow us to address the objectives of this course, specifically by attaining business and data understanding and employing core methodologies of data mining for 1) data cleaning, 2) finding reliable measurement indicators and key factors, 3) detecting patterns and outliers or anomalies, 4) classifying, segmenting and finding clusters, and 5) forecasting and predicting.

Variable Control within the Purpose
Variable control was mainly used for data reduction purposes. Our original data set was The Public Use Microdata Sample (PUMS), which is a sample of the American Community Survey (ACS) and contains data describing approximately one percent of the US population. “The PUMS dataset includes variables for nearly every question on the survey, as well as many new variables that were derived after the fact from multiple survey responses (such as poverty status).”

Considering the large number of variables and units from which data was collected (both person and household level) that were included in the original data set, we first used the State Code variable (ST) in order to reduce the data set to one that includes only information of individuals from the State of Oregon. In order to do so we used a Select Node for the value 41, which is presented in the PUMS Data Dictionary as representing Oregon.

While this reduced the size of our data set, the number of variables was still too large to be easily managed - especially in terms of the speed of running a stream that makes use of the data in SPSS Modeler. We further used a Feature Select Node to compile a list variables that were highly correlated with our variable of interest. In order to increase the speed of running the stream that we would use in our final project we decided to create a separate data set derived from the original data set (the one used in Healthinsurance.str). The reduced data set consisted of the observations and predictor variables that remained after filtering out information the only pertained to the State of Oregon and were highly correlated with the independent variable

The independent variable is PRIVCOV (Private health insurance coverage recode) and is coded as 1 if a person has private health insurance coverage and as 2 if a person does not have any private health insurance coverage.

The dependent variables all represent demographic characteristics that were included in the American Community Survey (ACS). The initial number of dependent variables was 276. After conducting data reduction, the number of variables decreased to 230. Further variable reduction was carried out during the Business Understanding stages of the CRISP-DM process outlined below.

Hypothesis
H1: There is a certain set of demographic characteristics that can predict whether a person would or would not have private health insurance in the future.

The report ‘Income, Poverty, and Health Insurance Coverage in the United States: 2009’ presents information regarding absolute values and percentages of the US population for each of the three types of insurance plans and also by demographic characteristics. “Between 2008 and 2009, the percentage of people covered by private health insurance decreased from 66.7 percent to 63.9 percent [...]. The percentage of people covered by employment-based health insurance decreased to 55.8 percent in 2009, from 58.5 percent in 2008.” This will make it even more important for hospitals’ to identify and attract a larger percentage of patients that are privately insured.

In their study representing all civilian noninstitutionalized nonelderly families in the U.S. in 2006, Bernard & Bathin (2009), found that “total expenditures on health care services were highest among families with public coverage and lowest among uninsured families. Mean total expenditures were $8,831 among families with public insurance, $6,785 among families with private insurance, and $1,425 among uninsured families.” However, reimbursements rates do not have the same structure as the expenditures by type of insurance plan. The same study also indicated that families with private coverage have higher out-of-pocket health care services expenditures than those with public insurance or those uninsured, which might support the fact that overall revenues at hospital level are higher for privately insured patients than for public or uninsured ones. **Data Collection Procedure** 1. The 2009 American Community Survey 1-year PUMS Population File was downloaded from the <span style="color: #000099; font-family: 'Times New Roman',serif; font-size: 12pt;">[|www.data.gov] <span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;"> website. The folder contains two .csv files and pdf instructions.

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;"> 2. Reviewed the instructions and located the PUMS Data Dictionary located at <span style="color: #000099; font-family: 'Times New Roman',serif; font-size: 12pt;">[]

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">3. Found the predictor describing the State of Oregon in the ST variable. The State of Oregon is represented by the number 41. <span style="font-family: 'Times New Roman',serif; font-size: 16px; line-height: normal;">4. Created a new Excel file from the filtered data set. <span style="font-family: 'Times New Roman',serif; font-size: 16px; line-height: normal;">5. Ran a Feature Select Node on the Oregon Filtered Data Set. <span style="font-family: 'Times New Roman',serif; font-size: 16px; line-height: normal;">6. Created a new Excel file from the results of the Feature Select.

//<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">Note: the default settings were used when running the Feature Select Node. //

** Data ** **Interpretation**

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">In order to understand the data being presented within the data set and develop an accurate predictive model, the CRISP-DM process was employed. **Business Understanding** <span style="font-family: 'Times New Roman',serif; font-size: 16px; font-weight: normal;">Hospital systems group their patients into three broad categories:

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">1. Medicare/Medicaid Patients <span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">2. Privately Insured Patients <span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">3. Uninsured Patients

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">Traditionally Medicare/Medicaid reimbursements rates have been lower than the cost of the services being performed. Additionally, uninsured patients have not historically paid for their medical expenses. As a result, hospital systems have become very sensitive to their mix of medicare/medicaid, privately insured and uninsured patients. It is hoped that the use of a data mining model can be used to help hospital systems better focus their marketing efforts on individuals that will most likely have health insurance. If this is successfully executed, hospital systems will have more insight into monitoring their patient mix and may even have further insight into which areas for health care that they would like to specialize in.

Data Understanding
<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">Due to the large number of predictors in the data set (approximately 270 initially) it was impractical to develop a detailed understanding of all the predictors in the data set. Therefore the Feature Select Node was executed in SPSS to reduce the number of predictors uncorrelated with the PRVICOV independent variable. This process reduced the number of predictors to 222. The variable meanings were then ascertained in the data dictionary to ensure that an adequate understanding had been obtained.

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">Once an understanding of the reduced data had been attained, redundant variables were filtered out of the data set. Next, appropriate measurement types (continuous, flags etc.) were selected. Outliers were then cut from the data set and null values were replaced with zeroes. The data was then evaluated using the predictor to observation ratio of 10 to 1 before proceeding to analyze the data.
 * <span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">Data Preparation **

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">The data set was then partitioned into Training (50%), Testing (30%) and Validation (20%) sets in preparation for analysis.

<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">The Auto Classifier Node will be used to run all the models that are compatible with the data set. The Auto Cluster Model will also be used to develop additional insight into the demographic composition of individuals with health insurance.
 * <span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">Modelling **

Evaluation
<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">The top performing models be evaluated and tweaked to improve the accuracy of their predictions without over fitting.

Deployment
<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">The model will be used to predict whether or not an individual carries private insurance.

Data Interpretation Plan
<span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt; line-height: normal; margin-bottom: 0in;">The significance of each predictor will also be analyzed so that any additional insight can be gained. Graphical summaries can then be used to highlight the significance of any particular class having a significant impact on an individual having private health insurance. The results can be crossed reference with other demographic data to graphically predict where clusters of insured individuals reside within the State of Oregon. This can provide great insight on where to locate future hospitals and monitoring patient mix.