Team+1+2014

//Team 1 Final Project Report//

//Kathleen French and Zining Zhang// =Business Understanding= toc

**Objective 1**

 * Business Question:** What are the traits of uninsured patients that will cost the hospital more than an insurance policy and how can we identify them before costs build?
 * Specific Data Mining Goal**: For patients who had multiple encounters with the hospital which 50 – 100 patients would it have been most beneficial to purchase commercial health insurance for? How can we identify them?

Objective 2

 * Business Question**: Who are our Heart failure patients? What are the characteristics of HF patients? What are the utilization patterns?
 * Specific D****ata Mining Goal**: Find the correlations to Heart Failure Diagnosis. If possible to build model to predict Heart Failure

=Data Understanding=

After becoming familiar with the databases, exploring the meaning of the variables contained, and we began to explore which would be most suitable to explore and which would need modification to become useful (filtering/transformation/standardization/etc).

Objective 1
dbo.Encounters contains a flag HiLoss that indicates the encounter resulted in (negative) income in the bottom 10% of all patients. We separated out the encounters of uninsured patients and looked at the occurrence of various factors among them and if this could be related to predicting high loss.(image below)

Objective 2
Concerning our second objective we converted the heart failure count into a flag and looked at the demographics of these patients before beginning further preparation (table below).
 * < **Patient Categories ** ||< **N ** ||< **Percent ** ||< **HF % ** ||
 * <  ||< **Gender ** ||< **0 ** ||< 85960 ||< 43.6% ||< 52.02 ||
 * ^  ||^   ||< **1 ** ||< 111291 ||< 56.4% ||< 47.98 ||
 * ^  ||^   ||< **2 ** ||< 2 ||< .0% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.0 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">Total ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">197253 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">100.0% ||<   ||
 * ^  ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">Ethnicity ** ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">0 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">6279 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">3.2% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;"> 0.48 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">400 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">70186 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">35.6% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">33.22 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">800 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1215 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">.6% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.19 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">1200 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">25520 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">12.9% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">3.33 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">1600 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">3021 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1.5% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.33 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">2000 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1978 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1.0% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.38 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">2400 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">4587 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">2.3% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.9 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">2800 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1229 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">.6% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.29 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">3200 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">83164 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">42.2% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">60.88 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">3600 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">74 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">.0% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">Total ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">197253 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">100.0% ||<   ||
 * ^  ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">HighCost ** ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">0 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">195281 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">99.0% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;"> 93.72 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">1 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1972 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1.0% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">6.28 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">Total ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">197253 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">100.0% ||<   ||
 * ^  ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">PayerGroupCount ** ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">1 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">182065 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">92.3% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">88.77 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">2 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">14318 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">7.3% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">10.14 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">3 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">863 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">.4% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">1.09 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">4 ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">7 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">.0% ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">0.0 ||
 * ^  ||^   ||< **<span style="font-family: 'Times New Roman','serif'; font-size: 12.6667px;">Total ** ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">197253 ||< <span style="display: block; font-family: 'Times New Roman','serif'; font-size: 12.6667px; text-align: right;">100.0% ||<   ||

=**Data Preparation**=

We transformed categorical data into flags, partitioned the data, and balanced the training set based on the flag we were targeting (HiLoss and HF flag respectively). For the HiLoss patients we had previously selected only the records of self-insured patients so we then removed unneeded variables such as Payor_group and variables that repeated each other. Finally we used the auto-data-prep node to transform the data into standardized z-scores for more accurate modelling and to address outliers and missing values.

For the Heart Failure exploration there were additional steps (preformed after unsatisfactory modelling) taken to obtain more accurate models. We preformed a Principal Components Analysis to reduce the number of predictors by identifying underlying factors that load on multiple variables ( one example would be that cancer count, oncology service line, and ICD9-2 all are based on a cancer diagnosis) and removed specialty patients such as those with birth abnormalities.the auto-data-prep was split into two branches, one that standardized to Z scores and one that was put on a min/max scale from 0-1 to make it more suitable for neural network development.

=**Data Modelling/Evaluation**=

**Objective 1**
We ran an auto-classifier which recommend rule trees at the best type of model. A map of the CHAID tree developed is below. The primary factors determining if a self-payor is high cost is if they stay longer than average (essentially if they spend nights in the hospital) with an increase in stay length increasing the likely hood of a high loss. Emergency admission and certain service lines were also reflected here.

We also ran clustering models, the most striking of which being a k-means clustering that separated the patients by encounter type 100% of the time, but also showing influence of discharge status and ED admit.

**Objective 2**
We ran the auto-classifier which combined three models for 92% accuracy. After viewing this we ran a solo discriminant analysis which was reported by the classifier to be 92% accurate for the testing set but resulted in 8.5% type 1 error and 23.3% type II error in the training set. We suspect the high level of type II errors (false negatives) could be related to the oversampling in the balancing as the testing set had higher accuracy.



Performed separately with the specially prepared data, a boosted neural network gave us 92% accuracy overall but this model requires significant computing power and the process is not human readable. However neural networks do learn so perhaps running such a model on a server as business continued could eventually result in highly accurate predictions.

=**Business Conclusion**=

Objective 1
We were unable to find a way to identify high loss patients before they had incurred a high cost. We recommend further mining into Encounter type 4 visits to determine who uses these and if there is a way to predict patients that will have a type four encounter in the future. As expected, inpatient stays were the most financially unfeasible for self-payors so reducing these through appealing low-priced preventative care would be wise

Objective 2
The heart failure exploration lead to more useful insight into the characteristics of these patients. They tend to have multiple diagnoses and have had numerous encounters of varying types with the hospital suggesting that a overall unhealthiness is effecting heart health. The presence of cancer and diabetes is also a predictor of heart failure and this link should be further investigated. =Slides=