Application+of+Data+Mining+at+Salem+Health


 * By: Bishrut Thapa, Brian Neely, Craig Lobo & L.A Walker **

As a part of our experimental class GSM 6272 Data mining for Business Intelligence at our MBA program in Willamette University, our client Salem Hospital assigned us with two main objectives – the first was to identify whether or not it is financially viable for Salem Hospital to purchase insurance for patients who are currently uninsured and the second was to identify the characteristics of heart failure patients and build a model that would allow us to predict heart failure.
 * Background **

**Analysis of objective 1: Finding common characteristics among the defaulters: ** In order to approach this problem, we used the merged data set resulting from dbo.Encounter and dbo.DiagnosisRelatedGroup through SQL. We then eliminated all the unnecessary variables like foreign keys and variables that would only be generated after a patient is admitted to the hospital.We were left with following variables:
 * Encounter Rows (Number of times the service has been required)
 * ID (Unique patients ID)
 * Consumer_Gender (Gender of the patient)
 * Consmer_Ethnicity (Ethnicity of the patient)
 * ED_Admit (Emergency department admission)
 * Payor Group (Type of insurance)
 * Encounter Type (Encoded encounter Type)
 * Direct Income (Revenue from the patients - cost incurred by the hospital)
 * Principal Diagnosis (Standard principal diagnosis code)
 * HCFA_Diagonosis _Related_Group (Standard diagnosis group code)
 * Principal_Procedure_ICD9 (Standard procedure code)
 * <span style="font-family: Cambria,serif; font-size: 12pt;">Principal_Procedure_Ordering _Practitioner (Unknown physician coding)
 * <span style="font-family: Cambria,serif; font-size: 12pt;">Attending (Unknowing-physician encoding)
 * <span style="font-family: Cambria,serif; font-size: 12pt;">Discharge status (Unknown discharge code)
 * <span style="font-family: Cambria,serif; font-size: 12pt;">Patient Visit Index (Identifies the visit of each patient in order as they occur)

<span style="font-family: Cambria,serif; font-size: 12pt;">After further converting the data into flags, we then decided to extract and examine the data set that only contained the defaulters. For this we, first pulled out the data for the self-insured patients and then further extracted the information for only those patients who had actually defaulted. We then decided to run an Apriori model for association rules to examine if there were any interesting patterns that defaulters were exhibiting. The model we ran gave us the following result:

**__<span style="font-family: Cambria,serif; font-size: 12pt;">Findings __<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">: **<span style="font-family: Cambria,serif; font-size: 12pt;">The model showed that the default was common in patients belonging to ethnicity 1200 and 3200. These people were admitted through the Emergency Department. The other interesting fact is that these subsets belonged to encounter type_4 and discharge status_6. This information can now be used to profile a potential defaulter. <span style="font-family: Cambria,serif; font-size: 12pt;">

**__<span style="font-family: Cambria,serif; font-size: 12pt;">Recommendation __**<span style="font-family: Cambria,serif; font-size: 12pt;">**:**Depending upon the above mentioned demographics in the Salem metropolitan area, Salem Hospital can ascertain the likelihood of encountering this type of patients and then evaluate if it is worth purchasing insurance for them or not.

**<span style="font-family: Cambria,serif; font-size: 12pt;">Analysis of objective 2: Predicting Heart Failure Patients ** <span style="font-family: Cambria,serif; font-size: 12pt;">As we moved forward with predicting heart failure, we looked at the pros and cons of using various models to meet our goals. Within this goal we also worked to identify some useful demographics of those patients who have a higher risk of heart failure. <span style="font-family: Cambria,serif; font-size: 12pt;">In an effort to achieve this objective, we undertook the following process:
 * <span style="font-family: Cambria,serif; font-size: 12pt;">Converted the SAV file to Excel, and constructed a column to see if a patient went to the hospital multiple times, then deleted those that didn’t have multiple encounters
 * <span style="font-family: Cambria,serif; font-size: 12pt;">Constructed a multiple encounter data sheet in excel




 * <span style="font-family: Cambria,serif; font-size: 12pt;">Used the derive node using time categories in the data. This allowed us to derive the actual date in seconds using modeler


 * <span style="font-family: Cambria,serif; font-size: 12pt;">Used the visit order to calculate the time within each time category per encounter to create a specific date




 * <span style="font-family: Cambria,serif; font-size: 12pt;">Sorted all the dates according to when they occurred




 * <span style="font-family: Cambria,serif; font-size: 12pt;">Used the time interval node to create specific time intervals




 * <span style="font-family: Cambria,serif; font-size: 12pt;">Used the time serious node to create a model that helps predict cardiovascular encounters using previous encounters for the patient




 * <span style="font-family: Cambria,serif; font-size: 12pt;">Conducted transformations on the results to receive patient numbers



<span style="font-family: Cambria,serif; font-size: 12pt;">Now we are able to build an ARIMA model that forecasts the likelihood of cardiovascular risk. The targets were made continuous, so that the Time Series Node could be utilized. The results were cleaned to only include patients, predictions, and actual occurrences. The results where then transformed so that all of the encounters were reduced to only the highest predictions per patient and service line. Following this, if the patient had service line 1, it was kept as well as the highest values for prediction. A cutoff value of 0.5 was placed on the prediction, such that a prediction which was higher than that was considered a positive prediction.

**__<span style="font-family: Cambria,serif; font-size: 12pt;">Findings __**<span style="font-family: Cambria,serif; font-size: 12pt;">**:** The model has created 8559 predictions, for 5130 patients that already had service line cardiovascular and 3429 patients that did not. 182 patients where missed and 3429 patients were predicted to have the service line in the future. **COMPLETE STREAM** **__<span style="font-family: Cambria,serif; font-size: 12pt;">Assumption __**<span style="font-family: Cambria,serif; font-size: 12pt;">**:** The time series model splits it’s forecasting with regard to the record ID “patient”

**__<span style="font-family: Cambria,serif; font-size: 12pt;">Recommendation __**<span style="font-family: Cambria,serif; font-size: 12pt;">**:** Salem Hospital should evaluate and continue to collect data on its patients. The hospital should be able to use this model to help predict future heart failure patients. In addition, replace the time category with the real encounter time and date.