Team3

=Cost and Profit Improvements:=

A Statistical Study of Salem Health

 * Teruhisa Homma, Sophia Maletz, Ariella Odierna, and Peter Olander**

___

=Executive Summary= Salem Health partnered with the Data Mining course at Willamette University’s Atkinson Graduate School of Management with the purpose of exploring data from patient encounters at Salem Hospital. The goal of this partnership was to provide a meaningful, experiential project for students and a better understanding of patient and operations data for the Hospital. An underlying objective was to explore the data to identify ways that Salem Hospital could reduce costs to better serve patients. To help develop and achieve these objectives, the Data team chose Direct Income as the target variable (henceforth referred to as DirectIncome_Tile10) and segregated the data into 10 evenly distributed categories, called Tiles, each containing an equal percentage of the total income obtained from each patient. In this way, patient encounters that provided the most and least income amounts to Salem Health were identifiable. To examine the relationships between patient encounters and the qualities of DirectIncome_Tile10, two predictive models: Neural Networks and Logistic Regression were used. These models were applied in three different approaches; prediction, commonalities, and anticipatory factors.

The first approach asked: Is it possible to predict into which Tile patient encounter will fall? The most accurate model for predicting into which Tile a patient encounter would fall was Neural Network with 57.8% accuracy. The probability of correctly predicting the least profitable group was 54.4%. The most significant predictors were: Factor 2(contains Payor_Group_3, Sub_Service_Line_General Medicine, Service_Line_Cardiovascular, Discharge_Status_18.0 and Discharge_Status_36.0), Length of stay, Factor3 (Encounter_Type_2.0, Sub_Service_Line_Evaluation and Management).

The second approach asked: Are there commonalities between patient encounters that fall into the most costly Tile? Both the Neural Network and the Logistic Regression models produced results with high accuracy, however almost all variables had very low importance in terms of significance. The only variable that jumped out was Payor_Group_2.0. So in terms of a common characteristic that is seen in patient encounters in Tile 1, Payor_Group_2.0 is the most prevalent.

The third approach asked: Could you predict into which Tile a patient encounter would fall solely dependent on variables known at the point of check-in? This looked predominantly at variables such as demographics, Payor Group, Encounter Type and ED admits. The models produced had around 90% accuracy. Payor_Group_2.0, ED_Admit and Payor_Group_4.0 were the three most significant predictors in determining if a patient encounter would fall into Tile 1 for the Neural Network model. For the Logistics Regression model, Payor_Group_2.0, Encount_Type_3.0 and Encounter_Type_2.0 were all significant predictors.

The next step will require analysis from someone with better Hospital business understanding to look further into some of the variables listed above that frequently came up as significant predictors of determining into which Tile a patient encounter would fall. This is only the first step in working towards fully exploring and understanding the data. = = =Business Understanding= Salem Health is a comprehensive health services provider located in Northwest Oregon in Marion, Polk, Benton, Lincoln, and Yamhill counties. Salem Health’s principal facility is the nonprofit, 454 acute-care bed hospital located in Salem. Employing over 4,000 individuals and caring for over 150,000 patients each year, the hospital is a significant organization in Oregon. However, health care costs have increased over the past few decades and every forecast projects this trend to continue at significant growth rates. These costs are driven primarily by increases in technology use, prescription drugs, care required for chronic diseases, and administrative costs. This last cost is by far the easiest to contain, however, every organization struggles with it. Salem Health is concerned particularly by the amount of uncollected revenue from patients, which accounted for almost $13 million last year.

Salem Health partnered with the Data Mining course at Willamette University’s Atkinson Graduate School of Management. The goal of this partnership is to provide an analysis of Salem Health’s patient and operations data to find opportunities to reduce costs and improve profit margins. This team of students examined the data to identify areas of the data that could be particularly valuable in providing a predictive model to show cost and profit improvements. We primarily used IBM’s SPSS Modeler program (SPSS), a statistical prediction package able to process large datasets and quickly create intuitive models. This program allows us to create easily readable processes that can be replicated in a variety of other statistical analysis programs simply by following a similar data preparation and modeling strategy. The process embarked upon involved first gaining a full understanding of the data, then identifying key predictive variables, and finally creating adequate models to deliver actionable results. =Macro Data Understanding= The Finance Department at Salem Health provided records from over 700,000 patient encounters organized into 59 variable sets to assist in this analysis. We received data on many aspects of a patient visit to the hospital. Variables contained information such as dates, demographics, diagnoses, doctors, procedures, laboratory tests, encounter types and counts, costs, revenues, and length of stay.For a full list and description of variables, see Appendix 1. This data was delivered as a relatively raw statistics file that was readable by SPSS. Due to the size and diversity of the data, there were four data sets provided: SHDRG, SHPatients, SHPatientsDiagnoses, and SHEncounters. SHEncounters is the primary dataset, containing all the provided patients and the majority of the variables. The other three datasets were merged with this primary dataset into individual data streams in order to examine the data further. To help develop and achieve our goal, we chose Direct Income as our target variable and segregated the data into 10 evenly distributed categories, called tiles, each containing an equal percentage of the total incomes obtained from each patient. In this way we were able to identify the individuals who provided the most and least income amounts to Salem Health.

The data is organized as nominal, flag, continuous, ordinal, or categorical. This proves to be difficult since some of the variables are text descriptions of a procedure or diagnoses. Because some these descriptions are too long, SPSS does not recognize the data and therefore classifies the variable as type-less, which deems it unusable for analysis. However, for most of these variables, an associated code was provided that was usable in modeling. Unfortunately, these are listed in unfamiliar codes that have little significance to people who haven’t worked in healthcare before and are therefore difficult to interpret. Many of the variables are also coded in unknown units, making model interpretation difficult. We are able to identify some correlations and results but full analysis can be better completed by Salem Health. In general, the data that was provided is adequate for some simple analysis. The majority of the data came as complete datasets and had proportionately few outliers or extremes. The low quality data fields were eliminated or modified so that we would be able to utilize them in the analysis. =Specific Data Understanding &Preparation= The target chosen for this analysis was Direct Income. The data for Direct Income was organized into tiles, with each tile representing roughly 10% of the data. This is done by assigning each patient a value of 1 – 10. It is assumed that Tile_1 represented those patients who generated the least amount of revenue for the hospital. When the target DirectIncome_Tile10 is used, it includes the entire dataset. When the target DirectIncome_Tile10_1 is used, it only includes patient encounters that are in the least profitable tile (Tile 1).

Histogram of original Direct Income data and subsequent Tiling



To examine this relationship and the qualities of DirectIncome_Tile10, we used two predictive models: Neural Networks and Logistic Regression. We applied these models in three different areas; prediction, commonalities, and anticipatory factors. A Neural Network is a flexible data-driven method creating a model for classification or prediction. Since we primarily have two different types of input fields, flag (binary) and continuous (numeric) variables, in our dataset, a Neural Network can approximate a wide range of predictive models with minimal demands on model structure and assumption as one of its advantages. One of its disadvantages is that as a trade-off for flexibility, a neural network is not easily interpretable. Logistic Regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression, but supports a categorical target field like //Direct_Income_TILE10//. Logistic Regression models are relatively accurate and can handle categorical and numeric input fields. Furthermore, the models can give a set of equations that indicate how well each input field is associated with each output field categories. Since Logistic Regression models are a popular and powerful classification method, the models can also be used as a baseline against other modeling techniques. Comparing a Logistic Regression to a Neural Network, a Logistic Regression is useful when analyzing and interpreting relationships between input and output fields. The primary purpose of this is to identify and interpret important associations between highly important predictors and the target with respect to any cost reduction or improved profit margins as a primary purpose of this project.

1. Prediction: DirectIncome_Tile10
Is there a way to predict which patient encounters will be categorized as the most costly? The first approach was to build a model that could predict into which tile patient would fall. The primary idea behind this was to identify with some confidence those members of Tile 1 so that this tile could be further examined. This approach also looked at significant variables for Tile prediction.

2. Commonalities: DirectIncome_Tile10_1
Are there common characteristics between patient encounters categorized as the most costly? This approach looked for significant characteristics in the data in Tile 1 (the most costly tile) to see if there were areas where the hospital could improve cost savings. The group was interested in an analysis that isolated the bottom 10% of patients (the most costly patients), so it focused on Tile 1.

3. Anticipatory Factors: DirectIncome_Tile10_1
This analysis only included variables from SHEncounters that were assumed to be known at check-in and variables from SHPatientDiagnoses. This would have practical business application for Salem Hospital because at the point of check-in, depending on the encounter and diagnoses data, patient encounters with predictors similar to those in DirectIncome_Tile_1 could be identified. This could lead to better profit collections since the hospital could flag those patients and know the likelihood of collection. = = =Modeling and Evaluation=

Neural Network
Although interpretation of the result is expected to be difficult compared to Logistic Regression, we expect that a Neural Network would result in a model with greater accuracy; therefore, our focus in this model is not the model interpretability but model accuracy. Since model interpretability is not important in this particular section, we decided to use a principal components analysis (PCA) that is used for reducing the number of predictors that are not highly correlated. As a result of PCA, the accuracy of the resulting model increased from 31.1% to 57.8%. The following explains the more accurate model with PCA. Overall accuracy of this model is 57.8%. The probability of correctly predicting the least profitable group is 54.4%. For further analysis on model accuracy, see Appendix 3.





//Factor-2//, //LengthOfStay_transformed_Inverse//, and //Factor-3// are the three most important predictors, respectively. It is difficult for us to interpret the direct relationship between //Direct_Income// and //Factor-2//and //Factor-3// because the factors are composed of multiple different variables. Furthermore, as mentioned at the beginning of section, the result of a neural network is not easily interpretable but useful due to its greater accuracy. The diversity and size of the dataset makes this interpretation difficult.

ED_Admit Encounter_Type_1.0 Encounter_Type_3.0 Sub_Serive_Line_Infusion/Transfusion || **//__Factor-2__//** Payor_Group_3 Sub_Service_Line_General Medicine Service_Line,_Cardiovascular Discharge_Status_18.0 Discharge_Status_36.0 || **//__Factor-3__//** Encounter_Type_2.0 Sub_Service_Line_Evaluation and Management Service_Line_Evaluation and Management ||
 * **//__Factor-1__//**

Logistic Regression
The accuracy of this logistic regression model is 22.87% based on the validation set that is used only at the end of the model building and selection process to assess how well the final model performs on the selected portion of data, 20% of data in this case. This means that the model predicts with 20% accuracy the tile into which a patient encounter will fall. While this may seem low, it does provide some hints at potential improvements for future models. For further technical analysis of the accuracy of this model, see Appendix 4.

//Service_Line_Lab//, //Payor_Group_1.0//, and //ED_Admit// are the three most important predictors, respectively. Unfortunately, since we have limited experience with health care we do not know what these three input fields mean, preventing our team from delivering useful information on this particular result. However, knowing that these variables stand out in the analysis can help to further refine the model.

General Characteristics of Tile_1
DirectIncome_Tile was converted to 10 binary variables so that the value “1” would signify patients with lower income levels. Tile_1 was then isolated by filtering out all other tiles as well as the original nominal variable. An analysis of all the variables showed that no variable correlated highly with Tile_1.
 * DirectIncome_Tile1 General Characteristics ||
 * Total Count || 18518 ||
 * % of Total || 9.58% ||
 * Max || -69.25 ||
 * Min || -3927.01 ||
 * Mean || -254.81 ||

Neural Network
Here we have the Tile_1 Analysis with all variables from SHPatientsDiagnoses and SHEncounters. The first model attempted was a Neural Network with 90.4% accuracy. The Neural Net lists LengthOfStay, Sub_Service_Line Cardiac, GERD Testing, and Other Cosmetic as the top three most important predictors. All predictors seem to have equal importance, albeit very low. This may mean that in order for this model to be successful, it in fact needs a large data set to calculate off of since so many variables are needed to find a result.



Logistics Regression
The most important predictor is Payor_Group_2.0. Intuitively, this seems unbalanced considering the low values of the other predictors. Though this model had 90.71% accuracy in the validation set, the R² values are very low, indicating that the model cannot explain very much of the variation in the errors.





Predictor Coefficients
 * Payor_Group_2.0 || -3.17 ||
 * Payor_Group_3.0 || -0.741 ||
 * Payor_Group_4.0 || -1.261 ||
 * Service_Line_Lab = 0 || 1.537 ||
 * Service_Line_Radiology = 0 || 1.106 ||

3. Anticipatory Factors: DirectIncome_Tile10_1
This step of the analysis is looking for discriminating characteristics. What do we know about patients from Tile_1 upon check-in? The variables assumed to be known are Payor Group, Encounter Type, and ED Admit. This we believe includes some of the information that the hospital would be able to collect prior to a patient costing the hospital more significant amounts of money.

Neural Network
This model produced the following results with 90.4% accuracy.

The most important predictors are Payor_Group_2.0, ED Admit and Payor_Group_4.0. Payor_Group_2.0 is the most important, as is the case in the previous section. This same result between models gives us additional

Logistics Regression
This model had the following results, with 90.4% accuracy: Once more, the most important predictors are Payor_Group_2.0, Encounter_Type_3.0, and Encounter_Type_2.0. However, again the R² values are too low to be overly confident in the model. Payor_Group_2.0 has consistently been recognized as the most important predictor for Tile_1. It is assumed that Payor_Group refers to the general method of payment that the patient is using to cover the expenses of the visit. Through previous research about medical documentation, Encounter_Type most likely refers to the method by which the patient arrived at the hospital (ambulance, doctor referral, ER, etc.).








 * Predictor Coefficients ||
 * Payor_Group_2.0 || -2.650 ||
 * Encounter_Type_3.0 || .385 ||
 * Encounter_Type_2.0 || -0.256 ||
 * Payor_Group_3.0 || -0.638 ||
 * Payor_Group_4.0 || -1.057 ||

=Deployment= How does this analysis benefit Salem Health?

Predicting: Direct_Income_Tile_10
The most accurate model for predicting into which Tile a patient encounter would fall was Neural Network with 57.8% accuracy. The probability of correctly predicting the least profitable group was 54.4%. The most significant predictors were: Factor 2(contains Payor_Group_3, Sub_Service_Line_General Medicine, Service_Line_Cardiovascular, Discharge_Status_18.0 and Discharge_Status_36.0), Length of stay, Factor3 (Encounter_Type_2.0, Sub_Service_Line_Evaluation and Management). With additional knowledge of health care business and the language required to operate a hospital, someone would be able to more fully understand the variables in the factors from the PCA and identify usable correlations. The Logistic Regression model had a much lower accuracy in predicting into which Tile a patient encounter belonged with 22.87% accuracy. However, it did show that the most important predictors are: Service_Line_Lab, Payor_Group_1.0, and ED_Admit. Again, a better understanding of the business of health care could add a lot of value to this analysis.

Commonalities of Tile 1 – the most costly tile:
Both the Neural Network and the Logistic Regression models produced results with high accuracy, however almost all variables had very low importance in terms of significance. The only variable that jumped out was Payor_Group_2.0. So in terms of a common characteristic that is seen in patient encounters in Tile 1, Payor_Group_2.0 is the most prevalent. A further exploration of who qualifies as a member of this group could add much needed insights.

Anticipatory Factors: DirectIncome_Tile10_1
This looked predominantly at variables that would be known at the point of check in, such as demographics, Payor Group, Encounter Type and ED admits. The models produced had around 90% accuracy. Payor_Group_2.0, ED_Admit and Payor_Group_4.0 were the three most significant predictors in determining if a patient encounter would fall into Tile 1 for the Neural Network. For the Logistics Regression model, Payor_Group_2.0, Encount_Type_3.0 and Encounter_Type_2.0 were all significant predictors.

Next Steps
=Appendices=
 * 1) The model could be improved if the data were prepared differently so that SPSS could interpret it accurately. This could then be used to develop a similar process.
 * 2) Better overall data understanding would aid model interpretation and guide which model to use and which questions to pursue further.

Appendix 1: Variable Descriptions and Source Nodes

 * **Name of Variables** || **Description** || **SHEncounters** || **SHPatient Diagnosis** || **SHPatients** || **SHDRG** ||
 * Apr2010Sep2010 || Flag variables that give a sense of time period || X || X ||  ||   ||
 * Apr2011Sep2011 || Flag variables that give a sense of time period || X || X ||  ||   ||
 * Attending || Unknown attending physician encoding || X ||  ||   ||   ||
 * Bene_Plan || Unknown benefit plan provider encoding || X ||  ||   ||   ||
 * Consumer_Ethnicity || Unknown ethnicity encoding || X ||  ||   ||   ||
 * Consumer_Gender || Unknown gender encoding || X ||  ||   ||   ||
 * DIAG || Standard diagnosis code ||  || X ||   ||   ||
 * DIAG NAME || Standard diagnosis description ||  || X ||   ||   ||
 * Diagnosis Count || Diagnosis count ||  ||   || X ||   ||
 * Diagnosis_Code || Unknown diagnosis encoding ||  || X ||   ||   ||
 * DiagnosisID || Unknown diagnosis ID || X || X ||  ||   ||
 * Discharge_Status || Unknown discharge status code || X ||  ||   ||   ||
 * DRGID || Determines bill for Medicare patients (taken from ICD9 codes) || X ||  ||   || X ||
 * ED_Admit || Emergency department admission || X ||  ||   ||   ||
 * Encounter_ID || Unique hospital visit identifier || X || X ||  ||   ||
 * Encounter_Level_Actual_Payment || Amount received for this encounter || X ||  ||   ||   ||
 * Encounter_Level_Direct_Costs || Cost of this encounter || X ||  ||   ||   ||
 * Encounter_Level_Gross_Revenue || Amount billed for this encounter || X ||  ||   ||   ||
 * Encounter_Type || Unknown encounter type encoding || X ||  ||   ||   ||
 * EncounterCount || Number of encounters for this patient in data ||  ||   || X ||   ||
 * EncounterRows || Unknown ||  ||   ||   || X ||
 * Ethnicity || Unknown ethnicity encoding ||  ||   || X ||   ||
 * Gender || Unknown gender encoding ||  ||   || X ||   ||
 * HCFA Code || Unknown coding ||  ||   ||   || X ||
 * HCFA_Diagnosis_Related_Group || Standard diagnosis group code || X ||  ||   ||   ||
 * HCFA_Diagnosis_Related_Group_Name || Standard diagnosis group description || X ||  ||   ||   ||
 * LengthOfStay || Length of Stay (unknown units) || X ||  ||   ||   ||
 * MSCode || Unknown coding ||  ||   ||   || X ||
 * Name || Unknown naming ||  ||   ||   || X ||
 * Oct2009Mar2010 || Flag variables that give a sense of time period || X || X ||  ||   ||
 * Oct2010Mar2011 || Flag variables that give a sense of time period || X || X ||  ||   ||
 * Order Index || An encounter can result in multiple diagnoses, each gets an index number ||  || X ||   ||   ||
 * Patient || Unique patient identifier || X ||  || X ||   ||
 * PatientID || Unique patient identifier || X ||  || X ||   ||
 * Payor_Group || Unknown categorization of payor type || X ||  ||   ||   ||
 * PCPATADMITCODE || Unknown primary care physician encoding at point of admission || X ||  ||   ||   ||
 * Primary_Procedure_Description || Service performed || X ||  ||   ||   ||
 * PrimaryProcCount || Unknown procedure count encoding ||  ||   || X ||   ||
 * Principal_Diagnosis || Standard principal diagnosis code || X ||  ||   ||   ||
 * Principal_Diagnosis_Name || Standard principal diagnosis description || X ||  ||   ||   ||
 * Principal_Procedure_ICD9 || Standard procedure code || X ||  ||   ||   ||
 * Principal_Procedure_ICD9_NAME || Standard procedure description || X ||  ||   ||   ||
 * Principal_Procedure_Ordering_Practitioner || Unknown physician encoding || X ||  ||   ||   ||
 * PrincipalProcCount || Unknown procerdure count encoding ||  ||   || X ||   ||
 * SalemAvgLOS || Salem Hosptial average length of stay ||  ||   ||   || X ||
 * Service_Line || Servicing department || X ||  ||   ||   ||
 * StandardLOS || National average length of stay ||  ||   ||   || X ||
 * Sub_Service_Line || Service performed || X ||  ||   ||   ||
 * Total_Actual_Payments || Total payments for this patient ||  ||   || X ||   ||
 * Total_Direct_Costs || Total costs incurred by this patient ||  ||   || X ||   ||
 * Total_Direct_Income || Total payments minus total cost for this patient ||  ||   || X ||   ||
 * Total_Gross_Revenue || Total revenue from this patient ||  ||   || X ||   ||
 * Visit_Order || Unknown visit tracking || X ||  ||   ||   ||
 * Zip_Code_Area || Unknown zip code encoding || X ||  ||   ||   ||
 * ZipCodeCount || Unknown zip code encoding ||  ||   || X ||   ||

Appendix 3: Logistics Regression – Accuracy
Based on the lift chart, the logistic regression model performs well since the value of lift chart is higher than 1.40 at the 40 percentile and higher than 1.0 at the 60 percentile. Pseudo R-square indicates three types of measures of model fit with a range from 0 being the lowest to 1.0 being the highest. R-square is the coefficient of determination that is primarily used to predict future outcomes or test hypotheses and that indicates how well the observed outcomes are replicated by the model. In this case, since R-square of 0.567is considered high, the resulting model seems to replicate the observed outcomes well and thus would be worth further analysis.

Appendix 4: Logistics Regression – Accuracy & Further Analysis


=References=
 * IBM SPSS Modeler, []
 * Kaiser Edu, //U.S. Health Care Costs//, []
 * Salem Health, March 2013