Team+2

Dallin Calaway, Marta Tarantsey, Ben Platt, Matt Stephens
 * **Patients and Payors: A Statistical Study of Salem Health **


 * Introduction / Business Challenge **

Hospitals can potentially lose a large amount of money throughout the year by treating patients that have no ability to pay for the services they receive. Therefore, hospital administrators are very interested in finding out any information that could reduce the hospital’s losses. Our goal for this data mining project is to sift through the Salem Health datasets and ascertain which characteristics or events occur in conjunction with losing money among the most expensive 1% of all patients. These individuals in the top 1% cause the majority of the hospital losses. As data miners, we have selected several algorithms that can be used to arrive at these insights. These algorithms are Logistic Regression, Kohonen, and Carma. = =
 * Dataset **

We were provided a large data set from Salem Health on March 18th. Due to security concerns, the original dataset contained a few hundred records from the hospital. However, after signing confidentiality agreements, over 700,000 records were released for analysis. Each of the observations in the dataset represents a single hospital visit; in total, the data represents about 200,000 individual patients. One of the difficulties when dealing with this data, is the fact that key characteristics and variables have been prepared in such a way that no single patient could be identified by any student or anyone looking at the data; not only were patient identification numbers anonymized, but definitions surrounding variables such as “Discharge Status_6.0” were given without delineation. The team set out to find information that would lend understanding to these kinds of variables, but in many cases nothing was found. Even so, the models’ functionality is not affected by the aforementioned difficulties and the output did show evidence of some interesting associations. Below is a list of definitions for the data fields in the Salem Health dataset. These fields include items such as zip code, length of stay and discharge status.

[[image:gsm672/data fields.png caption="data fields.png"]]

 * Data Mining Goals **

** Logistic Regression: ** to find predictors that are most important for an encounter leading to the Total Direct Income being in the lowest percentile. ** Kohonen: ** to see what commonalities clusters characterized by the “profitability” of the patients were, and what they would tell us about the interaction of a patient’s Payor Group, the way in which they stay at the hospital and how they get discharged. ** Carma: ** to see what goes with what or to understand the combinations of predictors that “go together”.


 * Logistic Regression **

The plan was to find those variables that could identify patients in the DIPercentile_1.0. After running the logistic regression on the data, we have found several important predictor variables. The most important variables that we found were Payor_Group_3.0, Payor_Group_1.0 and Encounter_Type_3.0. We found this by looking at the odds ratio, or Exp (B), on the output below.





||  ||
 * Logistic Regression Data Stream ||


 * Carma **

What attracted our group to running the data through the Carma model was the idea of finding association rules. In other words, we want to use the Carma algorithm to find relationships between events. When thinking about the general question of the study which is “What is it about the 1% expensive patients that makes them more expensive than the rest of the patients seen?” the Carma algorithm could provide some insights into this question. In order to run the data through the Carma algorithm, the data needed to be reduced to a smaller amount, an amount that reduced the amount of noise in the total dataset. We began by removing the Outliers and Extremes followed by running the data through a filter node which removed fields we did not want to take into consideration such as DRGID, Gender, Ethnicity, etc. After filtering those fields, we ran another Outlier and Extreme node, followed by a transform node to normalize the data. We then made dummy variables of the nominal fields and filtered out the parent data of the dummies so there would be no redundancies. In order to understand which of these data fields were the most pertinent according to the way modeler perceives the data, we ran a Feature Select node and Modeler returned with 23 fields, all marked as “Important.” Finally, we structured a Carma node to only take into consideration these 23 fields.

Initially, the model was run without having removed any outliers and extremes, and none of the data was transformed (Direct Income being a prime candidate for doing so). The model output shows associations that it identifies within the dataset, and assigns a confidence and support level; confidence meaning that the algorithm is for example 100% confident that a patient from ZipCodeCount_1.0, having had some lab work done (Service_Line_Lab = 1.0) will end up being part of the category Encounter_Type_1.0, this being supported by, or in other words occurring within, 27% of the dataset. The first output yielded about five associations like this one, at 100% Confidence. Since that level doesn’t necessarily reflect reality, the adjustments mentioned in the previous section were made in addition to running a feature select function which displays the most significant predictors.

With these variables taken into consideration, a different set of associations were put out, all under 100% Confidence. One thing that happens quite frequently with this algorithm is where a set of antecedents and consequents will be inverse throughout the table (with different confidence and support levels, obviously). When using the model, it is also important to consider another factor in the output, called the Lift Ratio. The special consideration surrounding this measure revolve around 1.0: if less than or equal to 1.0, the association isn’t very strong, the inverse communicating the opposite. Below is a screen shot of Carma’s output.



As mentioned before, there is a redundant association in various forms of ZipCodeCount_1.0, Service_Line_Lab and Encounter_Type_1.0. The team sees this set of associations as being the most important in each of its forms as it explains more about the patient than some of the others in the model; for example, the above table is sorted by Lift ratio and displays strong associations (>2.0) between ED_Admit and Encounter_Type_3.0. This set was viewed as being more of an internal process, such as “when a person is admitted, they are characterized by the hospital as having had encounter type 3.0”.


 * Kohonen **

With the key question in mind, we reevaluated our assumptions and expectations from the project. We aimed to generate some descriptive characteristics and patterns that the patients who end up in the lowest Total Direct Income percentile fit.

The first theory was that most patients enter Salem Hospital for a particular reason (later on it turned out to be “Encounter Type 1”) and get some lab work done. They are Payor type 1 and don’t differentiate on gender and ethnicity. Zip code turned out to be a descriptive, not a predictive variable. Our working assumption was that it was in fact the patients who follow a slightly different path from “Business as usual” that result in least revenue generated or in a loss to Salem Health.

We focused on the lowest percentile and created a new variable, “Loss”, defined by a flagged binary 1 when the total direct income was less than 0. Our second theory was that there were some commonalities between these patients. An article[i] overviewing the pricing strategies hospitals can use to increase the return on their costs recommends that the “charge-masters” – the pricing lists that include a comprehensive breakdown of charges per certain procedure – are reviewed. The 5 Sub-service lines that were most frequently connected to “Loss” predictor were “Evaluation and Management”, “Cardiology”, “Medical Cardiology”, “Microbiology”, and “Chemistry”. It would therefore make sense to head on over to each of these departments and see the kind of collection efforts undertaken in particular towards patients who happened to be in the Payor Group 3 and might be there for a longer stay (>1 day). The pricing should be reviewed to be consistent across the Sub-service lines, but if it is not and there are extra fixed costs associated with equipment in the Chemistry lab, for example, then the extra costs are being transferred to the customers who are not being able to pay. The graph below is one of the several that we produced. We used the up-and-down scaling of predictor combination frequencies to get the clearest connections. This is what we started with but after looking at predictor importance when it comes to an encounter resulting in a “Loss”, several thresholds were raised and a more clear graph was produced. Variables that we excluded were the total revenue, costs, and direct income because we wanted to focus on the patients that were resulting in a loss. Below is a clearer web graph that has more connections that are insightful.

Another cluster, far smaller in size and shown above, derived after additional model runs with different settings, identified patients who were there for less than a day but who were discharge type 0.0 and whose total Direct Income averaged at -524 dollars. Quite a large loss, compared to others. This is a very particular combination and it is very rare but is something Salem Health could have on their radar. Next time it happens, they might be aware of the fact that this has happened before in 2012-2013.


 * Conclusion **

From our preceding analysis we found a few key findings. The factors that we have found to be the most frequent in the most costly one percent of patients are: When it comes to the front-desk staff recognizing a patient who is financially unable to pay but who is likely to end up with a longer hospital stay or a larger bill, there may be new policies developed for how to deal with this specific patient type (a specific payment plan or a referral to a not-for-profit that assists with medical expenses). Furthermore, certain sub-service lines are more likely to be connected to money-losing patients and the screening process can take place during the "referral moment" - when the billing or a front-desk staff member who enters the data from the patient's initial form into the billing records. If there are patients who are accumulating charges in more than one of the sub-service lines indicated above, there may be a need to devote more of staff time or resources to following through with charge collections.
 * Being in Payor Group 3
 * Encounter Types 1 and 2
 * Having a longer hospital stay