Predicting+A+Biological+Response+Rate+-+YWF+Consulting


 * YWF Consulting **
 * Thomas Leveleux **
 * Eliana Penzner **
 * Chris Riley **

__Predicting A Biological Response __

YWF Consulting is looking to identify the biological responses of molecules from their chemical properties. Using molecular characteristics and their respective biological response rates we plan to relate molecular information to biological responses. As leaders in the pharmaceutical industry, Boehringer Ingelheim is looking to YWF Consulting and Kaggle.com, to “drive innovative solutions to this scientific challenge.”   Boehringer Ingelheim is a family owned business of 125 years that focuses on human pharmaceuticals and animal health. Inherent in their success is the emphasis Boehringer Ingelheim has on research and development. In 2010, Boehringer Ingelheim had conducted or funded over 1,320 clinical studies.   YWF Consulting plans to support Boehringer Ingelheim’s efforts by building a model to fit their given data. YWF Consulting will present their findings and submit their final report to Kaggle.com on May 3, 2012.
 * INTRODUCTION: **

The field of predicting biological responses to molecular information has been around for 45 years. Specifically, this activity is classified as quantitative structure-activity relationship (QSAR) and is used heavily in the fields of chemistry - agro, pharmaceutical and toxicology.   The first QSAR formulation was published in 1962 by Hansch and Muir. In their study they described the “structure-activity relationships of plant growth regulators and their dependency on Hammett constants and hydrophobicity.”   More recently developments in structure-activity relationship (SAR) is attempt to derive structures from the function of certain receptors. <span style="color: black; font-family: 'Times New Roman',serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> Currently, the major advances in this field of research are related to the development of 3-D models coupled with X-ray crystallography to better derive and understand molecules and their reactions. <span style="color: black; font-family: 'Times New Roman',serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> Others in the biomedical field are definitely at work solving similar problems as ours, however, our dataset is unique and should prove to be quite interesting to study. We plan on conducting more research into previously determined algorithms in order to guide and enhance our modeling process.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">BACKGROUND: **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">In order to understand the data set provided, YWF Consulting did research in order to understand how these predictors are calculated. Milano Chemometrics and QSAR Research group provided a wealth of information on how these predictors are calculated (the following text was retrieved from <span style="font-family: 'Times New Roman',serif;">http://www.moleculardescriptors.eu/ <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">):
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">VARIABLE DATA SET: **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">In particular, the basic properties a molecular descriptor MUST HAVE are:

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1. invariance with respect to labeling and numbering of the molecule atoms <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">2. invariance with respect to the molecule roto-translation <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">3. an unambiguous algorithmically computable definition <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">4. values in a suitable numerical range for the set of molecules where it is applicable to

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Point 1 requires that the value of a molecular descriptor does not depend (is invariant) on how the molecule atoms are labelled or numbered. Descriptors which make use of the atom numbering in their definition have to use some canonical unique numbering based on unequivocal rules.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Point 2 requires that the value of a molecular descriptor does not depend on the absolute values of numerical coordinates defining the atom positions with respect to some arbitrary origin. For example, a descriptor value cannot have different values depending on the position of the molecule with respect some fixed reference axis.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">According to point 3, a molecular descriptor must be defined by a computable mathematical expression whose terms have to be not ambiguous and clearly available from the molecular structure.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">According to point 4, the values of a molecular descriptors must be in an acceptable numerical range, avoiding singular points and values as 10^12 or 10^-8. For example, descriptors defined on the product of some atomic property quickly reach large numerical values for big molecules. The mathematical rules given above MUST hold for all the molecular descriptors

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Moreover, good molecular descriptors SHOULD HAVE other important characteristics: <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">a. a structural interpretation <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">b. a good correlation with at least one property <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">c. no trivial correlation with other molecular descriptors <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">d. gradual change in its values with gradual changes in the molecular structure <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">e. not including in the definition experimental properties <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">f. not restricted to a too small class of molecules <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">g. preferably, some discrimination power among isomers <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">h. preferably, not trivially including in the definition other molecular descriptors <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">i. preferably, allowing reversible decoding (back from the descriptor value to the structure)

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Some examples of molecular descriptors include: <span style="font-family: 'Times New Roman',serif;">1) boiling point <span style="font-family: 'Times New Roman',serif;">2) melting point <span style="font-family: 'Times New Roman',serif;">3) heat capacity at T constant <span style="font-family: 'Times New Roman',serif;">4) heat capacity at P constant <span style="font-family: 'Times New Roman',serif;">5) Entropy <span style="font-family: 'Times New Roman',serif;">6) density <span style="font-family: 'Times New Roman',serif;">7) total surface area <span style="font-family: 'Times New Roman',serif;">8) molar volume

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">The data will be analyzed using IBM SPSS modeler v14.2. We should start by using a PCA node and a feature select node in SPSS to break down the variables and keep only the most important ones for a further and more detailed analysis. This process will allow us to run the analysis on a single computer without using a server. While breaking the data we will keep each time only the most relevant lines in order to reduce the number of lines in the data without reducing significantly the accuracy of the analysis.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">PROCESS: **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Materials: The best way to analyze the data and make modeler run perfectly would have been to use a server. No one is available for our project so we will use only laptops with Modeler to do it.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">The data has been taken from kaggle.com. This website organizes data mining competitions and the dataset we are using is the one provided for the “Predicting a biological response” contest. Several csv files are proposed to the competitors in order to achieve their project. No complementary data are needed to realize this project.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Concerning the time frame of this project: it could vary depending on the length it will take to run the models because the original dataset is really important. The breaking of the data to keep only the most accurate ones should take a few hours. After that we plan on at least half an hour to run each model we want to apply to the reduced dataset.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">We will both run different model on the data in order to see which one could bring the better analysis. Maybe a combination of the best could work too.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">On this project we didn’t plan on any costs (apart from time spent) because we already have computers and IBM SPSS Modeler.

<span style="font-family: 'Times New Roman',serif;">


 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">ANALYSIS **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">The data set originally contained 1776 predictor variables, 3752 rows of data and one target variable, activity, which was flagged. The predictor variables were numerically named to protect Boehringer Ingelheim’s proprietary information. Next a 50-30-20 partition was performed followed by auto data prep. The data used in our project had already been prepared and normalized, reducing a step in the auto data prep. Before determining which models to run on the data a feature select node was run. Feature select scaled down the data set by determining which predictor variables were the most important. We found it necessary to run feature select as the large set of data did not allow to run any models given the limitation on our computing power. We attempted to gain access to server time through Willamette University, however that option was not available. Next, we generated a filter node from feature select with only the important variables that decreased the predictor variable count to 1170. Next, we ran auto classifier to help us determine which models would produce the best results. Auto classifier was never able to complete its run, again due to the massive set of predictors we were using. However, when we aborted the run, it had some partially determined results that we were able to apply. Auto classifier suggested the use of Logistic Regression, CHAID and C5.0. In addition, we also chose to run a Two Step model to see if clustering the predictor variables produced positive results for the target.

<span style="font-family: 'Times New Roman',serif;">
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Feature Select Node: **

<span style="font-family: 'Times New Roman',serif;">


 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">LOGISITIC REGRESSION **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Our logistic model returned an 84.14% accuracy in prediction which was the best model results we achieved. However, the usability of the information returned by the model would probably have less real world value then our decision tree models. As one can see by our confusion matrix results we had 595 misclassifications. <span style="font-family: 'Times New Roman',serif;">

<span style="font-family: 'Times New Roman',serif;"> <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">By observing the first 15 variables we see that the highest predictor value hovers around 2%, in order for us to make a better model we decided to next try two-step model.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Since descriptor variables may have similar qualities, we next tried to run a cluster model in the form of Two-Step. We tried to find groups of data that had similar properties form them into groups. The results were poor, averaging a 50.2% misclassification rate. If we had more information as to what the variables meant we might have been able to put this model to better use. Similarly to the logistic model we don’t find strong correlation between predictors and the target variable. Oddly in the clusters, the number one predictor was the same in both which made us believe that this was not an accurate model to be using. <span style="font-family: 'Times New Roman',serif;">
 * <span style="font-family: 'Times New Roman',serif;">TWO STEP **


 * <span style="font-family: 'Times New Roman',serif;">CHAID **

<span style="font-family: 'Times New Roman',serif;"> We turned next to our first tree model, the CHAID model. The CHAID model gave us 78.5% accuracy and developed a nice set of rules to use for finding molecules that would give good predictions. As seen below, D27 (a flag variable) produces our first decision point followed by D78 if D27 is true or D129 if D27 is false. We feel that this would be very useful for scientists in concocting molecules that would give good predictor results. <span style="font-family: 'Times New Roman',serif;"> <span style="font-family: 'Times New Roman',serif;">

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">In order to validate our CHAID results, we turned to our other tree model, the C5.0. This actually outperformed our CHAID model giving an 80.8% accuracy rate. Of note is that our top predictor for eliciting a response by far is still D27. While our misclassification rate did not outperform the logistic model we feel that C5.0 would be best suited for finding molecules that will show a biological response. Using the rule set that was developed we feel that it would be fairly easy to find molecules that would show a response with an 80% accuracy rate. <span style="font-family: 'Times New Roman',serif;">
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">C5.0 **

<span style="font-family: 'Times New Roman',serif;">

<span style="font-family: 'Times New Roman',serif;">
 * <span style="font-family: 'Times New Roman',serif;">CONCLUSION **

<span style="font-family: 'Times New Roman',serif;">Given our blindness in regards to the actual meaning of our molecular descriptors we feel that the rule set developed by our C5.0 would quite applicable for Bohlinger Ingham to put use. We would be very interested to learn what the predictors actually mean as it would enable us to better understand how certain descriptors fields are related, perhaps enabling us to take more advantage of cluster modeling techniques. With more server power we would also be able to not use feature select and perhaps develop an even more accurate model. However, we feel that the 80% accuracy rate used with our C5.0 model is quite deployable. Another interesting result is the descriptor D27 has such a large effect on predicting a response. It would be interesting to see if that descriptor has been used in building molecules in the past or might be an important discovery to be used in the future.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Bibliography: <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Abraham, Donald J. //Burger’s Medicinal Chemistry and Drug Discovery Sixth Edition, Volume 1:Drug Discovery.// 2003. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Boehringer Ingelheim Annual Report 2010, Pg. 64. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">http://econsultancy.com/us/blog/9376-boehringer-ingelheim-partners-with-kaggle-to-crowdsource-scientific-problem. __<span style="color: #1155cc; font-family: 'Times New Roman',serif; font-size: 12pt;">www.kaggle.com __ __<span style="color: #1155cc; font-family: 'Times New Roman',serif;">http://www.moleculardescriptors.eu/dataset/dataset.htm __

<span style="color: black; font-family: Arial,sans-serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 8pt;">http://econsultancy.com/us/blog/9376-boehringer-ingelheim-partners-with-kaggle-to-crowdsource-scientific-problem <span style="color: black; font-family: Arial,sans-serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 8pt;"> Boehringer Ingelheim Annual Report 2010, Pg. 64. <span style="color: black; font-family: Arial,sans-serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 8pt;"> Abraham, Donald J. //Burger’s Medicinal Chemistry and Drug Discovery Sixth Edition, Volume 1: Drug Discovery.// 2003. Page 2 <span style="color: black; font-family: Arial,sans-serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 8pt;"> Abraham, Donald J. //Burger’s Medicinal Chemistry and Drug Discovery Sixth Edition, Volume 1: Drug Discovery.// 2003. Page 3 <span style="color: black; font-family: Arial,sans-serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 8pt;"> Abraham, Donald J. //Burger’s Medicinal Chemistry and Drug Discovery Sixth Edition, Volume 1: Drug Discovery.// 2003. Page 4 <span style="color: black; font-family: Arial,sans-serif; font-size: 11pt;"> <span style="font-family: 'Times New Roman',serif; font-size: 8pt;"> Abraham, Donald J. //Burger’s Medicinal Chemistry and Drug Discovery Sixth Edition, Volume 1: Drug Discovery.// 2003. Page 25