Prediction+of+Breast+cancer


 * by Deepa Rao and Sujuan Zhao **


 * Project Background **

 In the healthcare industry,It is vital to understand the developments happening to the patients. There has to be a precise data prediction, a model with accurate prediction will help the doctors to diagnose the cancer whether it is benign or malignant. This will really save time for the physicians and improve their efficiency. This project is used to identify the breast cancer condition whether it’s benign or malignant. The prediction condition is based on the attributes given in the data set. There are 11 attributes in the data set that. These data will help the physicians to decide which attributes are important. So the accuracy of the modeling is very important.


 * T **he term "benign" refers to a tumor, condition, or growth that is not cancerous. This means it is localized and has not spread (aka metastasize) to other parts of the body or invaded and destroyed nearby tissue. In general, a benign tumor or condition is usually not harmful and benign tumors usually grow slowly. They can usually be removed and in most cases they never come back. However, if a benign tumor is big enough, the size and weight can press on nearby organs, blood vessels, and nerves and thus cause problems.

The opposite of benign is malignant tumor. Malignant tumors are cancer, where the cancer cells can invade and damage tissues and organs near the tumor. Also, cancer cells can break away from a malignant tumor and enter the lymphatic system or the bloodstream. This is how cancer spreads from the original tumor to form new tumors in other parts of the body (aka metastasize).

However, although benign tumors are mostly harmless, they cause more than 13,000 annual deaths in the USA, which can be compared to more than 500,000 annual deaths from cancer (malignant tumors).

Creating a model to predict the breast cancer type - Benign or Malignant, based on a set of ten attributes.
 * Research Question **

The analysis is classification type and the models used would be:
 * Intent **
 * K-Nearest Neighbor
 * Logistic regression
 * Classification trees
 * Neural Nets
 * Discriminant Analysis

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Finally, the right model chosen for our prediction will be based on the highest accuracy from the above models.

No Attributes Value <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> -- - <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 1. Sample code number ID number <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 2. Clump Thickness 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 3. Uniformity of Cell Size 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 4. Uniformity of Cell Shape 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 5. Marginal Adhesion 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 6. Single Epithelial Cell Size 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 7. Bare Nuclei 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 8. Bland Chromatin 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 9. Normal Nucleoli 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 10. Mitoses 1 - 10 <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> 11. Class: (2 for benign, 4 for malignant)
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Data description **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> The 1 to 10 is the layers of penetration of the cancer into these cells .in the case of cells it <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> Number of Instances: 699

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> Benign: 458 (65.5%) <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> Malignant: 241 (34.5%)
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Class distribution: **


 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Attributes Description **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> **Clump thickness:** Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayer. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> **Uniformity of cell size/shape:** Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> **Marginal adhesion:** Normal cells tend to stick together. Cancer cells tend to loose this ability. So loss of adhesion is a sign of malignancy. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> **Single epithelial cell size:** Is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> **Bare nuclei:** This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumors. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">**Bland Chromatin:** Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tends to be coarser. <span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> **Normal nucleoli:** Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">For the initial data understanding process we performed the following:
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Data Understanding: **
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Created the correlation matrix


 * || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Clump thickness || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Uniformity of Cell Size || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Uniformity of Cell Shape || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Marginal Adhesion || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Single Epithelial Cell Size || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Bare Nuclei || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Bland Chromatin || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Normal Nucleoli || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Mitoses || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Class_in_Number ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Clump thickness || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1 ||  ||   ||   ||   ||   ||   ||   ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Uniformity of Cell Size || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||   ||   ||   ||   ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Uniformity of Cell Shape || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.9 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||   ||   ||   ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Marginal Adhesion || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.5 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||   ||   ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Single Epithelial Cell Size || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.5 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.8 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||   ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Bare Nuclei || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Bland Chromatin || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.8 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Normal Nucleoli || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.5 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.6 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||   ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Mitoses || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.4 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.5 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.4 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.4 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.5 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.3 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.3 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.4 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||  ||
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Class_in_Number || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.8 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.8 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.8 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.8 || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.7 || **<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">0.4 ** || <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">1.0 ||

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;">The distinct finding here is that Mitoses is proving not to be the significant predictor since the correlation is only 0.4. Also there are some good correlations amongst the predictors themselves. Moving ahead with analysis we have to be looking into this issue as it could reduce the prediction accuracy.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Using SPSS we performed Data Audit and removed the missing values and coerced the outliers. [[image:Breast_cancer_Outlier.png]]



<span style="font-family: 'Times New Roman',serif;">
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">To determine the importance of all the variables we used feature selection. The results of which showed that all the predictors are important for the prediction. This finding contradicts with the above correlation output for “Mitoses”. We would have to verify further during analysis or model creation.


 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt;">Data collection. **

<span style="font-family: 'Times New Roman',serif; font-size: 12pt;"> Source: The Trusted Leader in Technical Computing <span style="color: windowtext; font-family: 'Times New Roman',serif; font-size: 12pt; text-decoration: none;">[] <span style="font-family: 'Times New Roman',serif; line-height: 24px;">According to the following the Uniformity of cell shape, Cell nuclei is important : <span style="font-family: 'Times New Roman',serif; line-height: 24px;">Bare nuclei is the most important predictor for the Neural network.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">To perform the data audit to view the quality of data.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Outliers and Extremes are removed based on the data audit results.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Convert the attributes into nominal variables as all the attributes have the possible values of 1 to 10.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Run the Feature Selection model to find the significance of predictors.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Partition the data into training (60%) and validation (40%).
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Boost the data to bring evenly distributed data quantities.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">Apply different classification models like K-NN, Logistic Regression, Neural Network, Discriminant Analysis and Classification Trees.
 * <span style="font-family: 'Times New Roman',serif; font-size: 12pt; vertical-align: baseline;">The best model will be based on the highest accuracy, less misclassification rate and lift chart.

<span style="font-family: 'Times New Roman',serif; line-height: 24px;">Model 1: Constructed for Breast Data (Some of the predictors are Clump_thickness, Uniformity_Cell_Size)

The basic model analysis was run using both continuous and nominal values. We found that by defining the values as nominal the model has a better accuracy.

<span style="font-family: 'Times New Roman',serif; line-height: 24px;">

Comparison of different Classification models: <span style="font-family: 'Times New Roman',serif; line-height: 24px;">

From the above analysis chart, it's showing that the C5 Classification tree model has the highest accuracy prediction of classifying the breast cancer as either Benign or Malignant. And it has the accuracy of nearly 96.1% for Validation data and 96.4% for Training data.

Comparison of different models using lift chart

C5 Tree

From the above C5 tree, its evident that the two most important predictors used to classify benign and Malignant is Uniformity_Cell_Size and Bare_nuclei. So the Scatter plot has been constructed between these two predictors

Model 2:

<span style="font-family: 'Times New Roman',serif;">Upon further research we also found that to predict Benign and Malignant the measurement of nuclei values would also be sufficient.
 * Radius || Texture || Perimeter || Area || smoothness || compactness || Concavity || Concave points || Symmetry |||| Fractal Dimension ||

The Predictors are as follows :

code Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)   b) texture (standard deviation of gray-scale values) c) perimeter   d) area e) smoothness (local variation in radius lengths)   f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour)   h) concave points (number of concave portions of the contour) i) symmetry   j) fractal dimension ("coastline approximation" - code

The data has no missing values. The predictors of each nucleus will help to find to the target benign or malignant. It is important that we will be able to understand the correlation or importance of each attribute to the target, Having said that we could correlate the attributes with the main target and found that all attributes in cell nucleus is important to predict the target variable except for the fractal dimension.



Comparison of different Classification models

From the above analysis chart, it's showing that the C5 Classification tree model has the highest accuracy prediction of classifying the breast cancer as either Benign or Malignant. And it has the accuracy of nearly 97.64% for Validation data and 97.24% for Training data.

Classification of different models using lift chart



C5 Tree model From the above C5 tree, its evident that the two most important predictors used to classify benign and Malignant is concave points and texture. So the Scatter plot has been constructed between these two predictors



Recurrent and Non-recurrent Model



Comparison of different Classification models



Analysis and Recommendation:

The First 2 models although had good accuracy levels for the training data the validation data prediction was not very consistently significant only the C5 model had same accuracy level between Validation and Training data of 96%. Based on the tree analysis uniformity of the cell size and bare nuclei are the two most important predictors in predicting the output.

If the uniformity of the cell size is equal to 1 the model is 97% sure that it is benign. For 2 and above bare nuclei cell measurements have to be taken into consideration. If,

Model 2 : The analysis of model 2 also gave the C5 as the highest accuracy values compared to the other models that has 97% accuracy levels for both training and validation.
 * Uniformity of cell size = 2 and bare nuclei (1,2,3,4) then the model predicts that it is Benign and if the bare nuclei is more than 6 then the model predicts that it is malignant.
 * Uniformity of cell size = 3 and bare nuclei is 1 and 2 then the model predicts output as Benign. If the bare nuclei is 3 and above then model predicts the output as malignant.
 * Uniformity cell size is 4 and above then the model the output is malignant.

Also we has researched that same set of predictors used for model 2 can also be used to predict the recurring and non recurring status. However, the model accuracies were very low with a maximum of 80% using k-NN ,Hence we recommend in order to predict the recurring and non recurring status a different set of predictors have to be used for a better model.