by Deepa Rao and Sujuan Zhao

Project Background

In the healthcare industry,It is vital to understand the developments happening to the patients. There has to be a precise data prediction, a model with accurate prediction will help the doctors to diagnose the cancer whether it is benign or malignant. This will really save time for the physicians and improve their efficiency. This project is used to identify the breast cancer condition whether it’s benign or malignant. The prediction condition is based on the attributes given in the data set. There are 11 attributes in the data set that. These data will help the physicians to decide which attributes are important. So the accuracy of the modeling is very important.

The term "benign" refers to a tumor, condition, or growth that is not cancerous. This means it is localized and has not spread (aka metastasize) to other parts of the body or invaded and destroyed nearby tissue. In general, a benign tumor or condition is usually not harmful and benign tumors usually grow slowly. They can usually be removed and in most cases they never come back. However, if a benign tumor is big enough, the size and weight can press on nearby organs, blood vessels, and nerves and thus cause problems.

The opposite of benign is malignant tumor. Malignant tumors are cancer, where the cancer cells can invade and damage tissues and organs near the tumor. Also, cancer cells can break away from a malignant tumor and enter the lymphatic system or the bloodstream. This is how cancer spreads from the original tumor to form new tumors in other parts of the body (aka metastasize).

However, although benign tumors are mostly harmless, they cause more than 13,000 annual deaths in the USA, which can be compared to more than 500,000 annual deaths from cancer (malignant tumors).

Research Question
Creating a model to predict the breast cancer type - Benign or Malignant, based on a set of ten attributes.

Intent
The analysis is classification type and the models used would be:
  • K-Nearest Neighbor
  • Logistic regression
  • Classification trees
  • Neural Nets
  • Discriminant Analysis

Finally, the right model chosen for our prediction will be based on the highest accuracy from the above models.

Data description
No Attributes Value
-- -----------------------------------------
1. Sample code number ID number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)

The 1 to 10 is the layers of penetration of the cancer into these cells .in the case of cells it
Number of Instances: 699

Class distribution:
Benign: 458 (65.5%)
Malignant: 241 (34.5%)

Attributes Description

Clump thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayer.
Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
Marginal adhesion: Normal cells tend to stick together. Cancer cells tend to loose this ability. So loss of adhesion is a sign of malignancy.
Single epithelial cell size: Is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.
Bare nuclei: This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumors.
Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tends to be coarser.
Normal nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.

Data Understanding:
For the initial data understanding process we performed the following:
  • Created the correlation matrix


Clump thickness
Uniformity of Cell Size
Uniformity of Cell Shape
Marginal Adhesion
Single Epithelial Cell Size
Bare Nuclei
Bland Chromatin
Normal Nucleoli
Mitoses
Class_in_Number
Clump thickness
1









Uniformity of Cell Size
0.6
1.0








Uniformity of Cell Shape
0.7
0.9
1.0







Marginal Adhesion
0.5
0.7
0.7
1.0






Single Epithelial Cell Size
0.5
0.8
0.7
0.6
1.0





Bare Nuclei
0.6
0.7
0.7
0.7
0.6
1.0




Bland Chromatin
0.6
0.8
0.7
0.7
0.6
0.7
1.0



Normal Nucleoli
0.5
0.7
0.7
0.6
0.6
0.6
0.7
1.0


Mitoses
0.4
0.5
0.4
0.4
0.5
0.3
0.3
0.4
1.0

Class_in_Number
0.7
0.8
0.8
0.7
0.7
0.8
0.8
0.7
0.4
1.0

The distinct finding here is that Mitoses is proving not to be the significant predictor since the correlation is only 0.4. Also there are some good correlations amongst the predictors themselves. Moving ahead with analysis we have to be looking into this issue as it could reduce the prediction accuracy.
  • Using SPSS we performed Data Audit and removed the missing values and coerced the outliers.Breast_cancer_Outlier.png



Breast_cancer_Null.png


  • To determine the importance of all the variables we used feature selection. The results of which showed that all the predictors are important for the prediction. This finding contradicts with the above correlation output for “Mitoses”. We would have to verify further during analysis or model creation.
Feature_select_Cancer.png

Data collection.

Source: The Trusted Leader in Technical Computing
http://www.sgi.com/tech/mlc/db/breast.all
  • To perform the data audit to view the quality of data.
  • Outliers and Extremes are removed based on the data audit results.
  • Convert the attributes into nominal variables as all the attributes have the possible values of 1 to 10.
  • Run the Feature Selection model to find the significance of predictors.
  • Partition the data into training (60%) and validation (40%).
  • Boost the data to bring evenly distributed data quantities.
  • Apply different classification models like K-NN, Logistic Regression, Neural Network, Discriminant Analysis and Classification Trees.
  • The best model will be based on the highest accuracy, less misclassification rate and lift chart.
According to the following the Uniformity of cell shape, Cell nuclei is important :
Bare nuclei is the most important predictor for the Neural network.

Model 1: Constructed for Breast Data (Some of the predictors are Clump_thickness, Uniformity_Cell_Size)


The basic model analysis was run using both continuous and nominal values. We found that by defining the values as nominal the model has a better accuracy.

Breast_Cancer_Model1.jpg


Comparison of different Classification models:
Breast_Cancer_Model1_Analysis.jpg

From the above analysis chart, it's showing that the C5 Classification tree model has the highest accuracy prediction of classifying the breast cancer as either Benign or Malignant. And it has the accuracy of nearly 96.1% for Validation data and 96.4% for Training data.

Comparison of different models using lift chart


C5 Tree
Breast_Cancer_Model1_C5tree.jpg

From the above C5 tree, its evident that the two most important predictors used to classify benign and Malignant is Uniformity_Cell_Size and Bare_nuclei. So the Scatter plot has been constructed between these two predictors

Breast_Cancer_Model1_ScatterPlot.jpg
Model 2:

Upon further research we also found that to predict Benign and Malignant the measurement of nuclei values would also be sufficient.
Radius
Texture
Perimeter
Area
smoothness
compactness
Concavity
Concave points
Symmetry
Fractal Dimension

The Predictors are as follows :

Ten real-valued features are computed for each cell nucleus:
 
    a) radius (mean of distances from center to points on the perimeter)
    b) texture (standard deviation of gray-scale values)
    c) perimeter
    d) area
    e) smoothness (local variation in radius lengths)
    f) compactness (perimeter^2 / area - 1.0)
    g) concavity (severity of concave portions of the contour)
    h) concave points (number of concave portions of the contour)
    i) symmetry
    j) fractal dimension ("coastline approximation" -


The data has no missing values. The predictors of each nucleus will help to find to the target benign or malignant. It is important that we will be able to understand the correlation or importance of each attribute to the target, Having said that we could correlate the attributes with the main target and found that all attributes in cell nucleus is important to predict the target variable except for the fractal dimension.

Breast_Cancer_Model2.jpg

Comparison of different Classification models
Breast_Cancer_Model2_Analysis.jpg



From the above analysis chart, it's showing that the C5 Classification tree model has the highest accuracy prediction of classifying the breast cancer as either Benign or Malignant. And it has the accuracy of nearly 97.64% for Validation data and 97.24% for Training data.

Classification of different models using lift chart

Breast_Cancer_Model2_liftChart.jpg

C5 Tree model
Breast_Cancer_Model2_C5tree.jpg
From the above C5 tree, its evident that the two most important predictors used to classify benign and Malignant is concave points and texture. So the Scatter plot has been constructed between these two predictors

Breast_Cancer_Model2_ScatterPlot.jpg


Recurrent and Non-recurrent Model

Breast_Cancer_Model4.jpg


Comparison of different Classification models

Breast_Cancer_Model4_Analysis.jpg

Analysis and Recommendation:

The First 2 models although had good accuracy levels for the training data the validation data prediction was not very consistently significant only the C5 model had same accuracy level between Validation and Training data of 96%. Based on the tree analysis uniformity of the cell size and bare nuclei are the two most important predictors in predicting the output.

If the uniformity of the cell size is equal to 1 the model is 97% sure that it is benign. For 2 and above bare nuclei cell measurements have to be taken into consideration. If,

  • Uniformity of cell size = 2 and bare nuclei (1,2,3,4) then the model predicts that it is Benign and if the bare nuclei is more than 6 then the model predicts that it is malignant.
  • Uniformity of cell size = 3 and bare nuclei is 1 and 2 then the model predicts output as Benign. If the bare nuclei is 3 and above then model predicts the output as malignant.
  • Uniformity cell size is 4 and above then the model the output is malignant.
Model 2 :
The analysis of model 2 also gave the C5 as the highest accuracy values compared to the other models that has 97% accuracy levels for both training and validation.


Also we has researched that same set of predictors used for model 2 can also be used to predict the recurring and non recurring status. However, the model accuracies were very low with a maximum of 80% using k-NN ,Hence we recommend in order to predict the recurring and non recurring status a different set of predictors have to be used for a better model.