Mammogram+Interpretation


 * Mammogram Interpretation **

** Tuan Doan, Devin Rottiers, Andrea Tang **

**__Project Background__** A mammogram is an x-ray picture of the breast, used to check women (or men) for breast cancer, periodically, with or without symptoms of breast-cancer. The x-ray images make it possible to detect tumors that cannot be seen or felt, or used in diagnosis of cancer upon symptoms. It is recommended that most women and men perform regular breast self-exams (BSE), to find any changes in the size, shape, or feel of their breasts. Women over age 40 are advised to undergo yearly mammograms to check for changes. The below exhibit shows normal, benign cyst, and cancerous mammograms.

[|ttp://www.cancer.gov/cancertopics/screening/understanding-breast-changes/page6]

Mammograms are the most effective method for breast cancer screening used currently. But, as you can see from the above exhibit, the benign cyst looks similar on the mammogram to the cancerous tumor, and could potentially have other similar symptoms. It is easy to see how interpretation of these results could be skewed. Annual screening mammograms miss up to 20% of breast cancers present at the time of screening. ( []) False negatives occur mainly due to breast density, the tissue that makes up the breast can have a similar density to tumors, making them harder to detect. Seventy percent of biopsies completed after an abnormal mammogram are said to be false positive. These occur mainly in younger women and women with a history, or family history of breast cancer or breast cancer symptoms. A false positive mammogram can lead to anxiety and other forms of psychological distresses to the patient. Although it is important that healthcare professionals rule out potential cancer, it is extremely difficult for the patient to rely solely on her mammogram results and family history in the preliminary diagnosis of breast cancer.

**__ Business challenge __** There is a high correlation between secondary check-ups or biopsies and benign outcomes. It is stated that 70% of biopsies are deemed unnecessary with benign outcomes. Because of this high statistic, several computer aided diagnosis systems have been proposed. These systems aim to assist physicians in their decision to seek follow up medical procedures, such as performing a breast biopsy, or a short-term follow-up examination instead.

** __Business goal__ ** Our goal is to create a model that will predict the likelihood of the mass seen in a mammogram being benign or malignant based on characteristics of sample masses. The proposed model would be a supplement to the current method of breast cancer screening, to increase accuracy in forward action of abnormal mammograms.

**__Procedure__** **Data Interpretation** The data set includes 961 instances of women who received biopsies after an abnormal mammogram. It includes 516 benign instances, and 445 malignant instances. There are 6 attributes (1 goal field, 1 non-predictive, 4 predictive attributes) which are listed as follows:

1) BI-RAIDS Assesment: 1 to 5 (ordinal)  2) Age: patient's age in years (integer) 3) Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)  4) Mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal) 5) Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)  6) Severity: benign=0 or malignant=1 (binominal)

// BI-RADS (Breast Imaging- Reporting and Data System): // BI-RADS refers to a number based on presumed malignancy that a radiologist will apply after interpreting a mammogram. It is categorized from 1 to 5 being:

1) Negative  2) Benign Finding(s) 3) Probably Benign  4) Suspicious Abnormality 5) Highly Suggestive of malignancy 6) Known Malignancy

// SHAPE //


 * Round**

http://www.ispub.com/journal/the_internet_journal_of_surgery/volume_8_number_2/article_printable/palpable_breast_lesion_as_initial_manifestation_of_disseminated_renal_cell_carcinoma.html


 * Oval**



http://img.medscape.com/fullsize/migrated/443/381/wh3026.fig9.jpg

http://www.scielo.br/img/fbpe/spmj/v118n2/n2a03f02.gif
 * Lobular**


 * Irregular**



http://img.medscape.com/pi/emed/ckb/radiology/336139-34

// MASS MARGIN // Circumscribed: Well-defined Microlobulated: Having many small lobes Obscured: Partially hidden by tissue Ill-defined: Blurry Spiculated: Having small ‘needle-like’ sections

// DENSITY //  Density refers to the density of the tumor and the amounts of fatty elements present. A mass (cyst or tumor), becomes more suspicious for breast cancer when it appears denser, it suggests that the mass is composed of malignant cancer cells. ( [])

There are some missing attribute values listed, which we will discard. This will not compromise the results of the model, due to the fact of a large dataset:


 * BI-RADS assessment || ** 2 ** ||
 * Shape || ** 5 ** ||
 * Age || ** 31 ** ||
 * Margin || ** 48 ** ||
 * Density || ** 76 ** ||
 * Severity || ** 0 ** ||

** Data Preparation ** It is necessarily that the data be cleaned to generate more predictive accuracy from various testing models. The data preparation requires the following steps:

1. Noise removal: Discard outliers and handle missing data 2. Transform: Examine the distribution of the data and convert skew variables into normally-distributed variables (z-scores) if necessary 3. Partition: Divide the dataset into training and validation sets 4. Feature select: Identify variables that are correlated and show significant in predicting cancer severity

** Modeling ** Our goal is to predict the breast cancer severity as benign or malignant; therefore, we will use classification models to choose the best models that have the most predictive accuracy. 1. K-Nearest Neighbor 2. Logistic regression 3. Classification trees 4. Neural Nets 5. Discriminant Analysis

** Analyzing Results ** This section below describes the data mining methods we will use to analyze and predict our final outcome.

1. K-Nearest Neighbors: In this model, the training dataset is used to identify k observations that are similar to test data records. Neighboring records are used to classify test data as benign or malignant. Classification errors will then be measured to determine the classification accuracy of the model.

2. Logistic Regression: This classification method contains two phases. First, it yields estimates of probabilities of records belonging to class 1 (corresponding to malignant). Next, we determine a cutoff value on these probabilities. This model would be effective, given that the outcome requires a binary response.

3. Classification Trees (CART, C4.5, CHAID): This technique is used to divide certain sets of rules for prediction. The terminal nodes are marked with 0 or 1 with 0 corresponding to benign and 1 as malignant. The trees are based on separating observations into subgroups by splitting predictor outcomes. These splits create logical rules that are transparent and easy to understand.

4. Neural Nets: In this model, data is put into layers of a big neural net. The model then evaluates its own errors using the back propagation method. This model is capable of “self-learning,” and has highly predictive accuracy.

5. Discriminant Analysis: The main purpose of this model is to predict benign and malignant masses based on the statistical distance of an observation from each class average. The output of a discriminant analysis produces classification scores that can be translated into classification or probabilities of class membership.

**Data Mining Goals** · The project needs to be able to assist in the prediction of malignant tumors in abnormal mammograms · Identify the characteristics that are the best predictors of the prevalence of breast cancer · <span style="color: black; font-family: 'Times New Roman',serif; font-size: 12pt;">Have a high R-squared, to predict the accuracy of our model

** Hypothesis ** The team’s hypothesis is that age and mass margin are key predictors in determining whether a mass is malignant or benign.