Breast Cancer Detection¶

We will consider the problem of early breast cancer detection from X-ray images. Specifically, given a candidate region of interest (ROI) from an X-ray image of a patient's breast. Our goal is to predict if the region corresponds to a malignant tumor (label 1) or if it is normal (label 0).

Data¶

Each row of our data set corresponds to a ROI in a patient's X-ray, with columns 1-117 containing features computed using standard image processing algorithms. The last column contains the class label, and is based on either a radiologist's opinion or a biopsy. This data is from the KDD Cup 2008 challenge.

The data set contains 69,098 candidate ROIs, 409 are malignant.

In [10]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyClassifier

%matplotlib inline

In [6]:

df = pd.read_csv('bc_dataset.csv', header = None)

np.random.seed(1)
msk = np.random.rand(len(df)) < 0.75


df_train = df[msk]
df_test = df[~msk]

df_train.head()

Out[6]:

	0	1	2	3	4	5	6	7	8	9	...	108	109	110	111	112	113	114	115	116
0	-0.14400	-0.1430	-0.1160	-0.1030	0.226	0.210	-0.980	-0.780	-0.474	-0.447	...	0.925	0.5160	0.344	0.906	-1.130	-0.552	0.553	-0.417	0.2560
3	0.21500	-0.1840	0.0274	0.0494	0.443	0.463	-1.050	-0.941	-0.531	-0.394	...	0.634	0.1110	0.371	0.859	-0.993	-0.492	0.363	0.326	-0.0528
5	0.00922	-0.1380	0.1690	0.1540	-0.391	-0.397	-1.690	-1.450	-0.546	-0.527	...	-0.277	0.6990	0.371	0.481	-1.060	-0.526	0.550	-0.284	0.1550
6	0.05690	0.1920	0.3020	0.2720	-0.484	-0.473	0.348	0.256	-0.607	-0.355	...	0.206	-0.0599	-1.070	-0.536	0.864	0.527	0.282	0.817	-0.2830
9	-0.09250	0.0102	-0.2580	-0.2530	0.452	0.622	-1.790	-1.450	-1.170	-1.180	...	3.230	1.2000	0.270	2.220	-1.190	-1.350	0.255	-1.530	0.1960

5 rows × 118 columns

In [7]:

X_train = df_train.iloc[:, :-1]
X_test = df_test.iloc[:, :-1]

y_train = df_train.iloc[:, -1]
y_test = df_test.iloc[:, -1]

In [18]:

logreg = LogisticRegressionCV(random_state=123).fit(X_train, y_train)
print('Train Accuracy: \n')
print(accuracy_score(y_train, logreg.predict(X_train)))
print('\nTest Accuracy:\n')
print(accuracy_score(y_test, logreg.predict(X_test)))

Train Accuracy: 

0.9955776884769943

Test Accuracy:

0.995026040142782

In [19]:

dummy = DummyClassifier(strategy = 'constant', constant = 0 ).fit(X_train, y_train)
print('Train Accuracy: \n')
print(accuracy_score(y_train, dummy.predict(X_train)))
print('\nTest Accuracy:\n')
print(accuracy_score(y_test, [0 for y in y_test]))

Train Accuracy: 

0.9940394931646446

Test Accuracy:

0.9942067996957107

Let's first look at the absolute difference in correct classifications:

In [47]:

print('logistics: ', 0.995026040142782* len(df_test), '\ndummy: ', 0.9942067996957107*len(df_test))
print('\ndifference: ', 0.995026040142782* len(df_test)-0.9942067996957107*len(df_test))

logistics:  17004.0 
dummy:  16990.0

difference:  14.0

In [48]:

false_pos =0
for tup in zip(y_test, logreg.predict(X_test)):
    if tup[0] == 0 and tup[1] == 1:
        false_pos +=1
correct_class = false_pos + 14
print("true positive: ", correct_class)

true positive:  20

In [49]:

409*.25

Out[49]:

102.25

This means that our logisitic regression classified 20 people correctly that the dummy classifier didn't. Since we know that the dummy classifer only predicts the label: 0, these 20 people were correctly classified with label 1, meaning they have a malignant tumor.

Since the test set contains 25% of the data we would expect the test set to have roughly 409*.25 = 102.25 malignant tumors.

So our logistic regression under preformed by a large margin, BUT it did correctly classify 20 people, who would now presumably get the care they need. On the basis of this alone we can declare this to be a better classifier.

In [34]:

def make_confusion_matrix(pred, y_value):
    false_neg, false_pos = 0,0
    true_neg, true_pos = 0, 0
    for i, __ in enumerate(pred):
        if y_value[i] == 0:
            if pred[i] == 0:
                true_neg +=1
            else:
                false_pos +=1 
        else:
            if pred[i] == 1:
                true_pos +=1 
            else:
                false_neg +=1
    return np.array([np.array([false_pos, true_pos]), np.array([true_neg, false_neg])])


dummy_confusion_matrix = make_confusion_matrix(list(dummy.predict(X_test)), y_test.tolist())
logreg_confusion_matrix = make_confusion_matrix(list(logreg.predict(X_test)), y_test.tolist())
print('Confusion matrices\n')

print('Dummy:\n', dummy_confusion_matrix)
print('\nLogistic Regression:\n', logreg_confusion_matrix)

Confusion matrices

Dummy:
 [[    0     0]
 [16990    99]]

Logistic Regression:
 [[    6    20]
 [16984    79]]

In [94]:

logreg_true_pos = logreg_confusion_matrix[0][1]/(logreg_confusion_matrix[0][1] + logreg_confusion_matrix[1][1])
logreg_true_neg = logreg_confusion_matrix[1][0]/(logreg_confusion_matrix[1][0] + logreg_confusion_matrix[0][0])

dummy_true_pos = dummy_confusion_matrix[0][1]/(dummy_confusion_matrix[0][1] + dummy_confusion_matrix[1][1])
dummy_true_neg = dummy_confusion_matrix[1][0]/(dummy_confusion_matrix[1][0] + dummy_confusion_matrix[0][0])


print("logistic regression\n\ntrue positive: ", logreg_true_pos, '\ntrue negative: ',logreg_true_neg)
print("\ndummy classification\n\ntrue positive: ", dummy_true_pos, '\ntrue negative: ', dummy_true_neg)

logistic regression

true positive:  0.20202020202020202 
true negative:  0.9996468510888759

dummy classification

true positive:  0.0 
true negative:  1.0

The confusion matrices gives us the predicted class on the x axis and the true class on the y axis

Although the true negative rate is better for the dummy classifier, this is dude to the class imbalance in the data set. Furthermore in this case it does not outweigh the benifits of being able to correctly classify a true possitive. The logistics regression is the better model.

In [95]:

logreg_false_pos = logreg_confusion_matrix[0][0]/(logreg_confusion_matrix[1][0] + logreg_confusion_matrix[0][0])
dummy_false_pos = dummy_confusion_matrix[0][0]/(dummy_confusion_matrix[1][0] + dummy_confusion_matrix[0][0])
logreg_false_pos, dummy_false_pos

Out[95]:

(0.0003531489111241907, 0.0)

A higher false positive rate means that those who do not have the malignant tumor will be told that there is a possibility that they do, they will either spend time, and money on more invasive tests, or undergo treatment for a cancer that they do not have.

ROC Analysis¶

In [22]:

roc_curve_pts = metrics.roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
roc_curve_dummy = metrics.roc_curve(y_test, dummy.predict(X_test))
plt.plot(roc_curve_pts[0], roc_curve_pts[1], label = "logit")
plt.plot(roc_curve_dummy[0], roc_curve_dummy[1], label = 'dummy')
plt.title("ROC Curve\n")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.legend()
axes = plt.gca()
axes.spines['top'].set_visible(False)
axes.spines['right'].set_visible(False)

In [54]:

vals = [0, .1, .5, .9]

logit_fpr, logit_tpr, logit_thresh = roc_curve_pts
dummy_fpr, dummy_tpr, dummy_thresh = roc_curve_dummy

In [96]:

i = 0
fpr_tup_list =[]

for v in vals:
    fpr_tup = min([(i, fpr) for i, fpr in enumerate(logit_fpr) if v<=fpr])
    fpr_tup_list.append(fpr_tup)
print(fpr_tup_list, '\n')

for ind, _ in fpr_tup_list:
    
    print(logit_tpr[ind], logit_thresh[ind])

[(0, 0.0), (122, 0.10464979399646851), (148, 0.5755150088287228), (154, 1.0)] 

0.010101010101010102 0.9962001029045916
0.8383838383838383 0.005805569261222175
0.9696969696969697 0.00011989802181442326
1.0 5.1368208571002535e-114

Want to find FPR st 2FNR = FPR which imples 2(1-TPR) = FPR

In [66]:

print('(FPR, TPR):', max([(fpr,logit_tpr[i]) for i, fpr in enumerate(logit_fpr) if 2*(1-logit_tpr[i])>=fpr]))

(FPR, TPR): (0.23808122424955858, 0.8787878787878788)

In [56]:

metrics.roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])

Out[56]:

0.9345806505312098

In [57]:

metrics.roc_auc_score(y_test, dummy.predict(X_test))

Out[57]:

0.5

Missing data¶

Here we will start with the data set with all missing data included. We will proceed with 3 different methods and then fit models with cross validation.

Removing missing data
Impute missing data with mean
Impute missing data using linear regression

1. Remove missing data¶

In [27]:

df_missing = pd.read_csv('bc_dataset_missing.csv', index_col = 0)
df_missing.shape

Out[27]:

(24999, 118)

In [28]:

df_dropped = df_missing.dropna(how = 'any')
df_dropped.shape

Out[28]:

(1436, 118)

In [30]:

np.random.seed(9001)
msk = np.random.rand(len(df_dropped)) < 0.75


df_train_dropped = df_dropped[msk]
df_test_dropped = df_dropped[~msk]

X_train_dropped = df_train_dropped.drop('type', axis =1)
X_test_dropped = df_test_dropped.drop('type', axis = 1)

y_train_dropped = df_train_dropped['type']
y_test_dropped = df_test_dropped['type']

In [32]:

logreg_dropped = LogisticRegressionCV().fit(X_train_dropped, y_train_dropped)
print('Training: ', logreg_dropped.score(X_train_dropped, y_train_dropped))
print('Test: ', logreg_dropped.score(X_test_dropped, y_test_dropped))

Training:  0.9990740740740741
Test:  0.9943820224719101

/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_split.py:605: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=3.
  % (min_groups, self.n_splits)), Warning)

In [36]:

dropped_conf_matrix = make_confusion_matrix(logreg_dropped.predict(X_test_dropped), y_test_dropped.reset_index(drop = True))

tp_dropped_rate = dropped_conf_matrix[0][1]/(dropped_conf_matrix[0][1] + dropped_conf_matrix[1][1])
tp_dropped_rate

Out[36]:

0.0

2. Imputing missing data with mean¶

In [289]:

df_fill_mean = df_missing.fillna(df_missing.mean())


np.random.seed(9001)
msk = np.random.rand(len(df_fill_mean)) < 0.75


df_train_fill_mean = df_fill_mean[msk]
df_test_fill_mean = df_fill_mean[~msk]

X_train_fill_mean = df_train_fill_mean.drop('type', axis =1)
X_test_fill_mean = df_test_fill_mean.drop('type', axis = 1)

y_train_fill_mean = df_train_fill_mean['type']
y_test_fill_mean = df_test_fill_mean['type']

In [290]:

logreg_filled_mean = LogisticRegressionCV().fit(X_train_fill_mean,y_train_fill_mean)
print('train:', logreg_filled_mean.score(X_train_fill_mean, y_train_fill_mean))
print('test:', logreg_filled_mean.score(X_test_fill_mean, y_test_fill_mean))

train: 0.995496450143054
test: 0.9926530612244898

In [291]:

filled_mean_conf_matrix = make_confusion_matrix(logreg_filled_mean.predict(X_test_m), y_test_fill_mean.reset_index(drop = True))

tp_filled_mean_rate = dropped_conf_matrix[0][1]/(dropped_conf_matrix[0][1] + dropped_conf_matrix[1][1])
tp_filled_mean_rate

Out[291]:

0.0

3. Impute missing values with linear regression¶

In [293]:

full_columns =[]
columns_missing_vals = []
for col in df_missing.iloc[:,:-1]:
    if df_missing[col].isna().any():
        columns_missing_vals.append(col)
    else:
        full_columns.append(col)

In [278]:

X_impute = df_missing[full_columns]

df_regress_filled = df_missing.copy()

for col in columns_missing_vals:
   
    pred_x = X_impute[df_missing[col].isna()].iloc[:,:-1]
    X_regress = X_impute[~df_missing[col].isna()].iloc[:, :-1]
    y_regress = X_impute[~df_missing[col].isna()].iloc[:, -1]
    
    regress = LinearRegression().fit(X_regress,y_regress)
    y_reg = regress.predict(X_regress)
    y_hat = regress.predict(pred_x)
    noise = np.random.normal(loc=0, scale=np.sqrt(metrics.mean_squared_error(y_regress, y_reg)),size=y_hat.shape[0])
    y_hat_noise = (y_hat + noise)
    
    #fill df with imputed values from regression model
    df_regress_filled.iloc[pred_x.index, int(col)-1] = y_hat_noise

In [280]:

df_regress_filled.head()

Out[280]:

	1	2	3	4	5	6	7	8	9	10	...	109	110	111	112	113	114	115	116	117
0	0.1290	-0.2160	0.2880	0.2370	-0.993	-0.9550	-1.620	-1.470	-1.0100	-1.0100	...	-1.190000	1.1000	0.395	2.060	-1.180	-2.8500	-1.290	-2.100	0.0121
1	0.0989	0.1160	0.3130	0.2810	-0.188	-0.2790	0.173	0.445	0.4320	0.9440	...	-0.018100	0.2480	-0.869	-0.190	0.451	0.6980	0.363	1.030	-0.2490
2	0.0215	0.1590	0.5790	0.5020	-0.342	-0.2740	-0.172	-0.164	0.2160	0.0709	...	0.070200	0.0200	0.397	-0.800	0.173	0.7380	0.465	0.440	-0.2880
3	-0.2170	-0.3570	-0.0539	-0.0688	0.445	0.6380	0.436	0.351	0.0401	-0.1140	...	0.366046	0.0622	0.269	-0.217	-1.030	0.0276	0.472	-0.390	0.3660
4	-0.0846	0.0166	0.4240	0.3520	-0.259	-0.0947	0.119	-0.162	0.3020	-0.1700	...	0.719000	0.3250	-0.286	-0.528	-0.704	0.8530	0.953	-0.116	-0.1190

5 rows × 118 columns

In [282]:

np.random.seed(1)
msk = np.random.rand(len(df_regress_filled)) < 0.75


df_train_fill_reg = df_regress_filled[msk]
df_test_fill_reg = df_regress_filled[~msk]

X_train_fill_reg = df_train_fill_reg.drop('type', axis =1)
X_test_fill_reg = df_test_fill_reg.drop('type', axis = 1)

y_train_fill_reg = df_train_fill_reg['type']
y_test_fill_reg = df_test_fill_reg['type']

In [283]:

logreg_filled_reg = LogisticRegressionCV().fit(X_train_fill_reg,y_train_fill_reg)
print('train:', logreg_filled_mean.score(X_train_fill_reg, y_train_fill_reg))
print('test:', logreg_filled_mean.score(X_test_fill_reg, y_test_fill_reg))

train: 0.995178552506093
test: 0.9926530612244898

In [294]:

reg_conf_matrix = make_confusion_matrix(logreg_filled_reg.predict(X_test_fill_reg), y_test_fill_reg.reset_index(drop = True))

tp_dropped_rate = reg_conf_matrix[0][1]/(reg_conf_matrix[0][1] + reg_conf_matrix[1][1])
tp_dropped_rate

Out[294]:

0.17647058823529413

Conclusions¶

The result with dropping all missing entries results in the lowest TPR, with the mean method giving the second best TPR result. However the imputation method using linear regression gives the best results.

We might expect to see the higher TPR from imputation using the linear regression model since it provides a more sophisticated prediction. This is especially true because linear regression is able to better account the class imbalance than the simple mean method. The issue with using linear regression to impute is that it results in higher computational complexity. For this particular data set this is not an issue, but we could concieve of big data situations where imputing via linear regression would not be ideal.

Again TPR is a better measure than accuracy alone when considering models for this problem since we want to maximize the number of patients with breast cancer that our model detects as having breast cancer.

What Does Accuracy Mean for Breast Cancer Detection?