We will consider the problem of early breast cancer detection from X-ray images. Specifically, given a candidate region of interest (ROI) from an X-ray image of a patient's breast. Our goal is to predict if the region corresponds to a malignant tumor (label 1) or if it is normal (label 0).
Each row of our data set corresponds to a ROI in a patient's X-ray, with columns 1-117 containing features computed using standard image processing algorithms. The last column contains the class label, and is based on either a radiologist's opinion or a biopsy. This data is from the KDD Cup 2008 challenge.
The data set contains 69,098 candidate ROIs, 409 are malignant.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyClassifier
%matplotlib inline
df = pd.read_csv('bc_dataset.csv', header = None)
np.random.seed(1)
msk = np.random.rand(len(df)) < 0.75
df_train = df[msk]
df_test = df[~msk]
df_train.head()
X_train = df_train.iloc[:, :-1]
X_test = df_test.iloc[:, :-1]
y_train = df_train.iloc[:, -1]
y_test = df_test.iloc[:, -1]
logreg = LogisticRegressionCV(random_state=123).fit(X_train, y_train)
print('Train Accuracy: \n')
print(accuracy_score(y_train, logreg.predict(X_train)))
print('\nTest Accuracy:\n')
print(accuracy_score(y_test, logreg.predict(X_test)))
dummy = DummyClassifier(strategy = 'constant', constant = 0 ).fit(X_train, y_train)
print('Train Accuracy: \n')
print(accuracy_score(y_train, dummy.predict(X_train)))
print('\nTest Accuracy:\n')
print(accuracy_score(y_test, [0 for y in y_test]))
Let's first look at the absolute difference in correct classifications:
print('logistics: ', 0.995026040142782* len(df_test), '\ndummy: ', 0.9942067996957107*len(df_test))
print('\ndifference: ', 0.995026040142782* len(df_test)-0.9942067996957107*len(df_test))
false_pos =0
for tup in zip(y_test, logreg.predict(X_test)):
if tup[0] == 0 and tup[1] == 1:
false_pos +=1
correct_class = false_pos + 14
print("true positive: ", correct_class)
409*.25
This means that our logisitic regression classified 20 people correctly that the dummy classifier didn't. Since we know that the dummy classifer only predicts the label: 0, these 20 people were correctly classified with label 1, meaning they have a malignant tumor.
Since the test set contains 25% of the data we would expect the test set to have roughly 409*.25 = 102.25 malignant tumors.
So our logistic regression under preformed by a large margin, BUT it did correctly classify 20 people, who would now presumably get the care they need. On the basis of this alone we can declare this to be a better classifier.
def make_confusion_matrix(pred, y_value):
false_neg, false_pos = 0,0
true_neg, true_pos = 0, 0
for i, __ in enumerate(pred):
if y_value[i] == 0:
if pred[i] == 0:
true_neg +=1
else:
false_pos +=1
else:
if pred[i] == 1:
true_pos +=1
else:
false_neg +=1
return np.array([np.array([false_pos, true_pos]), np.array([true_neg, false_neg])])
dummy_confusion_matrix = make_confusion_matrix(list(dummy.predict(X_test)), y_test.tolist())
logreg_confusion_matrix = make_confusion_matrix(list(logreg.predict(X_test)), y_test.tolist())
print('Confusion matrices\n')
print('Dummy:\n', dummy_confusion_matrix)
print('\nLogistic Regression:\n', logreg_confusion_matrix)
logreg_true_pos = logreg_confusion_matrix[0][1]/(logreg_confusion_matrix[0][1] + logreg_confusion_matrix[1][1])
logreg_true_neg = logreg_confusion_matrix[1][0]/(logreg_confusion_matrix[1][0] + logreg_confusion_matrix[0][0])
dummy_true_pos = dummy_confusion_matrix[0][1]/(dummy_confusion_matrix[0][1] + dummy_confusion_matrix[1][1])
dummy_true_neg = dummy_confusion_matrix[1][0]/(dummy_confusion_matrix[1][0] + dummy_confusion_matrix[0][0])
print("logistic regression\n\ntrue positive: ", logreg_true_pos, '\ntrue negative: ',logreg_true_neg)
print("\ndummy classification\n\ntrue positive: ", dummy_true_pos, '\ntrue negative: ', dummy_true_neg)
The confusion matrices gives us the predicted class on the x axis and the true class on the y axis
Although the true negative rate is better for the dummy classifier, this is dude to the class imbalance in the data set. Furthermore in this case it does not outweigh the benifits of being able to correctly classify a true possitive. The logistics regression is the better model.
logreg_false_pos = logreg_confusion_matrix[0][0]/(logreg_confusion_matrix[1][0] + logreg_confusion_matrix[0][0])
dummy_false_pos = dummy_confusion_matrix[0][0]/(dummy_confusion_matrix[1][0] + dummy_confusion_matrix[0][0])
logreg_false_pos, dummy_false_pos
A higher false positive rate means that those who do not have the malignant tumor will be told that there is a possibility that they do, they will either spend time, and money on more invasive tests, or undergo treatment for a cancer that they do not have.
roc_curve_pts = metrics.roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
roc_curve_dummy = metrics.roc_curve(y_test, dummy.predict(X_test))
plt.plot(roc_curve_pts[0], roc_curve_pts[1], label = "logit")
plt.plot(roc_curve_dummy[0], roc_curve_dummy[1], label = 'dummy')
plt.title("ROC Curve\n")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.legend()
axes = plt.gca()
axes.spines['top'].set_visible(False)
axes.spines['right'].set_visible(False)
vals = [0, .1, .5, .9]
logit_fpr, logit_tpr, logit_thresh = roc_curve_pts
dummy_fpr, dummy_tpr, dummy_thresh = roc_curve_dummy
i = 0
fpr_tup_list =[]
for v in vals:
fpr_tup = min([(i, fpr) for i, fpr in enumerate(logit_fpr) if v<=fpr])
fpr_tup_list.append(fpr_tup)
print(fpr_tup_list, '\n')
for ind, _ in fpr_tup_list:
print(logit_tpr[ind], logit_thresh[ind])
Want to find FPR st 2FNR = FPR which imples 2(1-TPR) = FPR
print('(FPR, TPR):', max([(fpr,logit_tpr[i]) for i, fpr in enumerate(logit_fpr) if 2*(1-logit_tpr[i])>=fpr]))
metrics.roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
metrics.roc_auc_score(y_test, dummy.predict(X_test))
df_missing = pd.read_csv('bc_dataset_missing.csv', index_col = 0)
df_missing.shape
df_dropped = df_missing.dropna(how = 'any')
df_dropped.shape
np.random.seed(9001)
msk = np.random.rand(len(df_dropped)) < 0.75
df_train_dropped = df_dropped[msk]
df_test_dropped = df_dropped[~msk]
X_train_dropped = df_train_dropped.drop('type', axis =1)
X_test_dropped = df_test_dropped.drop('type', axis = 1)
y_train_dropped = df_train_dropped['type']
y_test_dropped = df_test_dropped['type']
logreg_dropped = LogisticRegressionCV().fit(X_train_dropped, y_train_dropped)
print('Training: ', logreg_dropped.score(X_train_dropped, y_train_dropped))
print('Test: ', logreg_dropped.score(X_test_dropped, y_test_dropped))
dropped_conf_matrix = make_confusion_matrix(logreg_dropped.predict(X_test_dropped), y_test_dropped.reset_index(drop = True))
tp_dropped_rate = dropped_conf_matrix[0][1]/(dropped_conf_matrix[0][1] + dropped_conf_matrix[1][1])
tp_dropped_rate
df_fill_mean = df_missing.fillna(df_missing.mean())
np.random.seed(9001)
msk = np.random.rand(len(df_fill_mean)) < 0.75
df_train_fill_mean = df_fill_mean[msk]
df_test_fill_mean = df_fill_mean[~msk]
X_train_fill_mean = df_train_fill_mean.drop('type', axis =1)
X_test_fill_mean = df_test_fill_mean.drop('type', axis = 1)
y_train_fill_mean = df_train_fill_mean['type']
y_test_fill_mean = df_test_fill_mean['type']
logreg_filled_mean = LogisticRegressionCV().fit(X_train_fill_mean,y_train_fill_mean)
print('train:', logreg_filled_mean.score(X_train_fill_mean, y_train_fill_mean))
print('test:', logreg_filled_mean.score(X_test_fill_mean, y_test_fill_mean))
filled_mean_conf_matrix = make_confusion_matrix(logreg_filled_mean.predict(X_test_m), y_test_fill_mean.reset_index(drop = True))
tp_filled_mean_rate = dropped_conf_matrix[0][1]/(dropped_conf_matrix[0][1] + dropped_conf_matrix[1][1])
tp_filled_mean_rate
full_columns =[]
columns_missing_vals = []
for col in df_missing.iloc[:,:-1]:
if df_missing[col].isna().any():
columns_missing_vals.append(col)
else:
full_columns.append(col)
X_impute = df_missing[full_columns]
df_regress_filled = df_missing.copy()
for col in columns_missing_vals:
pred_x = X_impute[df_missing[col].isna()].iloc[:,:-1]
X_regress = X_impute[~df_missing[col].isna()].iloc[:, :-1]
y_regress = X_impute[~df_missing[col].isna()].iloc[:, -1]
regress = LinearRegression().fit(X_regress,y_regress)
y_reg = regress.predict(X_regress)
y_hat = regress.predict(pred_x)
noise = np.random.normal(loc=0, scale=np.sqrt(metrics.mean_squared_error(y_regress, y_reg)),size=y_hat.shape[0])
y_hat_noise = (y_hat + noise)
#fill df with imputed values from regression model
df_regress_filled.iloc[pred_x.index, int(col)-1] = y_hat_noise
df_regress_filled.head()
np.random.seed(1)
msk = np.random.rand(len(df_regress_filled)) < 0.75
df_train_fill_reg = df_regress_filled[msk]
df_test_fill_reg = df_regress_filled[~msk]
X_train_fill_reg = df_train_fill_reg.drop('type', axis =1)
X_test_fill_reg = df_test_fill_reg.drop('type', axis = 1)
y_train_fill_reg = df_train_fill_reg['type']
y_test_fill_reg = df_test_fill_reg['type']
logreg_filled_reg = LogisticRegressionCV().fit(X_train_fill_reg,y_train_fill_reg)
print('train:', logreg_filled_mean.score(X_train_fill_reg, y_train_fill_reg))
print('test:', logreg_filled_mean.score(X_test_fill_reg, y_test_fill_reg))
reg_conf_matrix = make_confusion_matrix(logreg_filled_reg.predict(X_test_fill_reg), y_test_fill_reg.reset_index(drop = True))
tp_dropped_rate = reg_conf_matrix[0][1]/(reg_conf_matrix[0][1] + reg_conf_matrix[1][1])
tp_dropped_rate
The result with dropping all missing entries results in the lowest TPR, with the mean method giving the second best TPR result. However the imputation method using linear regression gives the best results.
We might expect to see the higher TPR from imputation using the linear regression model since it provides a more sophisticated prediction. This is especially true because linear regression is able to better account the class imbalance than the simple mean method. The issue with using linear regression to impute is that it results in higher computational complexity. For this particular data set this is not an issue, but we could concieve of big data situations where imputing via linear regression would not be ideal.
Again TPR is a better measure than accuracy alone when considering models for this problem since we want to maximize the number of patients with breast cancer that our model detects as having breast cancer.