Cancer Classification Using Gene Expression

Introduction

We will build a classification model to distinguish between two types of cancer, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), using gene expression measurements. Our data will be such that 0 indicates the ALL class and 1 indicates the AML class.

We will use logistic regression to build a classification model for this data set. We will start with the naive approach of using a linear regression classifier, to show why logistic regression is preferable.

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import seaborn as sns
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
%matplotlib inline

Data Exploration

In [3]:
np.random.seed(9001)
df = pd.read_csv('dataset_gene.csv')
msk = np.random.rand(len(df)) < 0.5
data_train = df[msk]
data_test = df[~msk]


data_train.head()
Out[3]:
Cancer_type AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at AFFX-BioDn-3_at AFFX-CreX-5_at AFFX-CreX-3_at ... U48730_at U58516_at U73738_at X06956_at X16699_at X83863_at Z17240_at L49218_f_at M71243_f_at Z78285_f_at
0 0 -214 -153 -58 88 -295 -558 199 -176 252 ... 185 511 -125 389 -37 793 329 36 191 -37
5 0 -67 -93 84 25 -179 -323 -135 -127 -2 ... 48 224 60 194 -10 291 41 8 -2 -80
9 0 -476 -213 -18 301 -403 -394 -42 -144 98 ... 241 1214 127 255 50 1701 1108 61 525 -83
12 0 17 -229 79 218 -262 -404 325 -201 6 ... 225 1020 -109 209 -51 1434 255 53 545 -16
13 0 -144 -199 -157 132 -151 -347 -118 -24 126 ... 103 595 -12 36 26 208 113 -8 22 -22

5 rows × 7130 columns

In [4]:
df_train = data_train.copy()
df_train.head()
Out[4]:
Cancer_type AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at AFFX-BioDn-3_at AFFX-CreX-5_at AFFX-CreX-3_at ... U48730_at U58516_at U73738_at X06956_at X16699_at X83863_at Z17240_at L49218_f_at M71243_f_at Z78285_f_at
0 0 -214 -153 -58 88 -295 -558 199 -176 252 ... 185 511 -125 389 -37 793 329 36 191 -37
5 0 -67 -93 84 25 -179 -323 -135 -127 -2 ... 48 224 60 194 -10 291 41 8 -2 -80
9 0 -476 -213 -18 301 -403 -394 -42 -144 98 ... 241 1214 127 255 50 1701 1108 61 525 -83
12 0 17 -229 79 218 -262 -404 325 -201 6 ... 225 1020 -109 209 -51 1434 255 53 545 -16
13 0 -144 -199 -157 132 -151 -347 -118 -24 126 ... 103 595 -12 36 26 208 113 -8 22 -22

5 rows × 7130 columns

In [5]:
df_test = data_test.copy()
df_train.head()
Out[5]:
Cancer_type AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at AFFX-BioDn-3_at AFFX-CreX-5_at AFFX-CreX-3_at ... U48730_at U58516_at U73738_at X06956_at X16699_at X83863_at Z17240_at L49218_f_at M71243_f_at Z78285_f_at
0 0 -214 -153 -58 88 -295 -558 199 -176 252 ... 185 511 -125 389 -37 793 329 36 191 -37
5 0 -67 -93 84 25 -179 -323 -135 -127 -2 ... 48 224 60 194 -10 291 41 8 -2 -80
9 0 -476 -213 -18 301 -403 -394 -42 -144 98 ... 241 1214 127 255 50 1701 1108 61 525 -83
12 0 17 -229 79 218 -262 -404 325 -201 6 ... 225 1020 -109 209 -51 1434 255 53 545 -16
13 0 -144 -199 -157 132 -151 -347 -118 -24 126 ... 103 595 -12 36 26 208 113 -8 22 -22

5 rows × 7130 columns

In [6]:
df_train = df_train.set_index('Cancer_type')
df_test = df_test.set_index('Cancer_type')


df_train = (df_train - df_train.min(axis =0))/(df_train.max(axis =0) - df_train.min(axis = 0))
df_test = (df_test - df_test.min(axis =0))/(df_test.max(axis =0) - df_test.min(axis = 0))

df_train = df_train.sort_index()
In [7]:
for name, group in df_train[['D49818_at', 'M23161_at', 'hum_alu_at', 'AFFX-PheX-5_at', 'M15990_at']].groupby(by = df_train.index):
    plt.figure(figsize = (10, 6))
    sns.heatmap(group, cmap ="Blues")
    plt.show()
  

M23161_at seems to have the greatest gradient distinction, indicating that it may be the most useful for classifying cancer type.

In [13]:
X_train = df_train
X_test = df_test
y_train = df_train.index
y_test = df_test.index

pca = PCA(n_components = 2)
X_train_pca = pca.fit_transform(X_train)


X_train_pca_df = pd.DataFrame(X_train_pca)
X_train_pca_df["Cancer_type"] = y_train.values




X_train_pca_df["Cancer_type"] = X_train_pca_df["Cancer_type"].apply(lambda x:'#6da0f2'if x==0 else '#f28a60' )
plt.scatter(X_train_pca_df.iloc[:, 0], X_train_pca_df.iloc[:, 1] ,c=X_train_pca_df["Cancer_type"])
plt.scatter(X_train_pca_df.iloc[:, 0], X_train_pca_df.iloc[:, 1] ,c=X_train_pca_df["Cancer_type"])

leg = plt.legend((0,1))

leg_colors = ['#6da0f2', '#f28a60']
i=0
for marker in leg.legendHandles:
    marker.set_color(leg_colors[i])
    i+=1

There is no clear decision boundary generated by a 2D plot of the PCA at 2 components.

Linear Regression vs. Logistic Regression

The problem with interpreting the scores predicted by the regression model as the probability that the patient has the ALL type cancer is that the linear regression model can predict values outside the interval (0, 1). We should not be getting probabilities outside this range.

In [14]:
X_train_constant = sm.add_constant(X_train['D29963_at'])
X_test_constant = sm.add_constant(X_test['D29963_at'])

linreg = OLS(y_train, X_train_constant).fit()

linreg.summary()
Out[14]:
OLS Regression Results
Dep. Variable: y R-squared: 0.100
Model: OLS Adj. R-squared: 0.070
Method: Least Squares F-statistic: 3.346
Date: Mon, 03 Sep 2018 Prob (F-statistic): 0.0773
Time: 00:14:14 Log-Likelihood: -20.501
No. Observations: 32 AIC: 45.00
Df Residuals: 30 BIC: 47.93
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.1313 0.157 0.834 0.411 -0.190 0.453
D29963_at 0.7509 0.410 1.829 0.077 -0.087 1.589
Omnibus: 14.047 Durbin-Watson: 0.287
Prob(Omnibus): 0.001 Jarque-Bera (JB): 3.932
Skew: 0.498 Prob(JB): 0.140
Kurtosis: 1.601 Cond. No. 5.43


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [15]:
lin_class_train_predictions = linreg.predict(X_train_constant).apply(lambda x: 1 if x>.5 else 0)
lin_class_test_predictions = linreg.predict(X_test_constant).apply(lambda x: 1 if x>.5 else 0)


def get_classification_rate(lin_class_predictions):
    misclassifications = 0 
    for tup in zip(lin_class_predictions.index, lin_class_predictions.values):
        if tup[0] != tup[1]:
            misclassifications +=1
    class_rate = 1 - misclassifications/ lin_class_predictions.shape[0]
    return class_rate
     
print("OLS binary classification\n")
print('training classification rate: ', get_classification_rate(lin_class_train_predictions))
print('testing classification rate: ', get_classification_rate(lin_class_test_predictions))
OLS binary classification

training classification rate:  0.71875
testing classification rate:  0.8536585365853658
In [16]:
logreg = LogisticRegression(C = 100000, fit_intercept= False).fit(X_train_constant, y_train)

print('Logistic regression\n')
print('training score: ', logreg.score(X_train_constant, y_train))
print('test score: ',logreg.score(X_test_constant, y_test))
Logistic regression

training score:  0.71875
test score:  0.8292682926829268
In [17]:
plt.scatter(X_train['D29963_at'], linreg.predict(X_train_constant), label = 'ols')
plt.scatter(X_train['D29963_at'], logreg.predict_proba(X_train_constant)[:, 1], label = 'logreg')
plt.scatter(X_train['D29963_at'], y_train, label = 'cancer type')
plt.legend()
Out[17]:
<matplotlib.legend.Legend at 0x11e7e6f60>

Logistic Regression

In [18]:
X_train_const_mult = sm.add_constant(X_train)
X_test_const_mult = sm.add_constant(X_test)

logreg_model = LogisticRegression(C = 100000).fit(X_train_const_mult, y_train)
print("Training score: ", logreg_model.score(X_train_const_mult, y_train))
print("Test score: ", logreg_model.score(X_test_const_mult, y_test))
Training score:  1.0
Test score:  0.926829268292683
In [19]:
def visualize_prob(model, x, y, ax):

    # Use the model to predict probabilities for
    y_pred = model.predict_proba(x)
    
    # Separate the predictions on the label 1 and label 0 points
    ypos = y_pred[y==1]
    yneg = y_pred[y==0]
    
    # Count the number of label 1 and label 0 points
    npos = ypos.shape[0]
    nneg = yneg.shape[0]
    # Plot the probabilities on a vertical line at x = 0, 
    # with the positive points in blue and negative points in red
    pos_handle = ax.plot(np.zeros((npos,1)), ypos[:,1], 'bx', label = 'ALL')
    neg_handle = ax.plot(np.zeros((nneg,1)), yneg[:,1], 'rx', label = 'AML')
    
    # Line to mark prob 0.5
    ax.axhline(y = 0.5, color = 'k', linestyle = '--')
    
    # Add y-label and legend, do not display x-axis, set y-axis limit
    ax.set_ylabel('Probability of AML class')
    ax.legend(loc = 'best')
    ax.get_xaxis().set_visible(False)
    ax.set_ylim([0,1])
In [20]:
f, ax = plt.subplots(2, 1, figsize = (5, 10))

visualize_prob(model = logreg_model, x = X_train_const_mult,y =  y_train, ax = ax[0])
ax[0].set_title("Training")
visualize_prob(logreg_model, X_test_const_mult, y_test, ax[1])
ax[1].set_title("Test")
Out[20]:
Text(0.5,1,'Test')

For the points that are closer to .5 we can say that there is only a slightly higher probability they belong in the AML class rather than the ALL class

Significance of Coefficients

Let's see how many of the coefficients estimated by the multiple logistic regression are significantly different from zero at a significance level of 95%

In [21]:
coef_logreg = logreg_model.predict(X_train_const_mult)

coef_boot = np.zeros((X_train_const_mult.shape[1],100))

for i in range(100):
    boot_sample = np.random.choice(range(X_train_const_mult.shape[0]), size=X_train_const_mult.shape[0])
    X_train_boot = X_train_const_mult.values[boot_sample]
    y_train_boot = y_train.values[boot_sample]

    logreg_boot = LogisticRegression(C=100000, fit_intercept=False)
    logreg_boot.fit(X_train_boot, y_train_boot)
    coef_boot[:,i] = logreg_boot.coef_

#Check distribution of coef values 
#for row in coef_boot.T:
   # plt.hist(row)
    #plt.show()

    

def get_confidence_interval(coef_matrix):
    confidence_intervals = []
    for row in coef_matrix:
        upper = np.mean(row) + 1.96*np.std(row)
        lower = np.mean(row) - 1.96*np.std(row)
        confidence_intervals.append((lower, upper))
    return confidence_intervals


sig_coef = 0
for co in get_confidence_interval(coef_boot):
    if 0< co[0] or 0> co[1]:
        sig_coef +=1 
print(sig_coef)
1798

Dimensionality Reduction using PCA

PCA is a method for reducing the dimensionality of our data. To reduce the dimensionality of the data we will use PCA and fit a logistic regression model on the first set of components contributing to 90% of the variance in the predictors.

In [22]:
#the number of components that explain 90% of variance 
n = 1
exp_var = 0 
while exp_var <.9:
    n+=1
    pca = PCA(n_components= n).fit(X_train, y_train)
    exp_var = np.sum(pca.explained_variance_ratio_)
print(n, "components")
24 components
In [23]:
#transform data based on PCA for logit
X_train_pca = pca.transform(X_train)[:, :n] 
X_test_pca = pca.transform(X_test)[:, :n] 
In [24]:
#fit logistics regression model 
logreg_pca = LogisticRegression(C=100000).fit(X_train_pca, y_train)
pca_xtrain_pred, pca_xtest_pred = logreg_pca.predict(X_train_pca), logreg_pca.predict(X_test_pca)

accuracy_score(y_train, pca_xtrain_pred), accuracy_score(y_test, pca_xtest_pred)
Out[24]:
(1.0, 0.926829268292683)
In [25]:
test_accuracies = np.zeros((30, 5))
training_accuracies = np.zeros((30, 5))

X_values = df.drop('Cancer_type', axis = 1)
y_values = df["Cancer_type"]



fold = 0
for x_ind, v_ind in KFold(n_splits = 5, shuffle = True, random_state = 9001).split(X_train.index):

    X_train_fd = X_values.iloc[x_ind,:]
    X_test_fd = X_values.iloc[v_ind,:]
    y_train_fd = y_values[x_ind]
    y_test_fd = y_values[v_ind]
    
    pca = PCA().fit(X_train_fd)
    X_train_fd_pca = pca.transform(X_train_fd)
    X_test_fd_pca = pca.transform(X_test_fd)
    for n in range(2, 30):
        logit_pca_cv = LogisticRegression(C = 100000).fit(X_train_fd_pca[:, :n], y_train_fd)
        training_accuracies[n, fold] = accuracy_score(y_train_fd, logit_pca_cv.predict(X_train_fd_pca[:, :n]))
        test_accuracies[n, fold] = accuracy_score(y_test_fd, logit_pca_cv.predict(X_test_fd_pca[:, :n]))
    fold +=1   
In [26]:
mean_training_ac, mean_test_ac = np.mean(training_accuracies, axis = 1), np.mean(test_accuracies, axis =1)
In [27]:
for n, value in enumerate(mean_test_ac):
    if value < 0.926829268292683:
        pass
    if mean_training_ac[n] ==1:
        n_components = n 
        break
n_components
Out[27]:
8
In [28]:
pca = PCA(n_components).fit(X_train, y_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

logit_pca_cv_fin = LogisticRegression(C = 100000).fit(X_train_pca[:, :n_components], y_train)
accuracy_score(y_train, logit_pca_cv_fin.predict(X_train_pca[:, :n_components]))
accuracy_score(y_test, logit_pca_cv_fin.predict(X_test_pca[:, :n_components]))
Out[28]:
0.8292682926829268
In [29]:
f, axes = plt.subplots(1, 1, figsize = (5, 5))

visualize_prob(model = logit_pca_cv_fin, x = X_test_pca[:, :n_components], y =  y_test, ax = axes)

Summary:

With the logit model that uses all predictors we risk finding spurious relationships between the predictor variables and our target.

Using PCA we wanted to find some lower dimensional representation of our predictors that retains our predictive capability but is less vulnerable to overfitting.

We've also given an example for why logisitic regression is useful for these kinds of predictive problems, as it gives us some rough intution for probabilities using Niave Bayes.