Introduction¶

We will build a classification model to distinguish between two types of cancer, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), using gene expression measurements. Our data will be such that 0 indicates the ALL class and 1 indicates the AML class.

We will use logistic regression to build a classification model for this data set. We will start with the naive approach of using a linear regression classifier, to show why logistic regression is preferable.

In [2]:

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import seaborn as sns
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
%matplotlib inline

Data Exploration¶

In [3]:

np.random.seed(9001)
df = pd.read_csv('dataset_gene.csv')
msk = np.random.rand(len(df)) < 0.5
data_train = df[msk]
data_test = df[~msk]


data_train.head()

Out[3]:

	AFFX-BioB-5_at	AFFX-BioB-M_at	AFFX-BioB-3_at	AFFX-BioC-5_at	AFFX-BioC-3_at	AFFX-BioDn-5_at	AFFX-BioDn-3_at	AFFX-CreX-5_at	AFFX-CreX-3_at	...	U48730_at	U58516_at	U73738_at	X06956_at	X16699_at	X83863_at	Z17240_at	L49218_f_at	M71243_f_at	Z78285_f_at
0	-214	-153	-58	88	-295	-558	199	-176	252	...	185	511	-125	389	-37	793	329	36	191	-37
5	-67	-93	84	25	-179	-323	-135	-127	-2	...	48	224	60	194	-10	291	41	8	-2	-80
9	-476	-213	-18	301	-403	-394	-42	-144	98	...	241	1214	127	255	50	1701	1108	61	525	-83
12	17	-229	79	218	-262	-404	325	-201	6	...	225	1020	-109	209	-51	1434	255	53	545	-16
13	-144	-199	-157	132	-151	-347	-118	-24	126	...	103	595	-12	36	26	208	113	-8	22	-22

5 rows × 7130 columns

In [4]:

df_train = data_train.copy()
df_train.head()

Out[4]:

	AFFX-BioB-5_at	AFFX-BioB-M_at	AFFX-BioB-3_at	AFFX-BioC-5_at	AFFX-BioC-3_at	AFFX-BioDn-5_at	AFFX-BioDn-3_at	AFFX-CreX-5_at	AFFX-CreX-3_at	...	U48730_at	U58516_at	U73738_at	X06956_at	X16699_at	X83863_at	Z17240_at	L49218_f_at	M71243_f_at	Z78285_f_at
0	-214	-153	-58	88	-295	-558	199	-176	252	...	185	511	-125	389	-37	793	329	36	191	-37
5	-67	-93	84	25	-179	-323	-135	-127	-2	...	48	224	60	194	-10	291	41	8	-2	-80
9	-476	-213	-18	301	-403	-394	-42	-144	98	...	241	1214	127	255	50	1701	1108	61	525	-83
12	17	-229	79	218	-262	-404	325	-201	6	...	225	1020	-109	209	-51	1434	255	53	545	-16
13	-144	-199	-157	132	-151	-347	-118	-24	126	...	103	595	-12	36	26	208	113	-8	22	-22

5 rows × 7130 columns

In [5]:

df_test = data_test.copy()
df_train.head()

Out[5]:

	AFFX-BioB-5_at	AFFX-BioB-M_at	AFFX-BioB-3_at	AFFX-BioC-5_at	AFFX-BioC-3_at	AFFX-BioDn-5_at	AFFX-BioDn-3_at	AFFX-CreX-5_at	AFFX-CreX-3_at	...	U48730_at	U58516_at	U73738_at	X06956_at	X16699_at	X83863_at	Z17240_at	L49218_f_at	M71243_f_at	Z78285_f_at
0	-214	-153	-58	88	-295	-558	199	-176	252	...	185	511	-125	389	-37	793	329	36	191	-37
5	-67	-93	84	25	-179	-323	-135	-127	-2	...	48	224	60	194	-10	291	41	8	-2	-80
9	-476	-213	-18	301	-403	-394	-42	-144	98	...	241	1214	127	255	50	1701	1108	61	525	-83
12	17	-229	79	218	-262	-404	325	-201	6	...	225	1020	-109	209	-51	1434	255	53	545	-16
13	-144	-199	-157	132	-151	-347	-118	-24	126	...	103	595	-12	36	26	208	113	-8	22	-22

5 rows × 7130 columns

In [6]:

df_train = df_train.set_index('Cancer_type')
df_test = df_test.set_index('Cancer_type')


df_train = (df_train - df_train.min(axis =0))/(df_train.max(axis =0) - df_train.min(axis = 0))
df_test = (df_test - df_test.min(axis =0))/(df_test.max(axis =0) - df_test.min(axis = 0))

df_train = df_train.sort_index()

In [7]:

for name, group in df_train[['D49818_at', 'M23161_at', 'hum_alu_at', 'AFFX-PheX-5_at', 'M15990_at']].groupby(by = df_train.index):
    plt.figure(figsize = (10, 6))
    sns.heatmap(group, cmap ="Blues")
    plt.show()

M23161_at seems to have the greatest gradient distinction, indicating that it may be the most useful for classifying cancer type.

In [13]:

X_train = df_train
X_test = df_test
y_train = df_train.index
y_test = df_test.index

pca = PCA(n_components = 2)
X_train_pca = pca.fit_transform(X_train)


X_train_pca_df = pd.DataFrame(X_train_pca)
X_train_pca_df["Cancer_type"] = y_train.values




X_train_pca_df["Cancer_type"] = X_train_pca_df["Cancer_type"].apply(lambda x:'#6da0f2'if x==0 else '#f28a60' )
plt.scatter(X_train_pca_df.iloc[:, 0], X_train_pca_df.iloc[:, 1] ,c=X_train_pca_df["Cancer_type"])
plt.scatter(X_train_pca_df.iloc[:, 0], X_train_pca_df.iloc[:, 1] ,c=X_train_pca_df["Cancer_type"])

leg = plt.legend((0,1))

leg_colors = ['#6da0f2', '#f28a60']
i=0
for marker in leg.legendHandles:
    marker.set_color(leg_colors[i])
    i+=1

There is no clear decision boundary generated by a 2D plot of the PCA at 2 components.

Linear Regression vs. Logistic Regression¶

The problem with interpreting the scores predicted by the regression model as the probability that the patient has the ALL type cancer is that the linear regression model can predict values outside the interval (0, 1). We should not be getting probabilities outside this range.

In [14]:

X_train_constant = sm.add_constant(X_train['D29963_at'])
X_test_constant = sm.add_constant(X_test['D29963_at'])

linreg = OLS(y_train, X_train_constant).fit()

linreg.summary()

Out[14]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.100
Model:	OLS	Adj. R-squared:	0.070
Method:	Least Squares	F-statistic:	3.346
Date:	Mon, 03 Sep 2018	Prob (F-statistic):	0.0773
Time:	00:14:14	Log-Likelihood:	-20.501
No. Observations:	32	AIC:	45.00
Df Residuals:	30	BIC:	47.93
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.1313	0.157	0.834	0.411	-0.190	0.453
D29963_at	0.7509	0.410	1.829	0.077	-0.087	1.589

Omnibus:	14.047	Durbin-Watson:	0.287
Prob(Omnibus):	0.001	Jarque-Bera (JB):	3.932
Skew:	0.498	Prob(JB):	0.140
Kurtosis:	1.601	Cond. No.	5.43

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [15]:

lin_class_train_predictions = linreg.predict(X_train_constant).apply(lambda x: 1 if x>.5 else 0)
lin_class_test_predictions = linreg.predict(X_test_constant).apply(lambda x: 1 if x>.5 else 0)


def get_classification_rate(lin_class_predictions):
    misclassifications = 0 
    for tup in zip(lin_class_predictions.index, lin_class_predictions.values):
        if tup[0] != tup[1]:
            misclassifications +=1
    class_rate = 1 - misclassifications/ lin_class_predictions.shape[0]
    return class_rate
     
print("OLS binary classification\n")
print('training classification rate: ', get_classification_rate(lin_class_train_predictions))
print('testing classification rate: ', get_classification_rate(lin_class_test_predictions))

OLS binary classification

training classification rate:  0.71875
testing classification rate:  0.8536585365853658

In [16]:

logreg = LogisticRegression(C = 100000, fit_intercept= False).fit(X_train_constant, y_train)

print('Logistic regression\n')
print('training score: ', logreg.score(X_train_constant, y_train))
print('test score: ',logreg.score(X_test_constant, y_test))

Logistic regression

training score:  0.71875
test score:  0.8292682926829268

In [17]:

plt.scatter(X_train['D29963_at'], linreg.predict(X_train_constant), label = 'ols')
plt.scatter(X_train['D29963_at'], logreg.predict_proba(X_train_constant)[:, 1], label = 'logreg')
plt.scatter(X_train['D29963_at'], y_train, label = 'cancer type')
plt.legend()

Out[17]:

<matplotlib.legend.Legend at 0x11e7e6f60>

Logistic Regression¶

In [18]:

X_train_const_mult = sm.add_constant(X_train)
X_test_const_mult = sm.add_constant(X_test)

logreg_model = LogisticRegression(C = 100000).fit(X_train_const_mult, y_train)
print("Training score: ", logreg_model.score(X_train_const_mult, y_train))
print("Test score: ", logreg_model.score(X_test_const_mult, y_test))

Training score:  1.0
Test score:  0.926829268292683

In [19]:

def visualize_prob(model, x, y, ax):

    # Use the model to predict probabilities for
    y_pred = model.predict_proba(x)
    
    # Separate the predictions on the label 1 and label 0 points
    ypos = y_pred[y==1]
    yneg = y_pred[y==0]
    
    # Count the number of label 1 and label 0 points
    npos = ypos.shape[0]
    nneg = yneg.shape[0]
    # Plot the probabilities on a vertical line at x = 0, 
    # with the positive points in blue and negative points in red
    pos_handle = ax.plot(np.zeros((npos,1)), ypos[:,1], 'bx', label = 'ALL')
    neg_handle = ax.plot(np.zeros((nneg,1)), yneg[:,1], 'rx', label = 'AML')
    
    # Line to mark prob 0.5
    ax.axhline(y = 0.5, color = 'k', linestyle = '--')
    
    # Add y-label and legend, do not display x-axis, set y-axis limit
    ax.set_ylabel('Probability of AML class')
    ax.legend(loc = 'best')
    ax.get_xaxis().set_visible(False)
    ax.set_ylim([0,1])

In [20]:

f, ax = plt.subplots(2, 1, figsize = (5, 10))

visualize_prob(model = logreg_model, x = X_train_const_mult,y =  y_train, ax = ax[0])
ax[0].set_title("Training")
visualize_prob(logreg_model, X_test_const_mult, y_test, ax[1])
ax[1].set_title("Test")

Out[20]:

Text(0.5,1,'Test')

For the points that are closer to .5 we can say that there is only a slightly higher probability they belong in the AML class rather than the ALL class

Significance of Coefficients¶

Let's see how many of the coefficients estimated by the multiple logistic regression are significantly different from zero at a significance level of 95%

In [21]:

coef_logreg = logreg_model.predict(X_train_const_mult)

coef_boot = np.zeros((X_train_const_mult.shape[1],100))

for i in range(100):
    boot_sample = np.random.choice(range(X_train_const_mult.shape[0]), size=X_train_const_mult.shape[0])
    X_train_boot = X_train_const_mult.values[boot_sample]
    y_train_boot = y_train.values[boot_sample]

    logreg_boot = LogisticRegression(C=100000, fit_intercept=False)
    logreg_boot.fit(X_train_boot, y_train_boot)
    coef_boot[:,i] = logreg_boot.coef_

#Check distribution of coef values 
#for row in coef_boot.T:
   # plt.hist(row)
    #plt.show()

    

def get_confidence_interval(coef_matrix):
    confidence_intervals = []
    for row in coef_matrix:
        upper = np.mean(row) + 1.96*np.std(row)
        lower = np.mean(row) - 1.96*np.std(row)
        confidence_intervals.append((lower, upper))
    return confidence_intervals


sig_coef = 0
for co in get_confidence_interval(coef_boot):
    if 0< co[0] or 0> co[1]:
        sig_coef +=1 
print(sig_coef)

Dimensionality Reduction using PCA¶

PCA is a method for reducing the dimensionality of our data. To reduce the dimensionality of the data we will use PCA and fit a logistic regression model on the first set of components contributing to 90% of the variance in the predictors.

In [22]:

#the number of components that explain 90% of variance 
n = 1
exp_var = 0 
while exp_var <.9:
    n+=1
    pca = PCA(n_components= n).fit(X_train, y_train)
    exp_var = np.sum(pca.explained_variance_ratio_)
print(n, "components")

24 components

In [23]:

#transform data based on PCA for logit
X_train_pca = pca.transform(X_train)[:, :n] 
X_test_pca = pca.transform(X_test)[:, :n]

In [24]:

#fit logistics regression model 
logreg_pca = LogisticRegression(C=100000).fit(X_train_pca, y_train)
pca_xtrain_pred, pca_xtest_pred = logreg_pca.predict(X_train_pca), logreg_pca.predict(X_test_pca)

accuracy_score(y_train, pca_xtrain_pred), accuracy_score(y_test, pca_xtest_pred)

Out[24]:

(1.0, 0.926829268292683)

In [25]:

test_accuracies = np.zeros((30, 5))
training_accuracies = np.zeros((30, 5))

X_values = df.drop('Cancer_type', axis = 1)
y_values = df["Cancer_type"]



fold = 0
for x_ind, v_ind in KFold(n_splits = 5, shuffle = True, random_state = 9001).split(X_train.index):

    X_train_fd = X_values.iloc[x_ind,:]
    X_test_fd = X_values.iloc[v_ind,:]
    y_train_fd = y_values[x_ind]
    y_test_fd = y_values[v_ind]
    
    pca = PCA().fit(X_train_fd)
    X_train_fd_pca = pca.transform(X_train_fd)
    X_test_fd_pca = pca.transform(X_test_fd)
    for n in range(2, 30):
        logit_pca_cv = LogisticRegression(C = 100000).fit(X_train_fd_pca[:, :n], y_train_fd)
        training_accuracies[n, fold] = accuracy_score(y_train_fd, logit_pca_cv.predict(X_train_fd_pca[:, :n]))
        test_accuracies[n, fold] = accuracy_score(y_test_fd, logit_pca_cv.predict(X_test_fd_pca[:, :n]))
    fold +=1

In [26]:

mean_training_ac, mean_test_ac = np.mean(training_accuracies, axis = 1), np.mean(test_accuracies, axis =1)

In [27]:

for n, value in enumerate(mean_test_ac):
    if value < 0.926829268292683:
        pass
    if mean_training_ac[n] ==1:
        n_components = n 
        break
n_components

Out[27]:

In [28]:

pca = PCA(n_components).fit(X_train, y_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

logit_pca_cv_fin = LogisticRegression(C = 100000).fit(X_train_pca[:, :n_components], y_train)
accuracy_score(y_train, logit_pca_cv_fin.predict(X_train_pca[:, :n_components]))
accuracy_score(y_test, logit_pca_cv_fin.predict(X_test_pca[:, :n_components]))

Out[28]:

0.8292682926829268

In [29]:

f, axes = plt.subplots(1, 1, figsize = (5, 5))

visualize_prob(model = logit_pca_cv_fin, x = X_test_pca[:, :n_components], y =  y_test, ax = axes)

Summary:¶

With the logit model that uses all predictors we risk finding spurious relationships between the predictor variables and our target.

Using PCA we wanted to find some lower dimensional representation of our predictors that retains our predictive capability but is less vulnerable to overfitting.

We've also given an example for why logisitic regression is useful for these kinds of predictive problems, as it gives us some rough intution for probabilities using Niave Bayes.

Cancer Classification Using Gene Expression

Introduction¶

Data Exploration¶

Linear Regression vs. Logistic Regression¶

Logistic Regression¶

Significance of Coefficients¶

Dimensionality Reduction using PCA¶

Summary:¶