Problem Summary¶

We will be classifying loan statuses using data on past applications and their respective status assignments (creditworthy or non-creditworthy). This data is contained in the credit-data-training dataset and contains features associated with each loan application which will be used to train the models. Finally we will use data on loan applications with undetermined statuses to make our predictions. These applications can be found in the customers-to-score dataset.

In [33]:

import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, r2_score, confusion_matrix, accuracy_score
from scipy import stats

In [3]:

df = pd.read_csv('credit-data-training.csv')
df_to_score = pd.read_csv('customers-to-score.csv')
df.columns = df.columns.str.lower().str.replace('-', '_')
df_to_score.columns = df_to_score.columns.str.lower().str.replace('-', '_')
df_to_score.columns

Out[3]:

Index(['account_balance', 'duration_of_credit_month',
       'payment_status_of_previous_credit', 'purpose', 'credit_amount',
       'value_savings_stocks', 'length_of_current_employment',
       'instalment_per_cent', 'guarantors', 'duration_in_current_address',
       'most_valuable_available_asset', 'age_years', 'concurrent_credits',
       'type_of_apartment', 'no_of_credits_at_this_bank', 'occupation',
       'no_of_dependents', 'telephone', 'foreign_worker'],
      dtype='object')

The above lists the features in our dataset.

Data Exploration and Feature Selection¶

In [4]:

df.isna().describe()

Out[4]:

	credit_application_result	account_balance	duration_of_credit_month	payment_status_of_previous_credit	purpose	credit_amount	value_savings_stocks	length_of_current_employment	instalment_per_cent	guarantors	duration_in_current_address	most_valuable_available_asset	age_years	concurrent_credits	type_of_apartment	no_of_credits_at_this_bank	occupation	no_of_dependents	telephone	foreign_worker
count	500	500	500	500	500	500	500	500	500	500	500	500	500	500	500	500	500	500	500	500
unique	1	1	1	1	1	1	1	1	1	1	2	1	2	1	1	1	1	1	1	1
top	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False
freq	500	500	500	500	500	500	500	500	500	500	344	500	488	500	500	500	500	500	500	500

Feature Preprocessing¶

We will drop features with high variability. We will also drop features with a high porportion of missing values, and impute other missing values using the median, which is robust to outliers.

In [5]:

f, axs = plt.subplots(5, 4, figsize = (15, 15))
plt.subplots_adjust(hspace = .5)


i, j = 0,  0 
for col in df.columns:
    s = df[col].value_counts()
    axs[i, j].bar(list(s.index.values), s.get_values())
    axs[i, j].set_title(col)
    if j == 3 and i ==0:
        axs[i, j].tick_params(rotation = -10)
    if j == 0 and i ==1:
        axs[i, j].tick_params(rotation = -12)
    j +=1 
    if j%4 == 0:
        i +=1
        j = 0

The following features had low variability as demonstrated by the bar charts above, thus they were removed as possible features for the model.

guarantors, no_of_dependents, occupation, concurrent_credits, and foreign_worker

The duration_in_current_address has 344 non-missing values, and that the remaining 256 values are missing. Due to the high number of missing values we should drop this feature as well.

We also have missing values in the age column, as the number of missing values is low (18) we can impute these missing values with the median, we use the median because the mean is sensitive to outliers in our data, and the distribution of our data is not normal, and thus skewed, meaning the mean would skew our results if it was used to impute.

Finally we can also remove telephone because this would likely not be a good predictor of creditworthiness.

In summary we have removed the following columns:

guarantors, no_of_dependents, occupation, concurrent_credits, foreign_worker, telephone and duration_in_current_address

In [6]:

df = df.drop(['duration_in_current_address', 'occupation', 'telephone', 'concurrent_credits',
              'guarantors', 'foreign_worker', 'no_of_dependents'], axis = 1)
df_to_score = df_to_score.drop(['duration_in_current_address', 'occupation', 'telephone', 'concurrent_credits',
              'guarantors', 'type_of_apartment', 'no_of_dependents'], axis = 1)
df.age_years = df.age_years.fillna(df.age_years.median())
df.columns

Out[6]:

Index(['credit_application_result', 'account_balance',
       'duration_of_credit_month', 'payment_status_of_previous_credit',
       'purpose', 'credit_amount', 'value_savings_stocks',
       'length_of_current_employment', 'instalment_per_cent',
       'most_valuable_available_asset', 'age_years', 'type_of_apartment',
       'no_of_credits_at_this_bank'],
      dtype='object')

Correlation¶

We next check our continuous features to ensure that we do not have high levels of multicolinearity. Pairwise correlations under .7 should be accepted.

In [7]:

plt.figure(figsize = (10, 8))
corr = df.corr()
sns.heatmap(corr)

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x11c601c18>

The above correlation matrix shows that none of our numeric variables have correlation greater than .7 with another numeric variable. Therefore we don’t need to worry about multicollinearity (since we will not be using linear classification here) and do not need to remove any further variables.

Functions¶

In [ ]:

def plot_confusion(matrix):
    cm = matrix
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax)

    # labels, title and ticks
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('Actual labels') 
    ax.set_title('Confusion Matrix') 
    ax.xaxis.set_ticklabels(['Denied', 'Approved'])
    ax.yaxis.set_ticklabels(['Denied', 'Approved'])

Cross Validation¶

In [8]:

X = df.drop('credit_application_result', axis =1)
y = df.credit_application_result.apply(lambda x: 1 if x == 'Creditworthy' else 0)

X = pd.get_dummies(X, columns = [ 
       'payment_status_of_previous_credit',
       'purpose', 'value_savings_stocks',
       'length_of_current_employment', 'instalment_per_cent',
       'most_valuable_available_asset', 'account_balance',
       'no_of_credits_at_this_bank',  'type_of_apartment'], drop_first=True)

X.sort_index(inplace = True)



df.credit_application_result.apply(lambda x: 1 if x == 'Creditworthy' else 0)

X = (X-X.min())/(X.max()-X.min())

In [9]:

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .3)

LOGIT¶

In [31]:

log = sm.Logit(ytrain, Xtrain).fit()
log.summary()

y_logit_true = ytest.sort_index()
y_logit_predit = log.predict(Xtest).sort_index().apply(lambda x: 0 if x<.5 else 1)

print('Accuracy: ',accuracy_score(y_logit_true, y_logit_predit))


fpr_logit, tpr_logit, _ = roc_curve(y_logit_true, log.predict(Xtest).sort_index())

#plot confusion matrix
cm = confusion_matrix(y_logit_true, y_logit_predit)
plot_confusion(cm)

log.summary()

Optimization terminated successfully.
         Current function value: 0.466982
         Iterations 6
Accuracy:  0.72

Out[31]:

Logit Regression Results
Dep. Variable:	credit_application_result	No. Observations:	350
Model:	Logit	Df Residuals:	328
Method:	MLE	Df Model:	21
Date:	Sun, 02 Sep 2018	Pseudo R-squ.:	0.2124
Time:	21:59:58	Log-Likelihood:	-163.44
converged:	True	LL-Null:	-207.53
		LLR p-value:	3.319e-10

	coef	std err	z	P>\|z\|	[0.025	0.975]
duration_of_credit_month	-0.7165	0.815	-0.879	0.379	-2.313	0.880
credit_amount	-0.2910	1.302	-0.224	0.823	-2.842	2.260
age_years	0.8644	0.824	1.049	0.294	-0.751	2.480
payment_status_of_previous_credit_Paid Up	0.0445	0.345	0.129	0.897	-0.632	0.721
payment_status_of_previous_credit_Some Problems	-1.3643	0.533	-2.561	0.010	-2.408	-0.320
purpose_New car	1.1960	0.593	2.016	0.044	0.033	2.359
purpose_Other	0.9594	0.914	1.050	0.294	-0.831	2.750
purpose_Used car	0.5541	0.384	1.444	0.149	-0.198	1.306
value_savings_stocks_None	-0.0754	0.438	-0.172	0.863	-0.933	0.782
value_savings_stocks_£100-£1000	0.8573	0.526	1.631	0.103	-0.173	1.888
length_of_current_employment_4-7 yrs	-0.3760	0.461	-0.815	0.415	-1.280	0.529
length_of_current_employment_< 1yr	-0.6103	0.364	-1.675	0.094	-1.325	0.104
instalment_per_cent_2	0.3374	0.477	0.707	0.480	-0.598	1.273
instalment_per_cent_3	0.0527	0.502	0.105	0.916	-0.931	1.037
instalment_per_cent_4	-0.3307	0.421	-0.786	0.432	-1.155	0.494
most_valuable_available_asset_2	0.1722	0.427	0.404	0.687	-0.664	1.009
most_valuable_available_asset_3	0.1654	0.363	0.455	0.649	-0.546	0.877
most_valuable_available_asset_4	-0.5793	0.673	-0.861	0.389	-1.898	0.740
account_balance_Some Balance	1.4826	0.318	4.669	0.000	0.860	2.105
no_of_credits_at_this_bank_More than 1	0.5429	0.364	1.492	0.136	-0.170	1.256
type_of_apartment_2	0.6106	0.347	1.760	0.078	-0.069	1.290
type_of_apartment_3	0.3955	0.793	0.499	0.618	-1.159	1.950

NOTE: Here we really should be doing some sort of feature selection. We might use forward or backward selection, or perhaps start by using the most important features from the decision tree, or random forest models. We want all the features for our model to be statistically significant (as indicated by the corresponding p-value).

DECISION TREE¶

In [26]:

tree = DecisionTreeClassifier(max_depth=6).fit(Xtrain, ytrain)

print('Accuracy: ',tree.score(Xtest, ytest))

y_tree_predict, y_tree_true = tree.predict(Xtest), np.array(ytest.tolist())


#plot feature importances
s_dt = pd.Series(data = tree.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.xticks(rotation = 90)
plt.title('Feature Importances')
plt.show()

#plot roc curve
fpr_dt, tpr_dt, _ = roc_curve(y_tree_true, tree.predict_proba(Xtest)[:,1])
plt.plot(fpr_dt, tpr_dt)
plt.show()

#plot confusion matrix
cm = confusion_matrix(y_tree_true, y_tree_predict)
plot_confusion(cm)

Accuracy:  0.7

RANDOM FOREST¶

In [28]:

forest = RandomForestClassifier(max_depth = 6).fit(Xtrain, ytrain)

print('Accuracy: ',forest.score(Xtest, ytest))



y_forest_predict, y_forest_true = forest.predict(Xtest), np.array(ytest.tolist())


#plot feature importances
s_dt = pd.Series(data = forest.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.title('Feature Importance')
plt.show()

#plot roc curve
fpr_rf, tpr_rf, _ = roc_curve(y_forest_true, forest.predict_proba(Xtest)[:,1])

plt.plot(fpr_rf, tpr_rf)
plt.title("ROC Random Forest")
plt.show()

#plot confusion matrix
cm = confusion_matrix(y_forest_true, y_forest_predict)
plot_confusion(cm)

Accuracy:  0.7666666666666667

BOOSTED¶

In [30]:

boost = GradientBoostingClassifier(max_depth=5).fit(Xtrain, ytrain)

print('Accuracy: ',boost.score(Xtest, ytest))



s_dt = pd.Series(data = tree.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.title('Feature Importance')
plt.show()


y_boost_predict, y_boost_true = boost.predict(Xtest), np.array(ytest.tolist())
fpr_boost, tpr_boost, _ = roc_curve(y_boost_true, boost.predict_proba(Xtest)[:,1])

plt.plot(fpr_boost, tpr_boost)
plt.title('ROC Curve')
plt.show()

cm = confusion_matrix(y_boost_true, y_boost_predict)

plot_confusion(cm)

Accuracy:  0.7733333333333333

In [32]:

plt.plot(fpr_logit, tpr_logit)
plt.plot(fpr_dt, tpr_dt)
plt.plot(fpr_rf, tpr_rf)
plt.plot(fpr_boost, tpr_boost)
plt.legend(['logit', 'Decision Tree', 'Random Forest', 'Boosted'])

Out[32]:

<matplotlib.legend.Legend at 0x11e0efe48>

Conclusions¶

An initial run through of the potential models we could use on this data tells us that the random forest model seems like a good model to explore further.

This model has the highest overall accuracy at .8, the highest number of Creditworthy applicants (106) correctly predicted and 14 applicants correction predicted to be non-credit worthy.

The ROC graph also has the highest area under the ROC curve, as shown above.

Additionally, the false negative rate is low, while the false positive rate is high. This might be helpful if we have loan specialists that can sort through all applications marked 'Yes' by hand while automatically rejecting any marked as 'No,' with the expectation that the number of false negatives will be relatively low.

On the other hand, if we don't have resources to review these loan applications and we are sensitive to the expenses incurred by falsely accepting a loan, this model would be dangerous to implement because we would essentially be accepting a lot of loans that should be marked as non-creditworthy.

Ideas for future model improvements:

bagging (bootstrap aggregate)
using forward/backward selection for logit feature selection

Predicting Loan Applicant's Creditworthiness