Predicting Loan Applicant's Creditworthiness

Problem Summary

We will be classifying loan statuses using data on past applications and their respective status assignments (creditworthy or non-creditworthy). This data is contained in the credit-data-training dataset and contains features associated with each loan application which will be used to train the models. Finally we will use data on loan applications with undetermined statuses to make our predictions. These applications can be found in the customers-to-score dataset.

In [33]:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, r2_score, confusion_matrix, accuracy_score
from scipy import stats
In [3]:
df = pd.read_csv('credit-data-training.csv')
df_to_score = pd.read_csv('customers-to-score.csv')
df.columns = df.columns.str.lower().str.replace('-', '_')
df_to_score.columns = df_to_score.columns.str.lower().str.replace('-', '_')
df_to_score.columns
Out[3]:
Index(['account_balance', 'duration_of_credit_month',
       'payment_status_of_previous_credit', 'purpose', 'credit_amount',
       'value_savings_stocks', 'length_of_current_employment',
       'instalment_per_cent', 'guarantors', 'duration_in_current_address',
       'most_valuable_available_asset', 'age_years', 'concurrent_credits',
       'type_of_apartment', 'no_of_credits_at_this_bank', 'occupation',
       'no_of_dependents', 'telephone', 'foreign_worker'],
      dtype='object')

The above lists the features in our dataset.

Data Exploration and Feature Selection

In [4]:
df.isna().describe()
Out[4]:
credit_application_result account_balance duration_of_credit_month payment_status_of_previous_credit purpose credit_amount value_savings_stocks length_of_current_employment instalment_per_cent guarantors duration_in_current_address most_valuable_available_asset age_years concurrent_credits type_of_apartment no_of_credits_at_this_bank occupation no_of_dependents telephone foreign_worker
count 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500
unique 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1
top False False False False False False False False False False True False False False False False False False False False
freq 500 500 500 500 500 500 500 500 500 500 344 500 488 500 500 500 500 500 500 500

Feature Preprocessing

We will drop features with high variability. We will also drop features with a high porportion of missing values, and impute other missing values using the median, which is robust to outliers.

In [5]:
f, axs = plt.subplots(5, 4, figsize = (15, 15))
plt.subplots_adjust(hspace = .5)


i, j = 0,  0 
for col in df.columns:
    s = df[col].value_counts()
    axs[i, j].bar(list(s.index.values), s.get_values())
    axs[i, j].set_title(col)
    if j == 3 and i ==0:
        axs[i, j].tick_params(rotation = -10)
    if j == 0 and i ==1:
        axs[i, j].tick_params(rotation = -12)
    j +=1 
    if j%4 == 0:
        i +=1
        j = 0

The following features had low variability as demonstrated by the bar charts above, thus they were removed as possible features for the model.

guarantors, no_of_dependents, occupation, concurrent_credits, and foreign_worker

The duration_in_current_address has 344 non-missing values, and that the remaining 256 values are missing. Due to the high number of missing values we should drop this feature as well.

We also have missing values in the age column, as the number of missing values is low (18) we can impute these missing values with the median, we use the median because the mean is sensitive to outliers in our data, and the distribution of our data is not normal, and thus skewed, meaning the mean would skew our results if it was used to impute.

Finally we can also remove telephone because this would likely not be a good predictor of creditworthiness.

In summary we have removed the following columns:

guarantors, no_of_dependents, occupation, concurrent_credits, foreign_worker, telephone and duration_in_current_address

In [6]:
df = df.drop(['duration_in_current_address', 'occupation', 'telephone', 'concurrent_credits',
              'guarantors', 'foreign_worker', 'no_of_dependents'], axis = 1)
df_to_score = df_to_score.drop(['duration_in_current_address', 'occupation', 'telephone', 'concurrent_credits',
              'guarantors', 'type_of_apartment', 'no_of_dependents'], axis = 1)
df.age_years = df.age_years.fillna(df.age_years.median())
df.columns
Out[6]:
Index(['credit_application_result', 'account_balance',
       'duration_of_credit_month', 'payment_status_of_previous_credit',
       'purpose', 'credit_amount', 'value_savings_stocks',
       'length_of_current_employment', 'instalment_per_cent',
       'most_valuable_available_asset', 'age_years', 'type_of_apartment',
       'no_of_credits_at_this_bank'],
      dtype='object')

Correlation

We next check our continuous features to ensure that we do not have high levels of multicolinearity. Pairwise correlations under .7 should be accepted.

In [7]:
plt.figure(figsize = (10, 8))
corr = df.corr()
sns.heatmap(corr)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c601c18>

The above correlation matrix shows that none of our numeric variables have correlation greater than .7 with another numeric variable. Therefore we don’t need to worry about multicollinearity (since we will not be using linear classification here) and do not need to remove any further variables.

Functions

In [ ]:
def plot_confusion(matrix):
    cm = matrix
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax)

    # labels, title and ticks
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('Actual labels') 
    ax.set_title('Confusion Matrix') 
    ax.xaxis.set_ticklabels(['Denied', 'Approved'])
    ax.yaxis.set_ticklabels(['Denied', 'Approved'])

Cross Validation

In [8]:
X = df.drop('credit_application_result', axis =1)
y = df.credit_application_result.apply(lambda x: 1 if x == 'Creditworthy' else 0)

X = pd.get_dummies(X, columns = [ 
       'payment_status_of_previous_credit',
       'purpose', 'value_savings_stocks',
       'length_of_current_employment', 'instalment_per_cent',
       'most_valuable_available_asset', 'account_balance',
       'no_of_credits_at_this_bank',  'type_of_apartment'], drop_first=True)

X.sort_index(inplace = True)



df.credit_application_result.apply(lambda x: 1 if x == 'Creditworthy' else 0)

X = (X-X.min())/(X.max()-X.min())
In [9]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .3)

LOGIT

In [31]:
log = sm.Logit(ytrain, Xtrain).fit()
log.summary()

y_logit_true = ytest.sort_index()
y_logit_predit = log.predict(Xtest).sort_index().apply(lambda x: 0 if x<.5 else 1)

print('Accuracy: ',accuracy_score(y_logit_true, y_logit_predit))


fpr_logit, tpr_logit, _ = roc_curve(y_logit_true, log.predict(Xtest).sort_index())

#plot confusion matrix
cm = confusion_matrix(y_logit_true, y_logit_predit)
plot_confusion(cm)

log.summary()
Optimization terminated successfully.
         Current function value: 0.466982
         Iterations 6
Accuracy:  0.72
Out[31]:
Logit Regression Results
Dep. Variable: credit_application_result No. Observations: 350
Model: Logit Df Residuals: 328
Method: MLE Df Model: 21
Date: Sun, 02 Sep 2018 Pseudo R-squ.: 0.2124
Time: 21:59:58 Log-Likelihood: -163.44
converged: True LL-Null: -207.53
LLR p-value: 3.319e-10
coef std err z P>|z| [0.025 0.975]
duration_of_credit_month -0.7165 0.815 -0.879 0.379 -2.313 0.880
credit_amount -0.2910 1.302 -0.224 0.823 -2.842 2.260
age_years 0.8644 0.824 1.049 0.294 -0.751 2.480
payment_status_of_previous_credit_Paid Up 0.0445 0.345 0.129 0.897 -0.632 0.721
payment_status_of_previous_credit_Some Problems -1.3643 0.533 -2.561 0.010 -2.408 -0.320
purpose_New car 1.1960 0.593 2.016 0.044 0.033 2.359
purpose_Other 0.9594 0.914 1.050 0.294 -0.831 2.750
purpose_Used car 0.5541 0.384 1.444 0.149 -0.198 1.306
value_savings_stocks_None -0.0754 0.438 -0.172 0.863 -0.933 0.782
value_savings_stocks_£100-£1000 0.8573 0.526 1.631 0.103 -0.173 1.888
length_of_current_employment_4-7 yrs -0.3760 0.461 -0.815 0.415 -1.280 0.529
length_of_current_employment_< 1yr -0.6103 0.364 -1.675 0.094 -1.325 0.104
instalment_per_cent_2 0.3374 0.477 0.707 0.480 -0.598 1.273
instalment_per_cent_3 0.0527 0.502 0.105 0.916 -0.931 1.037
instalment_per_cent_4 -0.3307 0.421 -0.786 0.432 -1.155 0.494
most_valuable_available_asset_2 0.1722 0.427 0.404 0.687 -0.664 1.009
most_valuable_available_asset_3 0.1654 0.363 0.455 0.649 -0.546 0.877
most_valuable_available_asset_4 -0.5793 0.673 -0.861 0.389 -1.898 0.740
account_balance_Some Balance 1.4826 0.318 4.669 0.000 0.860 2.105
no_of_credits_at_this_bank_More than 1 0.5429 0.364 1.492 0.136 -0.170 1.256
type_of_apartment_2 0.6106 0.347 1.760 0.078 -0.069 1.290
type_of_apartment_3 0.3955 0.793 0.499 0.618 -1.159 1.950
  • NOTE: Here we really should be doing some sort of feature selection. We might use forward or backward selection, or perhaps start by using the most important features from the decision tree, or random forest models. We want all the features for our model to be statistically significant (as indicated by the corresponding p-value).

DECISION TREE

In [26]:
tree = DecisionTreeClassifier(max_depth=6).fit(Xtrain, ytrain)

print('Accuracy: ',tree.score(Xtest, ytest))

y_tree_predict, y_tree_true = tree.predict(Xtest), np.array(ytest.tolist())


#plot feature importances
s_dt = pd.Series(data = tree.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.xticks(rotation = 90)
plt.title('Feature Importances')
plt.show()

#plot roc curve
fpr_dt, tpr_dt, _ = roc_curve(y_tree_true, tree.predict_proba(Xtest)[:,1])
plt.plot(fpr_dt, tpr_dt)
plt.show()

#plot confusion matrix
cm = confusion_matrix(y_tree_true, y_tree_predict)
plot_confusion(cm)
Accuracy:  0.7

RANDOM FOREST

In [28]:
forest = RandomForestClassifier(max_depth = 6).fit(Xtrain, ytrain)

print('Accuracy: ',forest.score(Xtest, ytest))



y_forest_predict, y_forest_true = forest.predict(Xtest), np.array(ytest.tolist())


#plot feature importances
s_dt = pd.Series(data = forest.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.title('Feature Importance')
plt.show()

#plot roc curve
fpr_rf, tpr_rf, _ = roc_curve(y_forest_true, forest.predict_proba(Xtest)[:,1])

plt.plot(fpr_rf, tpr_rf)
plt.title("ROC Random Forest")
plt.show()

#plot confusion matrix
cm = confusion_matrix(y_forest_true, y_forest_predict)
plot_confusion(cm)
Accuracy:  0.7666666666666667

BOOSTED

In [30]:
boost = GradientBoostingClassifier(max_depth=5).fit(Xtrain, ytrain)

print('Accuracy: ',boost.score(Xtest, ytest))



s_dt = pd.Series(data = tree.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.title('Feature Importance')
plt.show()


y_boost_predict, y_boost_true = boost.predict(Xtest), np.array(ytest.tolist())
fpr_boost, tpr_boost, _ = roc_curve(y_boost_true, boost.predict_proba(Xtest)[:,1])

plt.plot(fpr_boost, tpr_boost)
plt.title('ROC Curve')
plt.show()

cm = confusion_matrix(y_boost_true, y_boost_predict)

plot_confusion(cm)
Accuracy:  0.7733333333333333
In [32]:
plt.plot(fpr_logit, tpr_logit)
plt.plot(fpr_dt, tpr_dt)
plt.plot(fpr_rf, tpr_rf)
plt.plot(fpr_boost, tpr_boost)
plt.legend(['logit', 'Decision Tree', 'Random Forest', 'Boosted'])
Out[32]:
<matplotlib.legend.Legend at 0x11e0efe48>

Conclusions

An initial run through of the potential models we could use on this data tells us that the random forest model seems like a good model to explore further.

This model has the highest overall accuracy at .8, the highest number of Creditworthy applicants (106) correctly predicted and 14 applicants correction predicted to be non-credit worthy.

The ROC graph also has the highest area under the ROC curve, as shown above.

Additionally, the false negative rate is low, while the false positive rate is high. This might be helpful if we have loan specialists that can sort through all applications marked 'Yes' by hand while automatically rejecting any marked as 'No,' with the expectation that the number of false negatives will be relatively low.

On the other hand, if we don't have resources to review these loan applications and we are sensitive to the expenses incurred by falsely accepting a loan, this model would be dangerous to implement because we would essentially be accepting a lot of loans that should be marked as non-creditworthy.

Ideas for future model improvements:

  • bagging (bootstrap aggregate)

  • using forward/backward selection for logit feature selection