We will be classifying loan statuses using data on past applications and their respective status assignments (creditworthy or non-creditworthy). This data is contained in the credit-data-training dataset and contains features associated with each loan application which will be used to train the models. Finally we will use data on loan applications with undetermined statuses to make our predictions. These applications can be found in the customers-to-score dataset.
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, r2_score, confusion_matrix, accuracy_score
from scipy import stats
df = pd.read_csv('credit-data-training.csv')
df_to_score = pd.read_csv('customers-to-score.csv')
df.columns = df.columns.str.lower().str.replace('-', '_')
df_to_score.columns = df_to_score.columns.str.lower().str.replace('-', '_')
df_to_score.columns
The above lists the features in our dataset.
df.isna().describe()
We will drop features with high variability. We will also drop features with a high porportion of missing values, and impute other missing values using the median, which is robust to outliers.
f, axs = plt.subplots(5, 4, figsize = (15, 15))
plt.subplots_adjust(hspace = .5)
i, j = 0, 0
for col in df.columns:
s = df[col].value_counts()
axs[i, j].bar(list(s.index.values), s.get_values())
axs[i, j].set_title(col)
if j == 3 and i ==0:
axs[i, j].tick_params(rotation = -10)
if j == 0 and i ==1:
axs[i, j].tick_params(rotation = -12)
j +=1
if j%4 == 0:
i +=1
j = 0
The following features had low variability as demonstrated by the bar charts above, thus they were removed as possible features for the model.
guarantors, no_of_dependents, occupation, concurrent_credits, and foreign_worker
The duration_in_current_address has 344 non-missing values, and that the remaining 256 values are missing. Due to the high number of missing values we should drop this feature as well.
We also have missing values in the age column, as the number of missing values is low (18) we can impute these missing values with the median, we use the median because the mean is sensitive to outliers in our data, and the distribution of our data is not normal, and thus skewed, meaning the mean would skew our results if it was used to impute.
Finally we can also remove telephone because this would likely not be a good predictor of creditworthiness.
In summary we have removed the following columns:
guarantors, no_of_dependents, occupation, concurrent_credits, foreign_worker, telephone and duration_in_current_address
df = df.drop(['duration_in_current_address', 'occupation', 'telephone', 'concurrent_credits',
'guarantors', 'foreign_worker', 'no_of_dependents'], axis = 1)
df_to_score = df_to_score.drop(['duration_in_current_address', 'occupation', 'telephone', 'concurrent_credits',
'guarantors', 'type_of_apartment', 'no_of_dependents'], axis = 1)
df.age_years = df.age_years.fillna(df.age_years.median())
df.columns
We next check our continuous features to ensure that we do not have high levels of multicolinearity. Pairwise correlations under .7 should be accepted.
plt.figure(figsize = (10, 8))
corr = df.corr()
sns.heatmap(corr)
The above correlation matrix shows that none of our numeric variables have correlation greater than .7 with another numeric variable. Therefore we don’t need to worry about multicollinearity (since we will not be using linear classification here) and do not need to remove any further variables.
def plot_confusion(matrix):
cm = matrix
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax)
# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('Actual labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Denied', 'Approved'])
ax.yaxis.set_ticklabels(['Denied', 'Approved'])
X = df.drop('credit_application_result', axis =1)
y = df.credit_application_result.apply(lambda x: 1 if x == 'Creditworthy' else 0)
X = pd.get_dummies(X, columns = [
'payment_status_of_previous_credit',
'purpose', 'value_savings_stocks',
'length_of_current_employment', 'instalment_per_cent',
'most_valuable_available_asset', 'account_balance',
'no_of_credits_at_this_bank', 'type_of_apartment'], drop_first=True)
X.sort_index(inplace = True)
df.credit_application_result.apply(lambda x: 1 if x == 'Creditworthy' else 0)
X = (X-X.min())/(X.max()-X.min())
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .3)
log = sm.Logit(ytrain, Xtrain).fit()
log.summary()
y_logit_true = ytest.sort_index()
y_logit_predit = log.predict(Xtest).sort_index().apply(lambda x: 0 if x<.5 else 1)
print('Accuracy: ',accuracy_score(y_logit_true, y_logit_predit))
fpr_logit, tpr_logit, _ = roc_curve(y_logit_true, log.predict(Xtest).sort_index())
#plot confusion matrix
cm = confusion_matrix(y_logit_true, y_logit_predit)
plot_confusion(cm)
log.summary()
tree = DecisionTreeClassifier(max_depth=6).fit(Xtrain, ytrain)
print('Accuracy: ',tree.score(Xtest, ytest))
y_tree_predict, y_tree_true = tree.predict(Xtest), np.array(ytest.tolist())
#plot feature importances
s_dt = pd.Series(data = tree.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.xticks(rotation = 90)
plt.title('Feature Importances')
plt.show()
#plot roc curve
fpr_dt, tpr_dt, _ = roc_curve(y_tree_true, tree.predict_proba(Xtest)[:,1])
plt.plot(fpr_dt, tpr_dt)
plt.show()
#plot confusion matrix
cm = confusion_matrix(y_tree_true, y_tree_predict)
plot_confusion(cm)
forest = RandomForestClassifier(max_depth = 6).fit(Xtrain, ytrain)
print('Accuracy: ',forest.score(Xtest, ytest))
y_forest_predict, y_forest_true = forest.predict(Xtest), np.array(ytest.tolist())
#plot feature importances
s_dt = pd.Series(data = forest.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.title('Feature Importance')
plt.show()
#plot roc curve
fpr_rf, tpr_rf, _ = roc_curve(y_forest_true, forest.predict_proba(Xtest)[:,1])
plt.plot(fpr_rf, tpr_rf)
plt.title("ROC Random Forest")
plt.show()
#plot confusion matrix
cm = confusion_matrix(y_forest_true, y_forest_predict)
plot_confusion(cm)
boost = GradientBoostingClassifier(max_depth=5).fit(Xtrain, ytrain)
print('Accuracy: ',boost.score(Xtest, ytest))
s_dt = pd.Series(data = tree.feature_importances_, index = X.columns).sort_values( ascending = False)
s_dt.plot(kind = 'bar', colormap= 'Blues_r')
plt.title('Feature Importance')
plt.show()
y_boost_predict, y_boost_true = boost.predict(Xtest), np.array(ytest.tolist())
fpr_boost, tpr_boost, _ = roc_curve(y_boost_true, boost.predict_proba(Xtest)[:,1])
plt.plot(fpr_boost, tpr_boost)
plt.title('ROC Curve')
plt.show()
cm = confusion_matrix(y_boost_true, y_boost_predict)
plot_confusion(cm)
plt.plot(fpr_logit, tpr_logit)
plt.plot(fpr_dt, tpr_dt)
plt.plot(fpr_rf, tpr_rf)
plt.plot(fpr_boost, tpr_boost)
plt.legend(['logit', 'Decision Tree', 'Random Forest', 'Boosted'])
An initial run through of the potential models we could use on this data tells us that the random forest model seems like a good model to explore further.
This model has the highest overall accuracy at .8, the highest number of Creditworthy applicants (106) correctly predicted and 14 applicants correction predicted to be non-credit worthy.
The ROC graph also has the highest area under the ROC curve, as shown above.
Additionally, the false negative rate is low, while the false positive rate is high. This might be helpful if we have loan specialists that can sort through all applications marked 'Yes' by hand while automatically rejecting any marked as 'No,' with the expectation that the number of false negatives will be relatively low.
On the other hand, if we don't have resources to review these loan applications and we are sensitive to the expenses incurred by falsely accepting a loan, this model would be dangerous to implement because we would essentially be accepting a lot of loans that should be marked as non-creditworthy.
Ideas for future model improvements:
bagging (bootstrap aggregate)
using forward/backward selection for logit feature selection