Predicting Loan Approvals

Analytics Vidhya Loan Prediction III

Summary

We want to classify loans into two categories; approved and not approved. We will use past loan applications and their respective statuses to train our classifier. The data used for this project comes from Analytics Vidhya. Typically in a classification problem we would examine precision and recall. However since this is a competiton, we will looking to maximize accuracy.

Final Submission

After using nested cross validation on logistic regression, random forest and gradient boosted random forest models, we see that the random forest model results in the highest ROC AUC score on average with k=5 folds.

Our final r2 score after submission is 0.7847. This is a decent start. As we mentioned in our analysis below we could improve this by imputing our missing values by using the other features to predict the missing values (using regression, knn, or other classifiers). We could also try creating different ensembles, or bagging. We will return to this in a future post!

Loan Applicant Data

Features:

  • Loan_ID
  • Gender
  • Married
  • Dependents: 1, 2, 2+
  • Education: Graduate, non Graduate
  • Self_Employed
  • ApplicantIncome
  • CoapplicantIncome
  • LoanAmount
  • Loan_Amount_Term
  • Credit_History: 0, 1
  • Property_Area: Urban, Suburban, Rural

Target:

  • Loan_Status
In [4]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from statsmodels.regression.linear_model import OLS
from sklearn.metrics import accuracy_score, roc_auc_score
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

df_test_data = pd.read_csv('loan_test.csv')
df_train_data = pd.read_csv('loan_train.csv');

EDA

In [5]:
df_train_data.head()
Out[5]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y
4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y
In [6]:
df_train_data.describe()
Out[6]:
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
count 614.000000 614.000000 592.000000 600.00000 564.000000
mean 5403.459283 1621.245798 146.412162 342.00000 0.842199
std 6109.041673 2926.248369 85.587325 65.12041 0.364878
min 150.000000 0.000000 9.000000 12.00000 0.000000
25% 2877.500000 0.000000 100.000000 360.00000 1.000000
50% 3812.500000 1188.500000 128.000000 360.00000 1.000000
75% 5795.000000 2297.250000 168.000000 360.00000 1.000000
max 81000.000000 41667.000000 700.000000 480.00000 1.000000
In [7]:
i, j = 0, 0
f, axes = plt.subplots(6, 2, figsize = (11, 20))
plt.subplots_adjust(hspace = .45)

for col in df_train_data.columns[1:]:
    if col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']:
        axes[i,j].hist(df_train_data[col].dropna())
    else:
        bar_data = df_train_data[col].value_counts()
        sns.barplot(bar_data.index, bar_data.values, ax = axes[i,j])
    axes[i, j].set_title(col)
    axes[i, j].spines['top'].set_visible(False)
    axes[i, j].spines['right'].set_visible(False)
    j += 1 
    if j ==2:
        i += 1
        j = 0

Here we see we have a little less than half of our loans classified as not accepted, we should therefore perform stratefied cv.

In [8]:
for col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']:
    sns.boxplot(df_train_data[col].dropna(), orient = 'v')
    plt.title(col)
    plt.show()
In [9]:
df_train_data[df_train_data['ApplicantIncome'] > 20000].sort_values(by = 'ApplicantIncome')
Out[9]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
130 LP001469 Male No 0 Graduate Yes 20166 0.0 650.0 480.0 NaN Urban Y
308 LP001996 Male No 0 Graduate No 20233 0.0 480.0 360.0 1.0 Rural N
284 LP001922 Male Yes 0 Graduate No 20667 0.0 NaN 360.0 1.0 Rural N
506 LP002624 Male Yes 0 Graduate No 20833 6667.0 480.0 360.0 NaN Urban Y
126 LP001448 NaN Yes 3+ Graduate No 23803 0.0 370.0 360.0 1.0 Rural Y
183 LP001637 Male Yes 1 Graduate No 33846 0.0 260.0 360.0 1.0 Semiurban N
443 LP002422 Male No 1 Graduate No 37719 0.0 152.0 360.0 1.0 Semiurban Y
185 LP001640 Male Yes 0 Graduate Yes 39147 4750.0 120.0 360.0 1.0 Semiurban Y
155 LP001536 Male Yes 3+ Graduate No 39999 0.0 600.0 180.0 0.0 Semiurban Y
171 LP001585 NaN Yes 3+ Graduate No 51763 0.0 700.0 300.0 1.0 Urban Y
333 LP002101 Male Yes 0 Graduate NaN 63337 0.0 490.0 180.0 1.0 Urban Y
409 LP002317 Male Yes 3+ Graduate No 81000 0.0 360.0 360.0 0.0 Rural N

All applicants with an income greater than 20,000 have higher level of education, it seems reasonable that these applicants would have higher income. However, since most of the dataset is comprised of applicants with higher education this alone would not explain the difference. For most of these points we lack evidence indicating we should remove them.

If we look at row 409, we see that this applicant's income is the largest in our dataset, and suspiciously ends in three 0's. Futhermore, the property area is rural, the credit history is marked 0, and the loan status is marked as having been declined. Given this information it is most likely that the applicant income was entered incorrectly. We should drop this point

In [10]:
df_train_data = df_train_data.drop(409)
In [11]:
df_train_data[df_train_data['LoanAmount'] > 400 ].sort_values(by = 'ApplicantIncome')
Out[11]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
177 LP001610 Male Yes 3+ Graduate No 5516 11300.0 495.0 360.0 0.0 Semiurban N
523 LP002693 Male Yes 2 Graduate Yes 7948 7166.0 480.0 360.0 1.0 Rural Y
604 LP002959 Female Yes 1 Graduate No 12000 0.0 496.0 360.0 1.0 Semiurban Y
432 LP002386 Male No 0 Graduate NaN 12876 0.0 405.0 360.0 1.0 Semiurban Y
278 LP001907 Male Yes 0 Graduate No 14583 0.0 436.0 360.0 1.0 Semiurban Y
487 LP002547 Male Yes 1 Graduate No 18333 0.0 500.0 360.0 1.0 Urban N
561 LP002813 Female Yes 1 Graduate Yes 19484 0.0 600.0 360.0 1.0 Semiurban Y
369 LP002191 Male Yes 0 Graduate No 19730 5266.0 570.0 360.0 1.0 Rural N
130 LP001469 Male No 0 Graduate Yes 20166 0.0 650.0 480.0 NaN Urban Y
308 LP001996 Male No 0 Graduate No 20233 0.0 480.0 360.0 1.0 Rural N
506 LP002624 Male Yes 0 Graduate No 20833 6667.0 480.0 360.0 NaN Urban Y
155 LP001536 Male Yes 3+ Graduate No 39999 0.0 600.0 180.0 0.0 Semiurban Y
171 LP001585 NaN Yes 3+ Graduate No 51763 0.0 700.0 300.0 1.0 Urban Y
333 LP002101 Male Yes 0 Graduate NaN 63337 0.0 490.0 180.0 1.0 Urban Y

Only 4 out of 15 of the loans were denied, but since many of the incomes are fairly high this doesn't seem completely unreasonable.

While some of these points could be questioned, we lack significant evidence that any of these points should be removed.

Handling Missing Values

Below we have imputed categorical features with the mode, and numerical features with the median of the data. An alternative would be to try to use other features to predict the missing feature values, however this does not seem necessary for most of categorical features since they are likely not critical to the model. However predicting missing values in say the 'LoanAmount' or 'CreditHistory' attributes might produce a better result.

In [12]:
#report number of missing values for each feature
for col in df_train_data.columns:
    missing_series = df_train_data[col][df_train_data[col].isna() == True]
    if missing_series.size > 0:
        print(col, missing_series.size)
    plt.show()
Gender 13
Married 3
Dependents 15
Self_Employed 32
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
In [13]:
#fill missing categorical values with mode

categ_cols = ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Loan_Amount_Term', 
              'Credit_History']
for col in categ_cols:
    df_train_data[col] = df_train_data[col].fillna(df_train_data[col].mode()[0])
    df_test_data[col] = df_test_data[col].fillna(df_test_data[col].mode()[0])
In [14]:
#create dummy variables for categorical features
dummy_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 
              'Credit_History', 'Property_Area']
train_dummies = pd.get_dummies(df_train_data[dummy_cols], drop_first = True)
test_dummies = pd.get_dummies(df_test_data[dummy_cols], drop_first = True)
In [15]:
num_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

#feature normalization 
df_train_num = (df_train_data[num_cols] - df_train_data[num_cols].mean()) / df_train_data[num_cols].std()
df_test_num = (df_test_data[num_cols] - df_train_data[num_cols].mean()) / df_train_data[num_cols].std()
In [16]:
#set loan status to 1 if approved, else 0 
loan_status = df_train_data.Loan_Status.apply(lambda x: 0 if x == 'N' else 1)

df_train = pd.concat([df_train_num, train_dummies, loan_status], axis =1)
df_test = pd.concat([df_test_num, test_dummies], axis =1)
In [17]:
#create dataframes with numerical missing values dropped, and missing values w/ median imputed 
df_train_dropped = df_train.dropna()
df_train_fill_median = df_train.fillna(df_train.median())

df_test_dropped = df_test.dropna()
df_test_fill_median = df_test.fillna(df_test.median())
df_train_fill_median.head()
Out[17]:
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Gender_Male Married_Yes Dependents_1 Dependents_2 Dependents_3+ Education_Not Graduate Self_Employed_Yes Property_Area_Semiurban Property_Area_Urban Loan_Status
0 0.107451 -0.554626 -0.211847 0.273248 1.0 1 0 0 0 0 0 0 0 1 1
1 -0.131680 -0.039581 -0.211847 0.273248 1.0 1 1 1 0 0 0 0 0 0 0
2 -0.430689 -0.554626 -0.939491 0.273248 1.0 1 1 0 0 0 0 1 0 1 1
3 -0.509455 0.250729 -0.305737 0.273248 1.0 1 1 0 0 0 1 0 0 1 1
4 0.135973 -0.554626 -0.059277 0.273248 1.0 1 0 0 0 0 0 0 0 1 1
In [18]:
pd.plotting.scatter_matrix(df_train_num, figsize = (16, 10))
plt.show()

Feature Engineering

In [19]:
df_train_fill_median['LoanAmount_per_term'] = df_train_fill_median.LoanAmount/df_train_fill_median.Loan_Amount_Term
df_test_fill_median['LoanAmount_per_term'] = df_train_fill_median.LoanAmount/df_train_fill_median.Loan_Amount_Term

df_train_fill_median['ratio_income_per_term'] = df_train_fill_median.ApplicantIncome/df_train_fill_median.LoanAmount_per_term
df_test_fill_median['ratio_income_per_term'] = df_train_fill_median.ApplicantIncome/df_train_fill_median.LoanAmount_per_term
In [20]:
for col in ['LoanAmount_per_term', 'ratio_income_per_term']:
    plt.hist(df_train_fill_median[col])
    plt.show()
In [21]:
f = plt.subplots(1, 1, figsize = (14,10))
corr = df_train_fill_median.corr()
sns.heatmap(corr)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x11de667f0>

Here we see that ApplicantIncome and LoanAmount have high colinearity. Also credit history seems to be highly correlated with loan status, this points to it being the most important feature for our model.

Model Parameter Tuning and Selection

Train Test Split

In [22]:
columns = df_train_fill_median.drop('Loan_Status',axis =1).columns
Xtrain, Xtest, y_train, y_test = train_test_split(df_train_fill_median[columns], df_train_fill_median.Loan_Status, test_size = .2)

K Folds

In [23]:
seed =1 

inner_kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)
outer_kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)

Nested K-Fold Cross Validation

In [24]:
def cumlsum_thresh(tup, threshold, sort_vect = False, reverse = True):
    """
    inputs
    __________
    tup: tuple of labels and corresponding values, eg. (label, value)
    threshold: cumulative sum threshold
    sort_vector: should the vector be sorted, default False
    reverse: accending or descending, default True, ie descending
    
    output
    __________
    list of labels where first i values that contribute to the cumulative sum up to threshold
    """
    label, vector = zip(*tup)
    if sort_vect == True:
        vector = sorted(list(vector), reverse=reverse)
    else:
        cuml_sum = 0
        for i, value in enumerate(vector):
            cuml_sum +=value
            if cuml_sum > threshold:
                break
    return list(label[:i])
In [25]:
logit = LogisticRegression()
forest = RandomForestClassifier()
boost = GradientBoostingClassifier()
features = Xtrain.columns

nested_forest_score = np

forest_params = {'max_depth': range(1, 8), 'max_leaf_nodes': range(2, 6)}
boost_params = {'learning_rate': np.arange(.001, 1, .07), 'max_depth': range(1, 8),
                'max_leaf_nodes': range(2, 6)}

clf_forest = GridSearchCV(forest, param_grid=forest_params, scoring='roc_auc', cv=inner_kfold)
clf_boost = GridSearchCV(boost, param_grid=boost_params, scoring='roc_auc', cv=inner_kfold)

#logit feature selection
feat_importance = clf_forest.fit(Xtrain, y_train).best_estimator_.feature_importances_
feat_tup = zip(features, feat_importance)
features_by_importance = sorted(feat_tup, key=lambda x: x[0])
logit_features = cumlsum_thresh(features_by_importance, threshold=.8, reverse=True)
     
nested_forest_score = cross_val_score(estimator=clf_forest, X=Xtrain, y=y_train, cv=outer_kfold, scoring='roc_auc')
nested_boost_score = cross_val_score(estimator=clf_boost, X=Xtrain, y=y_train, cv=outer_kfold, scoring='roc_auc')
nested_logit_score = cross_val_score(estimator=logit, X=Xtrain[logit_features],y=y_train, cv=outer_kfold, 
                                     scoring='roc_auc')
   
print("Random Forest", np.mean(nested_forest_score))
print("Gradient Boosted", np.mean(nested_boost_score))
print("Logit", np.mean(nested_logit_score))   
        
Random Forest 0.7367159312500612
Gradient Boosted 0.7105250316018461
Logit 0.7483750771672988
In [26]:
k_fold = range(1,6)
plt.plot(k_fold, nested_boost_score)
plt.plot(k_fold, nested_forest_score)
plt.plot(k_fold, nested_logit_score)
plt.ylim(ymin=0)
plt.xlabel('K Folds')
plt.ylabel('AUC Score')
plt.legend(['boost', 'forest', 'logit'])
Out[26]:
<matplotlib.legend.Legend at 0x11da5a0f0>

Training the Model

In [27]:
rf = RandomForestClassifier(max_depth = 4, max_leaf_nodes=4).fit(Xtrain, y_train)

accuracy_score(y_train, rf.predict(Xtrain)), accuracy_score(y_test, rf.predict(Xtest))
Out[27]:
(0.8224489795918367, 0.7886178861788617)
In [28]:
roc_auc_score(y_test, rf.predict(Xtest))
Out[28]:
0.6962081128747795

Model Submission

In [248]:
X_predict = df_test_fill_median

pd_results = pd.DataFrame(rf.predict(X_predict), index = df_test_data.Loan_ID, columns = ['Loan_Status'])
pd_results.Loan_Status = pd_results.Loan_Status.apply(lambda x: 'Y' if x ==1 else 'N')
pd_results.to_csv('results.csv')