Table of Contents¶

Introduction

Summary
Final Submission
Training Data

Loan Classification

Exploratory Data Analysis
Handeling Missing Values
Feature Engineering
Parameter Tuning and Model Selection

Training the Model

Submission Code

Summary¶

We want to classify loans into two categories; approved and not approved. We will use past loan applications and their respective statuses to train our classifier. The data used for this project comes from Analytics Vidhya. Typically in a classification problem we would examine precision and recall. However since this is a competiton, we will looking to maximize accuracy.

Final Submission¶

After using nested cross validation on logistic regression, random forest and gradient boosted random forest models, we see that the random forest model results in the highest ROC AUC score on average with k=5 folds.

Our final r2 score after submission is 0.7847. This is a decent start. As we mentioned in our analysis below we could improve this by imputing our missing values by using the other features to predict the missing values (using regression, knn, or other classifiers). We could also try creating different ensembles, or bagging. We will return to this in a future post!

Loan Applicant Data¶

Features:

Loan_ID
Gender
Married
Dependents: 1, 2, 2+
Education: Graduate, non Graduate
Self_Employed
ApplicantIncome
CoapplicantIncome
LoanAmount
Loan_Amount_Term
Credit_History: 0, 1
Property_Area: Urban, Suburban, Rural

Target:

Loan_Status

In [4]:

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from statsmodels.regression.linear_model import OLS
from sklearn.metrics import accuracy_score, roc_auc_score
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

df_test_data = pd.read_csv('loan_test.csv')
df_train_data = pd.read_csv('loan_train.csv');

EDA¶

In [5]:

df_train_data.head()

Out[5]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	NaN	360.0	1.0	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N
2	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.0	360.0	1.0	Urban	Y
3	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.0	360.0	1.0	Urban	Y
4	LP001008	Male	No	0	Graduate	No	6000	0.0	141.0	360.0	1.0	Urban	Y

In [6]:

df_train_data.describe()

Out[6]:

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	614.000000	614.000000	592.000000	600.00000	564.000000
mean	5403.459283	1621.245798	146.412162	342.00000	0.842199
std	6109.041673	2926.248369	85.587325	65.12041	0.364878
min	150.000000	0.000000	9.000000	12.00000	0.000000
25%	2877.500000	0.000000	100.000000	360.00000	1.000000
50%	3812.500000	1188.500000	128.000000	360.00000	1.000000
75%	5795.000000	2297.250000	168.000000	360.00000	1.000000
max	81000.000000	41667.000000	700.000000	480.00000	1.000000

In [7]:

i, j = 0, 0
f, axes = plt.subplots(6, 2, figsize = (11, 20))
plt.subplots_adjust(hspace = .45)

for col in df_train_data.columns[1:]:
    if col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']:
        axes[i,j].hist(df_train_data[col].dropna())
    else:
        bar_data = df_train_data[col].value_counts()
        sns.barplot(bar_data.index, bar_data.values, ax = axes[i,j])
    axes[i, j].set_title(col)
    axes[i, j].spines['top'].set_visible(False)
    axes[i, j].spines['right'].set_visible(False)
    j += 1 
    if j ==2:
        i += 1
        j = 0

Here we see we have a little less than half of our loans classified as not accepted, we should therefore perform stratefied cv.

In [8]:

for col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']:
    sns.boxplot(df_train_data[col].dropna(), orient = 'v')
    plt.title(col)
    plt.show()

In [9]:

df_train_data[df_train_data['ApplicantIncome'] > 20000].sort_values(by = 'ApplicantIncome')

Out[9]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
130	LP001469	Male	No	0	Graduate	Yes	20166	0.0	650.0	480.0	NaN	Urban	Y
308	LP001996	Male	No	0	Graduate	No	20233	0.0	480.0	360.0	1.0	Rural	N
284	LP001922	Male	Yes	0	Graduate	No	20667	0.0	NaN	360.0	1.0	Rural	N
506	LP002624	Male	Yes	0	Graduate	No	20833	6667.0	480.0	360.0	NaN	Urban	Y
126	LP001448	NaN	Yes	3+	Graduate	No	23803	0.0	370.0	360.0	1.0	Rural	Y
183	LP001637	Male	Yes	1	Graduate	No	33846	0.0	260.0	360.0	1.0	Semiurban	N
443	LP002422	Male	No	1	Graduate	No	37719	0.0	152.0	360.0	1.0	Semiurban	Y
185	LP001640	Male	Yes	0	Graduate	Yes	39147	4750.0	120.0	360.0	1.0	Semiurban	Y
155	LP001536	Male	Yes	3+	Graduate	No	39999	0.0	600.0	180.0	0.0	Semiurban	Y
171	LP001585	NaN	Yes	3+	Graduate	No	51763	0.0	700.0	300.0	1.0	Urban	Y
333	LP002101	Male	Yes	0	Graduate	NaN	63337	0.0	490.0	180.0	1.0	Urban	Y
409	LP002317	Male	Yes	3+	Graduate	No	81000	0.0	360.0	360.0	0.0	Rural	N

All applicants with an income greater than 20,000 have higher level of education, it seems reasonable that these applicants would have higher income. However, since most of the dataset is comprised of applicants with higher education this alone would not explain the difference. For most of these points we lack evidence indicating we should remove them.

If we look at row 409, we see that this applicant's income is the largest in our dataset, and suspiciously ends in three 0's. Futhermore, the property area is rural, the credit history is marked 0, and the loan status is marked as having been declined. Given this information it is most likely that the applicant income was entered incorrectly. We should drop this point

In [10]:

df_train_data = df_train_data.drop(409)

In [11]:

df_train_data[df_train_data['LoanAmount'] > 400 ].sort_values(by = 'ApplicantIncome')

Out[11]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
177	LP001610	Male	Yes	3+	Graduate	No	5516	11300.0	495.0	360.0	0.0	Semiurban	N
523	LP002693	Male	Yes	2	Graduate	Yes	7948	7166.0	480.0	360.0	1.0	Rural	Y
604	LP002959	Female	Yes	1	Graduate	No	12000	0.0	496.0	360.0	1.0	Semiurban	Y
432	LP002386	Male	No	0	Graduate	NaN	12876	0.0	405.0	360.0	1.0	Semiurban	Y
278	LP001907	Male	Yes	0	Graduate	No	14583	0.0	436.0	360.0	1.0	Semiurban	Y
487	LP002547	Male	Yes	1	Graduate	No	18333	0.0	500.0	360.0	1.0	Urban	N
561	LP002813	Female	Yes	1	Graduate	Yes	19484	0.0	600.0	360.0	1.0	Semiurban	Y
369	LP002191	Male	Yes	0	Graduate	No	19730	5266.0	570.0	360.0	1.0	Rural	N
130	LP001469	Male	No	0	Graduate	Yes	20166	0.0	650.0	480.0	NaN	Urban	Y
308	LP001996	Male	No	0	Graduate	No	20233	0.0	480.0	360.0	1.0	Rural	N
506	LP002624	Male	Yes	0	Graduate	No	20833	6667.0	480.0	360.0	NaN	Urban	Y
155	LP001536	Male	Yes	3+	Graduate	No	39999	0.0	600.0	180.0	0.0	Semiurban	Y
171	LP001585	NaN	Yes	3+	Graduate	No	51763	0.0	700.0	300.0	1.0	Urban	Y
333	LP002101	Male	Yes	0	Graduate	NaN	63337	0.0	490.0	180.0	1.0	Urban	Y

Only 4 out of 15 of the loans were denied, but since many of the incomes are fairly high this doesn't seem completely unreasonable.

While some of these points could be questioned, we lack significant evidence that any of these points should be removed.

Handling Missing Values¶

Below we have imputed categorical features with the mode, and numerical features with the median of the data. An alternative would be to try to use other features to predict the missing feature values, however this does not seem necessary for most of categorical features since they are likely not critical to the model. However predicting missing values in say the 'LoanAmount' or 'CreditHistory' attributes might produce a better result.

In [12]:

#report number of missing values for each feature
for col in df_train_data.columns:
    missing_series = df_train_data[col][df_train_data[col].isna() == True]
    if missing_series.size > 0:
        print(col, missing_series.size)
    plt.show()

Gender 13
Married 3
Dependents 15
Self_Employed 32
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50

In [13]:

#fill missing categorical values with mode

categ_cols = ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Loan_Amount_Term', 
              'Credit_History']
for col in categ_cols:
    df_train_data[col] = df_train_data[col].fillna(df_train_data[col].mode()[0])
    df_test_data[col] = df_test_data[col].fillna(df_test_data[col].mode()[0])

In [14]:

#create dummy variables for categorical features
dummy_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 
              'Credit_History', 'Property_Area']
train_dummies = pd.get_dummies(df_train_data[dummy_cols], drop_first = True)
test_dummies = pd.get_dummies(df_test_data[dummy_cols], drop_first = True)

In [15]:

num_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

#feature normalization 
df_train_num = (df_train_data[num_cols] - df_train_data[num_cols].mean()) / df_train_data[num_cols].std()
df_test_num = (df_test_data[num_cols] - df_train_data[num_cols].mean()) / df_train_data[num_cols].std()

In [16]:

#set loan status to 1 if approved, else 0 
loan_status = df_train_data.Loan_Status.apply(lambda x: 0 if x == 'N' else 1)

df_train = pd.concat([df_train_num, train_dummies, loan_status], axis =1)
df_test = pd.concat([df_test_num, test_dummies], axis =1)

In [17]:

#create dataframes with numerical missing values dropped, and missing values w/ median imputed 
df_train_dropped = df_train.dropna()
df_train_fill_median = df_train.fillna(df_train.median())

df_test_dropped = df_test.dropna()
df_test_fill_median = df_test.fillna(df_test.median())
df_train_fill_median.head()

Out[17]:

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Gender_Male	Married_Yes	Dependents_1	Education_Not Graduate	Self_Employed_Yes	Property_Area_Urban	Loan_Status
0	0.107451	-0.554626	-0.211847	0.273248	1.0	1	0	0	0	0	1	1
1	-0.131680	-0.039581	-0.211847	0.273248	1.0	1	1	1	0	0	0	0
2	-0.430689	-0.554626	-0.939491	0.273248	1.0	1	1	0	0	1	1	1
3	-0.509455	0.250729	-0.305737	0.273248	1.0	1	1	0	1	0	1	1
4	0.135973	-0.554626	-0.059277	0.273248	1.0	1	0	0	0	0	1	1

In [18]:

pd.plotting.scatter_matrix(df_train_num, figsize = (16, 10))
plt.show()

Feature Engineering¶

In [19]:

df_train_fill_median['LoanAmount_per_term'] = df_train_fill_median.LoanAmount/df_train_fill_median.Loan_Amount_Term
df_test_fill_median['LoanAmount_per_term'] = df_train_fill_median.LoanAmount/df_train_fill_median.Loan_Amount_Term

df_train_fill_median['ratio_income_per_term'] = df_train_fill_median.ApplicantIncome/df_train_fill_median.LoanAmount_per_term
df_test_fill_median['ratio_income_per_term'] = df_train_fill_median.ApplicantIncome/df_train_fill_median.LoanAmount_per_term

In [20]:

for col in ['LoanAmount_per_term', 'ratio_income_per_term']:
    plt.hist(df_train_fill_median[col])
    plt.show()

In [21]:

f = plt.subplots(1, 1, figsize = (14,10))
corr = df_train_fill_median.corr()
sns.heatmap(corr)

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x11de667f0>

Here we see that ApplicantIncome and LoanAmount have high colinearity. Also credit history seems to be highly correlated with loan status, this points to it being the most important feature for our model.

Model Parameter Tuning and Selection¶

Train Test Split¶

In [22]:

columns = df_train_fill_median.drop('Loan_Status',axis =1).columns
Xtrain, Xtest, y_train, y_test = train_test_split(df_train_fill_median[columns], df_train_fill_median.Loan_Status, test_size = .2)

K Folds¶

In [23]:

seed =1 

inner_kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)
outer_kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)

Nested K-Fold Cross Validation¶

In [24]:

def cumlsum_thresh(tup, threshold, sort_vect = False, reverse = True):
    """
    inputs
    __________
    tup: tuple of labels and corresponding values, eg. (label, value)
    threshold: cumulative sum threshold
    sort_vector: should the vector be sorted, default False
    reverse: accending or descending, default True, ie descending
    
    output
    __________
    list of labels where first i values that contribute to the cumulative sum up to threshold
    """
    label, vector = zip(*tup)
    if sort_vect == True:
        vector = sorted(list(vector), reverse=reverse)
    else:
        cuml_sum = 0
        for i, value in enumerate(vector):
            cuml_sum +=value
            if cuml_sum > threshold:
                break
    return list(label[:i])

In [25]:

logit = LogisticRegression()
forest = RandomForestClassifier()
boost = GradientBoostingClassifier()
features = Xtrain.columns

nested_forest_score = np

forest_params = {'max_depth': range(1, 8), 'max_leaf_nodes': range(2, 6)}
boost_params = {'learning_rate': np.arange(.001, 1, .07), 'max_depth': range(1, 8),
                'max_leaf_nodes': range(2, 6)}

clf_forest = GridSearchCV(forest, param_grid=forest_params, scoring='roc_auc', cv=inner_kfold)
clf_boost = GridSearchCV(boost, param_grid=boost_params, scoring='roc_auc', cv=inner_kfold)

#logit feature selection
feat_importance = clf_forest.fit(Xtrain, y_train).best_estimator_.feature_importances_
feat_tup = zip(features, feat_importance)
features_by_importance = sorted(feat_tup, key=lambda x: x[0])
logit_features = cumlsum_thresh(features_by_importance, threshold=.8, reverse=True)
     
nested_forest_score = cross_val_score(estimator=clf_forest, X=Xtrain, y=y_train, cv=outer_kfold, scoring='roc_auc')
nested_boost_score = cross_val_score(estimator=clf_boost, X=Xtrain, y=y_train, cv=outer_kfold, scoring='roc_auc')
nested_logit_score = cross_val_score(estimator=logit, X=Xtrain[logit_features],y=y_train, cv=outer_kfold, 
                                     scoring='roc_auc')
   
print("Random Forest", np.mean(nested_forest_score))
print("Gradient Boosted", np.mean(nested_boost_score))
print("Logit", np.mean(nested_logit_score))

Random Forest 0.7367159312500612
Gradient Boosted 0.7105250316018461
Logit 0.7483750771672988

In [26]:

k_fold = range(1,6)
plt.plot(k_fold, nested_boost_score)
plt.plot(k_fold, nested_forest_score)
plt.plot(k_fold, nested_logit_score)
plt.ylim(ymin=0)
plt.xlabel('K Folds')
plt.ylabel('AUC Score')
plt.legend(['boost', 'forest', 'logit'])

Out[26]:

<matplotlib.legend.Legend at 0x11da5a0f0>

Training the Model¶

In [27]:

rf = RandomForestClassifier(max_depth = 4, max_leaf_nodes=4).fit(Xtrain, y_train)

accuracy_score(y_train, rf.predict(Xtrain)), accuracy_score(y_test, rf.predict(Xtest))

Out[27]:

(0.8224489795918367, 0.7886178861788617)

In [28]:

roc_auc_score(y_test, rf.predict(Xtest))

Out[28]:

0.6962081128747795

Model Submission¶

In [248]:

X_predict = df_test_fill_median

pd_results = pd.DataFrame(rf.predict(X_predict), index = df_test_data.Loan_ID, columns = ['Loan_Status'])
pd_results.Loan_Status = pd_results.Loan_Status.apply(lambda x: 'Y' if x ==1 else 'N')
pd_results.to_csv('results.csv')

Data for this project can be found at:

https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

Predicting Loan Approvals

Analytics Vidhya Loan Prediction III