We want to classify loans into two categories; approved and not approved. We will use past loan applications and their respective statuses to train our classifier. The data used for this project comes from Analytics Vidhya. Typically in a classification problem we would examine precision and recall. However since this is a competiton, we will looking to maximize accuracy.
After using nested cross validation on logistic regression, random forest and gradient boosted random forest models, we see that the random forest model results in the highest ROC AUC score on average with k=5 folds.
Our final r2 score after submission is 0.7847. This is a decent start. As we mentioned in our analysis below we could improve this by imputing our missing values by using the other features to predict the missing values (using regression, knn, or other classifiers). We could also try creating different ensembles, or bagging. We will return to this in a future post!
Features:
Target:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from statsmodels.regression.linear_model import OLS
from sklearn.metrics import accuracy_score, roc_auc_score
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
df_test_data = pd.read_csv('loan_test.csv')
df_train_data = pd.read_csv('loan_train.csv');
df_train_data.head()
df_train_data.describe()
i, j = 0, 0
f, axes = plt.subplots(6, 2, figsize = (11, 20))
plt.subplots_adjust(hspace = .45)
for col in df_train_data.columns[1:]:
if col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']:
axes[i,j].hist(df_train_data[col].dropna())
else:
bar_data = df_train_data[col].value_counts()
sns.barplot(bar_data.index, bar_data.values, ax = axes[i,j])
axes[i, j].set_title(col)
axes[i, j].spines['top'].set_visible(False)
axes[i, j].spines['right'].set_visible(False)
j += 1
if j ==2:
i += 1
j = 0
Here we see we have a little less than half of our loans classified as not accepted, we should therefore perform stratefied cv.
for col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']:
sns.boxplot(df_train_data[col].dropna(), orient = 'v')
plt.title(col)
plt.show()
df_train_data[df_train_data['ApplicantIncome'] > 20000].sort_values(by = 'ApplicantIncome')
All applicants with an income greater than 20,000 have higher level of education, it seems reasonable that these applicants would have higher income. However, since most of the dataset is comprised of applicants with higher education this alone would not explain the difference. For most of these points we lack evidence indicating we should remove them.
If we look at row 409, we see that this applicant's income is the largest in our dataset, and suspiciously ends in three 0's. Futhermore, the property area is rural, the credit history is marked 0, and the loan status is marked as having been declined. Given this information it is most likely that the applicant income was entered incorrectly. We should drop this point
df_train_data = df_train_data.drop(409)
df_train_data[df_train_data['LoanAmount'] > 400 ].sort_values(by = 'ApplicantIncome')
Only 4 out of 15 of the loans were denied, but since many of the incomes are fairly high this doesn't seem completely unreasonable.
While some of these points could be questioned, we lack significant evidence that any of these points should be removed.
Below we have imputed categorical features with the mode, and numerical features with the median of the data. An alternative would be to try to use other features to predict the missing feature values, however this does not seem necessary for most of categorical features since they are likely not critical to the model. However predicting missing values in say the 'LoanAmount' or 'CreditHistory' attributes might produce a better result.
#report number of missing values for each feature
for col in df_train_data.columns:
missing_series = df_train_data[col][df_train_data[col].isna() == True]
if missing_series.size > 0:
print(col, missing_series.size)
plt.show()
#fill missing categorical values with mode
categ_cols = ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Loan_Amount_Term',
'Credit_History']
for col in categ_cols:
df_train_data[col] = df_train_data[col].fillna(df_train_data[col].mode()[0])
df_test_data[col] = df_test_data[col].fillna(df_test_data[col].mode()[0])
#create dummy variables for categorical features
dummy_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
'Credit_History', 'Property_Area']
train_dummies = pd.get_dummies(df_train_data[dummy_cols], drop_first = True)
test_dummies = pd.get_dummies(df_test_data[dummy_cols], drop_first = True)
num_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
#feature normalization
df_train_num = (df_train_data[num_cols] - df_train_data[num_cols].mean()) / df_train_data[num_cols].std()
df_test_num = (df_test_data[num_cols] - df_train_data[num_cols].mean()) / df_train_data[num_cols].std()
#set loan status to 1 if approved, else 0
loan_status = df_train_data.Loan_Status.apply(lambda x: 0 if x == 'N' else 1)
df_train = pd.concat([df_train_num, train_dummies, loan_status], axis =1)
df_test = pd.concat([df_test_num, test_dummies], axis =1)
#create dataframes with numerical missing values dropped, and missing values w/ median imputed
df_train_dropped = df_train.dropna()
df_train_fill_median = df_train.fillna(df_train.median())
df_test_dropped = df_test.dropna()
df_test_fill_median = df_test.fillna(df_test.median())
df_train_fill_median.head()
pd.plotting.scatter_matrix(df_train_num, figsize = (16, 10))
plt.show()
df_train_fill_median['LoanAmount_per_term'] = df_train_fill_median.LoanAmount/df_train_fill_median.Loan_Amount_Term
df_test_fill_median['LoanAmount_per_term'] = df_train_fill_median.LoanAmount/df_train_fill_median.Loan_Amount_Term
df_train_fill_median['ratio_income_per_term'] = df_train_fill_median.ApplicantIncome/df_train_fill_median.LoanAmount_per_term
df_test_fill_median['ratio_income_per_term'] = df_train_fill_median.ApplicantIncome/df_train_fill_median.LoanAmount_per_term
for col in ['LoanAmount_per_term', 'ratio_income_per_term']:
plt.hist(df_train_fill_median[col])
plt.show()
f = plt.subplots(1, 1, figsize = (14,10))
corr = df_train_fill_median.corr()
sns.heatmap(corr)
Here we see that ApplicantIncome and LoanAmount have high colinearity. Also credit history seems to be highly correlated with loan status, this points to it being the most important feature for our model.
columns = df_train_fill_median.drop('Loan_Status',axis =1).columns
Xtrain, Xtest, y_train, y_test = train_test_split(df_train_fill_median[columns], df_train_fill_median.Loan_Status, test_size = .2)
seed =1
inner_kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)
outer_kfold = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)
def cumlsum_thresh(tup, threshold, sort_vect = False, reverse = True):
"""
inputs
__________
tup: tuple of labels and corresponding values, eg. (label, value)
threshold: cumulative sum threshold
sort_vector: should the vector be sorted, default False
reverse: accending or descending, default True, ie descending
output
__________
list of labels where first i values that contribute to the cumulative sum up to threshold
"""
label, vector = zip(*tup)
if sort_vect == True:
vector = sorted(list(vector), reverse=reverse)
else:
cuml_sum = 0
for i, value in enumerate(vector):
cuml_sum +=value
if cuml_sum > threshold:
break
return list(label[:i])
logit = LogisticRegression()
forest = RandomForestClassifier()
boost = GradientBoostingClassifier()
features = Xtrain.columns
nested_forest_score = np
forest_params = {'max_depth': range(1, 8), 'max_leaf_nodes': range(2, 6)}
boost_params = {'learning_rate': np.arange(.001, 1, .07), 'max_depth': range(1, 8),
'max_leaf_nodes': range(2, 6)}
clf_forest = GridSearchCV(forest, param_grid=forest_params, scoring='roc_auc', cv=inner_kfold)
clf_boost = GridSearchCV(boost, param_grid=boost_params, scoring='roc_auc', cv=inner_kfold)
#logit feature selection
feat_importance = clf_forest.fit(Xtrain, y_train).best_estimator_.feature_importances_
feat_tup = zip(features, feat_importance)
features_by_importance = sorted(feat_tup, key=lambda x: x[0])
logit_features = cumlsum_thresh(features_by_importance, threshold=.8, reverse=True)
nested_forest_score = cross_val_score(estimator=clf_forest, X=Xtrain, y=y_train, cv=outer_kfold, scoring='roc_auc')
nested_boost_score = cross_val_score(estimator=clf_boost, X=Xtrain, y=y_train, cv=outer_kfold, scoring='roc_auc')
nested_logit_score = cross_val_score(estimator=logit, X=Xtrain[logit_features],y=y_train, cv=outer_kfold,
scoring='roc_auc')
print("Random Forest", np.mean(nested_forest_score))
print("Gradient Boosted", np.mean(nested_boost_score))
print("Logit", np.mean(nested_logit_score))
k_fold = range(1,6)
plt.plot(k_fold, nested_boost_score)
plt.plot(k_fold, nested_forest_score)
plt.plot(k_fold, nested_logit_score)
plt.ylim(ymin=0)
plt.xlabel('K Folds')
plt.ylabel('AUC Score')
plt.legend(['boost', 'forest', 'logit'])
rf = RandomForestClassifier(max_depth = 4, max_leaf_nodes=4).fit(Xtrain, y_train)
accuracy_score(y_train, rf.predict(Xtrain)), accuracy_score(y_test, rf.predict(Xtest))
roc_auc_score(y_test, rf.predict(Xtest))
X_predict = df_test_fill_median
pd_results = pd.DataFrame(rf.predict(X_predict), index = df_test_data.Loan_ID, columns = ['Loan_Status'])
pd_results.Loan_Status = pd_results.Loan_Status.apply(lambda x: 'Y' if x ==1 else 'N')
pd_results.to_csv('results.csv')
Data for this project can be found at:
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/