We will continue to from our bike share prediction problem focusing on multiple linear regression, regularization, dealing with high dimensionality, and PCA. We will continue to build regression models for the Capital Bikeshare program in Washington D.C.
To recap, we are looking to predict the number of bikes used per day based off the features listed in the dataframe below.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
%matplotlib inline
df_train = pd.read_csv('bike_train_df.csv')
df_test = pd.read_csv('bike_test_df.csv')
columns = list(df_train.columns)
xtrain = df_train.drop(['count', 'Unnamed: 0'], axis = 1)
xtest = df_test.drop(['count', 'Unnamed: 0'], axis = 1)
ytrain = df_train['count'].values
ytest = df_test['count'].values
sample_sizes = np.arange(100, 450, 50)
xtrain.head()
#normalize
ytrain_reg = (ytrain-np.mean(ytrain))/np.std(ytrain)
ytest_reg = (ytest-np.mean(ytrain))/np.std(ytrain)
print(xtrain.columns,"\n")
ols = LinearRegression(fit_intercept= True)
ols.fit(xtrain, ytrain_reg)
print('OLS coefficents:\n', ols.coef_)
alphas_list = [10**i for i in range(-6, 5)]
ridge = RidgeCV(alphas = alphas_list, fit_intercept=False, normalize = True)
ridge.fit(xtrain, ytrain_reg)
print('\nRidge alpha:\n', ridge.alpha_)
print('Ridge coefficents:\n',ridge.coef_)
lasso = LassoCV(alphas = alphas_list, normalize = True)
lasso.fit(xtrain, ytrain_reg)
print('\nLasso alpha:\n',lasso.alpha_)
print('Lasso coefficents:\n',lasso.coef_, '\n')
print(list(filter(lambda x: abs(x[1])< 1e-10, zip(xtrain.columns, ridge.coef_))))
print(list(filter(lambda x: abs(x[1])< 1e-10, zip(xtrain.columns, lasso.coef_))))
Since the ridge and lasso regressions introduce a bias, which limits the size of the coefficients, both models tend to assign smaller values to the coefficients.
The ridge regression does not assign coefficients close to zero. The ridge regression tries to put constraints on the coefficients to prevent them from getting too large, without being allowed to be zero, while the lasso regression allows coefficients to be forced to zero.
From the lasso regression we see that the coefficients that are assigned zero are coeffients that were not found to be statistically significant from our previous post.
The ridge and lasso both assign atemp and temp similar values for coefficient. Atemp and temp are highly correlated and could be collinear. By introducing bias into the regression estimate, we reduce the variance of the standard errors and hopefully this helps give a less inflated coefficient assignment.
We next analyze the performance of the two shrinkage methods for different training sample sizes:
We will be generating random samples of sizes 100, 150, ..., 400 from the training set.
def sample(x, y, k):
"""
inputs
_________
x (n x d array of predictors in training data)
y (n x 1 array of response variable vals in training data)
k (size of sample)
output
________
random sample of size k
"""
n = x.shape[0] # No. of training points
# Choose random indices of size 'k'
subset_ind = np.random.choice(np.arange(n), k)
# Get predictors and reponses with the indices
x_subset = x[subset_ind, :]
y_subset = y[subset_ind]
return (x_subset, y_subset)
from sklearn.model_selection import GridSearchCV
alphas_list = [10**i for i in range(-6, 5)]
n_folds = 10
def optimize_ridge(n_folds, alphas_list, xtrain, ytrain, xtest, ytest):
opt_ridge = RidgeCV(alphas = alphas_list, cv = n_folds).fit(xtrain, ytrain)
return r2_score(ytrain, opt_ridge.predict(xtrain)), r2_score(ytest, opt_ridge.predict(xtest))
def optimize_lasso(n_folds, alphas_list, xtrain, ytrain, xtest, ytest):
opt_lasso = LassoCV(alphas = alphas_list, cv = n_folds).fit(xtrain, ytrain)
return r2_score(ytrain, opt_lasso.predict(xtrain)), r2_score(ytest, opt_lasso.predict(xtest))
def lin_reg(xtrain, ytrain, xtest, ytest):
linreg = LinearRegression(fit_intercept=True)
linreg.fit(xtrain, ytrain)
return linreg.score(xtrain,ytrain), linreg.score(xtest, ytest)
alphas = [10**i for i in range(-6, 5)]
#create df to store results from each sample size. Each row contains a distinct sample size results.
df_r2 = pd.DataFrame( columns = ['sample','r2_train_ridge', 'std_train_ridge', 'r2_test_ridge',
'std_test_ridge','r2_train_lasso','std_train_lasso', 'r2_test_lasso',
'std_test_lasso','r2_train_lin','std_train_lin', 'r2_test_lin', 'std_test_lin' ])
print(df_r2.head())
for n, size in enumerate(sample_sizes):
ridge_r2, lasso_r2, linreg_r2 = [], [], []
for i in range(10):
Xtrain, Ytrain = sample(np.array(xtrain), np.array(ytrain.reshape(-1, 1)), size)
ridge_r2.append(optimize_ridge(4, alphas, Xtrain, Ytrain, xtest, ytest))
lasso_r2.append(optimize_lasso(4, alphas, Xtrain, Ytrain, xtest, ytest))
linreg_r2.append(lin_reg(Xtrain, Ytrain, xtest, ytest))
ridge_train, ridge_test = zip(*ridge_r2)
lasso_train, lasso_test = zip(*lasso_r2)
lin_train, lin_test = zip(*linreg_r2)
df_r2.loc[n] = [size, np.average(ridge_train),np.std(ridge_train), np.average(ridge_test), np.std(ridge_test),
np.average(lasso_train), np.std(lasso_train), np.average(lasso_test), np.std(lasso_test),
np.average(lin_train), np.std(lin_train), np.average(lin_test), np.std(lin_test)]
df_r2 = df_r2.set_index('sample');
def get_confidence_int(mean_series, std_series):
lower, upper = mean_series - std_series, mean_series + std_series
confidence_int = pd.Series(list(zip(lower, upper)), mean_series.index, name = 'confidence interval')
return confidence_int
df_r2[['r2_train_ridge', 'r2_train_lasso', 'r2_train_lin']].plot()
plt.legend(bbox_to_anchor=(1, 1))
print(get_confidence_int(df_r2.r2_train_ridge, df_r2.std_train_ridge))
print(get_confidence_int(df_r2.r2_test_ridge, df_r2.std_test_ridge))
print(get_confidence_int(df_r2.r2_train_lasso, df_r2.std_train_ridge))
print(get_confidence_int(df_r2.r2_test_lasso, df_r2.std_test_ridge))
print(get_confidence_int(df_r2.r2_train_lin, df_r2.std_train_lin))
print(get_confidence_int(df_r2.r2_test_lin, df_r2.std_test_lin))
df_r2[['r2_test_ridge', 'r2_test_lasso', 'r2_test_lin']].plot()
plt.legend(bbox_to_anchor=(1, 1))
f, ax = plt.subplots(1, 2, figsize = (18, 5))
ax[0].errorbar(sample_sizes, df_r2.r2_train_lin, yerr = df_r2.std_train_lin)
ax[1].errorbar(sample_sizes, df_r2.r2_test_lin, yerr = df_r2.std_test_lin)
ax[0].set_ylabel("train r2")
ax[1].set_ylabel("test r2")
axes = f.add_subplot(111, frameon = False)
axes.axis('off')
for i in [0,1]:
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
plt.xlabel("sample size")
plt.title("MSE R2 Confidence Intervals\n ", fontsize = 16, loc = 'center')
f, ax = plt.subplots(1, 2, figsize = (18, 5))
ax[0].errorbar(sample_sizes, df_r2.r2_train_ridge, yerr = df_r2.std_train_ridge)
ax[1].errorbar(sample_sizes, df_r2.r2_test_ridge, yerr = df_r2.std_test_ridge)
ax[0].set_ylabel("train r2")
ax[1].set_ylabel("test r2")
axes = f.add_subplot(111, frameon = False)
axes.axis('off')
for i in [0,1]:
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
plt.xlabel("sample size")
plt.title("Ridge R2 Confidence Intervals\n ", fontsize = 16, loc = 'center')
f, ax = plt.subplots(1, 2, figsize = (18, 5), sharey = False, sharex= False, frameon = False)
ax[0].errorbar(sample_sizes, df_r2.r2_train_lasso, yerr = df_r2.std_train_lasso)
ax[1].errorbar(sample_sizes, df_r2.r2_test_lasso, yerr = df_r2.std_test_lasso)
ax[0].set_ylabel("train r2")
ax[1].set_ylabel("test r2")
axes = f.add_subplot(111, frameon = False)
axes.axis('off')
for i in [0,1]:
ax[i].spines['top'].set_visible(False)
ax[i].spines['right'].set_visible(False)
plt.xlabel("sample size")
plt.title("Lasso R2 Confidence Intervals\n ", fontsize = 16, loc = 'center')
We will now try to improve the performance of the regression model by including higher-order polynomial and interaction terms.
poly = PolynomialFeatures(4)
df_train = df_train.drop(['count', 'Unnamed: 0'], axis = 1)
df_test = df_test.drop(['count', 'Unnamed: 0'], axis = 1)
xtrain.head()
continuous_features = ['temp', 'atemp', 'humidity', 'windspeed']
df_train_fin = xtrain.copy()
df_test_fin = xtest.copy()
for feature in continuous_features:
df1 = pd.DataFrame(poly.fit_transform(xtrain[feature].values.reshape(-1,1)))
df2 = pd.DataFrame(poly.fit_transform(xtest[feature].values.reshape(-1,1)))
df1 = df1.drop(0, axis = 1)
df2 = df2.drop(0, axis = 1)
df1.columns = [feature + str(i) for i in range(1, 5)]
df2.columns = [feature + str(i) for i in range(1, 5)]
df_train_fin = pd.concat([df_train_fin, df1], axis = 1)
df_test_fin = pd.concat([df_test_fin, df2], axis = 1)
df_train_fin = df_train_fin.drop(continuous_features, axis = 1)
df_test_fin = df_test_fin.drop(continuous_features, axis = 1)
df_train_fin.head()
df_test_fin.head()
linreg = OLS(ytrain,df_train_fin).fit()
ytrain_pred = linreg.predict(df_train_fin)
ytest_pred = linreg.predict(df_test_fin)
print(linreg.summary())
print('\nstatisticaly significant coefficients:\n')
for column, pvalue in list(filter(lambda x: x[1] < .05, zip(columns, linreg.pvalues))):
print(column +":", pvalue)
print("r2 train score:", r2_score(ytrain, ytrain_pred ),"\nr2 test score:", r2_score(ytest, ytest_pred), '\n')
plt.scatter(df_train.temp, ytrain_pred, label = 'predicted y-value')
plt.scatter(df_train.temp, ytrain, label = 'y-value')
plt.legend()
# save df to use below
df_train_poly = df_train_fin.copy()
df_test_poly = df_test_fin.copy()
df_train_fin['month1_temp'] = df_train_fin['month_1.0'] * df_train_fin['temp1']
df_train_fin['weather1_workingday'] = df_train_fin['weather_1.0'] * df_train_fin['workingday']
df_test_fin['month1_temp'] = df_test_fin['month_1.0'] * df_test_fin['temp1']
df_test_fin['weather1_workingday'] = df_test_fin['weather_1.0'] * df_test_fin['workingday']
linreg2 = OLS(ytrain, df_train_fin).fit()
ytest_pred = linreg2.predict(df_test_fin)
ytrain_pred = linreg2.predict(df_train_fin)
print(linreg2.summary(), '\n\n')
print('statistically significant coefficients:\n')
for column, pvalue in list(filter(lambda x: x[1] < .05, zip(columns, linreg2.pvalues))):
print(column +":", pvalue)
print("\n Training R2 Score:", r2_score(ytrain, ytrain_pred))
print("\n Test R2 Score:", r2_score(ytest, ytest_pred))
The added interaction terms are not statistically significant
We would like to fit a model to include all main effects, polynomial terms up to the $4^{th}$ order, and all interactions between all possible predictors and polynomial terms. We will include the terms that account for 80% of total variance, this will be our explained variance.
poly_features = [ 'temp2', 'temp3', 'temp4', 'atemp2', 'atemp3', 'atemp4', 'humidity2',
'humidity3', 'humidity4', 'windspeed2', 'windspeed3', 'windspeed4']
df_train_pca = df_train.copy()
df_test_pca = df_test.copy()
poly_col = columns
for col_1 in columns:
poly_col = poly_col[1:]
for col_2 in poly_col:
df_train_pca[str(col_1) + ' ' + str(col_2)] = df_train_pca[str(col_1)] * df_train_pca[str(col_2)]
df_test_pca[str(col_1) + ' ' + str(col_2)] = df_test_pca[str(col_1)] * df_test_pca[str(col_2)]
#combine interaction and higher order data frames
df_train_pca = pd.concat([df_train_pca, df_train_poly[poly_features]], axis =1 )
df_test_pca = pd.concat([df_test_pca, df_test_poly[poly_features]], axis =1 )
df_train_pca.head()
for component in range(1, 15):
std_scale = MinMaxScaler().fit(df_train_pca)
pca_train_std = std_scale.transform(df_train_pca)
pca_test_std = std_scale.transform(df_test_pca)
pca_model = PCA(n_components=component)
scaled_train = pca_model.fit_transform(scale(pca_train_std))
scaled_test = pca_model.fit_transform(scale(pca_test_std))
print("Cumulative explained variance:", np.cumsum(np.round(pca_model.explained_variance_ratio_, decimals=5)*100))
print("R2 Training: ", ols.fit(scaled_train, ytrain).score(scaled_train, ytrain))
print("R2 Test: ", ols.fit(scaled_test, ytest).score(scaled_test, ytest))
print("\n")
Using principal component analysis we've managed to create a model that so far generalizes the best to our bike share data. While we might not have matched best known solutions, we've been able to explore regularization and add/deal with high dimensional features for linear models. By introducing high dimensional features and interaction terms we can often extract more information from our data that will be useful for our models. In this case, using PCA helped fix our issue with overfitting our model when standard regularization techniques (lasso and ridge l1 and l2 regularization) did not.