We will continue to from our bike share prediction problem focusing on multiple linear regression, regularization, dealing with high dimensionality, and PCA. We will continue to build regression models for the Capital Bikeshare program in Washington D.C.

To recap, we are looking to predict the number of bikes used per day based off the features listed in the dataframe below.

Regularization, High Dimensionality, PCA¶

In [5]:

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
%matplotlib inline

Regularization¶

In [24]:

df_train = pd.read_csv('bike_train_df.csv')
df_test = pd.read_csv('bike_test_df.csv')

columns = list(df_train.columns)

xtrain = df_train.drop(['count', 'Unnamed: 0'], axis = 1)
xtest = df_test.drop(['count', 'Unnamed: 0'], axis = 1)

ytrain = df_train['count'].values
ytest = df_test['count'].values

sample_sizes = np.arange(100, 450, 50)

xtrain.head()

Out[24]:

	workingday	temp	atemp	humidity	windspeed	weather_1.0	weather_2.0	season_2.0	...	month_5.0	month_6.0	month_9.0
0	1.0	0.623798	0.650106	0.920664	-0.928758	0	1	1	...	1	0	0
1	1.0	-0.180310	-0.054759	0.696852	-0.213502	1	0	0	...	0	0	0
2	1.0	0.802489	0.851495	-0.448383	0.803926	1	0	1	...	0	1	0
3	0.0	-1.520492	-1.565182	-0.332113	-0.269099	1	0	0	...	0	0	0
4	1.0	0.534453	0.348021	1.975789	-1.199027	0	0	0	...	0	0	1

5 rows × 28 columns

In [56]:

#normalize
ytrain_reg = (ytrain-np.mean(ytrain))/np.std(ytrain)
ytest_reg = (ytest-np.mean(ytrain))/np.std(ytrain)

In [57]:

print(xtrain.columns,"\n")

ols = LinearRegression(fit_intercept= True)
ols.fit(xtrain, ytrain_reg)
print('OLS coefficents:\n', ols.coef_)


alphas_list = [10**i for i in range(-6, 5)]

ridge = RidgeCV(alphas = alphas_list, fit_intercept=False, normalize = True)
ridge.fit(xtrain, ytrain_reg)
print('\nRidge alpha:\n', ridge.alpha_)
print('Ridge coefficents:\n',ridge.coef_)

lasso = LassoCV(alphas = alphas_list, normalize = True)
lasso.fit(xtrain, ytrain_reg)
print('\nLasso alpha:\n',lasso.alpha_)
print('Lasso coefficents:\n',lasso.coef_, '\n')


print(list(filter(lambda x: abs(x[1])< 1e-10, zip(xtrain.columns, ridge.coef_))))
print(list(filter(lambda x: abs(x[1])< 1e-10, zip(xtrain.columns, lasso.coef_))))

Index(['holiday', 'workingday', 'temp', 'atemp', 'humidity', 'windspeed',
       'weather_1.0', 'weather_2.0', 'season_1.0', 'season_2.0', 'season_3.0',
       'day_of_week_0.0', 'day_of_week_1.0', 'day_of_week_2.0',
       'day_of_week_3.0', 'day_of_week_4.0', 'day_of_week_5.0', 'month_1.0',
       'month_2.0', 'month_3.0', 'month_4.0', 'month_5.0', 'month_6.0',
       'month_7.0', 'month_8.0', 'month_9.0', 'month_10.0', 'month_11.0'],
      dtype='object') 

OLS coefficents:
 [-0.31908752 -0.01246811  0.47906069  0.16168241 -0.2838412  -0.13202413
  0.81866252  0.81008943 -0.63454282 -0.16940518 -0.10003395 -0.24070924
 -0.13281458 -0.16983315  0.01946434 -0.03707451 -0.01129774  0.06149669
  0.10752256  0.18527111  0.23400248  0.02743919 -0.34849373 -0.60088749
 -0.34032386  0.27115613  0.31312808  0.11980864]

Ridge alpha:
 10.0
Ridge coefficents:
 [-0.12816281 -0.00168024  0.30931428  0.26988423 -0.29764016 -0.13827485
  0.1872353   0.1872878  -0.40214081 -0.05258219 -0.13472559 -0.16088292
 -0.11324902 -0.09053026  0.02606265  0.00785153  0.04002204 -0.14484142
 -0.0656061   0.01658966  0.11150721  0.0247596  -0.19121704 -0.30556904
 -0.08492651  0.3222249   0.29526271  0.07237838]

Lasso alpha:
 0.001
Lasso coefficents:
 [-0.14399826  0.          0.26606627  0.25138658 -0.27207847 -0.11449332
  0.19729063  0.16024818 -0.49173865  0.         -0.01757944 -0.1329803
 -0.07463666 -0.04962681  0.          0.          0.02004495 -0.10747593
 -0.          0.          0.0456092   0.         -0.18506434 -0.35180385
 -0.07153066  0.35173043  0.35964927  0.02160554] 

[]
[('workingday', 0.0), ('season_2.0', 0.0), ('day_of_week_3.0', 0.0), ('day_of_week_4.0', 0.0), ('month_2.0', -0.0), ('month_3.0', 0.0), ('month_5.0', 0.0)]

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

Since the ridge and lasso regressions introduce a bias, which limits the size of the coefficients, both models tend to assign smaller values to the coefficients.

The ridge regression does not assign coefficients close to zero. The ridge regression tries to put constraints on the coefficients to prevent them from getting too large, without being allowed to be zero, while the lasso regression allows coefficients to be forced to zero.

From the lasso regression we see that the coefficients that are assigned zero are coeffients that were not found to be statistically significant from our previous post.

The ridge and lasso both assign atemp and temp similar values for coefficient. Atemp and temp are highly correlated and could be collinear. By introducing bias into the regression estimate, we reduce the variance of the standard errors and hopefully this helps give a less inflated coefficient assignment.

We next analyze the performance of the two shrinkage methods for different training sample sizes:

We will be generating random samples of sizes 100, 150, ..., 400 from the training set.

In [58]:

def sample(x, y, k):
    """  
    inputs
    _________
    x (n x d array of predictors in training data)
    y (n x 1 array of response variable vals in training data)
    k (size of sample) 
    output
    ________
    random sample of size k
    """
    n = x.shape[0] # No. of training points
    
    # Choose random indices of size 'k'
    subset_ind = np.random.choice(np.arange(n), k)
    
    # Get predictors and reponses with the indices
    x_subset = x[subset_ind, :]
    y_subset = y[subset_ind]
    
    return (x_subset, y_subset)

In [59]:

from sklearn.model_selection import GridSearchCV
alphas_list = [10**i for i in range(-6, 5)]
n_folds = 10

def optimize_ridge(n_folds, alphas_list, xtrain, ytrain, xtest, ytest):
    opt_ridge = RidgeCV(alphas = alphas_list, cv = n_folds).fit(xtrain, ytrain)
    return r2_score(ytrain, opt_ridge.predict(xtrain)), r2_score(ytest, opt_ridge.predict(xtest))

def optimize_lasso(n_folds, alphas_list, xtrain, ytrain, xtest, ytest):
    opt_lasso = LassoCV(alphas = alphas_list, cv = n_folds).fit(xtrain, ytrain)
    return r2_score(ytrain, opt_lasso.predict(xtrain)), r2_score(ytest, opt_lasso.predict(xtest))

def lin_reg(xtrain, ytrain, xtest, ytest):
    linreg = LinearRegression(fit_intercept=True)
    linreg.fit(xtrain, ytrain)
    return linreg.score(xtrain,ytrain), linreg.score(xtest, ytest)

In [ ]:

alphas = [10**i for i in range(-6, 5)]

#create df to store results from each sample size. Each row contains a distinct sample size results. 
df_r2 = pd.DataFrame( columns = ['sample','r2_train_ridge', 'std_train_ridge', 'r2_test_ridge', 
                                 'std_test_ridge','r2_train_lasso','std_train_lasso',  'r2_test_lasso', 
                                 'std_test_lasso','r2_train_lin','std_train_lin',  'r2_test_lin', 'std_test_lin' ])
print(df_r2.head())
for n, size in enumerate(sample_sizes):
    ridge_r2, lasso_r2, linreg_r2 = [], [], []
    for i in range(10):
       
        Xtrain, Ytrain = sample(np.array(xtrain), np.array(ytrain.reshape(-1, 1)), size)
        ridge_r2.append(optimize_ridge(4, alphas, Xtrain, Ytrain, xtest, ytest))
        lasso_r2.append(optimize_lasso(4, alphas, Xtrain, Ytrain, xtest, ytest))
        linreg_r2.append(lin_reg(Xtrain, Ytrain, xtest, ytest))
    ridge_train, ridge_test = zip(*ridge_r2) 
    lasso_train, lasso_test = zip(*lasso_r2)
    lin_train, lin_test = zip(*linreg_r2)
    df_r2.loc[n] = [size, np.average(ridge_train),np.std(ridge_train), np.average(ridge_test), np.std(ridge_test),
                    np.average(lasso_train),  np.std(lasso_train), np.average(lasso_test), np.std(lasso_test),
                   np.average(lin_train), np.std(lin_train), np.average(lin_test), np.std(lin_test)]

df_r2 = df_r2.set_index('sample');

In [13]:

def get_confidence_int(mean_series, std_series):
    lower, upper = mean_series - std_series, mean_series + std_series
    confidence_int = pd.Series(list(zip(lower, upper)), mean_series.index, name = 'confidence interval')
    return confidence_int

df_r2[['r2_train_ridge', 'r2_train_lasso', 'r2_train_lin']].plot()
plt.legend(bbox_to_anchor=(1, 1))

print(get_confidence_int(df_r2.r2_train_ridge, df_r2.std_train_ridge))
print(get_confidence_int(df_r2.r2_test_ridge, df_r2.std_test_ridge))
print(get_confidence_int(df_r2.r2_train_lasso, df_r2.std_train_ridge))
print(get_confidence_int(df_r2.r2_test_lasso, df_r2.std_test_ridge))
print(get_confidence_int(df_r2.r2_train_lin, df_r2.std_train_lin))
print(get_confidence_int(df_r2.r2_test_lin, df_r2.std_test_lin))


df_r2[['r2_test_ridge', 'r2_test_lasso', 'r2_test_lin']].plot()
plt.legend(bbox_to_anchor=(1, 1))

sample
100.0    (0.5576926990287421, 0.6843650420788236)
150.0    (0.5196887946312793, 0.6769591516540446)
200.0    (0.5587148446954685, 0.6763709904059517)
250.0    (0.5787030441560373, 0.6470266834198951)
300.0     (0.563347204464423, 0.6247109066728123)
350.0    (0.5629705435757407, 0.6189221982898468)
400.0    (0.5864967138753433, 0.6405255343023081)
Name: confidence interval, dtype: object
sample
100.0    (0.17689876768558815, 0.25776393871821873)
150.0     (0.1924700046069408, 0.25984911092847734)
200.0    (0.12064442608863479, 0.24591148110770644)
250.0    (0.21384809184382803, 0.24513586769485984)
300.0      (0.21323543916205856, 0.251880152635204)
350.0    (0.22181939172224222, 0.25813840153422324)
400.0    (0.18745859562415632, 0.25378662534823876)
Name: confidence interval, dtype: object
sample
100.0     (0.5584522441653824, 0.685124587215464)
150.0    (0.5324909268655075, 0.6897612838882728)
200.0     (0.5662687005590518, 0.683924846269535)
250.0    (0.5868727990356841, 0.6551964382995419)
300.0     (0.5696226198670528, 0.630986322075442)
350.0    (0.5662232600371984, 0.6221749147513045)
400.0    (0.5880547082197587, 0.6420835286467235)
Name: confidence interval, dtype: object
sample
100.0    (0.16096030852751386, 0.24182547956014444)
150.0    (0.17252861218766424, 0.23990771850920078)
200.0    (0.11199357582695549, 0.23726063084602714)
250.0    (0.20648109832760086, 0.23776887417863268)
300.0    (0.21081651132624427, 0.24946122479938973)
350.0      (0.21546443253002096, 0.251783442342002)
400.0     (0.19163717915509695, 0.2579652088791794)
Name: confidence interval, dtype: object
sample
100.0      (0.64917221005341, 0.7243259175623934)
150.0    (0.5928513217756791, 0.7109938121189759)
200.0     (0.592027259882326, 0.6900521132224495)
250.0    (0.6078258794480841, 0.6640476907397433)
300.0    (0.5844578518700634, 0.6422319300445511)
350.0     (0.5825673805909394, 0.629585908899413)
400.0    (0.6011667534673796, 0.6460563938850507)
Name: confidence interval, dtype: object
sample
100.0    (0.011456083263311184, 0.1538277691017851)
150.0     (0.08906962937958993, 0.1838320895870358)
200.0     (0.08327003569626587, 0.1733169273844545)
250.0    (0.15562208806453087, 0.22275030674764706)
300.0      (0.181348228715907, 0.23095485341110572)
350.0    (0.19109417896836495, 0.24676699929604445)
400.0    (0.16896630460629317, 0.23258446739316227)
Name: confidence interval, dtype: object

Out[13]:

<matplotlib.legend.Legend at 0x12082c7b8>

In [14]:

f, ax = plt.subplots(1, 2, figsize = (18, 5))
ax[0].errorbar(sample_sizes, df_r2.r2_train_lin, yerr = df_r2.std_train_lin)
ax[1].errorbar(sample_sizes, df_r2.r2_test_lin, yerr = df_r2.std_test_lin)
ax[0].set_ylabel("train r2")
ax[1].set_ylabel("test r2")

axes = f.add_subplot(111, frameon = False)
axes.axis('off')

for i in [0,1]:
    ax[i].spines['top'].set_visible(False)
    ax[i].spines['right'].set_visible(False)
    
plt.xlabel("sample size")
plt.title("MSE R2 Confidence Intervals\n ", fontsize = 16, loc = 'center')

Out[14]:

Text(0.5,1,'MSE R2 Confidence Intervals\n ')

In [15]:

f, ax = plt.subplots(1, 2, figsize = (18, 5))
ax[0].errorbar(sample_sizes, df_r2.r2_train_ridge, yerr = df_r2.std_train_ridge)
ax[1].errorbar(sample_sizes, df_r2.r2_test_ridge, yerr = df_r2.std_test_ridge)
ax[0].set_ylabel("train r2")
ax[1].set_ylabel("test r2")

axes = f.add_subplot(111, frameon = False)
axes.axis('off')

for i in [0,1]:
    ax[i].spines['top'].set_visible(False)
    ax[i].spines['right'].set_visible(False)
plt.xlabel("sample size")
plt.title("Ridge R2 Confidence Intervals\n ", fontsize = 16, loc = 'center')

Out[15]:

Text(0.5,1,'Ridge R2 Confidence Intervals\n ')

In [16]:

f, ax = plt.subplots(1, 2, figsize = (18, 5), sharey = False, sharex= False, frameon = False)


ax[0].errorbar(sample_sizes, df_r2.r2_train_lasso, yerr = df_r2.std_train_lasso)
ax[1].errorbar(sample_sizes, df_r2.r2_test_lasso, yerr = df_r2.std_test_lasso)
ax[0].set_ylabel("train r2")
ax[1].set_ylabel("test r2")

axes = f.add_subplot(111, frameon = False)
axes.axis('off')

for i in [0,1]:
    ax[i].spines['top'].set_visible(False)
    ax[i].spines['right'].set_visible(False)
plt.xlabel("sample size")
plt.title("Lasso R2 Confidence Intervals\n ", fontsize = 16, loc = 'center')

Out[16]:

Text(0.5,1,'Lasso R2 Confidence Intervals\n ')

Polynomial and Interaction Terms¶

We will now try to improve the performance of the regression model by including higher-order polynomial and interaction terms.

In [17]:

poly = PolynomialFeatures(4)

df_train = df_train.drop(['count', 'Unnamed: 0'], axis = 1)
df_test = df_test.drop(['count', 'Unnamed: 0'], axis = 1)
xtrain.head()

Out[17]:

	workingday	temp	atemp	humidity	windspeed	weather_1.0	weather_2.0	season_2.0	...	month_5.0	month_6.0	month_9.0
0	1.0	0.623798	0.650106	0.920664	-0.928758	0	1	1	...	1	0	0
1	1.0	-0.180310	-0.054759	0.696852	-0.213502	1	0	0	...	0	0	0
2	1.0	0.802489	0.851495	-0.448383	0.803926	1	0	1	...	0	1	0
3	0.0	-1.520492	-1.565182	-0.332113	-0.269099	1	0	0	...	0	0	0
4	1.0	0.534453	0.348021	1.975789	-1.199027	0	0	0	...	0	0	1

5 rows × 28 columns

In [35]:

continuous_features = ['temp', 'atemp', 'humidity', 'windspeed']
df_train_fin = xtrain.copy()
df_test_fin = xtest.copy()

for feature in continuous_features:
    df1 = pd.DataFrame(poly.fit_transform(xtrain[feature].values.reshape(-1,1)))
    df2 = pd.DataFrame(poly.fit_transform(xtest[feature].values.reshape(-1,1)))
    df1 = df1.drop(0, axis = 1)
    df2 = df2.drop(0, axis = 1)
    df1.columns = [feature + str(i) for i in range(1, 5)] 
    df2.columns = [feature + str(i) for i in range(1, 5)]
    df_train_fin = pd.concat([df_train_fin, df1], axis = 1)
    df_test_fin = pd.concat([df_test_fin, df2], axis = 1)

df_train_fin = df_train_fin.drop(continuous_features, axis = 1)
df_test_fin = df_test_fin.drop(continuous_features, axis = 1)
df_train_fin.head()

Out[35]:

	workingday	weather_1.0	weather_2.0	season_2.0	season_3.0	day_of_week_0.0	day_of_week_2.0	...	atemp3	atemp4	humidity1	humidity2	humidity3	humidity4	windspeed1	windspeed2	windspeed3	windspeed4
0	1.0	0	1	1	0	0	1	...	0.274759	0.178622	0.920664	0.847622	0.780375	0.718464	-0.928758	0.862591	-0.801139	0.744064
1	1.0	1	0	0	0	0	1	...	-0.000164	0.000009	0.696852	0.485603	0.338393	0.235810	-0.213502	0.045583	-0.009732	0.002078
2	1.0	1	0	1	0	0	0	...	0.617372	0.525689	-0.448383	0.201047	-0.090146	0.040420	0.803926	0.646297	0.519575	0.417699
3	0.0	1	0	0	0	1	0	...	-3.834373	6.001490	-0.332113	0.110299	-0.036632	0.012166	-0.269099	0.072414	-0.019487	0.005244
4	1.0	0	0	0	1	0	0	...	0.042152	0.014670	1.975789	3.903744	7.712976	15.239217	-1.199027	1.437666	-1.723801	2.066885

5 rows × 40 columns

In [36]:

df_test_fin.head()

Out[36]:

	workingday	weather_1.0	weather_2.0	season_1.0	season_2.0	day_of_week_0.0	day_of_week_1.0	...	atemp3	atemp4	humidity1	humidity2	humidity3	humidity4	windspeed1	windspeed2	windspeed3	windspeed4
0	1.0	1	0	1	0	0	0	...	-2.536556	3.459334	-0.500703	0.250704	-0.125528	0.062852	0.040945	0.001676	0.000069	0.000003
1	1.0	0	1	1	0	0	0	...	-4.623049	7.701430	0.132958	0.017678	0.002350	0.000313	2.036025	4.145397	8.440130	17.184314
2	1.0	0	1	0	1	0	0	...	5.430888	9.546139	-0.457103	0.208943	-0.095509	0.043657	-0.523392	0.273940	-0.143378	0.075043
3	1.0	1	0	1	0	0	1	...	-0.438323	0.332960	-0.997746	0.995497	-0.993253	0.991014	0.986696	0.973568	0.960616	0.947835
4	0.0	1	0	0	1	1	0	...	0.863319	0.822044	0.441062	0.194535	0.085802	0.037844	0.311061	0.096759	0.030098	0.009362

5 rows × 40 columns

In [37]:

linreg = OLS(ytrain,df_train_fin).fit()

ytrain_pred = linreg.predict(df_train_fin)
ytest_pred = linreg.predict(df_test_fin)

print(linreg.summary())

print('\nstatisticaly significant coefficients:\n')
for column, pvalue  in list(filter(lambda x: x[1] < .05, zip(columns, linreg.pvalues))):
    print(column +":", pvalue)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.940
Model:                            OLS   Adj. R-squared:                  0.933
Method:                 Least Squares   F-statistic:                     118.3
Date:                Sun, 02 Sep 2018   Prob (F-statistic):          5.19e-156
Time:                        23:09:54   Log-Likelihood:                -2821.1
No. Observations:                 331   AIC:                             5720.
Df Residuals:                     292   BIC:                             5868.
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
holiday          -198.8353    405.001     -0.491      0.624    -995.926     598.255
workingday        282.6116    169.681      1.666      0.097     -51.342     616.565
weather_1.0      4329.2460    367.698     11.774      0.000    3605.573    5052.919
weather_2.0      3972.8819    358.993     11.067      0.000    3266.341    4679.423
season_1.0       -647.7808    495.784     -1.307      0.192   -1623.544     327.982
season_2.0         11.0676    576.718      0.019      0.985   -1123.984    1146.119
season_3.0        561.4693    453.577      1.238      0.217    -331.226    1454.165
day_of_week_0.0  -134.4444    265.316     -0.507      0.613    -656.619     387.730
day_of_week_1.0  -190.1348    172.034     -1.105      0.270    -528.719     148.450
day_of_week_2.0  -185.2944    203.620     -0.910      0.364    -586.044     215.455
day_of_week_3.0   237.6566    213.010      1.116      0.265    -181.573     656.886
day_of_week_4.0    87.4325    205.273      0.426      0.670    -316.569     491.434
day_of_week_5.0   134.1165    201.279      0.666      0.506    -262.024     530.257
month_1.0         750.4693    523.741      1.433      0.153    -280.316    1781.254
month_2.0         513.7789    521.832      0.985      0.326    -513.249    1540.807
month_3.0         692.6118    512.814      1.351      0.178    -316.669    1701.893
month_4.0        1150.3478    645.863      1.781      0.076    -120.789    2421.485
month_5.0         635.5150    689.609      0.922      0.358    -721.720    1992.749
month_6.0         532.6848    679.106      0.784      0.433    -803.878    1869.248
month_7.0         847.8176    671.700      1.262      0.208    -474.170    2169.805
month_8.0         475.7832    665.182      0.715      0.475    -833.376    1784.943
month_9.0         862.2601    541.449      1.593      0.112    -203.377    1927.897
month_10.0       1259.8345    396.564      3.177      0.002     479.348    2040.321
month_11.0       1140.9279    363.674      3.137      0.002     425.172    1856.683
temp1            -367.1554    815.192     -0.450      0.653   -1971.553    1237.242
temp2           -1970.8729    893.173     -2.207      0.028   -3728.745    -213.000
temp3             235.5108    299.814      0.786      0.433    -354.559     825.581
temp4               9.7170    187.318      0.052      0.959    -358.949     378.383
atemp1           1477.3777    775.567      1.905      0.058     -49.032    3003.787
atemp2           1548.3650    861.137      1.798      0.073    -146.457    3243.186
atemp3           -465.9985    268.157     -1.738      0.083    -993.764      61.767
atemp4           -104.8450    160.987     -0.651      0.515    -421.688     211.997
humidity1        -549.4877    171.250     -3.209      0.001    -886.529    -212.446
humidity2         117.3974    168.168      0.698      0.486    -213.579     448.373
humidity3          79.5427     47.151      1.687      0.093     -13.256     172.341
humidity4         -13.2129     34.391     -0.384      0.701     -80.898      54.472
windspeed1       -406.2235    162.784     -2.495      0.013    -726.601     -85.846
windspeed2         74.3488    137.976      0.539      0.590    -197.206     345.903
windspeed3         47.8207     71.592      0.668      0.505     -93.081     188.723
windspeed4        -34.2834     33.107     -1.036      0.301     -99.442      30.875
==============================================================================
Omnibus:                       14.666   Durbin-Watson:                   1.951
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                6.708
Skew:                          -0.062   Prob(JB):                       0.0349
Kurtosis:                       2.314   Cond. No.                     3.84e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.6e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

statisticaly significant coefficients:

workingday: 1.902306662445104e-26
temp: 5.2675155009381036e-24
month_4.0: 0.0016482186111859399
month_5.0: 0.0018797330943107796
month_7.0: 0.02812037326044718

In [38]:

print("r2 train score:", r2_score(ytrain, ytrain_pred ),"\nr2 test score:", r2_score(ytest, ytest_pred), '\n')


plt.scatter(df_train.temp, ytrain_pred, label = 'predicted y-value')
plt.scatter(df_train.temp, ytrain, label = 'y-value')
plt.legend()

r2 train score: 0.6034813159535967 
r2 test score: 0.1853015563908097

Out[38]:

<matplotlib.legend.Legend at 0x1146b5940>

In [47]:

# save df to use below
df_train_poly = df_train_fin.copy()
df_test_poly = df_test_fin.copy()

df_train_fin['month1_temp'] = df_train_fin['month_1.0'] * df_train_fin['temp1']
df_train_fin['weather1_workingday'] = df_train_fin['weather_1.0'] * df_train_fin['workingday']

df_test_fin['month1_temp'] = df_test_fin['month_1.0'] * df_test_fin['temp1']
df_test_fin['weather1_workingday'] = df_test_fin['weather_1.0'] * df_test_fin['workingday']

 
linreg2 = OLS(ytrain, df_train_fin).fit()


ytest_pred = linreg2.predict(df_test_fin)
ytrain_pred = linreg2.predict(df_train_fin)


print(linreg2.summary(), '\n\n')
print('statistically significant coefficients:\n')
for column, pvalue  in list(filter(lambda x: x[1] < .05, zip(columns, linreg2.pvalues))):
    print(column +":", pvalue)

print("\n Training R2 Score:", r2_score(ytrain, ytrain_pred))
print("\n Test R2 Score:", r2_score(ytest, ytest_pred))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.932
Method:                 Least Squares   F-statistic:                     112.5
Date:                Sun, 02 Sep 2018   Prob (F-statistic):          2.60e-154
Time:                        23:17:50   Log-Likelihood:                -2820.1
No. Observations:                 331   AIC:                             5722.
Df Residuals:                     290   BIC:                             5878.
Df Model:                          41                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
holiday              -252.6033    409.464     -0.617      0.538   -1058.501     553.294
workingday            476.2002    251.390      1.894      0.059     -18.581     970.981
weather_1.0          4457.9540    394.333     11.305      0.000    3681.836    5234.072
weather_2.0          3840.0289    374.377     10.257      0.000    3103.187    4576.870
season_1.0           -762.5162    503.836     -1.513      0.131   -1754.156     229.123
season_2.0           -111.3314    584.850     -0.190      0.849   -1262.420    1039.757
season_3.0            494.3041    456.636      1.082      0.280    -404.436    1393.044
day_of_week_0.0      -179.4585    267.707     -0.670      0.503    -706.353     347.436
day_of_week_1.0      -161.5069    173.658     -0.930      0.353    -503.297     180.283
day_of_week_2.0      -160.7042    204.853     -0.784      0.433    -563.891     242.483
day_of_week_3.0       264.9275    214.443      1.235      0.218    -157.134     686.989
day_of_week_4.0       114.0535    207.330      0.550      0.583    -294.008     522.115
day_of_week_5.0       166.8271    203.828      0.818      0.414    -234.342     567.996
month_1.0            1509.4565   1207.733      1.250      0.212    -867.576    3886.489
month_2.0             587.4550    525.442      1.118      0.264    -446.709    1621.619
month_3.0             811.8583    523.214      1.552      0.122    -217.920    1841.636
month_4.0            1283.4365    657.105      1.953      0.052      -9.862    2576.735
month_5.0             703.1168    696.728      1.009      0.314    -668.169    2074.402
month_6.0             628.3724    684.767      0.918      0.360    -719.371    1976.115
month_7.0             923.2854    674.886      1.368      0.172    -405.011    2251.582
month_8.0             539.3965    667.848      0.808      0.420    -775.047    1853.840
month_9.0             881.3291    543.091      1.623      0.106    -187.570    1950.228
month_10.0           1219.5430    405.756      3.006      0.003     420.943    2018.142
month_11.0           1075.1640    374.513      2.871      0.004     338.056    1812.272
temp1                -326.2725    818.680     -0.399      0.691   -1937.580    1285.035
temp2               -2058.6638    900.924     -2.285      0.023   -3831.842    -285.486
temp3                 196.4916    306.492      0.641      0.522    -406.738     799.722
temp4                  41.7678    194.692      0.215      0.830    -341.421     424.957
atemp1               1476.0714    776.081      1.902      0.058     -51.394    3003.537
atemp2               1625.7186    868.679      1.871      0.062     -83.997    3335.434
atemp3               -460.6486    268.562     -1.715      0.087    -989.226      67.929
atemp4               -123.7665    164.150     -0.754      0.451    -446.843     199.310
humidity1            -545.2832    172.072     -3.169      0.002    -883.952    -206.615
humidity2             115.5490    168.286      0.687      0.493    -215.667     446.765
humidity3              76.1689     47.707      1.597      0.111     -17.726     170.064
humidity4             -14.8538     34.522     -0.430      0.667     -82.799      53.091
windspeed1           -409.1374    162.872     -2.512      0.013    -729.698     -88.577
windspeed2             80.7210    138.133      0.584      0.559    -191.150     352.592
windspeed3             50.6671     71.690      0.707      0.480     -90.431     191.765
windspeed4            -36.9065     33.191     -1.112      0.267    -102.232      28.419
month1_temp           557.9191    839.078      0.665      0.507   -1093.537    2209.375
weather1_workingday  -369.5725    342.332     -1.080      0.281   -1043.342     304.197
==============================================================================
Omnibus:                       13.886   Durbin-Watson:                   1.944
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                6.448
Skew:                          -0.052   Prob(JB):                       0.0398
Kurtosis:                       2.324   Cond. No.                     4.29e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.07e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular. 


statistically significant coefficients:

workingday: 8.530090721164868e-25
temp: 2.8953124920538826e-21
month_4.0: 0.0028822450702410437
month_5.0: 0.004395452902218839
month_7.0: 0.023031308725633175

 Training R2 Score: 0.605846696694117

 Test R2 Score: 0.1751436795230208

The added interaction terms are not statistically significant

PCA¶

We would like to fit a model to include all main effects, polynomial terms up to the $4^{th}$ order, and all interactions between all possible predictors and polynomial terms. We will include the terms that account for 80% of total variance, this will be our explained variance.

In [40]:

poly_features = [ 'temp2', 'temp3', 'temp4', 'atemp2', 'atemp3', 'atemp4', 'humidity2',
                 'humidity3', 'humidity4', 'windspeed2', 'windspeed3', 'windspeed4']


df_train_pca = df_train.copy()
df_test_pca = df_test.copy()

poly_col = columns
for col_1 in columns:
    poly_col = poly_col[1:]
    for col_2 in poly_col:
        df_train_pca[str(col_1) + ' ' + str(col_2)] = df_train_pca[str(col_1)] * df_train_pca[str(col_2)]
        df_test_pca[str(col_1) + ' ' + str(col_2)] = df_test_pca[str(col_1)] * df_test_pca[str(col_2)]
        
#combine interaction and higher order data frames 
df_train_pca = pd.concat([df_train_pca, df_train_poly[poly_features]], axis =1 )
df_test_pca = pd.concat([df_test_pca, df_test_poly[poly_features]], axis =1 )

df_train_pca.head()

Out[40]:

	Unnamed: 0	workingday	temp	atemp	humidity	windspeed	count	weather_1.0	weather_2.0	...	temp4	atemp2	atemp3	atemp4	humidity2	humidity3	humidity4	windspeed2	windspeed3	windspeed4
0	0	1.0	0.623798	0.650106	0.920664	-0.928758	6073.0	0	1	...	0.151418	0.422637	0.274759	0.178622	0.847622	0.780375	0.718464	0.862591	-0.801139	0.744064
1	1	1.0	-0.180310	-0.054759	0.696852	-0.213502	6606.0	1	0	...	0.001057	0.002998	-0.000164	0.000009	0.485603	0.338393	0.235810	0.045583	-0.009732	0.002078
2	2	1.0	0.802489	0.851495	-0.448383	0.803926	7363.0	1	0	...	0.414722	0.725044	0.617372	0.525689	0.201047	-0.090146	0.040420	0.646297	0.519575	0.417699
3	3	0.0	-1.520492	-1.565182	-0.332113	-0.269099	2431.0	1	0	...	5.344859	2.449794	-3.834373	6.001490	0.110299	-0.036632	0.012166	0.072414	-0.019487	0.005244
4	4	1.0	0.534453	0.348021	1.975789	-1.199027	1996.0	0	0	...	0.081590	0.121119	0.042152	0.014670	3.903744	7.712976	15.239217	1.437666	-1.723801	2.066885

5 rows × 477 columns

In [41]:

for component in range(1, 15):
    
    std_scale = MinMaxScaler().fit(df_train_pca)
    pca_train_std = std_scale.transform(df_train_pca)
    pca_test_std = std_scale.transform(df_test_pca)
    
    pca_model = PCA(n_components=component)
    
    
    scaled_train = pca_model.fit_transform(scale(pca_train_std))
    scaled_test = pca_model.fit_transform(scale(pca_test_std))
    
    print("Cumulative explained variance:", np.cumsum(np.round(pca_model.explained_variance_ratio_, decimals=5)*100))
    print("R2 Training: ", ols.fit(scaled_train, ytrain).score(scaled_train, ytrain))
    print("R2 Test: ", ols.fit(scaled_test, ytest).score(scaled_test, ytest))
    print("\n")

Cumulative explained variance: [7.892]
R2 Training:  0.39677258363956835
R2 Test:  0.23424358570529077


Cumulative explained variance: [ 7.892 12.567]
R2 Training:  0.4068652884136146
R2 Test:  0.23718544933922603


Cumulative explained variance: [ 7.892 12.567 16.592]
R2 Training:  0.43577304302959163
R2 Test:  0.2583314265820338


Cumulative explained variance: [ 7.892 12.567 16.592 19.512]
R2 Training:  0.47414165616497295
R2 Test:  0.26021485412666745


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283]
R2 Training:  0.47438424235223825
R2 Test:  0.2602255456189443


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956]
R2 Training:  0.4944167544695589
R2 Test:  0.3087573494441138


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541]
R2 Training:  0.5172468255877385
R2 Test:  0.30977224547826865


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062]
R2 Training:  0.5177805447890964
R2 Test:  0.31356193226607143


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062 32.423]
R2 Training:  0.5188643112488446
R2 Test:  0.3221939205705174


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062 32.423 34.72 ]
R2 Training:  0.5222131839409367
R2 Test:  0.3297796101500052


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062 32.423 34.72
 36.955]
R2 Training:  0.522214891417694
R2 Test:  0.34802155296087545


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062 32.423 34.72
 36.955 39.104]
R2 Training:  0.5269769797140098
R2 Test:  0.3510538623282665


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062 32.423 34.72
 36.955 39.104 41.136]
R2 Training:  0.5291994661367495
R2 Test:  0.35810658670212764


Cumulative explained variance: [ 7.892 12.567 16.592 19.512 22.283 24.956 27.541 30.062 32.423 34.72
 36.955 39.104 41.136 43.052]
R2 Training:  0.532610423859645
R2 Test:  0.3581975102560113

Conclusion¶

Using principal component analysis we've managed to create a model that so far generalizes the best to our bike share data. While we might not have matched best known solutions, we've been able to explore regularization and add/deal with high dimensional features for linear models. By introducing high dimensional features and interaction terms we can often extract more information from our data that will be useful for our models. In this case, using PCA helped fix our issue with overfitting our model when standard regularization techniques (lasso and ridge l1 and l2 regularization) did not.

Forecasting Bike Sharing Usage: Part 2

Regularization, High Dimensionality, PCA¶

Regularization¶

Polynomial and Interaction Terms¶

PCA¶

Conclusion¶