Multiple Linear Regression, Subset Selection, Cross Validation¶

In [45]:

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

from statsmodels.api import OLS
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
%matplotlib inline

season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
month (1 through 12, with 1 denoting Jan)
holiday (1 = the day is a holiday, 0 = otherwise)
day_of_week (0 through 6, with 0 denoting Sunday)
workingday (1 = the day is neither a holiday or weekend, 0 = otherwise)
weather
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp (temperature in Celsius)
atemp (apparent temperature, or relative outdoor temperature, in Celsius)
humidity (relative humidity)
windspeed (wind speed)

and the last column 'count' contains the response variable, total number of bike rentals on the day.

Data Exploration & Preprocessing¶

As a first step, identify important characteristics of the data using suitable visualizations when necessary. Some of the questions you may ask include (but are not limited to):

How does the number of bike rentals vary between weekdays and weekends?
How about bike rentals on holidays?
What effect does the season have on the bike rentals on a given day?
Is the number of bike rentals lower than average when there is rain or snow?
How does temperature effect bike rentals?
Do any of the numeric attributes have a clear non-linear dependence with number of the bike rentals?

In [46]:

df_train = pd.read_csv('data/Bikeshare_train.csv', index_col = 0)
df_test = pd.read_csv('data/Bikeshare_test.csv', index_col = 0)


df_train.head()

Out[46]:

	season	month	day_of_week	workingday	weather	temp	atemp	humidity	windspeed	count
0	2.0	5.0	2.0	1.0	2.0	24.0	26.0	76.5833	0.118167	6073.0
1	4.0	12.0	2.0	1.0	1.0	15.0	19.0	73.3750	0.174129	6606.0
2	2.0	6.0	4.0	1.0	1.0	26.0	28.0	56.9583	0.253733	7363.0
3	4.0	12.0	0.0	0.0	1.0	0.0	4.0	58.6250	0.169779	2431.0
4	3.0	9.0	3.0	1.0	3.0	23.0	23.0	91.7083	0.097021	1996.0

In [47]:

df_train.describe()

Out[47]:

	season	month	holiday	day_of_week	workingday	weather	temp	atemp	humidity	windspeed	count
count	331.000000	331.000000	331.000000	331.000000	331.000000	331.000000	331.000000	331.000000	331.000000	331.000000	331.000000
mean	2.561934	6.640483	0.033233	2.854985	0.670695	1.389728	17.018127	19.543807	63.385776	0.190833	4598.447130
std	1.094726	3.353974	0.179515	2.048680	0.470672	0.546962	11.192515	9.930991	14.334789	0.078240	1935.319338
min	1.000000	1.000000	0.000000	0.000000	0.000000	1.000000	-11.000000	-6.000000	25.416700	0.022392	431.000000
25%	2.000000	4.000000	0.000000	1.000000	0.000000	1.000000	7.500000	11.000000	52.702900	0.133083	3370.000000
50%	3.000000	7.000000	0.000000	3.000000	1.000000	1.000000	18.000000	21.000000	63.291700	0.178479	4648.000000
75%	4.000000	9.500000	0.000000	5.000000	1.000000	2.000000	26.000000	27.000000	73.500000	0.235380	5981.000000
max	4.000000	12.000000	1.000000	6.000000	1.000000	3.000000	38.000000	39.000000	97.250000	0.421642	8714.000000

In [48]:

f, axes = plt.subplots(6, 2, figsize = (10, 20))
plt.subplots_adjust(hspace = .3)
i, j = 0, 0 
for col in df_train.columns:
    axes[i, j].hist(df_train[col])
    axes[i, j].set_title(col)
    axes[i, j].spines['top'].set_visible(False)
    axes[i, j].spines['right'].set_visible(False)
    j +=1
    if j == 2:
        j = 0 
        i +=1
        
axes[i, j].spines['top'].set_visible(False)
axes[i, j].spines['right'].set_visible(False)
axes[i, j].spines['left'].set_visible(False)
axes[i, j].spines['bottom'].set_visible(False) 
axes[i, j].tick_params(axis = 'both', which = 'both', bottom = False, left = False,
                      labelbottom = False, labelleft = False)

In [49]:

count_means_by_day = df_train.groupby('day_of_week')['count'].mean().tolist()
weekend_count_mean = (count_means_by_day[0] + count_means_by_day[-1])/2
weekday_count_mean = np.mean(count_means_by_day[1:6])
weekend_count_mean, weekday_count_mean
count_means_by_holiday = df_train.groupby('holiday')['count'].mean()
count_means_by_season = df_train.groupby('season')['count'].mean()
count_means_by_weather = df_train.groupby('weather')['count'].mean()
count_means_by_temp = df_train.groupby('temp')['count'].mean()
humidity = df_train.groupby('windspeed')['count'].mean().sort_values()

In [50]:

plt.scatter(humidity.index, humidity.values)
plt.show()


df = df_train.copy() 
df['weekday_status'] = df.day_of_week.isin([0, 6])


colors = ['#f2c18c', '#c65503']
for g, group in df.groupby('weekday_status'):
    plt.hist(group['count'], color = colors.pop())
plt.legend(['Weekday', 'Weekend'])

Out[50]:

<matplotlib.legend.Legend at 0x111d38438>

Wednesdays have the highest rentals on average
Saturdays have the lowest number of rentals on average
Weekends have lower rentals on average than weekdays
Holidays have lower average rentals than non holidays
spring has the lowest average number of rentals at 2539.7 and fall has the highest at 5680.73
Light snow/rain heavily impacts number of rentals, and there are less than half, aprox 43% as many rentals rented than in any other weather conditions
lower temps more negatively impact rentals than high temps, high temperature days still see healthy rental levels
humitidy is not linearly related to rental count
windspeed is not linearly related to rental count

Treating Categorical Features¶

We cannot use categorically numbered attributes directly as predictors because these numbers do not represent some type of 'distance' between categories. Thus we must treat them as separate binary cases by creating dummy variables.

In [51]:

df_train = pd.get_dummies(df_train, columns = ['weather', 'season', 'day_of_week', 'month'])
df_test = pd.get_dummies(df_test, columns = ['weather', 'season', 'day_of_week', 'month'])
df_train = df_train.drop(['weather_3.0', 'month_12.0', 'season_4.0', 'day_of_week_6.0' ], axis = 1)
df_test = df_test.drop(['weather_3.0', 'month_12.0', 'season_4.0', 'day_of_week_6.0' ], axis = 1)

In [52]:

df_train_cont = df_train[['windspeed', 'atemp', 'temp', 'humidity']]

df_train[['windspeed', 'atemp', 'temp', 'humidity']] = (df_train_cont - df_train_cont.mean())/df_train_cont.std()

df_test[['windspeed', 'atemp', 'temp', 'humidity']] = (df_test[['windspeed', 'atemp', 'temp', 'humidity']] - df_train_cont.mean())/df_train_cont.std()

In [53]:

df_train.to_csv('bike_train_df.csv')
df_test.to_csv('bike_test_df.csv')


df_train.head()

Out[53]:

	workingday	temp	atemp	humidity	windspeed	count	weather_1.0	weather_2.0	...	month_5.0	month_6.0	month_9.0
0	1.0	0.623798	0.650106	0.920664	-0.928758	6073.0	0	1	...	1	0	0
1	1.0	-0.180310	-0.054759	0.696852	-0.213502	6606.0	1	0	...	0	0	0
2	1.0	0.802489	0.851495	-0.448383	0.803926	7363.0	1	0	...	0	1	0
3	0.0	-1.520492	-1.565182	-0.332113	-0.269099	2431.0	1	0	...	0	0	0
4	1.0	0.534453	0.348021	1.975789	-1.199027	1996.0	0	0	...	0	0	1

5 rows × 29 columns

Multiple Linear Regression¶

In [54]:

Xtrain, ytrain = df_train.drop('count', axis =1), df_train['count'].reshape(-1,1)
Xtest, ytest = df_test.drop('count', axis =1), df_test['count'].reshape(-1, 1)



mult_reg = sm.OLS(ytrain, sm.add_constant(Xtrain)).fit()
ytest_pred = mult_reg.predict(sm.add_constant(Xtest))
ytrain_pred = mult_reg.predict(sm.add_constant(Xtrain))


print("Train R2 score:", r2_score(ytrain, mult_reg.predict(sm.add_constant(Xtrain))))
print("Test R2 score:", r2_score(ytest, ytest_pred))

#print(mult_reg.params.apply(lambda x: 'positive correlation' if x >0 else 'negative correlation'))
#print(mult_reg.params)
#print(mult_reg.pvalues)

print(mult_reg.summary())

Train R2 score: 0.5761281783129482
Test R2 score: 0.2493421114652754
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.576
Model:                            OLS   Adj. R-squared:                  0.538
Method:                 Least Squares   F-statistic:                     15.25
Date:                Thu, 16 Aug 2018   Prob (F-statistic):           6.56e-42
Time:                        14:41:53   Log-Likelihood:                -2832.1
No. Observations:                 331   AIC:                             5720.
Df Residuals:                     303   BIC:                             5827.
Df Model:                          27                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            3672.2940    664.433      5.527      0.000    2364.807    4979.781
holiday          -616.6027    405.068     -1.522      0.129   -1413.706     180.500
workingday        -24.0933    174.143     -0.138      0.890    -366.777     318.590
temp              925.7338    474.536      1.951      0.052      -8.070    1859.538
atemp             312.4341    429.987      0.727      0.468    -533.705    1158.573
humidity         -548.4929    113.200     -4.845      0.000    -771.251    -325.735
windspeed        -255.1226     80.766     -3.159      0.002    -414.057     -96.189
weather_1.0      1581.9783    529.223      2.989      0.003     540.560    2623.396
weather_2.0      1565.4117    478.500      3.271      0.001     623.807    2507.016
season_1.0      -1226.1865    506.763     -2.420      0.016   -2223.407    -228.966
season_2.0       -327.3575    573.373     -0.571      0.568   -1455.654     800.939
season_3.0       -193.3050    449.171     -0.430      0.667   -1077.194     690.584
day_of_week_0.0  -465.1450    269.154     -1.728      0.085    -994.794      64.504
day_of_week_1.0  -256.6501    172.675     -1.486      0.138    -596.445      83.145
day_of_week_2.0  -328.1845    204.810     -1.602      0.110    -731.214      74.845
day_of_week_3.0    37.6128    214.916      0.175      0.861    -385.304     460.530
day_of_week_4.0   -71.6425    208.337     -0.344      0.731    -481.612     338.327
day_of_week_5.0   -21.8317    200.742     -0.109      0.913    -416.858     373.194
month_1.0         118.8358    505.353      0.235      0.814    -875.611    1113.282
month_2.0         207.7759    516.216      0.402      0.688    -808.047    1223.599
month_3.0         358.0167    511.391      0.700      0.484    -648.310    1364.344
month_4.0         452.1849    657.792      0.687      0.492    -842.234    1746.604
month_5.0          53.0233    700.991      0.076      0.940   -1326.403    1432.450
month_6.0        -673.4271    696.142     -0.967      0.334   -2043.313     696.458
month_7.0       -1161.1512    701.261     -1.656      0.099   -2541.109     218.806
month_8.0        -657.6397    684.628     -0.961      0.338   -2004.868     689.588
month_9.0         523.9804    548.284      0.956      0.340    -554.945    1602.906
month_10.0        605.0867    439.844      1.376      0.170    -260.449    1470.623
month_11.0        231.5175    413.966      0.559      0.576    -583.094    1046.129
==============================================================================
Omnibus:                       28.947   Durbin-Watson:                   1.912
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                9.753
Skew:                           0.054   Prob(JB):                      0.00762
Kurtosis:                       2.166   Cond. No.                     2.08e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.89e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead

From our analysis there is not enough evidence to reject the null hypothesis, that the day of the week does not predict the number of bike rentals, therefore we have no evidence of a relationship between day of the week and bike rentals
The above is true for the month as well
There is no evidence holiday effects the number of bike rentals
The temp variable is assigned a higher coefficent value than atemp. Since temp and atemp are relatively similar in value, this could mean that in the regression model, temp contributes more to the predicted outcome. Looking at the p values we see that there is sufficient evidence that there is a relationship between temp and bike rental numbers, while there is not sufficient evidence of a relationship between atemp and bike rentals.

In [55]:

plt.scatter(ytrain_pred.values, pd.Series(ytrain.reshape(1, -1)[0])-ytrain_pred)
plt.axhline(y=0, color = 'red')

Out[55]:

<matplotlib.lines.Line2D at 0x112806390>

In [56]:

plt.hist(ytrain_pred - pd.Series(ytrain.reshape(1, -1)[0]))

Out[56]:

(array([ 4., 21., 40., 57., 35., 45., 69., 40., 14.,  6.]),
 array([-3076.47515852, -2464.02663933, -1851.57812015, -1239.12960096,
         -626.68108178,   -14.23256259,   598.21595659,  1210.66447578,
         1823.11299496,  2435.56151415,  3048.01003333]),
 <a list of 10 Patch objects>)

Checking Collinearity¶

In [57]:

pd.plotting.scatter_matrix(df_train_cont, figsize = (10,10))
plt.show()
corr = df_train.drop('count', axis = 1).corr()

plt.figure(figsize=(10,10))
sns.heatmap(corr)

Out[57]:

<matplotlib.axes._subplots.AxesSubplot at 0x113bebfd0>

temp and atemp exhibit colinearlity
season, month and temp all demonstrate colinearity to some degree

Subset Selection¶

We need a way of selecting only the significant features to use in our model. Two useful methods for doing this are forward and backward selection. These are not implimented in scikit-learn, so we can write a quick function for each of these algorithms.

In [61]:

def forward_selection(df_train, ytrain):
    S = []
    results = {}
    for i in range(len(df_train.columns)-1):
        model_results = {}
        for column in df_train.drop(S+['count'], axis =1).columns:
            test_set = S+[str(column)]
            model = sm.OLS(ytrain, sm.add_constant(df_train[test_set])).fit()
            model_results[model.bic] = test_set 
        minimum = min(model_results.keys())
        minimum_set= model_results[minimum]
        S+=minimum_set
        S = list(set(S))
        results[minimum] = minimum_set
    key = min(results.keys())
    return results[key], key



def backward_selection(df_train, ytrain):
    dropped =['count']
    S = list(df_train.columns)
    results = {}
    for i in range(len(df_train.columns)-2):
        model_results = {}
        for column in df_train.drop(dropped, axis =1).columns:
            test_set = [x for x in df_train.columns if x not in dropped +[str(column)]] 
            model = sm.OLS(ytrain, sm.add_constant(df_train[test_set])).fit()
            model_results[model.bic] =  test_set
        minimum = min(model_results.keys())
        minimum_key = model_results[minimum]
        dropped = [x for x in df_train.columns if x not in minimum_key]
        results[minimum] = minimum_key
    key = min(results.keys())    
    return results[key], key


forward_set, backward_set = forward_selection(df_train, ytrain)[0], backward_selection(df_train, ytrain)[0]
forward_params, backward_params = df_train[forward_set], df_train[backward_set ]
forward_df, backward_df = df_train[forward_set + ['count']], df_train[backward_set + ['count']]


forward_test_params, backward_test_params = df_test[forward_set], df_test[backward_set]
forward_test_df, backward_test_df = df_train[forward_set + ['count']], df_train[backward_set + ['count']]
forward_model = sm.OLS(ytrain, sm.add_constant(forward_params)).fit()
forward_pred = forward_model.predict(sm.add_constant(forward_test_params))



print('Forward Selection')
print('Train R2:', r2_score(ytrain, forward_model.predict(sm.add_constant(forward_params))))
print('Test R2:', r2_score(ytest, forward_pred))
print('\nParameters:', forward_set)

backward_model = sm.OLS(ytrain, sm.add_constant(backward_params)).fit()
backward_pred = backward_model.predict(sm.add_constant(backward_test_params))

print('\nBackward Selection')
print('Train R2:',r2_score(ytrain, backward_model.predict(sm.add_constant(backward_params))))
print('Test R2:', r2_score(ytest, backward_pred))
print('\nParameters:', backward_set)

Forward Selection
Train R2: 0.5343288672357578
Test R2: 0.23390728606336675

Parameters: ['windspeed', 'season_1.0', 'month_10.0', 'month_9.0', 'atemp', 'humidity', 'month_7.0']

Backward Selection
Train R2: 0.5352873526529435
Test R2: 0.23580675962233155

Parameters: ['temp', 'humidity', 'windspeed', 'season_1.0', 'month_6.0', 'month_7.0', 'month_8.0']

Cross Validation¶

Next we will use k fold cross validation to validate our model, averaging over each fitted model's R2 score.

In [62]:

from sklearn.cross_validation import KFold


def fit_model_r2(X, ytrain, Xtest, ytest):
    model = sm.OLS(ytrain, sm.add_constant(X)).fit()
    results = model.predict(sm.add_constant(Xtest))
    return r2_score(ytest,results )

def cross_fold_10(model_df):
    r2 = []
    for train, test in KFold(model_df.shape[0], 10):
        Xytrain = model_df.iloc[train]
        Xytest = model_df.iloc[test]
        r2.append(fit_model_r2(Xytrain.drop('count', axis = 1), Xytrain['count'],
                               Xytest.drop('count', axis = 1), Xytest['count']))
    return np.mean(r2)
    
print('R2 Scores\n\nForward Cross Fold:', cross_fold_10(forward_df),
      '\nLinear Cross Fold:', cross_fold_10(df_train),
      '\nBackward Cross Fold:', cross_fold_10(backward_df))

R2 Scores

Forward Cross Fold: 0.4403650515142502 
Linear Cross Fold: 0.34257855093958434 
Backward Cross Fold: 0.44883775190033914

The issue with doing this is that since we used the the training data to choose our parameters we've used our data twice, thus increasing our scores. However we should be able to to use our test data to evaluate how the models generalize to new data.

What's Next?¶

In the next post we will try to improve our model's preformance using regularization, which can help adjust for any high variance we are seeing due to multicolinearity.

Forecasting Bike Sharing Usage