Forecasting Bike Sharing Usage

Multiple Linear Regression, Subset Selection, Cross Validation

In [45]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

from statsmodels.api import OLS
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
%matplotlib inline

Forecasting Bike Sharing Usage

  • season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
  • month (1 through 12, with 1 denoting Jan)
  • holiday (1 = the day is a holiday, 0 = otherwise)
  • day_of_week (0 through 6, with 0 denoting Sunday)
  • workingday (1 = the day is neither a holiday or weekend, 0 = otherwise)
  • weather
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp (temperature in Celsius)
  • atemp (apparent temperature, or relative outdoor temperature, in Celsius)
  • humidity (relative humidity)
  • windspeed (wind speed)

and the last column 'count' contains the response variable, total number of bike rentals on the day.

Data Exploration & Preprocessing

As a first step, identify important characteristics of the data using suitable visualizations when necessary. Some of the questions you may ask include (but are not limited to):

  • How does the number of bike rentals vary between weekdays and weekends?
  • How about bike rentals on holidays?
  • What effect does the season have on the bike rentals on a given day?
  • Is the number of bike rentals lower than average when there is rain or snow?
  • How does temperature effect bike rentals?
  • Do any of the numeric attributes have a clear non-linear dependence with number of the bike rentals?
In [46]:
df_train = pd.read_csv('data/Bikeshare_train.csv', index_col = 0)
df_test = pd.read_csv('data/Bikeshare_test.csv', index_col = 0)


df_train.head()
Out[46]:
season month holiday day_of_week workingday weather temp atemp humidity windspeed count
0 2.0 5.0 0.0 2.0 1.0 2.0 24.0 26.0 76.5833 0.118167 6073.0
1 4.0 12.0 0.0 2.0 1.0 1.0 15.0 19.0 73.3750 0.174129 6606.0
2 2.0 6.0 0.0 4.0 1.0 1.0 26.0 28.0 56.9583 0.253733 7363.0
3 4.0 12.0 0.0 0.0 0.0 1.0 0.0 4.0 58.6250 0.169779 2431.0
4 3.0 9.0 0.0 3.0 1.0 3.0 23.0 23.0 91.7083 0.097021 1996.0
In [47]:
df_train.describe()
Out[47]:
season month holiday day_of_week workingday weather temp atemp humidity windspeed count
count 331.000000 331.000000 331.000000 331.000000 331.000000 331.000000 331.000000 331.000000 331.000000 331.000000 331.000000
mean 2.561934 6.640483 0.033233 2.854985 0.670695 1.389728 17.018127 19.543807 63.385776 0.190833 4598.447130
std 1.094726 3.353974 0.179515 2.048680 0.470672 0.546962 11.192515 9.930991 14.334789 0.078240 1935.319338
min 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 -11.000000 -6.000000 25.416700 0.022392 431.000000
25% 2.000000 4.000000 0.000000 1.000000 0.000000 1.000000 7.500000 11.000000 52.702900 0.133083 3370.000000
50% 3.000000 7.000000 0.000000 3.000000 1.000000 1.000000 18.000000 21.000000 63.291700 0.178479 4648.000000
75% 4.000000 9.500000 0.000000 5.000000 1.000000 2.000000 26.000000 27.000000 73.500000 0.235380 5981.000000
max 4.000000 12.000000 1.000000 6.000000 1.000000 3.000000 38.000000 39.000000 97.250000 0.421642 8714.000000
In [48]:
f, axes = plt.subplots(6, 2, figsize = (10, 20))
plt.subplots_adjust(hspace = .3)
i, j = 0, 0 
for col in df_train.columns:
    axes[i, j].hist(df_train[col])
    axes[i, j].set_title(col)
    axes[i, j].spines['top'].set_visible(False)
    axes[i, j].spines['right'].set_visible(False)
    j +=1
    if j == 2:
        j = 0 
        i +=1
        
axes[i, j].spines['top'].set_visible(False)
axes[i, j].spines['right'].set_visible(False)
axes[i, j].spines['left'].set_visible(False)
axes[i, j].spines['bottom'].set_visible(False) 
axes[i, j].tick_params(axis = 'both', which = 'both', bottom = False, left = False,
                      labelbottom = False, labelleft = False)
In [49]:
count_means_by_day = df_train.groupby('day_of_week')['count'].mean().tolist()
weekend_count_mean = (count_means_by_day[0] + count_means_by_day[-1])/2
weekday_count_mean = np.mean(count_means_by_day[1:6])
weekend_count_mean, weekday_count_mean
count_means_by_holiday = df_train.groupby('holiday')['count'].mean()
count_means_by_season = df_train.groupby('season')['count'].mean()
count_means_by_weather = df_train.groupby('weather')['count'].mean()
count_means_by_temp = df_train.groupby('temp')['count'].mean()
humidity = df_train.groupby('windspeed')['count'].mean().sort_values()
In [50]:
plt.scatter(humidity.index, humidity.values)
plt.show()


df = df_train.copy() 
df['weekday_status'] = df.day_of_week.isin([0, 6])


colors = ['#f2c18c', '#c65503']
for g, group in df.groupby('weekday_status'):
    plt.hist(group['count'], color = colors.pop())
plt.legend(['Weekday', 'Weekend'])
Out[50]:
<matplotlib.legend.Legend at 0x111d38438>
  • Wednesdays have the highest rentals on average
  • Saturdays have the lowest number of rentals on average
  • Weekends have lower rentals on average than weekdays
  • Holidays have lower average rentals than non holidays
  • spring has the lowest average number of rentals at 2539.7 and fall has the highest at 5680.73
  • Light snow/rain heavily impacts number of rentals, and there are less than half, aprox 43% as many rentals rented than in any other weather conditions
  • lower temps more negatively impact rentals than high temps, high temperature days still see healthy rental levels
  • humitidy is not linearly related to rental count
  • windspeed is not linearly related to rental count

Treating Categorical Features


We cannot use categorically numbered attributes directly as predictors because these numbers do not represent some type of 'distance' between categories. Thus we must treat them as separate binary cases by creating dummy variables.

In [51]:
df_train = pd.get_dummies(df_train, columns = ['weather', 'season', 'day_of_week', 'month'])
df_test = pd.get_dummies(df_test, columns = ['weather', 'season', 'day_of_week', 'month'])
df_train = df_train.drop(['weather_3.0', 'month_12.0', 'season_4.0', 'day_of_week_6.0' ], axis = 1)
df_test = df_test.drop(['weather_3.0', 'month_12.0', 'season_4.0', 'day_of_week_6.0' ], axis = 1)
In [52]:
df_train_cont = df_train[['windspeed', 'atemp', 'temp', 'humidity']]

df_train[['windspeed', 'atemp', 'temp', 'humidity']] = (df_train_cont - df_train_cont.mean())/df_train_cont.std()

df_test[['windspeed', 'atemp', 'temp', 'humidity']] = (df_test[['windspeed', 'atemp', 'temp', 'humidity']] - df_train_cont.mean())/df_train_cont.std()
In [53]:
df_train.to_csv('bike_train_df.csv')
df_test.to_csv('bike_test_df.csv')


df_train.head()
Out[53]:
holiday workingday temp atemp humidity windspeed count weather_1.0 weather_2.0 season_1.0 ... month_2.0 month_3.0 month_4.0 month_5.0 month_6.0 month_7.0 month_8.0 month_9.0 month_10.0 month_11.0
0 0.0 1.0 0.623798 0.650106 0.920664 -0.928758 6073.0 0 1 0 ... 0 0 0 1 0 0 0 0 0 0
1 0.0 1.0 -0.180310 -0.054759 0.696852 -0.213502 6606.0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0.0 1.0 0.802489 0.851495 -0.448383 0.803926 7363.0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
3 0.0 0.0 -1.520492 -1.565182 -0.332113 -0.269099 2431.0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.0 1.0 0.534453 0.348021 1.975789 -1.199027 1996.0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 29 columns

Multiple Linear Regression

In [54]:
Xtrain, ytrain = df_train.drop('count', axis =1), df_train['count'].reshape(-1,1)
Xtest, ytest = df_test.drop('count', axis =1), df_test['count'].reshape(-1, 1)



mult_reg = sm.OLS(ytrain, sm.add_constant(Xtrain)).fit()
ytest_pred = mult_reg.predict(sm.add_constant(Xtest))
ytrain_pred = mult_reg.predict(sm.add_constant(Xtrain))


print("Train R2 score:", r2_score(ytrain, mult_reg.predict(sm.add_constant(Xtrain))))
print("Test R2 score:", r2_score(ytest, ytest_pred))

#print(mult_reg.params.apply(lambda x: 'positive correlation' if x >0 else 'negative correlation'))
#print(mult_reg.params)
#print(mult_reg.pvalues)

print(mult_reg.summary())
Train R2 score: 0.5761281783129482
Test R2 score: 0.2493421114652754
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.576
Model:                            OLS   Adj. R-squared:                  0.538
Method:                 Least Squares   F-statistic:                     15.25
Date:                Thu, 16 Aug 2018   Prob (F-statistic):           6.56e-42
Time:                        14:41:53   Log-Likelihood:                -2832.1
No. Observations:                 331   AIC:                             5720.
Df Residuals:                     303   BIC:                             5827.
Df Model:                          27                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            3672.2940    664.433      5.527      0.000    2364.807    4979.781
holiday          -616.6027    405.068     -1.522      0.129   -1413.706     180.500
workingday        -24.0933    174.143     -0.138      0.890    -366.777     318.590
temp              925.7338    474.536      1.951      0.052      -8.070    1859.538
atemp             312.4341    429.987      0.727      0.468    -533.705    1158.573
humidity         -548.4929    113.200     -4.845      0.000    -771.251    -325.735
windspeed        -255.1226     80.766     -3.159      0.002    -414.057     -96.189
weather_1.0      1581.9783    529.223      2.989      0.003     540.560    2623.396
weather_2.0      1565.4117    478.500      3.271      0.001     623.807    2507.016
season_1.0      -1226.1865    506.763     -2.420      0.016   -2223.407    -228.966
season_2.0       -327.3575    573.373     -0.571      0.568   -1455.654     800.939
season_3.0       -193.3050    449.171     -0.430      0.667   -1077.194     690.584
day_of_week_0.0  -465.1450    269.154     -1.728      0.085    -994.794      64.504
day_of_week_1.0  -256.6501    172.675     -1.486      0.138    -596.445      83.145
day_of_week_2.0  -328.1845    204.810     -1.602      0.110    -731.214      74.845
day_of_week_3.0    37.6128    214.916      0.175      0.861    -385.304     460.530
day_of_week_4.0   -71.6425    208.337     -0.344      0.731    -481.612     338.327
day_of_week_5.0   -21.8317    200.742     -0.109      0.913    -416.858     373.194
month_1.0         118.8358    505.353      0.235      0.814    -875.611    1113.282
month_2.0         207.7759    516.216      0.402      0.688    -808.047    1223.599
month_3.0         358.0167    511.391      0.700      0.484    -648.310    1364.344
month_4.0         452.1849    657.792      0.687      0.492    -842.234    1746.604
month_5.0          53.0233    700.991      0.076      0.940   -1326.403    1432.450
month_6.0        -673.4271    696.142     -0.967      0.334   -2043.313     696.458
month_7.0       -1161.1512    701.261     -1.656      0.099   -2541.109     218.806
month_8.0        -657.6397    684.628     -0.961      0.338   -2004.868     689.588
month_9.0         523.9804    548.284      0.956      0.340    -554.945    1602.906
month_10.0        605.0867    439.844      1.376      0.170    -260.449    1470.623
month_11.0        231.5175    413.966      0.559      0.576    -583.094    1046.129
==============================================================================
Omnibus:                       28.947   Durbin-Watson:                   1.912
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                9.753
Skew:                           0.054   Prob(JB):                      0.00762
Kurtosis:                       2.166   Cond. No.                     2.08e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.89e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  
  • From our analysis there is not enough evidence to reject the null hypothesis, that the day of the week does not predict the number of bike rentals, therefore we have no evidence of a relationship between day of the week and bike rentals

  • The above is true for the month as well

  • There is no evidence holiday effects the number of bike rentals

  • The temp variable is assigned a higher coefficent value than atemp. Since temp and atemp are relatively similar in value, this could mean that in the regression model, temp contributes more to the predicted outcome. Looking at the p values we see that there is sufficient evidence that there is a relationship between temp and bike rental numbers, while there is not sufficient evidence of a relationship between atemp and bike rentals.

In [55]:
plt.scatter(ytrain_pred.values, pd.Series(ytrain.reshape(1, -1)[0])-ytrain_pred)
plt.axhline(y=0, color = 'red')
Out[55]:
<matplotlib.lines.Line2D at 0x112806390>
In [56]:
plt.hist(ytrain_pred - pd.Series(ytrain.reshape(1, -1)[0]))
Out[56]:
(array([ 4., 21., 40., 57., 35., 45., 69., 40., 14.,  6.]),
 array([-3076.47515852, -2464.02663933, -1851.57812015, -1239.12960096,
         -626.68108178,   -14.23256259,   598.21595659,  1210.66447578,
         1823.11299496,  2435.56151415,  3048.01003333]),
 <a list of 10 Patch objects>)

Checking Collinearity

In [57]:
pd.plotting.scatter_matrix(df_train_cont, figsize = (10,10))
plt.show()
corr = df_train.drop('count', axis = 1).corr()

plt.figure(figsize=(10,10))
sns.heatmap(corr)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x113bebfd0>
  • temp and atemp exhibit colinearlity
  • season, month and temp all demonstrate colinearity to some degree

Subset Selection

We need a way of selecting only the significant features to use in our model. Two useful methods for doing this are forward and backward selection. These are not implimented in scikit-learn, so we can write a quick function for each of these algorithms.

In [61]:
def forward_selection(df_train, ytrain):
    S = []
    results = {}
    for i in range(len(df_train.columns)-1):
        model_results = {}
        for column in df_train.drop(S+['count'], axis =1).columns:
            test_set = S+[str(column)]
            model = sm.OLS(ytrain, sm.add_constant(df_train[test_set])).fit()
            model_results[model.bic] = test_set 
        minimum = min(model_results.keys())
        minimum_set= model_results[minimum]
        S+=minimum_set
        S = list(set(S))
        results[minimum] = minimum_set
    key = min(results.keys())
    return results[key], key



def backward_selection(df_train, ytrain):
    dropped =['count']
    S = list(df_train.columns)
    results = {}
    for i in range(len(df_train.columns)-2):
        model_results = {}
        for column in df_train.drop(dropped, axis =1).columns:
            test_set = [x for x in df_train.columns if x not in dropped +[str(column)]] 
            model = sm.OLS(ytrain, sm.add_constant(df_train[test_set])).fit()
            model_results[model.bic] =  test_set
        minimum = min(model_results.keys())
        minimum_key = model_results[minimum]
        dropped = [x for x in df_train.columns if x not in minimum_key]
        results[minimum] = minimum_key
    key = min(results.keys())    
    return results[key], key


forward_set, backward_set = forward_selection(df_train, ytrain)[0], backward_selection(df_train, ytrain)[0]
forward_params, backward_params = df_train[forward_set], df_train[backward_set ]
forward_df, backward_df = df_train[forward_set + ['count']], df_train[backward_set + ['count']]


forward_test_params, backward_test_params = df_test[forward_set], df_test[backward_set]
forward_test_df, backward_test_df = df_train[forward_set + ['count']], df_train[backward_set + ['count']]
forward_model = sm.OLS(ytrain, sm.add_constant(forward_params)).fit()
forward_pred = forward_model.predict(sm.add_constant(forward_test_params))



print('Forward Selection')
print('Train R2:', r2_score(ytrain, forward_model.predict(sm.add_constant(forward_params))))
print('Test R2:', r2_score(ytest, forward_pred))
print('\nParameters:', forward_set)

backward_model = sm.OLS(ytrain, sm.add_constant(backward_params)).fit()
backward_pred = backward_model.predict(sm.add_constant(backward_test_params))

print('\nBackward Selection')
print('Train R2:',r2_score(ytrain, backward_model.predict(sm.add_constant(backward_params))))
print('Test R2:', r2_score(ytest, backward_pred))
print('\nParameters:', backward_set)
Forward Selection
Train R2: 0.5343288672357578
Test R2: 0.23390728606336675

Parameters: ['windspeed', 'season_1.0', 'month_10.0', 'month_9.0', 'atemp', 'humidity', 'month_7.0']

Backward Selection
Train R2: 0.5352873526529435
Test R2: 0.23580675962233155

Parameters: ['temp', 'humidity', 'windspeed', 'season_1.0', 'month_6.0', 'month_7.0', 'month_8.0']

Cross Validation

Next we will use k fold cross validation to validate our model, averaging over each fitted model's R2 score.

In [62]:
from sklearn.cross_validation import KFold


def fit_model_r2(X, ytrain, Xtest, ytest):
    model = sm.OLS(ytrain, sm.add_constant(X)).fit()
    results = model.predict(sm.add_constant(Xtest))
    return r2_score(ytest,results )

def cross_fold_10(model_df):
    r2 = []
    for train, test in KFold(model_df.shape[0], 10):
        Xytrain = model_df.iloc[train]
        Xytest = model_df.iloc[test]
        r2.append(fit_model_r2(Xytrain.drop('count', axis = 1), Xytrain['count'],
                               Xytest.drop('count', axis = 1), Xytest['count']))
    return np.mean(r2)
    
print('R2 Scores\n\nForward Cross Fold:', cross_fold_10(forward_df),
      '\nLinear Cross Fold:', cross_fold_10(df_train),
      '\nBackward Cross Fold:', cross_fold_10(backward_df))
R2 Scores

Forward Cross Fold: 0.4403650515142502 
Linear Cross Fold: 0.34257855093958434 
Backward Cross Fold: 0.44883775190033914

The issue with doing this is that since we used the the training data to choose our parameters we've used our data twice, thus increasing our scores. However we should be able to to use our test data to evaluate how the models generalize to new data.

What's Next?

In the next post we will try to improve our model's preformance using regularization, which can help adjust for any high variance we are seeing due to multicolinearity.