Predicting Taxi Pickups in NYC

Exploring Knn and Linear and Polynomial Regressions

Problem Summary

For this project we will build regression models that can predict the number of taxi pickups in New York city at any given time of the day. These prediction models will be useful, for example, in monitoring traffic in the city.

The data set for this problem is given in files dataset_1_train.txt and dataset_1_test.txt as separate training and test sets. The first column in each file contains the time of a day in minutes, and the second column contains the number of pickups observed at that time. The data set covers taxi pickups recorded during different days in Jan 2015 (randomly sampled across days and time of that day).

In [12]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm

from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor
from statsmodels.api import OLS
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor

%matplotlib inline
In [13]:
dftrain = pd.read_table('dataset_1_train.txt',',') 
dftest = pd.read_table('dataset_1_test.txt', ',')
In [14]:
dftrain.head()
Out[14]:
TimeMin PickupCount
0 860.0 33.0
1 17.0 75.0
2 486.0 13.0
3 300.0 5.0
4 385.0 10.0
In [15]:
dftrain.describe()
Out[15]:
TimeMin PickupCount
count 250.000000 250.000000
mean 701.416000 32.780000
std 409.247928 18.910368
min 4.000000 1.000000
25% 381.500000 18.000000
50% 686.000000 32.000000
75% 1032.750000 44.000000
max 1438.000000 95.000000
In [16]:
#normalize data

dftrain.TimeMin = dftrain.TimeMin/1440
dftest.TimeMin = dftest.TimeMin/1440

plt.scatter(dftrain.TimeMin, dftrain.PickupCount)
plt.xlabel('Time of day (min)')
plt.ylabel('Taxi pickup count')
Out[16]:
Text(0,0.5,'Taxi pickup count')

k-Nearest Neighbors

We will begin by fitting a k-NN model for different values of k. We will then plot the training set against the predict values from our model at each value of k.

In [17]:
Xtrain, ytrain = dftrain.TimeMin.values.reshape(-1, 1), dftrain.PickupCount.values.reshape(-1, )
Xtest, ytest = dftest.TimeMin.values.reshape(-1, 1), dftest.PickupCount.values.reshape(-1, )

training_R2 = []
test_R2 = []

n_list = [1, 2, 10, 25, 50, 200]
for n in n_list:

    knn = KNeighborsRegressor(n_neighbors=n )
    knn.fit(Xtrain, ytrain)
    plt.figure(figsize = (8,5))
    plt.scatter(Xtest, knn.predict(Xtest))
  
    plt.scatter(Xtest, ytest, color = 'grey')
    plt.title('n = '+str(n))
    plt.show()
    training_R2.append(knn.score(Xtrain, ytrain))
    test_R2.append(knn.score(Xtest, ytest))
    

    print('R^2 score: ' + str(knn.score(Xtrain, ytrain)) +', ' + 
          str(knn.score(Xtest, ytest)))



#knn.predict(Xtest)
R^2 score: 0.8108889086047287, -0.04560301563038216
R^2 score: 0.6454040692744734, 0.23298509885774188
R^2 score: 0.45770521849580365, 0.41724791407053397
R^2 score: 0.4194670658749883, 0.41365598420422983
R^2 score: 0.35325229524195634, 0.38047568461148934
R^2 score: 0.10863100988399976, 0.10953014692642915
In [18]:
plt.plot(n_list, training_R2)
plt.plot(n_list, test_R2, color = 'red')
plt.legend(["training", "test"])
plt.title('Knn Regressor $R^2$ Scores')
plt.xlabel('n')
plt.show()

Simple Linear Regression

We next consider a simple linear regression model. We will fit the model on the training set and evaluate the training and test R2 scores. We will then further analyze our model by computing the confidence interval

In [26]:
Xtrain_col = sm.add_constant(Xtrain)
model = sm.OLS(ytrain, Xtrain_col)
results = model.fit()
train_predictions = results.predict(Xtrain_col)
test_predictions = results.predict(sm.add_constant(Xtest))
print(results.summary())
print(results.params)
print('\nR2 scores:\n\n', r2_score(ytrain, train_predictions))
print(r2_score(ytest, test_predictions))

print('\nConfidence Intervals:\n\n', results.conf_int(.05))
print(results.conf_int(.01))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.207
Model:                            OLS   Adj. R-squared:                  0.204
Method:                 Least Squares   F-statistic:                     64.82
Date:                Thu, 16 Aug 2018   Prob (F-statistic):           3.43e-14
Time:                        03:35:25   Log-Likelihood:                -1060.1
No. Observations:                 250   AIC:                             2124.
Df Residuals:                     248   BIC:                             2131.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         18.0264      2.121      8.501      0.000      13.850      22.203
x1            30.2890      3.762      8.051      0.000      22.879      37.699
==============================================================================
Omnibus:                       56.951   Durbin-Watson:                   1.960
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              101.977
Skew:                           1.202   Prob(JB):                     7.18e-23
Kurtosis:                       5.002   Cond. No.                         4.42
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[18.02638518 30.28902299]

R2 scores:

 0.20721375209894033
0.24771232994848624

Confidence Intervals:

 [[13.84986472 22.20290563]
 [22.879319   37.69872697]]
[[12.52194788 23.53082247]
 [20.52341753 40.05462845]]

The R2 score is lower than the R2 value obtained with k-NN regression, indicating that it does worse at predicting the outcomes accurately The sign of the slope indicated whether or not the data is positively or negatively correlated, in this case it is positively correlated You would expect a 99% confidence interval to be looser, because it should capture more variance in the data to be 99% confident that our true slope and intercept will be a point on the interval

Additionally the assumption of linearity appears to not hold for this data, the residual plots are not randomly distributed accross the x axis and instead follow a clear pattern as shown below

In [28]:
plt.scatter(Xtest, ytest-test_predictions)
plt.axhline(y=0, color = 'red')
Out[28]:
<matplotlib.lines.Line2D at 0x10920b240>

Polynomial Regression

We proceed to higher-order polynomial models for regression. We have already normalized our data, so we do not need to worry about larger values being more greatly affected by being input into high degree functions

In [30]:
fig, ax1 = plt.subplots(1,1, figsize=(14,10))
degrees = [2, 3, 10, 25, 50]
for n in degrees:
    poly = PolynomialFeatures(degree = n)
    x_features = poly.fit_transform(Xtrain)
    x_test_features = poly.fit_transform(Xtest)
    model = sm.GLS(ytrain, x_features)
    results =model.fit()
    pred = results.predict(x_features)
    pred_test =results.predict(x_test_features)
    print( r2_score(ytrain, pred), r2_score(ytest, results.predict(x_test_features)))
    ax1.scatter(Xtrain, pred, label = "n = "+ str(n))
    
ax1.scatter(Xtrain, ytrain, label ='observed pickups')    
ax1.legend()
0.2324332710285808 0.25572414216191597
0.3748362391177078 0.3785843622481676
0.42827706862793247 0.4020078006777249
0.46327387074517823 0.38367365491434813
0.48916957393037275 0.3332818519362445
Out[30]:
<matplotlib.legend.Legend at 0x108ea9b38>

By looking at our R2 scores and the visualization above at different values of n, we see that n = 10 appears to best fit the data while not overfitting. As n gets larger, the polynomial function appears to be affected by noise in the data, and thus does not generalize well.

Summarizing Results:

From the above, so far our best model is the k-nn model with n = 10. This model is simple to impliment and gives us the best prediction accuracy. This makes intuitive sense beacause the number of taxi pickups at a certain time of day is likely to be similar to the number of pickups in some time interval before and after the pickup time.

We will continue model selection in a future post!