For this project we will build regression models that can predict the number of taxi pickups in New York city at any given time of the day. These prediction models will be useful, for example, in monitoring traffic in the city.
The data set for this problem is given in files dataset_1_train.txt
and dataset_1_test.txt
as separate training and test sets. The first column in each file contains the time of a day in minutes, and the second column contains the number of pickups observed at that time. The data set covers taxi pickups recorded during different days in Jan 2015 (randomly sampled across days and time of that day).
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor
from statsmodels.api import OLS
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
%matplotlib inline
dftrain = pd.read_table('dataset_1_train.txt',',')
dftest = pd.read_table('dataset_1_test.txt', ',')
dftrain.head()
dftrain.describe()
#normalize data
dftrain.TimeMin = dftrain.TimeMin/1440
dftest.TimeMin = dftest.TimeMin/1440
plt.scatter(dftrain.TimeMin, dftrain.PickupCount)
plt.xlabel('Time of day (min)')
plt.ylabel('Taxi pickup count')
We will begin by fitting a k-NN model for different values of k. We will then plot the training set against the predict values from our model at each value of k.
Xtrain, ytrain = dftrain.TimeMin.values.reshape(-1, 1), dftrain.PickupCount.values.reshape(-1, )
Xtest, ytest = dftest.TimeMin.values.reshape(-1, 1), dftest.PickupCount.values.reshape(-1, )
training_R2 = []
test_R2 = []
n_list = [1, 2, 10, 25, 50, 200]
for n in n_list:
knn = KNeighborsRegressor(n_neighbors=n )
knn.fit(Xtrain, ytrain)
plt.figure(figsize = (8,5))
plt.scatter(Xtest, knn.predict(Xtest))
plt.scatter(Xtest, ytest, color = 'grey')
plt.title('n = '+str(n))
plt.show()
training_R2.append(knn.score(Xtrain, ytrain))
test_R2.append(knn.score(Xtest, ytest))
print('R^2 score: ' + str(knn.score(Xtrain, ytrain)) +', ' +
str(knn.score(Xtest, ytest)))
#knn.predict(Xtest)
plt.plot(n_list, training_R2)
plt.plot(n_list, test_R2, color = 'red')
plt.legend(["training", "test"])
plt.title('Knn Regressor $R^2$ Scores')
plt.xlabel('n')
plt.show()
We next consider a simple linear regression model. We will fit the model on the training set and evaluate the training and test R2 scores. We will then further analyze our model by computing the confidence interval
Xtrain_col = sm.add_constant(Xtrain)
model = sm.OLS(ytrain, Xtrain_col)
results = model.fit()
train_predictions = results.predict(Xtrain_col)
test_predictions = results.predict(sm.add_constant(Xtest))
print(results.summary())
print(results.params)
print('\nR2 scores:\n\n', r2_score(ytrain, train_predictions))
print(r2_score(ytest, test_predictions))
print('\nConfidence Intervals:\n\n', results.conf_int(.05))
print(results.conf_int(.01))
The R2 score is lower than the R2 value obtained with k-NN regression, indicating that it does worse at predicting the outcomes accurately The sign of the slope indicated whether or not the data is positively or negatively correlated, in this case it is positively correlated You would expect a 99% confidence interval to be looser, because it should capture more variance in the data to be 99% confident that our true slope and intercept will be a point on the interval
Additionally the assumption of linearity appears to not hold for this data, the residual plots are not randomly distributed accross the x axis and instead follow a clear pattern as shown below
plt.scatter(Xtest, ytest-test_predictions)
plt.axhline(y=0, color = 'red')
We proceed to higher-order polynomial models for regression. We have already normalized our data, so we do not need to worry about larger values being more greatly affected by being input into high degree functions
fig, ax1 = plt.subplots(1,1, figsize=(14,10))
degrees = [2, 3, 10, 25, 50]
for n in degrees:
poly = PolynomialFeatures(degree = n)
x_features = poly.fit_transform(Xtrain)
x_test_features = poly.fit_transform(Xtest)
model = sm.GLS(ytrain, x_features)
results =model.fit()
pred = results.predict(x_features)
pred_test =results.predict(x_test_features)
print( r2_score(ytrain, pred), r2_score(ytest, results.predict(x_test_features)))
ax1.scatter(Xtrain, pred, label = "n = "+ str(n))
ax1.scatter(Xtrain, ytrain, label ='observed pickups')
ax1.legend()
By looking at our R2 scores and the visualization above at different values of n, we see that n = 10 appears to best fit the data while not overfitting. As n gets larger, the polynomial function appears to be affected by noise in the data, and thus does not generalize well.
From the above, so far our best model is the k-nn model with n = 10. This model is simple to impliment and gives us the best prediction accuracy. This makes intuitive sense beacause the number of taxi pickups at a certain time of day is likely to be similar to the number of pickups in some time interval before and after the pickup time.
We will continue model selection in a future post!