Extract from Kaggle: Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
Goal: Predict the total number of bikes rented per each hour on the 20th day, given the data from the preceeding days.
We begin by importing the necessary modules and reading the data.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import datetime
import calendar
import math
sns.set_style('darkgrid')
df = pd.read_csv("train.csv")
df.head()
df.dtypes
Notice how datetime is an object. Let's first convert it to a datetime64 data type so we can split it into weekday, hour, month and year.
df['datetime'] = pd.to_datetime(df['datetime'])
df.dtypes
df_time = pd.DataFrame({'year':df['datetime'].dt.year,'month':df['datetime'].dt.month,'day':df['datetime'].dt.weekday,'hour':df['datetime'].dt.hour})
# 0 - Monday, 6 - Sunday
df = pd.concat([df['datetime'],df_time,df[['season','holiday','workingday','weather','temp','atemp','humidity','windspeed','casual','registered','count']]], axis=1)
df.head()
We will also map some of the categorical features, which helps in visualisation.
df['season'] = df.season.map({1: 'Spring', 2 : 'Summer', 3 : 'Fall', 4 :'Winter' })
df['day'] = df.day.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})
Next, we categorise the variables.
categorisedVars = ['day','hour','month','year','season','holiday','workingday','weather']
for var in categorisedVars:
df[var] = df[var].astype('category')
Now we look at a summary of each features.
df.describe()
Since the counts for each feature are equal, we can conclude there are no missing values. We can also use a useful module called missingno to verify this.
msno.matrix(df,figsize=(12,5))
Since we are predicting a continuous variable, it is useful for us to check for outliers.
fig,axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(12,10)
# Plots
sns.boxplot(data = df, y = 'count', orient='v', ax = axes[0][0])
sns.boxplot(data = df, y = 'count', x = 'hour', ax = axes[0][1])
sns.boxplot(data = df, y = 'count', x = 'workingday', ax = axes[1][0])
sns.boxplot(data = df, y = 'count', x = 'season', ax = axes[1][1])
# Labelling
axes[0][0].set(ylabel='Count',title='Overall')
axes[0][1].set(xlabel='Hour', ylabel='Count',title='Across Hours')
axes[1][0].set(xlabel='Working Day', ylabel='Count',title='Across Working Days')
axes[1][1].set(xlabel='Season', ylabel='Count',title='Across Seasons')
We make a number of observations:
All of the above helps to explain the the high number of outliers above the outer quartile limit in the overall count.
Let's take a closer look at the overall count.
sns.distplot(df['count'])
We see that the distribution is skewed heavily to the left. Let's try applying a transformation to make it more in-line with a normal distribution (bell-shaped).
sns.distplot(df['count'].apply(np.log))
By Chebyshev's Rule, let us also remove entries that lie outside of 3 standard deviations from the mean.
df_withoutoutliers = df[np.abs(df["count"]-df["count"].mean())<=(3*df["count"].std())]
df.shape
df_withoutoutliers.shape
sns.distplot(df_withoutoutliers['count'].apply(np.log))
colormap = plt.cm.RdBu
fig,ax = plt.subplots()
fig.set_size_inches(18,10)
sns.heatmap(df.corr(),cmap=colormap, annot = True)
plt.show()
We observe the following:
We will look into these features in greater detail.
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3)
fig.set_size_inches(12,5)
sns.regplot(x = 'temp', y = 'count', data = df, ax = ax1)
sns.regplot(x = 'humidity', y = 'count', data = df, ax = ax2)
sns.regplot(x = 'windspeed', y = 'count', data = df, ax = ax3)
We will also compare average count across months:
fig,ax = plt.subplots()
fig.set_size_inches(12,10)
# monthLabels = ['January','Feburary','March','April','May','June','July','August','September','October','November','December']
byMonth = pd.DataFrame(df.groupby('month')['count'].mean()).reset_index()
sns.barplot(data=byMonth,x='month',y='count')
ax.set(xlabel='Month', ylabel='Average Count',title="Average Count By Month")
Next, we compare non-registered and registered rentals.
df1 = df.groupby('weather')['casual'].sum()
df1 = df1.reset_index()
df2 = df.groupby('weather')['registered'].sum()
df2 = df2.reset_index()
# Easier to do stacked graphs in matlibplot than seaborn (albeit less aesthetic)
width = 0.4
p1 = plt.bar(df1['weather'],df1['casual'],width,color='r')
p2 = plt.bar(df2['weather'],df2['registered'],width,bottom=df1['casual'])
plt.xticks(df1['weather'],['Clear','Mist','Light','Heavy'])
plt.xlabel('Weather')
plt.ylabel('Count')
plt.legend(['Casual','Registered'])
plt.show()
There are more registered users than casual users. Most of the bikes were rented in Clear weather, and hardly any bikes were rented during Heavy weather.
fig,(ax1,ax2) = plt.subplots(nrows=2)
fig.set_size_inches(12,20)
dayLabels = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
byDays = pd.DataFrame(df.groupby(['hour','day'],sort=True)['count'].mean()).reset_index()
sns.pointplot(data=byDays, x='hour',y='count',hue=byDays['day'], hue_order = dayLabels, ax=ax1)
ax1.set(xlabel='Hour', ylabel='Average Count', title="Average Count By Hour Across Days")
bySeason = pd.DataFrame(df.groupby(['hour','season'],sort=True)['count'].mean()).reset_index()
sns.pointplot(data=bySeason,x='hour', y='count', hue=bySeason['season'],ax=ax2)
ax2.set(xlabel='Hour', ylabel='Average Count', title="Average Count By Hour Across Seasons")
We observe the following:
Based on the above, we will be removing these features:
df = df.drop(['datetime','atemp','windspeed','casual','registered'],axis=1)
We will try a few different methods and assess them based on their accuracy on the RMSLE (lower is better).
We will need to create dummy variables for our categorical variables. These are:
dummies = ['day','hour','month','year','season']
final_df = pd.get_dummies(df, columns = dummies)
final_df.columns
We repeat the earlier steps we took to prepare the train data for the test data.
df_test = pd.read_csv("test.csv")
df_test['datetime'] = pd.to_datetime(df_test['datetime'])
df_time = pd.DataFrame({'year':df_test['datetime'].dt.year,'month':df_test['datetime'].dt.month,'day':df_test['datetime'].dt.weekday,'hour':df_test['datetime'].dt.hour})
df_test = pd.concat([df_time,df_test[['season','holiday','workingday','weather','temp','humidity']]], axis=1)
df_test['season'] = df_test.season.map({1: 'Spring', 2 : 'Summer', 3 : 'Fall', 4 :'Winter' })
df_test['day'] = df_test.day.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})
df_test = pd.get_dummies(df_test, columns = dummies)
df_test.head()
df_test.columns
from sklearn.linear_model import LinearRegression
lmodel = LinearRegression()
lmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
To evaluate our model, we will use root mean squared logarithmic error.
def rmsle(y1, y2):
log1 = np.array([np.log(v + 1) for v in y1])
log2 = np.array([np.log(v + 1) for v in y2])
calc = (log1 - log2) ** 2
return np.sqrt(np.mean(calc))
preds = lmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Linear Regression: ",rmsle(final_df['count'],np.exp(preds)))
Predicting for test data:
preds_sub = lmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
For submission:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
df_sub.to_csv('submission_lr.csv',index = False)
The submission scored 0.66405.
from sklearn.linear_model import Ridge
for alpha in [0.001,0.01,0.1,1.0,10]:
rmodel = Ridge(alpha)
rmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds = rmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Ridge Regression for alpha {:f}: {:f}".format(alpha,rmsle(final_df['count'],np.exp(preds))))
rmodel = Ridge(alpha = 1)
rmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds_sub = rmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
For submission:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
df_sub.to_csv('submission_rr.csv',index = False)
This submission scored 0.62598.
from sklearn.linear_model import Lasso
for alpha in [0.001,0.01,0.1,1.0,10]:
lasmodel = Lasso(alpha=0.1)
lasmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds = lasmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Lasso Regression for alpha {:f}: {:f}".format(alpha,rmsle(final_df['count'],np.exp(preds))))
lasmodel = Lasso(alpha = 1)
lasmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds_sub = lasmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
For submission:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
df_sub.to_csv('submission_lasr.csv',index = False)
This submission scored 0.63225.
from sklearn.ensemble import RandomForestRegressor
rfmodel = RandomForestRegressor(n_estimators=100)
rfmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds = rfmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Random Forest: ",rmsle(final_df['count'],np.exp(preds)))
feature_importances = pd.DataFrame(rfmodel.feature_importances_,
index = final_df.drop(['count'],axis=1).columns,
columns=['importance']).sort_values('importance',ascending=False)
feature_importances
Predicting for test data:
preds_sub = rfmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
For submission:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
df_sub.to_csv('submission_rf.csv',index = False)
This submission scored 0.41089.
cat_columns = final_df.select_dtypes(['category']).columns
final_df[cat_columns] = final_df[cat_columns].apply(lambda x: x.cat.codes)
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 100)
xg_reg.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds = xg_reg.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Boosting: ",rmsle(final_df['count'],np.exp(preds)))
Predicting for test data:
preds_sub = xg_reg.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
For submission:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
df_sub.to_csv('submission_boost.csv',index = False)