Table of Contents

  • Introduction
  • Reading and preparing the data
    • Description of features
    • Feature Engineering
  • Exploratory Data Analysis
    • Outlier Analysis
    • Correlation Matrix
    • Univariate Plots
    • Bivariate Plots
    • Removing Features
  • Making Predictions
    • Preparing train and test data
    • Various Models
  • Conclusion

Introduction

Extract from Kaggle: Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Goal: Predict the total number of bikes rented per each hour on the 20th day, given the data from the preceeding days.

Reading and preparing the data

We begin by importing the necessary modules and reading the data.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import datetime
import calendar
import math
sns.set_style('darkgrid')
In [2]:
df = pd.read_csv("train.csv")
df.head()
Out[2]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

Description of each feature (12 in total)

  • datetime - hourly date + timestamp
  • season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather -
  • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - temperature in Celsius
  • atemp - "feels like" temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • casual - number of non-registered user rentals initiated
  • registered - number of registered user rentals initiated
  • count - number of total rentals

Feature Engineering

In [3]:
df.dtypes
Out[3]:
datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
casual          int64
registered      int64
count           int64
dtype: object

Notice how datetime is an object. Let's first convert it to a datetime64 data type so we can split it into weekday, hour, month and year.

In [4]:
df['datetime'] = pd.to_datetime(df['datetime'])
df.dtypes
Out[4]:
datetime      datetime64[ns]
season                 int64
holiday                int64
workingday             int64
weather                int64
temp                 float64
atemp                float64
humidity               int64
windspeed            float64
casual                 int64
registered             int64
count                  int64
dtype: object
In [5]:
df_time = pd.DataFrame({'year':df['datetime'].dt.year,'month':df['datetime'].dt.month,'day':df['datetime'].dt.weekday,'hour':df['datetime'].dt.hour})
# 0 - Monday, 6 - Sunday
df = pd.concat([df['datetime'],df_time,df[['season','holiday','workingday','weather','temp','atemp','humidity','windspeed','casual','registered','count']]], axis=1)
df.head()
Out[5]:
datetime year month day hour season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 2011 1 5 0 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 2011 1 5 1 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 2011 1 5 2 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 2011 1 5 3 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 2011 1 5 4 1 0 0 1 9.84 14.395 75 0.0 0 1 1

We will also map some of the categorical features, which helps in visualisation.

In [6]:
df['season'] = df.season.map({1: 'Spring', 2 : 'Summer', 3 : 'Fall', 4 :'Winter' })
df['day'] = df.day.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})

Next, we categorise the variables.

In [7]:
categorisedVars = ['day','hour','month','year','season','holiday','workingday','weather']
for var in categorisedVars:
    df[var] = df[var].astype('category')

Now we look at a summary of each features.

In [8]:
df.describe()
Out[8]:
temp atemp humidity windspeed casual registered count
count 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000

Since the counts for each feature are equal, we can conclude there are no missing values. We can also use a useful module called missingno to verify this.

In [9]:
msno.matrix(df,figsize=(12,5))
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c3399c588>

Exploratory Data Analysis

Outlier Analysis

Since we are predicting a continuous variable, it is useful for us to check for outliers.

In [10]:
fig,axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(12,10)
# Plots
sns.boxplot(data = df, y = 'count', orient='v', ax = axes[0][0])
sns.boxplot(data = df, y = 'count', x = 'hour', ax = axes[0][1])
sns.boxplot(data = df, y = 'count', x = 'workingday', ax = axes[1][0])
sns.boxplot(data = df, y = 'count', x = 'season', ax = axes[1][1])
# Labelling
axes[0][0].set(ylabel='Count',title='Overall')
axes[0][1].set(xlabel='Hour', ylabel='Count',title='Across Hours')
axes[1][0].set(xlabel='Working Day', ylabel='Count',title='Across Working Days')
axes[1][1].set(xlabel='Season', ylabel='Count',title='Across Seasons')
Out[10]:
[Text(0, 0.5, 'Count'),
 Text(0.5, 0, 'Season'),
 Text(0.5, 1.0, 'Across Seasons')]

We make a number of observations:

  • The median count across hours is much higher for the 6-7th hour (6-7AM) and 17-19th hour (5-7PM). This could be attributed to students and employees travelling to-and-fro from school/work
  • Most of the outliers came from working days
  • The count in spring is noticeably lower than other seasons

All of the above helps to explain the the high number of outliers above the outer quartile limit in the overall count.

Let's take a closer look at the overall count.

In [11]:
sns.distplot(df['count'])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c33f55ef0>

We see that the distribution is skewed heavily to the left. Let's try applying a transformation to make it more in-line with a normal distribution (bell-shaped).

In [12]:
sns.distplot(df['count'].apply(np.log))
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c34622978>

By Chebyshev's Rule, let us also remove entries that lie outside of 3 standard deviations from the mean.

In [13]:
df_withoutoutliers = df[np.abs(df["count"]-df["count"].mean())<=(3*df["count"].std())]
df.shape
Out[13]:
(10886, 16)
In [14]:
df_withoutoutliers.shape
Out[14]:
(10739, 16)
In [15]:
sns.distplot(df_withoutoutliers['count'].apply(np.log))
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c34700ef0>

Correlation Matrix

In [16]:
colormap = plt.cm.RdBu
fig,ax = plt.subplots()
fig.set_size_inches(18,10)
sns.heatmap(df.corr(),cmap=colormap, annot = True)
plt.show()

We observe the following:

  • temp and atemp have strong positive correlation (almost 1). We can consider including just one of them in our predictive model
  • windspeed has low positive correlation with count, we can consider removing it from the regression model
  • While casual and registered have strong positive correlation with count, they are leakage variables which will not be included in our regression model

We will look into these features in greater detail.

Univariate Plots

In [17]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3)
fig.set_size_inches(12,5)
sns.regplot(x = 'temp', y = 'count', data = df, ax = ax1)
sns.regplot(x = 'humidity', y = 'count', data = df, ax = ax2)
sns.regplot(x = 'windspeed', y = 'count', data = df, ax = ax3)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c36ad0518>

We will also compare average count across months:

In [18]:
fig,ax = plt.subplots()
fig.set_size_inches(12,10)

# monthLabels = ['January','Feburary','March','April','May','June','July','August','September','October','November','December']
byMonth = pd.DataFrame(df.groupby('month')['count'].mean()).reset_index()
sns.barplot(data=byMonth,x='month',y='count')
ax.set(xlabel='Month', ylabel='Average Count',title="Average Count By Month")
Out[18]:
[Text(0, 0.5, 'Average Count'),
 Text(0.5, 0, 'Month'),
 Text(0.5, 1.0, 'Average Count By Month')]

Bivariate Plots

Next, we compare non-registered and registered rentals.

In [19]:
df1 = df.groupby('weather')['casual'].sum()
df1 = df1.reset_index()
In [20]:
df2 = df.groupby('weather')['registered'].sum()
df2 = df2.reset_index()
In [21]:
# Easier to do stacked graphs in matlibplot than seaborn (albeit less aesthetic)
width = 0.4
p1 = plt.bar(df1['weather'],df1['casual'],width,color='r')
p2 = plt.bar(df2['weather'],df2['registered'],width,bottom=df1['casual'])
plt.xticks(df1['weather'],['Clear','Mist','Light','Heavy'])
plt.xlabel('Weather')
plt.ylabel('Count')
plt.legend(['Casual','Registered'])
plt.show()

There are more registered users than casual users. Most of the bikes were rented in Clear weather, and hardly any bikes were rented during Heavy weather.

In [22]:
fig,(ax1,ax2) = plt.subplots(nrows=2)
fig.set_size_inches(12,20)

dayLabels = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
byDays = pd.DataFrame(df.groupby(['hour','day'],sort=True)['count'].mean()).reset_index()
sns.pointplot(data=byDays, x='hour',y='count',hue=byDays['day'], hue_order = dayLabels, ax=ax1)
ax1.set(xlabel='Hour', ylabel='Average Count', title="Average Count By Hour Across Days")

bySeason = pd.DataFrame(df.groupby(['hour','season'],sort=True)['count'].mean()).reset_index()
sns.pointplot(data=bySeason,x='hour', y='count', hue=bySeason['season'],ax=ax2)
ax2.set(xlabel='Hour', ylabel='Average Count', title="Average Count By Hour Across Seasons")
Out[22]:
[Text(0, 0.5, 'Average Count'),
 Text(0.5, 0, 'Hour'),
 Text(0.5, 1.0, 'Average Count By Hour Across Seasons')]

We observe the following:

  • Bike rentals follow two main trends, one for weekdays (mondays to fridays) and one for weekends (saturday and sundays). Among weekdays
  • Average count by hour among seasons are similar from summer to winter but much lower in spring

Removing Features

Based on the above, we will be removing these features:

  • datetime (we already split it into its individual components)
  • atemp
  • windspeed
  • casual
  • registered
In [23]:
df = df.drop(['datetime','atemp','windspeed','casual','registered'],axis=1)

Making Predictions

We will try a few different methods and assess them based on their accuracy on the RMSLE (lower is better).

Preparing train and test data

We will need to create dummy variables for our categorical variables. These are:

  • day
  • hour
  • month
  • year
  • season
In [24]:
dummies = ['day','hour','month','year','season']
final_df = pd.get_dummies(df, columns = dummies)
final_df.columns
Out[24]:
Index(['holiday', 'workingday', 'weather', 'temp', 'humidity', 'count',
       'day_Friday', 'day_Monday', 'day_Saturday', 'day_Sunday',
       'day_Thursday', 'day_Tuesday', 'day_Wednesday', 'hour_0', 'hour_1',
       'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6', 'hour_7', 'hour_8',
       'hour_9', 'hour_10', 'hour_11', 'hour_12', 'hour_13', 'hour_14',
       'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19', 'hour_20',
       'hour_21', 'hour_22', 'hour_23', 'month_1', 'month_2', 'month_3',
       'month_4', 'month_5', 'month_6', 'month_7', 'month_8', 'month_9',
       'month_10', 'month_11', 'month_12', 'year_2011', 'year_2012',
       'season_Fall', 'season_Spring', 'season_Summer', 'season_Winter'],
      dtype='object')

We repeat the earlier steps we took to prepare the train data for the test data.

In [25]:
df_test = pd.read_csv("test.csv")
df_test['datetime'] = pd.to_datetime(df_test['datetime'])
df_time = pd.DataFrame({'year':df_test['datetime'].dt.year,'month':df_test['datetime'].dt.month,'day':df_test['datetime'].dt.weekday,'hour':df_test['datetime'].dt.hour})
df_test = pd.concat([df_time,df_test[['season','holiday','workingday','weather','temp','humidity']]], axis=1)

df_test['season'] = df_test.season.map({1: 'Spring', 2 : 'Summer', 3 : 'Fall', 4 :'Winter' })
df_test['day'] = df_test.day.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})

df_test = pd.get_dummies(df_test, columns = dummies)
df_test.head()
Out[25]:
holiday workingday weather temp humidity day_Friday day_Monday day_Saturday day_Sunday day_Thursday ... month_9 month_10 month_11 month_12 year_2011 year_2012 season_Fall season_Spring season_Summer season_Winter
0 0 1 1 10.66 56 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
1 0 1 1 10.66 56 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
2 0 1 1 10.66 56 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
3 0 1 1 10.66 56 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
4 0 1 1 10.66 56 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0

5 rows × 54 columns

In [26]:
df_test.columns
Out[26]:
Index(['holiday', 'workingday', 'weather', 'temp', 'humidity', 'day_Friday',
       'day_Monday', 'day_Saturday', 'day_Sunday', 'day_Thursday',
       'day_Tuesday', 'day_Wednesday', 'hour_0', 'hour_1', 'hour_2', 'hour_3',
       'hour_4', 'hour_5', 'hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10',
       'hour_11', 'hour_12', 'hour_13', 'hour_14', 'hour_15', 'hour_16',
       'hour_17', 'hour_18', 'hour_19', 'hour_20', 'hour_21', 'hour_22',
       'hour_23', 'month_1', 'month_2', 'month_3', 'month_4', 'month_5',
       'month_6', 'month_7', 'month_8', 'month_9', 'month_10', 'month_11',
       'month_12', 'year_2011', 'year_2012', 'season_Fall', 'season_Spring',
       'season_Summer', 'season_Winter'],
      dtype='object')

Various Models

Linear Regression Model

In [27]:
from sklearn.linear_model import LinearRegression
lmodel = LinearRegression()
lmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
Out[27]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

To evaluate our model, we will use root mean squared logarithmic error.

In [28]:
def rmsle(y1, y2):
    log1 = np.array([np.log(v + 1) for v in y1])
    log2 = np.array([np.log(v + 1) for v in y2])
    calc = (log1 - log2) ** 2
    return np.sqrt(np.mean(calc))
In [29]:
preds = lmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Linear Regression: ",rmsle(final_df['count'],np.exp(preds)))
RMSLE Value For Linear Regression:  0.5888216413863685

Predicting for test data:

In [30]:
preds_sub = lmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
Out[30]:
array([ 17.46660072,   8.84192307,   3.81146144, ..., 111.80953854,
        88.31376603,  54.31562896])

For submission:

In [31]:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
In [32]:
df_sub.to_csv('submission_lr.csv',index = False)

The submission scored 0.66405.

Ridge Regression

In [33]:
from sklearn.linear_model import Ridge
for alpha in [0.001,0.01,0.1,1.0,10]:
    rmodel = Ridge(alpha)
    rmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
    preds = rmodel.predict(final_df.drop(['count'],axis=1))
    print ("RMSLE Value For Ridge Regression for alpha {:f}: {:f}".format(alpha,rmsle(final_df['count'],np.exp(preds))))
RMSLE Value For Ridge Regression for alpha 0.001000: 0.584994
RMSLE Value For Ridge Regression for alpha 0.010000: 0.584994
RMSLE Value For Ridge Regression for alpha 0.100000: 0.584989
RMSLE Value For Ridge Regression for alpha 1.000000: 0.584950
RMSLE Value For Ridge Regression for alpha 10.000000: 0.585195
In [34]:
rmodel = Ridge(alpha = 1)
rmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds_sub = rmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
Out[34]:
array([ 16.14050617,   8.42520557,   4.84714386, ..., 111.68234695,
        88.28298581,  57.88092268])

For submission:

In [35]:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
In [36]:
df_sub.to_csv('submission_rr.csv',index = False)

This submission scored 0.62598.

Lasso Regression

In [40]:
from sklearn.linear_model import Lasso
for alpha in [0.001,0.01,0.1,1.0,10]:
    lasmodel = Lasso(alpha=0.1)
    lasmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
    preds = lasmodel.predict(final_df.drop(['count'],axis=1))
    print ("RMSLE Value For Lasso Regression for alpha {:f}: {:f}".format(alpha,rmsle(final_df['count'],np.exp(preds))))
RMSLE Value For Lasso Regression for alpha 0.001000: 1.230722
RMSLE Value For Lasso Regression for alpha 0.010000: 1.230722
RMSLE Value For Lasso Regression for alpha 0.100000: 1.230722
RMSLE Value For Lasso Regression for alpha 1.000000: 1.230722
RMSLE Value For Lasso Regression for alpha 10.000000: 1.230722
In [42]:
lasmodel = Lasso(alpha = 1)
lasmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds_sub = lasmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
Out[42]:
array([65.66462023, 65.66462023, 65.66462023, ..., 60.29639713,
       65.66462023, 54.19899953])

For submission:

In [43]:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
In [44]:
df_sub.to_csv('submission_lasr.csv',index = False)

This submission scored 0.63225.

Random Forest Regression

In [45]:
from sklearn.ensemble import RandomForestRegressor
rfmodel = RandomForestRegressor(n_estimators=100)
rfmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds = rfmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Random Forest: ",rmsle(final_df['count'],np.exp(preds)))
RMSLE Value For Random Forest:  0.11869442929650452
In [46]:
feature_importances = pd.DataFrame(rfmodel.feature_importances_,
                                   index = final_df.drop(['count'],axis=1).columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances
Out[46]:
importance
hour_4 0.165723
hour_3 0.139258
hour_2 0.104959
hour_5 0.097221
temp 0.079194
hour_1 0.077804
workingday 0.073979
hour_0 0.037235
humidity 0.026703
hour_6 0.026014
season_Spring 0.016669
hour_23 0.016473
year_2011 0.015190
year_2012 0.014501
hour_7 0.012265
weather 0.011047
hour_8 0.009869
season_Winter 0.008401
hour_17 0.007883
hour_18 0.006866
hour_22 0.004939
day_Friday 0.003985
day_Sunday 0.003426
hour_19 0.003099
month_4 0.002977
hour_9 0.002319
hour_21 0.002224
hour_20 0.002075
day_Monday 0.002009
month_1 0.001965
day_Saturday 0.001925
hour_10 0.001873
hour_16 0.001780
day_Wednesday 0.001640
day_Thursday 0.001347
holiday 0.001338
day_Tuesday 0.001287
month_5 0.001260
month_3 0.001173
month_2 0.001006
season_Summer 0.000996
month_9 0.000910
hour_11 0.000875
month_12 0.000868
month_10 0.000862
month_11 0.000647
season_Fall 0.000607
hour_12 0.000588
month_7 0.000517
month_6 0.000515
hour_14 0.000462
hour_13 0.000434
month_8 0.000430
hour_15 0.000388

Predicting for test data:

In [47]:
preds_sub = rfmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
Out[47]:
array([ 10.30306038,   3.57506469,   2.6612907 , ..., 111.61289065,
        94.81654297,  53.22024613])

For submission:

In [48]:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
In [49]:
df_sub.to_csv('submission_rf.csv',index = False)

This submission scored 0.41089.

Boosting

In [50]:
cat_columns = final_df.select_dtypes(['category']).columns
final_df[cat_columns] = final_df[cat_columns].apply(lambda x: x.cat.codes)
In [51]:
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)
xg_reg.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
c:\anaconda3\envs\venv\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
Out[51]:
XGBRegressor(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.3, gamma=0, importance_type='gain',
       learning_rate=0.1, max_delta_step=0, max_depth=5,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1)
In [52]:
preds = xg_reg.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Boosting: ",rmsle(final_df['count'],np.exp(preds)))
RMSLE Value For Boosting:  0.3957384116281124

Predicting for test data:

In [53]:
preds_sub = xg_reg.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub
Out[53]:
array([ 11.648093 ,   4.5127115,   3.4661877, ..., 114.09533  ,
        96.781136 ,  50.1189   ], dtype=float32)

For submission:

In [54]:
preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']
In [55]:
df_sub.to_csv('submission_boost.csv',index = False)

Conclusion

In [ ]: