Table of Contents¶

Introduction
Reading and preparing the data
- Description of features
- Feature Engineering
Exploratory Data Analysis
- Outlier Analysis
- Correlation Matrix
- Univariate Plots
- Bivariate Plots
- Removing Features
Making Predictions
- Preparing train and test data
- Various Models
Conclusion

Introduction¶

Extract from Kaggle: Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Goal: Predict the total number of bikes rented per each hour on the 20th day, given the data from the preceeding days.

Reading and preparing the data¶

We begin by importing the necessary modules and reading the data.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import datetime
import calendar
import math
sns.set_style('darkgrid')

df = pd.read_csv("train.csv")
df.head()

Description of each feature (12 in total)¶

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

Feature Engineering¶

df.dtypes

datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
casual          int64
registered      int64
count           int64
dtype: object

Notice how datetime is an object. Let's first convert it to a datetime64 data type so we can split it into weekday, hour, month and year.

df['datetime'] = pd.to_datetime(df['datetime'])
df.dtypes

datetime      datetime64[ns]
season                 int64
holiday                int64
workingday             int64
weather                int64
temp                 float64
atemp                float64
humidity               int64
windspeed            float64
casual                 int64
registered             int64
count                  int64
dtype: object

df_time = pd.DataFrame({'year':df['datetime'].dt.year,'month':df['datetime'].dt.month,'day':df['datetime'].dt.weekday,'hour':df['datetime'].dt.hour})
# 0 - Monday, 6 - Sunday
df = pd.concat([df['datetime'],df_time,df[['season','holiday','workingday','weather','temp','atemp','humidity','windspeed','casual','registered','count']]], axis=1)
df.head()

We will also map some of the categorical features, which helps in visualisation.

df['season'] = df.season.map({1: 'Spring', 2 : 'Summer', 3 : 'Fall', 4 :'Winter' })
df['day'] = df.day.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})

Next, we categorise the variables.

categorisedVars = ['day','hour','month','year','season','holiday','workingday','weather']
for var in categorisedVars:
    df[var] = df[var].astype('category')

Now we look at a summary of each features.

df.describe()

Since the counts for each feature are equal, we can conclude there are no missing values. We can also use a useful module called missingno to verify this.

msno.matrix(df,figsize=(12,5))

<matplotlib.axes._subplots.AxesSubplot at 0x24c3399c588>

Exploratory Data Analysis¶

Outlier Analysis¶

Since we are predicting a continuous variable, it is useful for us to check for outliers.

fig,axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(12,10)
# Plots
sns.boxplot(data = df, y = 'count', orient='v', ax = axes[0][0])
sns.boxplot(data = df, y = 'count', x = 'hour', ax = axes[0][1])
sns.boxplot(data = df, y = 'count', x = 'workingday', ax = axes[1][0])
sns.boxplot(data = df, y = 'count', x = 'season', ax = axes[1][1])
# Labelling
axes[0][0].set(ylabel='Count',title='Overall')
axes[0][1].set(xlabel='Hour', ylabel='Count',title='Across Hours')
axes[1][0].set(xlabel='Working Day', ylabel='Count',title='Across Working Days')
axes[1][1].set(xlabel='Season', ylabel='Count',title='Across Seasons')

[Text(0, 0.5, 'Count'),
 Text(0.5, 0, 'Season'),
 Text(0.5, 1.0, 'Across Seasons')]

We make a number of observations:

The median count across hours is much higher for the 6-7th hour (6-7AM) and 17-19th hour (5-7PM). This could be attributed to students and employees travelling to-and-fro from school/work
Most of the outliers came from working days
The count in spring is noticeably lower than other seasons

All of the above helps to explain the the high number of outliers above the outer quartile limit in the overall count.

Let's take a closer look at the overall count.

sns.distplot(df['count'])

<matplotlib.axes._subplots.AxesSubplot at 0x24c33f55ef0>

We see that the distribution is skewed heavily to the left. Let's try applying a transformation to make it more in-line with a normal distribution (bell-shaped).

sns.distplot(df['count'].apply(np.log))

<matplotlib.axes._subplots.AxesSubplot at 0x24c34622978>

By Chebyshev's Rule, let us also remove entries that lie outside of 3 standard deviations from the mean.

df_withoutoutliers = df[np.abs(df["count"]-df["count"].mean())<=(3*df["count"].std())]
df.shape

(10886, 16)

df_withoutoutliers.shape

(10739, 16)

sns.distplot(df_withoutoutliers['count'].apply(np.log))

<matplotlib.axes._subplots.AxesSubplot at 0x24c34700ef0>

Correlation Matrix¶

colormap = plt.cm.RdBu
fig,ax = plt.subplots()
fig.set_size_inches(18,10)
sns.heatmap(df.corr(),cmap=colormap, annot = True)
plt.show()

We observe the following:

temp and atemp have strong positive correlation (almost 1). We can consider including just one of them in our predictive model
windspeed has low positive correlation with count, we can consider removing it from the regression model
While casual and registered have strong positive correlation with count, they are leakage variables which will not be included in our regression model

We will look into these features in greater detail.

Univariate Plots¶

fig,(ax1,ax2,ax3) = plt.subplots(ncols=3)
fig.set_size_inches(12,5)
sns.regplot(x = 'temp', y = 'count', data = df, ax = ax1)
sns.regplot(x = 'humidity', y = 'count', data = df, ax = ax2)
sns.regplot(x = 'windspeed', y = 'count', data = df, ax = ax3)

<matplotlib.axes._subplots.AxesSubplot at 0x24c36ad0518>

We will also compare average count across months:

fig,ax = plt.subplots()
fig.set_size_inches(12,10)

# monthLabels = ['January','Feburary','March','April','May','June','July','August','September','October','November','December']
byMonth = pd.DataFrame(df.groupby('month')['count'].mean()).reset_index()
sns.barplot(data=byMonth,x='month',y='count')
ax.set(xlabel='Month', ylabel='Average Count',title="Average Count By Month")

[Text(0, 0.5, 'Average Count'),
 Text(0.5, 0, 'Month'),
 Text(0.5, 1.0, 'Average Count By Month')]

Bivariate Plots¶

Next, we compare non-registered and registered rentals.

df1 = df.groupby('weather')['casual'].sum()
df1 = df1.reset_index()

df2 = df.groupby('weather')['registered'].sum()
df2 = df2.reset_index()

# Easier to do stacked graphs in matlibplot than seaborn (albeit less aesthetic)
width = 0.4
p1 = plt.bar(df1['weather'],df1['casual'],width,color='r')
p2 = plt.bar(df2['weather'],df2['registered'],width,bottom=df1['casual'])
plt.xticks(df1['weather'],['Clear','Mist','Light','Heavy'])
plt.xlabel('Weather')
plt.ylabel('Count')
plt.legend(['Casual','Registered'])
plt.show()

There are more registered users than casual users. Most of the bikes were rented in Clear weather, and hardly any bikes were rented during Heavy weather.

fig,(ax1,ax2) = plt.subplots(nrows=2)
fig.set_size_inches(12,20)

dayLabels = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
byDays = pd.DataFrame(df.groupby(['hour','day'],sort=True)['count'].mean()).reset_index()
sns.pointplot(data=byDays, x='hour',y='count',hue=byDays['day'], hue_order = dayLabels, ax=ax1)
ax1.set(xlabel='Hour', ylabel='Average Count', title="Average Count By Hour Across Days")

bySeason = pd.DataFrame(df.groupby(['hour','season'],sort=True)['count'].mean()).reset_index()
sns.pointplot(data=bySeason,x='hour', y='count', hue=bySeason['season'],ax=ax2)
ax2.set(xlabel='Hour', ylabel='Average Count', title="Average Count By Hour Across Seasons")

[Text(0, 0.5, 'Average Count'),
 Text(0.5, 0, 'Hour'),
 Text(0.5, 1.0, 'Average Count By Hour Across Seasons')]

We observe the following:

Bike rentals follow two main trends, one for weekdays (mondays to fridays) and one for weekends (saturday and sundays). Among weekdays
Average count by hour among seasons are similar from summer to winter but much lower in spring

Removing Features¶

Based on the above, we will be removing these features:

datetime (we already split it into its individual components)
atemp
windspeed
casual
registered

df = df.drop(['datetime','atemp','windspeed','casual','registered'],axis=1)

Making Predictions¶

We will try a few different methods and assess them based on their accuracy on the RMSLE (lower is better).

Preparing train and test data¶

We will need to create dummy variables for our categorical variables. These are:

day
hour
month
year
season

dummies = ['day','hour','month','year','season']
final_df = pd.get_dummies(df, columns = dummies)
final_df.columns

Index(['holiday', 'workingday', 'weather', 'temp', 'humidity', 'count',
       'day_Friday', 'day_Monday', 'day_Saturday', 'day_Sunday',
       'day_Thursday', 'day_Tuesday', 'day_Wednesday', 'hour_0', 'hour_1',
       'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6', 'hour_7', 'hour_8',
       'hour_9', 'hour_10', 'hour_11', 'hour_12', 'hour_13', 'hour_14',
       'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19', 'hour_20',
       'hour_21', 'hour_22', 'hour_23', 'month_1', 'month_2', 'month_3',
       'month_4', 'month_5', 'month_6', 'month_7', 'month_8', 'month_9',
       'month_10', 'month_11', 'month_12', 'year_2011', 'year_2012',
       'season_Fall', 'season_Spring', 'season_Summer', 'season_Winter'],
      dtype='object')

We repeat the earlier steps we took to prepare the train data for the test data.

df_test = pd.read_csv("test.csv")
df_test['datetime'] = pd.to_datetime(df_test['datetime'])
df_time = pd.DataFrame({'year':df_test['datetime'].dt.year,'month':df_test['datetime'].dt.month,'day':df_test['datetime'].dt.weekday,'hour':df_test['datetime'].dt.hour})
df_test = pd.concat([df_time,df_test[['season','holiday','workingday','weather','temp','humidity']]], axis=1)

df_test['season'] = df_test.season.map({1: 'Spring', 2 : 'Summer', 3 : 'Fall', 4 :'Winter' })
df_test['day'] = df_test.day.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})

df_test = pd.get_dummies(df_test, columns = dummies)
df_test.head()

df_test.columns

Index(['holiday', 'workingday', 'weather', 'temp', 'humidity', 'day_Friday',
       'day_Monday', 'day_Saturday', 'day_Sunday', 'day_Thursday',
       'day_Tuesday', 'day_Wednesday', 'hour_0', 'hour_1', 'hour_2', 'hour_3',
       'hour_4', 'hour_5', 'hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10',
       'hour_11', 'hour_12', 'hour_13', 'hour_14', 'hour_15', 'hour_16',
       'hour_17', 'hour_18', 'hour_19', 'hour_20', 'hour_21', 'hour_22',
       'hour_23', 'month_1', 'month_2', 'month_3', 'month_4', 'month_5',
       'month_6', 'month_7', 'month_8', 'month_9', 'month_10', 'month_11',
       'month_12', 'year_2011', 'year_2012', 'season_Fall', 'season_Spring',
       'season_Summer', 'season_Winter'],
      dtype='object')

Various Models¶

Linear Regression Model¶

from sklearn.linear_model import LinearRegression
lmodel = LinearRegression()
lmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

To evaluate our model, we will use root mean squared logarithmic error.

def rmsle(y1, y2):
    log1 = np.array([np.log(v + 1) for v in y1])
    log2 = np.array([np.log(v + 1) for v in y2])
    calc = (log1 - log2) ** 2
    return np.sqrt(np.mean(calc))

preds = lmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Linear Regression: ",rmsle(final_df['count'],np.exp(preds)))

RMSLE Value For Linear Regression:  0.5888216413863685

Predicting for test data:

preds_sub = lmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub

array([ 17.46660072,   8.84192307,   3.81146144, ..., 111.80953854,
        88.31376603,  54.31562896])

For submission:

preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']

df_sub.to_csv('submission_lr.csv',index = False)

The submission scored 0.66405.

Ridge Regression¶

from sklearn.linear_model import Ridge
for alpha in [0.001,0.01,0.1,1.0,10]:
    rmodel = Ridge(alpha)
    rmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
    preds = rmodel.predict(final_df.drop(['count'],axis=1))
    print ("RMSLE Value For Ridge Regression for alpha {:f}: {:f}".format(alpha,rmsle(final_df['count'],np.exp(preds))))

RMSLE Value For Ridge Regression for alpha 0.001000: 0.584994
RMSLE Value For Ridge Regression for alpha 0.010000: 0.584994
RMSLE Value For Ridge Regression for alpha 0.100000: 0.584989
RMSLE Value For Ridge Regression for alpha 1.000000: 0.584950
RMSLE Value For Ridge Regression for alpha 10.000000: 0.585195

rmodel = Ridge(alpha = 1)
rmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds_sub = rmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub

array([ 16.14050617,   8.42520557,   4.84714386, ..., 111.68234695,
        88.28298581,  57.88092268])

For submission:

preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']

df_sub.to_csv('submission_rr.csv',index = False)

This submission scored 0.62598.

Lasso Regression¶

from sklearn.linear_model import Lasso
for alpha in [0.001,0.01,0.1,1.0,10]:
    lasmodel = Lasso(alpha=0.1)
    lasmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
    preds = lasmodel.predict(final_df.drop(['count'],axis=1))
    print ("RMSLE Value For Lasso Regression for alpha {:f}: {:f}".format(alpha,rmsle(final_df['count'],np.exp(preds))))

RMSLE Value For Lasso Regression for alpha 0.001000: 1.230722
RMSLE Value For Lasso Regression for alpha 0.010000: 1.230722
RMSLE Value For Lasso Regression for alpha 0.100000: 1.230722
RMSLE Value For Lasso Regression for alpha 1.000000: 1.230722
RMSLE Value For Lasso Regression for alpha 10.000000: 1.230722

lasmodel = Lasso(alpha = 1)
lasmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds_sub = lasmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub

array([65.66462023, 65.66462023, 65.66462023, ..., 60.29639713,
       65.66462023, 54.19899953])

For submission:

preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']

df_sub.to_csv('submission_lasr.csv',index = False)

This submission scored 0.63225.

Random Forest Regression¶

from sklearn.ensemble import RandomForestRegressor
rfmodel = RandomForestRegressor(n_estimators=100)
rfmodel.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))
preds = rfmodel.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Random Forest: ",rmsle(final_df['count'],np.exp(preds)))

RMSLE Value For Random Forest:  0.11869442929650452

feature_importances = pd.DataFrame(rfmodel.feature_importances_,
                                   index = final_df.drop(['count'],axis=1).columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Predicting for test data:

preds_sub = rfmodel.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub

array([ 10.30306038,   3.57506469,   2.6612907 , ..., 111.61289065,
        94.81654297,  53.22024613])

For submission:

preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']

df_sub.to_csv('submission_rf.csv',index = False)

This submission scored 0.41089.

Boosting¶

cat_columns = final_df.select_dtypes(['category']).columns
final_df[cat_columns] = final_df[cat_columns].apply(lambda x: x.cat.codes)

import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)
xg_reg.fit(final_df.drop(['count'],axis=1),final_df['count'].apply(np.log))

c:\anaconda3\envs\venv\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \

XGBRegressor(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.3, gamma=0, importance_type='gain',
       learning_rate=0.1, max_delta_step=0, max_depth=5,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1)

preds = xg_reg.predict(final_df.drop(['count'],axis=1))
print ("RMSLE Value For Boosting: ",rmsle(final_df['count'],np.exp(preds)))

RMSLE Value For Boosting:  0.3957384116281124

Predicting for test data:

preds_sub = xg_reg.predict(df_test)
preds_sub = np.exp(preds_sub)
preds_sub

array([ 11.648093 ,   4.5127115,   3.4661877, ..., 114.09533  ,
        96.781136 ,  50.1189   ], dtype=float32)

For submission:

preds_sub = pd.Series(preds_sub)
df_test2 = pd.read_csv('test.csv')
df_sub = df_test2['datetime']
df_sub = df_sub.to_frame().join(preds_sub.to_frame())
df_sub.columns = ['datetime','count']

df_sub.to_csv('submission_boost.csv',index = False)

	temp	atemp	humidity	windspeed	casual	registered	count
count	10886.00000	10886.000000	10886.000000	10886.000000	10886.000000	10886.000000	10886.000000
mean	20.23086	23.655084	61.886460	12.799395	36.021955	155.552177	191.574132
std	7.79159	8.474601	19.245033	8.164537	49.960477	151.039033	181.144454
min	0.82000	0.760000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	13.94000	16.665000	47.000000	7.001500	4.000000	36.000000	42.000000
50%	20.50000	24.240000	62.000000	12.998000	17.000000	118.000000	145.000000
75%	26.24000	31.060000	77.000000	16.997900	49.000000	222.000000	284.000000
max	41.00000	45.455000	100.000000	56.996900	367.000000	886.000000	977.000000

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

	datetime	year	month	day	hour	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	2011	1	5	0	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	2011	1	5	1	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	2011	1	5	2	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	2011	1	5	3	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	2011	1	5	4	1	1	9.84	14.395	75	0	1	1

	workingday	weather	temp	humidity	day_Thursday	...	year_2011	season_Spring
0	1	1	10.66	56	1	...	1	1
1	1	1	10.66	56	1	...	1	1
2	1	1	10.66	56	1	...	1	1
3	1	1	10.66	56	1	...	1	1
4	1	1	10.66	56	1	...	1	1

	importance
hour_4	0.165723
hour_3	0.139258
hour_2	0.104959
hour_5	0.097221
temp	0.079194
hour_1	0.077804
workingday	0.073979
hour_0	0.037235
humidity	0.026703
hour_6	0.026014
season_Spring	0.016669
hour_23	0.016473
year_2011	0.015190
year_2012	0.014501
hour_7	0.012265
weather	0.011047
hour_8	0.009869
season_Winter	0.008401
hour_17	0.007883
hour_18	0.006866
hour_22	0.004939
day_Friday	0.003985
day_Sunday	0.003426
hour_19	0.003099
month_4	0.002977
hour_9	0.002319
hour_21	0.002224
hour_20	0.002075
day_Monday	0.002009
month_1	0.001965
day_Saturday	0.001925
hour_10	0.001873
hour_16	0.001780
day_Wednesday	0.001640
day_Thursday	0.001347
holiday	0.001338
day_Tuesday	0.001287
month_5	0.001260
month_3	0.001173
month_2	0.001006
season_Summer	0.000996
month_9	0.000910
hour_11	0.000875
month_12	0.000868
month_10	0.000862
month_11	0.000647
season_Fall	0.000607
hour_12	0.000588
month_7	0.000517
month_6	0.000515
hour_14	0.000462
hour_13	0.000434
month_8	0.000430
hour_15	0.000388

Table of Contents¶

Introduction¶

Reading and preparing the data¶

Description of each feature (12 in total)¶

Feature Engineering¶

Exploratory Data Analysis¶

Outlier Analysis¶

Correlation Matrix¶

Univariate Plots¶

Bivariate Plots¶

Removing Features¶

Making Predictions¶

Preparing train and test data¶

Various Models¶

Linear Regression Model¶

Ridge Regression¶

Lasso Regression¶

Random Forest Regression¶

Boosting¶

Conclusion¶