Data & Analytics Project

Ipsa
Yogeek Inc.
Published in
7 min readJan 11, 2023

--

7 Steps — To Decode Hourly, Daily and Seasonal impact on demand of Rental Bikes

Photo by VizAforMemories ... on Unsplash

Bike rental service is common in urban mobility space. 3 factors that make it popular are -

  1. Easy access (No license required)
  2. Cheaper than automobile services (low maintenance & insurance cost)
  3. Faster way to commute in cities with high traffic volumes

Why is it critical for both Business Owners and Users to understand the pattern in bike rental requests?

Business Owner’s perspective

  1. Increase revenue and customer satisfaction — By Identifying the expected bike demand in a specific area, within a specific time frame
  2. Reduce Operational Costs — By optimising bike relocation

User’s perspective

  1. Ensure bike availability in the shortest wait time

So now that we have established the value in Analysing Bike rental data for both business owners and users. Let’s dive in!

Step 1 — Making sense of Data

1.0 — Fundamental understanding of the data and distribution of various features

1.1 — Import the relevant Python libraries and the data itself

#imports  
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
# load hourly data
hourly_data = pd.read_csv('hour.csv')

1.2 — Getting an idea about the size of the data we are loading & the number of missing values in each column

# print some generic statistics about the data
print(f"Shape of data: {hourly_data.shape}")
print(f"Number of missing values in the data:\
{hourly_data.isnull().sum().sum()}")

Output —

Shape of data: (17379, 17)
Number of missing values in the data:0

1.3 — Getting general statistics about the numerical columns

# get statistics on the numerical columns
hourly_data.describe().T

Output -

The above columns can be split into 3 main categories :

  1. temporal features: This contains information about the time at which the record was registered. This group contains the dteday, season, yr, mnth, hr, holiday, weekday, and workingday columns.
  2. weather related features: This contains information about the weather
    conditions. The weathersit, temp, atemp, hum, and windspeed columns
    are included in this group.
  3. record related features: This contains information about the number
    of records for the specific hour and date. This group includes the casual, registered, and cnt columns.

Step 2 — Data Preprocessing/transformation

Few columns are hard to interpret from a human perspective. Hence, we perform some basic transformations on the columns, which will result in a more easy-to-understand analysis at a later stage.

2.0 — Create a copy of the original dataset. This is to protect our original dataset from any transformation.

# create a copy of the original data
preprocessed_data = hourly_data.copy()

2.1 — Transforming season variable from a numerical to a nicely encoded
categorical one. We create a Python dictionary and use apply and lambda functions for it.

# transform seasons
seasons_mapping = {1: 'winter', 2: 'spring', \
3: 'summer', 4: 'fall'}
preprocessed_data['season'] = preprocessed_data['season']\
.apply(lambda x: seasons_mapping[x])

2.2 — Transforming yr column

# transform yr
yr_mapping = {0: 2011, 1: 2012}
preprocessed_data['yr'] = preprocessed_data['yr']\
.apply(lambda x: yr_mapping[x])

2.3 — Transforming weekday column

# transform weekday
weekday_mapping = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', \
3: 'Wednesday', 4: 'Thursday', 5: 'Friday', \
6: 'Saturday'}
preprocessed_data['weekday'] = preprocessed_data['weekday']\
.apply(lambda x: weekday_mapping[x])

2.4 — Transforming weathersit column

# transform weathersit
weather_mapping = {1: 'clear', 2: 'cloudy', \
3: 'light_rain_snow', 4: 'heavy_rain_snow'}
preprocessed_data['weathersit'] = preprocessed_data['weathersit']\
.apply(lambda x: \
weather_mapping[x])

2.5— Rescaling hum and windspeed columns

# transform hum and windspeed
preprocessed_data['hum'] = preprocessed_data['hum']*100
preprocessed_data['windspeed'] = preprocessed_data['windspeed']\
*67

2.6 — Now we visualise the results from our transformation by calling the sample() method on the newly created dataset

# visualize preprocessed columns
cols = ['season', 'yr', 'weekday', \
'weathersit', 'hum', 'windspeed']
preprocessed_data[cols].sample(10, random_state=123)

Output —

Step 3 — Data Analysis : Registered versus Casual Use Analysis

We analyse the number of rides performed by registered users versus the number of rides performed by non-registered (or casual) ones. These numbers are represented in the registered and casual columns, respectively, with the cnt column representing the sum of the registered and casual rides

# plot distributions of registered vs casual rides
sns.distplot(preprocessed_data['registered'], label='registered')
sns.distplot(preprocessed_data['casual'], label='casual')
plt.legend()
plt.xlabel('rides')
plt.title("Rides distributions")
plt.savefig('figs/rides_distributions.png', format='png')

Output —

Observations :

  1. Registered users perform way more rides than casual ones
  2. Both the distributions are skewed to the right, meaning that, for most of the entries in the data, zero or a small number of rides were registered (eg. overnight rides)
  3. Each entry in the data has quite a large number of rides (that is, higher than 800)

Step 4— Data Analysis : Study the evolution of rides over time with respect to number of rides per day

# plot evolution of rides over time
plot_data = preprocessed_data[['registered', 'casual', 'dteday']]
ax = plot_data.groupby('dteday').sum().plot(figsize=(10,6))
ax.set_xlabel("time");
ax.set_ylabel("number of rides per day");
plt.savefig('figs/rides_daily.png', format='png')

Output —

Observations —

  1. Number of registered rides is always above and significantly higher than the number of casual rides per day
  2. During winter, the overall number of rides decreases (which is totally in line with our expectations, as bad weather and low temperatures have a negative impact on ride sharing services)

Step 5— Data Analysis : Visualizing not only the average number of rides for a window but also the expected deviation from the mean

Note that there is quite a lot of variance in the time series of the rides. One way to smooth out the curves is to take the rolling mean and standard deviation of the two time series and plot those instead.

# create new dataframe with necessary for plotting columns, and 
# obtain number of rides per day, by grouping over each day
plot_data = preprocessed_data[['registered', 'casual', 'dteday']]
plot_data = plot_data.groupby('dteday').sum()

# define window for computing the rolling mean and standard deviation
window = 7
rolling_means = plot_data.rolling(window).mean()
rolling_deviations = plot_data.rolling(window).std()

# create a plot of the series, where we first plot the series of rolling means,
# then we color the zone between the series of rolling means
# +- 2 rolling standard deviations
ax = rolling_means.plot(figsize=(10,6))
ax.fill_between(rolling_means.index, \
rolling_means['registered'] + 2*rolling_deviations['registered'], \
rolling_means['registered'] - 2*rolling_deviations['registered'], \
alpha = 0.2)
ax.fill_between(rolling_means.index, \
rolling_means['casual'] + 2*rolling_deviations['casual'], \
rolling_means['casual'] - 2*rolling_deviations['casual'], \
alpha = 0.2)
ax.set_xlabel("time");
ax.set_ylabel("number of rides per day");
plt.savefig('figs/rides_aggregated.png', format='png')

Output —

Step 6— Data Analysis : Distributions of the requests over separate hours and days of the week

We would expect certain time patterns to arise, as bike requests should be
more frequent during certain hours of the day, depending on the day of the week. This analysis can be easily done by leveraging various functions from the seaborn package, as shown in the following code snippet:

# select relevant columns
plot_data = preprocessed_data[['hr', 'weekday', 'registered', 'casual']]

# transform the data into a format, in number of entries are computed as count,
# for each distinct hr, weekday and type (registered or casual)
plot_data = plot_data.melt(id_vars=['hr', 'weekday'], var_name='type', value_name='count')

# create FacetGrid object, in which a grid plot is produced.
# As columns, we have the various days of the week,
# as rows, the different types (registered and casual)
grid = sns.FacetGrid(plot_data, row='weekday', col='type', height=2.5,\
aspect=2.5, row_order=['Monday', 'Tuesday', \
'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

# populate the FacetGrid with the specific plots
grid.map(sns.barplot, 'hr', 'count', alpha=0.5)
grid.savefig('figs/weekday_hour_distributions.png', format='png')

Output —

Step 7— Analyzing Seasonal Impact on Rides

7.0 — Plotting the seasonal distribution of the number of rides over hours

# select subset of the data
plot_data = preprocessed_data[['hr', 'season', 'registered', 'casual']]

# unpivot data from wide to long format
plot_data = plot_data.melt(id_vars=['hr', 'season'], var_name='type', \
value_name='count')

# define FacetGrid
grid = sns.FacetGrid(plot_data, row='season', \
col='type', height=2.5, aspect=2.5, \
row_order=['winter', 'spring', 'summer', 'fall'])

# apply plotting function to each element in the grid
grid.map(sns.barplot, 'hr', 'count', alpha=0.5)

# save figure
grid.savefig('figs/exercise_1_02_a.png', format='png')

Output —

7.1— Plotting the seasonal distribution of the number of rides over weekdays

plot_data = preprocessed_data[['weekday', 'season', 'registered', 'casual']]
plot_data = plot_data.melt(id_vars=['weekday', 'season'], var_name='type', value_name='count')

grid = sns.FacetGrid(plot_data, row='season', col='type', height=2.5, aspect=2.5,
row_order=['winter', 'spring', 'summer', 'fall'])
grid.map(sns.barplot, 'weekday', 'count', alpha=0.5,
order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

# save figure
grid.savefig('figs/exercise_1_02_b.png', format='png')

Output —

Reference :

GURURAJAN GOVINDAN, Shubhangi Hora. The Data Analysis Workshop. Snehal Tambe, Salma Patel, Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill,Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira,Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur,Nitesh Thakur, and Jonathan Wray. Packt Publishing Ltd., 2020.

--

--