1. Introduction
Bike share systems have emerged as a pivotal component of urban
mobility, offering a sustainable and convenient transportation
alternative to city dwellers and visitors. These systems function
through a network of strategically placed docking stations where
bicycles can be picked up and returned. However, the effectiveness of
bike sharing hinges significantly on the availability of bicycles and
open docks at these stations. Due to varying daily patterns, some
stations may end up with too many bikes while others may have none,
leading to inconvenience for users seeking to either pick up or drop off
bicycles.
To ensure that bikes are available where and when they are needed,
bike share systems must engage in active re-balancing. This process
involves redistributing bicycles across the network to maintain an
equilibrium between bike and dock availability. One effective strategy
for re-balancing involves utilizing a fleet of specially equipped
vehicles that can transport multiple bikes at once. This method allows
for rapid and efficient movement of bicycles from stations with a
surplus to those with a deficit. Fleet-based re-balancing relies on
predictive analytics to forecast the demand for bicycles at different
stations throughout the day. Predictive methods incorporate time lag
features, which allow the model to use data from past intervals to
forecast future demands. By incorporating these time lags, the
predictive system can accurately anticipate demand patterns, enabling
the fleet to be deployed proactively. This strategic deployment ensures
that each station maintains an optimal number of bikes and open docks,
thereby maximizing the system’s overall utility and user
satisfaction.
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(knitr)
library(purrr)
# Load the readr package at the beginning of your script
library(readr)
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
# Install Census API Key
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
2. Data Wrangling
Citi Bike, powered by Lyft, is a prominent bike-sharing service in
New York City, designed to offer an eco-friendly, efficient, and
convenient mode of transportation for both locals and tourists. To
enhance user understanding and encourage community engagement, Citi Bike
makes available comprehensive trip data, which can be downloaded for
further exploration. This dataset includes key details such as ride IDs,
bike types, trip start and end times, station names, and station IDs,
along with the geographical coordinates of each station. Additionally,
the data distinguishes between member and casual riders, providing
insights into different user patterns and behaviors. The dissemination
of this data empowers stakeholders to perform detailed analyses, helping
to address operational challenges such as the distribution and
availability of bikes to ensure that the bike-sharing system meets its
users’ needs effectively.
In this section, we will import data from the Citi Bike sharing
program, along with New York City demographic statistics and weather
records from JFK Airport. By integrating demographic and weather
information, we aim to enhance the accuracy of our predictions for
future trends in the NYC Citi Bike sharing usage.
Data Source: https://citibikenyc.com/system-data
data <- read_csv("C:/Users/25077/Desktop/MUSA 508_PPA/Assignment 5/data/202401-citibike-tripdata.csv/202401-citibike-tripdata_1.csv")
set.seed(123) # Setting a seed for reproducibility
data_nyc <- data %>%
sample_n(size = 400000)
dat_interval <- data_nyc %>%
mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
NYCCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state = "NY",
geometry = TRUE,
county=c("New York", "Kings", "Queens", "Richmond", "Bronx"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
# Extract geography factors
NYCTracts <-
NYCCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
# join census tract information to "dat2" data frame
dat_census <- st_join(dat_interval %>%
filter(is.na(start_lng) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lng) == FALSE &
is.na(end_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
NYCTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(from_longitude = unlist(map(geometry, 1)),
from_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
st_join(., NYCTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(to_longitude = unlist(map(geometry, 1)),
to_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
weather_Panel <-
riem_measures(station = "JFK", date_start = "2024-01-01", date_end = "2024-01-31") %>%
select(valid, tmpf, p01i, sknt) %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
Present the hourly variations in weather conditions, including
precipitation levels, wind speed, and temperature fluctuations, for New
York City.
grid.arrange(
ggplot(weather_Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather_Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather_Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - NYC JFK - Jan, 2024")

3. Explore the Data
1) Bike Share Trends by Dates.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr. NYC, Jan, 2024",
x="Date",
y="Number of trips")+
plotTheme

2) Bike Share Trends by Time Periods (Rush hours,
AM/PM).
dat_census <- dat_census %>%
na.omit()
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station_name, time_of_day) %>%
tally()%>%
group_by(start_station_name, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean Number of Hourly Trips Per Station. NYC, Jan, 2024",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme

The chart indicates that both Mid-Day and PM Rush periods experience
the highest frequency of bike trips, with PM Rush exhibiting not only
the greatest overall trip frequency but also the most frequent
occurrence of second trips per station. This suggests that during the PM
Rush, many stations consistently see at least two trips starting per
hour on average, which may reflect a robust pattern of usage as people
conclude their workday or engage in evening activities.
Overnight periods display a surprisingly elevated frequency of trips,
which, while not as high as the daytime peaks, is significant when
compared to the AM Rush period, which has the lowest frequency of trips.
This could be due to a variety of factors, such as night-shift workers,
late-night entertainment activities, or even early-morning fitness
routines.
3) Bike Share Trends by station.
ggplot(dat_census %>%
group_by(interval60, start_station_name) %>%
tally())+
geom_histogram(aes(n), binwidth = 5)+
labs(title="Bike share trips per hr by station. NYC, Jan. 2024",
x="Number of Stations",
y="Trip Counts")+
plotTheme

4) Bike Share Trends by day of the week.
# by week days
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips in NYC, by day of the week, Jan. 2024",
x="Hour",
y="Trip Counts")+
plotTheme

# by workdays or weekends
ggplot(dat_census %>%
mutate(hour = hour(started_at),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in NYC - weekend vs weekday, Jan. 2024",
x="Hour",
y="Trip Counts")+
plotTheme

The lines representing weekdays generally follow a similar pattern,
with pronounced peaks, suggesting consistent commuting behavior on these
days. In contrast, Saturday and Sunday show different patterns, with a
more gradual increase in trips throughout the day and not as sharp peaks
during traditional rush hours, which is typical for weekends when people
are not bound to standard work or school schedules.
ggplot() +
geom_sf(data = NYCTracts %>%
st_transform(crs = 4326), fill = "grey70", color = "grey80") +
geom_point(data = dat_census %>%
mutate(hour = hour(started_at),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
tally(),
aes(x = from_longitude, y = from_latitude, color = log1p(n)),
fill = "transparent", size = 0.08) +
scale_colour_gradient(low = "cornsilk", high = "indianred2") +
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title = "Bike share trips per hr by station. NYC, Jan.2024") +
mapTheme +
facet_grid(weekend ~ time_of_day) +
theme(legend.position = "right")

During weekdays, a high concentration of bike trips occurs within key
business districts in Manhattan, indicating a prevalent commuter pattern
associated with work-related travel via bike sharing. However, this
pattern dissipates over the weekend; trip distribution becomes more
dispersed across various locations and the frequency diminishes,
reflecting a shift to more leisurely or diverse travel purposes that are
not centered around commuting.
4. Model Development
1) Create Space-Time Panel
# Calculate the number of rows in the resulting panel dataset
# This line calculates the number of rows in the resulting panel dataset by multiplying the number of unique values in the 'interval60' column with the number of unique values in the 'start_station_id' column in the 'dat_census' dataset.
length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))
study.panel <- expand.grid(interval60 = unique(dat_census$interval60),
start_station_id = unique(dat_census$start_station_id)) %>%
left_join(dat_census %>%
select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%
distinct() %>%
group_by(start_station_id) %>%
slice(1),
by = "start_station_id")
2. Create Panel
# Create a panel dataset for bike ride data
ride.panel <-
dat_census %>% # Start with the dat_census dataset
mutate(Trip_Counter = 1) %>% # Add a Trip_Counter variable, indicating one trip for each row
right_join(study.panel) %>% # Right join with the study.panel dataset to ensure all spatial-time combinations are included
group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>% # Group by spatial-time and station variables
summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE)) %>% # Summarize the Trip_Counter variable to calculate total trip count for each group
left_join(weather_Panel) %>% # Left join with the weather.Panel dataset to include weather information
ungroup() %>% # Remove grouping
filter(!is.na(start_station_id)) %>% # Filter out rows with missing start_station_id values
mutate(week = week(interval60), # Create a week variable indicating the week number of the year
dotw = wday(interval60, label = TRUE)) %>% # Create a dotw variable indicating the day of the week (label format)
filter(!is.na(Origin.Tract)) # Filter out rows with missing Origin.Tract values
ride.panel <-
left_join(ride.panel, NYCCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
3) Create time lags
# Arrange the ride.panel dataset based on 'start_station_id' and 'interval60'
# This step ensures that the dataset is ordered properly for subsequent calculations
ride.panel <-
ride.panel %>%
arrange(start_station_id, interval60) %>%
# Compute lag variables for the 'Trip_Count' variable at different time intervals
mutate(
lagHour = dplyr::lag(Trip_Count, 1), # Lag of 1 hour
lag2Hours = dplyr::lag(Trip_Count, 2), # Lag of 2 hours
lag3Hours = dplyr::lag(Trip_Count, 3), # Lag of 3 hours
lag4Hours = dplyr::lag(Trip_Count, 4), # Lag of 4 hours
lag12Hours = dplyr::lag(Trip_Count, 12), # Lag of 12 hours
lag1day = dplyr::lag(Trip_Count, 24) # Lag of 1 day (24 hours)
) %>%
# Identify holidays and create a binary variable indicating whether it's a holiday
mutate(
holiday = ifelse(yday(interval60) == 148, 1, 0) # Check if it's a holiday (148th day of the year)
) %>%
# Create a variable representing the day of the year
mutate(
day = yday(interval60) # Extract the day of the year from 'interval60'
) %>%
# Create additional variables related to holiday lag
mutate(
holidayLag = case_when(
dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay", # If the previous day was a holiday
dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays", # If two days ago was a holiday
dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays", # If three days ago was a holiday
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay", # If the next day will be a holiday
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays", # If two days ahead will be a holiday
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays" # If three days ahead will be a holiday
),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag) # Convert NA to 0 for non-holiday cases
)
# Assuming that ride.panel is your data frame and it has been processed as per the R code you provided
# Now, we will summarize the correlations and then create a kable
ride.panel %>%
as.data.frame() %>%
group_by(interval60) %>%
summarize_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour", "lag2Hours", "lag3Hours", "lag4Hours", "lag12Hours", "lag1day"))) %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count), 2)) %>%
# Now we will use kable to create a table
kable(format = "html", caption = "Correlation between lags and Trip Count") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
column_spec(1, bold = T) # Make the first column bold
Correlation between lags and Trip Count
|
Variable
|
correlation
|
|
lagHour
|
0.88
|
|
lag2Hours
|
0.67
|
|
lag3Hours
|
0.48
|
|
lag4Hours
|
0.31
|
|
lag12Hours
|
-0.45
|
|
lag1day
|
0.72
|
lagHour (0.83): Indicates a very strong positive correlation with the
number of trips from the previous hour. This suggests that bike usage in
the previous hour is a strong predictor of bike usage in the current
hour. lag2Hours (0.63): Also a positive correlation, though not as
strong as the 1-hour lag, showing that bike usage two hours prior still
has a significant impact on current bike usage. lag3Hours (0.44) and
lag4Hours (0.28): These values represent positive correlations but are
weaker, indicating that as the time lag increases, the predictive power
of previous bike usage diminishes. lag12Hours (-0.39): This negative
correlation suggests an inverse relationship, indicating that the number
of trips is generally opposite in trend 12 hours prior. This could
reflect a difference in bike usage patterns between morning and evening.
lag1day (0.66): A relatively strong positive correlation indicates that
the number of trips is similar from one day to the next, possibly
highlighting consistent daily patterns of bike usage.
4. Run Models
# Filter the ride.panel data to create subsets for training and testing
ride.panel.train.subset <- ride.panel %>% filter(week >= 3)
ride.panel.test.subset <- ride.panel %>% filter(week < 3)
# Randomly select 40,000 rows from the training subset
ride.Train <- ride.panel.train.subset %>% sample_n(40000)
# Now, select up to 40,000 rows for the test set. If there are less than 40,000, select all.
n_test <- min(40000, nrow(ride.panel.test.subset))
ride.Test <- ride.panel.test.subset %>% sample_n(n_test)
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station_name + dotw + Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)
4. Model Evaluation
1) Prediction
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
kable_output <- week_predictions %>%
kable(format = "html", caption = "Weekly Predictions and Error Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE)
2) Examine Error Metrics for Accuracy
palette5 <- c("cornsilk", "#f8e0c8", "#f1c1b1", "#ea929a", "indianred3")
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme

In the provided chart, it appears that Model A, which incorporates
time of day and temperature variables, results in the highest Mean
Absolute Errors (MAE), suggesting that it may not be capturing enough of
the factors that influence trip counts. Models D and E, which are more
complex and include time lags and holiday effects in addition to spatial
and weather variables, show relatively lower MAEs. This indicates that
the integration of additional time-related factors generally contributes
to a decrease in prediction error, underscoring the importance of
considering both immediate and historical temporal patterns in trip
count prediction.
Interestingly, Model C stands out with an abnormally higher error
than might be expected given its combination of time, spatial, and
weather variables. This could imply a need to re-evaluate the specific
variables included or the way they interact within the model.
Notably, Models D and E perform similarly across the weeks,
suggesting that the inclusion of holiday lags does not significantly
enhance model performance beyond the time lags already accounted for in
Model D. This could be a reflection of the non-linear nature of holiday
impacts or an indication of overfitting, where the model complexity
begins to capture noise rather than the true underlying
relationships.
From these observations, it becomes clear that while adding more
variables can be beneficial, there is a nuanced balance between model
complexity and predictive performance. Future model improvements might
involve refining the selection of lag variables, re-assessing the
inclusion of holiday effects, or exploring non-linear modeling
techniques that might better capture the complexities of trip count
variations.
3) Observed and Prediction
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id)) %>%
dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour = Variable)) +
geom_line(size = 1.1) +
scale_color_manual(values = c("Observed" = "#f1c1b1", "Prediction" = "indianred3")) + # Specify your colors here
facet_wrap(~Regression, ncol = 1) +
labs(title = "Predicted/Observed bike share time series",
subtitle = "Chicago; A test set of 2 weeks",
x = "Hour",
y = "Station Trips") +
plotTheme

The time series plots for different regression models over a two-week
period illustrate varying degrees of predictive accuracy in estimating
bike share usage. Model A, focusing solely on temporal and temperature
variables, demonstrates moderate predictive capability but shows notable
errors, particularly at peak usage times. Model B, which incorporates
spatial factors, captures some patterns more accurately than Model A,
yet still exhibits deviations due to the absence of temporal details.
Model C, which combines both temporal and spatial variables, along with
weather conditions, closely follows the observed data trends, although
it occasionally misses peak values. Model D adds temporal lags and
presents a better fit, closely mirroring the observed trips, signifying
the relevance of historical data in forecasting. Lastly, Model E, which
extends Model D with holiday lags, offers no significant improvement,
indicating that additional complexity through holiday lags may not
contribute meaningfully to the model’s performance in this context.
Overall, while the inclusion of time lags enhances model accuracy, the
benefit of integrating holiday-related variables appears limited.
4) Errors by Station
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude)) %>%
select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
group_by(start_station_id, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = NYCCensus, color = "grey80", fill = "grey70")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", size=0.3)+
scale_colour_gradient(low = "cornsilk", high = "indianred2")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title="Mean Abs Error, Test Set, Model 5")+
mapTheme

Based on the observations, it is evident that the East Village area
exhibits the highest prediction errors, indicating that the model’s
forecasting is less accurate in this locale. Furthermore, the areas
encompassing Greenwich Village and the vicinities bordering Central Park
also demonstrate relatively higher MAE values. These discrepancies
suggest that the model may not fully account for the unique factors or
behaviors influencing bike share usage patterns in these specific urban
regions, which are possibly characterized by their own distinct
demographic, cultural, or infrastructural attributes.
5) Space-Time Error Evaluation
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw)) %>%
select(interval60, start_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "indianred2")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme

During weekdays, there seems to be a pattern of overprediction during
the AM rush and underprediction during the PM rush. Midday and overnight
predictions appear to be more evenly distributed around the diagonal,
suggesting better predictive accuracy during these times. On weekends,
the scatter plots show a wide dispersion of points, indicating a greater
degree of variance in the predictive accuracy. This could be due to more
variable trip patterns during weekends that are not as well-captured by
the model.
5) Space-Time Error Evaluation by Station
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.) +
geom_sf(data = NYCCensus, color = "grey70", fill = "grey80") +
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
size = 0.5) +
scale_colour_gradient(low = "cornsilk", high = "indianred2") +
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude)) +
facet_grid(weekend ~ time_of_day) +
labs(title="Mean Absolute Errors, Test Set") +
mapTheme

5) Socio-Economic Factors
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, start_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = NYCCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "indianred")+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme

Median Income (Med_Inc) vs. MAE: The scatter plot does not
show a clear correlation between median income and MAE. However, there’s
a slight concentration of data points at the lower end of the median
income scale with lower MAE values, which might indicate that the model
predicts trips more accurately in areas with lower median incomes.
Percentage Taking Public Transportation vs. MAE: This plot
appears to show a slight upward trend, where areas with a higher
percentage of people using public transport tend to have higher MAE
values in trip prediction. This could suggest that the model struggles
to predict trips accurately in areas with higher public transportation
usage.
Percentage White vs. MAE: The scatter plot for the
percentage of the white population against MAE seems to show no strong
correlation. There’s a wide dispersion of MAE values across all
percentages, indicating that the racial composition, represented by the
percentage of white residents, does not have a clear impact on the
predictive error of the model.
5. Interpretation
The algorithm’s efficacy, particularly the models utilizing various
time lags, underscores its potential for optimizing bike fleet
transportation for re-balancing efforts. The inclusion of time lags
allows the algorithm to account for patterns and trends over time, which
is crucial for anticipating future demand and planning logistics.
For instance, the D and E models, which include time lag variables,
have shown a relatively lower Mean Absolute Error (MAE) in predictions.
This suggests they are more adept at capturing the temporal dynamics of
bike usage. For re-balancing operations, this means the algorithm can
predict not just the immediate demand for bikes but also how demand
evolves throughout the day or week. With this information, a
bike-sharing service can proactively move bikes from low-demand to
high-demand areas in anticipation of usage spikes, rather than reacting
to shortages or surpluses as they happen.
The model’s capability to reflect the influence of time on usage
patterns enables a more nuanced approach to fleet management. For
example, during weekday AM rush hours, where overprediction is common,
the model’s insights could prevent the unnecessary allocation of too
many bikes to certain areas, thereby optimizing the use of
transportation resources. Conversely, during PM rush hours, where
underprediction occurs, the model can signal the need to allocate
additional bikes ahead of time to meet the surge in demand.
By employing a model with time lags, the transportation fleet used
for re-balancing can operate more efficiently. It allows for the
scheduling of fleet movements during off-peak hours, which can reduce
operational costs and minimize the impact on traffic. It also ensures
that the re-balancing efforts are less disruptive to the service and
more aligned with the predicted user demand, thus enhancing the overall
user experience.
In conclusion, the time lag models demonstrate considerable potential
for improving bike fleet re-balancing operations. The algorithm can
provide predictive insights that allow for a more strategic deployment
of the re-balancing fleet, potentially leading to cost savings, better
resource allocation, and an overall increase in the effectiveness of the
bike-sharing system.
---
title: "Space-Time Prediction of Bike Share Demand in NYC"
author: "Ziyi Guo"
date: "Apr 28, 2024"
output: 
  html_document:
    toc: true
    toc_float: true
    code_folding: "hide"
    code_download: true
---

## 1. Introduction

Bike share systems have emerged as a pivotal component of urban mobility, offering a sustainable and convenient transportation alternative to city dwellers and visitors. These systems function through a network of strategically placed docking stations where bicycles can be picked up and returned. However, the effectiveness of bike sharing hinges significantly on the availability of bicycles and open docks at these stations. Due to varying daily patterns, some stations may end up with too many bikes while others may have none, leading to inconvenience for users seeking to either pick up or drop off bicycles.

To ensure that bikes are available where and when they are needed, bike share systems must engage in active re-balancing. This process involves redistributing bicycles across the network to maintain an equilibrium between bike and dock availability. One effective strategy for re-balancing involves utilizing a fleet of specially equipped vehicles that can transport multiple bikes at once. This method allows for rapid and efficient movement of bicycles from stations with a surplus to those with a deficit. Fleet-based re-balancing relies on predictive analytics to forecast the demand for bicycles at different stations throughout the day. Predictive methods incorporate time lag features, which allow the model to use data from past intervals to forecast future demands. By incorporating these time lags, the predictive system can accurately anticipate demand patterns, enabling the fleet to be deployed proactively. This strategic deployment ensures that each station maintains an optimal number of bikes and open docks, thereby maximizing the system's overall utility and user satisfaction.

```{r setup_13, warning = FALSE, message = FALSE, results='hide'}

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(knitr)
library(purrr)

# Load the readr package at the beginning of your script
library(readr)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

# Install Census API Key
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
```


## 2. Data Wrangling

Citi Bike, powered by Lyft, is a prominent bike-sharing service in New York City, designed to offer an eco-friendly, efficient, and convenient mode of transportation for both locals and tourists. To enhance user understanding and encourage community engagement, Citi Bike makes available comprehensive trip data, which can be downloaded for further exploration. This dataset includes key details such as ride IDs, bike types, trip start and end times, station names, and station IDs, along with the geographical coordinates of each station. Additionally, the data distinguishes between member and casual riders, providing insights into different user patterns and behaviors. The dissemination of this data empowers stakeholders to perform detailed analyses, helping to address operational challenges such as the distribution and availability of bikes to ensure that the bike-sharing system meets its users' needs effectively.

In this section, we will import data from the Citi Bike sharing program, along with New York City demographic statistics and weather records from JFK Airport. By integrating demographic and weather information, we aim to enhance the accuracy of our predictions for future trends in the NYC Citi Bike sharing usage.

Data Source:
https://citibikenyc.com/system-data

```{r dat_prep, warning = FALSE, message = FALSE, results='hide' }

data <- read_csv("C:/Users/25077/Desktop/MUSA 508_PPA/Assignment 5/data/202401-citibike-tripdata.csv/202401-citibike-tripdata_1.csv") 

set.seed(123) # Setting a seed for reproducibility
data_nyc <- data %>% 
  sample_n(size = 400000)


dat_interval <- data_nyc %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))


NYCCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "NY", 
          geometry = TRUE, 
          county=c("New York", "Kings", "Queens", "Richmond", "Bronx"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

# Extract geography factors
NYCTracts <- 
  NYCCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf


# join census tract information to "dat2" data frame
dat_census <- st_join(dat_interval %>% 
          filter(is.na(start_lng) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lng) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        NYCTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., NYCTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)


weather_Panel <- 
  riem_measures(station = "JFK", date_start = "2024-01-01", date_end = "2024-01-31") %>%
  select(valid, tmpf, p01i, sknt) %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

```

Present the hourly variations in weather conditions, including precipitation levels, wind speed, and temperature fluctuations, for New York City.

```{r plot_weather, warning = FALSE, message = FALSE, results='hide'}
grid.arrange(
  ggplot(weather_Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather_Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather_Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - NYC JFK - Jan, 2024")
```

## 3. Explore the Data
 
*1) Bike Share Trends by Dates. *

```{r trip_timeseries, warning = FALSE, message = FALSE, results='hide' }
ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. NYC, Jan, 2024",
       x="Date", 
       y="Number of trips")+
  plotTheme
```

*2) Bike Share Trends by Time Periods (Rush hours, AM/PM).*

```{r mean_trips_hist, warning = FALSE, message = FALSE, results='hide'}

dat_census <- dat_census %>%
  na.omit() 

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. NYC, Jan, 2024",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme
```

The chart indicates that both Mid-Day and PM Rush periods experience the highest frequency of bike trips, with PM Rush exhibiting not only the greatest overall trip frequency but also the most frequent occurrence of second trips per station. This suggests that during the PM Rush, many stations consistently see at least two trips starting per hour on average, which may reflect a robust pattern of usage as people conclude their workday or engage in evening activities.

Overnight periods display a surprisingly elevated frequency of trips, which, while not as high as the daytime peaks, is significant when compared to the AM Rush period, which has the lowest frequency of trips. This could be due to a variety of factors, such as night-shift workers, late-night entertainment activities, or even early-morning fitness routines.


*3) Bike Share Trends by station.*

```{r trips_station_dotw , warning = FALSE, message = FALSE, results='hide'}
ggplot(dat_census %>%
         group_by(interval60, start_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station. NYC, Jan. 2024",
       x="Number of Stations", 
       y="Trip Counts")+
  plotTheme
```

*4) Bike Share Trends by day of the week.*

```{r trips_hour_dotw, warning = FALSE, message = FALSE, results='hide' }

# by week days
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in NYC, by day of the week, Jan. 2024",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

# by workdays or weekends
ggplot(dat_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in NYC - weekend vs weekday, Jan. 2024",
       x="Hour", 
       y="Trip Counts")+
     plotTheme
```

The lines representing weekdays generally follow a similar pattern, with pronounced peaks, suggesting consistent commuting behavior on these days. In contrast, Saturday and Sunday show different patterns, with a more gradual increase in trips throughout the day and not as sharp peaks during traditional rush hours, which is typical for weekends when people are not bound to standard work or school schedules.


```{r origin_map,fig.width=10, fig.height=12, warning = FALSE, message = FALSE, results='hide' }

ggplot() +
  geom_sf(data = NYCTracts %>%
            st_transform(crs = 4326), fill = "grey70", color = "grey80") +
  geom_point(data = dat_census %>%
                   mutate(hour = hour(started_at),
                          weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                          time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                                  hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                                  hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                                  hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
                   group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
                   tally(),
             aes(x = from_longitude, y = from_latitude, color = log1p(n)),
             fill = "transparent", size = 0.08) +
  scale_colour_gradient(low = "cornsilk", high = "indianred2") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title = "Bike share trips per hr by station. NYC, Jan.2024") +
  mapTheme +
  facet_grid(weekend ~ time_of_day) +
  theme(legend.position = "right")

```

During weekdays, a high concentration of bike trips occurs within key business districts in Manhattan, indicating a prevalent commuter pattern associated with work-related travel via bike sharing. However, this pattern dissipates over the weekend; trip distribution becomes more dispersed across various locations and the frequency diminishes, reflecting a shift to more leisurely or diverse travel purposes that are not centered around commuting.

## 4. Model Development

*1) Create Space-Time Panel*

```{r panel_length_check, warning = FALSE, message = FALSE, results='hide'}

# Calculate the number of rows in the resulting panel dataset
# This line calculates the number of rows in the resulting panel dataset by multiplying the number of unique values in the 'interval60' column with the number of unique values in the 'start_station_id' column in the 'dat_census' dataset.

length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))


study.panel <- expand.grid(interval60 = unique(dat_census$interval60),
                           start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(dat_census %>% 
            select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>% 
            distinct() %>% 
            group_by(start_station_id) %>% 
            slice(1),
            by = "start_station_id")

```

*2. Create Panel*

```{r create_panel, warning = FALSE, message = FALSE, results='hide'}
# Create a panel dataset for bike ride data
ride.panel <- 
  dat_census %>%  # Start with the dat_census dataset
  mutate(Trip_Counter = 1) %>%  # Add a Trip_Counter variable, indicating one trip for each row
  right_join(study.panel) %>%  # Right join with the study.panel dataset to ensure all spatial-time combinations are included
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%  # Group by spatial-time and station variables
  summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE)) %>%  # Summarize the Trip_Counter variable to calculate total trip count for each group
  left_join(weather_Panel) %>%  # Left join with the weather.Panel dataset to include weather information
  ungroup() %>%  # Remove grouping
  filter(!is.na(start_station_id)) %>%  # Filter out rows with missing start_station_id values
  mutate(week = week(interval60),  # Create a week variable indicating the week number of the year
         dotw = wday(interval60, label = TRUE)) %>%  # Create a dotw variable indicating the day of the week (label format)
  filter(!is.na(Origin.Tract))  # Filter out rows with missing Origin.Tract values


ride.panel <- 
  left_join(ride.panel, NYCCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
```

*3) Create time lags*


```{r time_lags, warning = FALSE, message = FALSE, results='hide'}

# Arrange the ride.panel dataset based on 'start_station_id' and 'interval60'
# This step ensures that the dataset is ordered properly for subsequent calculations
ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>%

# Compute lag variables for the 'Trip_Count' variable at different time intervals
  mutate(
    lagHour = dplyr::lag(Trip_Count, 1),  # Lag of 1 hour
    lag2Hours = dplyr::lag(Trip_Count, 2),  # Lag of 2 hours
    lag3Hours = dplyr::lag(Trip_Count, 3),  # Lag of 3 hours
    lag4Hours = dplyr::lag(Trip_Count, 4),  # Lag of 4 hours
    lag12Hours = dplyr::lag(Trip_Count, 12),  # Lag of 12 hours
    lag1day = dplyr::lag(Trip_Count, 24)  # Lag of 1 day (24 hours)
  ) %>%

# Identify holidays and create a binary variable indicating whether it's a holiday
  mutate(
    holiday = ifelse(yday(interval60) == 148, 1, 0)  # Check if it's a holiday (148th day of the year)
  ) %>%

# Create a variable representing the day of the year
  mutate(
    day = yday(interval60)  # Extract the day of the year from 'interval60'
  ) %>%

# Create additional variables related to holiday lag
  mutate(
    holidayLag = case_when(
      dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",  # If the previous day was a holiday
      dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",  # If two days ago was a holiday
      dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",  # If three days ago was a holiday
      dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",  # If the next day will be a holiday
      dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",  # If two days ahead will be a holiday
      dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"  # If three days ahead will be a holiday
    ),
    holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag)  # Convert NA to 0 for non-holiday cases
  )


```

```{r evaluate_lags, warning = FALSE, message = FALSE}

# Assuming that ride.panel is your data frame and it has been processed as per the R code you provided
# Now, we will summarize the correlations and then create a kable

ride.panel %>%
  as.data.frame() %>%
  group_by(interval60) %>%
  summarize_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour", "lag2Hours", "lag3Hours", "lag4Hours", "lag12Hours", "lag1day"))) %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count), 2)) %>%
  # Now we will use kable to create a table
  kable(format = "html", caption = "Correlation between lags and Trip Count") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  column_spec(1, bold = T) # Make the first column bold

```

lagHour (0.83): Indicates a very strong positive correlation with the number of trips from the previous hour. This suggests that bike usage in the previous hour is a strong predictor of bike usage in the current hour.
lag2Hours (0.63): Also a positive correlation, though not as strong as the 1-hour lag, showing that bike usage two hours prior still has a significant impact on current bike usage.
lag3Hours (0.44) and lag4Hours (0.28): These values represent positive correlations but are weaker, indicating that as the time lag increases, the predictive power of previous bike usage diminishes.
lag12Hours (-0.39): This negative correlation suggests an inverse relationship, indicating that the number of trips is generally opposite in trend 12 hours prior. This could reflect a difference in bike usage patterns between morning and evening.
lag1day (0.66): A relatively strong positive correlation indicates that the number of trips is similar from one day to the next, possibly highlighting consistent daily patterns of bike usage.


*4. Run Models*


```{r train_test, warning = FALSE, message = FALSE, results='hide' }


# Filter the ride.panel data to create subsets for training and testing
ride.panel.train.subset <- ride.panel %>% filter(week >= 3)
ride.panel.test.subset <- ride.panel %>% filter(week < 3)

# Randomly select 40,000 rows from the training subset
ride.Train <- ride.panel.train.subset %>% sample_n(40000)

# Now, select up to 40,000 rows for the test set. If there are less than 40,000, select all.
n_test <- min(40000, nrow(ride.panel.test.subset))
ride.Test <- ride.panel.test.subset %>% sample_n(n_test)



reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)
```

## 4. Model Evaluation

*1) Prediction*

```{r nest_data, warning = FALSE, message = FALSE}
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

kable_output <- week_predictions %>%
  kable(format = "html", caption = "Weekly Predictions and Error Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
```

*2) Examine Error Metrics for Accuracy*


```{r plot_errors_by_model, warning = FALSE, message = FALSE, results='hide' }

palette5 <- c("cornsilk", "#f8e0c8", "#f1c1b1", "#ea929a", "indianred3")

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme


```

In the provided chart, it appears that Model A, which incorporates time of day and temperature variables, results in the highest Mean Absolute Errors (MAE), suggesting that it may not be capturing enough of the factors that influence trip counts. Models D and E, which are more complex and include time lags and holiday effects in addition to spatial and weather variables, show relatively lower MAEs. This indicates that the integration of additional time-related factors generally contributes to a decrease in prediction error, underscoring the importance of considering both immediate and historical temporal patterns in trip count prediction.

Interestingly, Model C stands out with an abnormally higher error than might be expected given its combination of time, spatial, and weather variables. This could imply a need to re-evaluate the specific variables included or the way they interact within the model.

Notably, Models D and E perform similarly across the weeks, suggesting that the inclusion of holiday lags does not significantly enhance model performance beyond the time lags already accounted for in Model D. This could be a reflection of the non-linear nature of holiday impacts or an indication of overfitting, where the model complexity begins to capture noise rather than the true underlying relationships.

From these observations, it becomes clear that while adding more variables can be beneficial, there is a nuanced balance between model complexity and predictive performance. Future model improvements might involve refining the selection of lag variables, re-assessing the inclusion of holiday effects, or exploring non-linear modeling techniques that might better capture the complexities of trip count variations.

*3) Observed and Prediction*

```{r error_vs_actual_timeseries, warning = FALSE, message = FALSE, results='hide'}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour = Variable)) +
  geom_line(size = 1.1) +
  scale_color_manual(values = c("Observed" = "#f1c1b1", "Prediction" = "indianred3")) + # Specify your colors here
  facet_wrap(~Regression, ncol = 1) +
  labs(title = "Predicted/Observed bike share time series", 
       subtitle = "Chicago; A test set of 2 weeks", 
       x = "Hour", 
       y = "Station Trips") +
  plotTheme

```

The time series plots for different regression models over a two-week period illustrate varying degrees of predictive accuracy in estimating bike share usage. Model A, focusing solely on temporal and temperature variables, demonstrates moderate predictive capability but shows notable errors, particularly at peak usage times. Model B, which incorporates spatial factors, captures some patterns more accurately than Model A, yet still exhibits deviations due to the absence of temporal details. Model C, which combines both temporal and spatial variables, along with weather conditions, closely follows the observed data trends, although it occasionally misses peak values. Model D adds temporal lags and presents a better fit, closely mirroring the observed trips, signifying the relevance of historical data in forecasting. Lastly, Model E, which extends Model D with holiday lags, offers no significant improvement, indicating that additional complexity through holiday lags may not contribute meaningfully to the model's performance in this context. Overall, while the inclusion of time lags enhances model accuracy, the benefit of integrating holiday-related variables appears limited.

*4) Errors by Station*

```{r errors_by_station, warning = FALSE, message = FALSE, results='hide' }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(start_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = NYCCensus, color = "grey80", fill = "grey70")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size=0.3)+
  scale_colour_gradient(low = "cornsilk", high = "indianred2")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme
```

Based on the observations, it is evident that the East Village area exhibits the highest prediction errors, indicating that the model's forecasting is less accurate in this locale. Furthermore, the areas encompassing Greenwich Village and the vicinities bordering Central Park also demonstrate relatively higher MAE values. These discrepancies suggest that the model may not fully account for the unique factors or behaviors influencing bike share usage patterns in these specific urban regions, which are possibly characterized by their own distinct demographic, cultural, or infrastructural attributes.

*5) Space-Time Error Evaluation*


```{r obs_pred_all, warning = FALSE, message = FALSE, results='hide'}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "indianred2")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme
```

During weekdays, there seems to be a pattern of overprediction during the AM rush and underprediction during the PM rush. Midday and overnight predictions appear to be more evenly distributed around the diagonal, suggesting better predictive accuracy during these times.
On weekends, the scatter plots show a wide dispersion of points, indicating a greater degree of variance in the predictive accuracy. This could be due to more variable trip patterns during weekends that are not as well-captured by the model.

*5) Space-Time Error Evaluation by Station*

```{r station_summary,fig.width=10, fig.height=12, warning = FALSE, message = FALSE, results='hide' }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.) +
  geom_sf(data = NYCCensus, color = "grey70", fill = "grey80") +
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             size = 0.5) +
  scale_colour_gradient(low = "cornsilk", high = "indianred2") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude)) +
  facet_grid(weekend ~ time_of_day) +
  labs(title="Mean Absolute Errors, Test Set") +
  mapTheme

  
```

*5) Socio-Economic Factors*

```{r station_summary2, warning = FALSE, message = FALSE, results='hide' }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = NYCCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "indianred")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme
  
```

*Median Income (Med_Inc) vs. MAE:*
The scatter plot does not show a clear correlation between median income and MAE. However, there's a slight concentration of data points at the lower end of the median income scale with lower MAE values, which might indicate that the model predicts trips more accurately in areas with lower median incomes.

*Percentage Taking Public Transportation vs. MAE:*
This plot appears to show a slight upward trend, where areas with a higher percentage of people using public transport tend to have higher MAE values in trip prediction. This could suggest that the model struggles to predict trips accurately in areas with higher public transportation usage.

*Percentage White vs. MAE:*
The scatter plot for the percentage of the white population against MAE seems to show no strong correlation. There's a wide dispersion of MAE values across all percentages, indicating that the racial composition, represented by the percentage of white residents, does not have a clear impact on the predictive error of the model.


## 5. Interpretation

The algorithm's efficacy, particularly the models utilizing various time lags, underscores its potential for optimizing bike fleet transportation for re-balancing efforts. The inclusion of time lags allows the algorithm to account for patterns and trends over time, which is crucial for anticipating future demand and planning logistics.

For instance, the D and E models, which include time lag variables, have shown a relatively lower Mean Absolute Error (MAE) in predictions. This suggests they are more adept at capturing the temporal dynamics of bike usage. For re-balancing operations, this means the algorithm can predict not just the immediate demand for bikes but also how demand evolves throughout the day or week. With this information, a bike-sharing service can proactively move bikes from low-demand to high-demand areas in anticipation of usage spikes, rather than reacting to shortages or surpluses as they happen.

The model's capability to reflect the influence of time on usage patterns enables a more nuanced approach to fleet management. For example, during weekday AM rush hours, where overprediction is common, the model's insights could prevent the unnecessary allocation of too many bikes to certain areas, thereby optimizing the use of transportation resources. Conversely, during PM rush hours, where underprediction occurs, the model can signal the need to allocate additional bikes ahead of time to meet the surge in demand.

By employing a model with time lags, the transportation fleet used for re-balancing can operate more efficiently. It allows for the scheduling of fleet movements during off-peak hours, which can reduce operational costs and minimize the impact on traffic. It also ensures that the re-balancing efforts are less disruptive to the service and more aligned with the predicted user demand, thus enhancing the overall user experience.

In conclusion, the time lag models demonstrate considerable potential for improving bike fleet re-balancing operations. The algorithm can provide predictive insights that allow for a more strategic deployment of the re-balancing fleet, potentially leading to cost savings, better resource allocation, and an overall increase in the effectiveness of the bike-sharing system.

