1. Introduction

Bike share systems have emerged as a pivotal component of urban mobility, offering a sustainable and convenient transportation alternative to city dwellers and visitors. These systems function through a network of strategically placed docking stations where bicycles can be picked up and returned. However, the effectiveness of bike sharing hinges significantly on the availability of bicycles and open docks at these stations. Due to varying daily patterns, some stations may end up with too many bikes while others may have none, leading to inconvenience for users seeking to either pick up or drop off bicycles.

To ensure that bikes are available where and when they are needed, bike share systems must engage in active re-balancing. This process involves redistributing bicycles across the network to maintain an equilibrium between bike and dock availability. One effective strategy for re-balancing involves utilizing a fleet of specially equipped vehicles that can transport multiple bikes at once. This method allows for rapid and efficient movement of bicycles from stations with a surplus to those with a deficit. Fleet-based re-balancing relies on predictive analytics to forecast the demand for bicycles at different stations throughout the day. Predictive methods incorporate time lag features, which allow the model to use data from past intervals to forecast future demands. By incorporating these time lags, the predictive system can accurately anticipate demand patterns, enabling the fleet to be deployed proactively. This strategic deployment ensures that each station maintains an optimal number of bikes and open docks, thereby maximizing the system’s overall utility and user satisfaction.

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(knitr)
library(purrr)

# Load the readr package at the beginning of your script
library(readr)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

# Install Census API Key
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)

2. Data Wrangling

Citi Bike, powered by Lyft, is a prominent bike-sharing service in New York City, designed to offer an eco-friendly, efficient, and convenient mode of transportation for both locals and tourists. To enhance user understanding and encourage community engagement, Citi Bike makes available comprehensive trip data, which can be downloaded for further exploration. This dataset includes key details such as ride IDs, bike types, trip start and end times, station names, and station IDs, along with the geographical coordinates of each station. Additionally, the data distinguishes between member and casual riders, providing insights into different user patterns and behaviors. The dissemination of this data empowers stakeholders to perform detailed analyses, helping to address operational challenges such as the distribution and availability of bikes to ensure that the bike-sharing system meets its users’ needs effectively.

In this section, we will import data from the Citi Bike sharing program, along with New York City demographic statistics and weather records from JFK Airport. By integrating demographic and weather information, we aim to enhance the accuracy of our predictions for future trends in the NYC Citi Bike sharing usage.

Data Source: https://citibikenyc.com/system-data

data <- read_csv("C:/Users/25077/Desktop/MUSA 508_PPA/Assignment 5/data/202401-citibike-tripdata.csv/202401-citibike-tripdata_1.csv") 

set.seed(123) # Setting a seed for reproducibility
data_nyc <- data %>% 
  sample_n(size = 400000)


dat_interval <- data_nyc %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))


NYCCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "NY", 
          geometry = TRUE, 
          county=c("New York", "Kings", "Queens", "Richmond", "Bronx"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

# Extract geography factors
NYCTracts <- 
  NYCCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf


# join census tract information to "dat2" data frame
dat_census <- st_join(dat_interval %>% 
          filter(is.na(start_lng) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lng) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        NYCTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., NYCTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)


weather_Panel <- 
  riem_measures(station = "JFK", date_start = "2024-01-01", date_end = "2024-01-31") %>%
  select(valid, tmpf, p01i, sknt) %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

Present the hourly variations in weather conditions, including precipitation levels, wind speed, and temperature fluctuations, for New York City.

grid.arrange(
  ggplot(weather_Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather_Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather_Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - NYC JFK - Jan, 2024")

3. Explore the Data

1) Bike Share Trends by Dates.

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. NYC, Jan, 2024",
       x="Date", 
       y="Number of trips")+
  plotTheme

2) Bike Share Trends by Time Periods (Rush hours, AM/PM).

dat_census <- dat_census %>%
  na.omit() 

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. NYC, Jan, 2024",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

The chart indicates that both Mid-Day and PM Rush periods experience the highest frequency of bike trips, with PM Rush exhibiting not only the greatest overall trip frequency but also the most frequent occurrence of second trips per station. This suggests that during the PM Rush, many stations consistently see at least two trips starting per hour on average, which may reflect a robust pattern of usage as people conclude their workday or engage in evening activities.

Overnight periods display a surprisingly elevated frequency of trips, which, while not as high as the daytime peaks, is significant when compared to the AM Rush period, which has the lowest frequency of trips. This could be due to a variety of factors, such as night-shift workers, late-night entertainment activities, or even early-morning fitness routines.

3) Bike Share Trends by station.

ggplot(dat_census %>%
         group_by(interval60, start_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station. NYC, Jan. 2024",
       x="Number of Stations", 
       y="Trip Counts")+
  plotTheme

4) Bike Share Trends by day of the week.

# by week days
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in NYC, by day of the week, Jan. 2024",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

# by workdays or weekends
ggplot(dat_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in NYC - weekend vs weekday, Jan. 2024",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

The lines representing weekdays generally follow a similar pattern, with pronounced peaks, suggesting consistent commuting behavior on these days. In contrast, Saturday and Sunday show different patterns, with a more gradual increase in trips throughout the day and not as sharp peaks during traditional rush hours, which is typical for weekends when people are not bound to standard work or school schedules.

ggplot() +
  geom_sf(data = NYCTracts %>%
            st_transform(crs = 4326), fill = "grey70", color = "grey80") +
  geom_point(data = dat_census %>%
                   mutate(hour = hour(started_at),
                          weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                          time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                                  hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                                  hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                                  hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
                   group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
                   tally(),
             aes(x = from_longitude, y = from_latitude, color = log1p(n)),
             fill = "transparent", size = 0.08) +
  scale_colour_gradient(low = "cornsilk", high = "indianred2") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title = "Bike share trips per hr by station. NYC, Jan.2024") +
  mapTheme +
  facet_grid(weekend ~ time_of_day) +
  theme(legend.position = "right")

During weekdays, a high concentration of bike trips occurs within key business districts in Manhattan, indicating a prevalent commuter pattern associated with work-related travel via bike sharing. However, this pattern dissipates over the weekend; trip distribution becomes more dispersed across various locations and the frequency diminishes, reflecting a shift to more leisurely or diverse travel purposes that are not centered around commuting.

4. Model Development

1) Create Space-Time Panel

# Calculate the number of rows in the resulting panel dataset
# This line calculates the number of rows in the resulting panel dataset by multiplying the number of unique values in the 'interval60' column with the number of unique values in the 'start_station_id' column in the 'dat_census' dataset.

length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))


study.panel <- expand.grid(interval60 = unique(dat_census$interval60),
                           start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(dat_census %>% 
            select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>% 
            distinct() %>% 
            group_by(start_station_id) %>% 
            slice(1),
            by = "start_station_id")

2. Create Panel

# Create a panel dataset for bike ride data
ride.panel <- 
  dat_census %>%  # Start with the dat_census dataset
  mutate(Trip_Counter = 1) %>%  # Add a Trip_Counter variable, indicating one trip for each row
  right_join(study.panel) %>%  # Right join with the study.panel dataset to ensure all spatial-time combinations are included
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%  # Group by spatial-time and station variables
  summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE)) %>%  # Summarize the Trip_Counter variable to calculate total trip count for each group
  left_join(weather_Panel) %>%  # Left join with the weather.Panel dataset to include weather information
  ungroup() %>%  # Remove grouping
  filter(!is.na(start_station_id)) %>%  # Filter out rows with missing start_station_id values
  mutate(week = week(interval60),  # Create a week variable indicating the week number of the year
         dotw = wday(interval60, label = TRUE)) %>%  # Create a dotw variable indicating the day of the week (label format)
  filter(!is.na(Origin.Tract))  # Filter out rows with missing Origin.Tract values


ride.panel <- 
  left_join(ride.panel, NYCCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

3) Create time lags

# Arrange the ride.panel dataset based on 'start_station_id' and 'interval60'
# This step ensures that the dataset is ordered properly for subsequent calculations
ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>%

# Compute lag variables for the 'Trip_Count' variable at different time intervals
  mutate(
    lagHour = dplyr::lag(Trip_Count, 1),  # Lag of 1 hour
    lag2Hours = dplyr::lag(Trip_Count, 2),  # Lag of 2 hours
    lag3Hours = dplyr::lag(Trip_Count, 3),  # Lag of 3 hours
    lag4Hours = dplyr::lag(Trip_Count, 4),  # Lag of 4 hours
    lag12Hours = dplyr::lag(Trip_Count, 12),  # Lag of 12 hours
    lag1day = dplyr::lag(Trip_Count, 24)  # Lag of 1 day (24 hours)
  ) %>%

# Identify holidays and create a binary variable indicating whether it's a holiday
  mutate(
    holiday = ifelse(yday(interval60) == 148, 1, 0)  # Check if it's a holiday (148th day of the year)
  ) %>%

# Create a variable representing the day of the year
  mutate(
    day = yday(interval60)  # Extract the day of the year from 'interval60'
  ) %>%

# Create additional variables related to holiday lag
  mutate(
    holidayLag = case_when(
      dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",  # If the previous day was a holiday
      dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",  # If two days ago was a holiday
      dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",  # If three days ago was a holiday
      dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",  # If the next day will be a holiday
      dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",  # If two days ahead will be a holiday
      dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"  # If three days ahead will be a holiday
    ),
    holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag)  # Convert NA to 0 for non-holiday cases
  )
# Assuming that ride.panel is your data frame and it has been processed as per the R code you provided
# Now, we will summarize the correlations and then create a kable

ride.panel %>%
  as.data.frame() %>%
  group_by(interval60) %>%
  summarize_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour", "lag2Hours", "lag3Hours", "lag4Hours", "lag12Hours", "lag1day"))) %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count), 2)) %>%
  # Now we will use kable to create a table
  kable(format = "html", caption = "Correlation between lags and Trip Count") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  column_spec(1, bold = T) # Make the first column bold
Correlation between lags and Trip Count
Variable correlation
lagHour 0.88
lag2Hours 0.67
lag3Hours 0.48
lag4Hours 0.31
lag12Hours -0.45
lag1day 0.72

lagHour (0.83): Indicates a very strong positive correlation with the number of trips from the previous hour. This suggests that bike usage in the previous hour is a strong predictor of bike usage in the current hour. lag2Hours (0.63): Also a positive correlation, though not as strong as the 1-hour lag, showing that bike usage two hours prior still has a significant impact on current bike usage. lag3Hours (0.44) and lag4Hours (0.28): These values represent positive correlations but are weaker, indicating that as the time lag increases, the predictive power of previous bike usage diminishes. lag12Hours (-0.39): This negative correlation suggests an inverse relationship, indicating that the number of trips is generally opposite in trend 12 hours prior. This could reflect a difference in bike usage patterns between morning and evening. lag1day (0.66): A relatively strong positive correlation indicates that the number of trips is similar from one day to the next, possibly highlighting consistent daily patterns of bike usage.

4. Run Models

# Filter the ride.panel data to create subsets for training and testing
ride.panel.train.subset <- ride.panel %>% filter(week >= 3)
ride.panel.test.subset <- ride.panel %>% filter(week < 3)

# Randomly select 40,000 rows from the training subset
ride.Train <- ride.panel.train.subset %>% sample_n(40000)

# Now, select up to 40,000 rows for the test set. If there are less than 40,000, select all.
n_test <- min(40000, nrow(ride.panel.test.subset))
ride.Test <- ride.panel.test.subset %>% sample_n(n_test)



reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

4. Model Evaluation

1) Prediction

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

kable_output <- week_predictions %>%
  kable(format = "html", caption = "Weekly Predictions and Error Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE)

2) Examine Error Metrics for Accuracy

palette5 <- c("cornsilk", "#f8e0c8", "#f1c1b1", "#ea929a", "indianred3")

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

In the provided chart, it appears that Model A, which incorporates time of day and temperature variables, results in the highest Mean Absolute Errors (MAE), suggesting that it may not be capturing enough of the factors that influence trip counts. Models D and E, which are more complex and include time lags and holiday effects in addition to spatial and weather variables, show relatively lower MAEs. This indicates that the integration of additional time-related factors generally contributes to a decrease in prediction error, underscoring the importance of considering both immediate and historical temporal patterns in trip count prediction.

Interestingly, Model C stands out with an abnormally higher error than might be expected given its combination of time, spatial, and weather variables. This could imply a need to re-evaluate the specific variables included or the way they interact within the model.

Notably, Models D and E perform similarly across the weeks, suggesting that the inclusion of holiday lags does not significantly enhance model performance beyond the time lags already accounted for in Model D. This could be a reflection of the non-linear nature of holiday impacts or an indication of overfitting, where the model complexity begins to capture noise rather than the true underlying relationships.

From these observations, it becomes clear that while adding more variables can be beneficial, there is a nuanced balance between model complexity and predictive performance. Future model improvements might involve refining the selection of lag variables, re-assessing the inclusion of holiday effects, or exploring non-linear modeling techniques that might better capture the complexities of trip count variations.

3) Observed and Prediction

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour = Variable)) +
  geom_line(size = 1.1) +
  scale_color_manual(values = c("Observed" = "#f1c1b1", "Prediction" = "indianred3")) + # Specify your colors here
  facet_wrap(~Regression, ncol = 1) +
  labs(title = "Predicted/Observed bike share time series", 
       subtitle = "Chicago; A test set of 2 weeks", 
       x = "Hour", 
       y = "Station Trips") +
  plotTheme

The time series plots for different regression models over a two-week period illustrate varying degrees of predictive accuracy in estimating bike share usage. Model A, focusing solely on temporal and temperature variables, demonstrates moderate predictive capability but shows notable errors, particularly at peak usage times. Model B, which incorporates spatial factors, captures some patterns more accurately than Model A, yet still exhibits deviations due to the absence of temporal details. Model C, which combines both temporal and spatial variables, along with weather conditions, closely follows the observed data trends, although it occasionally misses peak values. Model D adds temporal lags and presents a better fit, closely mirroring the observed trips, signifying the relevance of historical data in forecasting. Lastly, Model E, which extends Model D with holiday lags, offers no significant improvement, indicating that additional complexity through holiday lags may not contribute meaningfully to the model’s performance in this context. Overall, while the inclusion of time lags enhances model accuracy, the benefit of integrating holiday-related variables appears limited.

4) Errors by Station

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(start_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = NYCCensus, color = "grey80", fill = "grey70")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size=0.3)+
  scale_colour_gradient(low = "cornsilk", high = "indianred2")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme

Based on the observations, it is evident that the East Village area exhibits the highest prediction errors, indicating that the model’s forecasting is less accurate in this locale. Furthermore, the areas encompassing Greenwich Village and the vicinities bordering Central Park also demonstrate relatively higher MAE values. These discrepancies suggest that the model may not fully account for the unique factors or behaviors influencing bike share usage patterns in these specific urban regions, which are possibly characterized by their own distinct demographic, cultural, or infrastructural attributes.

5) Space-Time Error Evaluation

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "indianred2")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

During weekdays, there seems to be a pattern of overprediction during the AM rush and underprediction during the PM rush. Midday and overnight predictions appear to be more evenly distributed around the diagonal, suggesting better predictive accuracy during these times. On weekends, the scatter plots show a wide dispersion of points, indicating a greater degree of variance in the predictive accuracy. This could be due to more variable trip patterns during weekends that are not as well-captured by the model.

5) Space-Time Error Evaluation by Station

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.) +
  geom_sf(data = NYCCensus, color = "grey70", fill = "grey80") +
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             size = 0.5) +
  scale_colour_gradient(low = "cornsilk", high = "indianred2") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude)) +
  facet_grid(weekend ~ time_of_day) +
  labs(title="Mean Absolute Errors, Test Set") +
  mapTheme

5) Socio-Economic Factors

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = NYCCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "indianred")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

Median Income (Med_Inc) vs. MAE: The scatter plot does not show a clear correlation between median income and MAE. However, there’s a slight concentration of data points at the lower end of the median income scale with lower MAE values, which might indicate that the model predicts trips more accurately in areas with lower median incomes.

Percentage Taking Public Transportation vs. MAE: This plot appears to show a slight upward trend, where areas with a higher percentage of people using public transport tend to have higher MAE values in trip prediction. This could suggest that the model struggles to predict trips accurately in areas with higher public transportation usage.

Percentage White vs. MAE: The scatter plot for the percentage of the white population against MAE seems to show no strong correlation. There’s a wide dispersion of MAE values across all percentages, indicating that the racial composition, represented by the percentage of white residents, does not have a clear impact on the predictive error of the model.

5. Interpretation

The algorithm’s efficacy, particularly the models utilizing various time lags, underscores its potential for optimizing bike fleet transportation for re-balancing efforts. The inclusion of time lags allows the algorithm to account for patterns and trends over time, which is crucial for anticipating future demand and planning logistics.

For instance, the D and E models, which include time lag variables, have shown a relatively lower Mean Absolute Error (MAE) in predictions. This suggests they are more adept at capturing the temporal dynamics of bike usage. For re-balancing operations, this means the algorithm can predict not just the immediate demand for bikes but also how demand evolves throughout the day or week. With this information, a bike-sharing service can proactively move bikes from low-demand to high-demand areas in anticipation of usage spikes, rather than reacting to shortages or surpluses as they happen.

The model’s capability to reflect the influence of time on usage patterns enables a more nuanced approach to fleet management. For example, during weekday AM rush hours, where overprediction is common, the model’s insights could prevent the unnecessary allocation of too many bikes to certain areas, thereby optimizing the use of transportation resources. Conversely, during PM rush hours, where underprediction occurs, the model can signal the need to allocate additional bikes ahead of time to meet the surge in demand.

By employing a model with time lags, the transportation fleet used for re-balancing can operate more efficiently. It allows for the scheduling of fleet movements during off-peak hours, which can reduce operational costs and minimize the impact on traffic. It also ensures that the re-balancing efforts are less disruptive to the service and more aligned with the predicted user demand, thus enhancing the overall user experience.

In conclusion, the time lag models demonstrate considerable potential for improving bike fleet re-balancing operations. The algorithm can provide predictive insights that allow for a more strategic deployment of the re-balancing fleet, potentially leading to cost savings, better resource allocation, and an overall increase in the effectiveness of the bike-sharing system.

---
title: "Space-Time Prediction of Bike Share Demand in NYC"
author: "Ziyi Guo"
date: "Apr 28, 2024"
output: 
  html_document:
    toc: true
    toc_float: true
    code_folding: "hide"
    code_download: true
---

## 1. Introduction

Bike share systems have emerged as a pivotal component of urban mobility, offering a sustainable and convenient transportation alternative to city dwellers and visitors. These systems function through a network of strategically placed docking stations where bicycles can be picked up and returned. However, the effectiveness of bike sharing hinges significantly on the availability of bicycles and open docks at these stations. Due to varying daily patterns, some stations may end up with too many bikes while others may have none, leading to inconvenience for users seeking to either pick up or drop off bicycles.

To ensure that bikes are available where and when they are needed, bike share systems must engage in active re-balancing. This process involves redistributing bicycles across the network to maintain an equilibrium between bike and dock availability. One effective strategy for re-balancing involves utilizing a fleet of specially equipped vehicles that can transport multiple bikes at once. This method allows for rapid and efficient movement of bicycles from stations with a surplus to those with a deficit. Fleet-based re-balancing relies on predictive analytics to forecast the demand for bicycles at different stations throughout the day. Predictive methods incorporate time lag features, which allow the model to use data from past intervals to forecast future demands. By incorporating these time lags, the predictive system can accurately anticipate demand patterns, enabling the fleet to be deployed proactively. This strategic deployment ensures that each station maintains an optimal number of bikes and open docks, thereby maximizing the system's overall utility and user satisfaction.

```{r setup_13, warning = FALSE, message = FALSE, results='hide'}

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(knitr)
library(purrr)

# Load the readr package at the beginning of your script
library(readr)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

# Install Census API Key
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
```


## 2. Data Wrangling

Citi Bike, powered by Lyft, is a prominent bike-sharing service in New York City, designed to offer an eco-friendly, efficient, and convenient mode of transportation for both locals and tourists. To enhance user understanding and encourage community engagement, Citi Bike makes available comprehensive trip data, which can be downloaded for further exploration. This dataset includes key details such as ride IDs, bike types, trip start and end times, station names, and station IDs, along with the geographical coordinates of each station. Additionally, the data distinguishes between member and casual riders, providing insights into different user patterns and behaviors. The dissemination of this data empowers stakeholders to perform detailed analyses, helping to address operational challenges such as the distribution and availability of bikes to ensure that the bike-sharing system meets its users' needs effectively.

In this section, we will import data from the Citi Bike sharing program, along with New York City demographic statistics and weather records from JFK Airport. By integrating demographic and weather information, we aim to enhance the accuracy of our predictions for future trends in the NYC Citi Bike sharing usage.

Data Source:
https://citibikenyc.com/system-data

```{r dat_prep, warning = FALSE, message = FALSE, results='hide' }

data <- read_csv("C:/Users/25077/Desktop/MUSA 508_PPA/Assignment 5/data/202401-citibike-tripdata.csv/202401-citibike-tripdata_1.csv") 

set.seed(123) # Setting a seed for reproducibility
data_nyc <- data %>% 
  sample_n(size = 400000)


dat_interval <- data_nyc %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))


NYCCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "NY", 
          geometry = TRUE, 
          county=c("New York", "Kings", "Queens", "Richmond", "Bronx"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

# Extract geography factors
NYCTracts <- 
  NYCCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf


# join census tract information to "dat2" data frame
dat_census <- st_join(dat_interval %>% 
          filter(is.na(start_lng) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lng) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        NYCTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., NYCTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)


weather_Panel <- 
  riem_measures(station = "JFK", date_start = "2024-01-01", date_end = "2024-01-31") %>%
  select(valid, tmpf, p01i, sknt) %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

```

Present the hourly variations in weather conditions, including precipitation levels, wind speed, and temperature fluctuations, for New York City.

```{r plot_weather, warning = FALSE, message = FALSE, results='hide'}
grid.arrange(
  ggplot(weather_Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather_Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather_Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - NYC JFK - Jan, 2024")
```

## 3. Explore the Data
 
*1) Bike Share Trends by Dates. *

```{r trip_timeseries, warning = FALSE, message = FALSE, results='hide' }
ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. NYC, Jan, 2024",
       x="Date", 
       y="Number of trips")+
  plotTheme
```

*2) Bike Share Trends by Time Periods (Rush hours, AM/PM).*

```{r mean_trips_hist, warning = FALSE, message = FALSE, results='hide'}

dat_census <- dat_census %>%
  na.omit() 

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. NYC, Jan, 2024",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme
```

The chart indicates that both Mid-Day and PM Rush periods experience the highest frequency of bike trips, with PM Rush exhibiting not only the greatest overall trip frequency but also the most frequent occurrence of second trips per station. This suggests that during the PM Rush, many stations consistently see at least two trips starting per hour on average, which may reflect a robust pattern of usage as people conclude their workday or engage in evening activities.

Overnight periods display a surprisingly elevated frequency of trips, which, while not as high as the daytime peaks, is significant when compared to the AM Rush period, which has the lowest frequency of trips. This could be due to a variety of factors, such as night-shift workers, late-night entertainment activities, or even early-morning fitness routines.


*3) Bike Share Trends by station.*

```{r trips_station_dotw , warning = FALSE, message = FALSE, results='hide'}
ggplot(dat_census %>%
         group_by(interval60, start_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station. NYC, Jan. 2024",
       x="Number of Stations", 
       y="Trip Counts")+
  plotTheme
```

*4) Bike Share Trends by day of the week.*

```{r trips_hour_dotw, warning = FALSE, message = FALSE, results='hide' }

# by week days
ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in NYC, by day of the week, Jan. 2024",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

# by workdays or weekends
ggplot(dat_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in NYC - weekend vs weekday, Jan. 2024",
       x="Hour", 
       y="Trip Counts")+
     plotTheme
```

The lines representing weekdays generally follow a similar pattern, with pronounced peaks, suggesting consistent commuting behavior on these days. In contrast, Saturday and Sunday show different patterns, with a more gradual increase in trips throughout the day and not as sharp peaks during traditional rush hours, which is typical for weekends when people are not bound to standard work or school schedules.


```{r origin_map,fig.width=10, fig.height=12, warning = FALSE, message = FALSE, results='hide' }

ggplot() +
  geom_sf(data = NYCTracts %>%
            st_transform(crs = 4326), fill = "grey70", color = "grey80") +
  geom_point(data = dat_census %>%
                   mutate(hour = hour(started_at),
                          weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                          time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                                  hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                                  hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                                  hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
                   group_by(start_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
                   tally(),
             aes(x = from_longitude, y = from_latitude, color = log1p(n)),
             fill = "transparent", size = 0.08) +
  scale_colour_gradient(low = "cornsilk", high = "indianred2") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title = "Bike share trips per hr by station. NYC, Jan.2024") +
  mapTheme +
  facet_grid(weekend ~ time_of_day) +
  theme(legend.position = "right")

```

During weekdays, a high concentration of bike trips occurs within key business districts in Manhattan, indicating a prevalent commuter pattern associated with work-related travel via bike sharing. However, this pattern dissipates over the weekend; trip distribution becomes more dispersed across various locations and the frequency diminishes, reflecting a shift to more leisurely or diverse travel purposes that are not centered around commuting.

## 4. Model Development

*1) Create Space-Time Panel*

```{r panel_length_check, warning = FALSE, message = FALSE, results='hide'}

# Calculate the number of rows in the resulting panel dataset
# This line calculates the number of rows in the resulting panel dataset by multiplying the number of unique values in the 'interval60' column with the number of unique values in the 'start_station_id' column in the 'dat_census' dataset.

length(unique(dat_census$interval60)) * length(unique(dat_census$start_station_id))


study.panel <- expand.grid(interval60 = unique(dat_census$interval60),
                           start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(dat_census %>% 
            select(start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>% 
            distinct() %>% 
            group_by(start_station_id) %>% 
            slice(1),
            by = "start_station_id")

```

*2. Create Panel*

```{r create_panel, warning = FALSE, message = FALSE, results='hide'}
# Create a panel dataset for bike ride data
ride.panel <- 
  dat_census %>%  # Start with the dat_census dataset
  mutate(Trip_Counter = 1) %>%  # Add a Trip_Counter variable, indicating one trip for each row
  right_join(study.panel) %>%  # Right join with the study.panel dataset to ensure all spatial-time combinations are included
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, from_longitude, from_latitude) %>%  # Group by spatial-time and station variables
  summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE)) %>%  # Summarize the Trip_Counter variable to calculate total trip count for each group
  left_join(weather_Panel) %>%  # Left join with the weather.Panel dataset to include weather information
  ungroup() %>%  # Remove grouping
  filter(!is.na(start_station_id)) %>%  # Filter out rows with missing start_station_id values
  mutate(week = week(interval60),  # Create a week variable indicating the week number of the year
         dotw = wday(interval60, label = TRUE)) %>%  # Create a dotw variable indicating the day of the week (label format)
  filter(!is.na(Origin.Tract))  # Filter out rows with missing Origin.Tract values


ride.panel <- 
  left_join(ride.panel, NYCCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
```

*3) Create time lags*


```{r time_lags, warning = FALSE, message = FALSE, results='hide'}

# Arrange the ride.panel dataset based on 'start_station_id' and 'interval60'
# This step ensures that the dataset is ordered properly for subsequent calculations
ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>%

# Compute lag variables for the 'Trip_Count' variable at different time intervals
  mutate(
    lagHour = dplyr::lag(Trip_Count, 1),  # Lag of 1 hour
    lag2Hours = dplyr::lag(Trip_Count, 2),  # Lag of 2 hours
    lag3Hours = dplyr::lag(Trip_Count, 3),  # Lag of 3 hours
    lag4Hours = dplyr::lag(Trip_Count, 4),  # Lag of 4 hours
    lag12Hours = dplyr::lag(Trip_Count, 12),  # Lag of 12 hours
    lag1day = dplyr::lag(Trip_Count, 24)  # Lag of 1 day (24 hours)
  ) %>%

# Identify holidays and create a binary variable indicating whether it's a holiday
  mutate(
    holiday = ifelse(yday(interval60) == 148, 1, 0)  # Check if it's a holiday (148th day of the year)
  ) %>%

# Create a variable representing the day of the year
  mutate(
    day = yday(interval60)  # Extract the day of the year from 'interval60'
  ) %>%

# Create additional variables related to holiday lag
  mutate(
    holidayLag = case_when(
      dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",  # If the previous day was a holiday
      dplyr::lag(holiday, 2) == 1 ~ "PlusTwoDays",  # If two days ago was a holiday
      dplyr::lag(holiday, 3) == 1 ~ "PlusThreeDays",  # If three days ago was a holiday
      dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",  # If the next day will be a holiday
      dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",  # If two days ahead will be a holiday
      dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"  # If three days ahead will be a holiday
    ),
    holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag)  # Convert NA to 0 for non-holiday cases
  )


```

```{r evaluate_lags, warning = FALSE, message = FALSE}

# Assuming that ride.panel is your data frame and it has been processed as per the R code you provided
# Now, we will summarize the correlations and then create a kable

ride.panel %>%
  as.data.frame() %>%
  group_by(interval60) %>%
  summarize_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour", "lag2Hours", "lag3Hours", "lag4Hours", "lag12Hours", "lag1day"))) %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count), 2)) %>%
  # Now we will use kable to create a table
  kable(format = "html", caption = "Correlation between lags and Trip Count") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  column_spec(1, bold = T) # Make the first column bold

```

lagHour (0.83): Indicates a very strong positive correlation with the number of trips from the previous hour. This suggests that bike usage in the previous hour is a strong predictor of bike usage in the current hour.
lag2Hours (0.63): Also a positive correlation, though not as strong as the 1-hour lag, showing that bike usage two hours prior still has a significant impact on current bike usage.
lag3Hours (0.44) and lag4Hours (0.28): These values represent positive correlations but are weaker, indicating that as the time lag increases, the predictive power of previous bike usage diminishes.
lag12Hours (-0.39): This negative correlation suggests an inverse relationship, indicating that the number of trips is generally opposite in trend 12 hours prior. This could reflect a difference in bike usage patterns between morning and evening.
lag1day (0.66): A relatively strong positive correlation indicates that the number of trips is similar from one day to the next, possibly highlighting consistent daily patterns of bike usage.


*4. Run Models*


```{r train_test, warning = FALSE, message = FALSE, results='hide' }


# Filter the ride.panel data to create subsets for training and testing
ride.panel.train.subset <- ride.panel %>% filter(week >= 3)
ride.panel.test.subset <- ride.panel %>% filter(week < 3)

# Randomly select 40,000 rows from the training subset
ride.Train <- ride.panel.train.subset %>% sample_n(40000)

# Now, select up to 40,000 rows for the test set. If there are less than 40,000, select all.
n_test <- min(40000, nrow(ride.panel.test.subset))
ride.Test <- ride.panel.test.subset %>% sample_n(n_test)



reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)
```

## 4. Model Evaluation

*1) Prediction*

```{r nest_data, warning = FALSE, message = FALSE}
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

kable_output <- week_predictions %>%
  kable(format = "html", caption = "Weekly Predictions and Error Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
```

*2) Examine Error Metrics for Accuracy*


```{r plot_errors_by_model, warning = FALSE, message = FALSE, results='hide' }

palette5 <- c("cornsilk", "#f8e0c8", "#f1c1b1", "#ea929a", "indianred3")

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme


```

In the provided chart, it appears that Model A, which incorporates time of day and temperature variables, results in the highest Mean Absolute Errors (MAE), suggesting that it may not be capturing enough of the factors that influence trip counts. Models D and E, which are more complex and include time lags and holiday effects in addition to spatial and weather variables, show relatively lower MAEs. This indicates that the integration of additional time-related factors generally contributes to a decrease in prediction error, underscoring the importance of considering both immediate and historical temporal patterns in trip count prediction.

Interestingly, Model C stands out with an abnormally higher error than might be expected given its combination of time, spatial, and weather variables. This could imply a need to re-evaluate the specific variables included or the way they interact within the model.

Notably, Models D and E perform similarly across the weeks, suggesting that the inclusion of holiday lags does not significantly enhance model performance beyond the time lags already accounted for in Model D. This could be a reflection of the non-linear nature of holiday impacts or an indication of overfitting, where the model complexity begins to capture noise rather than the true underlying relationships.

From these observations, it becomes clear that while adding more variables can be beneficial, there is a nuanced balance between model complexity and predictive performance. Future model improvements might involve refining the selection of lag variables, re-assessing the inclusion of holiday effects, or exploring non-linear modeling techniques that might better capture the complexities of trip count variations.

*3) Observed and Prediction*

```{r error_vs_actual_timeseries, warning = FALSE, message = FALSE, results='hide'}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour = Variable)) +
  geom_line(size = 1.1) +
  scale_color_manual(values = c("Observed" = "#f1c1b1", "Prediction" = "indianred3")) + # Specify your colors here
  facet_wrap(~Regression, ncol = 1) +
  labs(title = "Predicted/Observed bike share time series", 
       subtitle = "Chicago; A test set of 2 weeks", 
       x = "Hour", 
       y = "Station Trips") +
  plotTheme

```

The time series plots for different regression models over a two-week period illustrate varying degrees of predictive accuracy in estimating bike share usage. Model A, focusing solely on temporal and temperature variables, demonstrates moderate predictive capability but shows notable errors, particularly at peak usage times. Model B, which incorporates spatial factors, captures some patterns more accurately than Model A, yet still exhibits deviations due to the absence of temporal details. Model C, which combines both temporal and spatial variables, along with weather conditions, closely follows the observed data trends, although it occasionally misses peak values. Model D adds temporal lags and presents a better fit, closely mirroring the observed trips, signifying the relevance of historical data in forecasting. Lastly, Model E, which extends Model D with holiday lags, offers no significant improvement, indicating that additional complexity through holiday lags may not contribute meaningfully to the model's performance in this context. Overall, while the inclusion of time lags enhances model accuracy, the benefit of integrating holiday-related variables appears limited.

*4) Errors by Station*

```{r errors_by_station, warning = FALSE, message = FALSE, results='hide' }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, start_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(start_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = NYCCensus, color = "grey80", fill = "grey70")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size=0.3)+
  scale_colour_gradient(low = "cornsilk", high = "indianred2")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme
```

Based on the observations, it is evident that the East Village area exhibits the highest prediction errors, indicating that the model's forecasting is less accurate in this locale. Furthermore, the areas encompassing Greenwich Village and the vicinities bordering Central Park also demonstrate relatively higher MAE values. These discrepancies suggest that the model may not fully account for the unique factors or behaviors influencing bike share usage patterns in these specific urban regions, which are possibly characterized by their own distinct demographic, cultural, or infrastructural attributes.

*5) Space-Time Error Evaluation*


```{r obs_pred_all, warning = FALSE, message = FALSE, results='hide'}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "indianred2")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme
```

During weekdays, there seems to be a pattern of overprediction during the AM rush and underprediction during the PM rush. Midday and overnight predictions appear to be more evenly distributed around the diagonal, suggesting better predictive accuracy during these times.
On weekends, the scatter plots show a wide dispersion of points, indicating a greater degree of variance in the predictive accuracy. This could be due to more variable trip patterns during weekends that are not as well-captured by the model.

*5) Space-Time Error Evaluation by Station*

```{r station_summary,fig.width=10, fig.height=12, warning = FALSE, message = FALSE, results='hide' }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.) +
  geom_sf(data = NYCCensus, color = "grey70", fill = "grey80") +
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             size = 0.5) +
  scale_colour_gradient(low = "cornsilk", high = "indianred2") +
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude)) +
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude)) +
  facet_grid(weekend ~ time_of_day) +
  labs(title="Mean Absolute Errors, Test Set") +
  mapTheme

  
```

*5) Socio-Economic Factors*

```{r station_summary2, warning = FALSE, message = FALSE, results='hide' }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = NYCCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "indianred")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme
  
```

*Median Income (Med_Inc) vs. MAE:*
The scatter plot does not show a clear correlation between median income and MAE. However, there's a slight concentration of data points at the lower end of the median income scale with lower MAE values, which might indicate that the model predicts trips more accurately in areas with lower median incomes.

*Percentage Taking Public Transportation vs. MAE:*
This plot appears to show a slight upward trend, where areas with a higher percentage of people using public transport tend to have higher MAE values in trip prediction. This could suggest that the model struggles to predict trips accurately in areas with higher public transportation usage.

*Percentage White vs. MAE:*
The scatter plot for the percentage of the white population against MAE seems to show no strong correlation. There's a wide dispersion of MAE values across all percentages, indicating that the racial composition, represented by the percentage of white residents, does not have a clear impact on the predictive error of the model.


## 5. Interpretation

The algorithm's efficacy, particularly the models utilizing various time lags, underscores its potential for optimizing bike fleet transportation for re-balancing efforts. The inclusion of time lags allows the algorithm to account for patterns and trends over time, which is crucial for anticipating future demand and planning logistics.

For instance, the D and E models, which include time lag variables, have shown a relatively lower Mean Absolute Error (MAE) in predictions. This suggests they are more adept at capturing the temporal dynamics of bike usage. For re-balancing operations, this means the algorithm can predict not just the immediate demand for bikes but also how demand evolves throughout the day or week. With this information, a bike-sharing service can proactively move bikes from low-demand to high-demand areas in anticipation of usage spikes, rather than reacting to shortages or surpluses as they happen.

The model's capability to reflect the influence of time on usage patterns enables a more nuanced approach to fleet management. For example, during weekday AM rush hours, where overprediction is common, the model's insights could prevent the unnecessary allocation of too many bikes to certain areas, thereby optimizing the use of transportation resources. Conversely, during PM rush hours, where underprediction occurs, the model can signal the need to allocate additional bikes ahead of time to meet the surge in demand.

By employing a model with time lags, the transportation fleet used for re-balancing can operate more efficiently. It allows for the scheduling of fleet movements during off-peak hours, which can reduce operational costs and minimize the impact on traffic. It also ensures that the re-balancing efforts are less disruptive to the service and more aligned with the predicted user demand, thus enhancing the overall user experience.

In conclusion, the time lag models demonstrate considerable potential for improving bike fleet re-balancing operations. The algorithm can provide predictive insights that allow for a more strategic deployment of the re-balancing fleet, potentially leading to cost savings, better resource allocation, and an overall increase in the effectiveness of the bike-sharing system.

