1. Introduction

Urban bike sharing is a service system that provides bicycles on the public street for individuals to use on a short term basis, usually through purchasing a membership plan. Many bike sharing systems, including Philadelphia’s Indego bike share ^1, allow users to unlock a bike from a dock and return it at another dock. This form of transportation is often seen as a mean to promote health and sustainability while reducing automobile congestion ^2.

Despite the benefits, this kind of system is hard to operate well, as there are limited number of docks at a dock station. A user cannot return a bike when the station is full and cannot get a bike when the station is empty. To prevent users from suffering from such problems and stop using the service, operators need to manually distribute the optimal number of bikes at the stations to ensure that a bike or a dock is available when needed. One way of doing this is to dispatch bikes when problems occur, but this will be very inefficient. By the time the operators arrive with an available bike or make room in the station, the user may be impatiently waiting for a long time or has decided not to use the service again. The operators ought to take a more proactive approach. This is known as the “re-balancing” problem.

The key to re-balancing is predicting the bike share demand for all stations at all times as accurate as possible, and the operators can move the bikes around ahead of time according to past experiences. In this way, user satisfaction will be maximized. As many top mathematicians and computer scientists have been addressing the challenge ^3, this analysis attempts to predict space/time demand for bike share pickups in the city of Philadelphia. It considers the bike share pickup pattern at the station level by time of the pickup and locations of the station. With 5-minute interval data accounting for a 5-week period (July 8 - August 11, 2019), it hopes to inform the usefulness of such algorithms and provide insights for Indego’s re-balancing plan at any given time.

2. Prediction Model

2.1 Prepare data

2.1.1 Bike Share Trip Data 2019 Q3

The 5-week bike share data from July 8 to August 11 is read in. The time period was intentionally selected to examine if the model predicts well across time that does not have national holidays. For the purpose of this analysis, it only considers the trips generated from the starting stations. It does want to predict the trip demand for very high resolution space/time intervals for accurate re-balancing plans, so the intervals are for every 5 minutes for every station.

dat <- read.csv("indego-trips-2019-q3.csv")

dat_indego <- dat %>%
  mutate(interval5 = floor_date(mdy_hm(start_time), unit = "5 mins"),
         interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
         week = isoweek(interval5),
         dotw = wday(interval5, label=TRUE)) %>%
  filter(week>=28 & week <= 32) %>% # selecting 5 weeks 7/8 - 8/11 2019
  select(-duration, -bike_id, -plan_duration, -trip_route_category, -passholder_type, -bike_type)

2.1.2 Census data

Census geography and data by tracts are also read in. One might reasonably assume that features such as age, race, usual commute time, and habit using public transit have influences on their willingness of using bike share. These information are joined to the station if the station spatially intersects the tract.

PhillyCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
PhillyTracts <- 
  PhillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf
dat_census <- st_join(dat_indego %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lon) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        PhillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., PhillyTracts %>%
            st_transform(crs=4326), #4326
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

2.1.3 Weather data

It is also a reasonable assumption that inclement weather would incentivize bike share. Features including temperature, precipitation, and wind speed are considered.

weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2019-07-08", date_end = "2019-08-11") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval5 = ymd_hm(substr(valid,1,16))) %>%
    mutate(week = week(interval5),
           dotw = wday(interval5, label=TRUE)) %>%
    group_by(interval5) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

2.1.4 Amenity features

Distances to amenities might also affect bike share demand. When a bike station is close to bus stations, people might take a bus instead. When a bike station is close to landmarks, tourists might take a ride. When a bike station is close to picnic sites, families might take a pleasure ride. Therefore, these features are also considered in the analysis. The average nearest neighbor distance to features from each station has been calculated, while the number of neighbors is 5.

# Distance to bus stations
bus_stations <- st_read("https://opendata.arcgis.com/datasets/e09e9f98bdf04eada214d2217f3adbf1_0.geojson") %>%
  dplyr::select(Y = Longitude, X = Latitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4236, agr = "constant")

# Distance to city landmarks
landmarks <- st_read("http://data-phl.opendata.arcgis.com/datasets/5146960d4d014f2396cb82f31cd82dfe_0.geojson") %>%
  st_centroid() %>%
  dplyr::mutate(lat = sf::st_coordinates(.)[,1],
                lon = sf::st_coordinates(.)[,2]) %>%
  dplyr::select(Y = lat, X = lon) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")

# Distance to picnic sites
picnic <- st_read("ppr_picnic_sites.geojson") %>%
  st_centroid() %>%
  dplyr::mutate(lat = sf::st_coordinates(.)[,1],
                lon = sf::st_coordinates(.)[,2]) %>%
  dplyr::select(Y = lat, X = lon) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")

# Nearest neighbor distance 
dat_indego.sf <- dat_indego %>%
  na.omit() %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326, agr = "constant")

st_c <- st_coordinates
nn_var <-
  dat_indego.sf %>%
    mutate(
      bus.nn =
        nn_function(st_c(dat_indego.sf), st_c(bus_stations),5),
      landmarks.nn =
        nn_function(st_c(dat_indego.sf), st_c(landmarks),5),
      picnic.nn =
        nn_function(st_c(dat_indego.sf), st_c(picnic),5))

2.2 Explore Space/time

2.2.1 Create Space-Time Panel

The data set for this analysis must be a complete panel with an observation for every possible space/time combination. A new data frame is created to ensure that all unique space/time intervals are captured.

study.panel <- 
  expand.grid(interval5=unique(dat_census$interval5), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

nrow(study.panel) == length(unique(dat_census$interval5)) * length(unique(dat_census$start_station))
## [1] TRUE

Other features are joined to the new data frame.

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval5, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  left_join(nn_var) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval5),
         dotw = wday(interval5, label = TRUE),
         hour = hour(interval5)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, PhillyCensus %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

2.2.2 Create Time Lag

Just like how home prices are related to each other when they are located close to each other, bike share demand might be affected by time order. The time lag features are created to represent the serial (temporal) relationships.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval5) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval5)) %>%
  dplyr::select(-end_time, -end_station, -end_lat, -end_lon, -Total_Pop, -Med_Inc, -White_Pop, -Travel_Time, -Means_of_Transport, -Total_Public_Trans)

2.3 Run Models

2.3.1 Model A: A 3 Weet Training Set and 2 Weet Test Set

rideA.Train <- filter(ride.panel, week %in% c("28","29","30"))
rideA.Test <- filter(ride.panel, week %in% c("31","32"))

3. Exploratory Analysis

Bike share data is explored for time, space, weather, and demographic relationships.

3.1 Space/time dependencies in the data

3.1.1 Serial (temporal) autocorrelation

Figure 1 visualizes time series of bike share usage from the first day of the first week to the last day of the fifth week. The black vertical lines are signifying Mondays. Most weeks exhibit similar patterns with consistent peaks and troughs, suggesting serial correlation. However, there are variations within, as the patterns are not identical.

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval5) == 1,
                         interval5, 0)) %>%
  filter(monday != 0) 

rbind(
  mutate(rideA.Train, Legend = "Training"), 
  mutate(rideA.Test, Legend = "Testing")) %>%
    group_by(Legend, interval5) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval5, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        geom_vline(data = mondays, aes(xintercept = monday)) +
        labs(title="Rideshare trips by week: July 8 - August 11, 2019",
             subtitle="Philadelphia", 
             x="Day", y="Trip Count",
             caption = "Figure 1") +
        plotTheme() + theme(panel.grid.major = element_blank())  

The time lag features are tested for pearson correlation with the number of trips from the start stations using data for all five weeks. The results, as shown in Table 1, indicate strong correlations, especially for the 1 hour lag. The correlation decreases as the lag hour increases. Note that the correlation is not very strong for the 1 day lag, signifying that the same time on different days might experience pretty different bike share trips.

lag_cor <- 
  as.data.frame(ride.panel) %>%
    group_by(interval5) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval5, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
  kable(caption = "Table 1. Correlation by time lag") %>%
        kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#ebedea") %>%
  row_spec(4, color = "black", background = "#ebedea") %>%
  row_spec(6, color = "black", background = "#ebedea")

lag_cor
Table 1. Correlation by time lag
Variable correlation
lagHour 0.92
lag2Hours 0.84
lag3Hours 0.80
lag4Hours 0.77
lag12Hours 0.62
lag1day 0.41

3.1.2 Spatial autocorrelation

Besides temporal autocorrelation, the bike share trips also exhibit spatial autocorrelation. Figure 2 shows the total bike share trips of every station by week. Note that the stations do not cover the entire city, and the map is zoomed to show all stations rather than the whole city. It is obvious that the bike usage varies across stations but is unevenly distributed across space. The higher usage stations are clustered at the center of the map, while the lower usage stations tend to be at the edge. This shows spatial autocorrelation and also shows that the spatial pattern seem consistent across different weeks.

ggplot()+
  geom_sf(data = PhillyTracts %>%
          st_transform(crs=4269))+
  geom_point(data = dat_census %>%
              group_by(start_station, start_lat, start_lon, week) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 1, size = 1)+
  scale_colour_viridis(direction = 1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(~week)+
  labs(title="Bike share trips per station by week",
       subtitle = "Philadelphia, July 8 - August 11, 2019",
       caption = "Figure 2")+
  mapTheme()

3.1.3 Space/time correlation

Bike share in Philadelphia for the selected weeks shows strong space and time dependencies. Figure 3 brings space and time together by different times and days of the week to help inform the re-balancing plan.

Looking at the 5 weeks trips together to gain a general sense, while the demand is rather consistent throughout the day in the weekends, it is obvious that there is generally more demand for bike share trips for PM rush in weekdays, which is also confirmed by Figure 4. The stations that have the most demands are also somewhat different across AM Rush, Mid-Day, Overnight, and PM Rush, probably due to the land use near the stations. As this is a high-resolution analysis, Indego should re-balance the bikes for different stations accordingly but focus the most on the “hotspot” stations during different time periods. These stations are the ones being most used for starting bike share trips. With more users, these stations are more likely to have fluctuate numbers of bikes, so they may have the most need for re-balancing. Referring to Figure 4, Indego should also have different re-balancing startegies for weekdays and the weekend, especially for rush hours.

ggplot()+
  geom_sf(data = PhillyTracts %>%
          st_transform(crs=4269))+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(interval5),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 1, size = 1)+
  scale_colour_viridis(direction = 1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips by station.",
       subtitle="Philadelphia, July 8 - August 11, 2019",
       caption = "Figure 3")+
  mapTheme()

ggplot(dat_census %>% mutate(hour = hour(interval5)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips by day of the week",
       subtitle="Philadelphia, July 8 - August 11, 2019",
       x="Hour", 
       y="Trip Counts",
       caption = "Figure 4")+
     plotTheme()

So far, the space/time analysis looks at the five-week worth of trips in one lens. To take a closer look, one random day in a random week is selected for mapping the bike share trips over space and time. Since Indego is not widely adopted in Philadelphia, the 5-minute high resolution interval does not show indicative patterns. Therefore, the map is created for a 60-minute lower resolution interval. Despite smaller bike station coverage and lower number of trips compared to other cities like Chicago, the space/time pattern indicates higher pickups outside Center City in the morning and heavy usage within and around Center City in the evening.

week32 <-
  filter(dat_indego, week == 32 & dotw == "Mon")

week32.panel <-
  expand.grid(
    interval60 = unique(week32$interval60),
    start_station = unique(dat_indego$start_station)
  )

sta <- st_read("indego-stations.json") %>%
  mutate(start_station = id,
         sta_name = name,
         sta_lat = latitude,
         sta_lon = longitude,
         sta_cor = coordinates
         ) %>%
  select(start_station, sta_name, sta_lat, sta_lon, sta_cor, totalDocks)
## Reading layer `indego-stations' from data source `C:\Users\m1861\OneDrive - PennO365\MUSA 508\Assignments\Assignment5 Rideshare Prediction\indego-stations.json' using driver `GeoJSON'
## Simple feature collection with 143 features and 33 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -75.22399 ymin: 39.88994 xmax: -75.12994 ymax: 39.99179
## geographic CRS: WGS 84
ride.animation.data <-
  mutate(week32, Trip_Counter = 1) %>%
    right_join(week32.panel) %>% 
    group_by(interval60, start_station) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    left_join(sta, by=c("start_station"="start_station")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","11+ trips"))

rideshare_animation <-
  ggplot() +
    geom_sf(data=PhillyTracts) +
    geom_point(data = ride.animation.data, aes(x=sta_lon, y = sta_lat, color = Trips, size=0.5)) +
    scale_color_manual(values = palette5A) +
    labs(title = "Bike share pickups for one Monday in August 2019",
         subtitle = "60 minute intervals: {current_frame}") +
    transition_manual(interval60) +
    ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
    xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
    mapTheme()

animate(rideshare_animation, duration=10, renderer = gifski_renderer())

4. Demand Prediction and Validation for Accuracy and Generalizability

Four linear regression models are built on the training set (bike share data for the first three weeks). The models have different features: 1. reg 1 focuses on space and weather 2. reg 2 focuses on space, weather, and demographic features 3. reg 3 focuses on space, weather, and time 4. reg 4 focuses on space, weather, time, and demographic features The model that is identified as the best is reg 4, with a 0.489 adjusted r-square, as shown in Table 2.

Table 2. Regression Model 4 Result
Dependent variable:
Trip_Count
start_station -0.0001***
(0.00004)
bus.nn 9.671**
(4.629)
landmarks.nn 7.083
(4.986)
picnic.nn 1.495*
(0.802)
Temperature 0.0003
(0.003)
Precipitation -0.114
(0.096)
Wind_Speed 0.0005
(0.001)
dotw.L -0.013**
(0.006)
dotw.Q 0.077***
(0.006)
dotw.C 0.016***
(0.006)
dotw4 0.028***
(0.005)
dotw5 -0.013**
(0.005)
dotw6 0.004
(0.006)
hour 0.003***
(0.0005)
lagHour 0.463***
(0.003)
lag2Hours 0.045***
(0.004)
lag3Hours 0.014***
(0.004)
lag12Hours 0.016***
(0.003)
lag1day 0.0004
(0.004)
Med_Age 0.00000
(0.0003)
Percent_White 0.042***
(0.014)
Mean_Commute_Time 0.0002***
(0.0001)
Percent_Taking_Public_Trans -0.011
(0.043)
Constant 1.266***
(0.162)
Observations 56,478
R2 0.489
Adjusted R2 0.489
Residual Std. Error 0.496 (df = 56454)
F Statistic 2,349.652*** (df = 23; 56454)
Note: p<0.1; p<0.05; p<0.01
reg1 <- lm(Trip_Count ~ start_station + bus.nn + landmarks.nn + picnic.nn # space
           + Temperature + Precipitation + Wind_Speed, # weather
          data=rideA.Train)

reg2 <- lm(Trip_Count ~ start_station + bus.nn + landmarks.nn + picnic.nn # space
           + Temperature + Precipitation + Wind_Speed # weather
           + Med_Age + Percent_White + Mean_Commute_Time + Percent_Taking_Public_Trans, # demo by space
          data=rideA.Train)

reg3 <- lm(Trip_Count ~ start_station + bus.nn + landmarks.nn + picnic.nn # space
           + Temperature + Precipitation + Wind_Speed # weather
           + dotw + hour, # time
          data=rideA.Train)

reg4 <- lm(Trip_Count ~ start_station + bus.nn + landmarks.nn + picnic.nn # space
           + Temperature + Precipitation + Wind_Speed # weather
           + dotw + hour + lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day # time lag
           + Med_Age + Percent_White + Mean_Commute_Time + Percent_Taking_Public_Trans, # demo by space
          data=rideA.Train)

stargazer(reg4, type="html", title="Table 2. Regression Model 4 Result")

4.1 purr on the two week test set

The Mean Absolute Error (MAE) is calculated on the test set (bike share data for the last two weeks) for each model by week. From Figure 5, it is obvious that model 4 that accounts for every kinds of effect results in the least MAE. The model also generalize pretty well across the two weeks selected.

ride.Test.weekNest <- 
  as.data.frame(rideA.Test) %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}
week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ASpace = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_Demo = map(.x = data, fit = reg2, .f = model_pred),
           CSpace_Time = map(.x = data, fit = reg3, .f = model_pred),
           DSpace_TimeLag_demo = map(.x = data, fit = reg4, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette4) +
    labs(title = "Mean Absolute Errors by model specification and week",
         caption = "Figure 5") +
  scale_x_continuous(breaks = seq(31, 32, by = 1)) +
  plotTheme()

To examine if the models generalize well across more specific time periods, Figure 6 compares the prediction and observed trip counts for each model. Again, model 4 which accounts for space, time lag, demographic information, and weather, shows both accuracy and generalizability fairly well. However, it has a general tendency to underpredict, especially for peak values.

week_predictions %>% 
    mutate(interval5 = map(data, pull, interval5),
           from_station_id = map(data, pull, start_station)) %>%
    dplyr::select(interval5, from_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    na.omit() %>%
    gather(Variable, Value, -Regression, -interval5, -from_station_id) %>%
    group_by(Regression, Variable, interval5) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval5, Value, colour=Variable)) + 
      geom_line(size = 1) + 
      scale_colour_manual(values = palette2) +
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks",  
           caption = "Figure 6",
           x = "Hour", y= "Station Trips") +
      plotTheme()

To understand more about model 4’s result, Figure 7 with the blue line denoting a perfect prediction and the red line denoting the prediction trend confirms Figure 6 that the model underpredicts for every time period.

week_predictions %>% 
    mutate(interval5 = map(data, pull, interval5),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval5, start_station, start_lon, start_lat, Observed, Prediction, Regression, dotw) %>%
    unnest() %>%
  filter(Regression == "DSpace_TimeLag_demo") %>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval5) < 7 | hour(interval5) > 18 ~ "Overnight",
                                 hour(interval5) >= 7 & hour(interval5) < 10 ~ "AM Rush",
                                 hour(interval5) >= 10 & hour(interval5) < 15 ~ "Mid-Day",
                                 hour(interval5) >= 15 & hour(interval5) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#b75564")+
    geom_abline(slope = 1, intercept = 0, color = "#437d95")+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips",
       caption ="Figure 7")+
  plotTheme()

In addition, Figure 8 illustrates that the prediction for the weekends is more erroneous than the weekdays.

week_predictions %>% 
    mutate(interval5 = map(data, pull, interval5),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval5, start_station, start_lon, start_lat, Observed, Prediction, Regression, dotw) %>%
    unnest() %>%
  filter(Regression == "DSpace_TimeLag_demo") %>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval5) < 7 | hour(interval5) > 18 ~ "Overnight",
                                 hour(interval5) >= 7 & hour(interval5) < 10 ~ "AM Rush",
                                 hour(interval5) >= 10 & hour(interval5) < 15 ~ "Mid-Day",
                                 hour(interval5) >= 15 & hour(interval5) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = PhillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 2, alpha = 0.9)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set",
       caption = "Figure 8")+
  mapTheme()

Figure 9 exhibits the MAE of model 4, showing the cross validation result by space. It shows that the errors are higher for stations in the middle of the map where more trips occur, as identified in the analysis earlier.

week_predictions %>% 
    mutate(interval5 = map(data, pull, interval5),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval5, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DSpace_TimeLag_demo") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = PhillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.9, size=2)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Mean Abs Error across space, Model 4",
       subtitle = "Philadelphia; A test set of 2 weeks",
       caption = "Figure 9")+
  mapTheme()

In summary for the model training with 3-week data and testing with 2-week data, the takeaways for re-balancing are: * Time: + While the model generalizes fairly well across the two test weeks, Indego has to keep in mind that the two weeks are rather similar but not all weeks are similar, especially the ones with holidays. Occasions such as holidays, students’ vacations, and sports or entertainment events will alternate people’s travel behavior. + The model has a tendency to underpredict, especially for peak values. This makes sense as there are more non-peak value data than peak-value data in the training set. + The prediction for the weekends is more erroneous than that for the weekdays. This also makes sense as there are more weekdays than weekends. * Space: + Prediction errors are higher for stations in the middle of the map where more trips occur. This can also be explained by the number of data available.

4.2 Random k-fold (leave-one-station-out) on the 5 week panel

The time effect is pretty clear, so this section explores more about space. With the 5 week data, a leave-one-station-out spatial cross validation is performed, with the same variables from the selected model above but with Poisson regression method. In addition, instead of looking at Mean Absolute Error, Figure 10 and 11 show the Mean Error to inform underprediction and overprediction across stations.

As the figure illustrate, the errors are pretty small, and they distribute fairly even across space, suggesting good generalizability and accuracy. However, the stations that were underpredicted and overpredicted seem to form small clusters in some areas. For example, the stations in blue on the northwest of the map, are near the Paine’s Park area, suggesting that this area is mainly recreational. The negative Mean Error suggests that the predicted value is lower than the observed trip counts, meaning that the model tend to underpredict for this area. Errors like this signifies that there are some features to be added to improve the model for better re-balancing.

Figure 11 also shows the percentage of white people by census tract. While the relationship between prediction errors and percentage white is not obvious by station, it is obvious that the bike stations are located in tracts with extremely high share of white population, as compared with the rest of the tracts. This questions the fairness of Indego’s service model, as it does not serve the minority community (in terms of race) well. Since Indego provides dock bikes, the location of the stations hugely affects who can use the bike. Residents of the minority tracts cannot even use the bike share service even if they want to because they cannot return the bike to stations close to their homes. This observation also means that if Indego decides to expand their station coverage, the existing model with existing data may not generate well to the new areas due to race and income disparities. This might not be as big of a problem for private businesses, however, as Indego is an initiative of the city, this equity issue should raise a red flag for the city.

trip_var <- c("start_station", "bus.nn", "landmarks.nn", "picnic.nn",
              "Temperature", "Precipitation", "Wind_Speed",
              "dotw", "hour", "lagHour", "lag2Hours", "lag3Hours", "lag12Hours", "lag1day",
              "Med_Age", "Percent_White", "Mean_Commute_Time", "Percent_Taking_Public_Trans")

reg.cv <- crossValidate(
  dataset = ride.panel,
  id = "start_station",
  dependentVariable = "Trip_Count",
  indVariables = trip_var) %>%
    dplyr::select(cvID = start_station, Trip_Count, Prediction, geometry)


reg.summary <- 
  mutate(reg.cv,    Error = Prediction - Trip_Count,
                    Regression = "Random k-fold CV: With Time Lag Factors") %>%
  st_sf() 

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - Trip_Count, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()
error_by_reg_and_fold %>%
  ggplot(aes(Mean_Error)) + 
    geom_histogram(bins = 30, colour="black", fill = "#ebedea") +
    geom_vline(xintercept = 0) + 
    labs(title="Distribution of ME", 
         subtitle = "leave-one-station-out cross validation",
         x="Mean Error", y="Count",
         caption = "Figure 10") 

ggplot(error_by_reg_and_fold) +
  geom_sf(data = PhillyCensus, color = "grey", aes(fill = Percent_White))+
  geom_sf(aes(color = Mean_Error),size=2) +
  scale_fill_gradient(low = "gray40",
                      high = "gray5") +
  scale_colour_gradient2(low = "#437d95",
                       high = "#b75564",
                       midpoint = 0,
                       na.value = "transparent") +
  labs(title = "Bike share demand Mean Error in Philadelphia, 2019",
  subtitle = "by leave-one-station-out spatial cross validation with 5-week data",
        caption = "Figure 11") +
  mapTheme() + 
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))

5. Conclusion

All in all, the bike share demand prediction model for Indego service predicts fairly accurately and is generalizable across space and time, as the features, especially the time lag features, are strong predictors. I would recommend the algorithm to be used for the bike re-balancing plan, with perhaps more features added to predict peak values and the weekends better. Underpredicting will cause not enough bikes at the station, while overpredicting will dispatch too many bikes at the station and may stop people from successfully returning the bikes.

More importantly, the operator should use this algorithm to inform re-balancing plan. This include not only allocating different amount of bikes for different stations, but also creating different dispatch time intervals for different stations during different time periods. For example, the stations in the center of the map tend to be used the most in PM rush, calling for more manual dispatch to prevent problems/dissatisfaction. For these station, a high time resolution would be ideal, like dispatching every five minutes. For other stations, the operator can dispatch less frequently. However, the operator needs to consider other things such as the surrounding traffic, dispatch capacity, and human resource. In cases when dispatch cannot catch up with demand, the operator should consider other methods such as increasing the number of bikes and docks at these stations.

As Philadelphia has been trying to and should address equity, and given the fact that the current Indego stations are mainly in the majority-white areas, establishing bike share stations to more areas in the city may help those who need affordable transportation. Of course, more research should be done to evaluate whether bike share would be effective or not.

Sources:

  1. https://www.rideindego.com/about/

  2. https://foreignpolicy.com/2018/12/31/a-billion-bicyclists-can-be-wrong-china-business-bikeshare/

  3. https://www.bloomberg.com/news/articles/2014-08-27/balancing-bike-share-stations-has-become-a-serious-scientific-endeavor