Bus Ridership Project

By Qi Si, Jiamin Tan, Xuezhu (Gillian) Zhao

Final Project - MUSA 550 Geospatial Data Science in Python

Introduction

CTA has 1,864 buses that make about 19,237 trips a day and serve 10,768 bus stops, according to the latest official tally in 2017. However, there is not much publicly available information on how the bus service has been used. What are the areas with lots of ridership? Are there any interesting spatial patterns? Are there areas with potential ridership demand, but are not accessible to a bus stop?

Through our project, we hope to gain a better understanding of the bus ridership service and usage and apply our insights to improve planning of bus stations. We will find boarding and alighting patterns across geographies and sociodemographics to select indicators of ridership demand. From there, we will build a prediction model to help predict ridership demand across the city. Finally, we will visualize a dashboard to help city officials identify areas in need of better bus services.

Preparation

Library import

Data preparation - Chicago

Ridership data

Socioeconomic data

Data from Census
Stop-level Ridership data to Block-group level data
Data cleaning
Feature Engineering
Save chicago_data

Chicago_data has block-group level census data, ridership data.

Chicago_data_filter has block-group level census data, ridership data, no geometry

Data preparation - Philadelphia

Bus stops data

Socioeconomic data

Data from census
Data Cleaning
Feature engineering
Save philadelphia data

Data Exploration

Distribution

Since linear regression assumes a normal distribution of the outcome for each value of the explanatory variable, we first created a dashboard panel to visualize the distribution of our plausible explanatory variables. As the histograms below show, the distributions of bachelor_degree, master_degree, professional_degree, doctoral_degree, white, and african_american are highly right-skewed. The tenure_owener and with_child are slightly right-skewed. We will later apply log transformation to deal with the skew issue. Other than that, the median_income peaks at 8.294e+4, a possible reason is that in the previous step, we assigned the unknown values (-666666666.0) to the mean, which is in this group.

In addition, we produced a heatmap to gain an impression of the density of the locations of bus stops in the city. Chicago usually leaves the impression as a city with one of the best bus service coverage in the United States, and the heatmap confirms this. At the initial zoom level, we can see that the heat is really evenly spread out in the city, meaning that the bus stops are pretty evenly distributed across the city. After zooming in more, it is obvious that major corridors throughout the city have higher bus stop densities. One example is Milwaukee Ave, which can be easily identified when zooming in because the heatmap shows strong density in a diagonal direction. Since Milwaukee Avenue is a major street in the city which has many Transit Oriented Developments, this suggests a relationship between bus stops allocation and residential buildings.

K-means clustering

According to the SKLearn documentation, the KMeans algorithm, as a widely used method, clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.

We conducted a K-means clustering analysis to understand the distribution of each predictor and their associations with boardings. Though in general, 5 clusters is the ideal number of clusters in clustering analysis, in this case, the most suitable number of clusters for most variables could be 3 or 4. To determine the optimal number of clusters, we created an elbow plot based on the scaled Chicago socioeconomic variables and boardings. As the elbow plot shows, the inflection point lies between 3 and 4. We used the Kneedle algorithm to determine the "knee" that represents the point at which adding extra clustering fails to add more details. For the Chicago dataset, the most ideal number of clusters is 3.

Example : Number of clusters to try out for scaled Chicago dataset

Clustering Analysis on the Map

From above, we get a sense that 3 is a good value for n. Next, we depict the clustering on the map. With a slider, the clustering on the map will change accordingly. The bar chart on the left side also shows the mean of features between each cluster with the use of the feature selection drop-down menu.

From the map, we can see the locations of the clusters if there were 3 of them. The cluster with highest boarding data (cluster number 1) also has the highest population, the highest median rent, the highest median income, the lowest median age, the highest percentage of college degree, the lowest percentage of minorities, highest percentage of public transit users, the lowest distance to residential buildings, and the lowest distance to office buildings. This shows a possible association between those socioeconomic and spatial features and ridership demand.

Correlation

To avoid multicollinearity in the future model building process, we run a correlation test on all the selected predictors. As the Correlation Among Selected Predictors graphic shows, the variable median_income and pct_college_degree are highly correlated(0.74), therefore, we will avoid using both predictors in the same model in the further modeling process.

OD Employment Statistics Analysis

NOTE: the size of this dataset exceed the limit for GitHub, please download through this link: https://lehd.ces.census.gov/data/lodes/LODES7/il/od/il_od_main_JT00_2019.csv.gz

We also explored the employment origin and desination (OD) patterns in Chicago to better understand whether the variables we chose are meaningful for the ridership prediction. The OD data we used is from LEHD Origin-Destination Employment Statistics (LODES) published by the US Census Bureau. We used the 2019 data in Illinois. Since there is no API for importing this data, and it is too large to be uploaded to GitHub, please click this link to download the data. We then used Dask to import this data.

The original 2019 Illinios OD data contain 5254115 rows. Each row shows the total employment numbers (S000) of each OD pair at block level. Since all the data we have processed are in block groups, we will aggregate the employment numbers to block groups and only extract those within the City of Chicago.

We then plotted two maps based on the number of employments in the destination block group and the number of employments from the origin block groups.

The first map above illustrated that extremely large amount of employments are concentrated in a block in the Chicago CBD. The second map above shows that, comparing to destinations, the origin of jobs trips spread all around the city. Based on these observations we believe it is reasonable to use the distance to CBD for model building and prediction.

In our proposal, we mentioned we wanted to explore a scenerio in which if the city opens a BRT line between the block groups with the highest demand using the shortest route, what that BRT's ridership will be. We tried to proceed in that direction and plotted some block groups OD with the highest demands.

We plotted the OD block groups which has the top 10 demand. We can see that all the top 10 ODs have the same destination block group (marked in red), which is a block group in the CBD. Although the top 10 ODs with highest demand have different origin block group, they are all very close to the destination block group geographically. Thus, it is meaningless to have a BRT line running between any of those block groups. The scenerio we imagined will not exist in real life.

Regression and Prediction

Training and testing on Chicago

Linear Regression

Random Forest

Model Performance - Test on Chicago

The first step of this regression process is analyzing if there is a linear correlation between the number of boardings and other socioeconomic predictors. However, the result of the linear regression turned out to be relatively weak. The score on the training set is about 0.15, and the score on the test set is around 0.13. Therefore, we think the correlation between the boardings and our selected predictors is not simply linearly correlated. To predict the bus ridership, we used another supervised learning algorithm - the random forest regression, which combines predictions from multiple machine learning algorithms and can be a more accurate prediction than a single model. As our random forest result shows, the score on the training set is around 0.46, and the score on the test set is about 0.18, which is much improved than the scores from linear regression. To understand the error between our predicted boardings and actual boardings at the spatial level, we calculated the average percent error for each block group and used hvplot to visualize the results. As the Average Percent Error at Block Group Level plot shows, the large errors (in blue color) occur on the edge of the city. If we hover over these blocks with these errors, we will find that all these block groups do not have any boarding records, but our model predicted some boardings for these block groups. Also highlighted in red is a block group near the city's center. Our data show that this block group has no boarding, but our random forest model assigned it 182 boardings. A possible explanation for such error is that this block group is too small that no bus stop locates within it. Meanwhile, this block group clustered with other high boardings block groups based on the socioeconomic predictors. Therefore, the boardings for this block group is 0.

Predict on Philadelphia

Use Case Scenario

As students in Philadelphia, our group wonders about the bus ridership pattern in Philadelphia. However, after searching online, we realized that SEPTA does not have any publicized bus ridership data. To better understand the bus ridership demand and the further developed bus services in underserved areas in Philadephia, we applied the model established based on the Chicago bus ridership dataset to Philadelphia since these two cities are relatively equivalent in size and socioeconomic conditions. The map above shows our predicted bus ridership for each block group in Philadelphia. As the plot above shows, block groups in Center City and the University City tend to have higher bus ridership based on the model. Both areas have high concentrations of students and employees with higher bus demand. Such prediction can later help SEPTA with deciding which blcok group need more bus service or helping with making decision on new bus routes.

Conclusion

Through this project, we aimed to use the existing Chicago bus ridership data, the O-D employment statistics, socioeconomic data from census' 2015-2019 American Community Survey, and spatial features from OpenStreetMap to build a model for predicting bus ridership for cities without open bus ridership data. In this project, we chose Philadelphia as the scenario city. After collecting all the necessary datasets, we created several dashboards to visualize the distribution and scatter plots for selected socioeconomic predictors and created a heatmap dashboard to visualize the bus ridership hotspots in Chicago.

We tried both linear and random forest regressions on the Chicago dataset to identify the best model. Our test results show that random forest outperforms linear regression. As a result, we chose the best random forest model to forecast bus ridership in Philadelphia. We also made an error plot for Chicago to visualize the predicting errors geographically. Our dashboard indicates that block groups on the outskirts have higher error rates.

We can't evaluate the accuracy of our predictions for Philadelphia because SEPTA's bus ridership data is missing. However, based on socioeconomic characteristics such as population, we could tell from the dashboard that the model predicted ridership reasonably well. The Center City area, for example, as students from Philadelphia, we know that this area should has a relatively high ridership.

Currently, our research is primarily focused on socioeconomic and spatial variables. However, predicting bus ridership is a very complicated analysis. To improve our model further, we should include temporal indicators such as weather and temperature. Furthermore, deciding a new bus route cannot be based solely on the model. Other factors to consider include topography, existing road systems, funding, and equity.