Exploratory Data Visualization on Traffic Crashes

Part I: Selecting a dataset

Part II: Exploring and visualizing the data

General trend of crashes

motivation:

Interested in the traffic crash patterns across time (year) and their location on the road (at or not at intersection), I picked stacked bar plot to demonstrate the change of number of crashes over time as well as whether a lot of crashes happen at intersections. With the stacking, people can see the total amount of crashes over time, as well as the number of crashes at intersection fluctuating over time. I picked the diverging color to highlight the contrast of the grouping for intersection.

Specifically, I used matplotlib but not seaborn or altair for this plot because with matplotlib, I have to configure two times (top and bottom of the stack) and for Seaborn, I also have to undergo similar process. Altair and matplotlib allow me to access a number of matplotlib’s methods with less code, but this particular data of interest that I am looking at is fairly simple and straightforward with matplotlib - since it does not have too many observations, it is easy to iterate the process to retrieve the arrays for "year", "yes" and "no" while ensuring accuracy. Therefore, matplotlib is good enough to support what I would like to visualize.

discussion

Despite efforts to improve road safety, the number of crashes in Philadelphia has not been decreasing. The number of intersection accidents and non-intersection accidents are roughly equal and the pattern is consistent throughout years. This is interesting because media has framed intersection accidents as more frequent than non-intersection accidents, but the graph illustrates that the difference is pretty small, and that both the number of intersection and non-intersection accidents have not been reduced.

Crashes by the time of the day

motivation:

Time (hour of day and weekend/weekday) might affect the occurance of traffic crashes. I picked a faceted line plot to demonstrate the change of number of crashes over time as time is continuous and the change is also continuous, but weekday/weekend is categorical. Also interested in seeing whether time affects the location of crashes (at intersection or not), I picked this kind of plot as it could plot the complicated data of multiple groups together.

Specifically, I used seaborn but not matplotlib for this plot because Seaborn's syntax is more declarative and it is good for visualizing complex relationships between variables. With Seaborn, I don't have to extract a list of values from the dataset anymore. Because the variables are more diverse (hour_of_day has 24 values) than last one (year only has 7 values), I don't want to risk messing up their order while extracting them from dataframe to the list format - that could lead to telling the wrong story. I also tried transforming (grouping and calculating count) within seaborn but the dataset is too large. Hence I did some grouping to make the dataset smaller before being put into seaborn.

Discussion

The plot shows up that the number of crashes is related to hour of the day and whether it is a weekday or a weekend. Generally, after 7 am, more crashes happen at intersections while after 22 pm, less crashes occur at intersections; however this trend varies slighly between weekday and weekend as the curve for weekend is flatter. This is interesting because one would expect more pedestrian on street and more traffic activities during the peak hours of the weekday, and this might be related to the higher amount of intersection accidents around 8am and 16pm. On the weekend, traffic acitivities are more steady throughout the day. Note that the number of crashes presented is a high level summary rather than a daily average, so we cannot say that fewer crashes happen during the weekend.

More on the crases

motivation:

Knowing that the at-intersection crashes are more frequent between 7am and 22pm, I wonder if this is related to pedestrain activities on the street. I also wonder if this trend is consistent for different years. Therefore, I picked the dash board that shows a bar chart depicting the distribution of at-intersection and non-intersection crashes for each year, and also a line plot that shows the average amount of pedestrians associated in the accident for hours of the day.

Compared to matplotlib and seaborn, altair is the most suitable to display complicated relationships and data because it requires the least amount of code. For example, I did a calculation of the daily value from the yearly value in the altair syntax in one line (byDay='( datum.ped_count / 365)'). It also supports interactive features. For example, here it allows people to examine the hourly trend of pedestrian-involved crashes for each year.

discussion

Similar to the total amount of crashes, the average number of pedestrians involved in crashes has not been decreasing over the years. Crashes that involve pedestrian happen more at intersections than non-intersections throughout the years, especilly for the morning and evening peak times. This pattern is generally consistent throughout 2011 - 2017.

motivation

From the previous charts, we learned that crashes are more likely to happen during the day and at intersections, where pedestrians are likely to be involved. Could crashes be related to the visibility of the surroundings, which is influenced by weather? To find out, I picked the heat map to illustrate the relationship between weather, month, and the amount of crashes.

discussion

We can see that although rainy days and snowy days in the colder months are associated with higher amount of crashes, most of the crashes happen when there is no adverse conditions. This suggest that crashes generally occur under normal weathers, probably because of reasons such as visibility of pedestrains or careless driving behaviors. Note that there are a number of "snow" during the warmer months, so this signify that there might be some errors in the data.

motivation

Traffic control devices are common measures to help direct traffic and ensure safety. There are different kinds of them on different types of street. To investigate their relationship with crashes, I chose to use a stacked bar chart to show if there is a pattern.

discussion

As shown on the grouped chart, most intersections employ the control device and a lot of crashes happened at intersections with traffic signal. However, because we do not have data on the whole traffic performence, we do not have a comparison group of when traffic devices were not associated with crashes. However, by looking at the data across time, we can see that the share of crashes with control devices has not decreased over year. Although the relationship could not be deemed causal, this suggests that the number of crashes at signalled intersections have not been improving, and the measures out there to mitigate such crashes have not been as effective. Interestingly, the share of crashes related to traffic signal for non-intersections has decreased from 2011 to 2017. This could be for railroad crossing, on and off ramps, or roundabouts - these have not been classified as intersections. However, we could not make definite conclusions with the information we have due to limitations from the data.