Introduction

The public sector has grown reliance on machine learning algorithms in forecasting where crimes will happen and thereby allocating police officers strategically in the right time and space to prevent crime. Though the intentions of these “predictive policing” tools are to reduce cost and protect public safety, pathologies in data collection and machine learning models can lead to very biased outcomes. This analysis will examine the accuracy and generalizability of our predictive policing model on Cannabis Possession crime in Chicago.

Unlawful narcotics (cannabis/marijuana) possession records from Chicago in 2017

Police routinely arrest millions of people for having marijuana every year but the arrests disproportionately target Black and Latinx people, according to an American Civil Liberties Union (ACLU) article^1. A black person is 3.73 times more likely to be arrested than a white person. The media criticized such policing techniques to be over-policing, wasting time and money, staggering racial bias, and driving mass criminalization^2 that were never meant to increase public safety in the first place. As the marijuana legalization campaigns become more and more heated, now is the time to examine the predictive policing algorithms.

The key to such algorithm is the Modus Operandi (MO) of the convicts which is the particular manner in which a crime is committed. Cannabis-related crimes could be soliciting, possessing, selling, and many more, each has different MO. To find the direct relationship between the MO and the crime, we narrowed cannabis-related crimes to cannabis possession, so that the features used for predicting risks will purposefully lead the police to find those or prevent those from possessing cannabis.

We also chose cannabis possession because the conviction record is likely to suffer from selection bias of the police. To arrest or find a person with cannabis possession, the police will have to see the person having physical control of the drugs - the person will have to have the drugs in their hands or on them. As one would imagine that those having drugs would not expose themselves in front of the police, the police must have some “intuition” to approach certain people to search their body and then find their drugs possession. Those “intuition” might contribute to the selection bias when the police are deciding which areas to patrol or who to search. Connecting this to the ACLU articles, the areas are possibly the lower-income, high crime, and majority non-white neighborhoods.

From the Chicago crime records in 2017 data, we collected data points for the records of “possessing Cannabis” regardless of the mass possessed.

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
## Reading layer `OGRGeoJSON' from data source `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON' using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## geographic CRS: WGS 84
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)
## Reading layer `OGRGeoJSON' from data source `https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON' using driver `GeoJSON'
## Simple feature collection with 277 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## geographic CRS: WGS 84
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

# Primary Type is NARCOTICS and particularly the illegal possession Cannabis (aka Marijuana, weed).
crime_cannabis <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    filter(Primary.Type == "NARCOTICS" & str_detect(Description, "POSS: CANNABIS") == T) %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 
## Reading layer `chicagoBoundary' from data source `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson' using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## geographic CRS: WGS 84

Figure 1 is a map that shows the locations of unlawful possession of Cannabis in Chicago 2017 that are recorded and published by the police department. It is observed from the map that there are some clear cannabis possession hotspots in Chicago, as the points are heavily clustered at the west side and middle-south side of the city. This indicates possible selection bias, that the police might be targeting the hotspots to search for potential convicts and therefore resulted in higher number of records.

# 1. A map of your outcome of interest in point form
ggplot(crime_cannabis[chicagoBoundary,]) +
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  geom_sf(data = crime_cannabis, 
          show.legend = "point", size = .75) +
  labs(title="Location of Unlawful Cannabis Possession, Chicago 2017",
       caption="Figure 1. Map of Unlawful Cannabis Possession in Chicago, 2017") +
  mapTheme()

Narcotics (Cannabis) possession conviction distribution in fishnet grid

When considering crime risk, it is most ideal to set the scale as smooth and even as possible to avoid scale bias. Finding a high risk of crime at a neighborhood level is not as helpful as finding a high risk of crime at a block level. Setting the area cells to be too small is also not helpful and less likely to be accurate, as it undermines the range of normal human activity. Therefore, we set the fishnet grid cell to be 500ft by 500 ft and aggregated the count of cannabis possession by grid cell. Figure 2 illustrates the cannabis possession distribution across Chicago by fishnet grid, with bright yellow color highlighting the grids that have the highest number of unlawful cannabis possession crime records. This is consistent with the point density map (Figure 1) that the west side and middle-south side of the city are the hotspots for the criminal activities that had been discovered by the police.

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(crime_cannabis) %>% 
  mutate(countCrime = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countCrime = replace_na(countCrime, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE)) #for cross validation

# 2. A map of your outcome joined to the fishnet
ggplot() +
  geom_sf(data = crime_net, aes(fill = countCrime), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Unlawful Cannabis Possession Convictions for the fishnet",
       caption="Figure 2. Map of Aggregated Count of Unlawful Cannabis Possession Convictions by Fishnet grid, Chicago 2017") +
  mapTheme()

Risk factors

Features that would be helpful in predicting the location of illegal cannabis possessors include 311 reports of abandoned cars, street lights out, graffiti remediation, sanitation complaints, and abandon buildings, so long with the location of retail stores that sell liquor to go, location of police stations, location of affordable rental housings, and location of public schools (with 9-12 grades).

  • The hypothesis for the factor(s) is that:
    • Having abandoned cars, street lights out, graffiti, sanitation issues, and abandoned buildings are signs of disrepair, thus may contribute to a higher number of police patrol and/or crime count. However, the data are from 311 complaints, meaning that the people in that area care about these signs of disrepair.
    • Being near liquor stores indicates a higher number of police patrol and/or crime count, since alcohol and drugs are often consumed together.
    • Being near police stations indicates a lower number of crime count, as criminals would be cautious not to be caught by police.
    • Being near affordable rental housing indicates a higher number of police patrol and/or crime count, as there is a normative belief of the positive relationship between low-income minorities and drug consumption.
    • Being near local public high schools suggests a higher number of police patrol and/or crime count, as there is a normative belief that public high school students sell and deal drugs that are in particular, cannabis.

Note that all data collected are recorded for the year of 2017 except for the liquor license data which is for all years. Even though the dataset is described to have data from 2006 until present, all start dates available are after 2017, probably because the licenses are being renewed. It is assumed in this model that the stores locations are consistent over time.

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")

policeStation <-
  read.socrata("https://data.cityofchicago.org/resource/z8bn-74gv.json") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Police_Station") #2016

#The rental housing developments listed below are among the thousands of affordable units that are supported by City of Chicago programs to maintain affordability in local neighborhoods. No specific dates in dataset.
affordableRental <-
  read.socrata("https://data.cityofchicago.org/resource/s6ha-ppgi.json") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Affordable_Rental_Housing") 

# Public Schools in Cook County, SY2017-18
publicSchools <-
  read.socrata("https://data.cityofchicago.org/resource/d2h8-2upd.json") %>%
    filter(str_detect(grades,"12")==T) %>%
    dplyr::select(Y = lat, X = long) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Public_Schools") 


## Neighborhoods to use in LOGOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 
## Reading layer `chicago' from data source `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson' using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## geographic CRS: WGS 84

The risk factors have been wrangled to the fishnet that each fishnet grid cell has the aggregated cannabis possession count.

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, policeStation, affordableRental, publicSchools) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

#do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet", bottom = "Figure 3"))

Although the counts by grid cell is valuable, its spatial scale of exposure is relatively rigid. To gain a smoother exposure relationship across space, the average nearest neighbor distance to the features from each cell centroid has been calculated. We have iteratively tried different numbers (k) of nearest neighbors and settled with the number that has resulted in the least prediction errors and the most generalizability, while the number is 1.

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),1),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),1),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),1),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),1),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),1),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),1),
      PoliceStation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(policeStation),1),
      Affordable_Rental.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(affordableRental),1),
      PublicSchools.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(publicSchools),1))

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

#do.call(grid.arrange,c(mapList, ncol = 4, top = "Nearest Neighbor risk Factors by Fishnet", bottom = "Figure 4"))

loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

After multiple rounds of testing of the correlation between features and crime count, nine risk factors, including either the count or the nearest neighbor distance by grid cell, are selected to be put in the prediction models. Their spatial distribution by fishnet is shown in Figure 3. We can observe that while abandoned buildings are clustered at the east and middle-south sides of the city as displayed in bright yellow color - where the cannabis convictions hotspots are - other features such as liquor retail and graffiti complaints are not clustered at the cannabis possession hotspots. In addition, other features including the nearest distance to police stations and affordable rental housings are rather evenly distributed across the city, but there are areas that are significantly closer (displayed in dark blue color) or farther away (displayed in bright yellow color) from the nearest selected features. Their effects in predicting cannabis possession risk will be demonstrated through a correlation plot (Figure 7) below.

# 3. A small multiple map of your risk factors in the fishnet (engineered)
final_vars_net.long <- 
  dplyr::select(vars_net, Abandoned_Buildings, Abandoned_Cars.nn, Graffiti.nn, 
              Liquor_Retail.nn, Street_Lights_Out, Sanitation, PoliceStation.nn, Affordable_Rental.nn, PublicSchools.nn) %>%
    gather(Variable, value, -geometry)

vars <- unique(final_vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Risk Factors by Fishnet", bottom = "Figure 3"))

Spatial process of cannabis convicts locations

Neighborhood and police districts are spatially joined to grid cells using grid cell centroids.

# Join risk factors
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
# Join neighborhoods and police districts
final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

To account for spatial process of cannabis possession count at more local scales, a Local Moran’s I is introduced to signify whether the cannabis possession count at a given location is randomly distributed relative to its immediate neighbors. When the p-value is smaller than or equal to 0.05, it means that the grid cell’s cannabis possession count is relative to its immediate neighbors. As displayed in the P_Value map in Figure 4, the smaller the p-value, the stronger the spatial correlation and the more significant the hotspots. The hospots are also displayed in bright yellow color in the Significant_Hotspots map in Figure 4. The ideal model must predict equally well for the hotspots and coldspots.

# 4. Local Moran's I-related small multiple map of your outcome
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countCrime, final_net.weights)),
    as.data.frame(final_net)) %>% 
    st_sf() %>%
      dplyr::select(Cannabis_Count = countCrime, 
                    Local_Morans_I = Ii, 
                    P_Value = `Pr(z > 0)`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
      gather(Variable, Value, -geometry)
  
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Cannabis", bottom = "Figure 4"))

It is reasonable for police to patrol areas with or near high number of cannabis possession crime more often. Here, we get whether the grid cell is in the hotspot (cannabis.isSig), and derive the average nearest neighbor distance from each cell centroild to its nearest cannabis possession cluster (cannabis.isSig.dist) to represent the local spatial process in our model. The distribution is shown in Figure 5 below.

final_net <-
  final_net %>% 
  mutate(cannabis.isSig = 
           ifelse(localmoran(final_net$countCrime, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(cannabis.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, cannabis.isSig == 1))), 1))

ggplot()+
  geom_sf(data=final_net, aes(fill = cannabis.isSig.dist), colour=NA) +
      scale_fill_viridis(name="")+
  labs(title="Distance to highly significant cannabis convict locations hotsopt", 
       caption="Figure 5. Map of distance to nearest highly significant cannabis convict locations")+
  mapTheme()

Correlation tests

Correlation tests results, as shown in Figure 6, shows the linear relationship between selected risk factors and cannabis possession count, providing contexts to the prediction. To avoid colinearity, either the count or nearest neighbor feature with the higher adjusted r-square was selected for each risk factor.

# 5. A small multiple scatterplot with correlations
plot_vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "PoliceStation.nn", "Affordable_Rental.nn", "PublicSchools.nn", "cannabis.isSig")

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    gather(Variable, Value, -countCrime) %>%
    filter(Variable %in% plot_vars)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countCrime, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countCrime)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Cannabis conviction count as a function of risk factors",
       caption = "Figure 6") +
  plotTheme()

The variables are put into two groups. reg.vars has risk factors while reg.ss.vars includes the Local Moran’s I spatial process features.

reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "PoliceStation.nn", "Affordable_Rental.nn", "PublicSchools.nn", "loopDistance")

reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "PoliceStation.nn", "Affordable_Rental.nn", "PublicSchools.nn", "loopDistance", "cannabis.isSig")

Histogram of cannabis convicts per fishnet grid

The distribution of cannabis possession by grid cell is shown in Figure 7. Most grid cells contain no cannabis possession events and for the grid cells that contain cannabis possession events, the number of grid cells decreases as the contained crime events increases. Such distribution is identical to the Poisson distribution. Thus, a Poison Regression is used to predict cannabis possession risk.

# 6. A histogram of your dependent variable
final_net %>%
  ggplot(aes(countCrime)) + 
    geom_histogram(bins = 30, colour="black") +
  scale_x_continuous(breaks = seq(0, 20, by = 1)) + 
    labs(title="Cannabis convict distribution",
         x="Number of cannabis possession convict in 1 fishnet grid", y="Count of grid",
         caption = "Figure 7. Poisson distribution of cannabis possession per grid") +
  plotTheme()

Cross Validation

The codes below are for the cross validate function which cross validates the prediction result of Poission regression model.

crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i
  cat("This hold out fold is", thisFold, "\n")

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  
  regression <-
    glm(countCrime ~ ., family = "poisson", 
      data = fold.train %>% 
      dplyr::select(-geometry, -id))
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

The codes below cross validates the prediction. The k-fold method uses the cvID to swap out different grid cells. It trains the model with the not-chosen grid cells and uses the trained model to predict the results for the chosen cell.

The LOGO-CV (Leave-One-Group-Out cross validation) method uses the neighborhood name to swap out grid cells in different neighborhoods. It trains the model with the not-chosen neighborhoods and uses the trained model to predict the results for the chosen neighborhood.

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countCrime",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countCrime, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countCrime",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countCrime, Prediction, geometry)
  
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countCrime",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countCrime, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countCrime",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countCrime, Prediction, geometry)

The codes below summarize the cross validations results by calculating the errors between observed and predicted counts for each grid cell and for each regression model.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countCrime,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countCrime,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countCrime,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countCrime,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 

The Mean Average Error (MAE) for each fold across each regression model is visualized in Figure 8. The mean MAE and standard deviation of MAE by regression is summarized in Table 1. Generally, the distributions illustrate that the majority of the MAEs are smaller than 1. The table conveys that the mean errors are between 0.20 and 0.45. To make sense of the severity of the errors, recall that the count of cannabis possession is in the range of 0-20 per grid cell, according to Figure 2. This comparison suggests that the models are on the accurate side as the errors do not seem too severe overall.

As expected, adding the spatial process feature reduces errors, as there are less count of higher MAE for the models that include the spatial process feature (cannabis.isSig) for both cross validation methods. Thus for the hotspot and risk prediction comparison, we included the spatial process feature in our model. Comparing the two cross validation methods, predictions are more accurate across grid cells (k-fold) than across neighborhoods (LOGO-CV). This makes sense as the neighborhoods boundaries are arbitrary, but the neighborhoods cross validation might deliver more pertinent information to local residents and local experts.

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countCrime, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", 
         subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count",
         caption = "Figure 8") 

# 8. A table of MAE and standard deviation MAE by regression
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "Table 1. MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#8FA108") %>%
    row_spec(4, color = "black", background = "#8FA108") 
Table 1. MAE by regression
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.25 0.21
Random k-fold CV: Spatial Process 0.20 0.19
Spatial LOGO-CV: Just Risk Factors 0.45 0.48
Spatial LOGO-CV: Spatial Process 0.29 0.24

Despite the interpretation that the errors are not severe, any number differences between predicted counts and observed counts means a chance of inaccurate allocation of police resources. To understand the predictions more, Figure 9A and 9B visualizes the LOGO-CV and k-fold errors spatially.

Referred to Figure 9A, the two maps on the left visualize where the higher errors occur across neighborhoods in bright yellow color, and the two maps on the right visualize where the higher errors occur across grid cells also in bright yellow color. The LOGO-CV maps show that the largest errors are in the hotspot locations, while the pattern for the largest errors in the k-fold maps are not obvious. The problem with Figure 9A is that some of the maps are plotted using different scales for MAE. Figure 9B makes up for the problem and shows the scales consistently across maps, which is better for comparisons. However, the k-fold maps show limited MAE variations. For these reasons, we kept both of the figures here. LOGO-CV cross validation provides more indicative information for analyzing the model and is thus used for the next section.

# 7. A small multiple map of model errors by k-fold and spatial cross validation
which <- c("Spatial LOGO-CV: Just Risk Factors", "Spatial LOGO-CV: Spatial Process", "Random k-fold CV: Just Risk Factors", "Random k-fold CV: Spatial Process")
which_list <- list()

for(i in which){
  which_list[[i]] <- 
    ggplot() +
      geom_sf(data = filter(error_by_reg_and_fold, Regression == i), 
              aes(fill = MAE), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + 
      theme(legend.position="bottom",
            plot.title = element_text(size = 7))}

do.call(grid.arrange,c(which_list, ncol = 4, top = "Cannabis Possession Risk MAE by LOGO-CV and k-fold spatial cross validation", bottom = "Figure 9A"))

error_by_reg_and_fold %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Cannabis Possession Risk MAE by LOGO-CV and k-fold spatial cross validation",
         caption = "Figure 9B") +
    mapTheme() + theme(plot.title=element_text(hjust=0.5),legend.position="bottom")

To provide more evidence that the spatial process feature helped account for spatial variation in cannabis possession convicts, a new neighborhood.weights spatial weights matrix is calculated. The Morans I using the new neighborhood level weights proves that including the spatial feature improves the prediction by accounting for spatial variation, though some still exists.

neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
    group_by(cvID) %>%
      poly2nb(as_Spatial(.), queen=TRUE) %>%
      nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
    st_drop_geometry() %>%
    group_by(Regression) %>%
    summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[1]],
              p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[3]])
## # A tibble: 2 x 3
##   Regression                         Morans_I p_value
##   <chr>                                 <dbl>   <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors   0.348    0.001
## 2 Spatial LOGO-CV: Spatial Process     0.0201   0.292

To understand how severe the errors are across areas with high cannabis possession crime rates (hotspots) and areas with low cannabis possession crime rates (coldspots), the mean predicted counts (latent risks) are compared to the observed counts by cannabis possession crime rates in decile. Figure 10 illustrates clearly that the models tend to over-predict cannabis possession crime for the coldspots and areas in general; and severely under-predict for the hotspots. This reflects how policing will be allocated inefficiently using the model - while there will be not enough police powers in the hotspots, there will be excessive police powers in the general areas.

st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
    mutate(burglary_Decile = ntile(countCrime, 10)) %>%
  group_by(Regression, burglary_Decile) %>%
    summarize(meanObserved = mean(countCrime, na.rm=T),
              meanPrediction = mean(Prediction, na.rm=T)) %>%
    gather(Variable, Value, -Regression, -burglary_Decile) %>%          
    ggplot(aes(burglary_Decile, Value, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = burglary_Decile), colour = "black") +
      scale_shape_manual(values = c(2, 17)) +
      facet_wrap(~Regression) + xlim(0,10) +
      labs(title = "Predicted and observed cannabis convict by observed cannabis convict decile",
           caption = "Figure 10")

The reason we picked cannabis-related crime was because of the infamous racial selection bias as explained previously. To examine if our model generalize across different race context, we collected census tract level racial profile data and calculate the errors by neighborhoods’ racial context. We also want to see how well the predictions generalize across time, so we use 2018 census data to represent the racial profile. There is no evidence that the crime patterns have changed drastically between 2017 and 2018, so it is safe to assume that the crime pattern is consistent.

If the prediction model confirms with the media’s critiques, the models will over-predict in Minority areas and under-predicts in majority white areas. In fact, all models under-predict in minority areas and over-predict in majority white areas, as shown in Table 2.

# 9. A table of raw errors by race context for a random k-fold and spatial cross validation
tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

reg.summary %>% 
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Table 2. Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#8FA108") %>%
    row_spec(4, color = "black", background = "#8FA108") 

Model comparison: risk vs. hotspot

Hotspot policing was commonly used by police departments as they can use it to identify areas where the crimes are most concentrated. With this information, they can easily deploy more police forces to those areas. If our risk prediction model is more accurate than the hotspot / Kernel Density model, the risk prediction model would be more helpful for allocating police resources.

To know which model is better, we want to see which model’s 2017 cannabis possession risk categories overlay more accurately with the 2018 cannabis possession data points. Figure 11 visualizes the comparison by showing the Kernel Density and Risk Predictions (with LOGO-CV spatial feature)’s risk categories side by side, both overlaid by the 2032 observed cannabis possession points in the year of 2018. The better model should show a more accurate alignment between the risk category and observed crimes. For example, for yellow colored, 90%-100% high risks areas, there should ideally be dense, black points of observed 2018 cannabis possession crimes on top, and vise versa.

It is clear that the risk predictions’ risk categories, as represented by five different colors, are smoother, curvier, and more aligned with the crime data points, while the Kernel Density’s risk categories, as represented by the same five colors, are chunkier and less aligned with the crime data points, especially for the hotspots. However, we need to look more closely to reach a conclusion.

# 10. The map comparing kernel density to risk predictions for the 2018 crime
# setting up dataset for crime_cannabis in 2018
cannabis18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "NARCOTICS" & str_detect(Description, "POSS: CANNABIS") == T) %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct() %>%
  .[fishnet,]

# calculating kernel density for crime_cannabis in 2017 and adds count of cannabis2018
cann_ppp <- as.ppp(st_coordinates(crime_cannabis), W = st_bbox(final_net))
cann_KD <- spatstat::density.ppp(cann_ppp, 1000)

cann_KDE_sf <- as.data.frame(cann_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(cannabis18) %>% mutate(cannCount = 1), ., sum) %>%
    mutate(cannCount = replace_na(cannCount, 0))) %>%
  dplyr::select(label, Risk_Category, cannCount)

#repeat for risk prediction
cann_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(cannabis18) %>% mutate(cannCount = 1), ., sum) %>%
      mutate(cannCount = replace_na(cannCount, 0))) %>%
  dplyr::select(label,Risk_Category, cannCount)

# map comparing kernel density to risk predictions for the 2018 crime
rbind(cann_KDE_sf, cann_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(cannabis18, 2032), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 Unlawful Cannabis Possession risk predictions; 2018 Unlawful Cannabis Possession",
         caption = "Figure 11") +
    mapTheme()

Since it might be impossible to interpret the accuracy with confident from Figure 11, a well-fit model (Figure 12) shows how well both models predict cannabis possession risk per fishnet grid cell by comparing their predictions side by side, with the 2018 rate of cannabis possession being the independent variable. It shows that the risk prediction model (in yellow) captures a greater share of 2018 cannabis possession in the 50%-69% risk category and a smaller share in the 70%-89% risk category, while the predictions for other risk categories are the same for both models. There is little evidence of one model out-performing another. From the information we have derived (Figure 11 and Figure 12), it is hard to conclude which model is better.

# 11. The bar plot making these comparisons
rbind(cann_KDE_sf, cann_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countCann = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countCann / sum(countCann)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2018 Unlawful Cannabis Possession",
           caption = "Figure 12") +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Conclusion

To predict the latent risk of cannabis possession and allocate police power efficiently based on the prediction, we have developed versions of risk prediction model for cannabis possession from 2017 Chicago data. We have compared the predictions to recorded 2018 Chicago cannabis possession data. Even though the model has accounted for spatial process as well as balanced accuracy and generalizability across different areas and contexts, the results indicate that the model is not predicting equally well across hotspots and racial contexts. Even though it is not over-predicting the crime risk for the hotpots (Figure 10) or the minority areas (Table 2), which are the common critiques for predictive policing algorithms, any biased prediction is unwelcome since it would leave unwanted consequences in the real world.

In addition, we cannot guarantee that the crime data we used to build and test our models have not suffered from selection bias of police records. If law enforcement has systematically over-police certain communities and thus recorded more criminal activities for those areas, our prediction models would produce biased predictions which represent the recorded cannabis possession risk rather than the true cannabis possession risk. Therefore, we do not recommend our algorithm to be put into production in the real world yet, as our risk factors are still oversimplified and the generalizability is not ideal.

Moreover, predictive policing might be a good way to allocate police resources, but it is not the right approach to address social disorder, as there is no evidence that policing prevents crimes. We should keep looking for other more effective approaches that are nonpunitive and rehabilitative to curb crimes and aid communities^3.

Sources: 1. https://www.aclu.org/issues/smart-justice/sentencing-reform/war-marijuana-black-and-white

  1. https://www.aclu.org/blog/criminal-law-reform/drug-law-reform/marijuana-legalization-racial-justice-issue

  2. Green, B., 2019. The Just city: machine learning’s social and political foundations, in: The Smart Enough City: Putting Technology in Its Place to Reclaim Our Urban Future. The MIT Press, Cambridge, Massachusetts.