Introduction

Zillow’s housing price prediction tool helps its users see the estimated value of their home with ease, based on algorithms using available determinants of sale prices such as the number of beds, number of bathrooms, and square footage. With the availability of huge volume of data, such algorithms aim to provide stakeholders such as house buyers, real estate agents, and developers with useful information to aid their decision-making process. While its predictions are helpful, Zillow always strive to improve and update their predictive models for higher accuracy and generalizability. Inaccurate predictions will impact the reputation of the company and financial consequences. Hence, our team has developed a predictive model that factors in available local intelligence in hopes of delivering a better model for home prices prediction.

After brainstorming and researching the factors that influence home prices, our team gathered publicly-available data on the homes in Miami and Miami Beach which are two cities in Florida. For simplicity, we refer them collectively as Miami. We used linear (OLS) regression as the basis for our model, factoring in variables of internal characteristics of the house, local amenities, and spatial structure of the location. Our modeling strategy is to include the variables that explain the most for the prediction and represent local intelligence. However developing an all-round omnibus statistical model that take into consideration of all these aspects while balancing the criteria of accuracy and generalizability proved to be a challenging task, in part because of the overfitting of variables. Therefore, as our results suggest, we would not recommend it as-is to Zillow since there are high variances and distortions of the true value sale price due to overfitting or misspecification of variables including local amenities, zoning and neighborhoods, which led to sale price predictions that generalize well with high-income neighborhoods but not others (which have high errors or inaccuracy). For these reasons, even if the adjusted R-square were very high (in our case 88.4% for the trained model), it will not be helpful when predicting new unseen data. Alternative solutions to re-specify and improve this model has been proposed and discussed in detail.

Data and Methods

Initial Data

Knowing what information could influence home prices is the first step for our project. We have researched and briefly interviewed domain experts to draft a list of potential factors as our basis to look for data. Then, we retrieved available data from Miami Open Data Portal, Miami-Dade County Open Data Hub, ArcGIS Hub, Open Street Map, and Census Bureau. Variables of Interests that could be used to predict sale prices (SalePrice) include:

  • Internal characteristics/ Customer Preferences:
    • Number of Bedroom
    • Number of Stories
    • Actual Square Footage of Property
    • Age of building
    • Having luxury pool
    • Having elevator
    • Having Tennis Court
  • Amenities/Public Services
    • Distance to Transit
    • Distance to Public Schools
    • Distance to Parks
    • Distance to graveyard
    • Distance to hospital
    • Distance to nursing home
    • Distance to resort beach
    • Distance to Crime
    • Distance to Water
  • Spatial Structure
    • Zoning
    • Neighborhood
    • Median income of census tract
    • Price lag

Some variables such as Distance to Work were dropped due to unavailability of data or redundancy (when they explain the same variations observed in sale prices)

# Codes for data gathering
# directories for both of our computers
#directory <- "C:\\Users\\m1861\\OneDrive - PennO365\\MUSA 508\\Assignments\\Assignment 2 Housing Price Prediction\\MiamiHomePricePredict"
directory <- "C:\\Users\\David\\Desktop\\508_assign2\\MiamiHomePricePredict\\"

# base map; filtered Miami and Miami beach
MunicipalBoundaries <- st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson")%>% 
  filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>%
  st_union() %>%
  st_as_sf()

MiamiBeach <- st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson")%>% 
  filter(NAME == "MIAMI BEACH") %>%
  st_transform('ESRI:102258') %>%
  mutate(LABEL=NAME)

# Internal characteristics
# home price and property-based data
MiamiProperty       <- st_read(file.path(directory,"studentsData.geojson")) %>%
  st_transform('ESRI:102258') # use the state plane of Florida, which is NAD 1983 HARN StatePlane Florida East FIPS 0901
# Separating data for project and competition
Miami.training <- MiamiProperty %>%
  filter(toPredict == 0) %>%
  select(-toPredict)

# Amenities
# transform municipal shape for arcgis opendata
MunicipalBoundaries_gisdata <- MunicipalBoundaries %>%
  st_transform('ESRI:102258')

# metrorail stations - county -> city;
MRStations <- st_read("https://opendata.arcgis.com/datasets/ee3e2c45427e4c85b751d8ad57dd7b16_0.geojson")%>%
  st_transform('ESRI:102258') %>%
  st_as_sf() %>%
   .[MunicipalBoundaries_gisdata,] %>%
  dplyr::select(LAT, LON) %>%
  st_as_sf(coords = c("Lon", "Lat"), crs = 4326, agr = "constant") %>%
  distinct()

# Public schools - private schools tested to be insignificant
PublicSchools <- st_read("https://opendata.arcgis.com/datasets/d3db0fce650d4e40a5949b0acae6fe3a_0.geojson")%>%
  st_transform('ESRI:102258') %>%
  st_as_sf() %>%
   .[MunicipalBoundaries_gisdata,] %>%
  dplyr::select(LAT, LON) %>%
  st_as_sf(coords = c("Lon", "Lat"), crs = 4326, agr = "constant") %>%
  distinct()

# parks - county -> city
Parks <- st_read("https://opendata.arcgis.com/datasets/8c9528d3e1824db3b14ed53188a46291_0.geojson") %>%
  st_transform('ESRI:102258') %>%
  st_as_sf() %>%
   .[MunicipalBoundaries_gisdata,] %>%
  dplyr::select(LAT, LON) %>%
  st_as_sf(coords = c("Lon", "Lat"), crs = 4326, agr = "constant") %>%
  distinct()

# data from Open Street Map
# set up bounding box
st_bbox(MunicipalBoundaries)
xmin = as.numeric(st_bbox(MunicipalBoundaries)[[1]])
ymin = as.numeric(st_bbox(MunicipalBoundaries)[[2]])
xmax = as.numeric(st_bbox(MunicipalBoundaries)[[3]])
ymax = as.numeric(st_bbox(MunicipalBoundaries)[[4]])

# Examine amenities variables - variable names set to be mostly consistent with OSM names
graveyard <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
    add_osm_feature(key = 'amenity', value = "grave_yard") %>%
    osmdata_sf()
graveyard <- graveyard$osm_points %>%
  .[MunicipalBoundaries,] %>%
  dplyr::select(geometry) %>%
  st_as_sf(crs = 4326, agr = "constant") %>%
  distinct() %>%
  st_transform('ESRI:102258')

hospital <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
    add_osm_feature(key = 'amenity', value = "hospital") %>%
    osmdata_sf()
hospital <- hospital$osm_points %>%
  .[MunicipalBoundaries,] %>%
  dplyr::select(geometry) %>%
  st_as_sf(crs = 4326, agr = "constant") %>%
  distinct() %>%
  st_transform('ESRI:102258')

nursing_home <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
    add_osm_feature(key = 'amenity', value = "nursing_home") %>%
    osmdata_sf()
nursing_home <- nursing_home$osm_points %>%
  .[MunicipalBoundaries,] %>%
  dplyr::select(geometry) %>%
  st_as_sf(crs = 4326, agr = "constant") %>%
  distinct() %>%
  st_transform('ESRI:102258')

#** Water, beach tested to be insignificant, even after being classified.
water <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
    add_osm_feature(key = 'natural', value = "water") %>%
    osmdata_sf()
water <- water$osm_points %>%
  .[MunicipalBoundaries,] %>%
  dplyr::select(geometry) %>%
  st_as_sf(crs = 4326, agr = "constant") %>%
  distinct() %>%
  st_transform('ESRI:102258')

beach <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
    add_osm_feature(key = 'natural', value = "beach") %>%
    osmdata_sf()
beach <- beach$osm_points %>%
  .[MunicipalBoundaries,] %>%
  dplyr::select(geometry) %>%
  st_as_sf(crs = 4326, agr = "constant") %>%
  distinct() %>%
  st_transform('ESRI:102258')

beach_resort <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
    add_osm_feature(key = 'leisure', value = "beach_resort") %>%
    osmdata_sf()
beach_resort <- beach_resort$osm_points %>%
  .[MunicipalBoundaries,] %>%
  dplyr::select(geometry) %>%
  st_as_sf(crs = 4326, agr = "constant") %>%
  distinct() %>%
  st_transform('ESRI:102258')

# Spatial Structure
# neighborhood polygons of the City of Miami
neighborhoods <- 
  st_read("https://opendata.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0.geojson") %>%
  st_transform('ESRI:102258')

# Joined Miami Beach as a neighborhood
neighborhoods <- rbind(
  dplyr::select(neighborhoods, LABEL),
  dplyr::select(MiamiBeach, LABEL)) 

# collect median income, median household income, and housing cost by tracts

miami_census <-  
  get_acs(geography = "tract", 
          variables = c("B06011_001E", #median inc
                        "B19001_001E", #HH inc
                        "B19013_001E", #median HH inc
                        "B25104_001E", #monthly housing cost
                        "B25105_001E"), #median monthly housing cost
          year=2018, 
          state=12, 
          county=086, survey = "acs5",
          geometry=T) %>%
  st_transform('ESRI:102258') %>%
  dplyr::select( -NAME, -moe) %>%
  spread(variable, estimate) %>% # transpose 
  rename(MedianInc = B06011_001,
         HHInc = B19001_001,
         MedHHInc = B19013_001,
         MonthlyHCost = B25104_001,
         MedMonthlyHCost = B25105_001)

Methods and Selecting Data

Our overall approach aims to improve accuracy while keeping generalizability in mind. We first separated our dataset into two with random seed: 60% for training and 40% for testing. Using the training dataset, we tested different combinations of variables or categorized variables to select the ones with the highest influence to prices. Then, we analyzed the result for both training set and testing set. The code is condensed from numerous testing with the training set.

set.seed(31357)
inTrain <- caret::createDataPartition(
              y = Miami.training$SalePrice, 
              p = .60, list = FALSE)
# split data into training and test
Miami.training_set <- Miami.training[inTrain,]
Miami.testing_set  <- Miami.training[-inTrain,] 

We will use the training data to predict the dependent variable, sale price, as accurately as possible. The sale price distribution in the training dataset is mapped in Figure 1. below. Each point represents the location of a property, with its sales price displayed in quintile breaks. The brighter the orange, the higher the price; whereas the brighter the green, the lower the price. We can see a general trend that the price increases as the distance to ocean decreases, which is why we looked into houses’ distances to water features and eventually included their distances to resort beach in our prediction model.

# Take a look at the properties locations and prices
ggplot() +
  geom_sf(data = MunicipalBoundaries, fill = "grey40") +
  geom_sf(data = Miami.training_set, aes(colour = q5(na.omit(SalePrice))), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr2(Miami.training_set,"SalePrice"),
                   name="Quintile\nBreaks") +
  labs(title="Sales Price by Property, Miami",
       caption="Figure 1. Sales Price by Property, Miami") +
  mapTheme()

Internal Characteristics

For our analysis, we made raw data gathered from open data portals through various methods. For factors categorized in internal characteristics, we first selected key variables from all available characteristics for the property. After an initial regression analysis with the training set, we found the building age, number of beds, number of stories, square footage of the property, and several additional features to have significant influence on sales price. It makes sense that these features have effects on home-buying decisions. However, these characteristics have covariance and we can only select the best of each to avoid collinearity which contributes redundant information and distorts the coefficients and p-value estimates, and thus the predicted value of the sale price. We re-classified the number of bedrooms into less than 4, 4, and more than 4, as could be observed from the scatterplot (Figure 2.) that the relationship between price and bedrooms are very different depending on the category. We also observed that there are various additional features for the properties, and we extracted the ones with the highest influence. For these additional features, we represent them categorically. When a house possesses a feature, regardless of how many of that feature it has, the value of that variable will be 1, whereas conversely, the value will be 0.

  • After testing a couple of different combinations of variables, we settled with these as the internal characteristics:
    • Number of Bedroom (Bed.cat)
    • Number of Stories (Stories)
    • Actual Square Footage of Property (ActualSqFt)
    • Age of building (EffectiveAge)
    • Having luxury pool (LPool)
    • Having elevator (Elevator)
    • Having Tennis Court (TennisCourt)
# Selecting key independent variables for internal characteristics (exclude zoning, which is spatial properties)
# Selected zoning here to keep the data for spatial structure
MiamiInternal <- Miami.training_set %>%
  select(SalePrice, Bed, AdjustedSqFt, Stories, Zoning, YearBuilt, EffectiveYearBuilt, Units, LotSize, LivingSqFt, ActualSqFt, XF1, XF2, XF3) %>%
  filter(Bed > 0,
         Stories > 0) %>%
  mutate(EffectiveAge = 2020 - EffectiveYearBuilt,
         BedSqFt = ActualSqFt/Bed,
         StoriesSqFt = ActualSqFt/Stories,
         LPool = ifelse(str_detect(XF1,"Luxury Pool") | str_detect(XF2,"Luxury Pool") | str_detect(XF3,"Luxury Pool"), 1, 0),
         Elevator = ifelse(str_detect(XF1,"Elevator") | str_detect(XF2,"Elevator") | str_detect(XF3,"Elevator"), 1, 0),
         Gazebo = ifelse(str_detect(XF1,"Gazebo") | str_detect(XF2,"Gazebo") | str_detect(XF3,"Gazebo"), 1, 0), 
         TennisCourt = ifelse(str_detect(XF1,"Tennis Court") | str_detect(XF2,"Tennis Court") | str_detect(XF3,"Tennis Court"), 1, 0), 
         Patio = ifelse(str_detect(XF1,"Patio") | str_detect(XF2,"Patio") | str_detect(XF3,"Patio"), 1, 0)) %>%
  select(-YearBuilt, -XF1, -XF2, -XF3) %>%
  dplyr::select(SalePrice, Bed, Stories, ActualSqFt, EffectiveAge, LPool, Elevator, TennisCourt, Zoning) %>%
  mutate(Bed.cat = case_when(
    Bed >= 0 & Bed < 4 ~ "Up to 4",
    Bed == 4 ~ "4",
    Bed >= 5 ~ "4+"
  )) %>%
  select(-Bed)

# plot bed because we categorized it
st_drop_geometry(MiamiInternal) %>% 
  dplyr::select(SalePrice, Bed.cat) %>% #4
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = 1) + 
  labs(title = "Price as a function of number of beds in property",
       caption = "Figure 2. Price distribution by category") +
  plotTheme()

Amenities/Public Services

Next, we made raw data for amenities meaningful for our analysis. Distance to amenities is one of the key factors influencing people’s preference for homes. With this assumption, we performed nearest neighbor function on amenities to get the average distance from each property to its k nearest neighbor amenities. For example, to know the average distance between a property and its three nearby public schools with our training set, we set k as 3 in the function.

From initial regression analysis, we were able to select the variables and the number of neighbors that have the most significant to housing sales price, with distance to graveyard, Metro station, parks, and public schools shown in Figure 3. Besides distance to parks, the sale price all increases as distance to the amenities increases. Private transportation and modern socioeconomic contexts, which we did not account for, probably played a part in shaping consumer preference and thus sale price. We also categorized distance to resort beach based on our observation from the data that the relationship between distance to resort beach and home prices matters the most when the distance is less than 1500 feet.

  • After testing a couple of different combinations of variables, we settled with these as the amenities variables:
    • Distance to Transit (MRStations_nn5)
    • Distance to Public Schools (PublicSchools_nn3)
    • Distance to Parks (Parks_nn5)
    • Distance to graveyard (graveyard_nn1)
    • Distance to hospital (hospital_nn1)
    • Distance to nursing home (nursing_home_nn1)
    • Distance to resort beach (beach_resort.cat)
# Nearest Neighbor Feature
st_c <- st_coordinates #is a function

# Name the training set miami.sf
miami.sf    <- MiamiInternal %>% 
  st_centroid(MiamiProperty) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102258')

miami.sf <-
  miami.sf %>% 
    mutate( 
      MRStations_nn5 = nn_function(st_c(miami.sf), st_c(MRStations), 5),
      PublicSchools_nn3 = nn_function(st_c(miami.sf), st_c(PublicSchools), 3),
      Parks_nn5 = nn_function(st_c(miami.sf), st_c(Parks), 5),
      graveyard_nn1 = nn_function(st_c(miami.sf), st_c(graveyard), 1),
      hospital_nn1 = nn_function(st_c(miami.sf), st_c(hospital), 1),
      nursing_home_nn1 = nn_function(st_c(miami.sf), st_c(nursing_home), 1),
      water_nn1 = nn_function(st_c(miami.sf), st_c(water), 1),
      water_nn2 = nn_function(st_c(miami.sf), st_c(water), 2),
      water_nn3 = nn_function(st_c(miami.sf), st_c(water), 3),
      water_nn4 = nn_function(st_c(miami.sf), st_c(water), 4),
      water_nn5 = nn_function(st_c(miami.sf), st_c(water), 5),
      beach_nn1 = nn_function(st_c(miami.sf), st_c(beach), 1),
      beach_nn2 = nn_function(st_c(miami.sf), st_c(beach), 2),
      beach_nn3 = nn_function(st_c(miami.sf), st_c(beach), 3),
      beach_nn4 = nn_function(st_c(miami.sf), st_c(beach), 4),
      beach_nn5 = nn_function(st_c(miami.sf), st_c(beach), 5),
      beach_resort_nn1 = nn_function(st_c(miami.sf), st_c(beach_resort), 1),
      beach_resort_nn2 = nn_function(st_c(miami.sf), st_c(beach_resort), 2),
      beach_resort_nn3 = nn_function(st_c(miami.sf), st_c(beach_resort), 3),
      beach_resort_nn4 = nn_function(st_c(miami.sf), st_c(beach_resort), 4),
      beach_resort_nn5 = nn_function(st_c(miami.sf), st_c(beach_resort), 5)
      )

#categorized water related features, saved the one with most rsquare, and let go of other water related variables
miami.sf <- miami.sf %>%
  mutate(water.cat = case_when(
    water_nn3 > 0 & water_nn3 < 1500 ~ "0-1500",
    water_nn3 >= 1500 ~ ">1500"
  ),
  beach.cat = case_when(
    beach_nn4 > 0 & water_nn3 < 1500 ~ "0-1500",
    beach_nn4 >= 1500 ~ ">1500"
  ),
  beach_resort.cat = case_when(
    beach_resort_nn1 > 0 & beach_resort_nn1 < 1500 ~ "0-1500",
    beach_resort_nn1 >= 1500 ~ ">1500"
  )) %>%
  select(-water_nn1, -water_nn2, -water_nn3, -water_nn4, -water_nn5, -beach_nn1, -beach_nn2, -beach_nn3, -beach_nn4, -beach_nn5, -beach_resort_nn1, -beach_resort_nn2, -beach_resort_nn3, -beach_resort_nn4, -beach_resort_nn5, -beach.cat, -water.cat)
# Regression plot used for visualizing relationship between variables and price
#** Chose 4 from all open data variables
st_drop_geometry(miami.sf) %>% 
  dplyr::select(SalePrice, PublicSchools_nn3, graveyard_nn1, Parks_nn5,MRStations_nn5) %>% #4
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + 
  ylim(0,5000000)+
  geom_smooth(method = "lm", se=T, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of continuous variables representing distance to amenities",
       caption = "Figure 3. Four home price correlation scatterplots") +
  plotTheme()

Spatial Structure

To account for local intelligence, we considered neighborhood and zoning in which the property is located, the economic indicators of the residents of the tracts in which the property is located, and nearby housing prices as well. We performed spatial join for these features to the property, that is, joining the features if their polygons intersects with the geography location of the property points. Note that due to unavailability of neighborhoods data for Miami Beach, we considered Miami Beach as a neighborhood.

  • After testing a couple of different combinations of variables with our training set, we settled with these as the amenities variables:
    • Zoning (Zoning)
    • Neighborhood (Neighborhood)
    • Median income of census tract (MedianInc)
    • Price lag (lagPrice)
# join neighborhood(name is "LABEL") to property points
miami.neigh <- st_join(miami.sf, neighborhoods, join = st_intersects) %>%
  mutate(Neighborhood = LABEL) %>%
  select(-LABEL)

# Joining census data to training data, tracts to points
miami.sf.spat <- st_join(miami.neigh, miami_census, join = st_intersects)

# Accounting for price lag
k_nearest_neighbors = 5
coords <- st_coordinates(miami.sf.spat) #prices
neighborList <- knn2nb(knearneigh(coords, k_nearest_neighbors))
spatialWeights <- nb2listw(neighborList, style="W")
miami.sf.spat$lagPrice <- lag.listw(spatialWeights, miami.sf.spat$SalePrice)

Statistical Summary of Final Data, Variables

After the applying the methods of testing described above, we now have the final set of variables for our model shown in Table 1. There are 1575 observations. The mean and standard deviations of binary variables such as LPool, Elevator and TennisCourt can be ignored as they are categorical variables that take only the value 0 or 1 to indicate whether the house possesses the feature(s). Lag price here is the average price of 5 nearest neighbors (homes) of a given home.

miami.data <- st_drop_geometry(miami.sf.spat) %>%
  select(Bed.cat, Stories, ActualSqFt, EffectiveAge, LPool, Elevator, TennisCourt, MRStations_nn5, PublicSchools_nn3, Parks_nn5, graveyard_nn1, hospital_nn1, nursing_home_nn1, beach_resort.cat, Zoning, Neighborhood, MedianInc, lagPrice)

stargazer(miami.data,
          type="html",
          title="Table 1. Summary Statistics of Variables")
Table 1. Summary Statistics of Variables
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Stories 1,575 1.196 0.428 1 1 1 4
ActualSqFt 1,575 2,355.571 1,924.460 388 1,386.5 2,558 20,192
EffectiveAge 1,575 47.044 26.485 1 24 70 115
LPool 1,575 0.014 0.117 0 0 0 1
Elevator 1,575 0.017 0.127 0 0 0 1
TennisCourt 1,575 0.001 0.036 0 0 0 1
MRStations_nn5 1,575 4,882.939 2,300.224 1,045.526 3,060.729 6,260.651 12,213.010
PublicSchools_nn3 1,575 1,253.809 688.308 222.409 825.025 1,425.857 4,323.684
Parks_nn5 1,575 1,011.403 364.428 288.705 723.604 1,276.840 2,327.688
graveyard_nn1 1,575 3,294.287 2,229.624 102.002 1,552.367 4,547.781 10,739.620
hospital_nn1 1,575 2,207.346 1,250.296 62.345 1,249.407 3,020.248 6,153.574
nursing_home_nn1 1,575 4,198.133 2,392.807 85.743 2,378.120 5,620.675 11,846.810
MedianInc 1,575 32,423.830 23,177.360 12,284 17,707 43,989 91,949
lagPrice 1,575 991,797.800 1,603,768.000 97,000 313,040 1,020,700 15,724,000

Correlation Matrix

We examined the correlation of variables to make sure that the variables do not have strong correlation among themselves, as multicollinearity would lead to distorted estimation of coefficients, and hence sale price. Figure 4 illustrates the Correlation Matrix. The more intense the color, the higher the correlation. Note here that even though distance to transit and distance to nursing home has relatively high correlation, we do not believe that they are redundant as they represent two separate amenities. Generally, it is recommended that correlation > 0.8 or < -0.8 be removed (Devore & Jay, 2008).

# Correlation between variables
final_var <- miami.sf.spat %>%
  select(SalePrice, Bed.cat, Stories, ActualSqFt, EffectiveAge, LPool, Elevator, TennisCourt, MRStations_nn5, PublicSchools_nn3, Parks_nn5, graveyard_nn1, hospital_nn1, nursing_home_nn1, beach_resort.cat, Zoning, Neighborhood, MedianInc, lagPrice) %>% na.omit()

numericVars <- 
  select_if(st_drop_geometry(final_var), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation across numeric variables",
       caption = "Figure 4. Correlation matrix for numeric variables")

Maps of 3 interesting independent variables

Figure 5.1 through 5.3 show 3 maps for three independent variables that are interesting. From Figure 5.1, we can see that a few of the houses are close to resort beach, displayed in light blue color. Looking at Figure 5.1 and Figure 1 (map of sales price by property) together, it is apparent that those houses are located at where the priciest houses are. Similarly, from Figure 5.2, we can see that a few of the houses have luxury pool, displayed in light blue color. They are also at the northeast of the map. From Figure 5.3, we can see that houses have different distances to graveyards but their distribution are somewhat clustered, with the ones closer to graveyard clustered at the southwest part of the city and the ones farther from graveyard clustered at the northeast part of the city. Therefore, the features represented by these variables might be influential to the sales price - houses that are closer to the resort beach(es), houses that have luxury pool(s), and houses that are far away from the graveyard would increase the values of the houses.

ggplot() +
  geom_sf(data = MunicipalBoundaries, fill = "grey40") +
  geom_sf(data = miami.sf.spat, aes(colour = beach_resort.cat), 
          show.legend = "point", size = .75) +
  geom_sf(data = neighborhoods, fill="transparent", color="dark gray")+
  labs(title="Distance to Resort Beach by Property, Miami",
       caption="Figure 5.1. Distance to Resort Beach by Property, Miami") +
  mapTheme()

ggplot() +
  geom_sf(data = MunicipalBoundaries, fill = "grey40") +
  geom_sf(data = miami.sf.spat, aes(colour = as.factor(LPool)), 
          show.legend = "point", size = .75) +
  geom_sf(data = neighborhoods, fill="transparent", color="dark gray")+
  labs(title="Houses with or without Luxury Pool by Property, Miami",
       caption="Figure 5.2. Houses with or without Luxury Pool by Property, Miami") +
  mapTheme()

ggplot() +
  geom_sf(data = MunicipalBoundaries, fill = "grey40") +
  geom_sf(data = miami.sf.spat, aes(colour = q5(graveyard_nn1)), 
          show.legend = "point", size = .75) +
  geom_sf(data = neighborhoods, fill="transparent", color="dark gray")+
  scale_colour_manual(values = palette5_rev,
                   labels=qBr2(miami.sf.spat,"graveyard_nn1"),
                   name="Quintile\nBreaks") +
  labs(title="Distance to Nearest Graveyard by Property, Miami",
       caption="Figure 5.3. Distance to Nearest Graveyard by Property, Miami") +
  mapTheme()

Ordinary Least Square (OLS) Regression for Training Data

By performing the Ordinary Least Square regression with different categories of variables, we were able to compare and contrast their fit by looking at their F-statistic (greater than the critical F-value) and the adjusted R square. We have observed that distances to selected amenities have limited influence on housing price, while the spatial structure can increase the adjusted R-square by around 5.5 percentage points, meaning that 5.5% more of the sale price could be explained by the spatial structure we selected. This is a meager increase considering the number of additional variables included. Note that even though adjusted R-square (fit) has increased (where standard residuals or errors decreased), bias and redundancy still affects the model (see Discussions and Conclusions sections).

# Regression analysis for selected internal characteristics variables
reg_phy <- lm(SalePrice ~., data = st_drop_geometry(MiamiInternal) %>%
                select(SalePrice, Bed.cat, Stories, ActualSqFt, EffectiveAge, LPool, Elevator, TennisCourt))
summary(reg_phy)
## 
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(MiamiInternal) %>% 
##     select(SalePrice, Bed.cat, Stories, ActualSqFt, EffectiveAge, 
##         LPool, Elevator, TennisCourt))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8564529  -315637    29337   271120 10714239 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)    -1398430.13   135581.56 -10.314 < 0.0000000000000002 ***
## Bed.cat4+        140032.30   117463.98   1.192               0.2334    
## Bed.catUp to 4   440646.49    74640.15   5.904        0.00000000435 ***
## Stories         -149441.00    77022.42  -1.940               0.0525 .  
## ActualSqFt          903.05       24.13  37.421 < 0.0000000000000002 ***
## EffectiveAge        797.03      984.71   0.809               0.4184    
## LPool           4587861.81   247356.87  18.548 < 0.0000000000000002 ***
## Elevator        1168305.41   250071.38   4.672        0.00000323999 ***
## TennisCourt     7925666.80   691792.59  11.457 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 960100 on 1566 degrees of freedom
## Multiple R-squared:  0.8297, Adjusted R-squared:  0.8288 
## F-statistic: 953.7 on 8 and 1566 DF,  p-value: < 0.00000000000000022
# Regression analysis for adding amenities
reg_2 <- lm(SalePrice ~., data = st_drop_geometry(miami.sf) %>%
              select(-Zoning))
summary(reg_2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(miami.sf) %>% 
##     select(-Zoning))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8595379  -311679    29058   270334 10730918 
## 
## Coefficients:
##                           Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)            -1515799.73   180393.27  -8.403 < 0.0000000000000002 ***
## Stories                 -144939.33    78769.88  -1.840                0.066 .  
## ActualSqFt                  884.65       25.49  34.701 < 0.0000000000000002 ***
## EffectiveAge                830.84      989.06   0.840                0.401    
## LPool                   4595067.16   248780.61  18.470 < 0.0000000000000002 ***
## Elevator                1227956.64   252878.19   4.856        0.00000131857 ***
## TennisCourt             7949311.51   692912.29  11.472 < 0.0000000000000002 ***
## Bed.cat4+                133768.95   117950.59   1.134                0.257    
## Bed.catUp to 4           450682.73    74967.59   6.012        0.00000000228 ***
## MRStations_nn5              -21.46       24.67  -0.870                0.385    
## PublicSchools_nn3            91.58       63.42   1.444                0.149    
## Parks_nn5                    88.16      101.90   0.865                0.387    
## graveyard_nn1                20.07       20.07   1.000                0.318    
## hospital_nn1                -31.61       24.57  -1.286                0.198    
## nursing_home_nn1             10.94       29.32   0.373                0.709    
## beach_resort.cat0-1500    92575.11   143325.32   0.646                0.518    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 959700 on 1559 degrees of freedom
## Multiple R-squared:  0.8306, Adjusted R-squared:  0.829 
## F-statistic: 509.6 on 15 and 1559 DF,  p-value: < 0.00000000000000022
# after adding spatial structure
reg_final <- lm(SalePrice ~., data = st_drop_geometry(miami.sf.spat) %>%
                dplyr::select(SalePrice, Bed.cat, Stories, ActualSqFt, EffectiveAge, LPool, Elevator, TennisCourt, MRStations_nn5, PublicSchools_nn3, Parks_nn5, graveyard_nn1, hospital_nn1, nursing_home_nn1, beach_resort.cat, Zoning, Neighborhood, MedianInc, lagPrice))
summary(reg_final)
## 
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(miami.sf.spat) %>% 
##     dplyr::select(SalePrice, Bed.cat, Stories, ActualSqFt, EffectiveAge, 
##         LPool, Elevator, TennisCourt, MRStations_nn5, PublicSchools_nn3, 
##         Parks_nn5, graveyard_nn1, hospital_nn1, nursing_home_nn1, 
##         beach_resort.cat, Zoning, Neighborhood, MedianInc, lagPrice))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5438542  -203411    29610   231527 10474480 
## 
## Coefficients: (1 not defined because of singularities)
##                                                  Estimate     Std. Error
## (Intercept)                                 -540350.38689  1170178.00290
## Bed.cat4+                                    -13134.01476   100479.94772
## Bed.catUp to 4                               390713.97551    64344.31033
## Stories                                       90722.01295    70285.77871
## ActualSqFt                                      697.51170       28.31375
## EffectiveAge                                     59.04316      866.11860
## LPool                                       4209253.20035   220811.09333
## Elevator                                     472738.70317   224818.59484
## TennisCourt                                 6890239.12954   584878.71285
## MRStations_nn5                                 -438.04988      106.86296
## PublicSchools_nn3                                18.00640       83.19828
## Parks_nn5                                       392.22961      134.61377
## graveyard_nn1                                  -259.10339       50.87482
## hospital_nn1                                     87.68651       50.69232
## nursing_home_nn1                                421.98995      106.66459
## beach_resort.cat0-1500                      -660336.86076   159380.05792
## Zoning0104 - SINGLE FAM - ANCILIARY UNIT      46191.50744   122666.49666
## Zoning0800 - SGL FAMILY - 1701-1900 SQ      1518155.06890   138063.04495
## Zoning2100 - ESTATES - 15000 SQFT LOT       5397688.74308   336892.72821
## Zoning2200 - ESTATES - 25000 SQFT LOT       4710189.32296   368123.94581
## Zoning2800 - TOWNHOUSE                       516027.31900   567878.41526
## Zoning3900 - MULTI-FAMILY - 38-62 U/A        -39853.27350   180689.62384
## Zoning3901 - GENERAL URBAN 36 U/A LIMITED     73128.69614   233822.95512
## Zoning4600 - MULTI-FAMILY - 5 STORY &        285746.34430   351836.95865
## Zoning4601 - MULTI-FAMILY - 8 STORY &       -649043.63192  1392492.04117
## Zoning4801 - RESIDENTIAL-LIMITED RETAI        69390.27775   601868.36061
## Zoning5700 - DUPLEXES - GENERAL               39734.96365    86507.57400
## Zoning6100 - COMMERCIAL - NEIGHBORHOOD       -14981.91637   830811.37669
## Zoning6101 - CEN-PEDESTRIAN ORIENTATIO       334217.20464   466700.14442
## Zoning6106 - RESIDENTIAL-LIBERAL RETAI      -202036.59501  1391849.05383
## Zoning6107 - RESIDENTIAL-MEDIUM RETAIL       247392.59306   289383.49611
## Zoning6110 - COMM/RESIDENTIAL-DESIGN D      -215142.82766   859992.51597
## Zoning7000 - INDUSTRIAL - GENERAL            114826.81626   864786.51903
## Zoning7700 - INDUSTRIAL - RESTRICTED         104118.16064  1280288.11395
## NeighborhoodAllapattah Industrial District             NA             NA
## NeighborhoodAuburndale                        48134.17850  1200412.46085
## NeighborhoodBay Heights                    -1534134.03614  1196330.63304
## NeighborhoodBaypoint                          62740.29552  1169138.02397
## NeighborhoodBayside                          380556.68341  1161167.08959
## NeighborhoodBelle Island                     652771.24323  1218016.01121
## NeighborhoodBelle Meade                      386392.59691  1158226.42509
## NeighborhoodBelle Meade West                 484235.47365  1172412.20633
## NeighborhoodBird Grove East                 -668322.94154  1228412.75385
## NeighborhoodBird Grove West                 -805572.64832  1205757.25179
## NeighborhoodBiscayne Island                 2164441.97286  1296770.32016
## NeighborhoodBiscayne Plaza                   294187.29914  1396021.42518
## NeighborhoodBrentwood                       -341601.03032  1280541.21582
## NeighborhoodBuena Vista Heights                1995.40090  1185119.72112
## NeighborhoodBuena Vista West                -117689.71305  1159417.47077
## NeighborhoodCitrus Grove                    -313492.39096  1166056.50672
## NeighborhoodCivic Center                     -16167.29910  1132396.38710
## NeighborhoodCoral Gate                      -803364.59526  1186570.51070
## NeighborhoodCurtis Park                      -54527.92562  1194480.53655
## NeighborhoodDouglas Park                    -867522.42347  1167077.07177
## NeighborhoodEast Grove                     -1179005.69071  1174861.46899
## NeighborhoodEast Little Havana               -32736.09449  1196602.75312
## NeighborhoodEdgewater                       -371092.22207  1547282.42435
## NeighborhoodEdison                           216663.16077  1151756.23624
## NeighborhoodFair Isle                       -668650.78997  1194790.50281
## NeighborhoodFlagami                          218435.61630  1197741.87029
## NeighborhoodFlora Park                       -83971.47739  1154845.82153
## NeighborhoodGrove Center                   -1380586.20961  1220542.68181
## NeighborhoodHadley Park                      -44552.68673  1150731.06383
## NeighborhoodHighland Park                    519616.77990  1130072.12812
## NeighborhoodHistoric Buena Vista East       -228656.67007  1176669.86870
## NeighborhoodKing Heights                     322331.30632  1164306.73789
## NeighborhoodLa Pastorita                    -313238.71352  1271773.52307
## NeighborhoodLatin Quarter                   -258425.96172  1256253.45256
## NeighborhoodLe Jeune Gardens                 126443.28727  1473068.62996
## NeighborhoodLemon City/Little Haiti          141755.32954  1171420.03984
## NeighborhoodLiberty Square                   440166.47815  1155948.41614
## NeighborhoodLittle River Central             300343.61146  1193416.49449
## NeighborhoodMelrose                         -255401.14857  1121765.77445
## NeighborhoodMiami Avenue                    -299247.19646  1189825.94133
## NeighborhoodMIAMI BEACH                     1248318.61374  1159732.84138
## NeighborhoodMorningside                       79038.88903  1161668.41229
## NeighborhoodNorth Grapeland Heights           82456.33174  1172991.01619
## NeighborhoodNorth Grove                     -711172.08313  1175911.68798
## NeighborhoodNorth Sewell Park               -336426.09466  1177550.48805
## NeighborhoodNortheast Overtown              -746660.40379  1288321.83635
## NeighborhoodNorthwestern Estates             359724.63138  1161839.30598
## NeighborhoodOakland Grove                     28884.90875  1393376.56260
## NeighborhoodOld San Juan                     -31743.96232  1170110.05865
## NeighborhoodOrange Bowl                     -571853.35698  1425157.44740
## NeighborhoodOrchard Villa                    -48095.61327  1161907.19725
## NeighborhoodPalm Grove                        48087.79182  1395230.78382
## NeighborhoodParkdale North                  -466972.89236  1186459.63471
## NeighborhoodParkdale South                  -851178.26994  1186425.92213
## NeighborhoodRoads                           -425235.94829  1155528.23609
## NeighborhoodSan Marco Island                3047596.96543  1311227.62607
## NeighborhoodSanta Clara                     -565839.10424  1159651.42857
## NeighborhoodShenandoah North                -323605.21665  1154857.98570
## NeighborhoodShenandoah South                -647479.80555  1154244.08340
## NeighborhoodShorecrest                       471262.80902  1153680.28181
## NeighborhoodSilver Bluff                    -970205.08761  1159994.00793
## NeighborhoodSouth Grapeland Heights          308837.08597  1202519.58495
## NeighborhoodSouth Grove                    -1740681.83955  1226849.16460
## NeighborhoodSouth Grove Bayside            -1609202.18700  1230072.74790
## NeighborhoodSouth Sewell Park               -130983.91667  1173039.70184
## NeighborhoodSpring Garden                   -657593.73306  1284001.47022
## NeighborhoodWest Grapeland Heights           470737.54632  1214752.06156
## NeighborhoodWest Grove                     -1246457.79299  1204726.67578
## MedianInc                                         1.52207        2.85288
## lagPrice                                         -0.04771        0.02403
##                                            t value             Pr(>|t|)    
## (Intercept)                                 -0.462              0.64432    
## Bed.cat4+                                   -0.131              0.89602    
## Bed.catUp to 4                               6.072         0.0000000016 ***
## Stories                                      1.291              0.19699    
## ActualSqFt                                  24.635 < 0.0000000000000002 ***
## EffectiveAge                                 0.068              0.94566    
## LPool                                       19.063 < 0.0000000000000002 ***
## Elevator                                     2.103              0.03566 *  
## TennisCourt                                 11.781 < 0.0000000000000002 ***
## MRStations_nn5                              -4.099         0.0000437262 ***
## PublicSchools_nn3                            0.216              0.82868    
## Parks_nn5                                    2.914              0.00363 ** 
## graveyard_nn1                               -5.093         0.0000003981 ***
## hospital_nn1                                 1.730              0.08388 .  
## nursing_home_nn1                             3.956         0.0000797644 ***
## beach_resort.cat0-1500                      -4.143         0.0000362047 ***
## Zoning0104 - SINGLE FAM - ANCILIARY UNIT     0.377              0.70655    
## Zoning0800 - SGL FAMILY - 1701-1900 SQ      10.996 < 0.0000000000000002 ***
## Zoning2100 - ESTATES - 15000 SQFT LOT       16.022 < 0.0000000000000002 ***
## Zoning2200 - ESTATES - 25000 SQFT LOT       12.795 < 0.0000000000000002 ***
## Zoning2800 - TOWNHOUSE                       0.909              0.36366    
## Zoning3900 - MULTI-FAMILY - 38-62 U/A       -0.221              0.82546    
## Zoning3901 - GENERAL URBAN 36 U/A LIMITED    0.313              0.75451    
## Zoning4600 - MULTI-FAMILY - 5 STORY &        0.812              0.41683    
## Zoning4601 - MULTI-FAMILY - 8 STORY &       -0.466              0.64121    
## Zoning4801 - RESIDENTIAL-LIMITED RETAI       0.115              0.90823    
## Zoning5700 - DUPLEXES - GENERAL              0.459              0.64607    
## Zoning6100 - COMMERCIAL - NEIGHBORHOOD      -0.018              0.98562    
## Zoning6101 - CEN-PEDESTRIAN ORIENTATIO       0.716              0.47403    
## Zoning6106 - RESIDENTIAL-LIBERAL RETAI      -0.145              0.88461    
## Zoning6107 - RESIDENTIAL-MEDIUM RETAIL       0.855              0.39275    
## Zoning6110 - COMM/RESIDENTIAL-DESIGN D      -0.250              0.80249    
## Zoning7000 - INDUSTRIAL - GENERAL            0.133              0.89439    
## Zoning7700 - INDUSTRIAL - RESTRICTED         0.081              0.93520    
## NeighborhoodAllapattah Industrial District      NA                   NA    
## NeighborhoodAuburndale                       0.040              0.96802    
## NeighborhoodBay Heights                     -1.282              0.19992    
## NeighborhoodBaypoint                         0.054              0.95721    
## NeighborhoodBayside                          0.328              0.74316    
## NeighborhoodBelle Island                     0.536              0.59209    
## NeighborhoodBelle Meade                      0.334              0.73872    
## NeighborhoodBelle Meade West                 0.413              0.67965    
## NeighborhoodBird Grove East                 -0.544              0.58649    
## NeighborhoodBird Grove West                 -0.668              0.50417    
## NeighborhoodBiscayne Island                  1.669              0.09531 .  
## NeighborhoodBiscayne Plaza                   0.211              0.83313    
## NeighborhoodBrentwood                       -0.267              0.78969    
## NeighborhoodBuena Vista Heights              0.002              0.99866    
## NeighborhoodBuena Vista West                -0.102              0.91916    
## NeighborhoodCitrus Grove                    -0.269              0.78808    
## NeighborhoodCivic Center                    -0.014              0.98861    
## NeighborhoodCoral Gate                      -0.677              0.49848    
## NeighborhoodCurtis Park                     -0.046              0.96360    
## NeighborhoodDouglas Park                    -0.743              0.45740    
## NeighborhoodEast Grove                      -1.004              0.31577    
## NeighborhoodEast Little Havana              -0.027              0.97818    
## NeighborhoodEdgewater                       -0.240              0.81049    
## NeighborhoodEdison                           0.188              0.85081    
## NeighborhoodFair Isle                       -0.560              0.57581    
## NeighborhoodFlagami                          0.182              0.85532    
## NeighborhoodFlora Park                      -0.073              0.94204    
## NeighborhoodGrove Center                    -1.131              0.25819    
## NeighborhoodHadley Park                     -0.039              0.96912    
## NeighborhoodHighland Park                    0.460              0.64572    
## NeighborhoodHistoric Buena Vista East       -0.194              0.84595    
## NeighborhoodKing Heights                     0.277              0.78194    
## NeighborhoodLa Pastorita                    -0.246              0.80548    
## NeighborhoodLatin Quarter                   -0.206              0.83704    
## NeighborhoodLe Jeune Gardens                 0.086              0.93161    
## NeighborhoodLemon City/Little Haiti          0.121              0.90370    
## NeighborhoodLiberty Square                   0.381              0.70342    
## NeighborhoodLittle River Central             0.252              0.80133    
## NeighborhoodMelrose                         -0.228              0.81993    
## NeighborhoodMiami Avenue                    -0.252              0.80146    
## NeighborhoodMIAMI BEACH                      1.076              0.28193    
## NeighborhoodMorningside                      0.068              0.94576    
## NeighborhoodNorth Grapeland Heights          0.070              0.94397    
## NeighborhoodNorth Grove                     -0.605              0.54542    
## NeighborhoodNorth Sewell Park               -0.286              0.77515    
## NeighborhoodNortheast Overtown              -0.580              0.56230    
## NeighborhoodNorthwestern Estates             0.310              0.75690    
## NeighborhoodOakland Grove                    0.021              0.98346    
## NeighborhoodOld San Juan                    -0.027              0.97836    
## NeighborhoodOrange Bowl                     -0.401              0.68829    
## NeighborhoodOrchard Villa                   -0.041              0.96699    
## NeighborhoodPalm Grove                       0.034              0.97251    
## NeighborhoodParkdale North                  -0.394              0.69394    
## NeighborhoodParkdale South                  -0.717              0.47322    
## NeighborhoodRoads                           -0.368              0.71293    
## NeighborhoodSan Marco Island                 2.324              0.02025 *  
## NeighborhoodSanta Clara                     -0.488              0.62567    
## NeighborhoodShenandoah North                -0.280              0.77935    
## NeighborhoodShenandoah South                -0.561              0.57491    
## NeighborhoodShorecrest                       0.408              0.68298    
## NeighborhoodSilver Bluff                    -0.836              0.40307    
## NeighborhoodSouth Grapeland Heights          0.257              0.79735    
## NeighborhoodSouth Grove                     -1.419              0.15616    
## NeighborhoodSouth Grove Bayside             -1.308              0.19100    
## NeighborhoodSouth Sewell Park               -0.112              0.91111    
## NeighborhoodSpring Garden                   -0.512              0.60863    
## NeighborhoodWest Grapeland Heights           0.388              0.69843    
## NeighborhoodWest Grove                      -1.035              0.30101    
## MedianInc                                    0.534              0.59375    
## lagPrice                                    -1.985              0.04730 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 789600 on 1472 degrees of freedom
## Multiple R-squared:  0.8917, Adjusted R-squared:  0.8842 
## F-statistic: 118.9 on 102 and 1472 DF,  p-value: < 0.00000000000000022

Results

Regression Test for Training Set, Summary of Results

Table 2 includes the OLS regression results for our training set, showing the coefficients of all variables as well as the adjusted R2 for our model. For predicting sales price, we can observe from Table 2 that there are a few features that are statistically significant, including having less than 4 bedrooms, actual square footage of the property, having elevator(s), having luxury pool(s), having tennis court(s), distance to nearest 5 metro stations, distance to nearest 5 parks, distance to the nearest graveyard, distance to the nearest hospital, distance to the nearest nursing home, distance to the nearest resort beach, prices of the 5 nearby homes, and a few zoning and neighborhood boundaries.

The adjusted R square of 0.884 means that 88.4% of the variance observed in sale price has been explained by the predictors. Among the predictors, as the Actual Square feet variable (ActualSqFt) is below the 0.05 significance level with p-value approaching zero, it means that this variable is very statistically significant. It also has a t-value above 24.635, which is significantly greater than 0, which suggest greater evidence that there is significance for Actual Square feet in sale price. The sale price increases with the presence of tennis court(s) by 6,890,239 dollars, whereas being farther than 1500 feet from a resort beach, the sale price decreases by 660,336 dollars, holding other variables constant. What is interesting from all the coefficients is that as distance to park increases, sales price also increases by $392. We should not rule out the possibility that redundant variables could have distorted the coefficients. Note that many variables are not significant, particularly categorical ones, which will be addressed in the Discussions and Conclusions sections.

Table 2. Training Set Summary Results
Dependent variable:
SalePrice
Bed.cat4+ -13,134.010
(100,479.900)
Bed.catUp to 4 390,714.000***
(64,344.310)
Stories 90,722.010
(70,285.780)
ActualSqFt 697.512***
(28.314)
EffectiveAge 59.043
(866.119)
LPool 4,209,253.000***
(220,811.100)
Elevator 472,738.700**
(224,818.600)
TennisCourt 6,890,239.000***
(584,878.700)
MRStations_nn5 -438.050***
(106.863)
PublicSchools_nn3 18.006
(83.198)
Parks_nn5 392.230***
(134.614)
graveyard_nn1 -259.103***
(50.875)
hospital_nn1 87.687*
(50.692)
nursing_home_nn1 421.990***
(106.665)
beach_resort.cat0-1500 -660,336.900***
(159,380.100)
Zoning0104 - SINGLE FAM - ANCILIARY UNIT 46,191.510
(122,666.500)
Zoning0800 - SGL FAMILY - 1701-1900 SQ 1,518,155.000***
(138,063.000)
Zoning2100 - ESTATES - 15000 SQFT LOT 5,397,689.000***
(336,892.700)
Zoning2200 - ESTATES - 25000 SQFT LOT 4,710,189.000***
(368,123.900)
Zoning2800 - TOWNHOUSE 516,027.300
(567,878.400)
Zoning3900 - MULTI-FAMILY - 38-62 U/A -39,853.270
(180,689.600)
Zoning3901 - GENERAL URBAN 36 U/A LIMITED 73,128.700
(233,823.000)
Zoning4600 - MULTI-FAMILY - 5 STORY
(351,837.000)
Zoning4601 - MULTI-FAMILY - 8 STORY
(1,392,492.000)
Zoning4801 - RESIDENTIAL-LIMITED RETAI 69,390.280
(601,868.400)
Zoning5700 - DUPLEXES - GENERAL 39,734.960
(86,507.570)
Zoning6100 - COMMERCIAL - NEIGHBORHOOD -14,981.920
(830,811.400)
Zoning6101 - CEN-PEDESTRIAN ORIENTATIO 334,217.200
(466,700.100)
Zoning6106 - RESIDENTIAL-LIBERAL RETAI -202,036.600
(1,391,849.000)
Zoning6107 - RESIDENTIAL-MEDIUM RETAIL 247,392.600
(289,383.500)
Zoning6110 - COMM/RESIDENTIAL-DESIGN D -215,142.800
(859,992.500)
Zoning7000 - INDUSTRIAL - GENERAL 114,826.800
(864,786.500)
Zoning7700 - INDUSTRIAL - RESTRICTED 104,118.200
(1,280,288.000)
NeighborhoodAllapattah Industrial District
NeighborhoodAuburndale 48,134.180
(1,200,412.000)
NeighborhoodBay Heights -1,534,134.000
(1,196,331.000)
NeighborhoodBaypoint 62,740.300
(1,169,138.000)
NeighborhoodBayside 380,556.700
(1,161,167.000)
NeighborhoodBelle Island 652,771.200
(1,218,016.000)
NeighborhoodBelle Meade 386,392.600
(1,158,226.000)
NeighborhoodBelle Meade West 484,235.500
(1,172,412.000)
NeighborhoodBird Grove East -668,322.900
(1,228,413.000)
NeighborhoodBird Grove West -805,572.600
(1,205,757.000)
NeighborhoodBiscayne Island 2,164,442.000*
(1,296,770.000)
NeighborhoodBiscayne Plaza 294,187.300
(1,396,021.000)
NeighborhoodBrentwood -341,601.000
(1,280,541.000)
NeighborhoodBuena Vista Heights 1,995.401
(1,185,120.000)
NeighborhoodBuena Vista West -117,689.700
(1,159,417.000)
NeighborhoodCitrus Grove -313,492.400
(1,166,057.000)
NeighborhoodCivic Center -16,167.300
(1,132,396.000)
NeighborhoodCoral Gate -803,364.600
(1,186,571.000)
NeighborhoodCurtis Park -54,527.930
(1,194,481.000)
NeighborhoodDouglas Park -867,522.400
(1,167,077.000)
NeighborhoodEast Grove -1,179,006.000
(1,174,861.000)
NeighborhoodEast Little Havana -32,736.090
(1,196,603.000)
NeighborhoodEdgewater -371,092.200
(1,547,282.000)
NeighborhoodEdison 216,663.200
(1,151,756.000)
NeighborhoodFair Isle -668,650.800
(1,194,791.000)
NeighborhoodFlagami 218,435.600
(1,197,742.000)
NeighborhoodFlora Park -83,971.480
(1,154,846.000)
NeighborhoodGrove Center -1,380,586.000
(1,220,543.000)
NeighborhoodHadley Park -44,552.690
(1,150,731.000)
NeighborhoodHighland Park 519,616.800
(1,130,072.000)
NeighborhoodHistoric Buena Vista East -228,656.700
(1,176,670.000)
NeighborhoodKing Heights 322,331.300
(1,164,307.000)
NeighborhoodLa Pastorita -313,238.700
(1,271,774.000)
NeighborhoodLatin Quarter -258,426.000
(1,256,253.000)
NeighborhoodLe Jeune Gardens 126,443.300
(1,473,069.000)
NeighborhoodLemon City/Little Haiti 141,755.300
(1,171,420.000)
NeighborhoodLiberty Square 440,166.500
(1,155,948.000)
NeighborhoodLittle River Central 300,343.600
(1,193,416.000)
NeighborhoodMelrose -255,401.100
(1,121,766.000)
NeighborhoodMiami Avenue -299,247.200
(1,189,826.000)
NeighborhoodMIAMI BEACH 1,248,319.000
(1,159,733.000)
NeighborhoodMorningside 79,038.890
(1,161,668.000)
NeighborhoodNorth Grapeland Heights 82,456.330
(1,172,991.000)
NeighborhoodNorth Grove -711,172.100
(1,175,912.000)
NeighborhoodNorth Sewell Park -336,426.100
(1,177,550.000)
NeighborhoodNortheast Overtown -746,660.400
(1,288,322.000)
NeighborhoodNorthwestern Estates 359,724.600
(1,161,839.000)
NeighborhoodOakland Grove 28,884.910
(1,393,377.000)
NeighborhoodOld San Juan -31,743.960
(1,170,110.000)
NeighborhoodOrange Bowl -571,853.400
(1,425,157.000)
NeighborhoodOrchard Villa -48,095.610
(1,161,907.000)
NeighborhoodPalm Grove 48,087.790
(1,395,231.000)
NeighborhoodParkdale North -466,972.900
(1,186,460.000)
NeighborhoodParkdale South -851,178.300
(1,186,426.000)
NeighborhoodRoads -425,235.900
(1,155,528.000)
NeighborhoodSan Marco Island 3,047,597.000**
(1,311,228.000)
NeighborhoodSanta Clara -565,839.100
(1,159,651.000)
NeighborhoodShenandoah North -323,605.200
(1,154,858.000)
NeighborhoodShenandoah South -647,479.800
(1,154,244.000)
NeighborhoodShorecrest 471,262.800
(1,153,680.000)
NeighborhoodSilver Bluff -970,205.100
(1,159,994.000)
NeighborhoodSouth Grapeland Heights 308,837.100
(1,202,520.000)
NeighborhoodSouth Grove -1,740,682.000
(1,226,849.000)
NeighborhoodSouth Grove Bayside -1,609,202.000
(1,230,073.000)
NeighborhoodSouth Sewell Park -130,983.900
(1,173,040.000)
NeighborhoodSpring Garden -657,593.700
(1,284,001.000)
NeighborhoodWest Grapeland Heights 470,737.500
(1,214,752.000)
NeighborhoodWest Grove -1,246,458.000
(1,204,727.000)
MedianInc 1.522
(2.853)
lagPrice -0.048**
(0.024)
Constant -540,350.400
(1,170,178.000)
Observations 1,575
R2 0.892
Adjusted R2 0.884
Residual Std. Error 789,644.200 (df = 1472)
F Statistic 118.854*** (df = 102; 1472)
Note: p<0.1; p<0.05; p<0.01

Predictions on Test Set and Errors

We reproduced the variables for the testing dataset.

Table of mean absolute error (AbsError) and mean absolute percentage error (MAPE) for a single test set

Table 3 shows the mean absolute error(ABE) and mean absolute percentage error(MAPE) for the sales price in the single test set, with and without accounting for neighborhood. According to Table 3, the neighborhood effects, when accounted, increased the errors to an extent, the reasons of which will be discussed in detail in the Discussion and Conclusions sections. Also note that some predicted prices may be negative because the trained model y-intercept is negative (see Table 2), which could be the case when the coefficients for certain observations are very small, and the model’s variance is high. Therefore they are omitted in future plots for interpretability.

# predict sale price from trained model

final_test.sf <-
  final_test.sf %>%
  mutate(Regression = "New Data Prediction - With Neighborhoods",
         SalePrice.Predict = predict(reg_train, final_test.sf),
         SalePrice.Error = SalePrice.Predict - SalePrice,
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice),
         SalePrice.APE = (abs(SalePrice.Predict - SalePrice)) / SalePrice.Predict)%>%
  filter(SalePrice < 5000000) %>% filter(SalePrice.Predict >= 0)

# errors
coords.test <-  st_coordinates(final_test.sf) 
neighborList.test <- knn2nb(knearneigh(coords.test, k_nearest_neighbors))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
final_test.sf$lagPriceError <- lag.listw(spatialWeights.test, final_test.sf$SalePrice.AbsError)

final_test_base.sf <-
  final_test_base.sf %>%
  mutate(Regression = "New Data Prediction - Without Neighborhoods",
         SalePrice.Predict = predict(reg_train_base, final_test_base.sf),
         SalePrice.Error = SalePrice.Predict - SalePrice,
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice),
         SalePrice.APE = (abs(SalePrice.Predict - SalePrice)) / SalePrice.Predict)%>%
  filter(SalePrice < 5000000) %>% filter(SalePrice.Predict >= 0)

# errors
coords.test2 <-  st_coordinates(final_test_base.sf) 
neighborList.test2 <- knn2nb(knearneigh(coords.test2, k_nearest_neighbors))
spatialWeights.test2 <- nb2listw(neighborList.test2, style="W")
final_test_base.sf$lagPriceError <- lag.listw(spatialWeights.test2, final_test_base.sf$SalePrice.AbsError)

compare_base_and_unified <- 
  rbind(
    dplyr::select(final_test_base.sf, starts_with("SalePrice"), Regression, Neighborhood) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test2, SalePrice.Error)),
    dplyr::select(final_test.sf, starts_with("SalePrice"), Regression, Neighborhood) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))    

st_drop_geometry(compare_base_and_unified) %>%
  gather(Variable, Value, -Regression, -Neighborhood) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable() %>%
      kable_styling("striped", full_width = F) %>%
      row_spec(1, color = "black", background = "#25CB10") %>%
      row_spec(2, color = "black", background = "#FA7800") %>%
      footnote(general_title = "\n",
           general = "Table 3: Table of mean absolute error and MAPE for a single test set")
Regression SalePrice.AbsError SalePrice.APE
New Data Prediction - With Neighborhoods 348157.7 5.981075
New Data Prediction - Without Neighborhoods 331956.7 2.315482

Table 3: Table of mean absolute error and MAPE for a single test set

Regression Result for Testing Set, extract residuals

F-statistic explains the overall significance of fit for the testing set. In summary, the F-statistic is greater than the F critical value of 0.2064024, therefore the model is explaining the variance in sale price. Actual Square feet is below the 0.05 significance level with p-value approaching zero which means that this variable very statistically significant, and has a t-value above 16, which is significantly greater than 0 and means that there is greater evidence that there is a significant difference. The adjusted R square of 80.7% means that 80.7% of the variance observed in sale price has been explained by the predictors. Median Income, Age and Distance to Transportation are all significant. Note that many variables are not significant, similar to the training result, and the result will be explained and discussed in further details with additional plots below.

Cross-Validation Test Results, including mean and standard deviation Mean Absolute Error (MAE)

Cross-validation MAE Histogram

Cross-validation on the training data help us identify the optimal model parameters to assess the model on unseen test data. From cross-validation, MAE has a mean of 212570.9, and a standard deviation of 82834.11 dollars. There is significant variance across folds, as suggested from MAE_SD and RMSE, particularly due to possible outliers or redundant variables that contribute redundant coefficients that distort the true predicted values. This is also mirrored in the histogram where the distribution of errors are not very tightly clustered together as there are gaps and multimodal patterns. The mean R-square is 0.8353 which means 83.53% of the variance has been explained in the cross-validated sets. Having said that the issues of overfitting and redundancy questions the actual generalizability of the trained model on new unseen data, notwithstanding a high R-square (see Discussions and Conclusions).

fitControl <- trainControl(method = "cv", 
                           number = 100, # 100 folds
                           savePredictions = TRUE)

reg.cv <- 
  train(SalePrice ~ .,  data = st_drop_geometry(final_test.sf) %>%  dplyr::select(SalePrice, Bed.cat, Stories, ActualSqFt, EffectiveAge, LPool, Elevator, TennisCourt, MRStations_nn5, PublicSchools_nn3, Parks_nn5, graveyard_nn1, hospital_nn1, nursing_home_nn1, beach_resort.cat, Zoning, Neighborhood, MedianInc, lagPrice) %>% na.omit(),
     method = "lm", 
     trControl = fitControl, 
     na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 894 samples
##  18 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 886, 885, 885, 885, 885, 885, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   314067.2  0.83516   212702.9
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = value, color = name)) +
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(
    legend.position = "none"
  )

MAE_mean<- mean(reg.cv$resample[,3])
MAE_SD<- sd(reg.cv$resample[,3])

compare<- data.frame(MAE_mean=MAE_mean,
                     MAE_SD=MAE_SD)
compare
##   MAE_mean   MAE_SD
## 1 212702.9 84956.57

Predicted Prices vs observed prices

Figure 6 shows that observed vs predicted prices follow the linear function, as a good amount of points reside on or are close to the best-fit line. However the residuals or error term are generally more spread out as prices increase, denoting high variance in the model.

predicted <- reg.cv[["pred"]][["pred"]] %>% as.numeric()
observed <- reg.cv[["pred"]][["obs"]] %>% as.numeric()

plot(predicted,observed, main="Predicted Prices as a function of \n Observed Prices (cross-validation)", cex.main=0.75, sub= "Figure 6.1", abline(lm(observed~predicted), col="#FA7800"))

# for checking
plot(final_test.sf$SalePrice.Predict,final_test.sf$SalePrice, main="Predicted Prices as a function of \n Observed Prices (capped at 5,000,000 to avoid extreme scaling)", cex.main=0.75, sub= "Figure 6.2", abline(lm(observed~predicted), col="#FA7800"))

Map of Residuals for the Test Set

Test set residuals in Figure 7 are generally randomly distributed in space, but some degrees of clustering in Miami Beach and the southern areas are likely not the result of chance due to signs of positive spatial autocorrelation, denoting that there may be some unobserved relationships. This is confirmed in Moran’s I test shown in Figure 8.

ggplot() +
  geom_sf(data = MunicipalBoundaries, fill = "grey40") +
  geom_sf(data = final_test.sf, aes(colour = q5(Residuals_test)), 
          show.legend = "point", size = 1) +
  geom_sf(data = neighborhoods, fill="transparent", color="dark gray")+
  scale_colour_manual(values = palette5_rev,
                   labels=qBr2(final_test.sf,"Residuals_test"),
                   name="Quintile\nBreaks") +
  labs(title="Map of Residuals for the Test Set",
       caption="Figure 7.") +
  mapTheme()

Moran’s I test

The Moran’s I is an indicator of spatial correlation (SA) which can be observed through clustering (positive SA) or no dispersion of residuals (negative SA - checkerboard pattern). Here we are concerned with positive SA as negative is very rare in real life. A value near 1 indicates strong clustering of similar values, around 0 means no clustering, and -1 means dispersion (strong negative autocorrelation). In this case, the Moran’s I value is about 0.25, represented by the orange line, which means there is a light or mild clustering of residuals, and therefore some positive spatial autocorrelation in the variable “SalePrice”. This is why including neighborhood variable would make sense to take into account the neighborhood effects. The characteristics of a neighborhood are often factored in the assessment of a property. Residential property is often assessed for higher value near higher valued, more respectable neighborhoods of higher median income, and vice-versa (Bucholz, 2004). Hence, we would expect homes with a high assessed value tend to be next to other homes with a high assessed value and vice-versa.

moranTest <- moran.mc(final_test.sf$SalePrice.AbsError, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count",
       caption="Figure 8") +
  plotTheme()

Plot of Spatial Lag in Errors

From this Figure 9, as price increases, the prices of nearby homes do not necessarily increase. However, when extreme outliers are included the line is positive, denoting somewhat of a positive clustering of sale prices. Note that the errors of Lag price means the average price of 5 nearest neighbors (home) of a given home (same location as sale price).

final_test.sf <- mutate(final_test.sf, lagPriceError = final_test.sf$lagPrice - final_test.sf$SalePrice)

ggplot(final_test.sf, aes(x=lagPriceError, y=SalePrice)) +
    geom_point(colour = "#FA7800") +
    geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
    labs(title = "Error as a function of the spatial lag of price",
         caption = "Figure 9",
         x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
         y = "Sale Price") +
    plotTheme()

Map of Predicted Values for where ‘toPredict’ is both 0 and 1

From Figure 10, predicted values of homes are generally higher in the Miami Beach area and other areas of high income (also see Figure 13). The values of homes in low-income neighborhoods are somewhat lower, which agrees to a certain extent with the premises about neighborhood effects mentioned above (Bucholz, 2004).

final_train.sf <- final_train.sf %>% mutate(Regression = "Existing Data Prediction")

ggplot() +
  geom_sf(data = MunicipalBoundaries, fill = "grey40") +
  geom_sf(data = final_train.sf, aes(colour = q5(SalePrice.Predict)), 
          show.legend = "point", size = 1) +
  geom_sf(data = final_test.sf, aes(colour = q5(SalePrice.Predict)), 
          show.legend = "point", size = 1) +
  geom_sf(data = neighborhoods, fill="transparent", color="dark gray")+
  scale_colour_manual(values = palette5_rev,
                   labels=qBr2(final_test.sf,"SalePrice.Predict"),
                   name="Quintile\nBreaks") +
  labs(title="Map of Predicted Values - both new and existing",
       caption="Figure 10.") +
  mapTheme()

Map of Mean absolute percentage error (MAPE) by neighborhood based on Test Set predictions

MAPE is very high in most neighborhoods and low in some neighborhoods, ranging from 0.086 to 1.38. The neighborhoods in brighter orange have higher MAPE, yet there are no clear patterns observed other than that they are mostly lower-income neighborhoods of the Miami, when compared to Figure 13.

nhood_summary <- final_test.sf %>% 
    group_by(Neighborhood) %>%
    summarize(meanPrice = mean(SalePrice, na.rm = T),
              meanPrediction = mean(SalePrice.Predict, na.rm = T),
              meanMAE = mean(SalePrice.AbsError, na.rm = T),
              MAPE=mean(SalePrice.APE, na.rm = T),
              Income=mean(MedianInc, na.rm = T))

nhood_summary %>% 
  st_drop_geometry %>%
  arrange(desc(meanMAE)) %>% 
      knitr::kable() %>% kable_styling()
Neighborhood meanPrice meanPrediction meanMAE MAPE Income
San Marco Island 1850000.0 8954848.045 7104848.05 0.7934080 53145.00
Fair Isle 1913750.0 2994454.552 1302366.12 0.3923635 74966.00
Baypoint 2560100.0 2767382.769 1158032.58 0.4360026 26709.00
Grove Center 1807500.0 747248.702 1060251.30 1.9400865 39877.50
MIAMI BEACH 1669220.3 2004854.124 743701.85 0.5637570 49712.17
South Grove Bayside 1973000.0 2210355.692 602214.13 0.2323260 91949.00
Curtis Park 323214.3 868633.484 545419.20 0.5965194 14949.00
East Grove 1531285.7 1230446.942 535536.48 1.0629968 74966.00
Belle Island 1700000.0 2227530.995 527530.99 0.2368232 43989.00
Miami Avenue 1188333.3 1629015.947 461813.08 0.3019707 65369.50
South Grove 1344502.3 1426918.818 460558.73 0.5145296 91949.00
Bay Heights 1413916.7 1018421.008 437757.06 0.5221637 74966.00
Bird Grove West 663333.3 876900.425 428686.66 0.4233185 27708.00
Morningside 957962.1 1111636.796 361730.71 0.3030786 26709.00
South Grapeland Heights 307555.6 668434.013 360878.46 0.5055572 16160.00
Shorecrest 696255.0 845685.220 344984.58 1.5054851 20224.00
Roads 670726.7 753585.812 329592.01 1.9610766 34295.17
Allapattah Industrial District 197500.0 507212.115 309712.12 0.5146439 18918.00
West Grove 620464.3 664240.059 300199.57 2.3674817 23004.57
Palm Grove 770466.7 715466.330 296126.13 0.4967123 26709.00
Highland Park 647500.0 570795.973 295532.31 0.5328359 21332.00
North Grove 960422.2 1058284.673 282530.42 0.3657736 60278.00
Flora Park 207992.3 400687.056 280841.07 1.3776670 16624.00
Shenandoah South 477511.8 576192.767 272481.19 0.8565733 23355.82
Belle Meade 788361.1 616649.106 271174.02 1.1874080 43989.00
North Grapeland Heights 320000.0 440363.936 250833.58 0.6480496 16315.00
Bird Grove East 608000.0 831391.313 244190.97 0.2706971 52333.20
Parkdale North 530333.3 720723.936 235254.14 0.2280676 16643.00
Citrus Grove 347200.0 401598.911 230039.98 255.5385401 15562.53
Melrose 456300.0 352077.874 227611.49 0.5949806 17406.80
Douglas Park 363833.3 416928.194 225162.41 0.7770209 23098.13
Shenandoah North 497986.8 607207.847 224064.87 1.4054856 18805.68
Auburndale 331857.1 398274.284 216494.27 1.1344137 15979.00
Orchard Villa 250722.2 298202.046 215148.52 2.3448909 17847.78
Orange Bowl 205000.0 3660.821 201339.18 54.9983744 16601.00
Flagami 345048.1 391039.406 196328.17 1.4792314 18987.81
Santa Clara 252037.5 332079.292 192553.79 0.7528348 16612.50
Silver Bluff 492806.1 420493.286 190616.63 0.9824792 24278.00
Parkdale South 397600.0 215953.342 181646.66 1.2621524 22488.00
Lemon City/Little Haiti 333000.0 352184.832 176598.56 0.5539028 22532.80
South Sewell Park 436750.0 313777.255 169317.96 3.9829074 12831.50
Bayside 571800.0 409803.353 161996.65 0.5069415 43989.00
Historic Buena Vista East 475000.0 315306.196 159693.80 0.5064721 25142.00
Edison 225328.6 295199.864 159539.18 2.5652265 16275.10
Hadley Park 213061.4 294397.299 150174.04 0.8499052 17756.43
Buena Vista West 311615.4 290212.752 145475.38 8.2345557 17707.00
Little River Central 170400.0 307665.043 145088.51 0.3712461 17182.40
Buena Vista Heights 376090.9 466695.876 145011.82 0.3026604 17707.00
Old San Juan 372683.3 472996.876 144772.73 0.2838775 24894.00
Liberty Square 166242.9 248661.889 142540.10 0.8135218 12667.00
Civic Center 245000.0 107906.262 137093.74 1.2704892 16135.00
East Little Havana 260000.0 266453.201 132162.59 0.6581857 16808.80
Coral Gate 573857.1 554727.417 124570.41 0.2605708 22488.00
West Grapeland Heights 352500.0 242649.110 109850.89 0.5859292 20761.00
North Sewell Park 839000.0 945527.063 106527.06 0.1126642 12284.00
Belle Meade West 387800.0 455497.104 97588.24 0.2076508 43989.00
Northwestern Estates 179384.6 213619.901 94233.90 0.6107202 18111.00
King Heights 207437.5 192150.514 88550.95 0.7367846 18934.50
La Pastorita 345000.0 320037.643 87919.65 0.3555449 16643.00
Northeast Overtown 370000.0 428089.466 58089.47 0.1356947 15208.00
Brentwood 332500.0 317396.166 15103.83 0.0475867 25142.00
map_MAPE <- final_test.sf %>% 
  group_by(Neighborhood) %>% 
  summarise(MAPE = mean(SalePrice.APE, na.rm = T))

neighborhoods <- neighborhoods %>% mutate(Neighborhood=LABEL)

  ggplot() +
    geom_sf(data = MunicipalBoundaries, fill = "white") +
    geom_sf(data = neighborhoods %>% 
              left_join(st_drop_geometry(map_MAPE), by = "Neighborhood"),
            aes(fill = q5(MAPE))) +
      scale_fill_manual(values = palette5,
                     labels=qBr2(nhood_summary,"MAPE"),
                     name="Quintile\nBreaks") +
    mapTheme() +
    labs(title="Mean absolute percentage error (MAPE) by neighborhood", caption="Figure 11.")

Scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood

MAPE and mean price in the neighborhoods are not closely related, as there are no obvious pattern shown in Figure 12 between MAPE and mean price of the neighborhoods. Figure 12 does not include outliers in our model to deliver a readable plot in a reasonable scale. However, when those outliers are included, we observed that the mean prices in the extremes tend to have higher MAPE.

plot(nhood_summary$meanPrice, nhood_summary$MAPE, main="MAPE by neighborhood as a function of mean price by neighbourhood", cex.main=0.75, ylim=range(0:1), sub= "Figure 12")

Test of Generalizability (split city into two groups by race or income)

From Figure 13 we can see that majority white neighborhoods tend to be wealthier neighborhoods, and vice versa for non-white majority. However, the model generalizes poorly by race and especially low-income group, as the MAPE is very high and biased, meaning that the difference between our predicted mean and the true mean of sale price is high. Reasons for this is discussed in detail in the Discussions and Conclusions sections. Note that the plots below uses census tracts that intersect with neighborhoods. Having said that, the model does generalize well on high-income neighborhoods as seen from Figure 10 and 11 where errors are lower.

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          year = 2018, state=12, county=086, geometry=T, output = "wide", survey= "acs5") %>%
  st_transform('ESRI:102258')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

tracts18 <- st_join(tracts18, final_test.sf) %>% filter(!is.na(Neighborhood)) # filter out the city

MunicipalBoundaries<- MunicipalBoundaries %>% st_set_crs(4326) %>% st_transform('ESRI:102258')

grid.arrange(ncol = 2, # may need to clear output and re-start R session for this code block to work
  ggplot() +
    geom_sf(data = MunicipalBoundaries, fill = "grey40") +
    geom_sf(data = na.omit(tracts18), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context", caption = "Figure 13") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + 
    geom_sf(data = MunicipalBoundaries, fill = "grey40") +
    geom_sf(data = na.omit(tracts18), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

context_table1 <- tracts18 %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context")

context_table1
Test set MAPE by neighborhood racial context
Majority Non-White Majority White
175% 703%
context_table2 <- tracts18 %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")

context_table2
Test set MAPE by neighborhood income context
High Income Low Income
58% 795%

Discussion

In summary, this is not an effective model due to overfitting that led to high variance and distortions in the coefficient estimates which led to many errors in predicting actual sale prices in the testing set, except for high-income neighborhoods (as appeared when comparing Figure 11, Figure 13, and the Test set MAPE by neighborhood income context tables). This means that this model is biased and not generalizable. In our training stages, we saw that R2 increased only by 5% (from the baseline regression) when accounting for the neighborhoods and other local intelligence variables (zoning, median income, price lag). The meagre increase in R square as we seen in the OLS of traning data by including a large number of variables only made it better in predicting or “fitting” the existing data but this sacrifices the generalizability of the model to predict new unseen data. Furthermore, because neighborhoods and zoning are distinct and categorical, each of them is treated as single variables. The substantial increase in the number of variables, notwithstanding the increases in adjusted R-square (which went up to 88.4%), will still decrease the model’s effectiveness in predicting new unseen data. This is accordingly reflected in the higher MAE errors in Table 3 above as well as in Figures 13 and where it did poorly in predicting across multiple scales except for high-income neighborhoods where the model actually predicted fairly well.

Errors may also be contributed by outliers and nonlinear relationships that we didn’t caught in the data. We could have used log transformation of the variables to help fixing non-linearity, outliers and heteroscedasticity, however it will also make interpretation of the coefficients more challenging, and depending on the distribution of each predictor, they may not even be suitable for log transformation (e.g. zero-inflated). Heteroscedasticity is important because it means there is a non-constant variance in the residuals, and hence there may be systematic patterns (e.g. parabolic) and that relationship is unexplained by the model.

Certainly a high degree of redundancy is contributed by the issue of multicollinearity. For instance, it is often the case that the greater the actual area square footage, the higher the number of bedrooms, which means the two variables are explaining the same thing. From the model, we believe that a share of the errors were the result of some variables of distance to amenities. We do not have enough information to determine why, but as an example from common sense, distance to highway and distance to retail have internal dependence because many retail developments are often located close to highway for the ease of goods movement. Even though these are not used in our model, our model did include Distance to Nursing home and Metro Station which showed a high collinearity, suggesting that the two are somehow explaining the same variations in sale price (see Figure 4). This is important because a strong presence of multicollinearity will not only be unhelpful (redundant) to the model but it also contributes redundant information, distorting coefficient estimates and p-values, and consequently the final predicted sale price (which is calculated from the coefficient estimates).

Furthermore, the nearest neighbor functions we used have a defined weight matrix. To improve the result of the model we should gain more knowledge of the model by testing different applicable weight matrices to make sure they are consistent since different types of weight matrix (e.g. inverse distance, simple distance, nn with no distance, contiguity-based, etc) would lead to different value of Moran’s I, hence different understanding of the magnitude of spatial auto-correlation in the variables. Therefore the Figure 8 showing 0.25 of mild positive spatial autocorrelation (which is observed in the clustering of residuals in Figure 7) should only be used as a reference not a standard. Also, Moran’s I’s effectiveness lies in continuous data. Here we have many binary and categorical data. This is important as significant presence of spatial autocorrelation (clustering of residuals) implies that there are variables not being accounted for. More precisely, the variance or clustering of residuals observed is likely not by chance but due to some uniformly distributed unobserved characteristics (Bucholz, 2004).

Additionally, one reason why the model was able to generalize better on richer neighborhoods than poorer neighborhoods could be the fact that the features we selected, such as the ownership of a luxury pool and tennis court are generally the customer preference applicable to the high-income individuals from whom there is a higher market demand.

Overall, in light of the presence of multicollinearity and spatial autocorrelation, we should look for any missing key variables or misspecifications, and remove redundant variables that distort coefficients (i.e. contributing redundant information) which led to disortions in sale prices.

Conclusions and Limitations

Following what we’ve discussed above, we would not recommend this model to Zillow as-is because its accuracy and generalizability are in question, due to the high variance and distortions introduced by overfitting or misspecification of variables, many of which are not significant and likely redundant, including many of the zoning types and neighborhoods. A common issue with OLS models is overfitting (as in our case), as we will always have some non-zero estimates for the coefficients of every predictors even though some may not be related to sale price, therefore distorting the true predicted value by contributing redundant information. Again, the meagre increase in R-square (as we seen in the OLS of training data section) by including a large number of variables only made it better in predicting existing data but sacrifices the generalizability of the model to predict new unseen data.

To improve this model, we would have to either reduce the number of redundant and/or insignificant variables (e.g. certain zoning types, distance to nursing homes), or opt for ridge or lasso regression, where a large number of predictors and the presence of multicollinearity wouldn’t affect the variances of the coefficient estimates as in OLS (where large variances resulting from multicollinearity lead to high error or distance from the true predicted value), which is likely one of the reasons why most zoning and neighborhood variables are insignificant. Having said that, if we were to use lasso or ridge, we’ll have to compromise interpretability, since the coefficients cannot be interpreted the same way as an OLS model, but rather only in the sense of their magnitude, and we’ll have to adjust for bias and variance at the optimal lamda value (since they are inversely related) to ensure it will do better than the OLS model in predicting sale price.

Since this is a model where there are many spatial elements and data relationships likely change over space, we may need to check additional diagnostics and use other regression approaches in improving the model. We suggest to first check the Koenker (BP) Statistic to see if relationships vary over the study area. If the Statistic is significant then the relationships are inconsistent, meaning that they are strong in some places and weak in others because different conditions throughout the study area may affect the relationship. In this case, GWR (Geographically Weighted Regression) would yield a more accurate result because it uses one regression equation to each feature instead of to all, as in the case of OLS. It uses nearby features to calibrate the coefficients between the dependent (sale price) and explanatory variables (the predictors), and by accounting for spatial variations, it usually leads to a more accurate model with a lower AIC and higher adjusted R-square. Futhermore, to improve this model we will check additional diagnostics to find out the redundant variables due to multicollinearity, using for example, the VIF (variable inflation factor) which can detect if more than one variable is explaining the same thing (e.g. size, density), and check if the model is biased using the Jarque-Bero statistic. A biased model does not replicate expected results in another study area, impacting generalizability, and denotes spatial autocorrelation, misspecification, and missing key explanatory variables or predictors.

Lastly, we must acknowledge the fact that there is no one single correct model, and even if were able to identify all the optimal variables (as well as their data sources), removed the all the redundant ones, and kept bias and variance to a minimum, there will still be severe limitations to this ideal model, as the exact price of a residential property is also a function of the reputation/quality of developers, promotional and marketing strategies, supply and demand at the time, the costs of development, as well as other customer preferences, the data of which are either abstract or more difficult to obtain. Hence, the most accurate way of determining the unknown price of any given property is to assess on a case-by-case basis, similar to how commercial property are often determined through a case-by-case detailed cash flow (DCF) analysis, Net Operating Income (NOI) and the capitalization rate which is a function of a discount rate based on investor perception of risks in the asset class at a given time, the market conditions of the time, the projected NOI growth rate for the specific property as well as other factors that are not generalizable in a statistical model (Linneman & Kirsch, 2018). Having said that, this is not an option here as there are thousands of observations to be predicted, therefore the best we can do is to generalize with a selected small set of features/variables on unknown data at the maximum accuracy across as many scales as possible (neighborhoods,income and race in our case).

References

Bucholz, S. (2004) A Brief Introduction to Spatial Econometrics

Devore, Jay L. “Probability and Statistics” (2008).

Linneman, P., & Kirsch, B. (2018). Real Estate Finance & Investments: Risks and Opportunities. Philadelphia: Linneman Associates.

Data Sources

https://gis-mdc.opendata.arcgis.com/datasets/park-facility https://gis-mdc.opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0?page=6 https://gis-mdc.opendata.arcgis.com/datasets/metrorail-station https://datahub-miamigis.opendata.arcgis.com/datasets/landmarks https://hub.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0?geometry=-80.351%2C25.754%2C-80.079%2C25.808 https://data.census.gov/cedsci/ https://cran.r-project.org/web/packages/osmdata/vignettes/osmdata.html