Introduction

In order to provide effective and efficient services, government agencies ought to use the least amount of resources to enhance as much public good as possible. Besides operating government programs, government agencies usually spend a huge amount of resources, including man power, funding, and time, on connecting the right people to the right programs. Any failed connection would be a waste of the government’s resources. For the case of this home repair tax credit program, the Department of Housing and Community Development (HCD) proactively reaches out to eligible homeowners at random but typically only 11% of them take the credit. Although this approach is inclusive, it is neither effective nor efficient, that reaching out to the eligible homeowners who are not going to take the tax credit program is considered a waste of their limited outreach resources. With past record data, it is possible to identify eligible homeowners who are more likely to take the tax credit program. By prioritizing the outreach resources to them, HCD can save some efforts while gaining a more successful outcome. Therefore, grounded on a cost benefit analysis, this project aims to develop a prediction model which can help the community outreach process to be more effective and efficient.

Looking at the features

From the past record, there are 19 variables that describe the features of the eligible homeowners as well as whether they have taken the credit or not. Among the features, 7 of them are numeric variables (Figure 1A), indicating features from age to how much have they spent on annual home repairs. According to Figure 1A, only inflation rate (inflation_rate), previous number of contacts before this campaign for this individual (previous), and unemployment rate at time of campaign (unemploy_rate) shows variations between the takers and non-takers - the rest do not show obvious differences between takers and non-takers. Therefore the features are visualized in Figure 1B to show the take / no take outcome in the continuous scale. The interpretation is that there are some patterns between intervals of the continuous values that could be meaningful in predicting whether the homeowners would take or not take the credits. For example, it is obvious that homeowners that spent more than $5165 are much less likely to take the credits (Figure 1B, spent_on_repairs). Since the relationships between the dependent variable (take or no take outcome) and the independent variables (numeric continuous features) are all not linear, it makes more sense to consider them as categorical. In addition, as the negative unemployment rate is not explained in the record, this feature will not be included in the engineered model.

dat %>%
  dplyr::select(y, age, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="take", y="Value", 
           title = "Feature associations with the likelihood of take",
           subtitle = "(continous outcomes for numeric variables)",
           caption = "Figure 1A") +
      theme(legend.position = "none")

dat %>%
    dplyr::select(y, age, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(title = "Feature distributions take vs. no take",
         subtitle = "(continous outcomes for numeric variables)",
         caption = "Figure 1B") +
    theme(legend.position = "right")

There are also 12 categorical features (Figure 2A, 2B, 2C), indicating features from jobs to the outcome of the previous marking campaign. While there are more no’s than yes’s in general, there are some variations of the likelihood of take between groups, suggesting that they might be meaningful features for the prediction. For example (Figure 2C, poutcome), the likelihood of taking the credit is higher for those whose previous outcome was successful, compared with failure and nonexistent outcomes. It also shows that the feature indicating the number of days after individual was last contacted from a previous program (Figure 2C, pdays) needs feature engineering to be more meaningful.

dat %>%
    dplyr::select(y, job, marital, education, taxLien) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="category", y="Value",
             title = "Feature associations with the likelihood of take",
             subtitle = "Categorical features",
             caption="Figure 2A") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

dat %>%
    dplyr::select(y, mortgage, taxbill_in_phl, contact, month) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="category", y="Value",
             title = "Feature associations with the likelihood of take",
             subtitle = "Categorical features",
             caption = "Figure 2B") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

dat %>%
    dplyr::select(y, day_of_week, campaign, pdays, poutcome) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="category", y="Value",
             title = "Feature associations with the likelihood of take",
             subtitle = "Categorical features",
             caption = "Figure 2C") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Feature engineering

The codes below shows the feature engineering process. First, all initially numeric continuous variables are converted into new categorical variables based on the intervals that show obvious differences in the possibilities of take from Figure 1B. The feature indicating the number of contacts before this campaign for the individual (previous) is categorized into a new feature (previous.cat) based on the common sense that only 0 and not 0 (contacted and not contacted) makes a difference. Also, a new feature (avgTake_pdays) is created to indicate the average possibility of taking the credit based on the number of days after individual was last contacted from a previous program (pdays) because just the number of days do not convey as much information. This is arguably the case for other features as well, including job, marital, education, and so on. I have feature engineered new features from these variables too, but the prediction had not been strengthened. Thus they are not included here.

# 4 The Sensitivity (True Positive Rate) for a model with all the features is very low. Engineer new features that significantly increase the Sensitivity
dat_eng <- 
  dat %>%
  mutate(ageGroup = case_when(age < 28 ~ "age0_28",
                              age >= 28 & age < 52 ~ "age28_52",
                              age >= 52 & age < 56 ~ "age52_56",
                              age >= 56 ~ "age56+"),
         cons.conf.cat = case_when(
           cons.conf.idx < -47.5 ~ "conf-47.5-",
           cons.conf.idx >= -47.5 & cons.conf.idx < -45 ~ "conf-47.5_-45",
           cons.conf.idx >= -45 & cons.conf.idx > -43.75 ~ "conf-45_-43.75",
           cons.conf.idx >= -43.75 & cons.conf.idx > -41.25 ~ "conf-47.5_-41.25",
           cons.conf.idx >= -41.25 & cons.conf.idx > -37.5 ~ "conf-41.25_-37.5",
           cons.conf.idx >= -37.5 & cons.conf.idx > -35 ~ "conf-37.5_-35",
           cons.conf.idx >= -35 ~ "conf-35+"),
         cons.price.cat = case_when(
           cons.price.idx < 92.8 ~ "price92.8-",
           cons.price.idx >= 92.8 & cons.price.idx < 93 ~ "price92.8_93",
           cons.price.idx >= 93 & cons.price.idx < 93.1 ~ "price93_93.1",
           cons.price.idx >= 93.1 & cons.price.idx < 93.6 ~ "price93.1_93.6",
           cons.price.idx >= 93.6 & cons.price.idx < 93.8 ~ "price93.6_93.8",
           cons.price.idx >= 93.8 & cons.price.idx < 94.1 ~ "price93.8_94.1",
           cons.price.idx >= 94.1 & cons.price.idx < 94.3 ~ "price94.1_94.3",
           cons.price.idx >= 94.3 & cons.price.idx < 94.6 ~ "price94.3_94.6",
           cons.price.idx >= 94.6 ~ "price94.6+"),
         inflation.cat = case_when(
           inflation_rate < 2.5 ~ "inflation2.5-",
           inflation_rate >= 2.5 & inflation_rate < 3.5 ~ "inflation2.5_3.5",
           inflation_rate >= 3.5 ~ "inflation3.5+"),
         previous.cat = case_when(
           previous == 0 ~ "previous0",
           previous > 0 ~ "previousNOT0"), #theoretically only 0 and not 0 makes a difference
         repair_spent.cat = case_when(
           spent_on_repairs < 5087.5 ~ "repair5087.5-",
           spent_on_repairs >= 5087.5 & spent_on_repairs < 5112.5 ~ "repair5087.5_5112.5",
           spent_on_repairs >= 5112.5 & spent_on_repairs < 5165 ~ "repair5112.5_5165",
           spent_on_repairs >= 5165 ~ "repair5165+")
         )

dat_eng <-
  dat_eng %>%
  group_by(pdays) %>% # categorical to continuous
  summarize(totalTake = sum(y_numeric),
            n = n(),
            avgTake_pdays = totalTake/n*100) %>%
  dplyr::select(-n, -totalTake) %>%
  right_join(dat_eng, .)

New features

Figure 3A shows the mean likelihood of taking credits by category for the previously continuous variables. It is now obvious that: * Homeowners that are older than or equal to 56 years old are more likely to take the credit. * Homeowners that were reached when the Consumer Confidence Index was lower than -47.5 are more likely to take the credit. * Homeowners that were reached when the Consumer Price Index was larger or equal to 93.6 and smaller than 93.8 are more likely to take the credit. * Homeowners that were reached when the US inflation rate was smaller than 2.5 are more likely to take the credit. * Homeowners that have been contacted previously before this campaign are more likely to take the credit. * Homeowners that have spent less than $5087.5 on annual home repairs are more likely to take the credit. Additionally, the general pattern between average take per previous contact days and the likelihood of take, as shown in Figure 3B, is that homeowners are slightly more willing to take the credit as the average increases.

dat_eng %>%
  dplyr::select(y_numeric, ageGroup, previous.cat, cons.price.cat, cons.conf.cat, inflation.cat, repair_spent.cat) %>%
  gather(Variable, value, -y_numeric) %>%
    ggplot(aes(value, y_numeric)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="value", y="likelihood of take", 
           title = "Feature associations with the likelihood of take",
           subtitle = "(new categorical variables)",
           caption = "Figure 3A")

dat_eng %>%
  dplyr::select(y_numeric, avgTake_pdays) %>%
  gather(Variable, value, -y_numeric) %>%
    ggplot(aes(value, y_numeric)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="value", y="likelihood of take", 
           title = "Feature associations with the likelihood of take",
           subtitle = "(new feature: avgTake_pdays)",
           caption = "Figure 3B") +
      theme(legend.position = "none")

Split data into a 65/35 training/test set

Because the dataset has far more “no takes” than “yes takes” outcomes, homeowners that do take the credit are rare. For machine learning, it is hard to predict rare events. In order to find the commonalities for credit takers, I manipulate the dataset splitting in order to have a more even distribution of yes’s and no’s. The original dataset is also kept for reference in further analysis. Now there are the old, original dataset with uneven outcome and the new, even-outcome dataset with similar numbers of “no takes” and “yes takes”. They are both split into training set for regression summary and testing set for further analysis.

# Creating new dataset that has even yes and no in y

# Get random sample that has ~500 no's
dat_no <- dat_eng %>%
  filter(y_numeric == 0)
dat_yes <- dat_eng %>%
  filter(y_numeric == 1)

set.seed(3456)
noIndex <- createDataPartition(dat_no$y, p = .135,
                                  list = FALSE,
                                  times = 1)
dat_no <- dat_no[noIndex,]
new_data_all <- rbind(dat_no, dat_yes)

# For kitchen sink
new_data <- new_data_all %>%
  select(age, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs, y, y_numeric, X1)

# For engineered
new_data_eng <- new_data_all %>%
  select(ageGroup, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, avgTake_pdays, previous.cat, poutcome, cons.price.cat, cons.conf.cat, inflation.cat,repair_spent.cat, y, y_numeric, X1)

# Splitting
# Old dataset
set.seed(123456)
trainIndex <- createDataPartition(dat_eng$y, p = .65,
                                  list = FALSE,
                                  times = 1)
datTrain <- dat_eng[ trainIndex,] %>%
  select(age, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs, y, y_numeric, X1)
datTest  <- dat_eng[-trainIndex,] %>%
  select(age, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs, y, y_numeric, X1)

dat_engTrain <- dat_eng[ trainIndex,] %>%
   select(ageGroup, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, avgTake_pdays, previous.cat, poutcome, cons.price.cat, cons.conf.cat, inflation.cat,repair_spent.cat, y, y_numeric, X1)
dat_engTest  <- dat_eng[-trainIndex,] %>%
   select(ageGroup, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, avgTake_pdays, previous.cat, poutcome, cons.price.cat, cons.conf.cat, inflation.cat,repair_spent.cat, y, y_numeric, X1)

# New dataset
set.seed(123456)
trainIndex <- createDataPartition(new_data$y, p = .65,
                                  list = FALSE,
                                  times = 1)
new_datTrain <- new_data[ trainIndex,] %>%
    select(age, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs, y, y_numeric, X1)
new_datTest  <- new_data[-trainIndex,] %>%
    select(age, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, pdays, previous, poutcome, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs, y, y_numeric, X1)

new_dat_engTrain <- new_data_eng[ trainIndex,] %>%
     select(ageGroup, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, avgTake_pdays, previous.cat, poutcome, cons.price.cat, cons.conf.cat, inflation.cat,repair_spent.cat, y, y_numeric, X1)
new_dat_engTest  <- new_data_eng[-trainIndex,] %>%
     select(ageGroup, job, marital, education, taxLien, mortgage, taxbill_in_phl, contact, month, day_of_week, campaign, avgTake_pdays, previous.cat, poutcome, cons.price.cat, cons.conf.cat, inflation.cat,repair_spent.cat, y, y_numeric, X1)

Regression summary for both the kitchen sink and engineered features

The regression summary is as shown below. Comparing the McFadden Adjusted R-Squares and AICs for the four models, it is evident that the engineered model is slightly better than the kitchen sink model for both new and old datasets given the increased McFadden Adjusted R-Square; however, with a slightly lower AIC for the new dataset. The model for the new training dataset with even yes’s and no’s is better than the model for the original training dataset as the McFadden Adjusted R-Square increased and the AIC decreased. It makes sense that as the yes’s are not as rare, the model could fit the predictions better for both credit takers and non-takers. The summary below are for old kitchen sink, old engineered, new kitchen sink, and new engineered models.

Old Kitchen Sink

# Old data, kitchen sink
datModel_KS <- glm(y_numeric ~ .,
                  data=datTrain %>% 
                    dplyr::select(-y, -X1),
                  family="binomial" (link="logit"))
summary(datModel_KS)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = datTrain %>% dplyr::select(-y, -X1))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9506  -0.3923  -0.3123  -0.2442   3.0634  
## 
## Coefficients:
##                                 Estimate  Std. Error z value Pr(>|z|)    
## (Intercept)                  -54.7458240 135.7536931  -0.403  0.68675    
## age                            0.0167965   0.0083904   2.002  0.04530 *  
## jobblue-collar                -0.2842282   0.2771254  -1.026  0.30507    
## jobentrepreneur               -0.7948698   0.5520893  -1.440  0.14994    
## jobhousemaid                  -0.3482317   0.5242273  -0.664  0.50651    
## jobmanagement                 -0.4400748   0.3172575  -1.387  0.16540    
## jobretired                    -0.2088053   0.3716366  -0.562  0.57422    
## jobself-employed              -0.6603178   0.4527951  -1.458  0.14475    
## jobservices                   -0.2993078   0.2970601  -1.008  0.31366    
## jobstudent                    -0.4219840   0.4717018  -0.895  0.37100    
## jobtechnician                 -0.1052882   0.2420020  -0.435  0.66351    
## jobunemployed                  0.1819740   0.4091968   0.445  0.65653    
## jobunknown                    -0.1173864   0.7323976  -0.160  0.87266    
## maritalmarried                 0.1127649   0.2417897   0.466  0.64095    
## maritalsingle                  0.1884295   0.2794178   0.674  0.50008    
## maritalunknown                 1.4227535   1.2451067   1.143  0.25317    
## educationbasic.6y              0.3790826   0.4071251   0.931  0.35179    
## educationbasic.9y              0.2349503   0.3299690   0.712  0.47644    
## educationhigh.school           0.1522794   0.3205775   0.475  0.63478    
## educationilliterate          -12.3085042 535.4114641  -0.023  0.98166    
## educationprofessional.course   0.3800762   0.3511686   1.082  0.27911    
## educationuniversity.degree     0.1237838   0.3300073   0.375  0.70759    
## educationunknown               0.4484901   0.4260660   1.053  0.29251    
## taxLienunknown                -0.0134188   0.2189180  -0.061  0.95112    
## taxLienyes                   -10.2243073 535.4114572  -0.019  0.98476    
## mortgageunknown               -0.6415508   0.5648581  -1.136  0.25605    
## mortgageyes                   -0.3218310   0.1479016  -2.176  0.02956 *  
## taxbill_in_phlyes              0.0010929   0.1961573   0.006  0.99555    
## contacttelephone              -1.2082941   0.3106159  -3.890  0.00010 ***
## monthaug                      -0.5318931   0.4613765  -1.153  0.24898    
## monthdec                       0.6646712   0.7657120   0.868  0.38537    
## monthjul                      -0.1846513   0.3762232  -0.491  0.62357    
## monthjun                       0.2245640   0.4618395   0.486  0.62680    
## monthmar                       1.7411304   0.6210737   2.803  0.00506 ** 
## monthmay                      -0.3093757   0.3145059  -0.984  0.32527    
## monthnov                      -0.5251596   0.4383972  -1.198  0.23095    
## monthoct                      -0.8337962   0.5845609  -1.426  0.15376    
## monthsep                      -0.5321598   0.6532525  -0.815  0.41528    
## day_of_weekmon                -0.0823495   0.2264891  -0.364  0.71616    
## day_of_weekthu                -0.1006872   0.2265655  -0.444  0.65675    
## day_of_weektue                -0.2232386   0.2373104  -0.941  0.34686    
## day_of_weekwed                 0.1171600   0.2369428   0.494  0.62098    
## campaign                      -0.0824037   0.0429460  -1.919  0.05501 .  
## pdays                         -0.0005157   0.0008926  -0.578  0.56343    
## previous                       0.1004598   0.2108717   0.476  0.63379    
## poutcomenonexistent            0.7666182   0.3608626   2.124  0.03364 *  
## poutcomesuccess                1.2530653   0.8806718   1.423  0.15478    
## unemploy_rate                 -0.3340069   0.5181717  -0.645  0.51919    
## cons.price.idx                 0.8002426   0.8998448   0.889  0.37384    
## cons.conf.idx                  0.0778920   0.0301135   2.587  0.00969 ** 
## inflation_rate                -0.0955743   0.4674072  -0.204  0.83798    
## spent_on_repairs              -0.0036923   0.0109038  -0.339  0.73489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1413.0  on 2627  degrees of freedom
## AIC: 1517
## 
## Number of Fisher Scoring iterations: 12
pR2(datModel_KS)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -706.5142187 -926.8710680  440.7136986    0.2377427    0.1516880    0.3037389

Old engineered

# Old data, engineered
datModel_eng <- glm(y_numeric ~ .,
                  data=dat_engTrain %>% 
                    dplyr::select(-y, -X1),
                  family="binomial" (link="logit"))
summary(datModel_eng)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = dat_engTrain %>% dplyr::select(-y, -X1))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4753  -0.3883  -0.3086  -0.2426   3.0690  
## 
## Coefficients: (3 not defined because of singularities)
##                                       Estimate Std. Error z value   Pr(>|z|)
## (Intercept)                          10.479066 622.379775   0.017   0.986567
## ageGroupage28_52                      0.428671   0.309732   1.384   0.166357
## ageGroupage52_56                      0.667863   0.404044   1.653   0.098342
## ageGroupage56+                        0.634126   0.419057   1.513   0.130224
## jobblue-collar                       -0.257237   0.278041  -0.925   0.354874
## jobentrepreneur                      -0.794909   0.554991  -1.432   0.152060
## jobhousemaid                         -0.296946   0.523728  -0.567   0.570724
## jobmanagement                        -0.507604   0.324244  -1.566   0.117465
## jobretired                           -0.195477   0.399747  -0.489   0.624840
## jobself-employed                     -0.673331   0.452017  -1.490   0.136325
## jobservices                          -0.320902   0.303058  -1.059   0.289654
## jobstudent                           -0.721202   0.499032  -1.445   0.148401
## jobtechnician                        -0.158241   0.247528  -0.639   0.522637
## jobunemployed                         0.158578   0.415048   0.382   0.702409
## jobunknown                           -0.154141   0.750278  -0.205   0.837224
## maritalmarried                        0.043752   0.243258   0.180   0.857265
## maritalsingle                         0.116564   0.272511   0.428   0.668840
## maritalunknown                        1.319232   1.257074   1.049   0.293973
## educationbasic.6y                     0.239815   0.411861   0.582   0.560384
## educationbasic.9y                     0.135789   0.327409   0.415   0.678335
## educationhigh.school                  0.002392   0.320305   0.007   0.994041
## educationilliterate                 -13.944754 882.743583  -0.016   0.987396
## educationprofessional.course          0.250974   0.354565   0.708   0.479047
## educationuniversity.degree            0.009120   0.330084   0.028   0.977957
## educationunknown                      0.391816   0.424593   0.923   0.356110
## taxLienunknown                        0.012654   0.220675   0.057   0.954274
## taxLienyes                          -11.447154 882.743553  -0.013   0.989654
## mortgageunknown                      -0.670463   0.567880  -1.181   0.237745
## mortgageyes                          -0.332098   0.150075  -2.213   0.026906
## taxbill_in_phlyes                     0.034095   0.200553   0.170   0.865008
## contacttelephone                     -1.239099   0.339988  -3.645   0.000268
## monthaug                            -11.619967 622.379431  -0.019   0.985104
## monthdec                            -10.176982 622.379769  -0.016   0.986954
## monthjul                            -11.574310 622.379788  -0.019   0.985163
## monthjun                            -10.644509 622.379702  -0.017   0.986355
## monthmar                            -10.182710 622.379690  -0.016   0.986946
## monthmay                            -10.931490 622.379840  -0.018   0.985987
## monthnov                            -11.273319 622.379472  -0.018   0.985549
## monthoct                            -11.275974 622.379311  -0.018   0.985545
## monthsep                            -11.587997 622.379682  -0.019   0.985145
## day_of_weekmon                       -0.073506   0.229641  -0.320   0.748901
## day_of_weekthu                       -0.140099   0.231849  -0.604   0.545666
## day_of_weektue                       -0.213192   0.241240  -0.884   0.376840
## day_of_weekwed                        0.145685   0.239940   0.607   0.543736
## campaign                             -0.073946   0.042551  -1.738   0.082247
## avgTake_pdays                         0.045387   0.011834   3.835   0.000125
## previous.catpreviousNOT0             -0.650605   0.241001  -2.700   0.006943
## poutcomenonexistent                         NA         NA      NA         NA
## poutcomesuccess                      -0.669814   0.705621  -0.949   0.342491
## cons.price.catprice92.8_93           -0.865340   0.848631  -1.020   0.307875
## cons.price.catprice93.1_93.6          0.148428   0.685646   0.216   0.828615
## cons.price.catprice93.6_93.8        -10.672808 622.378975  -0.017   0.986318
## cons.price.catprice93.8_94.1          0.587211   0.508107   1.156   0.247811
## cons.price.catprice93_93.1          -11.234446 622.380427  -0.018   0.985598
## cons.price.catprice94.1_94.3          0.625198   0.537087   1.164   0.244402
## cons.price.catprice94.3_94.6          0.450641   1.092208   0.413   0.679902
## cons.price.catprice94.6+             -1.351747   1.330591  -1.016   0.309677
## cons.conf.catconf-47.5-               0.684016   1.258394   0.544   0.586742
## cons.conf.catconf-47.5_-45           -1.264130   0.852118  -1.484   0.137938
## inflation.catinflation3.5+           -2.531283   0.563944  -4.489 0.00000717
## repair_spent.catrepair5087.5_5112.5         NA         NA      NA         NA
## repair_spent.catrepair5165+                 NA         NA      NA         NA
##                                        
## (Intercept)                            
## ageGroupage28_52                       
## ageGroupage52_56                    .  
## ageGroupage56+                         
## jobblue-collar                         
## jobentrepreneur                        
## jobhousemaid                           
## jobmanagement                          
## jobretired                             
## jobself-employed                       
## jobservices                            
## jobstudent                             
## jobtechnician                          
## jobunemployed                          
## jobunknown                             
## maritalmarried                         
## maritalsingle                          
## maritalunknown                         
## educationbasic.6y                      
## educationbasic.9y                      
## educationhigh.school                   
## educationilliterate                    
## educationprofessional.course           
## educationuniversity.degree             
## educationunknown                       
## taxLienunknown                         
## taxLienyes                             
## mortgageunknown                        
## mortgageyes                         *  
## taxbill_in_phlyes                      
## contacttelephone                    ***
## monthaug                               
## monthdec                               
## monthjul                               
## monthjun                               
## monthmar                               
## monthmay                               
## monthnov                               
## monthoct                               
## monthsep                               
## day_of_weekmon                         
## day_of_weekthu                         
## day_of_weektue                         
## day_of_weekwed                         
## campaign                            .  
## avgTake_pdays                       ***
## previous.catpreviousNOT0            ** 
## poutcomenonexistent                    
## poutcomesuccess                        
## cons.price.catprice92.8_93             
## cons.price.catprice93.1_93.6           
## cons.price.catprice93.6_93.8           
## cons.price.catprice93.8_94.1           
## cons.price.catprice93_93.1             
## cons.price.catprice94.1_94.3           
## cons.price.catprice94.3_94.6           
## cons.price.catprice94.6+               
## cons.conf.catconf-47.5-                
## cons.conf.catconf-47.5_-45             
## inflation.catinflation3.5+          ***
## repair_spent.catrepair5087.5_5112.5    
## repair_spent.catrepair5165+            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1385.3  on 2620  degrees of freedom
## AIC: 1503.3
## 
## Number of Fisher Scoring iterations: 13
pR2(datModel_eng)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -692.6454443 -926.8710680  468.4512474    0.2527057    0.1604259    0.3212356

New kitchen sink

# New, even-outcome data, kitchen sink
new_datModel_KS <- glm(y_numeric ~ .,
                  data=new_datTrain %>% 
                    dplyr::select(-y, -X1),
                  family="binomial" (link="logit"))
summary(new_datModel_KS)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = new_datTrain %>% dplyr::select(-y, -X1))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9525  -0.8106  -0.4056   0.7950   2.2584  
## 
## Coefficients:
##                                  Estimate   Std. Error z value Pr(>|z|)   
## (Intercept)                  -117.7994166  227.8453100  -0.517  0.60515   
## age                             0.0413686    0.0127370   3.248  0.00116 **
## jobblue-collar                 -0.0265943    0.3981749  -0.067  0.94675   
## jobentrepreneur                -0.4632098    0.6939070  -0.668  0.50443   
## jobhousemaid                   -0.1427838    0.7206181  -0.198  0.84293   
## jobmanagement                  -0.2113601    0.4725174  -0.447  0.65465   
## jobretired                     -0.3892447    0.5970159  -0.652  0.51441   
## jobself-employed               -0.6069398    0.5544948  -1.095  0.27370   
## jobservices                     0.0584659    0.4261835   0.137  0.89088   
## jobstudent                      0.4294287    0.7327984   0.586  0.55787   
## jobtechnician                   0.2487181    0.3463309   0.718  0.47266   
## jobunemployed                  -1.0295445    0.6700171  -1.537  0.12439   
## jobunknown                      0.5770308    1.1704016   0.493  0.62200   
## maritalmarried                  0.4167315    0.3315647   1.257  0.20880   
## maritalsingle                   0.4659052    0.3797283   1.227  0.21984   
## maritalunknown                  1.2495253    1.5082690   0.828  0.40742   
## educationbasic.6y               0.8679012    0.5684905   1.527  0.12684   
## educationbasic.9y              -0.0008492    0.4574190  -0.002  0.99852   
## educationhigh.school            0.1886204    0.4460652   0.423  0.67240   
## educationprofessional.course    0.3074157    0.5058214   0.608  0.54335   
## educationuniversity.degree      0.0126907    0.4631024   0.027  0.97814   
## educationunknown               -0.7221196    0.6110208  -1.182  0.23728   
## taxLienunknown                 -0.0193064    0.2806462  -0.069  0.94515   
## mortgageunknown                -0.3424998    0.7202104  -0.476  0.63439   
## mortgageyes                     0.0150266    0.2127651   0.071  0.94370   
## taxbill_in_phlyes               0.2172057    0.2748417   0.790  0.42936   
## contacttelephone               -1.4087590    0.4666483  -3.019  0.00254 **
## monthaug                        0.6306784    0.7251314   0.870  0.38444   
## monthdec                        2.2952568    1.3717141   1.673  0.09427 . 
## monthjul                        0.3876301    0.5688612   0.681  0.49561   
## monthjun                       -0.4386627    0.6946952  -0.631  0.52775   
## monthmar                        2.7019270    0.9961193   2.712  0.00668 **
## monthmay                       -0.0612274    0.4870458  -0.126  0.89996   
## monthnov                        0.5257517    0.6660425   0.789  0.42990   
## monthoct                        0.9908054    0.8292764   1.195  0.23217   
## monthsep                        0.2344104    0.9852809   0.238  0.81195   
## day_of_weekmon                 -0.3112423    0.3409982  -0.913  0.36138   
## day_of_weekthu                 -0.2652596    0.3408926  -0.778  0.43649   
## day_of_weektue                 -0.1316540    0.3465472  -0.380  0.70402   
## day_of_weekwed                  0.4817326    0.3445170   1.398  0.16203   
## campaign                       -0.0983451    0.0529554  -1.857  0.06329 . 
## pdays                          -0.2922039    0.1261394  -2.317  0.02053 * 
## previous                       -0.3190574    0.3822761  -0.835  0.40393   
## poutcomenonexistent             0.2565057    0.5652093   0.454  0.64996   
## poutcomesuccess              -287.9962842  125.1804739  -2.301  0.02141 * 
## unemploy_rate                  -1.2879412    0.7252791  -1.776  0.07577 . 
## cons.price.idx                  3.1449143    1.2930168   2.432  0.01501 * 
## cons.conf.idx                   0.1070785    0.0497066   2.154  0.03122 * 
## inflation_rate                 -0.8179719    0.6686069  -1.223  0.22118   
## spent_on_repairs                0.0232786    0.0152813   1.523  0.12768   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.98  on 616  degrees of freedom
## Residual deviance: 611.93  on 567  degrees of freedom
## AIC: 711.93
## 
## Number of Fisher Scoring iterations: 10
pR2(new_datModel_KS)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -305.9644105 -426.9900358  242.0512505    0.2834390    0.3245009    0.4329872

New engineered

# New, even-outcome data, engineered
new_datModel_eng <- glm(y_numeric ~ .,
                  data=new_dat_engTrain %>% 
                    dplyr::select(-y, -X1),
                  family="binomial" (link="logit"))
summary(new_datModel_eng)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = new_dat_engTrain %>% dplyr::select(-y, -X1))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7727  -0.8055  -0.3991   0.7322   2.2069  
## 
## Coefficients: (3 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                           -3.55788 1649.52219  -0.002  0.99828   
## ageGroupage28_52                       1.18285    0.49615   2.384  0.01712 * 
## ageGroupage52_56                       0.97916    0.63382   1.545  0.12238   
## ageGroupage56+                         1.44280    0.62907   2.294  0.02182 * 
## jobblue-collar                        -0.16811    0.40206  -0.418  0.67585   
## jobentrepreneur                       -0.33765    0.72030  -0.469  0.63924   
## jobhousemaid                          -0.01022    0.71881  -0.014  0.98866   
## jobmanagement                         -0.19126    0.48233  -0.397  0.69172   
## jobretired                             0.13667    0.62186   0.220  0.82604   
## jobself-employed                      -0.47473    0.56786  -0.836  0.40315   
## jobservices                            0.01284    0.43264   0.030  0.97632   
## jobstudent                             0.38001    0.77548   0.490  0.62411   
## jobtechnician                          0.09707    0.35370   0.274  0.78375   
## jobunemployed                         -1.22669    0.70541  -1.739  0.08204 . 
## jobunknown                             0.61540    1.19617   0.514  0.60692   
## maritalmarried                         0.35072    0.33228   1.055  0.29120   
## maritalsingle                          0.25182    0.36991   0.681  0.49602   
## maritalunknown                         0.95756    1.49695   0.640  0.52239   
## educationbasic.6y                      0.55269    0.56682   0.975  0.32953   
## educationbasic.9y                     -0.15411    0.45483  -0.339  0.73474   
## educationhigh.school                  -0.03969    0.44501  -0.089  0.92893   
## educationprofessional.course          -0.06671    0.51719  -0.129  0.89737   
## educationuniversity.degree            -0.32235    0.46554  -0.692  0.48867   
## educationunknown                      -0.68081    0.63998  -1.064  0.28742   
## taxLienunknown                         0.01959    0.28211   0.069  0.94464   
## mortgageunknown                       -0.41583    0.79748  -0.521  0.60206   
## mortgageyes                            0.06528    0.21570   0.303  0.76218   
## taxbill_in_phlyes                      0.17741    0.28050   0.632  0.52706   
## contacttelephone                      -1.43576    0.50083  -2.867  0.00415 **
## monthaug                               3.40579 1649.52204   0.002  0.99835   
## monthdec                               4.68252 1649.52251   0.003  0.99774   
## monthjul                               1.92966 1649.52247   0.001  0.99907   
## monthjun                               3.18956 1649.52247   0.002  0.99846   
## monthmar                              18.71662 1785.23781   0.010  0.99164   
## monthmay                               1.79525 1649.52256   0.001  0.99913   
## monthnov                               3.58904 1649.52206   0.002  0.99826   
## monthoct                               3.24888 1649.52190   0.002  0.99843   
## monthsep                               2.48100 1649.52238   0.002  0.99880   
## day_of_weekmon                        -0.32486    0.34191  -0.950  0.34205   
## day_of_weekthu                        -0.36382    0.34346  -1.059  0.28947   
## day_of_weektue                        -0.28206    0.35556  -0.793  0.42760   
## day_of_weekwed                         0.41555    0.34392   1.208  0.22695   
## campaign                              -0.09618    0.05337  -1.802  0.07153 . 
## avgTake_pdays                          0.11779    0.03690   3.192  0.00141 **
## previous.catpreviousNOT0              -0.60185    0.36416  -1.653  0.09839 . 
## poutcomenonexistent                         NA         NA      NA       NA   
## poutcomesuccess                       -4.62626    1.97874  -2.338  0.01939 * 
## cons.price.catprice92.8_93            -0.86101    1.68375  -0.511  0.60909   
## cons.price.catprice93.1_93.6          -0.92090    1.25477  -0.734  0.46300   
## cons.price.catprice93.6_93.8          18.18775 1381.79249   0.013  0.98950   
## cons.price.catprice93.8_94.1           1.37891    1.02216   1.349  0.17733   
## cons.price.catprice93_93.1             1.31518 1649.52354   0.001  0.99936   
## cons.price.catprice94.1_94.3           0.30856    0.98463   0.313  0.75399   
## cons.price.catprice94.3_94.6           0.87504    2.19604   0.398  0.69029   
## cons.price.catprice94.6+              14.88576  682.75607   0.022  0.98261   
## cons.conf.catconf-47.5-              -15.32177  682.75571  -0.022  0.98210   
## cons.conf.catconf-47.5_-45             0.19723    1.66450   0.118  0.90568   
## inflation.catinflation3.5+            -2.07187    1.04307  -1.986  0.04700 * 
## repair_spent.catrepair5087.5_5112.5         NA         NA      NA       NA   
## repair_spent.catrepair5165+                 NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.98  on 616  degrees of freedom
## Residual deviance: 595.10  on 560  degrees of freedom
## AIC: 709.1
## 
## Number of Fisher Scoring iterations: 15
pR2(new_datModel_eng)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -297.5478411 -426.9900358  258.8843893    0.3031504    0.3426809    0.4572451

Cross validation

Even though it is indicative from the regression summary on the training set that the fourth model (with new, even-outcome dataset and engineered features) are slightly more fit, I cross validate the models with the testing datasets to take a closer look at the accuracy by comparing ROC, Sensitivity and Specificity. The results below are for old kitchen sink, old engineered, new kitchen sink, and new engineered models.

Old Kitchen Sink

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(y ~ ., data = datTest %>% 
                                   dplyr::select(
                                   -y_numeric, -X1) %>%
                 mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit
## Generalized Linear Model 
## 
## 1440 samples
##   19 predictor
##    2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 1425, 1425, 1425, 1426, 1427, 1426, ... 
## Resampling results:
## 
##   ROC        Sens   Spec     
##   0.7190064  0.225  0.9783974

Old engineered

ctrl_eng <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit_eng <- train(y ~ ., data = dat_engTest %>% 
                                   dplyr::select(
                                   -y_numeric, -X1) %>%
                 mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl_eng)

cvFit_eng
## Generalized Linear Model 
## 
## 1440 samples
##   18 predictor
##    2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 1425, 1425, 1426, 1426, 1426, 1426, ... 
## Resampling results:
## 
##   ROC   Sens  Spec     
##   0.71  0.23  0.9782051

New kitchen sink

new_ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

new_cvFit <- train(y ~ ., data = new_datTest %>% 
                                   dplyr::select(
                                   -y_numeric, -X1) %>%
                 mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = new_ctrl)

new_cvFit
## Generalized Linear Model 
## 
## 330 samples
##  19 predictor
##   2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 328, 326, 326, 328, 328, 327, ... 
## Resampling results:
## 
##   ROC   Sens   Spec
##   0.68  0.555  0.74

New engineered

new_ctrl_eng <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

new_cvFit_eng <- train(y ~ ., data = new_dat_engTest %>% 
                                   dplyr::select(
                                   -y_numeric, -X1) %>%
                 mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = new_ctrl_eng)

new_cvFit_eng
## Generalized Linear Model 
## 
## 330 samples
##  18 predictor
##   2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 327, 328, 328, 327, 326, 327, ... 
## Resampling results:
## 
##   ROC   Sens  Spec 
##   0.64  0.49  0.735

Four models, from Figure 4A to 4D, are compared here. The baseline old kitchen sink model (Figure 4A) is the one with all features and the original random dataset, while the new engineered model (Figure 4D) is the one with engineered features and even-outcome dataset. Keep in mind that the different datasets result in different distribution (Figure 4A 4B vs. Figure 4C 4D). In terms of Receiver Operating Characteristic (ROC) Curve, which indicates the goodness of fit of the model, unfortunately my engineering of features do not lead to significant differences after several different engineering attempts (from comparing 4A with 4B and 4C with 4D). For Sensitivity, which is the rate the model correctly predicts true positives, the models from engineered dataset with even yes’s and no’s (Figure 4C and 4D) are doing a lot better than the original dataset (Figure 4A and 4B); however, at the cost of Specificity, which is the rate the model correctly predicts true negatives.

dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics for Old Kitchen Sink Model",
         subtitle = "Across-fold mean reprented as dotted lines",
         caption = "Figure 4A") +
    plotTheme()

dplyr::select(cvFit_eng$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit_eng$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics for Old Engineered Model",
         subtitle = "Across-fold mean reprented as dotted lines",
         caption = "Figure 4B") +
    plotTheme()

dplyr::select(new_cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(new_cvFit_eng$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics for New Kitchen Sink Model",
         subtitle = "Across-fold mean reprented as dotted lines",
         caption = "Figure 4C") +
    plotTheme()

dplyr::select(new_cvFit_eng$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(new_cvFit_eng$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics for New Engineered Model",
         subtitle = "Across-fold mean reprented as dotted lines",
         caption = "Figure 4D") +
    plotTheme()

Selecting Model

Taking into account the regression results and cross validation results, I would like to select the final model from the old engineered model and the new engineered models, which are both feature engineered, one from the original dataset and one from from even-outcome dataset. Shown in Figure 5A and 5B, the advantage from the model with engineered features and from the original dataset (old engineered model) is that it has a slightly better (by 0.05) goodness of fit. Shown in the code output below, it also has a higher Specificity (by 0.14). But the model with engineered features and from the even outcome dataset (new engineered model) has a significantly higher Sensitivity (by 0.24). Given the HCD case, even though it is important to predict the True Positives right, the past record has shown that there are a lot of people rejecting the credit program even though they are eligible and would take the credit. The features visualizations (Figure 1A-3B) show that there are no strong predictors for those who would take the credit program, at least not with the data provided. Even though the new engineered model significantly increased Sensitivity of the model, it comes with great costs of the accuracy of the model. Predicting the non-takers (Specificity) right is also important to not waste resources. Therefore, for now, as the true positives are not a promising guarantee of enhanced public good and the false positives waste resources, it is best to have a generally balanced and accurate model to perhaps avoid wasting marketing resources and accumulate data to develop a stronger model in predicting the True Positives in the future. So the old engineered model is chosen for further analysis.

Confusion Matrix

The first output is for the old engineered model and the second output is for the new engineered model.

# old data
dat_engTest <- dat_engTest[!dat_engTest$inflation.cat=="inflation2.5_3.5",]
testProbs <- data.frame(Outcome = as.factor(dat_engTest$y_numeric),
                        Probs = predict(datModel_eng, dat_engTest, type= "response"))
testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1253  118
##          1   29   39
##                                             
##                Accuracy : 0.8978            
##                  95% CI : (0.881, 0.913)    
##     No Information Rate : 0.8909            
##     P-Value [Acc > NIR] : 0.212             
##                                             
##                   Kappa : 0.3005            
##                                             
##  Mcnemar's Test P-Value : 0.0000000000003925
##                                             
##             Sensitivity : 0.24841           
##             Specificity : 0.97738           
##          Pos Pred Value : 0.57353           
##          Neg Pred Value : 0.91393           
##              Prevalence : 0.10910           
##          Detection Rate : 0.02710           
##    Detection Prevalence : 0.04726           
##       Balanced Accuracy : 0.61289           
##                                             
##        'Positive' Class : 1                 
## 
# new data
new_testProbs <- data.frame(Outcome = as.factor(new_dat_engTest$y_numeric),
                        Probs = predict(new_datModel_eng, new_dat_engTest, type= "response"))

new_testProbs <- 
  new_testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(new_testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(new_testProbs$predOutcome, new_testProbs$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 145  80
##          1  28  77
##                                           
##                Accuracy : 0.6727          
##                  95% CI : (0.6192, 0.7231)
##     No Information Rate : 0.5242          
##     P-Value [Acc > NIR] : 0.00000003142   
##                                           
##                   Kappa : 0.3337          
##                                           
##  Mcnemar's Test P-Value : 0.00000092255   
##                                           
##             Sensitivity : 0.4904          
##             Specificity : 0.8382          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.6444          
##              Prevalence : 0.4758          
##          Detection Rate : 0.2333          
##    Detection Prevalence : 0.3182          
##       Balanced Accuracy : 0.6643          
##                                           
##        'Positive' Class : 1               
## 

ROC curve

# old data
ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Old Engineered Model",
       caption = "Figure 5A")

pROC::auc(testProbs$Outcome, testProbs$Probs) #0.7569
## Area under the curve: 0.7566
# new data
ggplot(new_testProbs, aes(d = as.numeric(new_testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - New Engineered Model",
       caption = "Figure 5B")

pROC::auc(new_testProbs$Outcome, new_testProbs$Probs) #0.7066
## Area under the curve: 0.7066

Cost Benefit (C/B) Analysis

Fiscal C/B equation for each confusion metric

  • True Positive - Predicted correctly homeowner would take the credit; allocated the marketing resources, and credit was taken by 25% of them and not taken by 75% of them. HCD gain extra real estate tax revenue through new sale premium. For this analysis, the tax rate is set at the 1.1% national average^1. Public good will be enhanced.
    • -$2850 - $5000 * (25% count) + $10000 * 0.11 * (25% count) + -$2850 * (75% count)
  • True Negative - Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was taken. Public good will not be enhanced.
    • 0
  • False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit taken. Public good will not be enhanced.
    • -$2850 - 0 = -$2850
  • False Negative - We predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. One can ‘0 out’ this category, assuming the cost/benefit of this is $0. But even then it will still cost HCD credit, gain extra tax revenue, and public good will be enhanced. So this should not be ‘0ed out’ when considering money cost.
    • 0 - $5000 * count + $10000 * 0.11 * count = -$3900

C/B table with the probability threshold being 50%

Table 1 shows the Cost Benefit Table for when the probability threshold is set at 50%, meaning that when the predicted likelihood for the given homeowner is above 0.5, the model will recognize them as potential credit takers.

# old data
whichThreshold <- 
  iterateThresholds(
     data=testProbs, observedClass = Outcome, predictedProbs = Probs)

whichThreshold <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
                 case_when(
                   Variable == "Count_TN"  ~ 0,
                   Variable == "Count_TP"  ~ (-2850-5000+10000 * 0.11) * (Count * 0.25) + -2850 * (Count * 0.75),
                   Variable == "Count_FN" ~ (-5000+10000 * 0.11) * Count,
                   Variable == "Count_FP" ~ -2850 * Count))

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ 0,
                         Variable == "True_Positive"  ~ (-2850-5000+10000 * 0.11) * Count * 0.25 + -2850 * (Count * 0.75),
                         Variable == "False_Negative" ~ (-5000+10000 * 0.11) * Count,
                         Variable == "False_Positive" ~ -2850 * Count)) %>%
    bind_cols(data.frame(Description = c(
              "We predicted no take and the customer did not take",
              "We predicted take and allocated the resources and the customer took",
              "We predicted no take and the customer took",
              "We predicted take and allocated the resources and the customer did not take")))

kable(cost_benefit_table) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n",
           general = "Table 1")
Variable Count Revenue Description
True_Negative 1253 0 We predicted no take and the customer did not take
True_Positive 39 -149175 We predicted take and allocated the resources and the customer took
False_Negative 118 -460200 We predicted no take and the customer took
False_Positive 29 -82650 We predicted take and allocated the resources and the customer did not take

Table 1

Confusion metric outcomes for each Threshold

Figure 6 illustrates the confusion metric outcomes for each threshold. NOte that the revenue will always be negative no matter the threshold. The tradeoff ie mainly between the False Negatives and the other two (True Positives and the False Negatives). Ideally, HCD should avoid false positives as the resources will be wasted and no credit will be taken (no public good). However, HCD needs to take the maximization of public good into account too, so the bottom line is different.

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold",
       y = "Revenue",
       caption = "Figure 6") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

Threshold as a function of Total_Revenue and Total_Count_of_Credits

Different from the private sector, HCD is not profit-driven. Therefore, the threshold should not be picked for the ones avoiding the most cost (Figure 7B) but for the ones generating the most count of credits (Figure 7A and 7C). This is tricky with the false negatives and knowing that the Sensitivity of the model is not the best. At first, I included the count of credits from the false negatives, but the algorithm results in unhealthy result in that it wants to rely on the false negatives to get the most credit takers (Figure 7A). The threshold being 0.95, the homeowner will only be contacted when the likelihood is 95%. This saves a lot of money of course, but it makes campaigning not meaningful. Therefore, even though in fact a lot of credit takers are unrelated to the campaigns, credit takers from false negatives should be 0ed out here. HCD should try their best in getting the words out and pursuing people to take the credit program, instead of sitting there and wait for people to sign up. After the false negatives’ outcomes are excluded, the new threshold is 2% (Figure 7C), meaning that the homeowner will be contacted as long as the likelihood is equal to or beyond a mere 2%. However, this will increase the False Positives and cost more resources from HCD.

whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(actualTake = ifelse(Variable == "Count_TP", (Count * 0.25),
                        ifelse(Variable == "Count_FN", Count, 0))) %>% 
    group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_of_Credits = sum(actualTake))

ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Total_Count_of_Credits))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Total_Count_of_Credits)[1,1]))+
    labs(title = "Total Count of Credits By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold, including False Negatives",
         caption = "Figure 7A")

whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(actualTake = ifelse(Variable == "Count_TP", (Count * 0.25), 0)) %>% 
    group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_of_Credits = sum(actualTake))

ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Total_Revenue))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Total_Revenue)[1,1]))+
    labs(title = "Model Revenues By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold",
         caption = "Figure 7B")

ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Total_Count_of_Credits))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Total_Count_of_Credits)[1,1]))+
    labs(title = "Total Count of Credits By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold, not including false negatives",
         caption = "Figure 7C")

Table of the Total_Revenue and Total_Count_of_Credits

Table 2 shows that by lowering the threshold from 50% to 2%, the amount of credit takers increases by a huge amount, while the resources spent also increases by a huge amount, with the false negatives 0ed out.

compareThreshold <- 
  whichThreshold_revenue %>%
  filter(Threshold == 0.50 |
         Threshold == 0.02) %>%
  mutate(Model = paste(Threshold*100,"% Threshold", sep=''),
         Total_Revenue = paste("$", Total_Revenue, spe='')) %>%
  select(Model, Total_Revenue, Total_Count_of_Credits)

kable(compareThreshold) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n",
           general = "Table 2")
Model Total_Revenue Total_Count_of_Credits
2% Threshold $ -3998025 38.25
50% Threshold $ -692025 9.75

Table 2

Conclusion

Generally speaking, without any predictive model, HCD would have to allocate their resources to all households in order to get the same amount of households to take the credit. With the test set, there would be 1439 households and the cost of the allocation fee alone would be $4,101,150; not to mention that for those contacted and willing to take the credit, only 25% are estimated to take the credit. With the rate of credit-taking being so low, the HCD would waste a lot more resources in allocating the marketing resources without actually increasing the number of credit takers (i.e. without improving any public good nor increasing potential tax revenue). So the model is helpful.

For this case, whether to adopt the final model selected or not, especially in deciding the threshold, depends on the HCD’s goals and objectives. Specifically, it depends on how much resources they have and are willing to spend on the campaign. Given the past record data and the 25% credit-taking rate, it is probably not worth the huge amount of campaign resources to get only a few more people to take the credit program as shown in Table 2. For now, I would recommend them to put the model on hold and reflect on the low rate of credit-taking. They should also collect better data for the building of a better model.

Throughout the analysis, I found that there are a couple of issues with the model that came from the program itself. For example, the Sensitivity of the model is low because there are few credit-taking records. If they have campaigned so much already and gain such low response, is campaigning still meaningful? There might be other more effective ways to promote the home repair credit programs. Also, it is not very meaningful to predict the true positives right because even if the model did, only 25% of them would sign up eventually. Why did the eligible homeowners who would take the credit not take it in the end? This is pretty frustrating. Finding ways to get more of these homeowners to take the credit is perhaps more effective than adopting the model.

Not only would resolving the issues help promote the credit program, the logistics behind these would also help build a better model. For instance, further work should be done to reflect on the false negatives. The false negatives make me question the features in the model. If the campaign would not even reach the right homeowners, what are the motivations for those homeowners to sign up? In other words, if they signed up for reasons unrelated to the marketing campaign, what are the reasons? Of course, the false negatives in the data were reached by the campaign. If we reached out to them, they would have signed up. Thus, there must be some features missing from the model to predict these homeowners correctly. For example, the false negatives might come from homeowners who are friends of some homeowners being contacted. This is the result of the campaign but is not reflected in the data. Having features like this documented would help improve the model.

To gain a better response rate, it is important to know the significant features for those who will take the credit eventually. It is important to know why the homeowners took the credit program rather than why they would take the credit program. The past record has limited information on key features related to home repair, including but not not limited to wage, free time, gender, and if their homes need repair at all. I would survey the homeowners who took the credit programs and ask them how they heard about it and why they took it. I will add any currently missing features into the previous model. With amended information, I should be able to build a model with stronger predictors and thus campaign to the right eligible homeowners.

Source:

  1. https://www.usatoday.com/story/money/2020/03/03/states-with-the-highest-and-lowest-property-taxes/111375916/