Unable to address the missing categorical variables

nihal.ojha · August 2, 2020, 2:44pm

Question: 1
dataset$Ever_Married
[1] "No" "Yes" "Yes" "Yes" "NA" "Yes" "No" "No" "Yes" "Yes" "No" "No" "No" "Yes"
[15] "Yes" "No" "No" "No" "Yes" "Yes" "Yes" "No" "Yes" "No" NA "Yes" "No" "NA"

The above shown is a categorical variable which has over 8608 entries.
I want to replace it with the numerical factors such a 1 & 2 but unable to do so.

Tried the below code but unsuccesful-
"dataset$Married= factor(dataset$Ever_Married,
levels= c('No','Yes','NA'),labels=c(1,2,2))"
Please suggest the right code.

Question : 2
can we calcualte the decision matrix(sensitivity& specificity) in linear regression just like logistic regression?

Please guide.

enixam · August 3, 2020, 1:57am

Question 1

library(forcats)
f <- factor(c("Yes", "No", NA))
f <- fct_explicit_na(f, "unknown")
f
#> [1] Yes     No      unknown
#> Levels: No Yes unknown

# stay character
fct_recode(f, 
           "1" = "No",
           "2" = "Yes",
           "2" = "unknown")
#> [1] 2 1 2
#> Levels: 1 2

# completely numerical
dplyr::case_when(
          f == "No" ~ 1,
          f == "Yes" ~ 2,
          f == "unknown" ~ 2)
#> [1] 2 1 2

^{Created on 2020-08-03 by the reprex package (v0.3.0)}

Question 2:
I think there is not a definition of sensitivity / specificity in linear regression, usually the goodness-of-fit is judged by rmse or various types of residual plots.

nirgrahamuk · August 3, 2020, 10:18am

Sensitivity and Specificity are general enough concepts that they can be applied to widely to binary classification (without regard for the modelling methodology to achieve). Though depending on context it may be more or less 'relevant' compared to other possible stats. Here is an example

# model purpose is to predict if setosa , we will make this more difficult by only using Petal.Length to decide it.
library(tidyverse)
(myiris <- iris %>% mutate(
  is_setosa=case_when(Species != "setosa" ~ FALSE,
                      TRUE ~ TRUE)) %>% select(-Species)
)

(lm1 <- lm(is_setosa ~  Petal.Length, data = myiris))

myiris$pred <- predict(lm1,newdata = myiris)

hist(myiris$pred)
# pick a threshold(s)
mythresh <- 0:5/10

#to keep the function short. I assume that confusion matrix will be dimension 2x2 
# (which might not be the case for a threshold that pushes every value to a single class)
# also if someone could double check my mapping from the matrix to the TN/TP/FN/FP definitons that would help :)
analyse_at_thresh <- function(thr){
  pred_guess <- ifelse(myiris$pred < thr,FALSE,TRUE)
  conf <- table(myiris$is_setosa,pred_guess)
  TN <- conf[1,1]
  FP <- conf[1,2]
  FN <- conf[2,1]
  TP <- conf[2,2]
  Sensitivity <- TP / (TP+FN)
  Specificity <- TN / (TN+FP)
  #https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Confusion_matrix
  
  list(
    threshhold = thr,
    conf_matrix = conf,
    Sensitivity=Sensitivity,
    Specificity=Specificity
    )
}
purrr::map(mythresh,
  ~analyse_at_thresh(.)
)

system · August 24, 2020, 10:18am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.