dynamically select variables for formula based on binary variables

I'm sure this is pretty easy to do, but I can't work it out at the moment.

I want to add to a regression formula based on whether a variable is a 1 in a binary variable. Something like this, but it would be for many variables. What is a good way of doing this? Ideally this would involve a purrr function.

library(tidyverse)

set.seed(2)
df <- tibble(a = sample(c(1,0), 6, replace = TRUE),
             b = sample(c(1,0), 6, replace = TRUE))


df_desired <- df %>% 
  mutate(vars = if_else(a == 1, "a", ""),
         vars = if_else(b == 1, paste0(vars, " + b"), ""),
         vars = str_remove(vars, "^ \\+ "),
         vars = paste("y ~", vars)) 

# A tibble: 6 × 3
      a     b vars       
  <dbl> <dbl> <chr>      
1     1     1 "y ~ a + b"
2     1     1 "y ~ a + b"
3     0     1 "y ~ b"    
4     0     0 "y ~ "     # maybe ignore this row
5     0     1 "y ~ b"    
6     0     1 "y ~ b" 

This is generalised in that it doesnt mention 'a' or 'b' etc, it just assumes that the dataset its built from is data.frame with names of the dummy vars, and that they are integers like how you set up your example. I also added a column c to make it more interesting

library(tidyverse)
library(glue)
set.seed(2)
df <- tibble(a = sample(c(1,0), 6, replace = TRUE),
             b = sample(c(1,0), 6, replace = TRUE),
             c = sample(c(1,0), 6, replace = TRUE))


#so you can see how it works
(df2 <- df |> rowwise() |> mutate(across(.fns = ~ {
  ifelse(.x, cur_column(), "")
}),
vl_1 = list(c_across(everything())),
vl_2 = list(Filter(
  f = function(x) {
    nchar(x) > 0
  },
  x = vl_1
)),
vars = paste0("y ~ ", paste0(vl_2, collapse = " + "))
))

# the above again but condensed
(df2 <- df |> rowwise() |> mutate(across(.fns = ~ {
  ifelse(.x, cur_column(), "")
}),
vars = paste0("y ~ ", paste0(Filter(
  f = function(x) {nchar(x) > 0},
  x = c_across(everything())
), collapse = " + "))
))
1 Like

Hi I use similar concept in glm function un logistic regression. here is the sample code defining, how the formula is constructed and how the dependent and independent variables are passed on to the formula.

#create data frame
mydata <- data.frame(pp=c(.1, .2, .3, .4, .5, .6, .7, .8, .9, 1, 1, 1.1, 1.3, 1.5, 1.7),
qq=c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2, 2, 2.1, 2.3, 2.5, 2.7),
rr=c(2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3, 3, 2.1, 2.3, 3.5, 3.7),
ss=c(3.1,3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4, 4, 2.1, 2.3, 4.5, 4.7),
Class=c("benign", "malignant", "benign", "malignant", "benign", "benign", "benign", "benign", "benign", "malignant", "malignant", "malignant", "malignant", "malignant", "malignant"))

#Identify and assign as dependent and which value is called binaryone
mdependvar <- 'Class'
mbinaryOne <- 'malignant'

#convert dependent variable "Class" as factor
mydata[mdependvar] <- factor(ifelse(mydata[mdependvar] == mbinaryOne, 1, 0), levels = c(0, 1))

#Declare the list of Independent variables
mindependvar <- c('pp','qq','rr','ss')

#Start constructing the formula
xxx <- paste0("mydata$",mdependvar)
f1 <- as.formula(paste(paste( text=xxx,"~"), paste("+",paste (mdependvar , sep = " ", collapse = "+"))))

#Apply formula
glm_model <- glm(f1, family = "binomial", data=mydata)

#Summary output
summary(glm_model)

if you need more information on this formula and how to select and pass a dependent and independent variables, you may refer to my Logistic Regression Multi-model in my channel "Happy Learning-GP" on YouTube

1 Like

Great. Thanks for your help @nirgrahamuk and @ganapap1

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.