selecting list of columns from dataframe to convert to subset

Hi , I am trying to create a function , for that at the input i am giving a list of modifying columns .
for eg: sample data is

dataa<-data.frame(
  aa = c("q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c","q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c"),
col1=c(1,2,3,2,1,2,3,4,4,4,5,3,4,2,1,2,5,3,2,1,2,4,2,1,3,2,1,2,3,1,2,2,4,4,4,1,2,5,3,5),
col2=c(2,1,1,7,4,1,2,7,5,7,2,6,2,2,6,3,4,3,2,5,7,5,6,4,4,6,5,6,4,1,7,3,2,7,7,2,3,7,2,4)
)

i have mutated two new columns

dataa$col3 <- ifelse(dataa$aa == "y",1,0)
dataa$col4 <- ifelse(dataa$col2 == 7,1,0)

now i am creating a function like , i can give a list of mutated columns to my function. rest all are working fine.

dat1 = dataa
var1 = "col1"  #(Quantitative variable)
grouping_var = list(dataa$col3,dataa$col4)
total_var= TRUE

#fun_1 <- function(dat1,var1,grouping_var,total_var){
  total_col <- ifelse(total_var== TRUE,1,0)
  var1 <- rlang::parse_expr(var1)
  var2 <- dat1[unlist(grouping_var)] # i have tried these methods 
  var2 <- data.frame(sapply(grouping_var,c)) # i have tried this too
  
  dat1 <- dat1 %>% select(!!var1,!!var2) 
# so the objective is to create a data frame here by combining quantitative variable  and categorical variable, then i will have more calculations...going forward

Hope everything explained clearly, please let me know if anything else required.

Can you please provide at least one example of a call to your function as well as the output you are hoping to see from it because I'm not quite following what you want.

dat1 <- dat1 %>% select(!!var1,!!var2)
by this line i am expecting a subset data frame with  column c(col1,col3,col4)

list(dataa$col3,dataa$col4)

this is passing a list of the vectors contents rather than something that identifies the variables within dataa that should be used.

Again, please provide the output. It's unclear to me how you would like to select the subset.

I also don't understand the purpose of your proposed total_col variable.

So, now that I'm at a computer I had a chance to look through this again.

Pseudocode

What I think you want to do is this:

Starting from a data.frame object: df.
Choose a subset of the variables of df: cols
For each col in cols
     mutate col to be a new indicator variable coded as 1 or 0
     augment df with this new variable
Extract from the augmented df object a particular column (var) as well
as the newly created indicator variables, but only those rows for which
all the indicators have been evaluated to 1.

If that is not correct, let me know and please attempt to clarify for us.

New Data

I've made some new sample data (yours was much to complicated to be able to quickly ingest). I also included some NA data which will force us to think about how we might handle that later.

set.seed(123)
n <- 6
df <- data.frame(id = letters[seq(n)],
                 x1 = sample(c(NA, "n", "y"), n, TRUE, c(0.3, 0.2, 0.5)),
                 x2 = runif(n))
df
#>   id   x1        x2
#> 1  a    y 0.5281055
#> 2  b <NA> 0.8924190
#> 3  c    y 0.5514350
#> 4  d    n 0.4566147
#> 5  e    n 0.9568333
#> 6  f    y 0.4533342

Solution 1

Mutate

df[["m1"]] <- ifelse(df[["x1"]] == "y", 1, 0)
df[["m2"]] <- ifelse(df[["x2"]] >= 0.4, 1, 0)
df
#>   id   x1        x2 m1 m2
#> 1  a    y 0.5281055  1  1
#> 2  b <NA> 0.8924190 NA  1
#> 3  c    y 0.5514350  1  1
#> 4  d    n 0.4566147  0  1
#> 5  e    n 0.9568333  0  1
#> 6  f    y 0.4533342  1  1

Function

Here (and for the rest of this reply, I am choosing to pass the names of the grouping variables instead of a list of their values. The main reason for this is we are already passing these in through the augmented data.frame, there's no need to increase the volume of data being passed between functions.
Also, I renamed grouping_var to gv because I am a bad and lazy typist and it's a fairly long variable name and it made some of my lines of code longer than I would prefer and I am particular like that.

f <- function(dat, var, gv, total_var) {
  # I am still not sure what total_var is supposed to do. Is it a threshold value
  # that is if we had, say 7 indicators and total_var = 5, would we keep the
  # rows which met the conditions of 5 or more indicators? Let me know and
  # I can adjust the functions accordingly.
  dat[apply(dat[gv], 1, function(x) all(x == 1)), c(var, gv)]
}

Results

f(df, "id", c("m1", "m2"))
#>      id m1 m2
#> 1     a  1  1
#> NA <NA> NA NA
#> 3     c  1  1
#> 6     f  1  1

Solution 2 - Coding to TRUE/FALSE Instead of 1/0

Our function can be slightly simplified if we mutate to TRUE and FALSE (the logical equivalents of 1 and 0).

Mutate

df[["m1"]] <- df[["x1"]] == "y"
df[["m2"]] <- df[["x2"]] >= 0.4

Function

f <- function(dat, var, gv, total_var) {
  dat[apply(dat[gv], 1, all), c(var, gv)]
}

Results

f(df, "id", c("m1", "m2"))
#>      id   m1   m2
#> 1     a TRUE TRUE
#> NA <NA>   NA   NA
#> 3     c TRUE TRUE
#> 6     f TRUE TRUE

Solution 3 - Improvements

Alternately, we can devise a new function which will mutate, subset, and select for you. We use a default condition "== 1, which will match any mutated dummy vars you create as 1/0 or TRUE/FALSE

Function

f <- function(dat, vars, gv, conditions = "== 1", na.rm = FALSE) {
  exprs <- mapply(function(a, b) {
                    parse(text = paste(a, b))
                  },
                  gv, conditions,
                  USE.NAMES = FALSE)
  idx <- Reduce(`&`, lapply(exprs, eval, dat)) # 2
  idx <- ifelse(is.na(idx), !na.rm, idx) # 3
  dat[idx, c(vars, gv)] # 4
}
Explanation of Function Details

I'm quite happy with this function, so I am going to take a moment to explain it. The comments #1, #2, ... etc correspond to the points below,

  1. We are pasting the names of the variables together with a condition they must meet and turning the resulting character vector into an expression. for instance, one expression might be m1 == 1.
  2. For each expr in exprs we are evaluating it in a data.frame environment. This is somewhat similar to attaching a data.frame to the global environment or using the with() function if you are familiar with those ideas. Since we're already using eval though to evaluate the expression, it makes more sense to do it in a data.frame environment. Then we reduce the list output to a single index vector.
  3. If we want to remove NA's at this point we do so here.
  4. Finally, we subset our data.frame according to the conditions we set.

Results

f(df, "id", c("m1", "m2"))
#>   id   m1   m2
#> 1  a TRUE TRUE
#> 2  b   NA TRUE
#> 3  c TRUE TRUE
#> 6  f TRUE TRUE
f(df, "id", c("m1", "m2"), na.rm = TRUE)
#>   id   m1   m2
#> 1  a TRUE TRUE
#> 3  c TRUE TRUE
#> 6  f TRUE TRUE
f(df, "id", c("x1", "x2"), c("== \"y\"", ">= 0.4"))
#>   id   x1        x2
#> 1  a    y 0.5281055
#> 2  b <NA> 0.8924190
#> 3  c    y 0.5514350
#> 6  f    y 0.4533342
f(df, "id", c("x1", "x2"), c("== \"y\"", ">= 0.4"), TRUE)
#>   id x1        x2
#> 1  a  y 0.5281055
#> 3  c  y 0.5514350
#> 6  f  y 0.4533342

Created on 2020-09-04 by the reprex package (v0.3.0)

Solution 4 - A Guess About total_var

Here I implement my best guess about what you wanted to use the total_var argument for.

New Data

set.seed(2357)
n <- 10
df <- data.frame(id = letters[seq(n)],
                 x1 = sample(c(NA, "n", "y"), n, TRUE, c(0.3, 0.2, 0.5)),
                 x2 = round(runif(n), 2),
                 x3 = round(rnorm(n), 2))
df
#>    id   x1   x2    x3
#> 1   a    y 0.42 -0.30
#> 2   b <NA> 0.64 -0.01
#> 3   c    y 0.34  1.76
#> 4   d    y 0.60 -0.55
#> 5   e    y 0.28 -1.26
#> 6   f <NA> 0.56  1.19
#> 7   g    y 0.98 -0.21
#> 8   h    y 0.31 -1.76
#> 9   i <NA> 0.91 -1.33
#> 10  j    n 0.51  0.42

Mutate

df[["m1"]] <- ifelse(df[["x1"]] == "y", 1, 0)
df[["m2"]] <- ifelse(df[["x2"]] >= 0.3, 1, 0)
df[["m3"]] <- ifelse(df[["x3"]] < 0, 1, 0)
df
#>    id   x1   x2    x3 m1 m2 m3
#> 1   a    y 0.42 -0.30  1  1  1
#> 2   b <NA> 0.64 -0.01 NA  1  1
#> 3   c    y 0.34  1.76  1  1  0
#> 4   d    y 0.60 -0.55  1  1  1
#> 5   e    y 0.28 -1.26  1  0  1
#> 6   f <NA> 0.56  1.19 NA  1  0
#> 7   g    y 0.98 -0.21  1  1  1
#> 8   h    y 0.31 -1.76  1  1  1
#> 9   i <NA> 0.91 -1.33 NA  1  1
#> 10  j    n 0.51  0.42  0  1  0

New Function

f <- function(dat, vars, gv, conditions = "== 1", min_tot = 1, na.rm = FALSE) {
  exprs <- mapply(function(a, b) {
                    parse(text = paste(a, b))
                  },
                  gv, conditions,
                  USE.NAMES = FALSE)
  idx <- rowSums(vapply(exprs,
                        eval,
                        logical(nrow(dat)), dat),
                 na.rm = TRUE) >= min_tot
  dat[complete.cases(dat) + !na.rm & idx, c(vars, gv)]
}

Results

f(df, "id", c("m1", "m2", "m3"))
#>    id m1 m2 m3
#> 1   a  1  1  1
#> 2   b NA  1  1
#> 3   c  1  1  0
#> 4   d  1  1  1
#> 5   e  1  0  1
#> 6   f NA  1  0
#> 7   g  1  1  1
#> 8   h  1  1  1
#> 9   i NA  1  1
#> 10  j  0  1  0
f(df, "id", c("m1", "m2", "m3"), min_tot = 2)
#>   id m1 m2 m3
#> 1  a  1  1  1
#> 2  b NA  1  1
#> 3  c  1  1  0
#> 4  d  1  1  1
#> 5  e  1  0  1
#> 7  g  1  1  1
#> 8  h  1  1  1
#> 9  i NA  1  1
f(df, "id", c("m1", "m2", "m3"), min_tot = 2, na.rm = TRUE)
#>   id m1 m2 m3
#> 1  a  1  1  1
#> 3  c  1  1  0
#> 4  d  1  1  1
#> 5  e  1  0  1
#> 7  g  1  1  1
#> 8  h  1  1  1
f(df, "id", c("m1", "m2", "m3"), min_tot = 3)
#>   id m1 m2 m3
#> 1  a  1  1  1
#> 4  d  1  1  1
#> 7  g  1  1  1
#> 8  h  1  1  1

Created on 2020-09-04 by the reprex package (v0.3.0)

Thanks you have given various observations.....
i have to modify database to create cuts , and apply function like below

modifying data to create cuts
dataa$col3 <- ifelse(dataa$aa == "y",1,0)
dataa$col4 <- ifelse(dataa$col2 == 7,1,0)

# Applying function like this
fun_1(
dat1 = dataa
var1 = "col1"
grouping_var = list(dataa$col4,dataa$col4)
total_var= TRUE)

#now my function is like this
#tab_std_cross <- function(dat1,var1,grouping_var,total_var){
  total_col <- ifelse(total_var== TRUE,1,0)
  var1 <- rlang::parse_expr(var1)
  var2 <- dat1[unlist(grouping_var)] # i am trying to select modified columns with original names.
  var2 <- data.frame(sapply(grouping_var,c)) # i have also tried like this
  
  dat1 <- dat1 %>% select(!!var1,!!var2)
# after this like i am expecting a subset with Var1 and grouping_var columns.
  var_lab(dat1[[1]]) <- ""
  var_lab(dat1[[2]]) <- ""
  tab1 <- expss::cro_cpct(total(),dat1[[1]],dat1[[2]])
  tab1 <- as.data.frame(tab1)
#}

I hope now i make you understand everuthing.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.