So, now that I'm at a computer I had a chance to look through this again.
Pseudocode
What I think you want to do is this:
Starting from a data.frame object: df.
Choose a subset of the variables of df: cols
For each col in cols
mutate col to be a new indicator variable coded as 1 or 0
augment df with this new variable
Extract from the augmented df object a particular column (var) as well
as the newly created indicator variables, but only those rows for which
all the indicators have been evaluated to 1.
If that is not correct, let me know and please attempt to clarify for us.
New Data
I've made some new sample data (yours was much to complicated to be able to quickly ingest). I also included some NA
data which will force us to think about how we might handle that later.
set.seed(123)
n <- 6
df <- data.frame(id = letters[seq(n)],
x1 = sample(c(NA, "n", "y"), n, TRUE, c(0.3, 0.2, 0.5)),
x2 = runif(n))
df
#> id x1 x2
#> 1 a y 0.5281055
#> 2 b <NA> 0.8924190
#> 3 c y 0.5514350
#> 4 d n 0.4566147
#> 5 e n 0.9568333
#> 6 f y 0.4533342
Solution 1
Mutate
df[["m1"]] <- ifelse(df[["x1"]] == "y", 1, 0)
df[["m2"]] <- ifelse(df[["x2"]] >= 0.4, 1, 0)
df
#> id x1 x2 m1 m2
#> 1 a y 0.5281055 1 1
#> 2 b <NA> 0.8924190 NA 1
#> 3 c y 0.5514350 1 1
#> 4 d n 0.4566147 0 1
#> 5 e n 0.9568333 0 1
#> 6 f y 0.4533342 1 1
Function
Here (and for the rest of this reply, I am choosing to pass the names of the grouping variables instead of a list of their values. The main reason for this is we are already passing these in through the augmented data.frame, there's no need to increase the volume of data being passed between functions.
Also, I renamed grouping_var
to gv
because I am a bad and lazy typist and it's a fairly long variable name and it made some of my lines of code longer than I would prefer and I am particular like that.
f <- function(dat, var, gv, total_var) {
# I am still not sure what total_var is supposed to do. Is it a threshold value
# that is if we had, say 7 indicators and total_var = 5, would we keep the
# rows which met the conditions of 5 or more indicators? Let me know and
# I can adjust the functions accordingly.
dat[apply(dat[gv], 1, function(x) all(x == 1)), c(var, gv)]
}
Results
f(df, "id", c("m1", "m2"))
#> id m1 m2
#> 1 a 1 1
#> NA <NA> NA NA
#> 3 c 1 1
#> 6 f 1 1
Solution 2 - Coding to TRUE/FALSE Instead of 1/0
Our function can be slightly simplified if we mutate to TRUE and FALSE (the logical equivalents of 1 and 0).
Mutate
df[["m1"]] <- df[["x1"]] == "y"
df[["m2"]] <- df[["x2"]] >= 0.4
Function
f <- function(dat, var, gv, total_var) {
dat[apply(dat[gv], 1, all), c(var, gv)]
}
Results
f(df, "id", c("m1", "m2"))
#> id m1 m2
#> 1 a TRUE TRUE
#> NA <NA> NA NA
#> 3 c TRUE TRUE
#> 6 f TRUE TRUE
Solution 3 - Improvements
Alternately, we can devise a new function which will mutate, subset, and select for you. We use a default condition "== 1
, which will match any mutated dummy vars you create as 1
/0
or TRUE
/FALSE
Function
f <- function(dat, vars, gv, conditions = "== 1", na.rm = FALSE) {
exprs <- mapply(function(a, b) {
parse(text = paste(a, b))
},
gv, conditions,
USE.NAMES = FALSE)
idx <- Reduce(`&`, lapply(exprs, eval, dat)) # 2
idx <- ifelse(is.na(idx), !na.rm, idx) # 3
dat[idx, c(vars, gv)] # 4
}
Explanation of Function Details
I'm quite happy with this function, so I am going to take a moment to explain it. The comments #1, #2, ... etc correspond to the points below,
- We are pasting the names of the variables together with a condition they must meet and turning the resulting character vector into an expression. for instance, one expression might be
m1 == 1
.
- For each
expr
in exprs
we are evaluating it in a data.frame environment. This is somewhat similar to attaching a data.frame to the global environment or using the with()
function if you are familiar with those ideas. Since we're already using eval
though to evaluate the expression, it makes more sense to do it in a data.frame environment. Then we reduce the list output to a single index vector.
- If we want to remove
NA
's at this point we do so here.
- Finally, we subset our data.frame according to the conditions we set.
Results
f(df, "id", c("m1", "m2"))
#> id m1 m2
#> 1 a TRUE TRUE
#> 2 b NA TRUE
#> 3 c TRUE TRUE
#> 6 f TRUE TRUE
f(df, "id", c("m1", "m2"), na.rm = TRUE)
#> id m1 m2
#> 1 a TRUE TRUE
#> 3 c TRUE TRUE
#> 6 f TRUE TRUE
f(df, "id", c("x1", "x2"), c("== \"y\"", ">= 0.4"))
#> id x1 x2
#> 1 a y 0.5281055
#> 2 b <NA> 0.8924190
#> 3 c y 0.5514350
#> 6 f y 0.4533342
f(df, "id", c("x1", "x2"), c("== \"y\"", ">= 0.4"), TRUE)
#> id x1 x2
#> 1 a y 0.5281055
#> 3 c y 0.5514350
#> 6 f y 0.4533342
Created on 2020-09-04 by the reprex package (v0.3.0)
Solution 4 - A Guess About total_var
Here I implement my best guess about what you wanted to use the total_var
argument for.
New Data
set.seed(2357)
n <- 10
df <- data.frame(id = letters[seq(n)],
x1 = sample(c(NA, "n", "y"), n, TRUE, c(0.3, 0.2, 0.5)),
x2 = round(runif(n), 2),
x3 = round(rnorm(n), 2))
df
#> id x1 x2 x3
#> 1 a y 0.42 -0.30
#> 2 b <NA> 0.64 -0.01
#> 3 c y 0.34 1.76
#> 4 d y 0.60 -0.55
#> 5 e y 0.28 -1.26
#> 6 f <NA> 0.56 1.19
#> 7 g y 0.98 -0.21
#> 8 h y 0.31 -1.76
#> 9 i <NA> 0.91 -1.33
#> 10 j n 0.51 0.42
Mutate
df[["m1"]] <- ifelse(df[["x1"]] == "y", 1, 0)
df[["m2"]] <- ifelse(df[["x2"]] >= 0.3, 1, 0)
df[["m3"]] <- ifelse(df[["x3"]] < 0, 1, 0)
df
#> id x1 x2 x3 m1 m2 m3
#> 1 a y 0.42 -0.30 1 1 1
#> 2 b <NA> 0.64 -0.01 NA 1 1
#> 3 c y 0.34 1.76 1 1 0
#> 4 d y 0.60 -0.55 1 1 1
#> 5 e y 0.28 -1.26 1 0 1
#> 6 f <NA> 0.56 1.19 NA 1 0
#> 7 g y 0.98 -0.21 1 1 1
#> 8 h y 0.31 -1.76 1 1 1
#> 9 i <NA> 0.91 -1.33 NA 1 1
#> 10 j n 0.51 0.42 0 1 0
New Function
f <- function(dat, vars, gv, conditions = "== 1", min_tot = 1, na.rm = FALSE) {
exprs <- mapply(function(a, b) {
parse(text = paste(a, b))
},
gv, conditions,
USE.NAMES = FALSE)
idx <- rowSums(vapply(exprs,
eval,
logical(nrow(dat)), dat),
na.rm = TRUE) >= min_tot
dat[complete.cases(dat) + !na.rm & idx, c(vars, gv)]
}
Results
f(df, "id", c("m1", "m2", "m3"))
#> id m1 m2 m3
#> 1 a 1 1 1
#> 2 b NA 1 1
#> 3 c 1 1 0
#> 4 d 1 1 1
#> 5 e 1 0 1
#> 6 f NA 1 0
#> 7 g 1 1 1
#> 8 h 1 1 1
#> 9 i NA 1 1
#> 10 j 0 1 0
f(df, "id", c("m1", "m2", "m3"), min_tot = 2)
#> id m1 m2 m3
#> 1 a 1 1 1
#> 2 b NA 1 1
#> 3 c 1 1 0
#> 4 d 1 1 1
#> 5 e 1 0 1
#> 7 g 1 1 1
#> 8 h 1 1 1
#> 9 i NA 1 1
f(df, "id", c("m1", "m2", "m3"), min_tot = 2, na.rm = TRUE)
#> id m1 m2 m3
#> 1 a 1 1 1
#> 3 c 1 1 0
#> 4 d 1 1 1
#> 5 e 1 0 1
#> 7 g 1 1 1
#> 8 h 1 1 1
f(df, "id", c("m1", "m2", "m3"), min_tot = 3)
#> id m1 m2 m3
#> 1 a 1 1 1
#> 4 d 1 1 1
#> 7 g 1 1 1
#> 8 h 1 1 1
Created on 2020-09-04 by the reprex package (v0.3.0)