Creating new variables if original variables end with specific number

Hi,
I have this simple dummy df:

source <- data.frame(stringsAsFactors=FALSE,
                     þÿ.URN. = c("aaa", "bbb", "ccc", "ddd"),
                     Rec_2_43 = c(1, NA, 3, NA),
                     Staff_43 = c(NA, 1, NA, 1),
                     Rec_1_35 = c(2, NA, 1, NA),
                     Timeliness_2 = c(NA, 1, NA, 1),
                     Staff_2 = c(1, NA, 2, NA)
)

source

Now:

  1. I would like to create a code which creates new variables with names based on whatever is after the last "_".
  2. In each of these new variables a coding should create 1s if source variables contain values >=1

So in this small file variables "43", "35" and "2" should be created. Nevertheless, I would like to use a general code for all other variables (ending with _x) which may appear in my df.
Then:
if any value in variables ending with _43 is >=1, Variable 43 should be 1
if any value in variables ending with _35 is >=1, Variable 35 should be 1
if any value in variables ending with _2 is >=1, Variable 2 should be 1
if any value in variables ending with _x is >=1, Variable x should be 1

as a result I should get something like that:


# A tibble: 4 x 9
  `þÿ"URN"` Rec_2_43 Staff_43 Rec_1_35 Timeliness_2 Staff_2  `43`  `35`   `2`
  <chr>        <dbl>    <dbl>    <dbl>        <dbl>   <dbl> <dbl> <dbl> <dbl>
1 aaa              1       NA        2           NA       1     1     1     1
2 bbb             NA        1       NA            1      NA     1    NA     1
3 ccc              3       NA        1           NA       2     1     1     1
4 ddd             NA        1       NA            1      NA     1    NA     1
> 

Is it easy to do?

source <- data.frame(stringsAsFactors=FALSE,
                    þÿ.URN. = c("aaa", "bbb", "ccc", "ddd","eee","fff"),
                    Rec_2_43 = c(1, NA, 3, NA,NA,0),
                    Staff_43 = c(NA, 1, NA, 1,NA,0),
                    Rec_1_35 = c(2, NA, 1, NA,1,0),
                    Timeliness_2 = c(NA, 1, NA, 1,1,0),
                    Staff_2 = c(1, NA, 2, NA,1,0)
)

library(tidyverse)


process_suffix <- function(df,suffix)
{
vars_to_include <- tidyselect::vars_select(names(df),ends_with(suffix))

quotedform <- paste0(" if_else(any(c(",
                     paste0(vars_to_include,">=1",collapse = ","),
                     "),na.rm = TRUE),1,0)")

mutator <- rlang::parse_expr(quotedform)
result <- df %>% rowwise() %>%
  mutate(!!suffix := !!mutator)
} %>% ungroup()

df1 <- process_suffix(source,"_43")
df2 <- process_suffix(df1,"_35")
df3 <- process_suffix(df2,"_2")
> df3
# A tibble: 6 x 9
  þÿ.URN. Rec_2_43 Staff_43 Rec_1_35 Timeliness_2 Staff_2 `_43` `_35`  `_2`
  <chr>      <dbl>    <dbl>    <dbl>        <dbl>   <dbl> <dbl> <dbl> <dbl>
1 aaa            1       NA        2           NA       1     1     1     1
2 bbb           NA        1       NA            1      NA     1     0     1
3 ccc            3       NA        1           NA       2     1     1     1
4 ddd           NA        1       NA            1      NA     1     0     1
5 eee           NA       NA        1            1       1     0     1     1
6 fff            0        0        0            0       0     0     0     0

This is basically the same as your previous question with an added request, try to make your questions independent or think in the scope of your questions in advance.

There are columns that end with the same numbers (like Rec_2_43 and Staff_43) and you can't have columns with the same name, What do you want to do in those cases? this exemplifies the problem

library(tidyverse)

source <- data.frame(stringsAsFactors=FALSE,
                     þÿ.URN. = c("aaa", "bbb", "ccc", "ddd"),
                     Rec_2_43 = c(1, NA, 3, NA),
                     Staff_43 = c(NA, 1, NA, 1),
                     Rec_1_35 = c(2, NA, 1, NA),
                     Timeliness_2 = c(NA, 1, NA, 1),
                     Staff_2 = c(1, NA, 2, NA)
)

source %>% 
    mutate_at(vars(matches("_\\d+$")),
              list(new = ~ if_else(.x>0,1,.x))) %>% 
    rename_at(vars(ends_with("new")), ~str_extract(., ".{2}\\d+(?=_new)"))
#>   þÿ.URN. Rec_2_43 Staff_43 Rec_1_35 Timeliness_2 Staff_2 2_43 f_43 1_35 s_2
#> 1     aaa        1       NA        2           NA       1    1   NA    1  NA
#> 2     bbb       NA        1       NA            1      NA   NA    1   NA   1
#> 3     ccc        3       NA        1           NA       2    1   NA    1  NA
#> 4     ddd       NA        1       NA            1      NA   NA    1   NA   1
#>   f_2
#> 1   1
#> 2  NA
#> 3   1
#> 4  NA

Created on 2020-02-10 by the reprex package (v0.3.0.9001)

I am sorry about creating two similar questions but the second idea came up later.
Answering your question about variables ending with the same suffix. This is the whole point based on "or" so if any value >0 appears in Rec_2_43 or Staff_43, _43 should be 1.
I can see that nirgrahamuk's solution is what I need.
The only problem is adding manually all stages (df1, df2, df3). In this dummy data I have only variables ending with _43, _35 and _2 but in my real data there are many more variables so I would like my code to pick them up automatically and do the trick with results like df3.
We are closer to the final solution I think.
Can we simply add something to andresrcs's code to have 2_43 and f_43 (s_2 and f_2 etc.) together as _43 (s_2 etc) like in nirgrahamuk's final (df3) solution? The satement is "or" so if value >0 appears in any variable ending with _43 (2_43 and f_43 in this case), new variable _43 should be 1...

sourcedf <- data.frame(stringsAsFactors=FALSE,
                     þÿ.URN. = c("aaa", "bbb", "ccc", "ddd","eee","fff"),
                     Rec_2_43 = c(1, NA, 3, NA,NA,0),
                     Staff_43 = c(NA, 1, NA, 1,NA,0),
                     Rec_1_35 = c(2, NA, 1, NA,1,0),
                     Timeliness_2 = c(NA, 1, NA, 1,1,0),
                     Staff_2 = c(1, NA, 2, NA,1,0)
)

library(tidyverse)

# edited so that ungroup is inside and not outside the function
process_suffix <- function(df,suffix)
{
  vars_to_include <- tidyselect::vars_select(names(df),ends_with(suffix))
  #edited to wrap vars to include in backticks for irregular var names
quotedform <- paste0(" if_else(any(c(",
                       paste0("`",vars_to_include,"`",">=1",collapse = ","),
                       "),na.rm = TRUE),1,0)")
  
  mutator <- rlang::parse_expr(quotedform)
  result <- df %>% rowwise() %>%
    mutate(!!suffix := !!mutator) %>% ungroup()
} 

# get all suffixes after skipping first column
ndf <- names(sourcedf[,-1])
suffix_list <- stringr::str_split(ndf,
                                  "_") %>% 
  map(~tail(.,1)) %>%   unique %>%  unlist

output_df <- sourcedf
for( i in suffix_list) {  output_df <-  process_suffix(output_df,paste0("_",i))}

result will be in output_df

Thank you, I have noticed that some variable names in my real files are in "full name format" such as Ease of Buying_Pos_51. Will this code fork for this type of variables?
When I apply the final part of the code:

output_df <- sourcedf
for( i in suffix_list) {  output_df <-  process_suffix(output_df,paste0("_",i))}

I have following error, which I think might be related to that:

  Error in parse(text = x) : <text>:1:50: unexpected symbol
1:  if_else(any(c(Facilities_50>=1,Staff_50>=1,Ease of
                                                     ^ 
7.
parse(text = x) 
6.
parse_exprs(x) 
5.
rlang::parse_expr(quotedform) 
4.
eval(lhs, parent, parent) 
3.
eval(lhs, parent, parent) 
2.
{
    vars_to_include <- tidyselect::vars_select(names(df), ends_with(suffix))
    quotedform <- paste0(" if_else(any(c(", paste0(vars_to_include, 
        ">=1", collapse = ","), "),na.rm = TRUE),1,0)") ... 
1.
process_suffix(output_df, paste0("_", i)) 

I am sorry for not mentioning that earlier but I wanted to make my dummy file as small as possible and forgot about variable names like "Ease of Buying_Pos_51"...

You could add backticks around varstoinclude in the paste0 before the >=1

I've tried this:

  quotedform <- paste0(" if_else(any(c(",
                       paste0(`vars_to_include`,">=1",collapse = ","),
                       "),na.rm = TRUE),1,0)")

but still the same problem :frowning:

  quotedform <- paste0(" if_else(any(c(",
                       paste0("`",vars_to_include,"`",">=1",collapse = ","),
                       "),na.rm = TRUE),1,0)")
1 Like

Absolutely amazing!!! Thank you :smiley:

I know the task is resolved (amazing!!!) but is is easy to do the same with prefix?
What should be modified to get the same but for Rec_2, Staff, Rec_1 and Timeliness? This would work almost the same way but prefixes would be taken into account (Staff in this example)...
I used this code:

library(tidyverse)

# edited so that ungroup is inside and not outside the function
process_prefix <- function(df,prefix)
{
  vars_to_include <- tidyselect::vars_select(names(df),starts_with(prefix))
  #edited to wrap vars to include in backticks for irregular var names
quotedform <- paste0(" if_else(any(c(",
                       paste0("`",vars_to_include,"`",">=1",collapse = ","),
                       "),na.rm = TRUE),1,0)")
  
  mutator <- rlang::parse_expr(quotedform)
  result <- df %>% rowwise() %>%
    mutate(!!prefix := !!mutator) %>% ungroup()
} 

# get all prefixes after skipping first column
ndf <- names(sourcedf[,-1])
prefix_list <- stringr::str_split(ndf,
                                  "_") %>% 
  map(~tail(.,1)) %>%   unique %>%  unlist

output_df <- sourcedf
for( i in prefix_list) {  output_df <-  process_prefix(output_df,paste0("_",i))}

but I have following error:

 Error in parse(text = x) : attempt to use zero-length variable name 

What am I doing wrong?

I think you should check the contents of prefix_list.
I'm assuming theres suffixes in there. would do to change from tail() to head()

Changing the code to this:

ndf <- names(sourcedf[,-1])
prefix_list <- stringr::str_split(ndf,
                                  "_") %>% 
  map(~head(.,1)) %>%   unique %>%  unlist

did not help :frowning:
still the same error:

 Error in parse(text = x) : attempt to use zero-length variable name 

I don't have any more ideas :thinking:

Can you dput() your ndf so I can look at its contents ?

Thank you for not giving up. Here you go:

> dput(ndf)
c("Rec_2_43", "Staff_43", "Rec_1_35", "Timeliness_2", "Staff_2"
)

In this case I need "Rec_2", "Staff", "Rec_1" and "Timeliness"...

I took your ndf

ndf <- c("Rec_2_43", "Staff_43", "Rec_1_35", "Timeliness_2", "Staff_2")

and ran the next line:

prefix_list <- stringr::str_split(ndf,"_") %>%  map(~head(.,1)) %>%   unique %>%  unlist

which when looked at shows

[1] "Rec"        "Staff"      "Timeliness"

as far as this goes its what I expected. but it seems you have different requirements than we originally identified, as you wish for the prefix to be the first two tokens seperated by underscores from any name.
before I try and provide a solution for this first alteration (lets call it the 'prefix detection step', would you like to try to solve this first step yourself ? I ask because I don't want to rob you of a learning experience

Hurray!
I fixed tat myself:

for( i in prefix_list) {  output_df <-  process_prefix(output_df,paste0(i))}

The only thing I could not fix is having "Rec_1" and "Rec_2" separately but this is too advanced for me as this requires changing this line:

prefix_list <- stringr::str_split(ndf,
                                  "_") %>% 

to split variables' names before the last "_" ...

library(tidyverse)
ndf <- c("Rec_2_43", "Staff_43", "Rec_1_35", "Timeliness_2", "Staff_2")


step1  <- stringr::str_split(ndf, "_")

step2  <- step1  %>% map(~ head(., 2))

step3 <- step2  %>% map(~ paste0(., collapse = "_"))

step4  <- step3 %>%  unique() %>% unlist()


allsteps <- stringr::str_split(ndf, "_") %>%
  map(~ head(., 2)) %>%
  map(~ paste0(., collapse = "_")) %>%
  unique() %>%  unlist()

> step1
[[1]]
[1] "Rec" "2"   "43" 

[[2]]
[1] "Staff" "43"   

[[3]]
[1] "Rec" "1"   "35" 

[[4]]
[1] "Timeliness" "2"         

[[5]]
[1] "Staff" "2"    

> step2
[[1]]
[1] "Rec" "2"  

[[2]]
[1] "Staff" "43"   

[[3]]
[1] "Rec" "1"  

[[4]]
[1] "Timeliness" "2"         

[[5]]
[1] "Staff" "2"    

> step3
[[1]]
[1] "Rec_2"

[[2]]
[1] "Staff_43"

[[3]]
[1] "Rec_1"

[[4]]
[1] "Timeliness_2"

[[5]]
[1] "Staff_2"

> step4
[1] "Rec_2"        "Staff_43"     "Rec_1"        "Timeliness_2" "Staff_2"     
> allsteps
[1] "Rec_2"        "Staff_43"     "Rec_1"        "Timeliness_2" "Staff_2"

Just realised if your input data is inconsistent and sometimes a prefix is the first 2 parts of 3 and others the first part of 2 there will need to be more complicated code to handle the difference. I shared code here that assumed prefix is 2 parts of 3

Absolutely brilliant!
I also found another (perhaps very basic) way. I renamed all variables prior to the process by replacing "_" by "."