Using mutate to create a new column from a list of existing columns that was generated programmatically

ajutla · September 28, 2019, 6:24pm

Hi everyone. I have a question about something that seems like it should be simple, but that I can't seem to wrap my head around.

Say I have a tibble, my_tibble, containing columns that represent the scores a group of people received on 5 items. The columns are named item_1_score, item_2_score, and so on. Together they comprise a rating scale. I want to use mutate to derive a summary_score column by adding those five item-level scores together.

A straightforward way of doing it would be to try my_tibble %>% mutate(summary_score = item_1_score + item_2_score + item_3_score + item_4_score + item_5_score). But the repetition here is a bit silly, especially since the real dataframe I'm trying to do this with has 21 of these scores, not 5.

Another way of doing it might be to try to match the columns I'm interested in with a regex-based approach. However, the dataset also includes columns with names like item_1b_score, which I'm not interested in. Regexes have always felt like black magic to me, and although I'm sure it's possible, I have no idea how to write one that matches item_1_score but not item_1b_score.

So I thought, what if I programmatically generate strings representing the column names I'm interested in, and then get mutate to recognize those strings as variable names? It turns out that getting R to recognize strings as variable names seems to involve black magic of a different sort. I went down a rabbit hole of trying to understand !!, {{, expr, sym and so on, which didn't go anywhere.

Finally, in frustration, I built a string that contains the literal argument to mutate that I want, and then used eval and parse to read it.

This approach technically works, but its ugliness bothers me. I know there has to be a simple, elegant, idiomatic way to do what I want that doesn't involve parsing strings, but I just can't seem to figure out what it might be.

Any help would be appreciated.

Here's a reprex that includes my ugly non-solution:

library(tidyverse)
set.seed(1)

my_tibble <- tibble(
    subject_id = c(1:20),
    item_1_score = sample.int(10, 20, replace=TRUE),
    item_1b_score = sample.int(10, 20, replace=TRUE),
    item_2_score = sample.int(10, 20, replace=TRUE),
    item_3_score = sample.int(10 ,20, replace=TRUE),
    item_4_score = sample.int(10, 20, replace=TRUE),
    item_5_score = sample.int(10, 20, replace=TRUE),
)

construct_column_name <- function(item_number) {
    return(paste0('item_', item_number, '_score'))
}

my_string <-
    1:4 %>%
    map(construct_column_name) %>%
    paste0(' + ') %>%
    paste0(collapse='') %>%
    paste0(construct_column_name(5))

my_tibble <- my_tibble %>%
    mutate(summary_score = eval(parse(text=my_string)))

^{Created on 2019-09-28 by the reprex package (v0.3.0)}

technocrat · September 28, 2019, 6:49pm

Hi, and welcome

A S/O post has a good solution

 my_tibble %>% mutate(summary_score = rowSums(select(.,contains("item"))))

FJCC · September 28, 2019, 6:57pm

I think to avoid using the item_1b_score column, the command is

my_tibble %>% mutate(summary_score = rowSums(select(.,-contains("b_"), -subject_id)))

andresrcs · September 28, 2019, 6:58pm

Actually I think this is more explicit

my_tibble %>% 
    mutate(summary_score = rowSums(select(.,matches("item_\\d{1,2}_score"))))

\\d{1,2} this will match 1 or 2 digits in the middle and no letters

ajutla · September 28, 2019, 7:17pm

I think the solution @andresrcs provided works the best for my use case. select(., -contains("b_"), -subject_id works for the toy example I provided, but unfortunately the real dataframe has a large number of other columns that would match.

Thank you all for your help.

system · October 5, 2019, 7:17pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.