Trouble using filter and str_detect to remove participants

I am conducting food related study and would like to remove all of a participants data if they identified that they have any food related allergies during the questionnaire part of my study. I am trying to accomplish this using group_by, filter and str_detect.

Unfortunately, the code I have at the moment results in a new table, with only the answers with "gluten". The group_by function also does not function as expected, as it doesn't remove all the participants answers, only the rows that contain "gluten".

Here is the code I have now. I would like all of a participants answers tobe removed if they answered "gluten" anywhere in the question:)

  my_data_raw_quest %>%
  group_by(user_id) %>%
  filter(
    str_detect(dv, "(G|g)luten"))

Here is the table created from that code.

structure(list(session_id = c(53877, 53891, 54090, 54469, 54929, 
55038, 55061, 55096, 55104, 55108, 55145, 57068, 57074, 57146, 
57276, 57435, 57952, 58817), project_id = c(495, 495, 495, 495, 
495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 
495), quest_name = c("Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic"
), quest_id = c(2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 
2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189), 
    user_id = c(47667, 47681, 47877, 48251, 48705, 48816, 48839, 
    48873, 48881, 48881, 48921, 50663, 50723, 50794, 50924, 51077, 
    51561, 52161), user_sex = c("male", "female", "female", "female", 
    "female", "na", "female", "female", "female", "female", "female", 
    "female", "female", "female", "female", "female", "male", 
    "female"), user_status = c("test", "test", "guest", "guest", 
    "registered", "guest", "guest", "guest", "test", "test", 
    "guest", "registered", "guest", "guest", "guest", "guest", 
    "guest", "test"), user_age = c(59, 40, 35, 38, 53.7, 28, 
    21, 65, 24, 24, 25, 20.8, 38, 44, 32, 34, 44, 20), q_name = c("food allergies", 
    "food allergies", "food allergies", "food allergies", "food allergies", 
    "food allergies", "food allergies", "food allergies", "food allergies", 
    "food allergies", "food allergies", "food allergies", "food allergies", 
    "food allergies", "food allergies", "food allergies", "Other", 
    "food allergies"), q_id = c(92827397, 92827397, 92827397, 
    92827397, 92827397, 92827397, 92827397, 92827397, 92827397, 
    92827397, 92827397, 92827397, 92827397, 92827397, 92827397, 
    92827397, 92831398, 92827397), order = c(4, 4, 4, 4, 4, 4, 
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4), dv = c("Gluten", "Gluten, cumin, paprika, anchovies", 
    "Gluten intolerance", "Dairy, gluten some veg, fruit and nuts", 
    "Gluten", "Gluten", "Gluten intolerant", "Gluten", "No allergies, but intolerant to gluten", 
    "No allergies, but gluten intolerant", "Lactose & gluten", 
    "gluten and dairy intolerance", "Sensitive to gluten and soy", 
    "Gluten", "Gluten", "Gluten", "Locked down with family, sister is gluten free", 
    "I am conscious of what gluten i eat as it sets my eczema off"
    ), starttime = structure(c(1607970136, 1607970692, 1607975785, 
    1607984805, 1608023741, 1608037872, 1608041491, 1608047134, 
    1608048524, 1608048811, 1608055657, 1609950997, 1609951334, 
    1609953692, 1609961095, 1609976350, 1610182572, 1610465355
    ), tzone = "UTC", class = c("POSIXct", "POSIXt")), endtime = structure(c(1607970180, 
    1607970791, 1607975825, 1607984927, 1608023787, 1608037944, 
    1608041525, 1608047239, 1608048613, 1608048856, 1608055709, 
    1609951071, 1609951428, 1609953730, 1609961133, 1609976399, 
    1610182657, 1610465458), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), undergraduate = c(FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE), NoUni = c(FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Masters = c(FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), 
    Postgraduate = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE), degree = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), row.names = c(NA, -18L), groups = structure(list(user_id = c(47667, 
47681, 47877, 48251, 48705, 48816, 48839, 48873, 48881, 48921, 
50663, 50723, 50794, 50924, 51077, 51561, 52161), .rows = structure(list(
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9:10, 11L, 12L, 13L, 14L, 
    15L, 16L, 17L, 18L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -17L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

Which ones do you want to keep?

> df %>% 
+   ungroup() %>% 
+   distinct(dv)
# A tibble: 12 x 1
   dv                                                          
   <chr>                                                       
 1 Gluten                                                      
 2 Gluten, cumin, paprika, anchovies                           
 3 Gluten intolerance                                          
 4 Dairy, gluten some veg, fruit and nuts                      
 5 Gluten intolerant                                           
 6 No allergies, but intolerant to gluten                      
 7 No allergies, but gluten intolerant                         
 8 Lactose & gluten                                            
 9 gluten and dairy intolerance                                
10 Sensitive to gluten and soy                                 
11 Locked down with family, sister is gluten free              
12 I am conscious of what gluten i eat as it sets my eczema off

Someone on Stack overflow helped me loads and now i have this code that removes all those with certain conditions.

my_data_raw_quest_2 <- 
  my_data_raw_quest %>%
  group_by(user_id) %>%
  filter(is.na(dv)|
    !str_detect(dv, "(G|g)luten"))

Unfortunately, it still does not remove all the rows of participants answers that are associated with that one "gluten" containing answer. Therefore, have I used group_by correctly?

Hey William, thanks for the reply.

I don't want to keep any of those answers:) At the moment I'm trying to remove all the rows of answers from a participant that answered with the answer containing "gluten" for one question:)

If you want help with code that had to account for multiple ids, and should exclude gluten, but preserve non gluten, it would be best to provide representative data that includes those scenarios...
E..g I could tell you to %>%filter(FALSE) which seems to work if the entire example data happens to have gluten and is worth exclusion, but it would clearly fail on a more representative example.

Good point thank you ;D I have made an example below, please let me know if I need to change anything

Group_by doesn't seem to be functioning while I'm using filter. This is a simply reproduction of my data:

data = my_data_raw_quest 

user_id     question          dv
1            Allergies?        na     
1            food choice       left
2            Allergies?        yes, I hate gluten  
2            food choice       left        
3            Allergies?        allergic to soy 
3            food choice       left                   
4            Allergies?        na
4            food choice       left             
5            Allergies?        na
5            food choice       left            
6            Allergies?        Soy 
6            food choice       right          
7            Allergies?        na
7            food choice       right

when i run this code, it seems like group_by did not function correctly.

my_data_raw_quest_2 <- 
  my_data_raw_quest %>%
  dplyr::group_by(user_id) %>%
  filter(add = TRUE, is.na(dv)|
    !str_detect(dv, "(G|g)luten"))

This results in the following data set, when Instead I would like ALL responses from users that answered "gluten" to any question removed. Notice that line 3 is removed but not line 4.

user_id     question           dv
1            Allergies?        na     
1            food choice       left
2            food choice       left        
3            Allergies?        allergic to soy 
3            food choice       left                   
4            Allergies?        na
4            food choice       left             
5            Allergies?        na
5            food choice       left            
6            Allergies?        Soy 
6            food choice       right          
7            Allergies?        na
7            food choice       right

This is the table im trying to achieve

user_id     question           dv
1            Allergies?        na     
1            food choice       left
3            Allergies?        allergic to soy 
3            food choice       left                   
4            Allergies?        na
4            food choice       left             
5            Allergies?        na
5            food choice       left            
6            Allergies?        Soy 
6            food choice       right          
7            Allergies?        na
7            food choice       right

Hello, you are making progress but took a step back im afraid.
The manner in which you have shared examples this time, is plain text, but originally you had it much better as a copy and pastable code (i.e. structure). perhaps go back to dput and replace the above ?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.