Check whether each row of tibble contains value and add TRUE/FALSE to new column

I'm using a dataset from COVID-19 World Vaccine Adverse Reactions | Kaggle, specifically the dataset "2021VAERSSYMPTOMS.csv".

The dataset contains the a column with the ID of persons who have been vaccinated (VAERS_ID) and columns with the side effects that these persons experienced from the vaccine (SYMPTOMS1-5).

From this, I have made a new tibble where each VAERS_ID has its own row and multiple columns (>5) containing symptoms.

Using tidyverse, I would now like to add additional columns (or make a new tibble) that has a column with VAERS_ID and columns with the top 20 symptoms (these I have extracted) indicating with TRUE/FALSE whether the patient has the symptom.

So in summary, is there a way, using tidyverse, in which I can check whether each row of a tibble contains a certain value, and then add TRUE/FALSE to a new column with this value name?

Could you supply us with a small example data set? E.g.:

my_data <- tribble(
  ~ID, ~VAR1, ~VAR2,
  "A", 1, 2,
  "B", 3, 4
)

and then how you imagine your result would look? (You can create this manually)

Hi Leon,

Here is a small example data set:

symptoms <- tribble(
   ~VAERS_ID, ~SYMPTOM1, ~SYMPTOM2,
   "A", "Headache", "Fever" ,
   "B", "Stomach pain", "Cough"
)

This is what I would like the result to look like:

  symptoms <- tribble(
     ~VAERS_ID, ~Headache, ~Cough,
     "A", TRUE, FALSE,
     "B", FALSE, TRUE
)

I have tried to use mutate() and case_when() but there are many columns (> 100) so I cannot explicitly write them out in the case_when() statement and I have not been able to find a way to indicate all columns.

It's quite long but I found something that works. Here's an example for three symptoms:

symptoms %>%
  pivot_longer(col = -VAERS_ID, 
               names_to = "symptom num",
               values_to = "symptom",
               values_drop_na = TRUE) %>%
  select(VAERS_ID, symptom) %>%
  mutate(Headache = case_when("Headache" == symptom ~ TRUE),
         Pyrexia = case_when("Pyrexia" == symptom ~ TRUE),
         Chills = case_when("Chills" == symptom ~ TRUE)) %>%
  filter_at(vars(-VAERS_ID), any_vars(. %in% TRUE)) %>%
  replace(., is.na(.), FALSE) %>%
  select(-symptom) %>%
  group_by(VAERS_ID) %>%
  summarize_all(sum) %>%
  view()

I see, l think we can create something a bit more concise, try if this works for you?

symptoms %>%
  pivot_longer(cols = -VAERS_ID) %>%
  mutate(name = TRUE) %>% 
  pivot_wider(id_cols = VAERS_ID,
              names_from = value,
              values_from = name,
              values_fill = FALSE)

Hope it helps :slightly_smiling_face:

Thanks a lot! I see how that would work well :slight_smile:

However, I get the following error and warning when I try to run it:
Error: Can't convert to .
In addition: Warning message:
Values are not uniquely identified; output will contain list-cols.

Hmm... Can you provide a small data set, which generates this error?

Yes, here is an example:

symptoms <- tribble(
  ~VAERS_ID, ~SYMPTOM1, ~SYMPTOM2, ~SYMPTOM3, ~SYMPTOM4, ~SYMPTOM5,
  0916600, "Dysphagia", "Epiglottitis", NA, NA, NA, 
  0916601, "Anxiety", "Dyspnoea", NA, NA, NA, 
  0916602, "Chest discomfort", "Dysphagia", "Pain in extremities", "Visual impairment", NA,      
  0916603, "Dizziness", "Fatigue", "Mobility decreased", NA, NA,
  0916604, "Injection site erythema", "Injection site pruritus", "Injection site swelling", "Injection site warmth", NA,
  0916606, "Pharyngeal swelling", NA, NA, NA, NA,     
  0916607, "Abdominal pain", "Chills", "Sleep disorder", NA, NA,   
  0916608, "Diarrhoea", "Nasal congestion", NA, NA, NA, 
  0916609, "Vaccination site erythema", "Vaccination site pruritus", "Vaccination site swelling", NA, NA,    
  0916610, "Rash", "Urticaria", NA, NA, NA  
)

Thank you very much for trying to help!

Excellent, let us work with that:

> symptoms %>%
+   pivot_longer(cols = -VAERS_ID)
# A tibble: 50 x 3
   VAERS_ID name     value       
      <dbl> <chr>    <chr>       
 1   916600 SYMPTOM1 Dysphagia   
 2   916600 SYMPTOM2 Epiglottitis
 3   916600 SYMPTOM3 NA          
 4   916600 SYMPTOM4 NA          
 5   916600 SYMPTOM5 NA          
 6   916601 SYMPTOM1 Anxiety     
 7   916601 SYMPTOM2 Dyspnoea    
 8   916601 SYMPTOM3 NA          
 9   916601 SYMPTOM4 NA          
10   916601 SYMPTOM5 NA          
# … with 40 more rows

Hmmm... Notice how now we have repeated NA observations for value? Let us see what happens, when we add the mutate() from the previous code:

> symptoms %>%
+   pivot_longer(cols = -VAERS_ID) %>%
+   mutate(name = TRUE)
# A tibble: 50 x 3
   VAERS_ID name  value       
      <dbl> <lgl> <chr>       
 1   916600 TRUE  Dysphagia   
 2   916600 TRUE  Epiglottitis
 3   916600 TRUE  NA          
 4   916600 TRUE  NA          
 5   916600 TRUE  NA          
 6   916601 TRUE  Anxiety     
 7   916601 TRUE  Dyspnoea    
 8   916601 TRUE  NA          
 9   916601 TRUE  NA          
10   916601 TRUE  NA          
# … with 40 more rows

Ok, we see repeats of VAERS_ID-observations... We could do a call to distinct()... Hmm... Let us think about that - Now, we know that we want to make a call to pivot_wider() like so:

pivot_wider(id_cols = VAERS_ID,
            names_from = value,
            values_from = name,
            values_fill = FALSE)

Notice how we want to get names of the variables in the final output from the value variable? Do we really want to end up with a column / variable in our data set called NA? Probably not, so let us get rid of that:

> symptoms %>%
+   pivot_longer(cols = -VAERS_ID) %>%
+   mutate(name = TRUE) %>% 
+   drop_na(value) %>% 
+   pivot_wider(id_cols = VAERS_ID,
+               names_from = value,
+               values_from = name,
+               values_fill = FALSE)
# A tibble: 10 x 26
   VAERS_ID Dysphagia Epiglottitis Anxiety Dyspnoea `Chest discomfo… `Pain in extrem…
      <dbl> <lgl>     <lgl>        <lgl>   <lgl>    <lgl>            <lgl>           
 1   916600 TRUE      TRUE         FALSE   FALSE    FALSE            FALSE           
 2   916601 FALSE     FALSE        TRUE    TRUE     FALSE            FALSE           
 3   916602 TRUE      FALSE        FALSE   FALSE    TRUE             TRUE            
 4   916603 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
 5   916604 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
 6   916606 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
 7   916607 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
 8   916608 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
 9   916609 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
10   916610 FALSE     FALSE        FALSE   FALSE    FALSE            FALSE           
# … with 19 more variables: `Visual impairment` <lgl>, Dizziness <lgl>, Fatigue <lgl>, `Mobility
#   decreased` <lgl>, `Injection site erythema` <lgl>, `Injection site pruritus` <lgl>, `Injection
#   site swelling` <lgl>, `Injection site warmth` <lgl>, `Pharyngeal swelling` <lgl>, `Abdominal
#   pain` <lgl>, Chills <lgl>, `Sleep disorder` <lgl>, Diarrhoea <lgl>, `Nasal congestion` <lgl>,
#   `Vaccination site erythema` <lgl>, `Vaccination site pruritus` <lgl>, `Vaccination site
#   swelling` <lgl>, Rash <lgl>, Urticaria <lgl>

Read and understand the details here, so that you might take them with you on your onwards (bio) data science journey :wink:

...and as always - Hope it helps :slightly_smiling_face:

Excellent! This makes sense.
Thanks again for your help :smiley:

1 Like

Basically... Take home: If you run a pipeline and all-of-the-sudden, you start getting errors, then take the pipeline apart - Run it line by line and identify (and understand) where it goes wrong :+1:

1 Like