Recoding as "True" or "False" depending on contents of other columns

GabriellaS-K · December 22, 2020, 9:12pm

Hi,

I'm hoping someone will be able to help. I have a dataset of fruit and vegetables. I made a new variable Fruit which I want to show FALSE if the other columns do not contain fruits, and TRUE if they do contain a fruit, and NA if it is from the meat column.
Where any fruit in a column overrides any veg or meat, any veg overrides meat
fruit>veg>meat

I hope this makes sense. Example of what I have now, what I want it to look like, and the structure of my data below. I know it will require a little manual work (to code veggies as veg, and fruits as fruit)

So I'd like to go from what I currently have:

   ID lemon sprout lime apple broccoli grape meat FRUIT
1 aaa  <NA>   <NA> <NA> apple broccoli grape <NA>    NA
2 bbb  <NA> sprout <NA>  <NA>     <NA>  <NA> <NA>    NA
3 ccc  <NA>   <NA> <NA>  <NA> broccoli grape <NA>    NA
4 ddd lemon sprout lime  <NA>     <NA>  <NA> <NA>    NA
5 eee lemon   <NA> lime apple     <NA> grape <NA>    NA
6 fff  <NA>   <NA> <NA>  <NA>     <NA>  <NA> meat    NA

to:

   ID lemon sprout lime apple broccoli grape meat FRUIT
1 aaa  <NA>   <NA> <NA> apple broccoli grape <NA>  true
2 bbb  <NA> sprout <NA>  <NA>     <NA>  <NA> <NA> false
3 ccc  <NA>   <NA> <NA>  <NA> broccoli grape <NA>  true
4 ddd lemon sprout lime  <NA>     <NA>  <NA> <NA>  true
5 eee lemon   <NA> lime apple     <NA> grape <NA>  true
6 fff  <NA>   <NA> <NA>  <NA>     <NA>  <NA> meat  <NA>

Structure of my data:

dput(df)
structure(list(ID = c("aaa", "bbb", "ccc", "ddd", "eee", "fff"
), lemon = c(NA, NA, NA, "lemon", "lemon", NA), sprout = c(NA, 
"sprout", NA, "sprout", NA, NA), lime = c(NA, NA, NA, "lime", 
"lime", NA), apple = c("apple", NA, NA, NA, "apple", NA), broccoli = c("broccoli", 
NA, "broccoli", NA, NA, NA), grape = c("grape", NA, "grape", 
NA, "grape", NA), meat = c(NA, NA, NA, NA, NA, "meat"), FRUIT = c(NA, 
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-6L))

Many thanks

Lief · December 22, 2020, 9:59pm

Your Data

Cleaned up dput:

dat <- 
  structure(list(ID = c("aaa", "bbb", "ccc", "ddd", "eee", "fff"), 
                 lemon = c(NA, NA, NA, "lemon", "lemon", NA), 
                 sprout = c(NA, "sprout", NA, "sprout", NA, NA), 
                 lime = c(NA, NA, NA, "lime", "lime", NA), 
                 apple = c("apple", NA, NA, NA, "apple", NA), 
                 broccoli = c("broccoli", NA, "broccoli", NA, NA, NA), 
                 grape = c("grape", NA, "grape",  NA, "grape", NA), 
                 meat = c(NA, NA, NA, NA, NA, "meat"), 
                 FRUIT = c(NA, NA, NA, NA, NA, NA)), 
            class = "data.frame", row.names = c(NA, -6L))

Constants to identify your types of items

FRUITS <- c("apple", "grape", "lemon", "lime")
VEGGIE <- c("broccoli", "sprout")
MEAT <- c("meat")

Base R solution

This is a solution to exactly the question posed using base R. We use apply to go row by row for each set of items and count how many non missing items there are. Then we combine that information back into the FRUIT column so that it is TRUE if there are more than 1 fruit, FALSE if more than 1 veggies, and NA if there is more than one Meat, or if none of the prior statements were TRUE. In this setup, a row that has fruit and veggies and meat will still be TRUE. You can change the order of the ifelse statements if that is not the desired outcome.

fruit_rows  <- apply(dat[FRUITS], 1, function(x) sum(!is.na(x)))
veggie_rows <- apply(dat[VEGGIE], 1, function(x) sum(!is.na(x)))
meat_rows   <- apply(dat[MEAT],   1, function(x) sum(!is.na(x)))

dat$FRUIT = ifelse(fruit_rows > 0, TRUE,
                   ifelse(veggie_rows > 0, FALSE, 
                          ifelse(meat_rows > 0, NA, NA)))

Tidy data

Your data would be easier to work with in a tidy format (see here). As it is, there's a lot of missing values and redundancies. Here's some code that re-organizes and cleans up the data.
First create a 'long' dataset that has multiple rows per id, with a record of each item they have.

library(tidyverse)

long <- dat %>% select(-FRUIT) %>% 
  pivot_longer(-ID, values_to = "item", values_drop_na = TRUE) %>% 
  select(-name)

Then, summarize for each id what item types they have. You can replace any with sum below if you want to know how many of each item type they have. From here, you can create your FRUIT column with an if_else statement pretty easily, similar to what was done in the section above.

items_by_id <- long %>% group_by(ID) %>% 
  summarize(hasFruit = any(item %in% FRUITS), 
            hasVeggie = any(item %in% VEGGIE),
            hasMeat = any(item %in% MEAT))

GabriellaS-K · December 23, 2020, 12:54am

This is perfect. Thank you so much!!!

system · December 30, 2020, 12:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.