How to scale up this filtering condition using dplyr for large datasets

Rscotty · May 18, 2019, 1:38am

I have tested a solution with a small dataset and it works. However, now I have 100 columns and my solution is not scalable. Can you please help me to achieve the result. Solution based on tidyverse would be lovely!

I want to Tag each row based on following condition:

Tags condition

Tag1 = col1 must contain either one value from the User_list and rest of the columns must not include any values from the User_list
Tag2 = col1 must NOT contain any value from the User_list and rest of the columns must have at least one value from the User_list
Tag3 = col1 must contain either one value from the User_list and rest of the columns must have at least one value from the User_list

final_result
#> # A tibble: 8 x 7
#>   name      col1       col2       col3        Tag1  Tag2  Tag3
#>   <chr>     <chr>      <chr>      <chr>      <dbl> <dbl> <dbl>
#> 1 John      cricket    football   basketball    NA    NA     3
#> 2 Scottish  running    football   <NA>          NA     2    NA
#> 3 Kobe      <NA>       <NA>       basketball    NA     2    NA
#> 4 jordan    baseball   basketball <NA>          NA     2    NA
#> 5 shakira   singing    dancing    acting        NA    NA    NA
#> 6 Chris     <NA>       <NA>       <NA>          NA    NA    NA
#> 7 Tendulkar cricket    <NA>       <NA>           1    NA    NA
#> 8 Axl       basketball guitar     cricket       NA    NA     3

Below is the sample data

library (tidyverse)
sample_data <- tibble::tribble(
        ~name,        ~col1,        ~col2,        ~col3,
       "John",    "cricket",   "football", "basketball",
   "Scottish",    "running",   "football",           NA,
       "Kobe",           NA,           NA, "basketball",
     "jordan",   "baseball", "basketball",           NA,
    "shakira",    "singing",    "dancing",     "acting",
      "Chris",           NA,           NA,           NA,
  "Tendulkar",    "cricket",           NA,           NA,
        "Axl", "basketball",     "guitar",    "cricket"
  )

This is the user list

User_list <- tibble(vars = c("basketball", "football", "cricket"))

This is my codes:

sample_data %>% 
  mutate(Tag1 = case_when ((col1 %in% codes$vars)  & (!col2 %in% codes$vars) & (!col3 %in% codes$vars) ~ 1),
         Tag2 = case_when (!(col1 %in% codes$vars) & ((col2 %in% codes$vars) | (col3 %in% codes$vars)) ~ 2),
         Tag3 = case_when (col1 %in% codes$vars & ((col2 %in% codes$vars) | (col3 %in% codes$vars)) ~ 3))

Above code surprising work! However I have 100 columns. How can I achieve the above result with the same Tags condition which is column 1 is col1 and rest of the columns refers to column 2 to column 100

FJCC · May 18, 2019, 3:23am

I would break up the calculation in two pieces and then join them at the end. I did not calculate the Tags, but it is clear you can do that. The TRUE/FALSE conditions signify whether an element of User_list is present. Remember, also, that TRUE == 1 and FALSE == 0, so summing booleans tells you how many are TRUE.

library (tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
sample_data <- tibble::tribble(
  ~name,        ~col1,        ~col2,        ~col3,
  "John",    "cricket",   "football", "basketball",
  "Scottish",    "running",   "football",           NA,
  "Kobe",           NA,           NA, "basketball",
  "jordan",   "baseball", "basketball",           NA,
  "shakira",    "singing",    "dancing",     "acting",
  "Chris",           NA,           NA,           NA,
  "Tendulkar",    "cricket",           NA,           NA,
  "Axl", "basketball",     "guitar",    "cricket"
)

User_list <- tibble(vars = c("basketball", "football", "cricket"))

Col1 <- sample_data %>% select(name, col1) %>% 
  mutate(Col1UserBool = col1 %in% User_list$vars)

ColNot1 <- sample_data %>% select( -col1) %>%
  gather(key = Column, value = Value, col2:col3) %>% 
  mutate(User_Bool = Value %in% User_list$vars) %>% 
  group_by(name) %>% 
  summarize(ColNot1UserBool = sum(User_Bool) > 0)

JoinCol1ColNot1 <- inner_join(Col1, ColNot1, by = "name")

JoinCol1ColNot1
#> # A tibble: 8 x 4
#>   name      col1       Col1UserBool ColNot1UserBool
#>   <chr>     <chr>      <lgl>        <lgl>          
#> 1 John      cricket    TRUE         TRUE           
#> 2 Scottish  running    FALSE        TRUE           
#> 3 Kobe      <NA>       FALSE        TRUE           
#> 4 jordan    baseball   FALSE        TRUE           
#> 5 shakira   singing    FALSE        FALSE          
#> 6 Chris     <NA>       FALSE        FALSE          
#> 7 Tendulkar cricket    TRUE         FALSE          
#> 8 Axl       basketball TRUE         TRUE

^{Created on 2019-05-17 by the reprex package (v0.2.1)}

Yarnabrina · May 18, 2019, 4:57am

Most probably, this is not what you are looking for, but it works:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

sample_data <- tribble(~name,       ~col1,        ~col2,        ~col3,
                       "John",      "cricket",    "football",   "basketball",
                       "Scottish",  "running",    "football",   NA,
                       "Kobe",      NA,           NA,           "basketball",
                       "jordan",    "baseball",   "basketball", NA,
                       "shakira",   "singing",    "dancing",    "acting",
                       "Chris",     NA,           NA,           NA,
                       "Tendulkar", "cricket",    NA,           NA,
                       "Axl",       "basketball", "guitar",     "cricket")

User_list <- tibble(vars = c("basketball", "football", "cricket"))

sample_data %>%
  mutate(Tag = apply(X = .,
                     MARGIN = 1,
                     FUN = function(t) {
                       case_when(((t[2] %in% User_list$vars) & (!(any(t[-(1:2)] %in% User_list$vars)))) ~ 1,
                                 (!(t[2] %in% User_list$vars) & (any(t[-(1:2)] %in% User_list$vars))) ~ 2,
                                 ((t[2] %in% User_list$vars) & (any(t[-(1:2)] %in% User_list$vars))) ~ 3)
                     }))
#> # A tibble: 8 x 5
#>   name      col1       col2       col3         Tag
#>   <chr>     <chr>      <chr>      <chr>      <dbl>
#> 1 John      cricket    football   basketball     3
#> 2 Scottish  running    football   <NA>           2
#> 3 Kobe      <NA>       <NA>       basketball     2
#> 4 jordan    baseball   basketball <NA>           2
#> 5 shakira   singing    dancing    acting        NA
#> 6 Chris     <NA>       <NA>       <NA>          NA
#> 7 Tendulkar cricket    <NA>       <NA>           1
#> 8 Axl       basketball guitar     cricket        3

Adding another solution:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

sample_data <- tribble(~name,       ~col1,        ~col2,        ~col3,
                       "John",      "cricket",    "football",   "basketball",
                       "Scottish",  "running",    "football",   NA,
                       "Kobe",      NA,           NA,           "basketball",
                       "jordan",    "baseball",   "basketball", NA,
                       "shakira",   "singing",    "dancing",    "acting",
                       "Chris",     NA,           NA,           NA,
                       "Tendulkar", "cricket",    NA,           NA,
                       "Axl",       "basketball", "guitar",     "cricket")

User_list <- tibble(vars = c("basketball", "football", "cricket"))

temp <- function(t)
{
  case_when(((t[2] %in% User_list$vars) & (!(any(t[-(1:2)] %in% User_list$vars)))) ~ 1L,
            (!(t[2] %in% User_list$vars) & (any(t[-(1:2)] %in% User_list$vars))) ~ 2L,
            ((t[2] %in% User_list$vars) & (any(t[-(1:2)] %in% User_list$vars))) ~ 3L)
}

sample_data %>%
  mutate(Tag = pmap_int(.l = .,
                        .f = lift_vd(temp)))
#> # A tibble: 8 x 5
#>   name      col1       col2       col3         Tag
#>   <chr>     <chr>      <chr>      <chr>      <int>
#> 1 John      cricket    football   basketball     3
#> 2 Scottish  running    football   <NA>           2
#> 3 Kobe      <NA>       <NA>       basketball     2
#> 4 jordan    baseball   basketball <NA>           2
#> 5 shakira   singing    dancing    acting        NA
#> 6 Chris     <NA>       <NA>       <NA>          NA
#> 7 Tendulkar cricket    <NA>       <NA>           1
#> 8 Axl       basketball guitar     cricket        3

Rscotty · May 19, 2019, 11:57am

I found your solution to be concise and easy to apply. This will really help my project. Thank you very much.

system · May 26, 2019, 11:57am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.