Filtering out rows from data table based on list of values

vutv · February 13, 2020, 7:08am

Hi all,

I want to filter out rows from a data table (DT) using values from a vector (goodHosp). I'm wondering what the best way to go about doing it.

      DT <- data.table(patients = 1:5, treatment = letters[1:5],
                        hospital = c(".yyy", ".yyy.bbb", ".zzz", ".yyy.www", ".uuu")
                        , response = rnorm(5))

      goodHosp <- c("yyy", "uuu")

My coworker wrote a for loop. It seems to work if the vector only has 1 value but not for multiple values as the below example has 2 values.

      for (i in 1:length(goodHosp)){
        if(i == 1)
          a_DT <- DT[DT$hospital %like% goodHosp[1] ]  
        else
          a_DT <- funion(DT, DT[hospital %like% goodHosp[i] ])
      }

It returns all values as where I would expect it to filters out row #3 where it doesn't matched the value in goodHosp vector. This means that the "else" part is not working correctly.
patients treatment hospital response
1: 1 a .yyy 0.6886801
2: 2 b .yyy.bbb 2.0524934
3: 3 c .zzz 0.8979818
4: 4 d .yyy.www 0.3883533
5: 5 e .uuu 0.5332226

I would like to understand what doesn't work in the for loop. I also want to know if there is a more elegant and effective way to do this by using apply() or purrr that I'm not too familiar with yet.

Thanks in advance!

technocrat · February 13, 2020, 8:33am

Hi, and welcome.

I have to confess that this is my first time out with data.table, so I can only offer a tidy approach

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

DT <- data.table(patients = 1:5, treatment = letters[1:5],
              hospital = c(".yyy", ".yyy.bbb", ".zzz", ".yyy.www", ".uuu"),
              response = rnorm(5))
goodHosp <- c(".yyy", ".uuu") # added prefix . to conform to data

DT2 <- lazy_dt(DT)
DT2 %>% filter(hospital %in% goodHosp) -> filtrate
filtrate
#> Source: local data table [?? x 4]
#> Call:   `_DT1`[hospital %in% goodHosp]
#> 
#>   patients treatment hospital response
#>      <int> <chr>     <chr>       <dbl>
#> 1        1 a         .yyy      0.00578
#> 2        5 e         .uuu     -0.648  
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

^{Created on 2020-02-13 by the reprex package (v0.3.0)}

martin.R · February 13, 2020, 2:27pm

Does this give you what you want?

library(data.table)

DT <- data.table(patients = 1:5, treatment = letters[1:5],
                 hospital = c(".yyy", ".yyy.bbb", ".zzz", ".yyy.www", ".uuu")
                 , response = rnorm(5))

goodHosp <- c("yyy", "uuu")
goodHosp_match <- paste0(goodHosp, collapse = "|")

DT[hospital %like% goodHosp_match]

nwerth · February 13, 2020, 2:48pm

This seems like a good case for the outer() function and the stringi package (for fully-vectorized string functions).

From the documentation for outer:

The outer product of the arrays X and Y is the array A with dimension c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] = FUN(X[arrayindex.x], Y[arrayindex.y], ...) .

Which is a fancy way of saying, "do a function over all possible pairs of X and Y."

So we'll use stri_detect_fixed to look for each value of goodHosp in each value of DT[["hospital"]]. This is equivalent to a for-loop, but with R's vectorizing flair.

library(stringi)
is_good_hosp <- outer(DT[["hospital"]], goodHosp, FUN = stri_detect_fixed)
# Names are only for clarity of what's going on
rownames(is_good_hosp) <- DT[["hospital"]]
colnames(is_good_hosp) <- goodHosp
is_good_hosp
#            yyy   uuu
# .yyy      TRUE FALSE
# .yyy.bbb  TRUE FALSE
# .zzz     FALSE FALSE
# .yyy.www  TRUE FALSE
# .uuu     FALSE  TRUE

DT[rowSums(is_good_hosp) > 0]
#    patients treatment hospital   response
# 1:        1         a     .yyy  1.1666976
# 2:        2         b .yyy.bbb -0.5488551
# 3:        4         d .yyy.www -0.1502280
# 4:        5         e     .uuu -1.1591103

vutv · February 13, 2020, 6:15pm

Hi there,

Using %in% will not work as where I need to be able to filter in row #2 and #4 as well. Below is the expected result that I want:

#> patients treatment hospital response
#>1: 1 a .yyy 0.6886801
#>2: 2 b .yyy.bbb 2.0524934
#>4: 4 d .yyy.www 0.3883533
#>5: 5 e .uuu 0.5332226

vutv · February 13, 2020, 6:17pm

Hi Martin,

Yay, this does give me exactly what I want. I have a gut feeling that we might be making it more complicated using a for loop and your code proves it to be right. Can you explain to me a bit more on why this works instead of a for loop?

Thank you so much,

vutv · February 13, 2020, 6:25pm

Wow, I would never thought of it this way. This is a fancier way of doing it and still give the desired result. However, it returns only the hospital column as where I would want to return the whole data table. I am still very new to R. Since I am not fully understand all the benefits of vector, list, data frame and others, this solution makes it more complicated for me at the moment.

martin.R · February 13, 2020, 6:33pm

In this case you wanted to use the %like% function over all elements of a vector of patterns, but only one regex pattern is permitted as input. I simply converted the vector to a single string with | as the OR operator, which is a simple but effective trick in cases like this.

If you really want to loop over the vector, then this should work:

for (i in seq_along(goodHosp)){
  if(i == 1)
    a_DT <- DT[hospital %like% goodHosp[1] ]  
  else
    a_DT <- funion(a_DT, DT[hospital %like% goodHosp[i] ])
}

I really wouldn't recommend it though.

technocrat · February 13, 2020, 6:35pm

@nwerth has the better solution. Another possibility is to split your hospital field into hospital and unit columns and then (if you are in a data frame, at least) use mutate(good = ifelse(hospital %in% good, 1,0))

Dawie · February 14, 2020, 5:49am

The purrr version of the solution would look like this:

purrr::map_df(goodHosp, ~ DT[str_detect(hospital, .x),])

Results:
patients treatment hospital response
1: 1 a .yyy 1.0884662
2: 2 b .yyy.bbb -0.3726247
3: 4 d .yyy.www 1.8042033
4: 5 e .uuu 0.6081744

Hope that solves your problem?

vutv · February 14, 2020, 6:35pm

Hi Dawie,

This is perfect. I was reading and watching Youtube videos about purrr and apply() in the past couple days but couldn't apply the new knowledge to this particular example yet. People mentioned that purrr is the way to go because it is more efficient, easier to understand and take less coding.

Thank you so much!

martin.R · February 14, 2020, 6:54pm

You need to be aware that you may be mixing different paradigms. Your initial question related to data.table, but purrr is from the tidyverse family of packages.

You can mix them but I don't think it's a particularly good idea unless you know why you are doing so. I see plenty of examples of copy/pasting snippets and combining them which tends to lead to poor code.

system · February 21, 2020, 6:54pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.