Filtering out rows from data table based on list of values

Hi all,

I want to filter out rows from a data table (DT) using values from a vector (goodHosp). I'm wondering what the best way to go about doing it.

      DT <- data.table(patients = 1:5, treatment = letters[1:5],
                        hospital = c(".yyy", ".yyy.bbb", ".zzz", ".yyy.www", ".uuu")
                        , response = rnorm(5))

      goodHosp <- c("yyy", "uuu")

My coworker wrote a for loop. It seems to work if the vector only has 1 value but not for multiple values as the below example has 2 values.

      for (i in 1:length(goodHosp)){
        if(i == 1)
          a_DT <- DT[DT$hospital %like% goodHosp[1] ]  
        else
          a_DT <- funion(DT, DT[hospital %like% goodHosp[i] ])
      }

It returns all values as where I would expect it to filters out row #3 where it doesn't matched the value in goodHosp vector. This means that the "else" part is not working correctly.
patients treatment hospital response
1: 1 a .yyy 0.6886801
2: 2 b .yyy.bbb 2.0524934
3: 3 c .zzz 0.8979818
4: 4 d .yyy.www 0.3883533
5: 5 e .uuu 0.5332226

I would like to understand what doesn't work in the for loop. I also want to know if there is a more elegant and effective way to do this by using apply() or purrr that I'm not too familiar with yet.

Thanks in advance!

1 Like

Hi, and welcome.

I have to confess that this is my first time out with data.table, so I can only offer a tidy approach

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

DT <- data.table(patients = 1:5, treatment = letters[1:5],
              hospital = c(".yyy", ".yyy.bbb", ".zzz", ".yyy.www", ".uuu"),
              response = rnorm(5))
goodHosp <- c(".yyy", ".uuu") # added prefix . to conform to data

DT2 <- lazy_dt(DT)
DT2 %>% filter(hospital %in% goodHosp) -> filtrate
filtrate
#> Source: local data table [?? x 4]
#> Call:   `_DT1`[hospital %in% goodHosp]
#> 
#>   patients treatment hospital response
#>      <int> <chr>     <chr>       <dbl>
#> 1        1 a         .yyy      0.00578
#> 2        5 e         .uuu     -0.648  
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

Created on 2020-02-13 by the reprex package (v0.3.0)

Does this give you what you want?

library(data.table)

DT <- data.table(patients = 1:5, treatment = letters[1:5],
                 hospital = c(".yyy", ".yyy.bbb", ".zzz", ".yyy.www", ".uuu")
                 , response = rnorm(5))

goodHosp <- c("yyy", "uuu")
goodHosp_match <- paste0(goodHosp, collapse = "|")

DT[hospital %like% goodHosp_match]
1 Like

This seems like a good case for the outer() function and the stringi package (for fully-vectorized string functions).

From the documentation for outer:

The outer product of the arrays X and Y is the array A with dimension c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] = FUN(X[arrayindex.x], Y[arrayindex.y], ...) .

Which is a fancy way of saying, "do a function over all possible pairs of X and Y."

So we'll use stri_detect_fixed to look for each value of goodHosp in each value of DT[["hospital"]]. This is equivalent to a for-loop, but with R's vectorizing flair.

library(stringi)
is_good_hosp <- outer(DT[["hospital"]], goodHosp, FUN = stri_detect_fixed)
# Names are only for clarity of what's going on
rownames(is_good_hosp) <- DT[["hospital"]]
colnames(is_good_hosp) <- goodHosp
is_good_hosp
#            yyy   uuu
# .yyy      TRUE FALSE
# .yyy.bbb  TRUE FALSE
# .zzz     FALSE FALSE
# .yyy.www  TRUE FALSE
# .uuu     FALSE  TRUE

DT[rowSums(is_good_hosp) > 0]
#    patients treatment hospital   response
# 1:        1         a     .yyy  1.1666976
# 2:        2         b .yyy.bbb -0.5488551
# 3:        4         d .yyy.www -0.1502280
# 4:        5         e     .uuu -1.1591103
1 Like

Hi there,

Using %in% will not work as where I need to be able to filter in row #2 and #4 as well. Below is the expected result that I want:

#> patients treatment hospital response
#>1: 1 a .yyy 0.6886801
#>2: 2 b .yyy.bbb 2.0524934
#>4: 4 d .yyy.www 0.3883533
#>5: 5 e .uuu 0.5332226

1 Like

Hi Martin,

Yay, this does give me exactly what I want. I have a gut feeling that we might be making it more complicated using a for loop and your code proves it to be right. Can you explain to me a bit more on why this works instead of a for loop?

Thank you so much,

Wow, I would never thought of it this way. This is a fancier way of doing it and still give the desired result. However, it returns only the hospital column as where I would want to return the whole data table. I am still very new to R. Since I am not fully understand all the benefits of vector, list, data frame and others, this solution makes it more complicated for me at the moment.

In this case you wanted to use the %like% function over all elements of a vector of patterns, but only one regex pattern is permitted as input. I simply converted the vector to a single string with | as the OR operator, which is a simple but effective trick in cases like this.

If you really want to loop over the vector, then this should work:

for (i in seq_along(goodHosp)){
  if(i == 1)
    a_DT <- DT[hospital %like% goodHosp[1] ]  
  else
    a_DT <- funion(a_DT, DT[hospital %like% goodHosp[i] ])
}

I really wouldn't recommend it though.

1 Like

@nwerth has the better solution. Another possibility is to split your hospital field into hospital and unit columns and then (if you are in a data frame, at least) use mutate(good = ifelse(hospital %in% good, 1,0))

The purrr version of the solution would look like this:

purrr::map_df(goodHosp, ~ DT[str_detect(hospital, .x),])

Results:
patients treatment hospital response
1: 1 a .yyy 1.0884662
2: 2 b .yyy.bbb -0.3726247
3: 4 d .yyy.www 1.8042033
4: 5 e .uuu 0.6081744

Hope that solves your problem?

1 Like

Hi Dawie,

This is perfect. I was reading and watching Youtube videos about purrr and apply() in the past couple days but couldn't apply the new knowledge to this particular example yet. People mentioned that purrr is the way to go because it is more efficient, easier to understand and take less coding.

Thank you so much!

You need to be aware that you may be mixing different paradigms. Your initial question related to data.table, but purrr is from the tidyverse family of packages.

You can mix them but I don't think it's a particularly good idea unless you know why you are doing so. I see plenty of examples of copy/pasting snippets and combining them which tends to lead to poor code.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.