My coworker wrote a for loop. It seems to work if the vector only has 1 value but not for multiple values as the below example has 2 values.
for (i in 1:length(goodHosp)){
if(i == 1)
a_DT <- DT[DT$hospital %like% goodHosp[1] ]
else
a_DT <- funion(DT, DT[hospital %like% goodHosp[i] ])
}
It returns all values as where I would expect it to filters out row #3 where it doesn't matched the value in goodHosp vector. This means that the "else" part is not working correctly.
patients treatment hospital response
1: 1 a .yyy 0.6886801
2: 2 b .yyy.bbb 2.0524934
3: 3 c .zzz 0.8979818
4: 4 d .yyy.www 0.3883533
5: 5 e .uuu 0.5332226
I would like to understand what doesn't work in the for loop. I also want to know if there is a more elegant and effective way to do this by using apply() or purrr that I'm not too familiar with yet.
This seems like a good case for the outer() function and the stringi package (for fully-vectorized string functions).
From the documentation for outer:
The outer product of the arrays X and Y is the array A with dimension c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] = FUN(X[arrayindex.x], Y[arrayindex.y], ...) .
Which is a fancy way of saying, "do a function over all possible pairs of X and Y."
So we'll use stri_detect_fixed to look for each value of goodHosp in each value of DT[["hospital"]]. This is equivalent to a for-loop, but with R's vectorizing flair.
library(stringi)
is_good_hosp <- outer(DT[["hospital"]], goodHosp, FUN = stri_detect_fixed)
# Names are only for clarity of what's going on
rownames(is_good_hosp) <- DT[["hospital"]]
colnames(is_good_hosp) <- goodHosp
is_good_hosp
# yyy uuu
# .yyy TRUE FALSE
# .yyy.bbb TRUE FALSE
# .zzz FALSE FALSE
# .yyy.www TRUE FALSE
# .uuu FALSE TRUE
DT[rowSums(is_good_hosp) > 0]
# patients treatment hospital response
# 1: 1 a .yyy 1.1666976
# 2: 2 b .yyy.bbb -0.5488551
# 3: 4 d .yyy.www -0.1502280
# 4: 5 e .uuu -1.1591103
Yay, this does give me exactly what I want. I have a gut feeling that we might be making it more complicated using a for loop and your code proves it to be right. Can you explain to me a bit more on why this works instead of a for loop?
Wow, I would never thought of it this way. This is a fancier way of doing it and still give the desired result. However, it returns only the hospital column as where I would want to return the whole data table. I am still very new to R. Since I am not fully understand all the benefits of vector, list, data frame and others, this solution makes it more complicated for me at the moment.
In this case you wanted to use the %like% function over all elements of a vector of patterns, but only one regex pattern is permitted as input. I simply converted the vector to a single string with | as the OR operator, which is a simple but effective trick in cases like this.
If you really want to loop over the vector, then this should work:
for (i in seq_along(goodHosp)){
if(i == 1)
a_DT <- DT[hospital %like% goodHosp[1] ]
else
a_DT <- funion(a_DT, DT[hospital %like% goodHosp[i] ])
}
@nwerth has the better solution. Another possibility is to split your hospital field into hospital and unit columns and then (if you are in a data frame, at least) use mutate(good = ifelse(hospital %in% good, 1,0))
This is perfect. I was reading and watching Youtube videos about purrr and apply() in the past couple days but couldn't apply the new knowledge to this particular example yet. People mentioned that purrr is the way to go because it is more efficient, easier to understand and take less coding.
You need to be aware that you may be mixing different paradigms. Your initial question related to data.table, but purrr is from the tidyverse family of packages.
You can mix them but I don't think it's a particularly good idea unless you know why you are doing so. I see plenty of examples of copy/pasting snippets and combining them which tends to lead to poor code.