Doing multiple filtering without using any loop

oniz · August 30, 2019, 9:12pm

Hello! I'm new to R, sorry if I can't explain well!
I am trying the following:
In table_base, I have the start and end columns. I am trying to check which table_target rows (toFind column) are between these start and end values. At the end I create a tibble where each row has the result, with the table_base and table_base rows matching.

My solution was from below, using a loop. However the loop takes too long. Is there another way?

table_target
A tibble: 732,854 x 2
start end

1 10001 10468
2 10469 11447
3 11504 11675
4 11678 11780
5 15265 15355
6 16713 16749
7 18907 19048
8 19948 20405
9 20531 20679
10 21949 22075

... with 732,844 more rows

table_base
#----------------------------

A tibble: 47,773 x 1
  toFind
   <dbl>
1 91194674
2 230560793
3 5937253
4 166958439
5 214170376
6 43831041
7 51034865
8 200011786
9 170490434
10 20960010

... with 47,763 more rows

table_base

My solution was:

#This is just to show a progress status
progress <- seq(1, nrow(table_target),1000)
#Loop start!
for (i in 1:nrow(table_target)) {

#The the respective start and end values
start_number <- as.numeric(table_target$start[i])
end_number <- as.numeric(table_target$end[i])
filter table_base$toFind with start and end values
searching <- table_base %>% filter(toFind>= start_number & toFind <= end_number)

test the results and, if it work, send to results tibble
if(nrow(searching) != 0) {
results <- results %>% bind_rows(searching)
}

#Just show the progrees
if(i %in% progress) {
cat("\014")
cat("\n", round(i/nrow(table_target)*100),"%")

}

Thanks!

FJCC · August 30, 2019, 9:49pm

I would use the fuzzy_join package.

library(fuzzyjoin)
DF1 = data.frame(A = c(0, 100, 200, 300), B = c(99, 199, 299, 399))
DF1
#>     A   B
#> 1   0  99
#> 2 100 199
#> 3 200 299
#> 4 300 399
DF2 <- data.frame(C = c(234, 19, 102))
DF2
#>     C
#> 1 234
#> 2  19
#> 3 102

DF3 <- fuzzy_inner_join(DF1, DF2, by = c("A" = "C", "B" = "C"), match_fun = list(`<=`, `>=`))
DF3
#>     A   B   C
#> 1   0  99  19
#> 2 100 199 102
#> 3 200 299 234

^{Created on 2019-08-30 by the reprex package (v0.2.1)}

oniz · August 30, 2019, 10:05pm

I REALLY love your suggestion, but it consumes a lot of memory and I can't run.

test <- fuzzy_inner_join(table_base, table_target, by = c("toFind" = "start", "toFind" = "end"), match_fun = list(<=, >=))
Error: cannot allocate vector of size 151.1 Gb

andresrcs · August 30, 2019, 10:18pm

You have the logical operators backwards, your code is filtering "toFind" <= "start" & "toFind" >= "end"

oniz · August 30, 2019, 10:29pm

Thanks! But still consume consuming much of the memory. I tried to reduce the tables for a test but ended up slower than the loop...

test <- fuzzy_inner_join(as.data.frame(table_base),as.data.frame(table_target), by = c("toFind" = "end", "toFind" = "start"), match_fun = list(<=, >=))
Error: cannot allocate vector of size 151.1 Gb

FJCC · August 30, 2019, 10:33pm

Perhaps it does not matter, but the I think the functions within match_fun should be within back tics.

match_fun = list(`<=`, `>=`)

oniz · August 30, 2019, 10:57pm

Yes. Without the back tics, its an error.

match_fun = list("<=", ">=")
Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...) :
could not find function "mf"

Before I tried all with back tics

mdzorn · August 30, 2019, 11:37pm

Here's a script to do that:

Key is map_dfr which loops over a list of targets and applies a function to each element in the target list.
The function tests whether the target list element is within start and end. map_dfr returns a data frame with all results combined by rows.

library(tidyverse)

DF1 = data.frame(start = runif(1000, min = 1000, max = 9999)) %>% 
  mutate(end = start + runif(1000, min = 100, max = 999))

     start       end
1 9364.815 10184.800
2 6560.811  7521.991
3 8422.298  9179.108
4 9088.313  9274.061
5 4865.341  5725.937
6 7239.143  7922.612


DF2 <- data.frame(toFind = runif(100, min = 1000, max = 9999))

    toFind
1 1370.765
2 6881.400
3 7265.235
4 5341.615
5 5405.194
6 6282.752

-- Mapping function --
Note that I flipped the test: toFind >= start_number is also start < find

find_targets <- function(find) {
  result <- DF1 %>% filter(start < find & end > find)
  result$toFind <- find
  return(result)
}

DF3 <- map_dfr(DF2$toFind, find_targets)

     start      end     find
1 1021.448 2018.737 1370.765
2 1315.437 1480.758 1370.765
3 1155.297 1512.022 1370.765
4 1270.936 1798.401 1370.765
5 1275.343 2127.847 1370.765
6 1090.721 1709.477 1370.765

oniz · August 31, 2019, 12:10am

I love this function, but for some reason it's giving error

find_targets <- function(find) {
result <- table_target %>% filter(start < find & end > find)
result$toFind <- find
return(result)
}

test <- map_dfr(table_base$toFind, find_targets)

Error in $<-.data.frame(*tmp*, toFind, value = 91194674) :
replacement has 1 row, data has 0

mdzorn · August 31, 2019, 12:13am

When I added your numbers, 732,854 for target and 47,773 for base, my Mac ran out of space.

I changed the data.frame to data.table and it processed a 1/10 of that (73,285 and 4,777) generating 30 million rows. So it may or may not work with your numbers. Maybe your actual numbers yield fewer combinations of base between start and end. You could split your data and run it in parallel on a larger machine.

Anyway, here's the data.table code:

library(tidyverse)
library(data.table)

DF1 <- data.table(start = runif(73285L, min = 1000, max = 9999))
DF1[, end := start + runif(1L, min = 100, max = 999)]

DF2 <- data.table(toFind = runif(4777L, min = 1000, max = 9999))

find_targets <- function(find) {
  result <- DF1[start < find & end > find]
  result$toFind <- find
  return(result)
}

DF3 <- map_dfr(DF2$toFind, find_targets)

mdzorn · August 31, 2019, 12:16am

It could be that there's no result, so adding a column to an empty result could give an error. You could add the test for empty result before adding the $toFind column.

oniz · August 31, 2019, 12:32am

Thank you! I will work on these tips this weekend and will definitely give everyone a return!

oniz · August 31, 2019, 4:39pm

So, from the beginning I was working with data.tibble, so I had no problem with data.frame speed or memory consuption. I was able to solve:

Error in $<-.data.frame ( *tmp* , toFind, value = 91194674) :
replacement has 1 row, data has 0

by adding an "if" to preventing it from adding something to the results when no match was found. I ran my data without memory issues and was at least 3 minutes faster than my original loop:

find_targets <- function(find) {
result <- table_target %>% filter(start <= find & end >= find)

if(nrow(result) > 0) {
result$find <- find
}
return(result)
}
a<- Sys.time()
test <- map_dfr(table_base$toFind, find_targets)
print(Sys.time()-a)
Time difference of 7.743048 mins
nrow(test) #matches
[1] 7524

I will now try to put this into parallel processing to try to accelerate more, although I have no idea yet how to do it!

raytong · September 1, 2019, 4:26am

I suggest a script for the question and can run under limited resource.

library(tidyverse)

df1 <- data.frame(start = seq(1, 7328540, 10), end = seq(9, 7328540, 10))
df2 <- data.frame(toFind = sample(seq(2, 7328540, 10), 47773))

filterFun <- function(x) {
  ret <- c()
  for(i in seq_along(x[-1])) {
    if(x[i] == "range" && x[i + 1] == "toFind") {
      ret[i] <- TRUE
    } else if(x[i] == "toFind") {
      ret[i] <- TRUE
    } else {
      ret[i] <- FALSE
    }
  }
  ret[length(x)] <- ifelse(x[length(x)] == "toFind", TRUE, FALSE)
  ret
}

df1 %>%
  mutate(type = "range") %>%
  bind_rows(df2 %>%
              rename(start = toFind) %>%
              mutate(end = NA_integer_, type = "toFind")) %>%
  arrange(start) %>%
  filter(filterFun(type)) %>%
  {
    bind_cols(filter(., type == "range"), filter(., type == "toFind"))
  } %>%
  rename(toFind = start1) %>%
  select(start, end, toFind) %>%
  filter(toFind <= end)

Hope the script can help.

oniz · September 3, 2019, 3:02pm

Thank you, I will try it too!!

system · September 10, 2019, 3:02pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.