Create a vector in a LOOP with more conditions

I have a problem a little complicate. I have two database BD and BD2 . For every row in BD I want to search in entire BD2 and obtain some info: Sum, Mean, Sd etc. With Sum I succeeded to make it work, but for Mean, Sd, Median I want to obtain the vector and after that apply these functions.

To be clearer, here is the code. For sum it worked. But I want now to save in a vector the values that met the conditions and after calculate Mean , Sd , Median . I tried to make the vector out of the base BD but nothing work or is something trickie and I can't figure it out.

for (i in 1:dim(BD)[1])
  
{
  
  for (j in 1:dim(BD2)[1])
    
  {
    
    if((BD$Start.Date[i]<BD2$X_TIMESTAMP[j]) && (BD2$X_TIMESTAMP[j]<BD$End.Date[i]) && (BD$Linea[i]==BD2$Linea[j]))
      
    { 
      vl = BD2$X_VALUE[j]
      BD$vec[i] = paste(BD$vec[i],vl,sep="")
      #vect = as.numeric(BD$vec[i])
      BD$Sum[i] = BD$Sum[i]+ BD2$X_VALUE[j]
      #BD$Average[i] = mean(vect)
      
      
    }
    
  }
  
}

I tried to create vec before the for sentence. But still it doesn't work. The error is always the same: Error in $<-.data.frame(*tmp*, "vec", value = list(NULL, NULL, NULL, : replacement has 47 rows, data has 530

EDIT: I added a photo with details. In fact, I don't necessarily want to retain the vector. I jus want to be able to calculate Mean, Sd etc.

Thank you

EDIT 2: For a reproducible example:
BD2

X_TIMESTAMP X_VALUE Linea
23/05/2020 05:40 106 1
23/05/2020 05:40 100 3
23/05/2020 05:41 82 1
23/05/2020 05:41 101 3
23/05/2020 05:42 109 1
23/05/2020 05:42 100 3
23/05/2020 05:43 104 1
23/05/2020 05:43 56 3
23/05/2020 05:44 104 1
23/05/2020 05:44 56 3
23/05/2020 05:45 56 1
23/05/2020 05:45 104 3
23/05/2020 05:46 0 1

BD

Start.Date End.Date Linea Vector Sum Mean
23/05/2020 05:40 23/05/2020 05:43 1 82;109 191 95.5
24/05/2020 05:41 24/05/2020 05:45 3 100;56;56 212 70.66

HI,

I think there are indeed some more elegant solutions to this issue, but you'll need to provide me a bit extra info. Can you show me a sample of both databases and tell me which columns are linked (foreign keys) and which values need to be calculated (i.e. provide an example of input and output).

To help us understand, read the reprex guide. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

Good luck,
PJ

Hi. I hope now it's ok. I added more info. Thank you!

One way to do this is to do a fuzzy join between BD and BD2.
I edited the date values in BD for linea = 3 to line up with values in BD2.

``` r
library(fuzzyjoin)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

BD <- read.csv("~/R/Play/BD.csv", stringsAsFactors = FALSE)  
BD2 <- read.csv("~/R/Play/BD2.csv", stringsAsFactors = FALSE)
BD$Start.Date <- dmy_hm(BD$Start.Date)
BD$End.Date <- dmy_hm(BD$End.Date)
BD2$X_TIMESTAMP <- dmy_hm(BD2$X_TIMESTAMP)
BDjoin <- fuzzy_inner_join(BD, BD2, 
                           by = c("Start.Date" = "X_TIMESTAMP", 
                                  "End.Date" = "X_TIMESTAMP",
                                  "Linea" = "Linea"),
                          match_fun = list(`<`, `>`, `==`))
BDjoin
#>            Start.Date            End.Date Linea.x         X_TIMESTAMP X_VALUE
#> 1 2020-05-23 05:40:00 2020-05-23 05:43:00       1 2020-05-23 05:41:00      82
#> 2 2020-05-23 05:40:00 2020-05-23 05:43:00       1 2020-05-23 05:42:00     109
#> 3 2020-05-23 05:41:00 2020-05-23 05:45:00       3 2020-05-23 05:42:00     100
#> 4 2020-05-23 05:41:00 2020-05-23 05:45:00       3 2020-05-23 05:43:00      56
#> 5 2020-05-23 05:41:00 2020-05-23 05:45:00       3 2020-05-23 05:44:00      56
#>   Linea.y
#> 1       1
#> 2       1
#> 3       3
#> 4       3
#> 5       3
BDstats <- BDjoin %>% group_by(Start.Date, End.Date, Linea.x) %>% 
  summarize(SUM = sum(X_VALUE), MED = median(X_VALUE), MEAN = mean(X_VALUE), SD = sd(X_VALUE))
BDstats
#> # A tibble: 2 x 7
#> # Groups:   Start.Date, End.Date [2]
#>   Start.Date          End.Date            Linea.x   SUM   MED  MEAN    SD
#>   <dttm>              <dttm>                <int> <int> <dbl> <dbl> <dbl>
#> 1 2020-05-23 05:40:00 2020-05-23 05:43:00       1   191  95.5  95.5  19.1
#> 2 2020-05-23 05:41:00 2020-05-23 05:45:00       3   212  56    70.7  25.4

Created on 2020-07-30 by the reprex package (v0.3.0)

1 Like

Thank you very much. I succeeded to fix the for sentence but it takes a lot to run. I have in BD = 530 rows and BD2 = 700k. I tried your code, but I don't know why when I make group by it reduce my database at only 13 observation, even if the start.date it doesn't have a value repeated.

What is the result of

nrow(unique(BDjoin[, c("Start.Date", "End.Date", "Linea.x")]))

13L. Smth happened when I apllied dmy_hm(BD$Start.Date). In my original code I have BD$Start.Date = strptime(as.character(BD$Start.Date),format="%d.%m.%Y %H:%M") because in the original Database the date are like this: 14.05.2020 21:12. I tried with strptime but I have an error when I try to run fuzzy_inner_join. I think it's better to create a unique key in BD to be sure that nothing will reduce the size. Did you ever create smth like this? I don't know, a random no.. Thank you

Your code using strptime() should work fine. Use that if you prefer it over dmy_hm().

Is it possible that there really are only 13 cases where Start.Date, End.Date and Linea all meet the requirements? Any rows in BD that do not have matching data in BD2 will be dropped from BDjoin. Note that the match conditions are < and > not <= and >=.
Do you expect most rows in BD to have matching data in BD2? Can you find any rows in BD that do have matches in BD2 but do not appear in BDstats?

I think you are right. A last question, if you know please: I have for other database the conditions: Start.Data+2 min < X_TIMESTAMP < End.Date - 10 min. Do you know how to write this in match_fun in the fuzzy_inner_join sentence? Thank you.

I have never written my own functions for fuzzy joins but I think this will work.

TwoMin <- function(x, y) x + 120 < y
TenMin <- function (x, y) x - 600 > y
BDjoin <- fuzzy_inner_join(BD, BD2, 
                           by = c("Start.Date" = "X_TIMESTAMP", 
                                  "End.Date" = "X_TIMESTAMP",
                                  "Linea" = "Linea"),
                          match_fun = list(TwoMin, TenMin, `==`))

I just created others 2 columns, but your solution seems to be good also. Unfortunately, when I try to run fuzzy_inner_join with my original database (530 BD and 120k BD2) I get this error: Error: std::bad_alloc. I closed the R Session and open again, but still doesn't work.

Your system might not have enough memory. Will it run if you use just the first 10 or 100 rows of BD?

Oh, yes. For 1000 it ran. Meaning I had the entire database BD (530) and just 1000 from BD2

Since you are calculating by group in BD, it would be best to run subsets of BD and use all of BD2.

You are perfectly right. Thank you very much.