Create a vector in a LOOP with more conditions

Mariana136 · July 30, 2020, 10:22am

I have a problem a little complicate. I have two database BD and BD2 . For every row in BD I want to search in entire BD2 and obtain some info: Sum, Mean, Sd etc. With Sum I succeeded to make it work, but for Mean, Sd, Median I want to obtain the vector and after that apply these functions.

To be clearer, here is the code. For sum it worked. But I want now to save in a vector the values that met the conditions and after calculate Mean , Sd , Median . I tried to make the vector out of the base BD but nothing work or is something trickie and I can't figure it out.

for (i in 1:dim(BD)[1])
  
{
  
  for (j in 1:dim(BD2)[1])
    
  {
    
    if((BD$Start.Date[i]<BD2$X_TIMESTAMP[j]) && (BD2$X_TIMESTAMP[j]<BD$End.Date[i]) && (BD$Linea[i]==BD2$Linea[j]))
      
    { 
      vl = BD2$X_VALUE[j]
      BD$vec[i] = paste(BD$vec[i],vl,sep="")
      #vect = as.numeric(BD$vec[i])
      BD$Sum[i] = BD$Sum[i]+ BD2$X_VALUE[j]
      #BD$Average[i] = mean(vect)
      
      
    }
    
  }
  
}

I tried to create vec before the for sentence. But still it doesn't work. The error is always the same: Error in $<-.data.frame(*tmp*, "vec", value = list(NULL, NULL, NULL, : replacement has 47 rows, data has 530

EDIT: I added a photo with details. In fact, I don't necessarily want to retain the vector. I jus want to be able to calculate Mean, Sd etc.

Thank you

EDIT 2: For a reproducible example:
BD2

X_TIMESTAMP	X_VALUE	Linea
23/05/2020 05:40	106	1
23/05/2020 05:40	100	3
23/05/2020 05:41	82	1
23/05/2020 05:41	101	3
23/05/2020 05:42	109	1
23/05/2020 05:42	100	3
23/05/2020 05:43	104	1
23/05/2020 05:43	56	3
23/05/2020 05:44	104	1
23/05/2020 05:44	56	3
23/05/2020 05:45	56	1
23/05/2020 05:45	104	3
23/05/2020 05:46	0	1

BD

Start.Date	End.Date	Linea	Vector	Sum	Mean
23/05/2020 05:40	23/05/2020 05:43	1	82;109	191	95.5
24/05/2020 05:41	24/05/2020 05:45	3	100;56;56	212	70.66

pieterjanvc · July 30, 2020, 11:50am

HI,

I think there are indeed some more elegant solutions to this issue, but you'll need to provide me a bit extra info. Can you show me a sample of both databases and tell me which columns are linked (foreign keys) and which values need to be calculated (i.e. provide an example of input and output).

To help us understand, read the reprex guide. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck,
PJ

Mariana136 · July 30, 2020, 12:43pm

Hi. I hope now it's ok. I added more info. Thank you!

FJCC · July 30, 2020, 1:46pm

One way to do this is to do a fuzzy join between BD and BD2.
I edited the date values in BD for linea = 3 to line up with values in BD2.

``` r
library(fuzzyjoin)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

BD <- read.csv("~/R/Play/BD.csv", stringsAsFactors = FALSE)  
BD2 <- read.csv("~/R/Play/BD2.csv", stringsAsFactors = FALSE)
BD$Start.Date <- dmy_hm(BD$Start.Date)
BD$End.Date <- dmy_hm(BD$End.Date)
BD2$X_TIMESTAMP <- dmy_hm(BD2$X_TIMESTAMP)
BDjoin <- fuzzy_inner_join(BD, BD2, 
                           by = c("Start.Date" = "X_TIMESTAMP", 
                                  "End.Date" = "X_TIMESTAMP",
                                  "Linea" = "Linea"),
                          match_fun = list(`<`, `>`, `==`))
BDjoin
#>            Start.Date            End.Date Linea.x         X_TIMESTAMP X_VALUE
#> 1 2020-05-23 05:40:00 2020-05-23 05:43:00       1 2020-05-23 05:41:00      82
#> 2 2020-05-23 05:40:00 2020-05-23 05:43:00       1 2020-05-23 05:42:00     109
#> 3 2020-05-23 05:41:00 2020-05-23 05:45:00       3 2020-05-23 05:42:00     100
#> 4 2020-05-23 05:41:00 2020-05-23 05:45:00       3 2020-05-23 05:43:00      56
#> 5 2020-05-23 05:41:00 2020-05-23 05:45:00       3 2020-05-23 05:44:00      56
#>   Linea.y
#> 1       1
#> 2       1
#> 3       3
#> 4       3
#> 5       3
BDstats <- BDjoin %>% group_by(Start.Date, End.Date, Linea.x) %>% 
  summarize(SUM = sum(X_VALUE), MED = median(X_VALUE), MEAN = mean(X_VALUE), SD = sd(X_VALUE))
BDstats
#> # A tibble: 2 x 7
#> # Groups:   Start.Date, End.Date [2]
#>   Start.Date          End.Date            Linea.x   SUM   MED  MEAN    SD
#>   <dttm>              <dttm>                <int> <int> <dbl> <dbl> <dbl>
#> 1 2020-05-23 05:40:00 2020-05-23 05:43:00       1   191  95.5  95.5  19.1
#> 2 2020-05-23 05:41:00 2020-05-23 05:45:00       3   212  56    70.7  25.4

^{Created on 2020-07-30 by the reprex package (v0.3.0)}

Mariana136 · July 30, 2020, 2:15pm

Thank you very much. I succeeded to fix the for sentence but it takes a lot to run. I have in BD = 530 rows and BD2 = 700k. I tried your code, but I don't know why when I make group by it reduce my database at only 13 observation, even if the start.date it doesn't have a value repeated.

FJCC · July 30, 2020, 2:21pm

What is the result of

nrow(unique(BDjoin[, c("Start.Date", "End.Date", "Linea.x")]))

Mariana136 · July 30, 2020, 2:34pm

13L. Smth happened when I apllied dmy_hm(BD$Start.Date). In my original code I have BD$Start.Date = strptime(as.character(BD$Start.Date),format="%d.%m.%Y %H:%M") because in the original Database the date are like this: 14.05.2020 21:12. I tried with strptime but I have an error when I try to run fuzzy_inner_join. I think it's better to create a unique key in BD to be sure that nothing will reduce the size. Did you ever create smth like this? I don't know, a random no.. Thank you

FJCC · July 30, 2020, 2:52pm

Your code using strptime() should work fine. Use that if you prefer it over dmy_hm().

Is it possible that there really are only 13 cases where Start.Date, End.Date and Linea all meet the requirements? Any rows in BD that do not have matching data in BD2 will be dropped from BDjoin. Note that the match conditions are < and > not <= and >=.
Do you expect most rows in BD to have matching data in BD2? Can you find any rows in BD that do have matches in BD2 but do not appear in BDstats?

Mariana136 · July 30, 2020, 4:09pm

I think you are right. A last question, if you know please: I have for other database the conditions: Start.Data+2 min < X_TIMESTAMP < End.Date - 10 min. Do you know how to write this in match_fun in the fuzzy_inner_join sentence? Thank you.

FJCC · July 30, 2020, 4:33pm

I have never written my own functions for fuzzy joins but I think this will work.

TwoMin <- function(x, y) x + 120 < y
TenMin <- function (x, y) x - 600 > y
BDjoin <- fuzzy_inner_join(BD, BD2, 
                           by = c("Start.Date" = "X_TIMESTAMP", 
                                  "End.Date" = "X_TIMESTAMP",
                                  "Linea" = "Linea"),
                          match_fun = list(TwoMin, TenMin, `==`))

Mariana136 · July 30, 2020, 4:50pm

I just created others 2 columns, but your solution seems to be good also. Unfortunately, when I try to run fuzzy_inner_join with my original database (530 BD and 120k BD2) I get this error: Error: std::bad_alloc. I closed the R Session and open again, but still doesn't work.

FJCC · July 30, 2020, 5:17pm

Your system might not have enough memory. Will it run if you use just the first 10 or 100 rows of BD?

Mariana136 · July 30, 2020, 5:20pm

Oh, yes. For 1000 it ran. Meaning I had the entire database BD (530) and just 1000 from BD2

FJCC · July 30, 2020, 5:25pm

Since you are calculating by group in BD, it would be best to run subsets of BD and use all of BD2.

Mariana136 · July 30, 2020, 5:33pm

You are perfectly right. Thank you very much.

system · August 20, 2020, 5:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.