Rows selection using specific coordinates-based condition

In my created dataset:

# Packages
library(tidyverse)
library(DBI)

# Create myds
ID<-c(1,2,3,4,5)
x<-c(-51.6422,-51.6351,-51.6351,-51.6351,-51.6422)
y<-c(-30.1412,-30.1206,-30.1206,-30.1207,-30.1412)
date<-c("2021-11-01","2021-11-01","2021-11-15", "2021-11-15","2021-11-15")
myds<-data.frame(ID=ID,x=x,y=y,date=as.Date(date))
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "myds", myds)

# Take the table created inside `RSQLite::SQLite()` 
dataset <- dplyr::tbl(con, "myds")

# Rows selection by the rule
sel_vars <- dataset %>% dplyr::??? %>% collect()

??? Here I'd like to make the rows selection and create a new object (sel_vars) with two conditions:

  1. Take the row just only if x and y with no repetition (duplicates) in coordinates, but when I have coincident or duplicates coordinates:
  2. I take the coincident coordinates too just only if I have 10 meters neighbourhood row increase in any direction eg. horizontally, vertically, or diagonally (0.00015 or -0.00015 unit in x or y, like chess king's movement) during the time (date).

My desirable output is:

    sel_vars
        ID x       y        date
    [1] 2 -51.6351 -30.1206 2021-11-01
    [2] 3 -51.6351 -30.1206 2021-11-15
    [3] 4 -51.6351 -30.1207 2021-11-15

Please, any help with it?

Hello there,

I don't really get your problem. What, in that dataset do you count as a duplicate? Are these multiple entries per date or multiple dates per x/y combination? If the former, I don't quite get how you reach your desired output using the rules you described above. Could you clarify what you mean?

However, I have the feeling, you may have use for the following:

# offset a variable by a given lag, use inside mutate
dplyr::lag()

# with this you can calculate differences between the "old" and new x and y coordniate
dataset %>% 
  dplyr::mutate(x_lag = lag(x,1),
                y_lag = lag(y,1)) %>% 
  dplyr::mutate(dx = abs(x-x_lag),
                dy = abs(y-y_lag)) %>%

# you can filter the rows that statisfy your 2. conditions then using if_any()
 
  dplyr::filter(if_any(c("dx","dy"), ~.x > 0.00015)) %>%
  collect()

Maybe this will already help you, although I don't think that's exactly what you wanted.

Best,
Valentin

Thanks very much @valentingar , in my case duplicate, is the coordinate that appears again in date and date+1. But I select this coordinate too if in date+1 it appears a neighbourhood coordinate and neighbour is a new coordinate at the distance 0.00015 or -0.00015 unit in x or y in any direction.
In my desirable output sel_vars in the example, I select ID =2 and ID=3 , despite the same coordinate between two dates, because in the date 2021-11-15 appears a ID= 4 neighbour. In inverse, I don't select ID= 1 and ID =5 because I have the same coordinate and no neighbour coordinate.
And your solution don't solve yet:

dataset %>% 
  dplyr::mutate(x_lag = lag(x,1),
                y_lag = lag(y,1)) %>% 
  dplyr::mutate(dx = abs(x-x_lag),
                dy = abs(y-y_lag)) %>% 
  dplyr::filter(if_any(c("dx","dy"), ~.x > 0.00015)) %>%
  collect()
# A tibble: 2 x 8
# Groups:   date [2]
     ID     x     y  date x_lag y_lag      dx     dy
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>
1     2 -51.6 -30.1 18932 -51.6 -30.1 0.00710 0.0206
2     5 -51.6 -30.1 18946 -51.6 -30.1 0.00710 0.0205

and not:

  ID          x          y    date
  <dbl> <dbl> <dbl> <dbl> 
    1 2 -51.6351 -30.1206 2021-11-01
    2 3 -51.6351 -30.1206 2021-11-15
    3 4 -51.6351 -30.1207 2021-11-15

Yes, i think I got it.
Basically, you want the following rows:

  1. all rows at date where there is no exact coordinate match at date + 1
  2. all rows that do not satisfy the 1. condition but have a "neighbor" at date + 1. A neighbor is another coordinate at date + 1 that has an x or y coordinate with an absolute difference of less than 0.00015.

Is this correct?

If yes, then this is not a very straight-forward filter(), I believe. I am going to give it a try, though I am not sure I will be able to help here.

Okay I gave this some thought and got to this not very great but working(?) solution. There is definitely not dplyr-like function for this and this is the best I got so far. Notice, that you will have to collect() everything, because this is non-standard filtering.

Basically we define two functions exists_duplicate() and exists_neighbor(), that check for a given x, y and date if ther is a duplicate or a neighbor in the next or current date. We then apply this function rowwise to each row of your dataset. Note that . in the respective function call passes the whole dataset to the function that is then filtered to the correct date to check for neighbors/duplicates. This also means, that this will be not very efficient so that I doubt you can easily run this on very large datasets.

I am also not completely sure if this solves your problem correctly but might get you started.

Best,
Valentin

# Rows selection by the rule

# function for duplicate detection
exists_duplicate <- function(x,y,date,dates,df,dir = 1){
  
  #dir is either -1 for the previous, or 1 for the next date
  if (!dir %in% c(-1,1)){
    stop("dir must be -1 or 1")
  }
  
  # choose either next or previous date depending on dir
  next_date <- dates[which(date == dates) + dir] 
  
  # exception for dir == -1 and date == min(dates)
  if (length(next_date) == 0){
    return(NA)
  } 
  # exception for dir == 1 and date == max(dates)
  if (is.na(next_date)){
    return(NA)
  }
  
  n_dupl <-
  df %>% 
    filter(date == !!next_date,
           x == !!x,
           y == !!y) %>%
    nrow()
  
  n_dupl > 0 # if there are any duplicates, returns TRUE
}

# function for neighbor detection
exists_neighbor <- function(x,y,date,dates,df,dir = 1){
  
  
  # dir is either 1 for the next date or 0 for the current date
  if(!(dir %in% c(1,0))){
    stop("dir must be 1 or 0")
  }
  
  next_date <- dates[which(date == dates) + dir] 
  
  # exception for dir == 1 and date == max(dates)
  if (is.na(next_date)){
    return(NA)
  }
  
  n_neighbor <-
    df %>% 
    filter(date == !!next_date,
           abs(x - !! x) <= 0.00015,
           abs(y - !! y) <= 0.00015,
           !(x == !!x & y == !!y)
    ) %>%
    nrow()
  
  n_neighbor > 0  # if there are any neighbors, returns TRUE
}

data_collected <- dataset %>% 
  collect() 

dates <- dataset %>%
  pull(date) %>%
  unique() %>%
  sort()

data_neighbor <- 
  data_collected %>%
  rowwise() %>%
  mutate(has_duplicate = exists_duplicate(x, y, date, !!dates, ., dir = 1)) %>%
  mutate(has_neighbor =   exists_neighbor(x, y, date, !!dates, ., dir = 1)) %>%
  mutate(is_duplicate =  exists_duplicate(x, y, date, !!dates, ., dir = -1)) %>%
  mutate(is_neighbor =    exists_neighbor(x, y, date, !!dates, ., dir = 0))

data_neighbor

data_neighbor %>%
  filter(has_duplicate == FALSE | # if there are no duplicates
         has_neighbor == TRUE |   # if there is a neighbor at date == date + 1
         is_neighbor == TRUE      # if there is a neighbor at date == date
         
  )

1 Like

Thank you very much @valentingar, you solve a big problem with a nice solution!!!!

You're welcome. But do check the results carefully, as I am not 100 % sure this is what you wanted. Just realised there was a mistake in my reply as well :sweat_smile::

  mutate(is_duplicate =  exists_duplicate(x, y, date, !!dates, ., dir = 1))

should be

  mutate(is_duplicate =  exists_duplicate(x, y, date, !!dates, ., dir = -1))

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.