Mean value of nearest neighbours

Rapha · September 1, 2022, 1:10pm

Hey there,

I am a little bit confused and could need some help. I got a dataset similar to my repex. Where id is the id of a district, and nb4 a list of id´s of the nearest four neighbours.

nb = list()
for (i in 1:10) {
  nb[[i]] = sample.int(10,4)
  }

repex <- tibble(id = seq(1,10,1),
                value = runif(10, min = 100, max = 200),
                nb4 = nb)

Now I want to determine the mean value of the nearest neighbours for each observation.
In my head it looks something like this, but it only works individually.

# does not work
repex %>% 
  mutate(mean_nb4 = mean(repex %>% filter(id %in% repex$nb4[id]) %>% pull(value)))

# does work individually
mean(repex %>% filter(id %in% repex$nb4[[1]]) %>% pull(value))
mean(repex %>% filter(id %in% repex$nb4[[2]]) %>% pull(value))

Thanks a lot in advance

dvetsch75 · September 1, 2022, 1:56pm

Here is is one really easy way to things, and one less intuitive way to do it. I've made the reprex dataframe much bigger so that differences in the benchmarks are easier to see - but as you can see, unnesting the list and grouping rowwise are pretty close in terms of speed.

library(tidyverse)
library(microbenchmark)

nb = list()
for (i in 1:1000000) {
    nb[[i]] = sample.int(10,4)
}

reprex_df <- tibble(id = seq(1,1000000,1),
                    value = runif(1000000, min = 100, max = 200),
                    nb4 = nb)


# Method 1
microbenchmark(times = 25, {
    reprex_df %>% 
        rowwise() %>% 
        mutate(
            mean_nb4 = mean(nb4)
        )
})
#> Unit: seconds
#>                                                              expr      min
#>  {     reprex_df %>% rowwise() %>% mutate(mean_nb4 = mean(nb4)) } 10.42397
#>        lq     mean   median       uq      max neval
#>  11.04515 11.80023 11.58646 12.03795 15.98523    25

# Method 2
microbenchmark(times = 25, {
    reprex_df %>% 
        group_by(id, value) %>% 
        unnest(nb4) %>% 
        summarize(mean_nb4 = mean(nb4))
})

#> Unit: seconds
#>                                                                                           expr
#>  {     reprex_df %>% group_by(id, value) %>% unnest(nb4) %>% summarize(mean_nb4 = mean(nb4)) }
#>       min       lq     mean   median       uq      max neval
#>  11.16424 11.32398 12.03699 11.75246 12.13938 14.63995    25

^{Created on 2022-09-01 by the reprex package (v1.0.0)}

Rapha · September 1, 2022, 2:22pm

Thanks @dvetsch75 ,

but nb4 gives the id of the 4 nearest neighbors, not their value.

I managed to run a loop to solve my problem, but I am not happy about this way. Any idea to achieve this the tidy way?

mean.nb4 <- vector()
for (i in repex$id) {
  mean.nb4 <- c(mean.nb4,mean(repex %>% filter(id %in% repex$nb4[[i]]) %>% pull(value)))
}

dvetsch75 · September 1, 2022, 3:21pm

Ahhh my fault - I misunderstood the problem. This is a bit easier - just unnest your list column, join the dataframe to itself (I subset the columns to make things a bit cleaner) by the ids, then you can just summarize for the mean.

library(tidyverse)
nb = list()
for (i in 1:10) {
    nb[[i]] = sample.int(10,4)
}

reprex_df <- tibble(id = seq(1,10,1),
                    value = runif(10, min = 100, max = 200),
                    nb4 = nb)


reprex_df %>% 
    group_by(id, value) %>% 
    unnest(nb4) %>% 
    inner_join(
        reprex_df %>% 
            select(
                id,
                value
            ),
        by = c('nb4' = 'id')
    ) %>% 
    group_by(id, value.x) %>% 
    summarize(
        mean_nn_val = mean(value.y)
    )
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 10 x 3
#> # Groups:   id [10]
#>       id value.x mean_nn_val
#>    <dbl>   <dbl>       <dbl>
#>  1     1    125.        145.
#>  2     2    114.        139.
#>  3     3    135.        157.
#>  4     4    153.        156.
#>  5     5    147.        144.
#>  6     6    121.        134.
#>  7     7    190.        146.
#>  8     8    125.        131.
#>  9     9    172.        130.
#> 10    10    165.        134.

^{Created on 2022-09-01 by the reprex package (v1.0.0)}

Rapha · September 5, 2022, 2:09pm

Thanks @dvetsch75, that helped a lot. I only changed the summarize to mutate for keeping other variables in the dataset. Since I only got one value for each id i skipped the group_by(value).

repex <- repex %>% 
       group_by(id) %>% 
       tidyr::unnest(nb4) %>% 
       inner_join(repex %>% select(id, value), by = c('nb4' = 'id')) %>%
       mutate(mean_value_nb4 = mean(value.y)) %>% 
       select(-value.y) %>% 
       rename(value = value.x) %>% 
       tidyr::nest(nb4 = nb4) %>% 
       distinct()

system · September 12, 2022, 2:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.