How to replace nested loops and conditions with purrr's map?

I have been thinking on how to replace nested loops with nested conditionals with map but without success. I'm aware of the discussions on SO (https://stackoverflow.com/questions/48847613/purrr-map-equivalent-of-nested-for-loop and https://stackoverflow.com/questions/52031380/replacing-the-for-loop-by-the-map-function-to-speed-up?noredirect=1&lq=1) but neither of these proved to be useful for my case.

I have two dataset with different lenghts. For downstream purposes I want to include a unique group id from one dataset to the other. However, one dataset contains data from time periods (df_1), the other is annual frequency (df_2).

My solution so far is to loop over both dataset (the nested loops are neccesary due to the difference in lenghts) check if the countries are the same and within those countries check if the annual data falls between a specific period. If yes, than add the group id to the df_2

My problem with the map approach (or *apply for that matter) is that I don't know how to express the nested loop and the conditions together.

For a mwe see below.

library(dplyr)
library(tidyr)


# data
df_1 <- tibble(
  start = rep(seq(1990, 1994, 4), each = 2),
  end = start + 4,
  countryname = rep(c("SWE", "NOR"), 2),
  group_id = rep(seq(1:2), each = 2)
)

df_2 <- tibble(
  year = rep(seq(1989, 1999, 1), each = 2),
  countryname = rep(c("SWE", "NOR"), 11),
  value = rep(seq(100, 110, 1), each = 2),
  group_id = NA_real_
)  




for (i in (1:nrow(df_1))) {
  for (j in (1:nrow(df_2))) {
    if (df_1[i, "countryname"] == df_2[j, "countryname"]) {
      if (df_2[j, "year"] >= df_1[i, "start"] & df_2[j, "year"] <= df_1[i, "end"]) {
        df_2[j, "group_id"] <- df_1[i, "group_id"]
      } 
    }
  }
}

Created on 2021-01-12 by the reprex package (v0.3.0)

I have a solution that doesn't do any looping or mapping. I take df_1 and expand it to make it longer and have a column for the year.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)


# data
df_1 <- tibble(
  start = rep(seq(1990, 1994, 4), each = 2),
  end = start + 4,
  countryname = rep(c("SWE", "NOR"), 2),
  group_id = rep(seq(1:2), each = 2)
)

df_2 <- tibble(
  year = rep(seq(1989, 1999, 1), each = 2),
  countryname = rep(c("SWE", "NOR"), 11),
  value = rep(seq(100, 110, 1), each = 2),
  group_id = NA_real_
)  


df_1_long <- df_1 %>%
  rowwise() %>%
  mutate(year=list(start:end)) %>%
  ungroup() %>%
  unnest(year) %>%
  select(-start, -end)

df_2_update <- df_2 %>%
  select(-group_id) %>%
  left_join(df_1_long, by=c("year", "countryname"))

df_2_update
#> # A tibble: 24 x 4
#>     year countryname value group_id
#>    <dbl> <chr>       <dbl>    <int>
#>  1  1989 SWE           100       NA
#>  2  1989 NOR           100       NA
#>  3  1990 SWE           101        1
#>  4  1990 NOR           101        1
#>  5  1991 SWE           102        1
#>  6  1991 NOR           102        1
#>  7  1992 SWE           103        1
#>  8  1992 NOR           103        1
#>  9  1993 SWE           104        1
#> 10  1993 NOR           104        1
#> # ... with 14 more rows

Created on 2021-01-12 by the reprex package (v0.3.0)

1 Like

I was also experimenting with joins, the problem is that on the cases where the periods overlap (one ends and the other begins) the join will duplicate rows. In this case, df_2_update has 24 rows (1994 duplicates) and the loop approach preserves row number.

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)


# data
df_1 <- tibble(
  start = rep(seq(1990, 1994, 4), each = 2),
  end = start + 4,
  countryname = rep(c("SWE", "NOR"), 2),
  group_id = rep(seq(1:2), each = 2)
)

df_2 <- tibble(
  year = rep(seq(1989, 1999, 1), each = 2),
  countryname = rep(c("SWE", "NOR"), 11),
  value = rep(seq(100, 110, 1), each = 2),
  group_id = NA_real_
)  



# original approach
for (i in (1:nrow(df_1))) {
  for (j in (1:nrow(df_2))) {
    if (df_1[i, "countryname"] == df_2[j, "countryname"]) {
      if (df_2[j, "year"] >= df_1[i, "start"] & df_2[j, "year"] <= df_1[i, "end"]) {
        df_2[j, "group_id"] <- df_1[i, "group_id"]
      } 
    }
  }
}


# proposed alternative
df_1_long <- df_1 %>%
  rowwise() %>%
  mutate(year=list(start:end)) %>%
  ungroup() %>%
  unnest(year) %>%
  select(-start, -end)

df_2_update <- df_2 %>%
  select(-group_id) %>%
  left_join(df_1_long, by=c("year", "countryname"))


# results
df_2 %>% 
  group_by(year) %>% 
  summarise(n())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 11 x 2
#>     year `n()`
#>    <dbl> <int>
#>  1  1989     2
#>  2  1990     2
#>  3  1991     2
#>  4  1992     2
#>  5  1993     2
#>  6  1994     2
#>  7  1995     2
#>  8  1996     2
#>  9  1997     2
#> 10  1998     2
#> 11  1999     2

df_2_update %>% 
  group_by(year) %>% 
  summarise(n())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 11 x 2
#>     year `n()`
#>    <dbl> <int>
#>  1  1989     2
#>  2  1990     2
#>  3  1991     2
#>  4  1992     2
#>  5  1993     2
#>  6  1994     4
#>  7  1995     2
#>  8  1996     2
#>  9  1997     2
#> 10  1998     2
#> 11  1999     2

Created on 2021-01-12 by the reprex package (v0.3.0)

the overlap can be addressed by adding a bit more to the df_1 processing, an additional group by and summarise

# proposed alternative
df_1_long <- df_1 %>%
  rowwise() %>%
  mutate(year=list(start:end)) %>%
  ungroup() %>%
  unnest(year) %>%
  select(-start, -end)  %>%
  group_by(countryname,year) %>%
  summarise(group_id = min(group_id))
1 Like

Thanks for the fix, and the initial approach to use joins!

Out of curiosity, how would one do this with map if at all?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.