Map a custom function onto a list with a split

Piranha · January 30, 2018, 9:57pm

Hello. This is my first post here, so hopefully I doing this correctly. I am a self-taught R learner and have been using Tidyverse for a few months.

I am having trouble applying a custom-made function onto a list that has a split. My data is very similar to the modified version of the gapminder data shown below.

Here is the data:

library(gapminder)
library(tidyverse)


df<- gapminder %>%
      filter(country %in% c("Afghanistan", "Belgium", "Cameroon")) %>%
      select(year, country, lifeExp, gdpPercap, pop) %>%
      mutate(month = rep("01", nrow(df))) %>%
      mutate(day = rep("01", nrow(df))) %>%
      mutate(year = as.character(year)) %>%
      unite_(col='date', c("year", "month", "day"), sep = "-") %>%
      mutate(date = as.Date(date)) %>%
      mutate(country = as.character(country))

df_split <- split(df, df$country)

I want to apply the following user-defined function onto the data at each level of split

PctChange <- function(x){
  ((x - lag(x,1)) / lag(x,1)) * 100
}

I was able to figure out how to apply the function on a single level of split

Afg_pct <- modify_if(df_split$Afghanistan, is.numeric, PctChange)

gapminder_df_2

But I am really struggling to figure out how to apply this to each and every level of split in the data.
Due to the time-series nature of the data, it seems to me that this kind of a split/nested/grouped list should theoretically be useful. However, I am really stuck here.

Is there a simple solution to this that I am not thinking of?

Alternatively, should I look into reshaping the data so that the calculations are easier?

Any help would be hugely appreciated!! Thanks in advance.

cderv · January 30, 2018, 10:30pm

Hi,

here is a solution using the hability to use nested dataframe thanks to tidyr included in the tidyverse.
With nested data, you have list-columns that you can deal with purrr and its function to iterate over.

library(gapminder)
library(tidyverse)


df<- gapminder %>%
  filter(country %in% c("Afghanistan", "Belgium", "Cameroon")) %>%
  select(year, country, lifeExp, gdpPercap, pop) %>%
  mutate(date = as.Date(paste(year, "01", "01", sep = "-"))) %>%
  mutate(country = as.character(country)) %>%
  select(country, date, everything())

PctChange <- function(x){
  ((x - lag(x,1)) / lag(x,1)) * 100
}

# you can work with nested tibble
df2 <- df%>%
  nest(-country) %>%
  mutate(data_pctchange = map(data, ~ mutate_if(.x, is.numeric, PctChange)))

# You get your original data and the new one, results of mutate_if
df2
#> # A tibble: 3 x 3
#>   country     data              data_pctchange   
#>   <chr>       <list>            <list>           
#> 1 Afghanistan <tibble [12 x 5]> <tibble [12 x 5]>
#> 2 Belgium     <tibble [12 x 5]> <tibble [12 x 5]>
#> 3 Cameroon    <tibble [12 x 5]> <tibble [12 x 5]>

# You can get one or the other using unnest
df2 %>%
  unnest(data_pctchange)
#> # A tibble: 36 x 6
#>    country     date         year lifeExp gdpPercap    pop
#>    <chr>       <date>      <dbl>   <dbl>     <dbl>  <dbl>
#>  1 Afghanistan 1952-01-01 NA      NA         NA     NA   
#>  2 Afghanistan 1957-01-01  0.256   5.32       5.31   9.68
#>  3 Afghanistan 1962-01-01  0.255   5.49       3.93  11.1 
#>  4 Afghanistan 1967-01-01  0.255   6.32     - 1.98  12.4 
#>  5 Afghanistan 1972-01-01  0.254   6.08     -11.5   13.4 
#>  6 Afghanistan 1977-01-01  0.254   6.51       6.23  13.8 
#>  7 Afghanistan 1982-01-01  0.253   3.68      24.4  -13.4 
#>  8 Afghanistan 1987-01-01  0.252   2.43     -12.8    7.66
#>  9 Afghanistan 1992-01-01  0.252   2.09     -23.8   17.7 
#> 10 Afghanistan 1997-01-01  0.251   0.214    - 2.16  36.2 
#> # ... with 26 more rows

Created on 2018-01-30 by the reprex package (v0.1.1.9000).

Piranha · January 30, 2018, 10:39pm

Oh yay! This looks like a great solution! I will play around with this for a bit and see if it works with the rest of my workflow.

mfherman · January 30, 2018, 11:00pm

If you don't need to split your data or use list-columns for other reasons in your analysis, another option is using the simpler group_by(country):

  library(gapminder)
  library(tidyverse)
  
  PctChange <- function(x){
    ((x - lag(x,1)) / lag(x,1)) * 100
  }
  
  df <- gapminder %>%
    filter(country %in% c("Afghanistan", "Belgium", "Cameroon")) %>%
    select(year, country, lifeExp, gdpPercap, pop) %>%
    mutate(date = as.Date(paste(year, "01", "01", sep = "-"))) %>%
    mutate(country = as.character(country)) %>%
    select(country, date, everything())
  
  df %>%
    group_by(country) %>%
    mutate_if(is.numeric, PctChange)
#> # A tibble: 36 x 6
#> # Groups:   country [3]
#>    country     date         year lifeExp gdpPercap    pop
#>    <chr>       <date>      <dbl>   <dbl>     <dbl>  <dbl>
#>  1 Afghanistan 1952-01-01 NA      NA         NA     NA   
#>  2 Afghanistan 1957-01-01  0.256   5.32       5.31   9.68
#>  3 Afghanistan 1962-01-01  0.255   5.49       3.93  11.1 
#>  4 Afghanistan 1967-01-01  0.255   6.32     - 1.98  12.4 
#>  5 Afghanistan 1972-01-01  0.254   6.08     -11.5   13.4 
#>  6 Afghanistan 1977-01-01  0.254   6.51       6.23  13.8 
#>  7 Afghanistan 1982-01-01  0.253   3.68      24.4  -13.4 
#>  8 Afghanistan 1987-01-01  0.252   2.43     -12.8    7.66
#>  9 Afghanistan 1992-01-01  0.252   2.09     -23.8   17.7 
#> 10 Afghanistan 1997-01-01  0.251   0.214    - 2.16  36.2 
#> # ... with 26 more rows

Created on 2018-01-30 by the reprex package (v0.1.1.9000).

Piranha · January 31, 2018, 2:19pm

Thank you! This definitely solves my problem. I had tried group_by before, but for some reason, I could not get it to work. I must have made a mistake somewhere in the pipe process.

cderv · January 31, 2018, 3:01pm

group_by is clearly your solution in this case.
I show you nested tibble as a new alternative for some split and combine operations, with the advantage of staying a tidy data workflow.

Hope you will update your workflow without difficulties with these solutions!

And thanks a lot for your clean example code ! your reprex is very helpful to help you ! Thanks.