Filtering of cases in each dataframe within the nested tibble - nest


#1

I was wondering if there is a way of doing a filtering of cases in each dataframe within the nested tibble. Let's say I have a tibble with a bunch of observations. I want to treat each date of sampling independently, and divide the observations of each date into training and test. a reprex will look like ``` r

# A tibble with four columns: Date, Sample, Training, value
suppressPackageStartupMessages(library(tidyverse))
#> Warning: package 'dplyr' was built under R version 3.5.1
df <- tibble (Date = c(rep("Day1", 10),
                         rep("Day2", 10),
                         rep("Day3", 10)),
              Sample = 1:30,
              Training = rep(c(rep("YES",3),
                             rep("NO",7)),3),
              value = rnorm(1:30))

df_nested <- df %>% nest (-Date)
df_nested
#> # A tibble: 3 x 2
#>   Date  data             
#>   <chr> <list>           
#> 1 Day1  <tibble [10 × 3]>
#> 2 Day2  <tibble [10 × 3]>
#> 3 Day3  <tibble [10 × 3]>

# My ideal output is to have the data split into training and real data, and have each of them as one list column

# desired output

df %>% filter (Training != "YES") %>% nest (-Date) -> df_test_data

df %>% filter (Training == "YES") %>% nest (-Date) %>% left_join (df_test_data, by = "Date")
#> # A tibble: 3 x 3
#>   Date  data.x           data.y          
#>   <chr> <list>           <list>          
#> 1 Day1  <tibble [3 × 3]> <tibble [7 × 3]>
#> 2 Day2  <tibble [3 × 3]> <tibble [7 × 3]>
#> 3 Day3  <tibble [3 × 3]> <tibble [7 × 3]>

# I was wondering if there is a way of doing that WITHIN the nested tibble - 

df_nested %>% map (data, split ( .,Training))
#> Warning in .f(.x[[i]], ...): data set '.x[[i]]' not found
#> Warning in .f(.x[[i]], ...): data set 'split(., Training)' not found
#> Warning in .f(.x[[i]], ...): data set '.x[[i]]' not found
#> Warning in .f(.x[[i]], ...): data set 'split(., Training)' not found
#> $Date
#> [1] ".x[[i]]"            "split(., Training)"
#> 
#> $data
#> [1] ".x[[i]]"            "split(., Training)"

df_nested %>% mutate (Training = map (data, filter(Training == "YES")))
#> Error in mutate_impl(.data, dots): Evaluation error: object 'Training' not found.

Created on 2018-08-22 by the reprex
package
(v0.2.0).

I think I am just missing something in the grammar of filter and map


Filter columns using purrr's map() and dplyr's filter()
#2

You need to quote anonymous functions in map with ~ (roughly equivalent to function(.x)) and specify the data frame passed into filter (usually unnecessary because data is piped in to that first parameter):

library(tidyverse)

df <- tibble(Date = c(rep("Day1", 10), rep("Day2", 10), rep("Day3", 10)),
             Sample = 1:30,
             Training = rep(c(rep("YES", 3), rep("NO", 7)), 3),
             value = rnorm(1:30))

df_nested <- df %>% nest(-Date)

df_nested %>% 
    mutate(train = map(data, ~filter(.x, Training == "YES")))
#> # A tibble: 3 x 3
#>   Date  data              train           
#>   <chr> <list>            <list>          
#> 1 Day1  <tibble [10 × 3]> <tibble [3 × 3]>
#> 2 Day2  <tibble [10 × 3]> <tibble [3 × 3]>
#> 3 Day3  <tibble [10 × 3]> <tibble [3 × 3]>

# equivalent but more verbose
df_nested %>% 
    mutate(train = map(data, function(.x){
        .x %>% filter(Training == "YES")
    }))
#> # A tibble: 3 x 3
#>   Date  data              train           
#>   <chr> <list>            <list>          
#> 1 Day1  <tibble [10 × 3]> <tibble [3 × 3]>
#> 2 Day2  <tibble [10 × 3]> <tibble [3 × 3]>
#> 3 Day3  <tibble [10 × 3]> <tibble [3 × 3]>

You could do this for both testing and training sets, but a simpler option is to not nest the Training column, and spread the results to wide form:

df %>% 
    nest(-Date, -Training) %>% 
    spread(Training, data)
#> # A tibble: 3 x 3
#>   Date  NO               YES             
#>   <chr> <list>           <list>          
#> 1 Day1  <tibble [7 × 2]> <tibble [3 × 2]>
#> 2 Day2  <tibble [7 × 2]> <tibble [3 × 2]>
#> 3 Day3  <tibble [7 × 2]> <tibble [3 × 2]>

You'll probably want to change the YES and NO values to get better column names, but otherwise it will divide your data nicely.


#3

That's awesome ! Thanks alistaire!