Filter columns using purrr's map() and dplyr's filter()

Z3tt · March 8, 2018, 10:43am

Hi together,

Since yesterday I try to find out why the dplyr's filter command works when explicitly naming columns but not when maping through a vector of column names (see example below).

What I am wondering about the most is the fact, that (a) the select command works in the purr context but the filter command doesn't and that (b) it throws not any error but filter returns always empty tibbles. I tired several ways unquoting the colum name using noquote() as well as bang bang (!!) but still empty tibbles. What am I missing?

Thank you very much and here is a small example with reproducible code:

library(tidyverse)

df <- tibble(a = 1:10, b = round(runif(10)), c = round(runif(10)))

## works
df %>% 
  dplyr::select(a, b) %>% 
  dplyr::filter(b == 1)
#> # A tibble: 3 x 2
#>       a     b
#>   <int> <dbl>
#> 1     1  1.00
#> 2     7  1.00
#> 3    10  1.00

## works
map(c("b", "c"), function(x) df %>% 
  dplyr::select(a, x))
#> [[1]]
#> # A tibble: 10 x 2
#>        a     b
#>    <int> <dbl>
#>  1     1  1.00
#>  2     2  0   
#>  3     3  0   
#>  4     4  0   
#>  5     5  0   
#>  6     6  0   
#>  7     7  1.00
#>  8     8  0   
#>  9     9  0   
#> 10    10  1.00
#> 
#> [[2]]
#> # A tibble: 10 x 2
#>        a     c
#>    <int> <dbl>
#>  1     1  0   
#>  2     2  1.00
#>  3     3  1.00
#>  4     4  1.00
#>  5     5  1.00
#>  6     6  0   
#>  7     7  0   
#>  8     8  0   
#>  9     9  1.00
#> 10    10  1.00

## does not work (no error but returns empty tibbles)
map(c("b", "c"), function(x) df %>% 
      dplyr::select(a, x) %>% 
      dplyr::filter(x == 1))
#> [[1]]
#> # A tibble: 0 x 2
#> # ... with 2 variables: a <int>, b <dbl>
#> 
#> [[2]]
#> # A tibble: 0 x 2
#> # ... with 2 variables: a <int>, c <dbl>

Created on 2018-03-08 by the reprex package (v0.2.0).

danr · March 8, 2018, 11:13am

Thanks for including a reprex in your question.

What is it that you are trying to do? The examples you show that "work" each produce different results.

You have shown us some input, df, we need to see the code you want to execute and the output you expect.

quaisiquotion is meant to be used on quosures for example:

suppressPackageStartupMessages(library(tidyverse))

df <- tibble(a = 1:10, b = round(runif(10)), c = round(runif(10)))

# v is a quosure
v <- rlang::quo(c(a, b))
# !!! converts quosure to spliced symbols,
# i.e. symbols separated by quotes
dplyr::select(df, !!!v)
#> # A tibble: 10 x 2
#>        a     b
#>    <int> <dbl>
#>  1     1    1.
#>  2     2    0.
#>  3     3    0.
#>  4     4    0.
#>  5     5    0.
#>  6     6    0.
#>  7     7    0.
#>  8     8    1.
#>  9     9    0.
#> 10    10    1.

v <- rlang::quo(c(b, c))
dplyr::select(df, !!!v)
#> # A tibble: 10 x 2
#>        b     c
#>    <dbl> <dbl>
#>  1    1.    1.
#>  2    0.    0.
#>  3    0.    0.
#>  4    0.    1.
#>  5    0.    1.
#>  6    0.    0.
#>  7    0.    1.
#>  8    1.    0.
#>  9    0.    1.
#> 10    1.    0.

# this fails because standard evaluation is
# used on the Collection, c(), arguments. This
# makes R look for the variables a and c in
# the current environment, but they don't exist
# there
dplyr::select(df, !!!c(a, c))
#> Error in quos(...): object 'a' not found

# typically you would use quosures in a function
# implementation

f <- function(df, columns) {
    v <- rlang::enquo(columns)
    dplyr::select(df, !!!v)
}

f(df, c(a, b))
#> # A tibble: 10 x 2
#>        a     b
#>    <int> <dbl>
#>  1     1    1.
#>  2     2    0.
#>  3     3    0.
#>  4     4    0.
#>  5     5    0.
#>  6     6    0.
#>  7     7    0.
#>  8     8    1.
#>  9     9    0.
#> 10    10    1.

Created on 2018-03-08 by the reprex package (v0.2.0).

Z3tt · March 8, 2018, 1:38pm

Hi danr,

Thanks for your feedback. I want to have a vector for each column (b, c) containing all 1s (simply speaking, in the applied case I just keep the row IDs for those rows which works perfectly).

The code I want to execute is the unexpected using the map() command. The results of the two "works" comments differ indeed. The first one show later the result I expect (two tibbles, both only containing 1s) and the second shows that select() work the way I coded it but filter() doesn't (i.e. it's a interim result).

Here I hope (a) to get the result I expect using map(), select() and filter() and (b) to find an explanation why filter fails here.

I will dig into the use of quosure once I am back at the PC.

danr · March 8, 2018, 4:08pm

A prose description of what you want to do is not sufficient. You need to include a reprex that includes:

The input data.
The function you are trying to write, even if it doesn't work.
Usage of the function, even it it doesn't work.
The output you expect the function to produce.

...so that we can reproduce what you are trying to do.

Z3tt · March 8, 2018, 6:11pm

Dear danr,

Sorry I don't really get what I missed with my reprex... But I try to clarify my purpose: The input is the df I create at the beginning. The function I try to write is not a function but a dplyr chain. Here's an updated reprex:

suppressPackageStartupMessages(library(tidyverse))

## create fake input data
df <- tibble(id = 1:10, a = round(runif(10)), b = round(runif(10)))

## filter column a and b seperately and 
## return id where the condition is fullfilled
## works
da <- df %>% 
  dplyr::select(id, a) %>% 
  dplyr::filter(a == 1) %>% 
  dplyr::select(id)

db <- df %>% 
  dplyr::select(id, b) %>% 
  dplyr::filter(b == 1) %>% 
  dplyr::select(id)

ids <- list(da, db)

## this is what I expect as result:
## a nested list, one tibble for each column
ids
#> [[1]]
#> # A tibble: 7 x 1
#>      id
#>   <int>
#> 1     1
#> 2     2
#> 3     4
#> 4     5
#> 5     7
#> 6     9
#> 7    10
#> 
#> [[2]]
#> # A tibble: 4 x 1
#>      id
#>   <int>
#> 1     1
#> 2     4
#> 3     5
#> 4     9


## the select function works without any prolems when using map
## this is just to show that filter seems to be the problematic line of code
d <- map(c("a", "b"), function(x) df %>% 
  dplyr::select(id, a))

## now the whole chain using map()
ids_map <- map(c("a", "b"), function(x) df %>% 
  dplyr::select(id, x) %>% 
  dplyr::filter(x == 1) %>% 
  dplyr::select(id))

## filter returns empty tibbles now
## this is not what I expect
ids_map
#> [[1]]
#> # A tibble: 0 x 1
#> # ... with 1 variable: id <int>
#> 
#> [[2]]
#> # A tibble: 0 x 1
#> # ... with 1 variable: id <int>

Created on 2018-03-08 by the reprex package (v0.2.0).

The problem seems not to be a quotation issue (but I might be wrong). Your suggestion using rlang::quo is very handy but for filtering this doesn't help because the rows which were returned might be different in length and position. But getting one tibble with a column of each filter-result would be also fine instead of a nested list.

I hope this helps!

danr · March 8, 2018, 6:27pm

Just a note on terminology a dplyr "chain" just uses the inline function %>%.

So for example:

suppressPackageStartupMessages(library(tidyverse))

# function
f1 <- function(x, y) {
    x + y
}

# a "chain"  like this
1 %>% f1(2)
#> [1] 3

# is just a shorthand for calling the %>% function in the traditional way:
`%>%` (1, f1(2))
#> [1] 3

Created on 2018-03-08 by the reprex package (v0.2.0).

Everything you do in R is done with a function.

But it's still not clear to me what you are trying to do. If you just want to extract the columns from a tibble, map will do that:

suppressPackageStartupMessages(library(tidyverse))
df <- tibble(id = 1:10, a = round(runif(10)), b = round(runif(10)))
cols <- map(df, ~.)
cols
#> $id
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $a
#>  [1] 0 1 0 0 0 0 0 1 1 1
#> 
#> $b
#>  [1] 1 1 0 0 1 1 1 0 0 0

Created on 2018-03-08 by the reprex package (v0.2.0).

alistaire · March 8, 2018, 7:24pm

If you reshape to long form, you can filter on a single column instead of iterating over columns:

library(tidyverse)
set.seed(47)    # for reproducible sampling

df <- tibble(id = 1:10, 
             a = round(runif(10)), 
             b = round(runif(10)))

df2 <- df %>% 
    gather(letter, value, a:b) %>% 
    filter(value == 1) 

df2
#> # A tibble: 11 x 3
#>       id letter value
#>    <int> <chr>  <dbl>
#>  1     1 a         1.
#>  2     3 a         1.
#>  3     4 a         1.
#>  4     5 a         1.
#>  5     6 a         1.
#>  6     9 a         1.
#>  7    10 a         1.
#>  8     2 b         1.
#>  9     4 b         1.
#> 10     5 b         1.
#> 11     6 b         1.

Really, you probably want your data to stay in a data frame like this, but you can use split to make a list of vectors:

df2 %>% {split(.$id, .$letter)}
#> $a
#> [1]  1  3  4  5  6  9 10
#> 
#> $b
#> [1] 2 4 5 6

or data frames, depending on how you subset:

df2 %>% {split(.['id'], .$letter)}
#> $a
#> # A tibble: 7 x 1
#>      id
#>   <int>
#> 1     1
#> 2     3
#> 3     4
#> 4     5
#> 5     6
#> 6     9
#> 7    10
#> 
#> $b
#> # A tibble: 4 x 1
#>      id
#>   <int>
#> 1     2
#> 2     4
#> 3     5
#> 4     6

The braces are necessary so the data piped in does not get passed to the first parameter.

Another way to do the same thing is to use nest to make a list column (effectively grouped by letter), which you can then extract:

df2 %>% 
    nest(id) %>% 
    pull(data)
#> [[1]]
#> # A tibble: 7 x 1
#>      id
#>   <int>
#> 1     1
#> 2     3
#> 3     4
#> 4     5
#> 5     6
#> 6     9
#> 7    10
#> 
#> [[2]]
#> # A tibble: 4 x 1
#>      id
#>   <int>
#> 1     2
#> 2     4
#> 3     5
#> 4     6

Again, the data frame form is more useful in the long run.

Z3tt · March 8, 2018, 7:25pm

Dear danr,

The important step you miss here is the filter() command that just keeps IDs for rows containing a 1 and drops rows containing a 0.

You are returning the full columns. My question is why the select() command picks the columns as expected while filter() doesn't filter the correct rows when using it in a purrr context - and how to solve it.

Sorry for the flappy use of the term "chain".

Best,

Cédric

danr · March 8, 2018, 7:50pm

filter is working as expected. To debug map you have break it down into individual iterations to see what is going on.

# breaking map down in to iterations
map(c("b", "c"), function(x) df %>% 
			dplyr::select(a, x) %>% 
			dplyr::filter(x == 1))

suppressPackageStartupMessages(library(tidyverse))

df <- tibble(a = 1:10, b = round(runif(10)), c = round(runif(10)))

# this is the first iteration
# first step of iteration

df1 <- df %>% dplyr::select(a, "b")
# produces tibble with columns a and b
df1
#> # A tibble: 10 x 2
#>        a     b
#>    <int> <dbl>
#>  1     1    0.
#>  2     2    1.
#>  3     3    1.
#>  4     4    1.
#>  5     5    1.
#>  6     6    1.
#>  7     7    1.
#>  8     8    1.
#>  9     9    0.
#> 10    10    0.
    
# second step of iteration
dplyr::filter(df1, "b" == 1)
#> # A tibble: 0 x 2
#> # ... with 2 variables: a <int>, b <dbl>

# filter doesn't find anything because "b"
# is never equal to 1

Created on 2018-03-08 by the reprex package (v0.2.0).

Z3tt · March 8, 2018, 10:34pm

Dear alistaire,

This is a great trick/workaround! Still don't get why it did work with filter when "breaking it down in to iterations" but not when using map(). The result of your approach is exactly what I want. I love gather() but never thought about to use it in this case... Thank you very much!

alistaire · March 9, 2018, 2:37am

It's possible to get it to work that way, but since you're writing a function for a column name that's a variable, you'd have to write it in rlang syntax:

library(tidyverse)
set.seed(47)

df <- tibble(id = 1:10, 
             a = round(runif(10)), 
             b = round(runif(10)))

c('a', 'b') %>% 
    syms() %>% 
    map(~df %>% 
            filter(!!.x == 1) %>% 
            select(id))
#> [[1]]
#> # A tibble: 7 x 1
#>      id
#>   <int>
#> 1     1
#> 2     3
#> 3     4
#> 4     5
#> 5     6
#> 6     9
#> 7    10
#> 
#> [[2]]
#> # A tibble: 4 x 1
#>      id
#>   <int>
#> 1     2
#> 2     4
#> 3     5
#> 4     6

An alternative is to program in base R, which is simpler, as it doesn't have to deal with NSE:

map(c('a', 'b'), ~df[df[[.x]] == 1, 'id'])

Results are identical.

Z3tt · March 9, 2018, 9:28am

Thank you very much for the several ways to solve the problem.I will dig into them and see which one I like he most - all of them will provide the desired result!

And I definitely need to check out NSE in more detail!

Thank you all again!

EconomiCurtis · August 23, 2018, 9:05am

3 posts were split to a new topic: Filtering of cases in each dataframe within the nested tibble - nest