Select columns based on criteria calculated across groups

Dong · April 13, 2018, 8:41pm

Hi all,

I am trying to select the columns that meet certain criteria for its subset. For example, I want to pick out column from p1, p2 where grp 0 min is larger than the overall median:

library(tidyverse)

df <- tribble(
  ~ID, ~grp, ~p1, ~p2,
  1,  0,  1,  3,
  2,  0,  2,  4,
  3,  1,  3,  2,
  4,  1,  3,  2,
  5,  1,  3,  1
)

df %>%
  summarise_at(vars(p1:p2), funs(min(.[grp==0]), median) )

df %>%
summarise_at(vars(p1:p2), funs(min(.[grp==0]) > median(.)) )

the results below

> df %>%
+   summarise_at(vars(p1:p2), funs(min(.[grp==0]), median) )
# A tibble: 1 x 4
  p1_min p2_min p1_median p2_median
   <dbl>  <dbl>     <dbl>     <dbl>
1     1.     3.        3.        2.
> df %>%
+ summarise_at(vars(p1:p2), funs(min(.[grp==0]) > median(.)) )
# A tibble: 1 x 2
  p1    p2   
  <lgl> <lgl>
1 FALSE TRUE

what I want to only keep the columns that is TRUE, i.e.,

> df %>% 
+ select(ID, grp, p2)
# A tibble: 5 x 3
     ID   grp    p2
  <dbl> <dbl> <dbl>
1    1.    0.    3.
2    2.    0.    4.
3    3.    1.    2.
4    4.    1.    2.
5    5.    1.    1.

Any suggestion how to do this? I was thinking select_if, but don't know how to limit the column range to p1:p2

cderv · April 14, 2018, 4:09pm

Keeping the two steps process of identifying column then selecting. Here are two solutions. The second one seems better to me.

library(tidyverse)

df <- tribble(
  ~ID, ~grp, ~p1, ~p2,
  1,  0,  1,  3,
  2,  0,  2,  4,
  3,  1,  3,  2,
  4,  1,  3,  2,
  5,  1,  3,  1
)

# Create a logical vector of columns to keep
to_keep <- df %>%
  summarise_at(vars(p1:p2), funs(min(.[grp==0]) > median(.))) %>%
  as_vector()

# Create a logical vector all TRUE named after each column
selected_vars <- !vector("logical", length(names(df))) %>%
  set_names(nm = names(df))
# Replace by previously calculated column to keep
selected_vars[names(to_keep)] <- to_keep
selected_vars
#>    ID   grp    p1    p2 
#>  TRUE  TRUE FALSE  TRUE

# Filter your df with only column to keep
df %>%
  select_if(selected_vars)
#> # A tibble: 5 x 3
#>      ID   grp    p2
#>   <dbl> <dbl> <dbl>
#> 1    1.    0.    3.
#> 2    2.    0.    4.
#> 3    3.    1.    2.
#> 4    4.    1.    2.
#> 5    5.    1.    1.

One other solution would be to identify column to delete then select them out;

to_delete <- df %>%
  summarise_at(vars(p1:p2), funs(min(.[grp==0]) < median(.))) %>%
  select_if(isTRUE) %>%
  names()

df %>%
  select(-matches(to_delete))
#> # A tibble: 5 x 3
#>      ID   grp    p2
#>   <dbl> <dbl> <dbl>
#> 1    1.    0.    3.
#> 2    2.    0.    4.
#> 3    3.    1.    2.
#> 4    4.    1.    2.
#> 5    5.    1.    1.

Created on 2018-04-14 by the reprex package (v0.2.0).

markdly · April 15, 2018, 2:17am

If the data has to remain in the same layout then I think I'd go with cderv's preferred solution:

Another alternative might be to 'tidy' up the data first with gather and then spread the result. This means the processing can take place in a single block of statements (but does involve an interim change to the data structure).

library(tidyverse)

df <- tribble(
  ~ID, ~grp, ~p1, ~p2,
  1,  0,  1,  3,
  2,  0,  2,  4,
  3,  1,  3,  2,
  4,  1,  3,  2,
  5,  1,  3,  1
)

df %>% 
  gather(px, val, -c(ID, grp)) %>% 
  group_by(px) %>% 
  filter(min(val[grp == 0]) > median(val)) %>% 
  spread(px, val)
#> # A tibble: 5 x 3
#>      ID   grp    p2
#>   <dbl> <dbl> <dbl>
#> 1    1.    0.    3.
#> 2    2.    0.    4.
#> 3    3.    1.    2.
#> 4    4.    1.    2.
#> 5    5.    1.    1.

Created on 2018-04-15 by the reprex package (v0.2.0).

Dong · April 16, 2018, 12:01am

Many thanks for introducing to me these very useful base functions. Below is what I would prefer:

to_keep <- df %>%
  summarise_at(vars(p1:p2), funs(min(.[grp==0]) > median(.)) ) %>% 
  select_if(isTRUE) %>% 
  names()

df %>% 
  select(ID, grp, to_keep)

I was hoping that dplyr grammar allows me to combine select_if and select_at and limit the range of columns for selection and the conditions.

Dong · April 16, 2018, 6:00pm

I like the gather + spread solution. Only concern is the performance. I will have to test it on real case first (1000 X 1500 data frame).