Combining rows that do not match filter

dplyr

#1

Apologies in advance for no reprex - apparently my clipboard is not available

Pretty sure I saw there was a way of combining rows after mutate/summarize
so that in the example below I could filter on pc >20 and end up with rows A, D and 'other' with a value of 23

reprex1

Any help appreciated


#2

Try the github version... the one on CRAN has some issues.

devtools::install_github("tidyverse/reprex")

if this gives you an error saying that devtools is not available then use

install.packages("devtools")


#3

I don't understand what you expect for output. what should have a value of 23 and what is 'other"?

Prose is a poor way to describe output, you should use R to build the output you expect as you did to show the input.

This is how to filter for pc >20

suppressPackageStartupMessages(library(dplyr))

x <- data.frame(a = c("A", "B", "C", "D"), b = c(8, 3, 2, 9))

x %>% mutate(pc = round(100*b/sum(b))) %>%
        filter(pc > 20)
#>   a b pc
#> 1 A 8 36
#> 2 D 9 41

Created on 2018-03-26 by the reprex package (v0.2.0).


#4

This would be the output, I would want to see.

y <-data.frame(a = c("A", "D", "Others"), b = c(8, 9, 5),pc=c(36, 41, 23))

dput(y)
structure(list(a = structure(1:3, .Label = c("A", "D", "Others"
), class = "factor"), b = c(8, 9, 5), pc = c(36, 41, 23)), .Names = c("a",
"b", "pc"), row.names = c(NA, -3L), class = "data.frame")

print(y)
a b pc
1 A 8 36
2 D 9 41
3 Others 5 23

Would have some parameter in filter() e.g Combine = TRUE
I'm pretty sure I am not imagining it but cannot identify in documentation


#5

Thanks for heads up!


#7

I think this is what you are after:

library(dplyr)

x <- tibble(a = c("A", "B", "C", "D"), b = c(8, 3, 2, 9))

x %>% 
  mutate(pc = round(100*b/sum(b)), 
         a = ifelse(pc > 20, a, "Other")) %>% 
  group_by(a) %>% 
  summarise(pc = sum(pc))


#8

It's a little tricky because column a of your data.frame is a factor

suppressPackageStartupMessages(library(dplyr))
x <- data.frame(a = c("A", "B", "C", "D"), b = c(8, 3, 2, 9))

# because column a is a factor "others" needs to be
# added so that value can be used in column a
levels(x$a) <- c(levels(x$a), "others")

x %>%
    # do computation
    mutate( pc = round(100*b/sum(b))) %>%
    # add "others" row
    rbind(data.frame(a = "others", 
                                     b = sum(filter(., pc <= 20)$b),
                                     pc = sum(filter(., pc <= 20)$pc))) %>%
    # keep others row regardless of pc value
    filter(pc >20 | a == "others")
#>        a b pc
#> 1      A 8 36
#> 2      D 9 41
#> 3 others 5 23

Created on 2018-03-26 by the reprex package (v0.2.0).


#9

Another way to do this would be to group by the a variable and use mutate and case_when to create the labels, as following:

suppressPackageStartupMessages(library(dplyr))

x <- data.frame(a = c("A", "B", "C", "D"), b = c(8, 3, 2, 9))

x %>% 
  mutate(pc = round(100 * b / sum(b))) %>% # Compute %
  group_by(a) %>% 
  mutate(type = case_when( # Create variable where
    pc > 20 ~ as.character(`a`), # equal to `a` if `pc`> 20
    TRUE ~ "Other") # otherwise as `Other`
    ) %>% 
  group_by(type) %>% # Group by new variable
  summarise_if(is.numeric, sum) # Compute values
#> # A tibble: 3 x 3
#>   type      b    pc
#>   <chr> <dbl> <dbl>
#> 1 A      8.00  36.0
#> 2 D      9.00  41.0
#> 3 Other  5.00  23.0

Created on 2018-03-26 by the reprex package (v0.2.0).


#10

Thanks for all the suggestions, which have solved my problem

Obviously , for this example, it would have been more sensible to have started with a tibble but you have catered for an issue that might arise


#11

Wondering if when you originally thought you’d seen a way of doing this before, you were perhaps remembering forcats::fct_lump or forcats::fct_other? They don’t quite apply to this case, but fct_other could be useful in the above solution if you have a lot more categories to lump in your real data.


#12

That absolutely was what I meant. :grinning:
I will use that for real-life examples with more categories
I currently only use fct_reorder() for charting purposes but there is a lot more to the forcats package


#13

There’s nothing worse than that Twilight Zone feeling when you swear you remember reading something in the docs and now (cue creepy music) it has vanished! :fearful: