How to keep the removed rows in separate group using group_by() & filter() in dplyr

yash_yp · September 14, 2020, 11:49am

In dplyr package, I am using group_by() function and then applying the filter() function to remove some rows from that group. Now, I want to put the removed rows (which are left out from the group) into a whole new group.

This is my code-

threshold <- dummy %>%
group_by(expiry_date, location_code,model,age,Emp_id) %>%
filter(Date <= as.Date(min(Date) + 2), .preserve = TRUE) %>
arrange(expiry_date, location_code,model,age,Emp_id)

It is giving me the filtered out rows but I want to keep the removed rows as well in a different group. Please provide me a work around for this. Thank You!

mara · September 14, 2020, 12:31pm

If there are rows that you want in your dataset, then you shouldn't filter them out (which, in effect, gets rid of them). Depending on what your goal is, there are a number of different ways to approach this, I can't totally tell from your example, but if you're trying to do something to one group and not another, you can use if_else()- or case_when()-type logic.

Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

yash_yp · September 14, 2020, 12:50pm

Thanks for the quick response. I'll again try to explain the problem.

Let's say I have 4 features:

Creation_date
Age
location_code
device_type

And total rows = 1000
Now, I want to create groups/clusters which have the same values for 3 columns namely - Age, location_code and device_type.

All the groups formed will have all the 4 features.
Now in each group, the values of "Age", "location_code" and "device_type" will be same for each row.

And value of "Creation_date" may or may not be same for each row.

Now after creating groups, let's analyze the first group. In this group,
I want to keep a row only if for that row the Creation_date is < = min(Creation_date + 2 days)
rows which do not satisfy this condition, I would like to put them into different cluster (or group).

I want to repeat this process for all the groups formed.

I think this explanation might help in understanding the problem.

nirgrahamuk · September 14, 2020, 1:19pm

Is there a typo ? as it seems guaranteed that any positive number would be smaller than itself plus an positive integer, so not a useful criteria to to split on...

That said heres a basic example of doing the 'kind of thing'.

library(tidyverse)

set.seed(42)
(exdf<- tibble(
  creation_date = sample(seq.Date(from=as.Date("2019/01/01"),by="day",length.out = 400),
                         size = 1000,
                         replace=TRUE

),
age = sample.int(3,size=1000,replace = TRUE)*20+15,
location = sample(letters[1:5],
                  size = 1000,
                  replace=TRUE),
device= sample(LETTERS,
              size = 1000,
              replace=TRUE)
))

(group_df <- group_by(exdf,
                     age,
                     location,
                     device) %>% summarise(
                       avg_date =mean(creation_date)
                       ) %>% ungroup %>% arrange(age,location,device) %>% 
    mutate(group_num= row_number()))

(df2 <-left_join(group_df,
                 exdf) %>% mutate(greater_than_average = creation_date > avg_date,
                                  final_group_code = paste0(group_num,str_sub(greater_than_average,start = 1,end=1))))

yash_yp · September 14, 2020, 4:36pm

Hi,

Yes there was a typo in my previous reply, I corrected it.
But I understood your logic and it's working fine for me.

Thanks a lot !

system · October 5, 2020, 4:36pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.