Label the filtered rows of a dataframe

I have filtered some rows of a data frame (df) based on some criteria using group_by and filter function of dplyr. Now I want to label those rows differently in the original dataframe (df). So that I can use if else function based on that label.

# the code tries to find out the middle index of the each group of Transitonyear.
df%>%
  group_by(Transitionyear) %>%
  filter(row_number()==ceiling(n()/2))

Thanks

It is hard to be specific without knowing more about your data but here is a sketch of one approach.

library(dplyr)

df <- data.frame(Transitionyear = rep(2000:2004, each = 4), Dummy = 1:20)
df
#>    Transitionyear Dummy
#> 1            2000     1
#> 2            2000     2
#> 3            2000     3
#> 4            2000     4
#> 5            2001     5
#> 6            2001     6
#> 7            2001     7
#> 8            2001     8
#> 9            2002     9
#> 10           2002    10
#> 11           2002    11
#> 12           2002    12
#> 13           2003    13
#> 14           2003    14
#> 15           2003    15
#> 16           2003    16
#> 17           2004    17
#> 18           2004    18
#> 19           2004    19
#> 20           2004    20
FilteredDF <- df%>%
  group_by(Transitionyear) %>%
  filter(row_number()==ceiling(n()/2)) %>% 
  mutate(Flag = "Y")
FilteredDF
#> # A tibble: 5 x 3
#> # Groups:   Transitionyear [5]
#>   Transitionyear Dummy Flag 
#>            <int> <int> <chr>
#> 1           2000     2 Y    
#> 2           2001     6 Y    
#> 3           2002    10 Y    
#> 4           2003    14 Y    
#> 5           2004    18 Y
df <- left_join(df, FilteredDF, by = c("Transitionyear", "Dummy"))
df
#>    Transitionyear Dummy Flag
#> 1            2000     1 <NA>
#> 2            2000     2    Y
#> 3            2000     3 <NA>
#> 4            2000     4 <NA>
#> 5            2001     5 <NA>
#> 6            2001     6    Y
#> 7            2001     7 <NA>
#> 8            2001     8 <NA>
#> 9            2002     9 <NA>
#> 10           2002    10    Y
#> 11           2002    11 <NA>
#> 12           2002    12 <NA>
#> 13           2003    13 <NA>
#> 14           2003    14    Y
#> 15           2003    15 <NA>
#> 16           2003    16 <NA>
#> 17           2004    17 <NA>
#> 18           2004    18    Y
#> 19           2004    19 <NA>
#> 20           2004    20 <NA>

Created on 2021-09-29 by the reprex package (v0.2.1)

1 Like

In principle, it seems if you have a working filter condition, then you have a working variable definition for a new column in df. I would recommend using mutate after group_by, and then filtering on your new variable, as needed.

This will have advantages over merging the filtered results back into your data as it won't leave missing values for the rows that didn't pass the filter.

library(dplyr)
df <- 
  data.frame(Transitionyear = rep(2000:2004, 
                                  each = 4), 
             Dummy = 1:20)

df %>% 
  group_by(Transitionyear) %>% 
  mutate(Flag = row_number() == ceiling(n() / 2))
#> # A tibble: 20 x 3
#> # Groups:   Transitionyear [5]
#>    Transitionyear Dummy Flag 
#>             <int> <int> <lgl>
#>  1           2000     1 FALSE
#>  2           2000     2 TRUE 
#>  3           2000     3 FALSE
#>  4           2000     4 FALSE
#>  5           2001     5 FALSE
#>  6           2001     6 TRUE 
#>  7           2001     7 FALSE
#>  8           2001     8 FALSE
#>  9           2002     9 FALSE
#> 10           2002    10 TRUE 
#> 11           2002    11 FALSE
#> 12           2002    12 FALSE
#> 13           2003    13 FALSE
#> 14           2003    14 TRUE 
#> 15           2003    15 FALSE
#> 16           2003    16 FALSE
#> 17           2004    17 FALSE
#> 18           2004    18 TRUE 
#> 19           2004    19 FALSE
#> 20           2004    20 FALSE
1 Like