Conditional filtering of rows

NargizB · February 10, 2021, 3:07pm

I want to delete one of duplicate rows when the condition is met.

When "Client" and "year" are the same and "type" is different delete one of the duplicate row (which corresponds to a larger value in "cost"). But is DO NOT filter if all "client", "year" and "type" are the same.

client<- c(1,1,1,2,2,3,3,4, 5,5)
year <- c(2000, 2010, 2000, 2003, 2007, 2009, 2009, 2012, 2017, 2017)
type<-c("A", "A", "B", "B", "C","D","D", "A" , "C", "D")
cost<- c(34,56,78, 45,12,56,67,23,10, 89)
data<- tibble(cient,year, type,value)

client year type cost

1 1 2000 A 34
2 1 2010 A 56
3 1 2000 B 78
4 2 2003 B 45
5 2 2007 C 12
6 3 2009 D 56
7 3 2009 D 67
8 4 2012 A 23
9 5 2017 C 10
10 5 2017 D 89

As a result row # 2 and row #10 should get deleted and the result should look like this:
Cient type value
client year type cost

1 1 2000 A 34
2 1 2000 B 78
3 2 2003 B 45
4 2 2007 C 12
5 3 2009 D 56
6 3 2009 D 67
7 4 2012 A 23
8 5 2017 C 10

I am new to tidyverse and couldn't write a command that will filter a row with condition.

Thanks in advance for your help

lars · February 10, 2021, 3:42pm

Is this, distinct(client, value) , what you're looking for?

library(tidyverse)

data_NB <- tibble::tribble(
        ~client, ~type, ~value,
               "1", "A", "34",
               "1", "A", "56",
               "1", "B", "34",
               "2", "B", "45",
               "2", "C", "12",
               "3", "D", "56",
               "3", "D", "67",
               "4", "A", "23",
               "5", "C", "10",
               "5", "D", "10"
        )

data_NB
#> # A tibble: 10 x 3
#>    client type  value
#>    <chr>  <chr> <chr>
#>  1 1      A     34   
#>  2 1      A     56   
#>  3 1      B     34   
#>  4 2      B     45   
#>  5 2      C     12   
#>  6 3      D     56   
#>  7 3      D     67   
#>  8 4      A     23   
#>  9 5      C     10   
#> 10 5      D     10

data_NB %>% distinct(client, value)
#> # A tibble: 8 x 2
#>   client value
#>   <chr>  <chr>
#> 1 1      34   
#> 2 1      56   
#> 3 2      45   
#> 4 2      12   
#> 5 3      56   
#> 6 3      67   
#> 7 4      23   
#> 8 5      10

^{Created on 2021-02-10 by the reprex package (v1.0.0)}

lars · February 10, 2021, 3:48pm

And with the .keep_all = TRUE argument you get almost exactly what you're looking for. This argument keeps the first row of the found duplicate.

> data_NB %>% distinct(client, value, .keep_all = TRUE)
# A tibble: 8 x 3
  client type  value
  <chr>  <chr> <chr>
1 1      A     34   
2 1      A     56   
3 2      B     45   
4 2      C     12   
5 3      D     56   
6 3      D     67   
7 4      A     23   
8 5      C     10

NargizB · February 10, 2021, 4:43pm

Thanks for your response. I just made a slight change to dataset to make it a bit more clear. But there is one issue your code.

I have two conditions to meet so that filtration gets executed. That is, client & year is the same and type is different. That means if client & year is the same and type is the same as well I don't want to filter.

This scenario only exist in client # 1 and client # 10. One row (one that corresponds to larger value in "cost") in these two cases should be removed in the output. Thanks.

lars · February 10, 2021, 7:47pm

I noticed that you have updated your original post several times and significantly compared to your initial post (on which my first reply was based on) ..
Anyway, after rereading it (as the info provided is not quite consistent) it's still not clear to me what exactly are your requirements.

Assuming this is the dataset:

# A tibble: 10 x 4
   client  year type   cost
    <dbl> <dbl> <chr> <dbl>
 1      1  2000 A        34
 2      1  2010 A        56
 3      1  2000 B        78
 4      2  2003 B        45
 5      2  2007 C        12
 6      3  2009 D        56
 7      3  2009 D        67
 8      4  2012 A        23
 9      5  2017 C        10
10      5  2017 D        89

and the requirement is:

but also

As a result row # 3 and row #10 should get deleted and the result should look like this:

Then what do you want to keep?

first row (so, row 1 i.s.o. row 3)
the row with the highest code: type B (so, row 3 i.s.o. row 1)
the row with the largest cost: cost 78 (so, row 3 i.s.o. row 1)

lars · February 10, 2021, 8:08pm

Hi @NargizB,

I think the following code provides you the flexibility you're looking for. It's all about sorting your dataset correctly and then make use of the .keep_all argument (which only keeps the first row in case of non-distinct values).

See for clients 1 (year 2000) and 5 (year 2017) the differences in results below:

library(tidyverse)

data_NB <- tibble(client = c(1,1,1,2,2,3,3,4, 5,5),
                  year   = c(2000, 2010, 2000, 2003, 2007, 2009, 2009, 2012, 2017, 2017),
                  type   = c("A", "A", "B", "B", "C","D","D", "A" , "C", "D"),
                  cost   = c(34,56,78, 45,12,56,67,23,10, 89)
                  )
data_NB
#> # A tibble: 10 x 4
#>    client  year type   cost
#>     <dbl> <dbl> <chr> <dbl>
#>  1      1  2000 A        34
#>  2      1  2010 A        56
#>  3      1  2000 B        78
#>  4      2  2003 B        45
#>  5      2  2007 C        12
#>  6      3  2009 D        56
#>  7      3  2009 D        67
#>  8      4  2012 A        23
#>  9      5  2017 C        10
#> 10      5  2017 D        89

## data sorted by client, year and type (all in ascending order)
data_NB %>% 
  arrange(client, year, type) %>% 
  distinct(client, year, .keep_all = TRUE)
#> # A tibble: 7 x 4
#>   client  year type   cost
#>    <dbl> <dbl> <chr> <dbl>
#> 1      1  2000 A        34
#> 2      1  2010 A        56
#> 3      2  2003 B        45
#> 4      2  2007 C        12
#> 5      3  2009 D        56
#> 6      4  2012 A        23
#> 7      5  2017 C        10

## data sorted by client, year and cost (where COST IS IN DESCENDING order)
data_NB %>% 
  arrange(client, year, desc(cost)) %>% 
  distinct(client, year, .keep_all = TRUE)
#> # A tibble: 7 x 4
#>   client  year type   cost
#>    <dbl> <dbl> <chr> <dbl>
#> 1      1  2000 B        78
#> 2      1  2010 A        56
#> 3      2  2003 B        45
#> 4      2  2007 C        12
#> 5      3  2009 D        67
#> 6      4  2012 A        23
#> 7      5  2017 D        89

^{Created on 2021-02-10 by the reprex package (v1.0.0)}

HTH
lars

kcarnold · February 10, 2021, 10:06pm

It's not clear if your question is answered, but one tool you can use to specify what happens when multiple rows share some values is grouping: group by the common values, then summarize with min / max / first / whatever you want.

gabriel.de.wit · February 10, 2021, 11:39pm

Or this, which works on unsorted data

data %>%
  group_by(client, year, type) %>%
  mutate(largest_cost_within_type = max(cost)) %>% # I'm not clear, you might want `sum` instead
  group_by(client, year) %>%
  filter(largest_cost_within_type == min(largest_cost_within_type)) %>%
  ungroup() %>%
  select(-largest_cost_within_type)

NargizB · February 11, 2021, 3:30pm

Thanks for catching that. I fixed the output as I need row with larger "cost" be deleted.

NargizB · February 11, 2021, 3:39pm

Hello. so the issue with filtering by distinct function is that it also filters the row when all "client", "year" and "type" are the same whereas my condition is that filter ONLY when "client " & "year " are the same and "type" is different. That means none of row #6 to #7 should be removed. Thanks

lars · February 11, 2021, 4:28pm

Hi Nargiz,

I've created a new reprex below which tries to solve your need by using two intermediate datasets. You can see that I also removed the sorting in descending order of the cost variable.

library(tidyverse)

data_NB <- tibble(client = c(1,1,1,2,2,3,3,4,5,5),
                  year   = c(2000,2010,2000,2003,2007,2009,2009,2012,2017,2017),
                  type   = c("A","A","B","B","C","D","D","A","C","D"),
                  cost   = c(34,56,78,45,12,56,67,23,10,89)
)
# data_NB %>% head()

## keep rows which have multiple hits on client, year and *same* type
data_1 <- data_NB %>% 
  count(client, year, type) %>% 
  filter(n > 1) %>% 
  left_join(data_NB, by = c("client", "year", "type"))
data_1
#> # A tibble: 2 x 5
#>   client  year type      n  cost
#>    <dbl> <dbl> <chr> <int> <dbl>
#> 1      3  2009 D         2    56
#> 2      3  2009 D         2    67

## remove rows which have different type
data_2 <- data_NB %>% 
  arrange(client, year, cost) %>% 
  distinct(client, year, .keep_all = TRUE)
data_2
#> # A tibble: 7 x 4
#>   client  year type   cost
#>    <dbl> <dbl> <chr> <dbl>
#> 1      1  2000 A        34
#> 2      1  2010 A        56
#> 3      2  2003 B        45
#> 4      2  2007 C        12
#> 5      3  2009 D        56
#> 6      4  2012 A        23
#> 7      5  2017 C        10

## merge both intermediate datasets into one.
data_2 %>% 
  full_join(data_1) %>% 
  arrange(client, year, type, cost) %>% 
  select(-n)
#> Joining, by = c("client", "year", "type", "cost")
#> # A tibble: 8 x 4
#>   client  year type   cost
#>    <dbl> <dbl> <chr> <dbl>
#> 1      1  2000 A        34
#> 2      1  2010 A        56
#> 3      2  2003 B        45
#> 4      2  2007 C        12
#> 5      3  2009 D        56
#> 6      3  2009 D        67
#> 7      4  2012 A        23
#> 8      5  2017 C        10

^{Created on 2021-02-11 by the reprex package (v1.0.0)}

Be aware though that on this small example data set the result appear to be what you expect, but that in your dataset the result may differ: what happens when a client have multiple duplicate types in one year?

Anyway, I hope this helps you forward.

NargizB · February 11, 2021, 8:36pm

This was helpful for me to see a new approach of manipulating data by extracting rows that interests me and merging them back later. Thank you for that. However in this particular problem it may not work because as you mentioned there could be 2 cases of same client, year and type and this code will add them both back when I do a final merge. I tested it in this dataset

client<- c(1,1,1,1,2,2,3,3,4, 5,5)

year<-c(2000, 2000, 2000,2000, 2003,2007,2009,2009,2012,2017,2017)

type<-c("A", "A", "B", "B", "B", "C","D","D", "A", "C", "D" )

cost<- c(34,56,78,30, 45,12,56,67,23,10, 89)

data_2<- tibble(client,year, type,cost)

NargizB · February 11, 2021, 8:42pm

Thank you very much for this solution. This is really smart. I tested this solution with more complicated situation where same client, year and type appears more than once. It worked properly by deleting all rows that have same client & year and different type. I tested it in this dataset and it removed row # 3,4,11

client<- c(1,1,1,1,2,2,3,3,4, 5,5)

 year<-c(2000, 2000, 2000,2000, 2003,2007,2009,2009,2012,2017,2017)

type<-c("A", "A", "B", "B", "B", "C","D","D", "A", "C", "D" )

cost<- c(34,56,78,30, 45,12,56,67,23,10, 89)

data_2<- tibble(client,year, type,cost)

 client  year type   cost
    <dbl> <dbl> <chr> <dbl>
 1      1  2000 A        34
 2      1  2000 A        56
 3      1  2000 B        78
 4      1  2000 B        30
 5      2  2003 B        45
 6      2  2007 C        12
 7      3  2009 D        56
 8      3  2009 D        67
 9      4  2012 A        23
10      5  2017 C        10
11      5  2017 D        89

system · February 18, 2021, 8:42pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.