Subtracting values of observations based on their rank position in a given group

Hi everyone,

I am trying to subtract values of different observations of a variable (table), based on their rank variable (rank), which reflects their position within groups of a third variable (color). What I need to do is subtract the values of the first and second ranked table values in each color group. The code I tried deleted all observations from the df. I'm not really sure where to go next, if anyone can help I'd be really grateful.

(d <- head(ggplot2::diamonds))


# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
  
##I then create a rank variable to hold the rank position of table in each color:

d <- d %>% 
  arrange(color, table) %>% 
  group_by(color) %>% 
  mutate(rank = rank(order(table, decreasing = TRUE)))

##The code that deletes all observations in the df:

d <- d %>% filter(rank == 1 & rank == 2) %>%
  mutate(diff = table - lag(table, order_by = rank))

If you want your filtering to work you need to give it a valid condition. See below:

d <- d %>% filter(rank == 1 | rank == 2) %>%
  mutate(diff = table - lag(table, order_by = rank))

No value will have 1 and 2 as rank at the same time so it will filter out everything.

Then in terms of your other code you will need to make a condition that color is only valid when it at least has 2 values associated with that color. Otherwise you will have a problem with some who only have single. You will also have unfortunate overlap so you should always calculate when it should occur.

A bit of code that might help you as well is something like this. You can essentially create a mutate with conditionals. If rank 1 or 2 then it will perform depth - lag(depth) for diff otherwise it will just have 0. You can adjust this accordingly to work for your problem (more than 2 levels etc).


d <- d %>% mutate(
  diff = case_when( 
    rank == 1 | rank == 2 ~ depth - lag(depth),
    TRUE ~ 0
  )
) 

Let us know if this helps :slight_smile:

1 Like

Thank you for your input! The valid condition is something that slipped by, good that you noticed right away. Though I still had a problem when running the code you provided. I got this error message back:

 rlang::last_trace()
<error/dplyr:::mutate_error>
Problem with `mutate()` input `diff`.
x comparação (1) é possível apenas para tipos lista ou atômicos
i Input `diff` is `case_when(...)`.
Backtrace:
     x
  1. +-`%>%`(...)
  2. +-dplyr::mutate(...)
  3. +-dplyr:::mutate.data.frame(...)
  4. | \-dplyr:::mutate_cols(.data, ...)
  5. |   +-base::withCallingHandlers(...)
  6. |   \-mask$eval_all_mutate(dots[[i]])
  7. +-dplyr::case_when(...)
  8. | \-rlang::eval_tidy(pair$lhs, env = default_env)
  9. \-base::.handleSimpleError(...)
 10.   \-dplyr:::h(simpleError(msg, call))

Sorry for the Portuguese in there, but it translates to something like 'comparison (1) is only possible for lists or atomic vectors"

Here is a reprex for you :slight_smile:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3

d <- head(ggplot2::diamonds)



d <- d %>% 
  arrange(color, table) %>% 
  group_by(color) %>% 
  mutate(rank = rank(order(table, decreasing = TRUE)))

##The code that deletes all observations in the df:

#d <- d %>% filter(rank == 1 & rank == 2) %>%
#  mutate(diff = table - lag(table, order_by = rank))


d2 <- d %>% mutate(
  diff = case_when( 
    rank == 1 | rank == 2 ~ depth - lag(depth),
    TRUE ~ 0
  )
) 

d2
#> # A tibble: 6 x 12
#> # Groups:   color [3]
#>   carat cut       color clarity depth table price     x     y     z  rank  diff
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43     3  0   
#> 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31     2 -1.7 
#> 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31     1 -2.90
#> 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63     1 NA   
#> 5 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48     2 NA   
#> 6 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75     1  0.5

Created on 2020-12-03 by the reprex package (v0.3.0)

1 Like

It worked now, I think there was an issue with my rank variable. After some tweaking, it ran properly. Thanks for this!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.