Select Top 5 values and names per row

Hi,

I have a huge dataframe with values of several substances at different sampling sites (row = site, col = substances) and I want to create a new dataframe with all sites as rows but just the TOP 5 substances per row (5 highest value per row) and the respective column name (= substance name).
At the moment I have no idea how to start. Could someone give me a tip?

I used max() and function(x) names(x)[which.max(x)])) before to get the highest value, but is there an equivalent to get the TOP 1- 5?

This is how my df basically looks like:

set.seed(123456)
example <- dplyr::data_frame(
    Site = 1:10,
    Substance_A = sample(0:50, 10, rep = TRUE),
    Substance_B = sample(0:50, 10, rep = TRUE),
    Substance_C = sample(0:50, 10, rep = TRUE)
)

And I should look like this...

set.seed(123456)
TOP <- dplyr::data_frame(
    Site = 1:10,
    TOP1 = sample(0:50, 10, rep = TRUE),
    TOP1_Name = sample(c("a", "b","c"), 10, rep = TRUE),
    TOP2 = sample(0:50, 10, rep = TRUE), 
    TOP2_Name = ""
)

I would very much appreciate any tip or comment!

Thanks
Nele

My first approach would be to use the tidyverse package and convert the input "wide" dataset into an intermediate "long" output dataset. The slice_head() can easily be adjusted to the requested top 5 values.

``` r
 library(tidyverse)
  
  set.seed(123456)
  example <- dplyr::data_frame(
    Site = 1:10,
    Substance_A = sample(0:50, 10, rep = TRUE),
    Substance_B = sample(0:50, 10, rep = TRUE),
    Substance_C = sample(0:50, 10, rep = TRUE)
    )
#> Warning: `data_frame()` is deprecated as of tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
    
  output_long <- 
    example %>%
    pivot_longer(cols = 2:NCOL(.),
                 names_to = "substance",
                 values_to = "value"
                 ) %>% 
    arrange(Site, desc(value)) %>% 
    group_by(Site) %>% 
    slice_head(n = 2)
  
  output_long
#> # A tibble: 20 x 3
#> # Groups:   Site [10]
#>     Site substance   value
#>    <int> <chr>       <int>
#>  1     1 Substance_B    44
#>  2     1 Substance_A    41
#>  3     2 Substance_A    41
#>  4     2 Substance_B    29
#>  5     3 Substance_A    48
#>  6     3 Substance_B    46
#>  7     4 Substance_C    45
#>  8     4 Substance_B    22
#>  9     5 Substance_C    46
#> 10     5 Substance_A    44
#> 11     6 Substance_A    35
#> 12     6 Substance_B    27
#> 13     7 Substance_A    37
#> 14     7 Substance_B    37
#> 15     8 Substance_C    34
#> 16     8 Substance_A     9
#> 17     9 Substance_B    15
#> 18     9 Substance_C    15
#> 19    10 Substance_C    22
#> 20    10 Substance_B    20

Created on 2021-02-03 by the reprex package (v1.0.0)

The above output dataset would be easier to work with.

However, to give it a try to put it back into a wider format, I've added the top_n and top_substance variables and then needed to pivot_wider() and pivot_longer() a couple of times to get the result you're looking for.

  output_long %>% 
    mutate(top_n = str_c("Top_", row_number()),
           top_substance = str_c(top_n, "_Name")
           )  %>%
    pivot_wider(names_from = top_n,
                values_from = value
                ) %>% 
    pivot_wider(names_from = top_substance,
                values_from = substance
                ) %>% 
    select(Site, sort(names(.))) %>% 
    ## pivot_longer of multiple columns require that all values are of same class()
    mutate(across(.cols = everything(), .fns = as.character)) %>% 
    ## get rid of NA values
    pivot_longer(cols = 2:NCOL(.),
                 names_to = "col_name",
                 values_to = "col_value",
                 values_drop_na = TRUE
                 ) %>% 
    pivot_wider(names_from = col_name,
                values_from = col_value
                )
#> # A tibble: 10 x 5
#> # Groups:   Site [10]
#>     Site Top_1 Top_1_Name  Top_2 Top_2_Name 
#>    <int> <chr> <chr>       <chr> <chr>      
#>  1     1 44    Substance_B 41    Substance_A
#>  2     2 41    Substance_A 29    Substance_B
#>  3     3 48    Substance_A 46    Substance_B
#>  4     4 45    Substance_C 22    Substance_B
#>  5     5 46    Substance_C 44    Substance_A
#>  6     6 35    Substance_A 27    Substance_B
#>  7     7 37    Substance_A 37    Substance_B
#>  8     8 34    Substance_C 9     Substance_A
#>  9     9 15    Substance_B 15    Substance_C
#> 10    10 22    Substance_C 20    Substance_B

If there's an easier / quicker way to achieve the requested results then I'm curious what it might look like :sweat_smile:

1 Like

Thanks! That works perfectly!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.