How to select top N (highest value) and create new column with it ?

AbhishekHP · September 3, 2019, 2:41pm

How can we create columns with highest values for each row ?
ref:
https://rdrr.io/cran/dplyr/man/top_n.html

For e.g.

library(tidyverse)

iris %>% glimpse()

# my attempt
x = iris %>% 
  select(-Species) %>%
  gather(measure,values) %>%
# hereafter got stuck
  mutate(top_1 =
                  top_2 = 
                  top3_3 = )

expected_output = iris %>% mutate(top_1 = 1st highest value from the row  (row wise),
                                  top_2 = 2nd highest value from the row  (row wise),
                                  top_3 = 3rd highest value from the row (row wise))

# expected output first 3 rows looks like below:
iris[1:3,] %>% 
mutate(top_1 = c(5.1,4.9,4.7), top_2 = c(3.5,3.0,3.2), top_3 = c(1.4,1.4,1.3))

raytong · September 3, 2019, 3:14pm

Hi AbhishekHP. I will suggest the script as follow and not using top_n function. Top_n will return more number of value than you wanted (top 3) in the tiles situation.

iris %>%
  select(-Species) %>%
  gather(measure, values) %>%
  group_by(measure) %>%
  summarise(data = list(values)) %>%
  rowwise() %>%
  mutate(data = list(data[rev(order(data))][1:3])) %>%
  unnest(data) %>%
  mutate(top = rep(c("top_1", "top_2", "top_3"), length.out = nrow(.))) %>%
  spread(top, data)

AbhishekHP · September 4, 2019, 7:59am

Thanks @raytong for taking time to help me.
Perhaps, was unclear in my description of expected _output.
It will have all the rows in the actual dataset (so used mutate)

# expected output first 3 rows looks like below:
iris[1:3,] %>% 
mutate(top_1 = c(5.1,4.9,4.7), top_2 = c(3.5,3.0,3.2), top_3 = c(1.4,1.4,1.3))

raytong · September 4, 2019, 8:53am

Hi @AbhishekHP. I modified the script and use rowwise and sort.

iris %>% 
    rowwise() %>%
    mutate(rows = list(sort(c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), decreasing = TRUE))) %>%
    mutate(top_1 = rows[1], top_2 = rows[2], top_3 = rows[3]) %>%
    select(-rows)

AbhishekHP · September 4, 2019, 2:10pm

great and simplified solution.

Is it possible to mention the column name as well ?
first 3 rows of expected output looks like :

iris[1:3,] %>% 
    mutate(top_1 = c(5.1,4.9,4.7), top_2 = c(3.5,3.0,3.2), top_3 = c(1.4,1.4,1.3),
           top_1_from = c("Sepal.Length","Sepal.Length","Sepal.Length"),
           top_2_from = c("Sepal.Width","Sepal.Width","Sepal.Width"),
           top_3_from = c("Petal.Length","Petal.Length","Petal.Length"))

AbhishekHP · September 4, 2019, 3:07pm

Actual data has NA values in some column (e.g. Sepal.Width)
Can we sort ignoring NA's ?

raytong · September 4, 2019, 3:09pm

@AbhishekHP. The script are a little complicated.

iris %>% 
  rownames_to_column() %>%
  group_by(rowname, Species) %>%
  nest() %>%
  rowwise() %>%
  mutate(data = list(sort(unlist(as.list(data)), decreasing = TRUE))) %>%
  mutate(top_1 = data[1], top_2 = data[2], top_3 = data[3],
         top_1_from = names(data)[1], top_2_from = names(data)[2], top_3_from = names(data)[3]) %>%
  bind_cols(do.call(bind_rows, .$data))  %>%
  select(-rowname, -data)

Yarnabrina · September 4, 2019, 3:17pm

Have you tried the na.last argument?

Edit

I don't think you'll need na.last, as by default NA values are omitted altogether. I've included it below, though.
I used development version of tidyr in my solution.
I am very bad at reshaping. I'm sure the 3 calls to unnest_wider and several rename's can be done in a much better and elegant way, but still posting in hope that other people will help in getting better with these things.

working, but bad, solution

set.seed(seed = 38914)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(tibble)
library(tidyr)

fake_dataset <- data.frame(a = sample(x = c(1:10, NA),
                                      size = 20,
                                      replace = TRUE),
                           b = sample(x = c(letters[1:10], NA),
                                      size = 20,
                                      replace = TRUE),
                           c = sample(x = c((-10):(-1), NA),
                                      size = 20,
                                      replace = TRUE),
                           d = sample(x = c(letters[17:26], NA),
                                      size = 20,
                                      replace = TRUE),
                           e = sample(x = c((-5):5, NA),
                                      size = 20,
                                      replace = TRUE),
                           f = sample(x = c(letters[5:14], NA),
                                      size = 20,
                                      replace = TRUE),
                           g = sample(x = c((-10):(-1), NA),
                                      size = 20,
                                      replace = TRUE),
                           h = sample(x = c(letters[13:22], NA),
                                      size = 20,
                                      replace = TRUE),
                           i = sample(x = c(1:10, NA),
                                      size = 20,
                                      replace = TRUE),
                           j = sample(x = c(letters[9:18], NA),
                                      size = 20,
                                      replace = TRUE))

fake_dataset %>%
  mutate(temp = pmap(.l = select_if(.tbl = .,
                                    .predicate = is.numeric),
                     .f = list),
         temp = map(.x = temp,
                    .f = ~ sort(x = unlist(x = .x),
                                decreasing = TRUE,
                                na.last = TRUE)[1:3]),
         temp = map(.x = temp,
                    .f = enframe)) %>%
  unnest_wider(col = temp,
               names_sep = "_") %>%
  unnest_wider(col = temp_name,
               names_sep = "_") %>%
  unnest_wider(col = temp_value,
               names_sep = "_") %>%
  rename(top_1_from = temp_name_...1,
         top_2_from = temp_name_...2,
         top_3_from = temp_name_...3,
         top_1 = temp_value_...1,
         top_2 = temp_value_...2,
         top_3 = temp_value_...3)
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> # A tibble: 20 x 16
#>        a b         c d         e f         g h         i j     top_1_from
#>    <int> <fct> <int> <fct> <int> <fct> <int> <fct> <int> <fct> <chr>     
#>  1     4 h        -9 s         5 m        NA t         6 o     i         
#>  2    10 h        -2 y         4 f        -3 v         3 j     a         
#>  3     4 g        -9 t        -4 m        -4 p         9 p     i         
#>  4    10 f       -10 v        -4 h        -2 v         8 j     a         
#>  5     7 j        NA s        NA n        -6 v         6 p     a         
#>  6     9 a        -5 w        -5 m       -10 p        NA o     a         
#>  7    NA g        -8 y        NA j        -2 o         4 i     i         
#>  8     8 i        -3 v        -2 h        -6 s         6 i     a         
#>  9     5 f       -10 y         0 <NA>     -2 u         6 j     i         
#> 10     2 d        -4 y         2 j        -9 u         7 n     i         
#> 11     4 d        NA z        NA g        -1 u         9 r     i         
#> 12     9 a        -4 z        -1 j        -8 <NA>      3 q     a         
#> 13     5 h        -1 w         1 h        -3 <NA>     10 k     i         
#> 14     4 i        -9 w         5 n        -8 v         9 <NA>  i         
#> 15     3 j        -2 <NA>      0 h        NA p         3 k     a         
#> 16     7 c        -7 q        NA k        NA v         9 r     i         
#> 17     2 e        -2 w         0 l        -2 n        NA <NA>  a         
#> 18    10 e        NA t         0 l        -6 o        NA l     a         
#> 19     7 c       -10 v        NA h        -3 q         4 i     a         
#> 20     2 a        -7 t         1 l        -8 s        NA r     a         
#> # … with 5 more variables: top_2_from <chr>, top_3_from <chr>,
#> #   top_1 <int>, top_2 <int>, top_3 <int>

^{Created on 2019-09-05 by the reprex package (v0.3.0)}

AbhishekHP · September 5, 2019, 9:27am

Actually, NA's were inevitable because top 3 values were 100, NA, NA
So, top_2 and top_3 need to be NA

system · September 12, 2019, 9:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.