summarise(max) but keep all columns

uvapnut · February 11, 2020, 5:47pm

I am a total beginner, and struggling to understand how to format the code to do what I want. I want to remove the lower test score (grouped by student_id and test_name) but I want to keep all of the other variables that I don't need to group by. I can't figure out how to do this. It goes from 21 columns to 3 columns.

Thanks for any help!

mattwarkentin · February 11, 2020, 6:11pm

You probably want to use the combination of group_by() and mutate(). This will compute the summary score (max value, for example) but not collapse the data.

For example:

library(dplyr)

iris %>% 
  group_by(Species) %>% 
  mutate(max_score = max(Sepal.Length)) %>% 
  ungroup()
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species max_score
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>       <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa        5.8
#>  2          4.9         3            1.4         0.2 setosa        5.8
#>  3          4.7         3.2          1.3         0.2 setosa        5.8
#>  4          4.6         3.1          1.5         0.2 setosa        5.8
#>  5          5           3.6          1.4         0.2 setosa        5.8
#>  6          5.4         3.9          1.7         0.4 setosa        5.8
#>  7          4.6         3.4          1.4         0.3 setosa        5.8
#>  8          5           3.4          1.5         0.2 setosa        5.8
#>  9          4.4         2.9          1.4         0.2 setosa        5.8
#> 10          4.9         3.1          1.5         0.1 setosa        5.8
#> # … with 140 more rows

^{Created on 2020-02-11 by the reprex package (v0.3.0)}

uvapnut · February 11, 2020, 6:20pm

Thank you! I then used distinct to select only the highest score. I am quite sure that I have sixteen lines of code when three would have worked. Sigh. Work in progress!

mattwarkentin · February 11, 2020, 6:35pm

You may want to use filter() instead (if you're trying to keep the highest score, per student). For example:

library(dplyr)

iris %>% 
  group_by(Species) %>% 
  mutate(max_score = max(Sepal.Length)) %>% 
  ungroup() %>% 
  filter(Sepal.Length==max_score)
#> # A tibble: 3 x 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    max_score
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>          <dbl>
#> 1          5.8         4            1.2         0.2 setosa           5.8
#> 2          7           3.2          4.7         1.4 versicolor       7  
#> 3          7.9         3.8          6.4         2   virginica        7.9

uvapnut · February 11, 2020, 6:37pm

Ah. Beautiful. This is so much more efficient than my current code!

system · March 3, 2020, 6:37pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.