group by factor, and remove lowest value

Longshot408 · December 1, 2020, 3:42am

Hi all,
I'm trying to abandon my mass of Excel spreadsheets for grading my college class, but I'm running into a few snags. First off is trying to figure out how to drop the lowest quiz grade. A snippet of de-identified data is below:

gradebook=tibble::tribble(
                           ~name, ~Quiz_number, ~Quiz_score, ~Homework_number, ~Homework_score,
                 "Person A",           1L,          19,               1L,             18L,
                 "Person A",           2L,        13.5,               2L,             10L,
                 "Person A",           3L,        15.5,               3L,              0L,
                 "Person A",           4L,        10.5,               4L,              NA,
                 "Person A",           5L,          NA,               5L,              NA,
                "Person B",           1L,          24,               1L,             19L,
                "Person B",           2L,          19,               2L,             20L,
                "Person B",           3L,          25,               3L,             20L,
                "Person B",           4L,       18.75,               4L,              NA,
                "Person B",           5L,          NA,               5L,              NA,
               "Person C",           1L,          18,               1L,             18L,
               "Person C",           2L,          22,               2L,             18L,
               "Person C",           3L,          22,               3L,             18L,
               "Person C",           4L,          22,               4L,              NA,
               "Person C",           5L,          NA,               5L,              NA,
            "Person D",           1L,          14,               1L,             18L,
            "Person D",           2L,          18,               2L,             17L,
            "Person D",           3L,        16.5,               3L,             16L,
            "Person D",           4L,        14.5,               4L,              NA,
            "Person D",           5L,          NA,               5L,              NA
            )

NA means I'm still waiting on a submission; 0 means the deadline has past and they missed it.

The line gradebook%>%group_by(name)%>%slice_min(Quiz_score) kind of works in that it selects each person's lowest quiz grade, but it does the opposite of what I want in that it keeps that grade and discards the rest. I need a code that groups by name and then removes the row with the lowest value.

AlexisW · December 1, 2020, 5:00am

One way to avoid the problem is to first arrange(), then use slice_head() to keep the first n lines (which are thus the highest). In your example, every student has 5 rows, so you can simply use:

gradebook %>%
  group_by(name) %>%
  arrange(desc(Quiz_score)) %>%
  slice_head(n = 4)

There is a major difference with your approach though: NA always gets sorted as last. So for Person A, slice_min() returns quiz 4 (with 10.5), whereas arrange() gives Quiz 5 as last (with NA). So, to avoid that, we can sort in ascending order, then keep the tail. So the NA will be in the tail:

gradebook %>%
  group_by(name) %>%
  arrange(Quiz_score) %>%
  slice_tail(n = 4)

Also, this doesn't work if each student has a different number of records, in such a case you're better off with an anti-join:

to_exclude <- gradebook %>%
  group_by(name) %>%
  slice_min(Quiz_score)

anti_join(gradebook, to_exclude, by = c("name","Quiz_number"))

Longshot408 · December 1, 2020, 1:46pm

Thanks for the reply! Unfortunately though this seems to make duplicate entries for each quiz; each quiz ends up with four entries per person
Screenshot 2020-12-01 084509

But now I see what you're saying about having a different number of rows for homework vs. quizzes. Perhaps I'll split them into two separate tables then and calculate homework and quiz grades separately

FJCC · December 1, 2020, 2:08pm

You might also consider structuring the table to have columns for Name, Assignment_Type (Quiz or Homework), Assignment_Number, and Score. You could then group_by(Name, Assignment_Type) and use summarize() to calculate the sum(), min(), n(), sum() - min(), and (sum() - min())/n() for each group all in one step.

Longshot408 · December 1, 2020, 2:23pm

That's brilliant! Any tips on how to do that? If I had to guess, it would be a tweak in my pivot commands?

I had to use these to change the original Canvas .csv file's many columns to single summary columns for both quizzes and homework

gradebook=gradebook %>% pivot_longer(cols = starts_with("qui"),
                           names_to="Quiz_number",
                           names_prefix="quiz",
                           values_to="Quiz_scores")

gradebook=gradebook %>% pivot_longer(cols = starts_with("Hom"),
                                     names_to="Homework_number",
                                     names_prefix="homework",
                                     values_to="Homework_scores")

ADernild · December 1, 2020, 2:59pm

Not sure if or how you could do it with pivot_longer, especially as I don't know the structure of Canvas.csv. However building on top of the table you already have, you could do something like this:

# Splitting gradebook
library(tidyverse)
gradebook_quiz <- gradebook %>% 
  select(name, Quiz_number, Quiz_score) %>% 
  transmute(name = name, 
            Assignment_type = "Quiz", 
            number = Quiz_number, 
            score = Quiz_score)

gradebook_homework <- gradebook %>% 
  select(name, Homework_number, Homework_score) %>% 
  transmute(name = name, 
            Assignment_type = "Homework", 
            number = Homework_number, 
            score = Homework_score)

# Joining gradebooks together
gradebook <- gradebook_quiz %>% 
  full_join(gradebook_homework)

This would result in a new tibble with 40 rows and 4 columns

> gradebook
# A tibble: 40 x 4
   name     Assignment_type number score
   <chr>    <chr>            <int> <dbl>
 1 Person A Quiz                 1  19  
 2 Person A Quiz                 2  13.5
 3 Person A Quiz                 3  15.5
 4 Person A Quiz                 4  10.5
 5 Person A Quiz                 5  NA  
 6 Person B Quiz                 1  24  
 7 Person B Quiz                 2  19  
 8 Person B Quiz                 3  25  
 9 Person B Quiz                 4  18.8
10 Person B Quiz                 5  NA  
# … with 30 more rows

In this case you could modify @AlexisW code, sorting in ascending order and keeping the tail:

gradebook %>% 
  group_by(name, Assignment_type) %>% 
  arrange(score) %>% 
  slice_tail(n = 4)

# A tibble: 32 x 4
# Groups:   name, Assignment_type [8]
   name     Assignment_type number score
   <chr>    <chr>            <int> <dbl>
 1 Person A Homework             2  10  
 2 Person A Homework             1  18  
 3 Person A Homework             4  NA  
 4 Person A Homework             5  NA  
 5 Person A Quiz                 2  13.5
 6 Person A Quiz                 3  15.5
 7 Person A Quiz                 1  19  
 8 Person A Quiz                 5  NA  
 9 Person B Homework             2  20  
10 Person B Homework             3  20  
# … with 22 more rows

Hope this helps!

FJCC · December 1, 2020, 4:01pm

I expect there is a way to do this in a more elegant way but sometimes I just use the simple methods.

Qtmp <- gradebook %>% select(name, AssignNum = Quiz_number, Score = Quiz_score) %>% 
  mutate(AssignType = "Quiz")
Htmp <- gradebook %>% select(name, AssignNum = Homework_number, Score = Homework_score) %>% 
  mutate(AssignType = "Homework")
Alldat <- rbind(Qtmp, Htmp)
Stats <- Alldat %>% group_by(name, AssignType) %>% 
  summarize(RawTot = sum(Score, na.rm = TRUE), Min = min(Score, na.rm = TRUE), 
            N = sum(!is.na(Score)),
            AdjTot = RawTot - Min, AdjAvg = AdjTot/(N - 1))

Longshot408 · December 2, 2020, 4:07am

This works great! Only problem is my pivot commands seem to be off and I have to run everything in a very specific order. My opening block is this:

gradebook <- read_csv("grades.csv", col_names=TRUE)%>%
  slice(2:n())%>%
  janitor::clean_names()%>%
  select(-(assignments_current_points:unposted_final_score))%>%
  select(-(sis_login_id:section))%>%
  select(-c(id,quiz_1_1018568)) %>% 
  rename("Quiz 1"="quiz_1_983240",
         "Quiz 2"="quiz_2_983250",
         "Quiz 3"="quiz_3_983247",
         "Quiz 4"="quiz_4_983249",
         "Homework 1"= "homework_1_1011481",
         "Homework 2"="homework_2_1022194",
         "Homework 3"="homework_3_1035866",
         "Homework 4"="homework_4_1048009")


gradebook=gradebook %>% pivot_longer(cols = starts_with("qui"),
                           names_to="Quiz_number",
                           names_prefix="quiz",
                           values_to="Quiz_scores",
                           values_drop_na = FALSE)

gradebook=gradebook %>% pivot_longer(cols = starts_with("Hom"),
                                     names_to="Homework_number",
                                     names_prefix="homework",
                                     values_to="Homework_scores",
                                     values_drop_na = FALSE)

For some reason if I run either one of the pivot commands it works exactly as intended, but running both causes each person to get four rows of every item. To get your code to work I have to import the data, pivot once, run your transmute codes, reimport and use the second pivot, and then finish your codes.
Any idea what is going on with the pivots and how to fix?

ADernild · December 2, 2020, 2:49pm

Looking over the documentation for pivot_longer, it seems, as if you only need to call pivot_longer() one time, to do the whole operation.

When you do gradebook=gradebook %>% pivot_longer(...) sequentially, you first create a new matrix of [n x 4], where n is your observations/rows and 4 is the number of columns being collapsed. The second time you call pivot_longer is where the problem arises, as n, or the number of rows, doesn't correspond to the number of observations (students) anymore, but pivot_longer still returns a new matrix of [n x 4]. Instead you want a matrix of [n x (4 x 2)] where n is the number of students, 4 is the number of grades multiplied by the types of grades i.e., quiz or homework. To conclude calling pivot_longer twice creates a matrix [n x 4^2], whereas specifying both operations in one function call creates a matrix [n x (4 x 2)].

As I don't have your data grades.csv I tried to create some sample data to test this out and it seems to work:

library(tidyverse)

df <- tibble(name = 1:4, 
             `Quiz 1` = 1:4, 
             `Quiz 2` = 1:4, 
             `Quiz 3` = 1:4, 
             `Quiz 4` = 1:4, 
             `Homework 1` = 1:4, 
             `Homework 2` = 1:4, 
             `Homework 3` = 1:4, 
             `Homework 4` = NA)

df

# A tibble: 4 x 9
   name `Quiz 1` `Quiz 2` `Quiz 3` `Quiz 4` `Homework 1` `Homework 2` `Homework 3` `Homework 4`
  <int>    <int>    <int>    <int>    <int>        <int>        <int>        <int> <lgl>       
1     1        1        1        1        1            1            1            1 NA          
2     2        2        2        2        2            2            2            2 NA          
3     3        3        3        3        3            3            3            3 NA          
4     4        4        4        4        4            4            4            4 NA     

df %>% pivot_longer(starts_with(c("Qui", "Hom")),
  names_to = c("Assignment_type", "number"),
  names_sep = " ",
  values_to = "score",
  values_drop_na = FALSE)

# A tibble: 32 x 4
    name Assignment_type number score
   <int> <chr>           <chr>  <int>
 1     1 Quiz            1          1
 2     1 Quiz            2          1
 3     1 Quiz            3          1
 4     1 Quiz            4          1
 5     1 Homework        1          1
 6     1 Homework        2          1
 7     1 Homework        3          1
 8     1 Homework        4         NA
 9     2 Quiz            1          2
10     2 Quiz            2          2
# … with 22 more rows

Hope this makes sense!

Longshot408 · December 2, 2020, 7:05pm

That works beautifully!! Thank you so much. Great explanation too

system · December 9, 2020, 7:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.