Using dplyr distinct on counts of dates with NA values

I know there should be a simple solution to this, but I can't seem to figure this bit out. What I want to is to try to summarize or group by the date in a data frame and having the total values in each count column to be aligned to each of the dates. See code below.


data <- data.frame(letter = c("a", "b", "c", "d", "e"), 
                   date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03"), 
                   count = c(1,NA, 3, 2, NA), 
                   count2 = c(2, 1, NA, NA, 2)
                   )

print(data)
#   letter       date count count2
# 1      a 2020-01-01     1      2
# 2      b 2020-01-02    NA      1
# 3      c 2020-01-02     3     NA
# 4      d 2020-01-03     2     NA
# 5      e 2020-01-03    NA      2


data2 <- data%>%
  distinct(date, .keep_all = TRUE) %>%
  arrange(date)

print(data2)
#   letter       date count count2
# 1      a 2020-01-01     1      2
# 2      b 2020-01-02    NA      1
# 3      d 2020-01-03     2     NA

#What it should look like

#   letter       date count count2
# 1      a 2020-01-01     1      2
# 2      b 2020-01-02     3      1
# 3      d 2020-01-03     2      2

It looks like what distinct is doing is selecting the first date row and only using that value in the count fields, but I can't seem to figure out where to get the summation. Thanks for your help.

That seems exactly what you want to do:

data %>%
  group_by(date) %>%
  summarize(count = sum(count, na.rm = TRUE),
            count2 = sum(count2, na.rm = TRUE))
#   date       count count2
#   <chr>      <dbl>  <dbl>
# 1 2020-01-01     1      2
# 2 2020-01-02     3      1
# 3 2020-01-03     2      2

Indeed, distinct(date) will only keep the first row of each distinct date.

1 Like

Thank you so much. I knew I was overlooking something and it had to be an easy solution. Thank you again.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.