Summarizing large dataframe by day, by type, and by specific timestamps. Why is my code dropping the date?

Hi there. I'm trying to summarize a very large dataframe so I can see the total quantity in each type at some specific timestamps, by day. I hope to use this data so I can do a comparison exercise later to see how today's quantity at 16:00 (by category, and also overall) compares to previous day's at 16:00 (etc).

The structure of the dataframe is like this:

df <- structure(list(date = structure(c(19034, 19034, 19034, 19034, 19034, 19034, 19034, 19034, 19034, 19034), 
                                      class = "Date"),type = c("UKS", "USD", "UKS", "UKS", "USD", "USD", "UKS", "USD", "UKZ", "UKY"), 
                     time = structure(c(28793, 32403, 36003, 43203, 46803, 50404, 50408, 54011, 54014, 58815), units = "secs", class = c("hms", "difftime")), 
                     quantity = c(0.003, 0.007, 0.002, 0.001, 0.03, 0.123, 0.017, 0.019, 0.012, 0.01 ),cumvol = c(0.003, 0.01, 0.012, 0.013, 0.043, 0.166, 0.183, 0.202, 0.214, 0.224)), 
                class = "data.frame", row.names = c(NA, -10L))

And my attempt thus far is this, however I'm getting an error around the group_by(date) line.

library(tidyverse)
library(lubridate)
library(hms)

time_check <- c(13, 16, 18)

df %>% 
  mutate(time_hr = hour(time), .after = time) %>% 
  filter(time_hr %in% time_check) %>% 
  group_by(date, type, time_hr) %>% 
  summarize(cat_total = sum(quantity)) %>% 
  group_by(date) %>% 
  mutate(date_total = sum(cat_total)) %>% 
  ungroup()

The summarize line seems to be dropping the date and then throwing up the error:

Error: Must group by variables found in `.data`.
* Column `date` is not found.

Does anyone advice on how to fix this?

I can't reproduce your error. Seems to run OK for me.

df <- structure(list(date = structure(c(19034, 19034, 19034, 19034, 19034, 19034, 19034, 19034, 19034, 19034), 
                                      class = "Date"),type = c("UKS", "USD", "UKS", "UKS", "USD", "USD", "UKS", "USD", "UKZ", "UKY"), 
                     time = structure(c(28793, 32403, 36003, 43203, 46803, 50404, 50408, 54011, 54014, 58815), units = "secs", class = c("hms", "difftime")), 
                     quantity = c(0.003, 0.007, 0.002, 0.001, 0.03, 0.123, 0.017, 0.019, 0.012, 0.01 ),cumvol = c(0.003, 0.01, 0.012, 0.013, 0.043, 0.166, 0.183, 0.202, 0.214, 0.224)), 
                class = "data.frame", row.names = c(NA, -10L))

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(hms)
#> 
#> Attaching package: 'hms'
#> The following object is masked from 'package:lubridate':
#> 
#>     hms

time_check <- c(13, 16, 18)

df %>% 
  mutate(time_hr = hour(time), .after = time) %>% 
  filter(time_hr %in% time_check) %>% 
  group_by(date, type, time_hr) %>% 
  summarize(cat_total = sum(quantity)) %>% 
  group_by(date) %>% 
  mutate(date_total = sum(cat_total)) %>% 
  ungroup()
#> `summarise()` has grouped output by 'date', 'type'. You can override using the `.groups` argument.
#> # A tibble: 2 x 5
#>   date       type  time_hr cat_total date_total
#>   <date>     <chr>   <int>     <dbl>      <dbl>
#> 1 2022-02-11 UKY        16      0.01       0.04
#> 2 2022-02-11 USD        13      0.03       0.04

Created on 2022-02-12 by the reprex package (v2.0.1)

I got an error but then restarted R & RStudio and it seems fine. Could you have something loaded that is interfering?

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] hms_1.1.1       lubridate_1.8.0 forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.1.1    
 [8] tidyr_1.1.4     tibble_3.1.6    ggplot2_3.3.5   tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8        cellranger_1.1.0  pillar_1.6.5      compiler_4.1.2    dbplyr_2.1.1      tools_4.1.2      
 [7] jsonlite_1.7.3    lifecycle_1.0.1   gtable_0.3.0      pkgconfig_2.0.3   rlang_1.0.0       reprex_2.0.1     
[13] rstudioapi_0.13   DBI_1.1.1         cli_3.1.1         haven_2.4.3       xml2_1.3.3        withr_2.4.3      
[19] httr_1.4.2        fs_1.5.2          generics_0.1.1    vctrs_0.3.8       grid_4.1.2        tidyselect_1.1.1 
[25] glue_1.6.1        R6_2.5.1          fansi_1.0.2       readxl_1.3.1      tzdb_0.2.0        modelr_0.1.8     
[31] magrittr_2.0.2    backports_1.4.1   scales_1.1.1      ellipsis_0.3.2    rvest_1.0.2       assertthat_0.2.1 
[37] colorspace_2.0-2  utf8_1.2.2        stringi_1.7.6     munsell_0.5.0     broom_0.7.11.9000 crayon_1.4.2     

It's very strange.

If I run the whole codeset, top to bottom I get:

Error in [[<-.data.frame(*tmp*, col, value = 13L) :
replacement has 1 row, data has 0

If I run this block separately:

df %>% 
  mutate(time_hr = hour(time), .after = time) %>% 
  filter(time_hr %in% time_check) %>% 
  group_by(date, type, time_hr) %>% 
  summarize(cat_total = sum(quantity)) %>% 
  group_by(date) %>% 
  mutate(date_total = sum(cat_total)) %>% 
  ungroup()

I get the:

Error: Must group by variables found in `.data`.
* Column `date` is not found.

Weird eh?

It seems that the problem was caused by R defaulting to using:

plyr::summarize

Rather than:

dplyr::summarize

That simple change solved the issue.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.