summarize() - different outputs, but same data

Hi, I have a following issue. When I run the function summarize() from dplyr package, I got two different results. See bellow:

Case 1:

library(dplyr)
library(tibble)
library(tidyr)
library(lubridate)


df <- tibble(
  Date = c("01-01-2005", "01-04-2005", "01-04-2005"), 
  Index = c("01","02","01"), 
  Value = c(1,2,3) 
)  %>%
  mutate(Date = mdy(Date)) 

all_dates = tidyr::full_seq(df$Date,1)

df2 = expand.grid(Date=all_dates,
                  Index=unique(df$Index),
                  Value=0)

df3 = bind_rows(df,df2)  

df3 = group_by(df3,Date,Index) 

df3 = summarize(df3,Value=sum(Value))
print(df3)

  Value
1     6

However, when I restart R session (Session -> Restart R in RStudio), I run the same code again and I got:

Case 2:

df3 = summarize(df3,Value=sum(Value))
print(df3)

# A tibble: 8 x 3
# Groups:   Date [4]
  Date       Index Value
  <date>     <chr> <dbl>
1 2005-01-01 01        1
2 2005-01-01 02        0
3 2005-01-02 01        0
4 2005-01-02 02        0
5 2005-01-03 01        0
6 2005-01-03 02        0
7 2005-01-04 01        3
8 2005-01-04 02        2

How is it possible? Desired output is the second one, i.e. not the number, but a table 8 x 3.Thanks for a help!

Your code doesn't provide df2. Also, all_dates is created but never used.

Thanks! Now it should be OK.

After creating df3 and grouping it, you probably ran summarise multiple times in a row on the output of the previous summarise. summarise drops the last grouping column in the output data frame, so the summary was performed with one fewer grouping columns each time through. In other words, after creating and grouping df3, you probably ran the following code three times:

df3 = summarize(df3,Value=sum(Value))

Each time, df3 was reassigned to the new result. The code below shows what happened by saving the result of each successive summarise with a new name each time. Note how summarise provides a message each time informing us that it dropped a last grouping column. Also note that after running summarise three times, we get the output that surprised you.

library(dplyr)
library(tibble)
library(tidyr)
library(lubridate)

# Create starting data frame --------------------------------------------
df <- tibble(
  Date = c("01-01-2005", "01-04-2005", "01-04-2005"), 
  Index = c("01","02","01"), 
  Value = c(1,2,3) 
)  %>%
  mutate(Date = mdy(Date)) 

all_dates = tidyr::full_seq(df$Date,1)

df2 = expand.grid(Date=all_dates,
                  Index=unique(df$Index),
                  Value=0)

df3 = bind_rows(df,df2)  

df3 = group_by(df3,Date,Index) 
# Run summarise three times --------------------------------------------

# FIRST summarise: After running summarise, df3 is now grouped only by Date
df3 = summarize(df3,Value=sum(Value))
#> `summarise()` regrouping output by 'Date' (override with `.groups` argument)
df3
#> # A tibble: 8 x 3
#> # Groups:   Date [4]
#>   Date       Index Value
#>   <date>     <chr> <dbl>
#> 1 2005-01-01 01        1
#> 2 2005-01-01 02        0
#> 3 2005-01-02 01        0
#> 4 2005-01-02 02        0
#> 5 2005-01-03 01        0
#> 6 2005-01-03 02        0
#> 7 2005-01-04 01        3
#> 8 2005-01-04 02        2

# SECOND summarise: After running summarise, df4 is now not grouped at all
df4 = summarize(df3,Value=sum(Value))
#> `summarise()` ungrouping output (override with `.groups` argument)
df4
#> # A tibble: 4 x 2
#>   Date       Value
#>   <date>     <dbl>
#> 1 2005-01-01     1
#> 2 2005-01-02     0
#> 3 2005-01-03     0
#> 4 2005-01-04     5

# THIRD summarise
df5 = summarize(df4,Value=sum(Value))
df5
#> # A tibble: 1 x 1
#>   Value
#>   <dbl>
#> 1     6
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.