The advantage of the 'pipe' (that is the use of the %>% construct) is that it very compact.
And that is also the main disadvantage
To see what is actually happening just break the flow of the pipe in its parts again and show the results of each part:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- tibble(
Date = c("01-01-2005", "01-04-2005", "01-04-2005"),
Index = c("01","02","01"),
Value = c(1,2,3)
) %>%
mutate(Date = mdy(Date))
print(df)
#> # A tibble: 3 x 3
#> Date Index Value
#> <date> <chr> <dbl>
#> 1 2005-01-01 01 1
#> 2 2005-01-04 02 2
#> 3 2005-01-04 01 3
# all dates between first and last (inclusive first and last)
all_dates = tidyr::full_seq(df$Date,1)
# df2 will contain all combinations of dates and indices and 0
df2 = expand.grid(Date=all_dates,
Index=unique(df$Index),
Value=0)
print(df2)
#> Date Index Value
#> 1 2005-01-01 01 0
#> 2 2005-01-02 01 0
#> 3 2005-01-03 01 0
#> 4 2005-01-04 01 0
#> 5 2005-01-01 02 0
#> 6 2005-01-02 02 0
#> 7 2005-01-03 02 0
#> 8 2005-01-04 02 0
# make new data.frame with rows from df AND df2 combined
# so all possible combinations and the original data
# Date/Index combinations from original data also occur in df3 with a 0 Value
# Therefore we add original Value and the zero Value per Date/Index combination
# That is done in the steps after the rbind
# NB I assume that you need only one row per Date/Index
df3 = rbind(df,df2)
print(df3)
#> # A tibble: 11 x 3
#> Date Index Value
#> <date> <chr> <dbl>
#> 1 2005-01-01 01 1
#> 2 2005-01-04 02 2
#> 3 2005-01-04 01 3
#> 4 2005-01-01 01 0
#> 5 2005-01-02 01 0
#> 6 2005-01-03 01 0
#> 7 2005-01-04 01 0
#> 8 2005-01-01 02 0
#> 9 2005-01-02 02 0
#> 10 2005-01-03 02 0
#> 11 2005-01-04 02 0
df3 = group_by(df3,Date,Index) # indicate we want to group on Date and Index fields
print(df3)
#> # A tibble: 11 x 3
#> # Groups: Date, Index [8]
#> Date Index Value
#> <date> <chr> <dbl>
#> 1 2005-01-01 01 1
#> 2 2005-01-04 02 2
#> 3 2005-01-04 01 3
#> 4 2005-01-01 01 0
#> 5 2005-01-02 01 0
#> 6 2005-01-03 01 0
#> 7 2005-01-04 01 0
#> 8 2005-01-01 02 0
#> 9 2005-01-02 02 0
#> 10 2005-01-03 02 0
#> 11 2005-01-04 02 0
df3 = summarize(df3,Value=sum(Value)) # sum the Value field over the groups Date and Index
#> `summarise()` regrouping output by 'Date' (override with `.groups` argument)
print(df3)
#> # A tibble: 8 x 3
#> # Groups: Date [4]
#> Date Index Value
#> <date> <chr> <dbl>
#> 1 2005-01-01 01 1
#> 2 2005-01-01 02 0
#> 3 2005-01-02 01 0
#> 4 2005-01-02 02 0
#> 5 2005-01-03 01 0
#> 6 2005-01-03 02 0
#> 7 2005-01-04 01 3
#> 8 2005-01-04 02 2
df3 = ungroup(df3)
print(df3)
#> # A tibble: 8 x 3
#> Date Index Value
#> <date> <chr> <dbl>
#> 1 2005-01-01 01 1
#> 2 2005-01-01 02 0
#> 3 2005-01-02 01 0
#> 4 2005-01-02 02 0
#> 5 2005-01-03 01 0
#> 6 2005-01-03 02 0
#> 7 2005-01-04 01 3
#> 8 2005-01-04 02 2
df4 = summarize(df3,Value=sum(Value)) # NB no grouping here
print(df4) # then we sum Values over all rows
#> # A tibble: 1 x 1
#> Value
#> <dbl>
#> 1 6
Thanks for helping me! However, I cannot replicate your solution. I am confused, it has never happened to me in Rstudio before. Look, what I see, when I copy and run your solution:
df3 = group_by(df3,Date,Index) # indicate we want to group on Date and Index fields
> print(df3)
# A tibble: 11 x 3
# Groups: Date, Index [8]
Date Index Value
<date> <chr> <dbl>
1 2005-01-01 01 1
2 2005-01-04 02 2
3 2005-01-04 01 3
4 2005-01-01 01 0
5 2005-01-02 01 0
6 2005-01-03 01 0
7 2005-01-04 01 0
8 2005-01-01 02 0
9 2005-01-02 02 0
10 2005-01-03 02 0
11 2005-01-04 02 0
> df3 = summarize(df3,Value=sum(Value)) # sum the Value field over the groups Date and Index
> #> `summarise()` regrouping output by 'Date' (override with `.groups` argument)
> print(df3)
Value
1 6
> df3 = ungroup(df3)
> print(df3)
Value
1 6
> df4 = summarize(df3,Value=sum(Value)) # NB no grouping here
> print(df4)
Value
1 6
Until df3 = group_by(df3,Date,Index) # indicate we want to group on Date and Index fields is everything OK. Unfortunately then we got different outputs. How is that possible?
Only possibility I see : we don't use exactly the same functions/packages.
The code I showed before was run in a reprex. That means that the code runs in a separate environment that is comparable with a restart of R.
I have run this again with an added call to the sessionInfo function. The output (of only that function) is included below.
Can you do the same:
restart your R session (e.g. in RStudio by clicking Session | Restart R)
do not run anything yet but open in an editor panel the code we are discussing
add the sessionInfo() line to end of the code
run the code
check if the output is now as expected (then something in your previous environment caused the error)
compare the versions of the packages you used with those of mine (ignore the packages that are only in my sessionInfo because I used more packages while making the reprex)