How to fill missing dates in dataset?

hi, I have a following problem. I would like to fill missing rows in dataset per each date and each index.

df <- tibble(Date = c("01-01-2005", "01-04-2005", "01-04-2005"), Index = c("01","02","01"), Value = c(1,2,3) )

I tried this:


library(dplyr)
library(tsibble)

df %>%
  mutate(Date = mdy(Date)) %>% 
  as_tsibble(key = Index, index = Date) %>%
  fill_gaps(Value = 0, .full = TRUE) %>%
  arrange(Date, Index)

Unfortunately, result is:

# A tsibble: 4 x 3 [3D]
# Key:       Index [2]
  Date       Index Value
  <date>     <chr> <dbl>
1 2005-01-01 01        1
2 2005-01-01 02        0
3 2005-01-04 01        3
4 2005-01-04 02        2

But I need also dates 2005-01-02 and 2005-01-03. In other words, data should look like:

  Date       Index Value
  <date>     <chr> <dbl>
2005-01-01 01        1
2005-01-01 02        0
2005-01-02 01        0
2005-01-02 02        0
2005-01-03 01        0
2005-01-03 02        0
2005-01-04 01        3
2005-01-04 02        2

How can I fix it please? Thanks a lot!

Seems a use case for tidyr::full_seq?

Thanks for answer! Not sure, how you mean it. Could you give me an example, please? Ideally on my df, as I have provided.

For example like this (I don't know the package tsibble) :

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- tibble(
  Date = c("01-01-2005", "01-04-2005", "01-04-2005"), 
  Index = c("01","02","01"), 
  Value = c(1,2,3) 
  )  %>%
  mutate(Date = mdy(Date)) 

df2 = expand.grid(Date=tidyr::full_seq(df$Date,1),Index=unique(df$Index),Value=0)

df3 = rbind(df,df2) %>%
  group_by(Date,Index) %>%
  summarize(Value=sum(Value)) %>%
  ungroup()
#> `summarise()` regrouping output by 'Date' (override with `.groups` argument)

print(df3)
#> # A tibble: 8 x 3
#>   Date       Index Value
#>   <date>     <chr> <dbl>
#> 1 2005-01-01 01        1
#> 2 2005-01-01 02        0
#> 3 2005-01-02 01        0
#> 4 2005-01-02 02        0
#> 5 2005-01-03 01        0
#> 6 2005-01-03 02        0
#> 7 2005-01-04 01        3
#> 8 2005-01-04 02        2

Created on 2020-06-21 by the reprex package (v0.3.0)

1 Like

What do you mean by `summarize(Value=sum(Value)) ? The output is then just a number, isn't it?

Look:

df <- tibble(
+   Date = c("01-01-2005", "01-04-2005", "01-04-2005"), 
+   Index = c("01","02","01"), 
+   Value = c(1,2,3) 
+ )  %>%
+   mutate(Date = mdy(Date)) 
> 
> df2 = expand.grid(Date=tidyr::full_seq(df$Date,1),Index=unique(df$Index),Value=0)
> 
> df3 = rbind(df,df2) %>%
+   group_by(Date,Index) %>%
+   summarize(Value=sum(Value)) %>%
+   ungroup()
> 
> 
> print(df3)
  Value
1     6

How did you get your 8x3 table df3? Seems not obvious to me. Thanks for help!

The advantage of the 'pipe' (that is the use of the %>% construct) is that it very compact.
And that is also the main disadvantage :grinning:
To see what is actually happening just break the flow of the pipe in its parts again and show the results of each part:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- tibble(
  Date = c("01-01-2005", "01-04-2005", "01-04-2005"), 
  Index = c("01","02","01"), 
  Value = c(1,2,3) 
  )  %>%
  mutate(Date = mdy(Date)) 
print(df)
#> # A tibble: 3 x 3
#>   Date       Index Value
#>   <date>     <chr> <dbl>
#> 1 2005-01-01 01        1
#> 2 2005-01-04 02        2
#> 3 2005-01-04 01        3

# all dates between first and last (inclusive first and last)
all_dates = tidyr::full_seq(df$Date,1)

# df2 will contain all combinations of dates and indices and 0
df2 = expand.grid(Date=all_dates,
                  Index=unique(df$Index),
                  Value=0)
print(df2)
#>         Date Index Value
#> 1 2005-01-01    01     0
#> 2 2005-01-02    01     0
#> 3 2005-01-03    01     0
#> 4 2005-01-04    01     0
#> 5 2005-01-01    02     0
#> 6 2005-01-02    02     0
#> 7 2005-01-03    02     0
#> 8 2005-01-04    02     0

# make new data.frame with rows from df AND df2 combined
# so all possible combinations and the original data
# Date/Index combinations from original data also occur in df3 with a 0 Value
# Therefore we add original Value and the zero Value per Date/Index combination
# That is done in the steps after the rbind 
# NB I assume that you need only one row per Date/Index
df3 = rbind(df,df2)  
print(df3)
#> # A tibble: 11 x 3
#>    Date       Index Value
#>    <date>     <chr> <dbl>
#>  1 2005-01-01 01        1
#>  2 2005-01-04 02        2
#>  3 2005-01-04 01        3
#>  4 2005-01-01 01        0
#>  5 2005-01-02 01        0
#>  6 2005-01-03 01        0
#>  7 2005-01-04 01        0
#>  8 2005-01-01 02        0
#>  9 2005-01-02 02        0
#> 10 2005-01-03 02        0
#> 11 2005-01-04 02        0

df3 = group_by(df3,Date,Index) # indicate we want to group on Date and Index fields
print(df3)
#> # A tibble: 11 x 3
#> # Groups:   Date, Index [8]
#>    Date       Index Value
#>    <date>     <chr> <dbl>
#>  1 2005-01-01 01        1
#>  2 2005-01-04 02        2
#>  3 2005-01-04 01        3
#>  4 2005-01-01 01        0
#>  5 2005-01-02 01        0
#>  6 2005-01-03 01        0
#>  7 2005-01-04 01        0
#>  8 2005-01-01 02        0
#>  9 2005-01-02 02        0
#> 10 2005-01-03 02        0
#> 11 2005-01-04 02        0

df3 = summarize(df3,Value=sum(Value)) # sum the Value field over the groups Date and Index
#> `summarise()` regrouping output by 'Date' (override with `.groups` argument)
print(df3)
#> # A tibble: 8 x 3
#> # Groups:   Date [4]
#>   Date       Index Value
#>   <date>     <chr> <dbl>
#> 1 2005-01-01 01        1
#> 2 2005-01-01 02        0
#> 3 2005-01-02 01        0
#> 4 2005-01-02 02        0
#> 5 2005-01-03 01        0
#> 6 2005-01-03 02        0
#> 7 2005-01-04 01        3
#> 8 2005-01-04 02        2

df3 = ungroup(df3)
print(df3)
#> # A tibble: 8 x 3
#>   Date       Index Value
#>   <date>     <chr> <dbl>
#> 1 2005-01-01 01        1
#> 2 2005-01-01 02        0
#> 3 2005-01-02 01        0
#> 4 2005-01-02 02        0
#> 5 2005-01-03 01        0
#> 6 2005-01-03 02        0
#> 7 2005-01-04 01        3
#> 8 2005-01-04 02        2

df4 = summarize(df3,Value=sum(Value)) # NB no grouping here
print(df4) # then we sum Values over all rows
#> # A tibble: 1 x 1
#>   Value
#>   <dbl>
#> 1     6

Created on 2020-06-21 by the reprex package (v0.3.0)

If this is not clear, then let me know.

1 Like

Thanks for helping me! However, I cannot replicate your solution. I am confused, it has never happened to me in Rstudio before. Look, what I see, when I copy and run your solution:

df3 = group_by(df3,Date,Index) # indicate we want to group on Date and Index fields
> print(df3)
# A tibble: 11 x 3
# Groups:   Date, Index [8]
   Date       Index Value
   <date>     <chr> <dbl>
 1 2005-01-01 01        1
 2 2005-01-04 02        2
 3 2005-01-04 01        3
 4 2005-01-01 01        0
 5 2005-01-02 01        0
 6 2005-01-03 01        0
 7 2005-01-04 01        0
 8 2005-01-01 02        0
 9 2005-01-02 02        0
10 2005-01-03 02        0
11 2005-01-04 02        0
> df3 = summarize(df3,Value=sum(Value)) # sum the Value field over the groups Date and Index
> #> `summarise()` regrouping output by 'Date' (override with `.groups` argument)
> print(df3)
  Value
1     6
> df3 = ungroup(df3)
> print(df3)
  Value
1     6
> df4 = summarize(df3,Value=sum(Value)) # NB no grouping here
> print(df4)
  Value
1     6

Until df3 = group_by(df3,Date,Index) # indicate we want to group on Date and Index fields is everything OK. Unfortunately then we got different outputs. How is that possible? :slightly_frowning_face:

Only possibility I see : we don't use exactly the same functions/packages.
The code I showed before was run in a reprex. That means that the code runs in a separate environment that is comparable with a restart of R.
I have run this again with an added call to the sessionInfo function. The output (of only that function) is included below.
Can you do the same:

  • restart your R session (e.g. in RStudio by clicking Session | Restart R)
  • do not run anything yet but open in an editor panel the code we are discussing
  • add the sessionInfo() line to end of the code
  • run the code
  • check if the output is now as expected (then something in your previous environment caused the error)
  • compare the versions of the packages you used with those of mine (ignore the packages that are only in my sessionInfo because I used more packages while making the reprex)

?

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] lubridate_1.7.8 tidyr_1.1.0     tibble_3.0.1    dplyr_1.0.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6     knitr_1.28       magrittr_1.5     tidyselect_1.1.0
#>  [5] R6_2.4.1         rlang_0.4.6      fansi_0.4.1      stringr_1.4.0   
#>  [9] highr_0.8        tools_4.0.0      xfun_0.13        utf8_1.1.4      
#> [13] cli_2.0.2        htmltools_0.4.0  ellipsis_0.3.0   assertthat_0.2.1
#> [17] yaml_2.2.1       digest_0.6.25    lifecycle_0.2.0  crayon_1.3.4    
#> [21] purrr_0.3.4      vctrs_0.3.0      glue_1.4.1       evaluate_0.14   
#> [25] rmarkdown_2.1    stringi_1.4.6    compiler_4.0.0   pillar_1.4.3    
#> [29] generics_0.0.2   pkgconfig_2.0.3
1 Like

Hi, it finally works. I need to restart the R session before each code running. Thanks a lot for your help!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.