How to subset data to calendar year according to the longest continuous time length

Hi all,
Following the former post here, I have another question.

I summarized the datasets to the longest continuous days of data, such as df.summary. Each dataset (df1, df2, df3, etc.) has four columns: Year, Month, Day, Value (total/day). How to subset each dataset according to the longest continuous time length, but also a whole calendar year-from January 1st to December 31st. For example, df1 should be subsetted to 1/1/2002 to 12/31/2010. The result will be one year less, but it is the desired output I'd prefer to get. Then I want to calculate the total values in each calendar year. I am wondering how to write it in a loop as there are many datasets. Each dataset has the name df1.xlsx, df2.xlsx, etc. Thanks for your help.

df.summary

ID   Length  Start       End
df1   10     1/10/2001  1/9/2011
df2   4      8/13/2001  8/12/2005
df3   13     3/21/1982  8/5/1995

df1

Year  Month Day  Value (total/day)
1985    1       1         10
1985    1       2         12
1985    1       5         11.8
1985    1       6         NA
1985    1       7         21
...
2012    12     30       22
2012    12     31       20

Your sample data is not complete enough to give you a specific solution, but this is how I would approach the task.

First of all, I wouldn't use a loop, I would read all the .xlsx files into a single data frame using something like this

library(tidyverse)
library(readxl)

list_of_files <- list.files(path = "patath\to\your\files",
                            pattern = ".xlsx$",
                            full.names = TRUE)
all_df <- list_of_files %>%
    set_names() %>% 
    map_dfr(read_xlsx, .id = "ID")

Then, I would perform the filtering over all the data at once with something like this example

library(tidyverse)
library(lubridate)
library(fuzzyjoin)

df.summary <- data.frame(stringsAsFactors=FALSE,
                         ID = c("df1", "df2", "df3"),
                         Length = c(10, 4, 13),
                         Start = c("1/10/2001", "8/13/2001", "3/21/1982"),
                         End = c("1/9/2011", "8/12/2005", "8/5/1995")
)

all_df <- data.frame(
    ID = c("df3", "df3", "df3", "df2", "df2", "df1", "df1"),
    Year = c(1985, 1985, 1985, 1985, 1985, 2009, 2009),
    Month = c(1, 1, 1, 1, 1, 12, 12),
    Day = c(1, 2, 5, 6, 7, 30, 31),
    Value_total_day = c(10, 12, 11.8, NA, 21, 22, 20)
)


df.summary %>% 
    mutate_at(vars(Start, End), mdy) %>% 
    mutate(Start = ceiling_date(Start, unit = "years"),
           End = floor_date(End, unit = "years") - 1) %>% 
    fuzzy_left_join(all_df %>% mutate(date = make_date(year = Year, month = Month, day = Day)),
                    by = c("Start" = "date", "End" = "date", "ID" = "ID"),
                    match_fun = c(`<=`, `>=`, `==`)) %>% 
    select(ID = ID.x, Year, Month, Day, Value_total_day)
#>    ID Year Month Day Value_total_day
#> 1 df1 2009    12  30            22.0
#> 2 df1 2009    12  31            20.0
#> 3 df2   NA    NA  NA              NA
#> 4 df3 1985     1   1            10.0
#> 5 df3 1985     1   2            12.0
#> 6 df3 1985     1   5            11.8

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.