How to select the file based on last modified time ?

Basically, we got to select the data from the most recent file.
However file name has few discrepancy, and hence using string and rebus to clean them.

But can we use this info to select the most recent file name ?

Please find the simplified reprex :

library(tidyverse)
library(rebus)

myfiles <- tribble(
  ~files,~last_modified,
  "file_2014_01.csv", "2019-07-17T14:00:20.000Z",
  "file_2014_01 ", "2019-07-17T14:00:21.000Z",
  "file_2014_01.csv", "2019-07-17T13:59:36.000Z",
  "file_2014_01fdn.csv", "2019-07-17T14:00:23.000Z",
  "file_2014_01.csv", "2019-07-17T14:00:11.000Z",
  "file_2014_01.csv", "2019-07-17T14:00:27.000Z", # Most recent
  "äsdfile_2014_03.csv", "2019-06-17T14:00:23.000Z",
  "qwerfile_2014_03 ", "2019-07-15T14:00:21.000Z",
  "file_2014_03.csv", "2019-01-17T13:59:36.000Z",
  "bfffile_2014_03fdn.csv", "2019-06-17T14:00:32.000Z",
  "cvfile_2014_03.csv", "2019-07-14T14:00:11.000Z",
  "uufile_2014_03.csv", "2019-2-17T15:00:23.000Z" # Most recent
)

# Select same months
to_group <- myfiles %>% select(files) %>% unlist() %>%
  str_extract(pattern = one_or_more(DGT) %R% ANY_CHAR %R%
                one_or_more(DGT))

# number of months to choose from
to_group %>% unique()

# How can we use this info to select the file from the myfiles ?

This is a solution with regular expressions instead of rebus

Note: This is not the most recent datetime for that group because it's in february

library(tidyverse)
library(lubridate)

myfiles <- tribble(
  ~files,~last_modified,
  "file_2014_01.csv", "2019-07-17T14:00:20.000Z",
  "file_2014_01 ", "2019-07-17T14:00:21.000Z",
  "file_2014_01.csv", "2019-07-17T13:59:36.000Z",
  "file_2014_01fdn.csv", "2019-07-17T14:00:23.000Z",
  "file_2014_01.csv", "2019-07-17T14:00:11.000Z",
  "file_2014_01.csv", "2019-07-17T14:00:27.000Z", # Most recent
  "äsdfile_2014_03.csv", "2019-06-17T14:00:23.000Z",
  "qwerfile_2014_03 ", "2019-07-15T14:00:21.000Z",
  "file_2014_03.csv", "2019-01-17T13:59:36.000Z",
  "bfffile_2014_03fdn.csv", "2019-06-17T14:00:32.000Z",
  "cvfile_2014_03.csv", "2019-07-14T14:00:11.000Z",
  "uufile_2014_03.csv", "2019-2-17T15:00:23.000Z" # Most recent
)

myfiles %>% 
  mutate(group = str_extract(files, "\\d{4}.\\d{2}"),
         last_modified = ymd_hms(last_modified)) %>% 
  group_by(group) %>% 
  filter(last_modified == max(last_modified))
#> # A tibble: 2 x 3
#> # Groups:   group [2]
#>   files               last_modified       group  
#>   <chr>               <dttm>              <chr>  
#> 1 file_2014_01.csv    2019-07-17 14:00:27 2014_01
#> 2 "qwerfile_2014_03 " 2019-07-15 14:00:21 2014_03

Created on 2019-07-18 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.