Find and correct filenames based on partial string

HABSaremyjam · August 10, 2022, 8:27pm

Hello !

I have a dataset of file names in which the date is nestled between other characters in a YYYYMMDD format. For example:

CBS20_WI-10005_V1_20200822_A_01, where 20200822 is the date.

There are some filenames, however, that have the date in a MMDDYYYY format. Is there any easy way to find the files that have the MMDDYYYY format and flip them to have a YYYYMMDD format? I have searched Google and am not finding an easy solution.

Here is a subset of my data:

structure(list(SITE_ID = c("NGL20_MI-10162", "NGL20_MI-10162", 
"NGL20_MI-10163", "NGL20_MI-10163", "NGL20_MN-10001", "NGL20_MN-10001", 
"NGL20_MN-10001", "NGL20_MN-10001", "NGL20_MN-10002", "NGL20_MN-10002", 
"NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", 
"NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", 
"NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10003", "NGL20_MN-10003"
), DATE_COLLECTED = structure(c(18430, 18430, 18441, 18441, 18864, 
18864, 18884, 18884, 18869, 18869, 18869, 18869, 18869, 18869, 
18885, 18885, 18885, 18885, 18885, 18885, 18862, 18862), class = "Date"), 
    VISIT_NO = c(1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 
    2, 2, 2, 2, 2, 1, 1), CREW = c("GL1", "GL1", "GL1", "GL1", 
    "MN3", "MN3", "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", 
    "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", "MN3", "MN3"
    ), FILE_NAME = c("NGL20_MI-10162_V1_20200617_A_01", "NGL20_MI-10162_V1_20200617_B_01", 
    "NGL20_MI-10163_V1_20200628_A_01", "NGL20_MI-10163_V1_20200628_B_01", 
    "NGL20_MN-10001_V1_20200825_A_01", "NGL20_MN-10001_V1_20210825_B_01", 
    "NGL20_MN-10001_V2_20210914_A_01", "NGL20_MN-10001_V2_20210914_B_01", 
    "NGL20_MN-10002_V1_08302021_B_01", "NGL20_MN-10002_V1_08302021_B_02", 
    "NGL20_MN-10002_V1_08302021_B_03", "NGL20_MN-10002_V1_20210830_A_01", 
    "NGL20_MN-10002_V1_20210830_A_02", "NGL20_MN-10002_V1_20210830_A_03", 
    "NGL20_MN-10002_V2_09152021_A_01", "NGL20_MN-10002_V2_09152021_A_02", 
    "NGL20_MN-10002_V2_09152021_A_03", "NGL20_MN-10002_V2_09152021_B_01", 
    "NGL20_MN-10002_V2_09152021_B_02", "NGL20_MN-10002_V2_09152021_B_03", 
    "NGL20_MN-10003_V1_08232021_A_01", "NGL20_MN-10003_V1_08232021_B_01"
    ), START_TIME = c("0.0243055555555556", "0.0263888888888889", 
    "0.0208333333333333", "0.0201388888888889", "0.0381944444444444", 
    "0.0381944444444444", "0.0534722222222222", "0.0541666666666667", 
    "0.0131944444444444", "0.0173611111111111", "0.0381944444444444", 
    "0.0243055555555556", "0.0243055555555556", "0.0347222222222222", 
    "0.0118055555555556", "0.0131944444444444", "0.0180555555555556", 
    "0.0152777777777778", "0.0173611111111111", "0.0243055555555556", 
    "0.0902777777777778", "0.0645833333333333"), END_TIME = c("0.102777777777778", 
    "0.0979166666666667", "0.0743055555555556", "0.0729166666666667", 
    "0.122916666666667", "0.122916666666667", "0.125", "0.125", 
    "0.145833333333333", "0.122916666666667", "0.114583333333333", 
    "0.159722222222222", "0.131944444444444", "0.118055555555556", 
    "0.131944444444444", "0.158333333333333", "0.18125", "0.131944444444444", 
    "0.159027777777778", "0.177083333333333", "0.234027777777778", 
    "0.208333333333333"), VIDEO_QUALITY = c("GOOD", "GOOD", "GOOD", 
    "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", 
    "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", 
    "GOOD", "GOOD", "GOOD"), VEGETATION = c("N", "N", "N", "N", 
    "N", "N", "N", "N", "Y", "Y", "Y", "N", "Y", "N", "Y", "Y", 
    "N", "Y", "Y", "Y", "N", "N"), MUSSELS = c("N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N"), MUSSEL_ABUNDANCE = c("N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N"), DOM_SUBSTRATE = c("H", 
    "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", 
    "H", "H", "H", "H", "H", "H", "H", "S", "S"), HUMAN_FEATURES = c("N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N"), FISH = c("N", 
    "N", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N", "Y", "Y"), GOBIES = c("N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N", "N", "N"), OTHER_FISH = c("N", 
    "N", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "N", "N", "N", 
    "N", "N", "N", "N", "N", "N", "N", "Y", "Y"), NOTES = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, "WATER MUST HAVE ENTERED CAMERA CASE, GETS BLURRY", 
    "WATER MUST HAVE ENTERED CAMERA CASE, GETS BLURRY", NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, "1 FISH", "SILVERY LONG FISH, ONLY WATCHED UNTIL 3:30 AS VIDEO FAR EXCEEDS 1 MIN ON SEDIMENT"
    ), COCR_COMMENTS = c("Correct", "Correct", "Correct", "Correct", 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA)), row.names = 586:607, class = "data.frame")

Thanks so much!

FJCC · August 10, 2022, 9:10pm

Here is one method that assumes that valid dates start with a 2 and incorrect dates start with a 1 or a 0. That is, no month number starts with 2.
The select() function at the end is there only to make it easy to compare the original file name and the new file name. You don't want that step in your actual code.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2

DF <- structure(list(SITE_ID = c("NGL20_MI-10162", "NGL20_MI-10162", 
                           "NGL20_MI-10163", "NGL20_MI-10163", "NGL20_MN-10001", "NGL20_MN-10001", 
                           "NGL20_MN-10001", "NGL20_MN-10001", "NGL20_MN-10002", "NGL20_MN-10002", 
                           "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", 
                           "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10002", 
                           "NGL20_MN-10002", "NGL20_MN-10002", "NGL20_MN-10003", "NGL20_MN-10003"
), 
DATE_COLLECTED = structure(c(18430, 18430, 18441, 18441, 18864, 
                             18864, 18884, 18884, 18869, 18869, 18869, 18869, 18869, 18869, 
                             18885, 18885, 18885, 18885, 18885, 18885, 18862, 18862), class = "Date"), 
VISIT_NO = c(1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 
             2, 2, 2, 2, 2, 1, 1), 
CREW = c("GL1", "GL1", "GL1", "GL1", 
         "MN3", "MN3", "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", 
         "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", "MN2", "MN3", "MN3"
), 
FILE_NAME = c("NGL20_MI-10162_V1_20200617_A_01", "NGL20_MI-10162_V1_20200617_B_01", 
              "NGL20_MI-10163_V1_20200628_A_01", "NGL20_MI-10163_V1_20200628_B_01", 
              "NGL20_MN-10001_V1_20200825_A_01", "NGL20_MN-10001_V1_20210825_B_01", 
              "NGL20_MN-10001_V2_20210914_A_01", "NGL20_MN-10001_V2_20210914_B_01", 
              "NGL20_MN-10002_V1_08302021_B_01", "NGL20_MN-10002_V1_08302021_B_02", 
              "NGL20_MN-10002_V1_08302021_B_03", "NGL20_MN-10002_V1_20210830_A_01", 
              "NGL20_MN-10002_V1_20210830_A_02", "NGL20_MN-10002_V1_20210830_A_03", 
              "NGL20_MN-10002_V2_09152021_A_01", "NGL20_MN-10002_V2_09152021_A_02", 
              "NGL20_MN-10002_V2_09152021_A_03", "NGL20_MN-10002_V2_09152021_B_01", 
              "NGL20_MN-10002_V2_09152021_B_02", "NGL20_MN-10002_V2_09152021_B_03", 
              "NGL20_MN-10003_V1_08232021_A_01", "NGL20_MN-10003_V1_08232021_B_01"
), 
START_TIME = c("0.0243055555555556", "0.0263888888888889", 
               "0.0208333333333333", "0.0201388888888889", "0.0381944444444444", 
               "0.0381944444444444", "0.0534722222222222", "0.0541666666666667", 
               "0.0131944444444444", "0.0173611111111111", "0.0381944444444444", 
               "0.0243055555555556", "0.0243055555555556", "0.0347222222222222", 
               "0.0118055555555556", "0.0131944444444444", "0.0180555555555556", 
               "0.0152777777777778", "0.0173611111111111", "0.0243055555555556", 
               "0.0902777777777778", "0.0645833333333333"), 
END_TIME = c("0.102777777777778", 
             "0.0979166666666667", "0.0743055555555556", "0.0729166666666667", 
             "0.122916666666667", "0.122916666666667", "0.125", "0.125", 
             "0.145833333333333", "0.122916666666667", "0.114583333333333", 
             "0.159722222222222", "0.131944444444444", "0.118055555555556", 
             "0.131944444444444", "0.158333333333333", "0.18125", "0.131944444444444", 
             "0.159027777777778", "0.177083333333333", "0.234027777777778", 
             "0.208333333333333"), 
VIDEO_QUALITY = c("GOOD", "GOOD", "GOOD", 
                  "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", 
                  "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", "GOOD", 
                  "GOOD", "GOOD", "GOOD"), 
VEGETATION = c("N", "N", "N", "N", 
               "N", "N", "N", "N", "Y", "Y", "Y", "N", "Y", "N", "Y", "Y", 
               "N", "Y", "Y", "Y", "N", "N"), 
MUSSELS = c("N", "N", "N", 
            "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
            "N", "N", "N", "N", "N", "N", "N"), 
MUSSEL_ABUNDANCE = c("N", 
                     "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
                     "N", "N", "N", "N", "N", "N", "N", "N", "N"), 
DOM_SUBSTRATE = c("H", 
                  "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", 
                  "H", "H", "H", "H", "H", "H", "H", "S", "S"), 
HUMAN_FEATURES = c("N", 
                   "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
                   "N", "N", "N", "N", "N", "N", "N", "N", "N"), 
FISH = c("N", 
         "N", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "N", "N", "N", 
         "N", "N", "N", "N", "N", "N", "N", "Y", "Y"), 
GOBIES = c("N", 
           "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
           "N", "N", "N", "N", "N", "N", "N", "N", "N"), 
OTHER_FISH = c("N", 
               "N", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "N", "N", "N", 
               "N", "N", "N", "N", "N", "N", "N", "Y", "Y"), 
NOTES = c(NA, 
          NA, NA, NA, NA, NA, NA, NA, "WATER MUST HAVE ENTERED CAMERA CASE, GETS BLURRY", 
          "WATER MUST HAVE ENTERED CAMERA CASE, GETS BLURRY", NA, NA, 
          NA, NA, NA, NA, NA, NA, NA, NA, "1 FISH", "SILVERY LONG FISH, ONLY WATCHED UNTIL 3:30 AS VIDEO FAR EXCEEDS 1 MIN ON SEDIMENT"
), 
COCR_COMMENTS = c("Correct", "Correct", "Correct", "Correct", 
                  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
                  NA, NA, NA)), 
row.names = 586:607, class = "data.frame")

DF <- DF |> mutate(File_Name_new=str_replace(FILE_NAME,
                                                pattern = "([01]\\d)(\\d{2})(\\d{4})",
                                                replacement = "\\3\\1\\2")) |> 
  select(FILE_NAME,File_Name_new)
tail(DF)
#>                           FILE_NAME                   File_Name_new
#> 602 NGL20_MN-10002_V2_09152021_A_03 NGL20_MN-10002_V2_20210915_A_03
#> 603 NGL20_MN-10002_V2_09152021_B_01 NGL20_MN-10002_V2_20210915_B_01
#> 604 NGL20_MN-10002_V2_09152021_B_02 NGL20_MN-10002_V2_20210915_B_02
#> 605 NGL20_MN-10002_V2_09152021_B_03 NGL20_MN-10002_V2_20210915_B_03
#> 606 NGL20_MN-10003_V1_08232021_A_01 NGL20_MN-10003_V1_20210823_A_01
#> 607 NGL20_MN-10003_V1_08232021_B_01 NGL20_MN-10003_V1_20210823_B_01

^{Created on 2022-08-10 by the reprex package (v2.0.1)}

HABSaremyjam · August 11, 2022, 3:22pm

Wow, thank you! Can you explain the pattern and replacement code? For example, what does \d mean, why did you choose the numbers {2} and '{4}'. and what does the syntax \\ accomplish? Thank you SO much!

FJCC · August 11, 2022, 3:48pm

The pattern "([01]\\d)(\\d{2})(\\d{4})" is a regular expression that matches dates in the MMDDYYYY format. Regular expressions have many special characters to match particular kinds of text. In a regular expression, \d matches any numeric digit. In R functions, you have to write it as \\d. Let's not worry about why that is and just accept it.
Two other features of the pattern are [01] and numbers in curly braces. The expression [01] will match either 0 or 1. A number in braces, {n}, means that the preceding text should be matched n times. So, \\d{2} matches two numeric digits.
Finally each part of the pattern enclosed in parentheses forms a group that can be referred to in the replacement argument. Each group is numbered, so the first group, [01]\\d is referred to as \\1.
We can now interpret the pattern "([01]\\d)(\\d{2})(\\d{4})" like this:
Group 1 is formed by 0 or 1 followed by a digit, Group 2 is 2 digits and group 3 is 4 digits. Group 1 is the month, Group 2 is the day and Group 3 is the year.
The replacement argument is \\3\\1\\2, which is just the year month & day groups defined in the pattern.
This method relies on dates in the mmddyyy format having an initial digit of 0 or 1 while dates in the yyyymmdd format never starting with 0 or 1. If you have dates from last century, that will not work.

HABSaremyjam · August 12, 2022, 6:10pm

Wow, thank you for such a thorough explanation. I can definitely accept that. I understand that way better than when I tried to Google it. Thank you so much for your service!

system · August 19, 2022, 6:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.