Frustation with messy string data - Need to extract all patterns into new columns

I currently have a column called MDRLOGD_MESSAGE that contains so much information. This one column needs to split in so many ways. Here is a snippet of the data.
df <- data.frame(MDRLOGD_RUN_DATE = Sys.Date() - 1:4,
MDRLOGD_MESSAGE = c('Mapping completed successfully at: 31-MAY-2022 01:44:31 AM','Elapsed time: 2 seconds','Selected: 0 -- Inserted: 0 -- Updated: 0 -- Deleted: 0 -- Errors: 0','Running mapping: DELETE_MST_FACULTY_ATTRIBUTE'))

I would like the separate the numeric information along with its appropriate string information.
df %>%
mutate (Elapsed_Time = ifelse(str_detect(MDRLOGD_MESSAGE, "Elapsed time:"),
str_extract(MDRLOGD_MESSAGE,'(\d\s\b\w+)'), 0), Record_Metric_Name = ifelse(str_detect(MDRLOGD_MESSAGE,"Selected:"), str_extract_all(MDRLOGD_MESSAGE, '(\\b\\w+)'), "NULL"),
Record_Metric_Number = ifelse(str_detect(MDRLOGD_MESSAGE,"Selected:" ),
str_extract_all('\d\s\b\w+&')
Mapping_Run_Date = ifelse(str_detect(MDRLOGD_MESSAGE, "Mapping completed successfully at:"),
str_extract(MDRLOGD_MESSAGE, '?=(:)'),"Null"))

So far, I came up with this to try to manipulate with only successfully splitting up the Elapsed time into its own column. I get an error with the rest: Error in ifelse(., str_detect(MDRLOGD_MESSAGE, "Selected:"), strsplit(MDRLOGD_MESSAGE, :
'list' object cannot be coerced to type 'logical'
Specifically, I would like to separate the 'Selected: 0 -- Inserted: 0 -- Updated: 0 -- Deleted: 0 -- Errors: 0" into their own columns so I can sum up inserted and updated separately by date.

Any help will be appreciated

See if this helps

# row contains a variable, here called target, that contains
# 4 vectors of strings
(target = c('Mapping completed successfully at: 31-MAY-2022 01:44:31 AM','Elapsed time: 2 seconds','Selected: 0 -- Inserted: 0 -- Updated: 0 -- Deleted: 0 -- Errors: 0','Running mapping: DELETE_MST_FACULTY_ATTRIBUTE'))
#> [1] "Mapping completed successfully at: 31-MAY-2022 01:44:31 AM"         
#> [2] "Elapsed time: 2 seconds"                                            
#> [3] "Selected: 0 -- Inserted: 0 -- Updated: 0 -- Deleted: 0 -- Errors: 0"
#> [4] "Running mapping: DELETE_MST_FACULTY_ATTRIBUTE"

# target[1] contains information success/fail and date time
(outcome <- gsub(" at.*$","",target[1]))
#> [1] "Mapping completed successfully"
(finish  <- gsub(".*at: ","",target[1]) |> lubridate::dmy_hms(x = _))
#> [1] "2022-05-31 01:44:31 UTC"
# target[2] contains a numeric value to be extracted
(elapsed <- gsub("[^.0-9]","",target[2]))
#> [1] "2"
# target[3] contains a classification of mappings
split_on_colon <- function(x) strsplit(x,": ")
(results <- strsplit(target[3]," -- "))
#> [[1]]
#> [1] "Selected: 0" "Inserted: 0" "Updated: 0"  "Deleted: 0"  "Errors: 0"
(results <- lapply(results,split_on_colon) |> unlist(x = _))
#>  [1] "Selected" "0"        "Inserted" "0"        "Updated"  "0"       
#>  [7] "Deleted"  "0"        "Errors"   "0"
(result_types <- results[1:length(results) %% 2 != 0])
#> [1] "Selected" "Inserted" "Updated"  "Deleted"  "Errors"
(result_counts <- as.numeric(results[1:length(results) %% 2 == 0]))
#> [1] 0 0 0 0 0
# target[4] contains the mapping used
(mapping <- gsub("^.*: ","",target[4]))
#> [1] "DELETE_MST_FACULTY_ATTRIBUTE"

Created on 2023-01-02 with reprex v2.0.2

Hi @momoandi,
I went a little further than @technocrat and tried to bring all the required output together using dplyr verbs. I also assumed that the run_date was the same for the 4 lines in the dataframe:

suppressPackageStartupMessages(library(tidyverse))

df <- data.frame(MDRLOGD_RUN_DATE = as.Date("2022-06-10"),
                 MDRLOGD_MESSAGE = c('Mapping completed successfully at: 31-MAY-2022 01:44:31 AM',
                                     'Elapsed time: 2 seconds',
                                     'Selected: 0 -- Inserted: 0 -- Updated: 0 -- Deleted: 0 -- Errors: 0',
                                     'Running mapping: DELETE_MST_FACULTY_ATTRIBUTE'))

df %>% 
  mutate(line_index = 1:4) %>% 
  group_by(MDRLOGD_RUN_DATE) %>% 
  pivot_wider(names_from=line_index, values_from = MDRLOGD_MESSAGE) %>% 
  rename(completed = `1`, elapsed = `2`, action = `3`, mapping = `4`) %>% 
  mutate(completed = as.POSIXct(str_remove(completed, "Mapping completed successfully at: "),
                                format="%d-%b-%Y %H:%M:%S %p")) %>% 
  mutate(elapsed = str_remove(elapsed, "Elapsed time: ")) %>%
  mutate(mapping = str_remove(mapping, "Running mapping: ")) %>% 
  separate(action, into=c("selected", "inserted", "updated", "deleted", "errors"), sep="--") %>% 
  map_at(., .at=c("selected", "inserted", "updated", "deleted", "errors"), parse_number) %>% 
  as.data.frame()
#>   MDRLOGD_RUN_DATE           completed   elapsed selected inserted updated
#> 1       2022-06-10 2022-05-31 01:44:31 2 seconds        0        0       0
#>   deleted errors                      mapping
#> 1       0      0 DELETE_MST_FACULTY_ATTRIBUTE

Created on 2023-01-03 with reprex v2.0.2

Hope this helps.

2 Likes

Hi @DavoWW ,

I appreciate this solution because it uses tidyverse which I am familiar with. I will try this solution to see if this works. Yes, you're assumptions are correct. It is the same date for every row. That was my mistake. Thank you so much.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.