I am trying to clean up the "Transaction Description" of my banking history so they can be grouped by store rather than having 15 different vendors like "Amazon.ca #5411", "Amazon.ca HH#523432" and so on.
Using str_sub, I've managed to clean up a large percentage of the them, but a less-than-brute-force, line-by-line (or vendor-by-vendor) solution for the balance is eluding me. The data is in character format, with extra whitespaces, numbers and symbols with no specific format.
SP * MARILLA WALKER PA
CALG CO-OP GAS
NATIONAL OILWEL PAY
ONCE UPON A CHILD CALG
MCDONALD'S #919 _F
OLD COUNTRY MAR
TELUS MOBILITY PREAUTH
BEACON MARTINIZ _F
CHAPTERS 964 _F
CHASIN TAILS _F
The end goal of course is to have something that can work with new data without having to be re-written for that specific vendor each time one shows up in the records.
To pair down the inter-account transactions:
df_clean <- df_raw %>% mutate( Transaction = case_when( Transaction %like% "TFR-" ~ str_sub( Transaction, start=6), TRUE ~ Transaction), Transaction = case_when( Transaction %like% "^TD ATM" ~ str_sub( Transaction, start = 4, end = 10), TRUE ~Transaction))
It does the trick for those items, but doesn't strike me as particularly flexible and not very useful for thousands of records like the above.