Solving case sensitive mistypes

kbzsl · August 7, 2020, 4:35pm

I have to correct many case sensitive mistypes in a quite large data set, but simply converting to lower/upper case is not feasible because the legibility will be lost.
My current solution is a multi-line string replacement, but is an awful solution (it was done line by line when I was trying to identify all those errors in the data).

df <- df %>%
  mutate(name = str_replace(name, "(?i)AbcD", "AbcD")) %>% 
  mutate(name = str_replace(name, "(?i)DE", "DE")) %>%
  mutate(name = str_replace(name, "(?i)efghij", "efghij")) %>%   
  ...

Do you have any better, more elegant suggestion? Thank you.

technocrat · August 7, 2020, 8:37pm

Initial letter should be converted to uppercase
Initial letter should be converted to lowercase
Final letter should be converted to uppercase
Any cases involving interior letters

Unless there is a decision rule for classifying the cases, this can be done with tokenized object and a hash table. For each element in the object, if in hash, replace value with key, else no op. See {hash}

kbzsl · August 8, 2020, 8:24am

Thank you. The hash is a good idea and I will delve more deeply into this topic.

The only issue is that matching the keys should be case insensitive, because I want to avoid entering all the different mistypes in the dictionary(key). For example: if the correct typing is “cDeF”, I don’t want to specify all the different upper/lower case combinations in the dictionary.

And I would prefer a solution which is integrating seamlessly in the tidy workflow.

kbzsl · August 8, 2020, 8:56am

Sorry, I forget to mention that from the original string variables only some words has to be verified and changed, which are considered keywords.
For example: “correct ABCD correct correct” should be “correct AbcD correct correct” where “AbcD” is a keyword.

I was thinking to convert everything to lowercase and replace back the keywords with the correct versions when creating the reports, but considering the volumes it is far more manageable to maintain a smaller dictionaries where mistypes were identified.

kbzsl · August 8, 2020, 6:19pm

I ended up with this solution (in case of keywords the upper/lower case is corrected):

library(tidyverse)

df = tibble(name = c("aaa aa aa DFGH aa",
                     "aa dfgh",
                     "QWER aaaa",
                     "a qwer a"))

keywords = c("dFgH",
             "QweR",
             "cVBn")

correct_case <- function(df, keyword, column){
  df %>% 
    mutate({{column}} := str_replace_all({{column}}, paste0("(?i)", keyword), keyword))
}

df %>% reduce(keywords, correct_case, .init = ., column = name)
#> # A tibble: 4 x 1
#>   name             
#>   <chr>            
#> 1 aaa aa aa dFgH aa
#> 2 aa dFgH          
#> 3 QweR aaaa        
#> 4 a QweR a

system · August 29, 2020, 6:19pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.