Merging character variables after their pre-cleaning

Slavek · March 26, 2021, 12:38pm

Hi R masters,
I have a following task.
This is an example of typical df with comments:

source <- data.frame(
  stringsAsFactors = FALSE,
                                    URN = c("aaa","bbb","ccc",
                                            "ddd","eee","fff","ggg"),
                                   Name = c("xxx","xxx","yyy",
                                            "yyy","yyy","zzzz","abcde"),
                                     Q1 = c("None.",NA,
                                            "No comments related to this exercise","Na",
                                            "N/A","Interesting comment", "abc"),
                                     P2 = c("Nothing",
                                            "I have nothing in common","NA",NA,
                                            "Another comment","....?","xxxx"),
                                     Z3 = c("Service","All good",
                                            "aa","I don't know",
                                           "The final comment about that","Nothing.","na"),
                                     Q4 = c(2019,2020,2020,2019,
                                            2020,2021,2021)
                     )

I am planning to merge all character variables with more than 5 characters (in this example Q1, P2 and Z3) but before merging I need to do some data cleanup by:

Deleting comments specified in the blank_statement
Deleting meaningless comments with less than 3 characters (for example "abc")
Deleting comments with repeated characters (for example "xxxx")

Then I need to merge all these corrected comments into one new variable (all_comments).

I have done this:

blank_statements <- regex("None|None.|No\\scomments|No\\scomments.|NA|Nothing", ignore_case = TRUE)

merged.comments <- source %>%
              mutate_at(vars(matches("Q1|P2|Z3")), ~str_remove_all(.x, "^.{1,5}$")) %>% # Remove sentences with less than 5 characters
              mutate_at(vars(matches("Q1|P2|Z3")), ~str_remove_all(.x, "(.)\\1{2,}")) %>% # Removes repeated characters
              mutate_at(vars(matches("Q1|P2|Z3")), ~str_remove_all(.x, blank_statements)) %>% # Remove blank statements
              mutate(all_comments = paste(Q1,P2,Z3, sep="/"), # Merges comment variables
                     all_comments = str_remove_all(all_comments, "NA"), # Removes NAs
                     all_comments = str_remove_all(all_comments, "[:cntrl:]"), # Removes control characters like /n/r
                     all_comments = str_replace_all(all_comments, "\\s\\s+", " "),  #Removes duplicated /
                     all_comments = str_replace_all(all_comments, "//+", "/")) # Removes extra spaces
merged.comments  

merged.comments$all_comments <- str_remove(merged.comments$all_comments, "/$") # Removes / in the end
merged.comments$all_comments <- str_remove(merged.comments$all_comments, "^/") # Removes / in the beginning

but:

I don't know how to avoid specifying my character variables in all mutate steps. I believe I could use something like:

mutate_if(~is.character(.) && any(. > 5, na.rm = TRUE) %>%

but I don't know how

All blank statements are removed but even if they are part of sentences ("I have nothing in common" becomes "I have in common" and "The final comment about that" becomes "The fil comment about that"). They should be removed only if they are individual sentences (like "Nothing.")
This way of removing "/"

merged.comments$all_comments <- str_remove(merged.comments$all_comments, "/$") # Removes / in the end
merged.comments$all_comments <- str_remove(merged.comments$all_comments, "^/") # Removes / in the beginning

is not really elegant but I don't know what else may be used

I think blank_statements should be simplified in a way that "No comment", "No comments", "No comments." should be written in a shorter code. Perhaps regex is a wrong way and I should use something different?

Can you help?

andresrcs · March 26, 2021, 5:25pm

I think Regex is an adequate tool for this task if you learn how to use it, you are supposed to describe a text pattern not enumerate each possible variation of the text.

library(tidyverse)

source <- data.frame(
    stringsAsFactors = FALSE,
    URN = c("aaa","bbb","ccc",
            "ddd","eee","fff","ggg"),
    Name = c("xxx","xxx","yyy",
             "yyy","yyy","zzzz","abcde"),
    Q1 = c("None.",NA,
           "No comments related to this exercise","Na",
           "N/A","Interesting comment", "abc"),
    P2 = c("Nothing",
           "I have nothing in common","NA",NA,
           "Another comment","....?","xxxx"),
    Z3 = c("Service","All good",
           "aa","I don't know",
           "The final comment about that","Nothing.","na"),
    Q4 = c(2019,2020,2020,2019,
           2020,2021,2021)
)

blank_statements <- regex("^(None.?|No\\scomments?.?|N.?A|Nothing)$", ignore_case = TRUE)


source %>% 
    mutate_if(~is.character(.) & any(nchar(.) > 5, na.rm = TRUE),
              ~str_remove_all(.x, blank_statements))
#>   URN  Name                                   Q1                       P2
#> 1 aaa   xxx                                                              
#> 2 bbb   xxx                                 <NA> I have nothing in common
#> 3 ccc   yyy No comments related to this exercise                         
#> 4 ddd   yyy                                                          <NA>
#> 5 eee   yyy                                               Another comment
#> 6 fff  zzzz                  Interesting comment                    ....?
#> 7 ggg abcde                                  abc                     xxxx
#>                             Z3   Q4
#> 1                      Service 2019
#> 2                     All good 2020
#> 3                           aa 2020
#> 4                 I don't know 2019
#> 5 The final comment about that 2020
#> 6                     Nothing. 2021
#> 7                              2021

Slavek · March 26, 2021, 7:55pm

Thank you. Can you help with other points please?

Slavek · March 29, 2021, 10:42am

I think that works:

merged.comments <-  source %>% 
  mutate_if(~is.character(.) & any(nchar(.) > 5, na.rm = TRUE),
            ~str_remove_all(.x, blank_statements))%>% 
  mutate_if(~is.character(.) & any(nchar(.) > 5, na.rm = TRUE),
            ~str_remove_all(.x, "^.{1,5}$"))%>% 
  mutate_if(~is.character(.) & any(nchar(.) > 5, na.rm = TRUE),
            ~str_remove_all(.x, "(.)\\1{2,}"))%>% 
  mutate(all_comments = paste(Q1,P2,Z3, sep="/"), # Merges comment variables
         all_comments = str_remove_all(all_comments, "NA"), # Removes NAs
         all_comments = str_remove_all(all_comments, "[:cntrl:]"), # Removes control characters like /n/r
         all_comments = str_replace_all(all_comments, "\\s\\s+", " "),  #Removes duplicated /
         all_comments = str_replace_all(all_comments, "//+", "/"), # Removes extra spaces
         all_comments = str_remove (all_comments, "/$"), # Removes / in the end
         all_comments = str_remove (all_comments, "^/")) # Removes / in the beginning
merged.comments

Is that right?
The thing missing is this stage:

mutate(all_comments = paste(Q1,P2,Z3, sep="/"), # Merges comment variables

I need to use something like:

~is.character(.) & any(nchar(.) > 5, na.rm = TRUE)

I think the above should apply to the entire code. Can you help?

Slavek · March 31, 2021, 3:11pm

I have a feeling that the above may be used in the beginning of the code just once as all manipulations are for sting variables with at least 5 characters. "all_comments" also meets this criteria. Then:

 ~str_remove_all(.x, blank_statements))%>% 

 ~str_remove_all(.x, "^.{1,5}$"))%>% 

 ~str_remove_all(.x, "(.)\\1{2,}"))%>% 

mutate(all_comments = paste(??????, sep="/")

...then all other statements related to "all_comments"...

How can I do that?
Can anyone help?

Slavek · April 6, 2021, 12:55pm

I still believe that solution for this question is very simple, I just don't know how to do that...

system · April 27, 2021, 12:55pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.