Hi R masters,
I have a following task.
This is an example of typical df with comments:
source <- data.frame(
stringsAsFactors = FALSE,
URN = c("aaa","bbb","ccc",
"ddd","eee","fff","ggg"),
Name = c("xxx","xxx","yyy",
"yyy","yyy","zzzz","abcde"),
Q1 = c("None.",NA,
"No comments related to this exercise","Na",
"N/A","Interesting comment", "abc"),
P2 = c("Nothing",
"I have nothing in common","NA",NA,
"Another comment","....?","xxxx"),
Z3 = c("Service","All good",
"aa","I don't know",
"The final comment about that","Nothing.","na"),
Q4 = c(2019,2020,2020,2019,
2020,2021,2021)
)
I am planning to merge all character variables with more than 5 characters (in this example Q1, P2 and Z3) but before merging I need to do some data cleanup by:
- Deleting comments specified in the blank_statement
- Deleting meaningless comments with less than 3 characters (for example "abc")
- Deleting comments with repeated characters (for example "xxxx")
Then I need to merge all these corrected comments into one new variable (all_comments).
I have done this:
blank_statements <- regex("None|None.|No\\scomments|No\\scomments.|NA|Nothing", ignore_case = TRUE)
merged.comments <- source %>%
mutate_at(vars(matches("Q1|P2|Z3")), ~str_remove_all(.x, "^.{1,5}$")) %>% # Remove sentences with less than 5 characters
mutate_at(vars(matches("Q1|P2|Z3")), ~str_remove_all(.x, "(.)\\1{2,}")) %>% # Removes repeated characters
mutate_at(vars(matches("Q1|P2|Z3")), ~str_remove_all(.x, blank_statements)) %>% # Remove blank statements
mutate(all_comments = paste(Q1,P2,Z3, sep="/"), # Merges comment variables
all_comments = str_remove_all(all_comments, "NA"), # Removes NAs
all_comments = str_remove_all(all_comments, "[:cntrl:]"), # Removes control characters like /n/r
all_comments = str_replace_all(all_comments, "\\s\\s+", " "), #Removes duplicated /
all_comments = str_replace_all(all_comments, "//+", "/")) # Removes extra spaces
merged.comments
merged.comments$all_comments <- str_remove(merged.comments$all_comments, "/$") # Removes / in the end
merged.comments$all_comments <- str_remove(merged.comments$all_comments, "^/") # Removes / in the beginning
but:
- I don't know how to avoid specifying my character variables in all mutate steps. I believe I could use something like:
mutate_if(~is.character(.) && any(. > 5, na.rm = TRUE) %>%
but I don't know how
-
All blank statements are removed but even if they are part of sentences ("I have nothing in common" becomes "I have in common" and "The final comment about that" becomes "The fil comment about that"). They should be removed only if they are individual sentences (like "Nothing.")
-
This way of removing "/"
merged.comments$all_comments <- str_remove(merged.comments$all_comments, "/$") # Removes / in the end
merged.comments$all_comments <- str_remove(merged.comments$all_comments, "^/") # Removes / in the beginning
is not really elegant but I don't know what else may be used
- I think blank_statements should be simplified in a way that "No comment", "No comments", "No comments." should be written in a shorter code. Perhaps regex is a wrong way and I should use something different?
Can you help?