How to remove same characters from the list

Shmi · March 31, 2020, 5:36am

I am comparing two texts. t1 is model and t2 has misspelling. I want to remove all same characters that appears in t1 and t2 which leaves with the misspelled characters. I am struggling to achieve this. Below is an example script that I have been working on.

t1 <- "This is a test. Weather is fine"
t2 <- "This text is a test. This wuither is fine. This blabalba That "
t1<-str_split(t1, "(?<=\\.)\\s")
t1<- lapply(t1,tolower)
t2<-str_split(t2, "(?<=\\.)\\s")
t2<- lapply(t2,tolower)

write.table(t1, file = "t1.txt", col.names = FALSE)
write.table(t2, file = "t2.txt", col.names = FALSE)

library(tools)
y<-Rdiff('t1.txt','t2.txt',Log = TRUE)
y<- as.character(y$out)
y<-strsplit(y,"\\s")
commonWords <- intersect(t1, t2)
y2<-removeWords(y,commonWords)

cderv · March 31, 2020, 6:32am

For mispelling, you can be interested in hunspell . It can detect unknown word based on a dictionnary

t1 <- "This is a test. Weather is fine"
t2 <- "This text is a test. This wuither is fine. This blabalba That "

# no mispelling
hunspell::hunspell(t1)
#> [[1]]
#> character(0)
# 2 mispelling
hunspell::hunspell(t2)
#> [[1]]
#> [1] "wuither"  "blabalba"

^{Created on 2020-03-31 by the reprex package (v0.3.0.9001)}

If you want to go with your approach, you can split by words and detect not common words

t1 <- "This is a test. Weather is fine"
t2 <- "This text is a test. This wuither is fine. This blabalba That "

t1 <- stringr::str_split(t1, stringr::boundary("word"))[[1]]
t2 <- stringr::str_split(t2, stringr::boundary("word"))[[1]]

# not common word
setdiff(tolower(t2), tolower(t1))
#> [1] "text"     "wuither"  "blabalba" "that"

^{Created on 2020-03-31 by the reprex package (v0.3.0.9001)}

hope it helps

Shmi · March 31, 2020, 7:53am

Thank you cderv for introducing hunspell. The problem i am working on is that the text are not in English which means to say I would need to code it without calling in the dictionary. However, hunspell will be useful when I get to work with dictionary.

I tried running setdiff and it is not picking up all the misspelled and extra characters present.
After applying Rdiff from the previous code (used to compare line by line), it gives this (an example):

[1] "1c1" "<" ""1"" ""this" "is" "a" "test."" "---" ">" ""1"" ""this" "text" "is" "a" "test.""

What do not seem to figure out is how do I remove ""this" "is" "a" "test."" from ""this" "text" "is" "a" "test.""?

So that I could have the output as:
"text"

cderv · March 31, 2020, 8:37am

just to clarify, do you want to use Rdiff absolutely ?

nirgrahamuk · March 31, 2020, 8:53am

I think the first step to acheive a computer algorithm is to have a human algorithm, its very unclear the exact steps that you (Shmi as a person) would take to process t2 in light of t1...

Here I assumed that you would step through each letter of t1 and if its a letter in t2 then delete from t2 and move to next letter of t1.

t1 <- "This is a test. Weather is fine"
t2 <- "This text is a test. This wuither is fine. This blabalba That "
library(tidyverse)
delete_a_from_b <- function(a,b){
  a_as_list <- str_remove_all(a," ") %>% 
    str_split(boundary("character")) %>% unlist
  b_n <- nchar(b)
  b_as_list <- str_remove_all(b," ") %>% 
    str_split(boundary("character")) %>% unlist
  previous_j <-1
  for(i in 1:length(a_as_list))
  {
    if(previous_j > length(b_as_list)) 
      break
    for (j in previous_j:length(b_as_list)){
      
      if(a_as_list[[i]]==b_as_list[[j]]){
        b_as_list[[j]] <- ""
        previous_j <- j+1
        break
      }
    }
  }

  print(paste0(b_as_list,collapse = ""))
  paste0(b_as_list,collapse = "")
}

t3 <- delete_a_from_b(t1,t2)

Shmi · March 31, 2020, 8:53am

I am open to different functions.

alfred0809 · March 31, 2020, 3:57pm

I could add a second cell that finishes the rest of the substitutions--using the output from the first formula as a starting point--but I feel like there's a more elegant solution to this sort of thing that doesn't rely on dozens of SUBSTITUTE functions. VidMate APK Download Momix

cderv · April 1, 2020, 6:40am

If you tokenize your sentence as words then get the difference, it would result in the desired output

t1 <- c("this", "is", "a", "test.") 
t2 <- c("this", "text", "is", "a")

# not common word
setdiff(tolower(t2), tolower(t1))
#> [1] "text"

Did I missed something ?

Shmi · April 1, 2020, 2:39pm

@cderv, nope you didn't. So I tried decluttering the sentence to words and applied setdiff. Everything worked perfectly well expect for the second sentence (from the orignal post).
t1<- lapply(t1,tolower)
t2<- lapply(t2,tolower)
y1<-setdiff(str_split(t2,"\s")[[1]],str_split(t1,"\s")[[1]])

#>[1] "text" "wuither" "this" "blabalba" "that"

missing word "this" from the second sentence did not appear in the output.
Would it be more efficient if I setdiff by comparing each sentences rather than all like what I am doing, since the code is missing out some words however that might slow down when comparing a huge text?

Shmi · April 1, 2020, 2:42pm

Thank you @nirgrahamuk for the insights. It did help on what levels I should be thinking and explaining my questions.

cderv · April 1, 2020, 7:33pm

"This" is in t1 and t2 in your example, so it get filtered out I guess. But yes you're right I missed something. setdiff will only filtered out from the first vector.

setdiff(c("a", "b"), c("b", "c"))
#> [1] "a"

So it may not be what you want or you need to run it with both sides.
Something like that

t1 <- "This is a test. Weather is fine"
t2 <- "This text is a test. This wuither is fine. This blabalba That "

t1 <- lapply(t1,tolower)
t2 <- lapply(t2,tolower)

s1 <- stringr::str_split(t1, stringr::boundary("word"))[[1]]
s2 <- stringr::str_split(t2, stringr::boundary("word"))[[1]]

purrr::flatten_chr(
  purrr::map2(list(s1, s2), list(s2, s1), setdiff)
)
#> [1] "weather"  "text"     "wuither"  "blabalba" "that"

system · April 22, 2020, 7:39pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.