Compare 2 word arrays

first of all, im new here, so please excuse any "dumb" questions. My problem is the following:

I am working with R-Studio at the moment and have to compare 2 word pools with each ca. 500 words. I want to compare the given pool to a pool set up by myself to find possible misspellings and such. I could just reread the pool, but this kind of comparison will happen about 10 times, so I would just like to have an R-Script do it for me and maybe put out the lines, the wrong words were found in.

I am rather new to R-Studio, that's why I do not have coded anything yet except from a little read-out of the Excel-Sheet.

I would be very happy for any help!

Hi alpaypay,
I could not totally get your question. But i think you tried to ask something like this. Is that true?

x=c("january","february","march")
y=c("january","may","june","august")

z1=x %in% y
z2=intersect(x,y)
z3=setdiff(x,y)
z3=setdiff(y,z)

hey theORCAs, first of all thank you for the quick response!
I just tried your idea and its going in the right direction of what I want to realise, but is there a possibility to read a complete 500 word excel list and save it as an array in r?

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

hey andresrcs, thank you for your help.
I am currently trying to implement the answer before and working on a reprex now as u asked for.
As soon as I can not proceed on my own, I would gladly upload my reprex and ask for your help, thank you very much !

library("readxl")
library("dplyr")
library("tidyverse")
x <- read_excel("Liste-A.xlsx",col_names = TRUE, range = cell_rows(1:16))
y <-read_excel("Liste-B.xlsx",col_names = TRUE, range = cell_rows(1:16))
as.list(x)
as.list(y)
#Data is being read in and saved as x and y
#x being the "correct" array of words and y the array to check
#Wanted : Comparison, check which words in list y are given in x

if a word in y is not found in x, beceuase theres likely going to be a missspelling, it should output the line in data_2 where the mistake is found

this is the beginning of my code and the read-in works as wanted, even if I am not sure, if reading the data in as a list is helpful when I want to manipulate the data.

the commented lines are what I want to achieve.

The Input data is going to be one column (x) of 500 words in excel and this needs to be compared to a self created column (y) in excel with 500 words as well. The problem is, that I will have to compare about 10 self created columns (y) to the given one. The words will be the same in every single column, but they will be randomised, so a side by side comparison won't be possible.

I tried a couple of functions to compare the two columns, e.g. intersect(x,y), x[-match(y,x)], y[(y %in% x)], but somehow it always just outputs the list, x.

I hope this is understandable and thank you very much in advance!

We don't have access to your local files so we can't reproduce your issue and try to help, can you please share a small part of the data set in a copy-paste friendly format?

In case you don't know how to do it, there are many options, which include:

  1. If you have stored the data set in some R object, dput function is very handy.

  2. In case the data set is in a spreadsheet, check out the datapasta package. Take a look at this link.

Hey, I tried to follow the instructions and hope this is right:

data <- tibble::tribble(
      ~adj.,
      "big",
    "small",
      "red",
     "blue",
     "tall",
   "little",
    "green",
     "pink",
   "yellow",
    "black",
     "wide",
  "shallow",
    "heavy",
    "light",
  "minimal"
  )
head(data)
#> # A tibble: 6 x 1
#>   adj.  
#>   <chr> 
#> 1 big   
#> 2 small 
#> 3 red   
#> 4 blue  
#> 5 tall  
#> 6 little

Created on 2020-04-13 by the reprex package (v0.3.0)

I hope this is the format you wanted:)
This is my list with given words (x)

the list following is an example list, I want to compare to the list before

data1 <-tibble::tribble(
    ~adj_1.,
    "small",
      "red",
     "bule",
     "tlal",
   "little",
    "green",
     "pink",
   "yelowl",
    "blakc",
     "wide",
  "shallow",
    "heavy",
    "light",
  "minimal",
      "big"
  )
head(data1)
#> # A tibble: 6 x 1
#>   adj_1.
#>   <chr> 
#> 1 small 
#> 2 red   
#> 3 bule  
#> 4 tlal  
#> 5 little
#> 6 green

Created on 2020-04-13 by the reprex package (v0.3.0)

and this is my list y, which I want to compare to x

I hope this helps!

I'm not sure I understand your problem but is this what you mean?

library(dplyr)

data <- tibble::tribble(
    ~adj.,
    "big",
    "small",
    "red",
    "blue",
    "tall",
    "little",
    "green",
    "pink",
    "yellow",
    "black",
    "wide",
    "shallow",
    "heavy",
    "light",
    "minimal"
)

data1 <-tibble::tribble(
    ~adj_1.,
    "small",
    "red",
    "bule",
    "tlal",
    "little",
    "green",
    "pink",
    "yelowl",
    "blakc",
    "wide",
    "shallow",
    "heavy",
    "light",
    "minimal",
    "big"
)

data1 %>%
    rowwise() %>%
    filter(!any(data$adj. == adj_1.))
#> Source: local data frame [4 x 1]
#> Groups: <by row>
#> 
#> # A tibble: 4 x 1
#>   adj_1.
#>   <chr> 
#> 1 bule  
#> 2 tlal  
#> 3 yelowl
#> 4 blakc

Created on 2020-04-14 by the reprex package (v0.3.0.9001)

Thank you very much, this is exactly, what I wanted to achieve!

Now, that I am almost done, I have another question, which will be better to understand I guess.

My first word list, as you can see in the comment above is going to be as it is, whereas the second one is going to have prefixes at the beginning of every word.

e.g. v-big,v-small...
It is always going to be the "v-" in front of the words.
Is it possible to ignore the prefixes for the whole vector of words?

so that I can have the "cleaned" vector assigned to a new variable, to finally compare the two ?

library(tidyverse)

data <- tibble::tribble(
    ~adj.,
    "big",
    "small",
    "red",
    "blue",
    "tall",
    "little",
    "green",
    "pink",
    "yellow",
    "black",
    "wide",
    "shallow",
    "heavy",
    "light",
    "minimal"
)

data1 <- tibble::tribble(
        ~adj_1.,
      "v-small",
        "v-red",
       "v-bule",
       "v-tlal",
     "v-little",
      "v-green",
       "v-pink",
     "v-yelowl",
      "v-blakc",
       "v-wide",
    "v-shallow",
      "v-heavy",
      "v-light",
    "v-minimal",
        "v-big"
    )


data1 %>%
    mutate(clean = str_remove(adj_1., "^v-")) %>% 
    rowwise() %>%
    filter(!any(data$adj. == clean))
#> Source: local data frame [4 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 4 x 2
#>   adj_1.   clean 
#>   <chr>    <chr> 
#> 1 v-bule   bule  
#> 2 v-tlal   tlal  
#> 3 v-yelowl yelowl
#> 4 v-blakc  blakc

Created on 2020-04-16 by the reprex package (v0.3.0.9001)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.