Code for comparing X with Y?

niko_bio · February 13, 2022, 9:47am

Hi

I'm a complete R beginner and am trying to write a code that wants to compare values of column X with that of column Y, to see if the values have a perfect match for each position. For instance:

Column X: apple, apple, apple, apple, pear
Column Y: orange, orange, apple, orange, pear

What would such a command look like? (perfect match would be at position 3 and 5)

Thx!!

pieterjanvc · February 13, 2022, 12:17pm

Hi,

Welcome to the RStudio community!

This is easy to do with simple logic like this

#Get the data
myData = data.frame(
  x = c("apple", "apple", "apple", "apple", "pear"),
  y = c("orange", "orange", "apple", "orange", "pear")
)

#Do the comparison and save in new column z
myData$z = myData$x == myData$y
myData
#>       x      y     z
#> 1 apple orange FALSE
#> 2 apple orange FALSE
#> 3 apple  apple  TRUE
#> 4 apple orange FALSE
#> 5  pear   pear  TRUE

#Which indices are identical:
which(myData$z == T)
#> [1] 3 5
# same as 
which(myData$x == myData$y)
#> [1] 3 5

^{Created on 2022-02-13 by the reprex package (v2.0.1)}

Have fun learning R!

PJ

niko_bio · February 14, 2022, 1:15pm

Thank you!

How about taking this one step further. Let's say I have three boxes, A B and C with different fruits (apples, oranges, and pears). And I want to compare these boxes to box D, E and F to see if I get any matches. However, box D, E and F may contain fruits of different types, and I'm only interested to see if my fruit X in box A/B/C is uniquely represented in box D E F. So:
Box A: only apples
Box B: only oranges
Box C: only pears
And
Box D: apples and oranges
Box E: only apples
Box F: apples, oranges and pears

In this case, only Box A with apples would have a perfect match with Box E, since that's the only other box with only apples in it. How should one think when coding that?

nirgrahamuk · February 14, 2022, 1:21pm

R doesn't have a 'box' concept. I would seem to be free to interpret this in any number of ways, and then you would only benefit from my solution, if your data resembled it sufficiently.
Why guess. If you have data structured in some particular way, please share a sample of the data for the avoidance of doubt and misunderstandings.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

pieterjanvc · February 14, 2022, 1:46pm

Hi again,

It is indeed better to create a more hands on example if you want us to give you exact answers like @nirgrahamuk suggested. So please take a look at the guide to learn how to build one.

Meanwhile, here is an overview of a few basic functions that can be used to compare lists (or sets)

#Two sets
x = 1:5
y = 3:7

#Intersection between lists
intersect(x, y)
#> [1] 3 4 5

#Elements in x not in y
setdiff(x, y)
#> [1] 1 2

#Elements in y not in yx
setdiff(y, x)
#> [1] 6 7

#All elements in x and y combined
union(x, y)
#> [1] 1 2 3 4 5 6 7

#Are the elements in x and y identical
# (Ignoring order or repeats)
setequal(x, y)
#> [1] FALSE

#Check for each element in x if it is in y
x %in% y
#> [1] FALSE FALSE  TRUE  TRUE  TRUE

^{Created on 2022-02-14 by the reprex package (v2.0.1)}

Note that the few operations work with sets, so this means repeated values are ignored and the order in which elements appear is not important. The last function using %in% does take repeats into account.

Try and change the input to the one below to see what's the same / different

x = c(1,1,2,3,4,5)
y = c(7,3,4,5,6)

PJ

niko_bio · February 14, 2022, 2:04pm

Ok. I've uploaded a small section of the data here:

https://easyupload.io/3mtgu8 (excel file)

So, what I'm trying to do is to check the SampleID (e.g., AVM_360) and see if that species (B. vasutana) has a perfect match with MatchID (ignore the match% column). In this case, yes it has a perfect match with another B. vasutana but it also matches with a bunch of other different species - so in this case I would like R to tell me that, yes, it has a perfect match but there are also other species matching. In contrast, AVM_363 with id U. folus has only matches with other U. folus.

nirgrahamuk · February 14, 2022, 4:03pm

it is preferable to share a small example of your R object, that you made from loading your excel than share the excel itself

Can you please check the guide that was linked ?

niko_bio · February 15, 2022, 3:29pm

Hello again and thanks for the suggestions.

Here's how my data looks like, and to reiterate, the SampleID "AVM_360" is just one sample which consists of Bibasis vasutana. And it is then compared to other species found in the database, one of them actually being a Bibasi vasutana (top of MatchID column) but as you can see there are also plenty of other hits. And so I'd like to write a code that checks if the "currentID" column of the "SampleID" has a match of the "MatchID" column (y/n) and if yes, also check if there are other matches that are not equal to the "currentID" (which in this case would also be yes).

data.frame(
stringsAsFactors = FALSE,
SampleID = c("AVM_360","AVM_360","AVM_360",
"AVM_360","AVM_360","AVM_360"),
CurrentID = c("Bibasis vasutana",
"Bibasis vasutana","Bibasis vasutana","Bibasis vasutana",
"Bibasis vasutana","Bibasis vasutana"),
MatchID = c("Bibasis vasutana",
"Burara vasutana","Burara tuckeri","Burara oedipodea",
"Burara aquilina","Burara harisa")
)

head(Smalldata)

A tibble: 6 x 3

SampleID CurrentID MatchID

1 AVM_360 Bibasis vasutana Bibasis vasutana
2 AVM_360 Bibasis vasutana Burara vasutana
3 AVM_360 Bibasis vasutana Burara tuckeri
4 AVM_360 Bibasis vasutana Burara oedipodea
5 AVM_360 Bibasis vasutana Burara aquilina
6 AVM_360 Bibasis vasutana Burara harisa

nirgrahamuk · February 15, 2022, 4:05pm

samp1<- data.frame(
  stringsAsFactors = FALSE,
  SampleID = c("AVM_360","AVM_360","AVM_360",
               "AVM_360","AVM_360","AVM_360"),
  CurrentID = c("Bibasis vasutana",
                "Bibasis vasutana","Bibasis vasutana","Bibasis vasutana",
                "Bibasis vasutana","Bibasis vasutana"),
  MatchID = c("Bibasis vasutana",
              "Burara vasutana","Burara tuckeri","Burara oedipodea",
              "Burara aquilina","Burara harisa")
)

samp1 %>% group_by(SampleID) %>% 
          summarise(any_match_check     = any(CurrentID == MatchID),
                    any_not_match_check = any(CurrentID != MatchID))

niko_bio · February 16, 2022, 1:21pm

Thank you nirgrahamuk. This is what I'm after but I'm having some problems with NA samples, but I will try to solve the issue myself first

niko_bio · February 16, 2022, 3:55pm

Another question. This following function does what I want it to do, however, I want it to create an object where this function has been applied to. If I just write "exfilt99" it says: Error: object 'exfilt99' not found. What do I need to write to make it show up, i.e., the results, so that I can use that object for other operations. Thank you!!

$output_mode
function(x,y){
filt99 <- y>99
exfilt99 <- x[filt99,]
}

nirgrahamuk · February 16, 2022, 4:31pm

most R functions dont directly create objects into the calling environments, rather they return objects to their calling environments, where they are free to be assigned to a name or not. I recommend you follow that model.

#defining
myfunc < -function(x,y){
filt99 <- y>99
 x[filt99,]
}
#using
exfilt99 <-myfunc(somex,somey)

niko_bio · February 17, 2022, 6:26am

I see. Thank you again!

niko_bio · February 23, 2022, 12:27pm

New question. Here's my dataframe.

data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
Sampleid = c("AVM_360","AVM_360","AVM_360",
"AVM_362","AVM_362","AVM_362","AVM_362"),
Currentid = c("Bibasis vasutana",
"Bibasis vasutana","Bibasis vasutana","Parnara ganga",
"Parnara ganga","Parnara ganga","Parnara ganga"),
%Match = c(100, 100, 99.5, 100, 98.6, 97.5, 96.5),
Matchid = c("Bibasis vasutana",
"Burara vasutana","Bibasis nikos","Parnara ganga","Parnara batta",
"Parnara batta","Parnara batta")
)

Here's a code someone earlier helped me write:

(First I do a quick filtering step to only select those values that have a >99 match)

data[,3] <- sapply(data[,3], as.numeric)
ffilter <- function(x,y){
filt <- y>=99
exfilt <- x[filt,]
}
exfilt <- ffilter(data,data$%Match)

And then:

em <- exfilt %>% group_by(Sampleid) %>%
summarise(any_match = any(Currentid == Matchid),
any_not_match = any(Currentid != Matchid))

This creates a table and checks if any there are or arent any matches between Currentid and matchid, grouped by sampleid.

If I then write:

which(em$any_match==TRUE & em$any_not_match==TRUE)

I get

[1] 1

which corresponds to the AVM_360 sample.

Now here's what I want to do: in this case, sampleid AVM_360 has both a correct match and an incorrect match when filtering at >99. What I would like to do is to somehow extract that data into a new dataframe which looks exactly like my previous ("data", see above), however, this new dataframe should only consist of those values where I have both a correct and incorrect match as described.

Is it possible to do?

And a second question. Is there any way to select specific columns in these brackets [
] without using numbers? I.e., let' s say I have a column named "%Match" and I want to see all row values of that column, and let's say it has position 3. I could write data[,3], but is it in any way possible to instead use the column name, i.e., I would like to write data[,data$%Match] but that doesn't work.

Thank you!!!

system · March 16, 2022, 12:27pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.