Why my code is not working currently but was working fine in previous

A few days ago I gave a post in the Stack Overflow (here) and found the answer from @jay.sf. At that time the code was working totally fine. But, now when I am running the same code on my device I am getting the wrong output.

I am taking the same dataframe and code from the post. But I am getting the given output which is totally different from my expected output and the post output.

                      query                  target weight
 1: (+)-1(10),4-Cadinadiene (+)-1(10),4-Cadinadiene  0.090
 2:                      A1                      A1  0.600
 3:                      A1                      A1  0.600
 4:                      A1                      A1  1.000
 5:                      A1                      A1  1.000
 6:                      A2                      A2  0.500
 7:                      A2                      A2  0.500
 8:                      A2                      A2  1.000
 9:                      A2                      A2  1.000
10:                      A3                      A3  0.750
11:                      A3                      A3  0.750
12:                      A3                      A3  1.000
13:                      A3                      A3  1.000
14:                      A4                      A4  0.880
15:                      A4                      A4  0.880
16:                      A4                      A4  1.000
17:                      A4                      A4  1.000
18:             Falcarinone             Falcarinone  1.000
19:        Leucodelphinidin        Leucodelphinidin  0.876

Version related information given below

platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          6.3                         
year           2020                        
month          02                          
day            29                          
svn rev        77875                       
language       R                           
version.string R version 3.6.3 (2020-02-29)
nickname       Holding the Windsock 

I have restarted my RStudio several times but getting the same wrong output.

I cant reproduce your error.

Jays code on your example data, gives consistent result as on StackOverflow, different from what you have here.

Is it the case that you only load data.table and no other libraries ? as per Jays code?
it would be very unusual for a conflict to not throw an obvious error, but stranger things have happened.
check that conflicts() doesnt return anything you are using in the code.

@nirgrahamuk I am not getting the same output! I am also using the same df_1 and df_2 and running the code of the edit portion(data.table) part. Why this is happening!

I am getting the output given in this post. Most probably !duplicated is not working in my PC!

@ when I am using conflicts() I am getting [1] "body<-" "kronecker" What does it mean?

thats not a problem then.

@nirgrahamuk No idea what is happening! Loading only library(data.table)!

what do you get with the shorter tables ?

df_1 <- structure(list(
  query = c("A1", "A2"), target = c("A2", "A5"),
  weight = c(0.6, 0.5)
), class = "data.frame", 
row.names = c(  NA,  -2L))

df_2 <- structure(list(query = c("A1", "A2"), 
target = c("A2", "A5")), class = "data.frame", 
row.names = c(  NA,  -2L))

library(data.table)
setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
setDT(df_2)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
res <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)

I am getting this

   query target weight
1:    A1     A2    0.6
2:    A2     A5    0.5

ok, now try larger

df_1 <- structure(list(query = c("A1", "A2", "A3", "A4", "A5", "(+)-1(10),4-Cadinadiene", 
                                 "Leucodelphinidin", "Lignin", "(2E,7R,11R)-2-Phyten-1-ol", "Falcarinone", 
                                 "A1", "A2", "A3", "Falcarinone", "A4", "A4", "Falcarinone", "A5"
), target = c("A2", "A5", "A1", "A5", "A3", "Falcarinone", "(+)-1(10),4-Cadinadiene", 
              "(2E,7R,11R)-2-Phyten-1-ol", "Leucodelphinidin", "Lignin", "(+)-1(10),4-Cadinadiene", 
              "Lignin", "(2E,7R,11R)-2-Phyten-1-ol", "A6", "Leucodelphinidin", 
              "Leucodelphinidin", "A100", "Falcarinone"), weight = c(0.6, 0.5, 
                                                                     0.75, 0.88, 0.99, 0.09, 0.876, 0.778, 0.55, 1, 1, 1, 1, 1, 1, 
                                                                     1, 1, 1)), class = "data.frame", row.names = c(NA, -18L))

df_2 <- structure(list(query = c("A1", "A2", "A1", "A4", "A3", "(+)-1(10),4-Cadinadiene", 
                                 "Leucodelphinidin", "Lignin-2", "A11", "A2", "A3", "Falcarinone", 
                                 "A4"), target = c("A2", "A5", "A3", "A5", "A5", "Falcarinone", 
                                                   "(+)-1(10),4-Cadinadiene-100", "(2E,7R,11R)-2-Phyten-1-ol", "(+)-1(10),4-Cadinadiene", 
                                                   "Lignin", "(2E,7R,11R)-2-Phyten-1-0l", "A6", "Leucodelphinidin"
                                 )), class = "data.frame", row.names = c(NA, -13L))

This time getting the right output

                     query           target weight
1: (+)-1(10),4-Cadinadiene      Falcarinone   0.09
2:                      A1               A2   0.60
3:                      A1               A3   0.75
4:                      A2               A5   0.50
5:                      A2           Lignin   1.00
6:                      A3               A5   0.99
7:                      A4               A5   0.88
8:                      A4 Leucodelphinidin   1.00
9:                      A6      Falcarinone   1.00

Thanks but why this is happening?

@nirgrahamuk when I am taking the df_1 and df_2 code from my original post this is giving the wrong output but when I am making dataframes from your code then getting the right output!

Strange! Any explanation?

take df_1 and df_2 from your original post, and use dput() to share the result here explicitly without read.table
then we can analyse.

@nirgrahamuk df_1 and df_2 directly taken from my original post

df_1 <- structure(list(query = structure(c(3L, 4L, 5L, 6L, 7L, 1L, 9L, 
10L, 2L, 8L, 3L, 4L, 5L, 8L, 6L, 6L, 8L, 7L), .Label = c("(+)-1(10),4-Cadinadiene", 
"(2E,7R,11R)-2-Phyten-1-ol", "A1", "A2", "A3", "A4", "A5", "Falcarinone", 
"Leucodelphinidin", "Lignin"), class = "factor"), target = structure(c(5L, 
7L, 3L, 7L, 6L, 9L, 1L, 2L, 10L, 11L, 1L, 11L, 2L, 8L, 10L, 10L, 
4L, 9L), .Label = c("(+)-1(10),4-Cadinadiene", "(2E,7R,11R)-2-Phyten-1-ol", 
"A1", "A100", "A2", "A3", "A5", "A6", "Falcarinone", "Leucodelphinidin", 
"Lignin"), class = "factor"), weight = c(0.6, 0.5, 0.75, 0.88, 
0.99, 0.09, 0.876, 0.778, 0.55, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = "data.frame", row.names = c(NA, 
-18L))
df_2 <- structure(list(query = structure(c(2L, 4L, 2L, 6L, 5L, 1L, 8L, 
9L, 3L, 4L, 5L, 7L, 6L), .Label = c("(+)-1(10),4-Cadinadiene", 
"A1", "A11", "A2", "A3", "A4", "Falcarinone", "Leucodelphinidin", 
"Lignin-2"), class = "factor"), target = structure(c(5L, 7L, 
6L, 7L, 7L, 9L, 2L, 4L, 1L, 11L, 3L, 8L, 10L), .Label = c("(+)-1(10),4-Cadinadiene", 
"(+)-1(10),4-Cadinadiene-100", "(2E,7R,11R)-2-Phyten-1-0l", "(2E,7R,11R)-2-Phyten-1-ol", 
"A2", "A3", "A5", "A6", "Falcarinone", "Leucodelphinidin", "Lignin"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-13L))

After running setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))] getting this type of table.

df_1 <- structure(list(query = structure(c(3L, 4L, 5L, 6L, 7L, 1L, 9L, 
10L, 2L, 8L, 3L, 4L, 5L, 8L, 6L, 6L, 8L, 7L), .Label = c("(+)-1(10),4-Cadinadiene", 
"(2E,7R,11R)-2-Phyten-1-ol", "A1", "A2", "A3", "A4", "A5", "Falcarinone", 
"Leucodelphinidin", "Lignin"), class = "factor"), target = structure(c(3L, 
4L, 5L, 6L, 7L, 1L, 9L, 10L, 2L, 8L, 3L, 4L, 5L, 8L, 6L, 6L, 
8L, 7L), .Label = c("(+)-1(10),4-Cadinadiene", "(2E,7R,11R)-2-Phyten-1-ol", 
"A1", "A2", "A3", "A4", "A5", "Falcarinone", "Leucodelphinidin", 
"Lignin"), class = "factor"), weight = c(0.6, 0.5, 0.75, 0.88, 
0.99, 0.09, 0.876, 0.778, 0.55, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("data.table", 
"data.frame"), row.names = c(NA, -18L), .internal.selfref = <pointer: 0x560ad10aa500>)

in my data, the query and target's are character variables, in yours they are factor.
in R version 4, (from R 3.6) the default behaviour for strings in dataframe creation went from assuming factors to assuming strings, so your read.table technique is vulnerable to that.

you can add stringsAsFactors = FALSE to your read.table call to get the default R4 behaviour from within your R 3.6 session.

but notice also that dput is great way to transfer simple R objects as it is explicit about such things.

@nirgrahamuk thanks for the nice solution.

But the real problem is, these are the demo data and I have real data. I am taking real data using read.csv() and showing the wrong result like this post output. In this case what can I do?

I believe read.csv can also take a stringsasfactors param

You can also use options(stringsAsFactors=FALSE) to turn off factorization of strings in most (all?) functions.

Cheers
Steen

Thanks for your suggestion

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.