Why my dataset changed after running the setDT()

I have 2 CSV files. Now I want to find the common rows of these 2 files. So, after reading them as dataframe I converted them as datatable and then merge them. But, somehow, my code is not working and I am not getting any common rows between them!

Before running my dataset

     nodeA  nodeB           scr
1  ID08918 ID04896                 1
2  ID00402 ID01198                 1
3  ID00182 ID01576                 1
4  ID06413 ID00745                 1
5  ID00215 ID01175                 1
6  ID00448 ID05351                 1
7  ID00860 ID00959 0.996197718631179
8  ID01110 ID01127  0.99604743083004
9  ID00497 ID01192 0.995436766623207
10 ID00877 ID01590 0.993939393939394
11 ID01192 ID01183 0.992202729044834
12 ID00361 ID00570 0.988354430379747
13 ID01045 ID01201  0.98766954377312
14 ID11641 ID00541 0.986875315497224
15 ID11641 ID00570  0.98685540950455
16 ID00458 ID01151 0.986813186813187
17 ID00199 ID01211 0.981416957026713
18 ID00570 ID00309 0.981151299032094
19 ID00541 ID00309 0.978161503301168
20 ID00603 ID06789 0.977272727272727
library(dplyr)

df_1 <- read.csv("~/df_1.csv", stringsAsFactors = FALSE)
df_2 <- read.csv("~/df_2.csv", stringsAsFactors = FALSE)

library(data.table)
setDT(df_1)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
setDT(df_2)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
result <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)

After running the code my dataset is changed.

      nodeA   nodeB               scr
 1: ID08918 ID08918                 1
 2: ID00402 ID00402                 1
 3: ID00182 ID00182                 1
 4: ID06413 ID06413                 1
 5: ID00215 ID00215                 1
 6: ID00448 ID00448                 1
 7: ID00860 ID00860 0.996197718631179
 8: ID01110 ID01110  0.99604743083004
 9: ID00497 ID00497 0.995436766623207
10: ID00877 ID00877 0.993939393939394
11: ID01192 ID01192 0.992202729044834
12: ID00361 ID00361 0.988354430379747
13: ID01045 ID01045  0.98766954377312
14: ID11641 ID11641 0.986875315497224
15: ID11641 ID11641  0.98685540950455
16: ID00458 ID00458 0.986813186813187
17: ID00199 ID00199 0.981416957026713
18: ID00570 ID00570 0.981151299032094
19: ID00541 ID00541 0.978161503301168
20: ID00603 ID00603 0.977272727272727

Reproducible Dataset
df_1

structure(list(query = structure(c(18L, 5L, 1L, 17L, 3L, 6L, 
12L, 15L, 8L, 13L, 16L, 4L, 14L, 19L, 19L, 7L, 2L, 10L, 9L, 11L
), .Label = c("ID00182", "ID00199", "ID00215", "ID00361", "ID00402", 
"ID00448", "ID00458", "ID00497", "ID00541", "ID00570", "ID00603", 
"ID00860", "ID00877", "ID01045", "ID01110", "ID01192", "ID06413", 
"ID08918", "ID11641"), class = "factor"), target = structure(c(16L, 
11L, 14L, 4L, 8L, 17L, 5L, 6L, 10L, 15L, 9L, 3L, 12L, 2L, 3L, 
7L, 13L, 1L, 1L, 18L), .Label = c("ID00309", "ID00541", "ID00570", 
"ID00745", "ID00959", "ID01127", "ID01151", "ID01175", "ID01183", 
"ID01192", "ID01198", "ID01201", "ID01211", "ID01576", "ID01590", 
"ID04896", "ID05351", "ID06789"), class = "factor"), new_ssp = structure(c(15L, 
15L, 15L, 15L, 15L, 15L, 14L, 13L, 12L, 11L, 10L, 9L, 8L, 7L, 
6L, 5L, 4L, 3L, 2L, 1L), .Label = c("0.977272727272727", "0.978161503301168", 
"0.981151299032094", "0.981416957026713", "0.986813186813187", 
"0.98685540950455", "0.986875315497224", "0.98766954377312", 
"0.988354430379747", "0.992202729044834", "0.993939393939394", 
"0.995436766623207", "0.99604743083004", "0.996197718631179", 
"1"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

df_2

structure(list(nodeA = structure(c(4L, 2L, 1L, 1L, 1L, 4L, 1L, 
9L, 3L, 4L, 2L, 8L, 2L, 1L, 5L, 7L, 3L, 6L, 2L, 1L), .Label = c("ID00309", 
"ID00361", "ID00541", "ID00570", "ID00615", "ID00696", "ID00762", 
"ID01200", "ID05109"), class = "factor"), nodeB = structure(c(8L, 
3L, 3L, 1L, 2L, 7L, 9L, 8L, 8L, 6L, 9L, 7L, 4L, 4L, 6L, 9L, 6L, 
7L, 5L, 5L), .Label = c("ID00361", "ID00541", "ID00570", "ID00615", 
"ID00696", "ID01200", "ID05109", "ID11641", "ID11691"), class = "factor"), 
    scr = structure(20:1, .Label = c("1.85284606048794", "1.90444166064472", 
    "1.90762235378507", "1.94364188077133", "1.95883206119256", 
    "2.08440437841349", "2.26408172709962", "2.3223132020942", 
    "2.46120775935034", "2.49647215035727", "2.50432367561777", 
    "2.57541320006514", "2.65099330092281", "2.75209155741549", 
    "2.93717640337986", "2.99596628688011", "3.21209741517806", 
    "3.21997803385465", "3.48788394772132", "3.81389707587156"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

Note: I am also using dplyr for some purposes like %>% etc. Does it mean, dplyr and data.table is conflicting somehow?

You've made a number of mistakes in setting up your example.

The df_1 and df_2 structures have their columns defined as factors whereas your read.csv() lines do not. The column names in df_1 also are not nodeA and nodeB.

Therefore you made a mistake somewhere in your script to change your data in the way you did.

I'm not clear on what you are trying to do, but some general advice regarding dplyr and data.table:
You can mix the use of the two packages, although I don't think this is generally a good idea unless you have a compelling reason. You could use dtplyr to use dplyr syntax and exploit some (but not all) of data.table's performance.

1 Like

Solution

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.