Why does this code works in Dplyr but not in data.table

dplyr
datatable

#1
DT<-tibble::tribble(
    ~WORK_DEPT, ~EMP_ID,            ~EMP_NAME, ~PUNCH_OT, ~MAN_OT,    ~OT_DATE, ~PUNCH_1, ~PUNCH_2, ~OT_PUNCH1, ~OT_PUNCH2,             ~DEPT_NAME,
            2L, 100976L,       "LOKESH KUMAR",        4L,      0L, "19-Apr-18",     5.47,    18.17,      13.58,      18.17,             "ACCOUNTS",
            2L, 503053L,        "AJEET SINGH",        5L,      0L, "19-Apr-18",    21.38,     6.07,         NA,      22.38,             "ACCOUNTS",
            4L, 100042L,      "BALWANT SINGH",        3L,      4L, "19-Apr-18",     5.39,    14.34,      13.35,      18.04,             "SECURITY",
            4L, 100761L,         "DHARAM PAL",        4L,      4L, "19-Apr-18",     5.48,    14.38,      13.38,      18.04,             "SECURITY",
            4L, 100797L,     "SATYAVIR SINGH",        4L,      4L, "19-Apr-18",    21.37,     6.03,      17.43,      22.29,             "SECURITY",
            5L, 100139L,       "BALBIR SINGH",        4L,      5L, "19-Apr-18",       NA,    14.35,      13.54,      18.57, "PRESSURE DIE CASTING",
            5L, 100319L,       "PARMOD KUMAR",        4L,      4L, "19-Apr-18",     5.51,    18.23,      14.19,      18.23, "PRESSURE DIE CASTING",
            5L, 100718L,      "JAGNESH KUMAR",        3L,      4L, "19-Apr-18",     5.49,    18.23,      14.19,      18.23, "PRESSURE DIE CASTING",
            5L, 100785L,     "SURENDRA SINGH",        4L,      5L, "19-Apr-18",     5.51,       NA,      14.03,      18.57, "PRESSURE DIE CASTING",
            5L, 100850L,      "JANVESH KUMAR",       12L,      4L, "19-Apr-18",    21.31,     6.23,      17.42,       6.24, "PRESSURE DIE CASTING",
            5L, 100977L, "JITENDRA KUMAR PAL",       12L,      4L, "19-Apr-18",    21.39,     6.23,      17.52,       6.24, "PRESSURE DIE CASTING",
            5L, 203962L,           "RAJKUMAR",       12L,      4L, "19-Apr-18",     21.3,     6.07,         NA,       6.08, "PRESSURE DIE CASTING",
            5L, 204070L,        "LALIT KUMAR",        5L,      0L, "19-Apr-18",    21.38,     6.07,      17.42,      22.54, "PRESSURE DIE CASTING",
            5L, 204280L,     "SURENDRA KUMAR",       12L,      0L, "19-Apr-18",    21.52,     6.06,      17.59,       6.07, "PRESSURE DIE CASTING",
            5L, 204325L,      "KAMLESH DUBEY",        4L,      4L, "19-Apr-18",     5.55,    14.42,      13.49,         NA, "PRESSURE DIE CASTING",
            5L, 204332L,     "VINDESHWAR RAY",        0L,      0L, "19-Apr-18",     5.51,    14.41,      13.46,      14.36, "PRESSURE DIE CASTING",
            5L, 204333L,         "VISHVENDRA",       12L,      4L, "19-Apr-18",    21.36,     6.23,      17.53,       6.24, "PRESSURE DIE CASTING",
            5L, 204386L,          "VIJAY PAL",        4L,      4L, "19-Apr-18",     5.55,    18.01,      14.24,      18.02, "PRESSURE DIE CASTING",
            5L, 204566L,       "TINKU KALITA",        4L,      0L, "19-Apr-18",    21.32,     6.07,      17.51,         NA, "PRESSURE DIE CASTING",
            5L, 204600L,    "SANDEEP JAISWAL",        4L,      4L, "19-Apr-18",     5.46,    17.59,      13.49,      18.01, "PRESSURE DIE CASTING",
            5L, 204607L,   "VISHVENDRA SINGH",        8L,      8L, "19-Apr-18",     5.53,    22.03,      14.13,      22.04, "PRESSURE DIE CASTING"
    )

DT %>% setDT


DT[,':='(OT_DATE=as.POSIXct.numeric(OT_DATE,origin = '1970-01-01'),
                 PUNCH_1=hm(PUNCH_1),
                 PUNCH_2=hm(PUNCH_2),
                 OT_PUNCH1=hm(OT_PUNCH1),
                 OT_PUNCH2=hm(OT_PUNCH))] 


DT %>% 
    mutate(PUNCH_1=hm(PUNCH_1)) %>%  
    mutate(PUNCH_2=hm(PUNCH_2)) %>%  
    mutate(OT_PUNCH1=hm(OT_PUNCH1)) %>%  
    mutate(OT_PUNCH2=hm(OT_PUNCH2))  


I am trying to convert time as normal integer to hours and minutes but the code doesn't work the data.table way but it does in dplyr.

I would like to use the same code in data.table for speed. is there any way I can do that.


#2

no. They are, effectively, two completely different domain specific languages implemented with R APIs around them.


#3

But data.table is again a data frame too... And thus it should work. Should I post this issue to lubricate or data.table


#4

Unless for sure you know there is a bug or you want an enhancement, I would suggest your first port of call with code questions should be StackOverflow. I use it regularly and often get answers within a few minutes


#5

nope. That's not how they work. I recommend you read up on data.table vs. dplyr. You could start with this:


#6

Hadley has this package that makes data.table behave as you would expect in dplyr. But apparently it is slow, so that might not be relevant to you since your appeal in data.table was speed...


#7

Yes.

I need a data.table solution because that's the fastest and it reminds me of basic sql which is easier.

Again it would be implemented in a shiny app so speed does matter


#8

Yeah, as @prosoitos said, if you're looking for data.table speed, you should probably use data.table syntax.


#9

That's exactly what I did. I just tried to change a column in hours and minutes by using lubridate.

I have written both the syntax. But one works and another doesn't.


#10

If the only process being performed is a call to the lubridate hm function, I suspect that will be the time bottleneck, not whether that call is made from inside data.table or dplyr.

The main difference in calculation times between those two frameworks is when you want to do grouping and aggregation calcs among 100K+ groups. That's where data.table is optimized to work really fast. I don't think that wrapping your lubridate call inside data.table will make it any faster.

If you need the conversion done faster, you might want to look at alternatives to lubridate, which has smart parsing and clean syntax, but is probably not the fastest.


#11

One thing I'm noticing is that your data.table code is referencing a column that doesn't exist in your data frame (OT_PUNCH instead of OT_PUNCH2). When you say "it doesn't work", what do you mean? What error message or unexpected result are you seeing?


#12

It does exist and you can simply copy paste the example code I have pasted in question and it should work


#13

Its not the speed I just wanted to do it data.table way because I am more comfortable with it.

But some how it doesn't work and I want to know why??


#14

Your data.table code had typo. This worked for me

DT[, `:=`(OT_DATE = as.Date(OT_DATE, "%d-%b-%y"),
         PUNCH_1 = hm(PUNCH_1),
         PUNCH_2 = hm(PUNCH_2),
         OT_PUNCH1 = hm(OT_PUNCH1),
         OT_PUNCH2 = hm(OT_PUNCH2))][]
    WORK_DEPT EMP_ID           EMP_NAME PUNCH_OT MAN_OT    OT_DATE    PUNCH_1
 1:         2 100976       LOKESH KUMAR        4      0 2018-04-19  5H 47M 0S
 2:         2 503053        AJEET SINGH        5      0 2018-04-19 21H 38M 0S
 3:         4 100042      BALWANT SINGH        3      4 2018-04-19  5H 39M 0S
 4:         4 100761         DHARAM PAL        4      4 2018-04-19  5H 48M 0S
 5:         4 100797     SATYAVIR SINGH        4      4 2018-04-19 21H 37M 0S
 6:         5 100139       BALBIR SINGH        4      5 2018-04-19       <NA>
 7:         5 100319       PARMOD KUMAR        4      4 2018-04-19  5H 51M 0S
 8:         5 100718      JAGNESH KUMAR        3      4 2018-04-19  5H 49M 0S
 9:         5 100785     SURENDRA SINGH        4      5 2018-04-19  5H 51M 0S
10:         5 100850      JANVESH KUMAR       12      4 2018-04-19 21H 31M 0S
11:         5 100977 JITENDRA KUMAR PAL       12      4 2018-04-19 21H 39M 0S
12:         5 203962           RAJKUMAR       12      4 2018-04-19 21H 30M 0S
13:         5 204070        LALIT KUMAR        5      0 2018-04-19 21H 38M 0S
       PUNCH_2  OT_PUNCH1  OT_PUNCH2            DEPT_NAME
 1: 18H 17M 0S 13H 58M 0S 18H 17M 0S             ACCOUNTS
 2:   6H 7M 0S       <NA> 22H 38M 0S             ACCOUNTS
 3: 14H 34M 0S 13H 35M 0S  18H 4M 0S             SECURITY
 4: 14H 38M 0S 13H 38M 0S  18H 4M 0S             SECURITY
 5:   6H 3M 0S 17H 43M 0S 22H 29M 0S             SECURITY
 6: 14H 35M 0S 13H 54M 0S 18H 57M 0S PRESSURE DIE CASTING
 7: 18H 23M 0S 14H 19M 0S 18H 23M 0S PRESSURE DIE CASTING
 8: 18H 23M 0S 14H 19M 0S 18H 23M 0S PRESSURE DIE CASTING
 9:       <NA>  14H 3M 0S 18H 57M 0S PRESSURE DIE CASTING
10:  6H 23M 0S 17H 42M 0S  6H 24M 0S PRESSURE DIE CASTING
11:  6H 23M 0S 17H 52M 0S  6H 24M 0S PRESSURE DIE CASTING
12:   6H 7M 0S       <NA>   6H 8M 0S PRESSURE DIE CASTING
13:   6H 7M 0S 17H 42M 0S 22H 54M 0S PRESSURE DIE CASTING
 [ reached getOption("max.print") -- omitted 9 rows ]

Similarly, with dplyr

DT %>% 
  mutate(OT_DATE = as.Date(OT_DATE, "%d-%b-%y")) %>% 
  mutate(PUNCH_1 = hm(PUNCH_1)) %>%  
  mutate(PUNCH_2 = hm(PUNCH_2)) %>%  
  mutate(OT_PUNCH1 = hm(OT_PUNCH1)) %>%  
  mutate(OT_PUNCH2 = hm(OT_PUNCH2)) 
   WORK_DEPT EMP_ID           EMP_NAME PUNCH_OT MAN_OT    OT_DATE    PUNCH_1    PUNCH_2
1          2 100976       LOKESH KUMAR        4      0 2018-04-19  5H 47M 0S 18H 17M 0S
2          2 503053        AJEET SINGH        5      0 2018-04-19 21H 38M 0S   6H 7M 0S
3          4 100042      BALWANT SINGH        3      4 2018-04-19  5H 39M 0S 14H 34M 0S
4          4 100761         DHARAM PAL        4      4 2018-04-19  5H 48M 0S 14H 38M 0S
5          4 100797     SATYAVIR SINGH        4      4 2018-04-19 21H 37M 0S   6H 3M 0S
6          5 100139       BALBIR SINGH        4      5 2018-04-19       <NA> 14H 35M 0S
7          5 100319       PARMOD KUMAR        4      4 2018-04-19  5H 51M 0S 18H 23M 0S
8          5 100718      JAGNESH KUMAR        3      4 2018-04-19  5H 49M 0S 18H 23M 0S
9          5 100785     SURENDRA SINGH        4      5 2018-04-19  5H 51M 0S       <NA>
10         5 100850      JANVESH KUMAR       12      4 2018-04-19 21H 31M 0S  6H 23M 0S
11         5 100977 JITENDRA KUMAR PAL       12      4 2018-04-19 21H 39M 0S  6H 23M 0S
12         5 203962           RAJKUMAR       12      4 2018-04-19 21H 30M 0S   6H 7M 0S
13         5 204070        LALIT KUMAR        5      0 2018-04-19 21H 38M 0S   6H 7M 0S
    OT_PUNCH1  OT_PUNCH2            DEPT_NAME
1  13H 58M 0S 18H 17M 0S             ACCOUNTS
2        <NA> 22H 38M 0S             ACCOUNTS
3  13H 35M 0S  18H 4M 0S             SECURITY
4  13H 38M 0S  18H 4M 0S             SECURITY
5  17H 43M 0S 22H 29M 0S             SECURITY
6  13H 54M 0S 18H 57M 0S PRESSURE DIE CASTING
7  14H 19M 0S 18H 23M 0S PRESSURE DIE CASTING
8  14H 19M 0S 18H 23M 0S PRESSURE DIE CASTING
9   14H 3M 0S 18H 57M 0S PRESSURE DIE CASTING
10 17H 42M 0S  6H 24M 0S PRESSURE DIE CASTING
11 17H 52M 0S  6H 24M 0S PRESSURE DIE CASTING
12       <NA>   6H 8M 0S PRESSURE DIE CASTING
13 17H 42M 0S 22H 54M 0S PRESSURE DIE CASTING

#15

Yes, the data.table code was throwing errors in two places (but neither one involved lubridate::hm) — character data being passed to as.POSIXct.numeric, plus the typo I mentioned before.

A couple of further thoughts:

  1. In the dplyr version, all the conversions can go in a single call to mutate:
DT %>% mutate(
   OT_DATE = as.Date(OT_DATE, "%d-%b-%y"),
   PUNCH_1 = hm(PUNCH_1),
   PUNCH_2 = hm(PUNCH_2),
   OT_PUNCH1 = hm(OT_PUNCH1),
   OT_PUNCH2 = hm(OT_PUNCH2)
) 
  1. lubridate::hm parses to a period object, so you want to be sure that makes sense for the data. For instance, if PUNCH1 and friends are points in time, rather than periods (e.g., 5.49 = 5:49 AM, not a period lasting 5 hours and 49 minutes), then you may want to combine them with the date (and timezone!) and convert into datetimes, rather than parsing with hm.

#16

Thanks for pointing this silly mistake to me. I thought there might be something wrong going on under the hood. But It's like I didn't check it properly. I am really sorry to bug you for nothing.

Thanks again
@jcblum
@RuReady
@jonspring