converting efficiently between data.table, data.frame and tibble

Hello there,

I have a pretty large dataset (about 200GB) and I need to use a mix of dplyr and data.table functions. Of course, with a data this size, I cannot allow the data to be duplicated in memory when I convert from data.table to tibble.

For instance, the SO post here (https://stackoverflow.com/questions/52431288/r-data-table-how-to-go-from-tibble-to-data-table-to-tibble-back) suggests to simply use :

setattr(x, "class", c("tbl", "tbl_df", "data.frame"))

which seems super hacky to me (but maybe it isn't). Any ideas/suggestions how to proceed?

Thanks!

I am curious what functions you are planning on using? This might be an instance where it makes sense to stay one package / data type, and I wonder if there's alternative functions you can use that can make that happen.

I'm new here, but here is how I would approach it. I imagine you have some sort of data warehouse. Please tell us what that is... But judicious use of the query feature is going to be your best friend. Then use select, filter, and na.omit as appropriate as you pipe the data all the way to the point where it needs to be.

I think dplyr can work on data.frame. data.table inherits from data.frame, or offer in memory conversion with setDF or setDT. So why converting to tibble ?
The only thing you'll loose is tibble printing. But I think every dplyr operation will work.

However, data.table can work by reference so memory efficiently. It is not dplyr logic, so as soon as you'll use dplyr functions, you'll have some copies at the end of the workflow. Or it is what I understand at least.

putting data in database and using dbplyr features allowing dplyr to work with database can be a good solution for such size of data.

Hopes it helps.

1 Like

thanks @cderv . the idea is that I want to use the rolling joins in data.table.
so what you suggests is simply to

  1. use data.table to say, import the data and run the rolling join
  2. use setDF to convert efficiently to data.frame
  3. use native dplyr functions on the data.frame directly.

Is that right? Am I only losing the pretty display or there is some other subtle issue I should be aware of?

Thanks!!

@cderv what is weird is that the output of this pipe seems to be a tibble.. so is there some hidden tibble conversion along the road?

library(dplyr)
library(data.table)
library(tibble)

data.table(
  ID = c("b","b","b","a","a","c"),
  a = 1:6,
  b = 7:12,
  c = 13:18
) %>% setDF() %>% 
  group_by(ID) %>% 
  filter(row_number() == 1)
# A tibble: 3 x 4
# Groups:   ID [3]
  ID        a     b     c
  <chr> <int> <int> <int>
1 b         1     7    13
2 a         4    10    16
3 c         6    12    18

Yes I think there is a conversion because the function used here is group_by that result in a special object know only by dplyr then. It makes sense because the concept of grouped data.frame are one of dplyr.
If you just filter, you keep a df and get no tibble

library(dplyr)
library(data.table)
library(magrittr)

data.table(
  ID = c("b","b","b","a","a","c"),
  a = 1:6,
  b = 7:12,
  c = 13:18
) %>% setDF() %T>% 
  {cat(c("---", class(.), "---"), sep = "\n")} %>%
  filter(row_number() == 1) %T>%
  {cat(c("---", class(.), "---"), sep = "\n")}
#> ---
#> data.frame
#> ---
#> ---
#> data.frame
#> ---
#>   ID a b  c
#> 1  b 1 7 13

Created on 2019-03-04 by the reprex package (v0.2.1)

So you are right, what I suggested is not totally true - or at least not so simple. The conversion seems to append in C++ function from the code . I don't think it copies the all data.frame. I let you look at size and uses dplyr::changes() to look for difference.

Also, I don't think as_tibble will make a full copy if you apply it on a data.table. I think it will just change classes and rownames, there is no reason data are modified. Try and see with dplyr::changes()

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.