Is there any faster alternative to `as_tibble()`?

I have a huge matrix (395,706 rows * 2689 columns) containing numeric values. I need to convert this to a data frame (tibble) for downstream analysis. What I'm currently trying to do this is

my_data %<>% as_tibble(rownames = "id")

But this takes extremely long (I waited for more than 2 hours and it is still running). Is there any faster alternative way to do this?

Tibbles can only have one format (i.e. character), while data frames let you have many types you can edit with mutate().

As far as I know, a tibble is essentially a data frame with some added features that make working with it easier. Therefore a tibble, like a data frame, can also have many different types of columns (e.g., both a character column and a numeric column) and can be edited with functions in the dplyr package such as mutate(). More information about the tibble can be found at 10 Tibbles | R for Data Science.

I'm afraid this is completely incorrect and should be ignored:

As for the original question, first ask yourself whether you really need to convert the data from a matrix.

If you do, consider using data.table, which is faster and more memory-efficient. However, I don't know how the speed of the initial conversion compares.

2 Likes

Building on @martin.R's reply, data.table works by manipulating objects in-place (in memory), not by making copies, which is how Base R and tidyverse work.

The problem is, with data.table you'll need to adopt a different coding style. If you don't need the heterogeneous data types that data frames and tibbles (and thus tidyverse) accommodate, consider using data.table.

For one intro to data.table see: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

1 Like

Where is that matrix coming from?

Are you preforming operations on the entire data set all at once?

If not, it might be better to store the "matrix" in a database and only load the data as needed. See https://db.rstudio.com/getting-started/database-queries for some ideas.

1 Like

Thank you for sharing the information. The matrix contains intermediate results from my previous analysis, hence generated in the middle of the R script. Unfortunately I need to access all the columns of the matrix at once for the operation I'm trying to do. Considering the above replies I think the best solution in my case would be to perform the downstream analysis without converting the matrix into tibble, although it means that I would need to script a bunch of "ugly" codes.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.