Problems with large datasets

I have a dataset .csv that weighs 8 GB and takes 40 minutes to load and to select columns it loads too much or closes
any way to reduce the size?

CSV is a plain text format, these can typically compress well via standard approaches like zipping ; .zip/.gz
library(readr) from tidyverse can read zipped csvs into memory. Perhaps thats an approach to start with.
Likely you would benefit from some tooling that would let you leave the data on disk and stream it into functions as needed; I believe package Arrow supports this, but I havent had cause to use it yet myself.

1 Like

This works up to about 2 billion rows.

library(readr)
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)
#> 
#> โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )
#> # A tibble: 15 ร— 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  2  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  3  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  5  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#>  6  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#>  7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
#>  8  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
#>  9  10.4     8  460    215  3     5.42  17.8     0     0     3     4
#> 10  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
#> 11  21.5     4  120.    97  3.7   2.46  20.0     1     0     3     1
#> 12  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
#> 13  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
#> 14  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
#> 15  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2

Created on 2023-06-20 with reprex v2.0.2

R isn't always the ideal tool for extracting selected columns from large csv files because it has to be done all in-memory. Here's a snippet to extract the first two columns of a csv file

cut -d ',' -f 1,2 gas.csv > trimmed.csv

and then trimmed.csvcan be imported with less demand on RAM.

2 Likes

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.