I have a dataset .csv that weighs 8 GB and takes 40 minutes to load and to select columns it loads too much or closes
any way to reduce the size?
CSV is a plain text format, these can typically compress well via standard approaches like zipping ; .zip/.gz
library(readr) from tidyverse can read zipped csvs into memory. Perhaps thats an approach to start with.
Likely you would benefit from some tooling that would let you leave the data on disk and stream it into functions as needed; I believe package Arrow supports this, but I havent had cause to use it yet myself.
This works up to about 2 billion rows.
library(readr)
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)
#>
#> โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> cols(
#> mpg = col_double(),
#> cyl = col_double(),
#> disp = col_double(),
#> hp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_double(),
#> am = col_double(),
#> gear = col_double(),
#> carb = col_double()
#> )
#> # A tibble: 15 ร 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 3 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 6 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> 7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
#> 8 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
#> 9 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 10 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
#> 11 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#> 12 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
#> 13 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
#> 14 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
#> 15 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
Created on 2023-06-20 with reprex v2.0.2
R
isn't always the ideal tool for extracting selected columns from large csv files because it has to be done all in-memory. Here's a snippet to extract the first two columns of a csv file
cut -d ',' -f 1,2 gas.csv > trimmed.csv
and then trimmed.csv
can be imported with less demand on RAM.
This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.