I am a bit perplexed - I have .rds files with each containing a large data.frame (4 columns by ~ 2 million rows). I'm doing some quality control, so my process is to read in the file, remove the offending rows, and save as a new .rds file as to not overwrite the raw data.
However the second "quality control" file (thousands of rows smaller), when saved as an .rds, is larger than the original by a decent amount.
I can replicate this with the 'volcano' dataset:
v = volcano
v = reshape2::melt(v)
saveRDS(object = v, file = "test.rds")
t = readRDS("test.rds")
t = t[which(t$value < 180),]
saveRDS(object = t, file = "test2.rds")
The file test.rds
is 6 KB and test2.rds
is 17 KB on my computer.
str() shows that their data types are the same. I tried other ways of indexing as well but the result is the same.
The only difference in this example is that using the volcano data the filtered data frame is larger than the original:
> object.size(v)
85904 bytes
> object.size(t)
102488 bytes
Whereas in my case the second object.size is smaller, while the output file size is larger.
Hoping this is something silly I overlooked.
Cheers, thanks!