storing large tables as files


#1

Are there any good alternatives to tab/comma-separated plain text files for storing large tables in a language-agnostic way?

One option I have seen is the feather format. However, the future of that seems uncertain (highlighted by a discussion here over a year ago). The last CRAN release is over two years old.


#2

It depends on the purpose.

For storing and managing source data, I find CSV nearly perfect:

  • Files can be viewed and edited in a basic text editor.
  • It follows a very simple format that can allow explicit "character" values; otherwise, software decides how to handle different data types. It's harder to be more language-agnostic than leaving the details up the languages.
  • Changes are tracked nicely in any version control system.

If you're asking about a way to pass intermediate data between processes, then JSON might also be a contender. If you really need efficiency or management beyond simple text, many languages have extensions to work with SQLite databases.


#3

This really depends on what problem you are trying to solve.

For many things CSV are great. HDF5 is a binary file that's great for many (but not all) things. Other plain text formats that are popular are XML and JSON.

I think the issue with Feather is concern over the spec changing. So the specific concerns is that it might not be a good archive format to keep data in for 5 years because in 5 years the spec may have changed. That should not slow you down from using it for data exchange in the short run. But, yes, it's not the ideal archive format.


#4

I agree it's great, but when it reaches multiple GBs (gzipped), it becomes difficult to both read and write.

On the other hand, if you save the same data as an R object, the experience is much smoother. The downside is that you are then required to use R (I don't have any problem with R, but it should be more generic for easier sharing and long-term storage).


#5

You might consider the fst package. The advantage is very fast read and write speeds and the ability to save with compression. You can also read selected rows and columns of the data, rather than reading in the whole file. It is not language agnostic (it's an R package), but there is now a binding for the Julia language.

Although the package is in active development, the package development site states that future versions will maintain backwards compatibility for reading files saved with previous versions, so perhaps that covers at least intermediate term archiving of data. Plus you can always* use packrat to manage dependencies or roll back to older versions of fst.

* Excluding a zombie apocalypse or similarly disastrous event.


#6

True, but this gets rough if your data is larger than your RAM...

Once you're dealing with multiple GB datasets which may not all fit in RAM it's worth seriously considering not storing data in filesystems but rather storing them in a database. Or you can get the portability of CSVs and the convenience of an SQL database with a database abstraction on top of CSV, such as Apache Drill. Then the files are stored as CSV but your applications talk SQL to Drill and can connect using ODBC tools.

So it really depends on your workflow and what's important to you.


#7

Thank you for the feedback everyone.

How about MatrixMarket format?


#8

If you're looking for a file format that hardly anyone has heard of and where the documentation disappears when the US federal gov is down, then I think this is a great choice.


closed #9

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.