How to Read Large JSON file in R?

dhanashreedeshpande · August 30, 2018, 5:53pm

Following R code is reading small JSON file but when I am applying huge JSON data (3 GB, 5,51,367 records, and 341 features), the reading process continues and does not end. My JSON data file is of proper format which is required for stream_in() function. Here is my R code,

library(jsonlite)
main_sample = jsonlite::stream_in(file("sample.json"),pagesize = 100000) # reads line by line, pagesize size is given to break records into chunks
data = jsonlite::flatten(main_sample) # convert into more nested columns
i <- sapply(data, is.list) # columns of list class converted into character
data[i] <- lapply(data[i], as.character)
write.table(data, file = "data.csv", sep = ",",row.names = FALSE, col.names = TRUE) # convert JSON into CSV format.

I want the solution which takes normal time to read the JSON file. Even though I have given pagesize=100000 but still it takes non ending time. I have 8 GB RAM. Can anyone solve this problem of reading huge JSON file in R?

cderv · August 30, 2018, 6:39pm

I don't know specifically about this error. It is not clear in your code if the error comes from reading part or writing part. As I see readLines in the error, I believe it is in fromJSON (but I may be wrong).

On this part, I have a suggestion. Try jsonlite with jsonlite::read_json to see if there is still an error. It is another implementation for parsing JSON.

cderv · August 30, 2018, 7:40pm

is it the exact same error ? Did you try in a clean environment ? It is easier if we have the exact command you run before error and not much. Also it is better with a reprex.

On the web, your error is associated to rjson and big json file.
You can also try to isolate where it comes from exactly (use traceback() to help), and if it is really from the rjson try to contact the developer.

cderv · August 30, 2018, 10:53pm

Can you provide a sample file?

cole · August 31, 2018, 2:55am

If you can provide a bit more information around the structure of your JSON file, that can be helpful as well. One thing I have had success with in the past (since it looks like you are running out of memory), is streaming the file.

This is easiest if every line is a JSON object (that's a standard in some JSON implementations... although I forget what it is called). However, if that is not the case and you know a JSON object usually does not have more than 500 lines, let's say, then it is possible to determine the start / end of a JSON object and split the file up that way. It's a bit more work, but it gets around your limitation

readr has some really great streaming file support. If I get a chance, I would love to scrounge up the example that I did (working with 4 GB of memory and a 10 GB file or something like that... way bigger than I could fit in memory).

I want to say I used jsonlite with readr::read_lines_chunked or something along those lines. tidyjson is also a favorite of mine when parsing JSON data (if the data is very complex), but it is not on CRAN anymore (a working version can be had with devtools::install_github("colearendt/tidyjson")).

EDIT: try jsonlite directly first, as @cderv suggested. Streaming is much more complex and potentially painful, but it gets around a memory limitation if one does exist

cole · August 31, 2018, 12:18pm

This error suggests that the file does not match the format spec for JSON. I would recommend searching for that point in the file using a text editor (or grep) and evaluating whether there is a typo / error in the file that makes it an invalid JSON object.

cderv · August 31, 2018, 5:09pm

This function is from readr right ? So you are not reading your file as json but just line by line, as character vector. If you want to continue, you'll need some package to deal with string, like stringr.

However, I would pursue in the json parsing solution.

In fact, it is possible that your json file is not a 'perfect json' file, that is to say not a valid json structure in a whole but a compilation of valid json. Something like that. This format is called ndjson , and it is possible you big file is that.

To deal with such file, you can use several tools. See jsonlite::stream_in and its example, or the ndjson that sometimes is more efficient. (see ndjson::stream_in and its example)

Also, you can read lines as you have done and reconstruct valid json as strings to parse with jsonlite::fromJSON.

Can you try those options on your big file ?

cderv · August 31, 2018, 5:31pm

You file is very big and this kind to feedback can be useful to jsonlite developer. There may be performance issue that could be solved.
Anyway, ndjson is known to be more efficient, maybe it could take less time.

However, if you want to pursue with read_lines, you can parse what is resulting to format the data as you want to. Just use parsing fonction like jsonlite::fromJSON on small bit of you file or stringr function and regex to extract relevant data.

alistaire · September 1, 2018, 11:19pm

hrbrmstr's book on using Apache Drill with R has a chapter on reading ndjson (assuming that's what it is) into R:

alistaire · September 2, 2018, 12:30am

Conversion from JSON to CSV through R implies two steps: reading and writing. Reading is generally the hard step, and the liked page shows one way to do this in the second half:

library(sergeant)
library(tidyverse)

db <- src_drill("localhost")

scorecards <- tbl(db, "dfs.tmp.`/college-scorecard/*.json.gz`")

To get this to work requires installing and starting Apache Drill first, covered in the first chapters. To be clear, jsonlite::stream_in is a simpler approach; Drill just scales more flexibly.

Once data is read into R, saving it as a CSV is comparatively straightforward, and can be as simple as a call to write.csv, or better, readr::write_csv or data.table::fwrite.

The top of the linked page suggests another possibility: using Drill to both read and write without touching R at all. (You could run the SQL from R if you like.) The example there goes the other direction—from CSV to JSON—but it is certainly capable of going the other direction. It requires writing SQL to invert the following:

0: jdbc:drill:> ALTER SESSION SET `store.format`='json';
0: jdbc:drill:> CREATE TABLE dfs.tmp.`/1996-97` AS SELECT * FROM dfs.root.`/Users/bob/Data/CollegeScorecard_Raw_Data/MERGED1996_97_PP.csv`;

The advantage of this approach is that Drill is quite clever about memory management. You can still subset along the way, if you like.

The disadvantage is that while the APIs are well-designed, this is beyond basic R and requires some system setup. If you can get jsonlite to handle everything and this is a one-time job, Drill is overkill.

cole · September 2, 2018, 12:50am

Just in general (thinking about future viewers), it is better not to edit your question and change the context of the discussion. It is better to copy / paste and make changes in a new post to keep the stream of the discussion readable. (Otherwise all of our responses thus far look crazy / unrelated to your question)

cole · September 2, 2018, 7:07pm

No worries at all! We're glad you're here, and hopeful we can help! Just want to offer suggestions for the future / etc.

Oh, to answer your question, you might try dplyr::bind_rows. I find it to behave a lot more dependably / understandably than rbind! That or take a look at your json_data object and see what might be causing rbind to fail (something about number of columns, etc.)

cderv · September 3, 2018, 5:39am

I believe your data is ndjson and that you should try ndjson as I mentioned in an earlier post. ndjson::stream_in is supposed to be faster than jsonlite alternative. See my earlier post and the website Home. ndjson

cderv · September 3, 2018, 6:36pm

Good to know. I am not sure what to advice for you to go further. Maybe it is possible de parallelize this type of reading by streaming operation. Let's see if others have solutions.

dhanashreedeshpande · September 3, 2018, 9:13pm

library(sparklyr)
library(dplyr)
library(jsonlite)

Sys.setenv(SPARK_HOME="/usr/lib/spark")
# Configure cluster (c3.4xlarge 30G 16core 320disk)
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "7g"
conf$'sparklyr.shell.driver-memory' <- "7g"
conf$spark.executor.cores <- 20
conf$spark.executor.memory <- "7G"
conf$spark.yarn.am.cores  <- 20
conf$spark.yarn.am.memory <- "7G"
conf$spark.executor.instances <- 20
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32

sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE,
                              overwrite = TRUE) 
sdf_schema_viewer(sample_tbl)

cderv · September 3, 2018, 9:20pm

Good to know! I did not think about sparklyr but right it is a clever option. You should be able to manipulate your data to create the csv file you want. Well done !

cole · September 4, 2018, 1:53am

Very nice! I'm not super familiar with all of the ins and outs, but spark.rstudio.com is a fantastic resource on the topic!