Performance issues when processing a large Json file

Cagideon · January 18, 2019, 2:15pm

Hello,

I'm taking a class on Data Analytics, and I turned in my code that processes a 10k line JSON file and decided that I wanted to learn R at the same time to parse through the supplied JSON file. The way the JSON is formatted is, it's a single .json file but it has 10k individual JSON objects with each object having a nested JSON array that I need to get account data from, account type, number of each account total and the sum of assets in each account type. I can tell my code works with a small data set, but when I process the 10k line file I let it run overnight and still it didn't process. I'm sure my code is just horribly inefficient, but I'm not sure how best to go about tuning it to solve for this. I'd appreciate any feedback on how to tune my code to run more efficiently. Thank you!

library("rjson")
library("dplyr")

parseJsonData <- function (sourceFile, outputFile) 
{
  #Get all total lines in the source file provided
  parsedJson <- readLines(sourceFile)
  
  #Get each amount for each account type
  accountTypeAmounts <- GetAccountTypeAmounts(parsedJson)

  #Clean up old output file
  if(file.exists(outputFile)){
    file.remove(outputFile)
  }
  
  #Loop over each line in the sourceFile, parse the JSON and append to DataFrame
  JsonAcctData <- NULL
  for(i in 1:length(parsedJson)){
    jsonValue <- fromJSON(parsedJson[[i]])
    frame <- data.frame(jsonValue)
    JsonAcctData <- rbind(JsonAcctData, frame)
  }

  #Get account types
  accountType <- select(JsonAcctData, "accountHistory.type")
  #print(accountType)
  
  #Get account type total count
  accountTypeCount <- table(unlist(select(JsonAcctData, "accountHistory.type")))
  
  #Take all dataframes and consolodate them into one
  finalAcctData <- ConsolodateDataFrames(accountType, accountTypeAmounts, accountTypeCount)
  #Write the DataFrame to the output file in CSV format
  write.csv(finalAcctData, file = outputFile)
}

GetAccountTypeAmounts <- function(json) {
  amount <- grep('\"amount\":\"\\d+\\.\\d+\"', json, value = TRUE)
  amount <- as.numeric(gsub('.*amount\":\"(\\d+\\.+\\d+)\".*', '\\1', amount, perl = TRUE))
  type   <- grep('\"type\":\"\\w+\"', json, value = TRUE)
  type   <- gsub('.*type\":\"(\\w+)\".*', '\\1', type, perl = TRUE)
  
  data.frame(type, amount)
}

ConsolodateDataFrames <- function(accountType, accountTypeAmounts, accountTypeCount) {
  
  #Create consolodated dataframe
  consolodatedDataFrame <- data.frame(matrix(ncol = 3, nrow = 0))
  headers <- c("account_type", "the_count", "sum_amount")
  colnames(consolodatedDataFrame) <- headers
  
  types <- list("withdrawal","deposit","invoice","payment")
  
  #Consolodate all dataframes into one
  for(actType in types){
    amtSum <- sum(accountTypeAmounts[accountTypeAmounts$type == actType, "amount"])
    actCount <- as.character(accountTypeCount[actType][1])
    newRow <- data.frame(account_type=actType, the_count=actCount, sum_amount=amtSum)
    consolodatedDataFrame <- rbind(consolodatedDataFrame, newRow)
  }
  
  consolodatedDataFrame
}

Here is a small snippet of data that I was working with, the original file was 10k lines.

{"name":"Test1", "accountHistory":[{"amount":"107.62","date":"2012-02-02T06:00:00.000Z","business":"CompanyA","name":"Home Loan Account 6220","type":"payment","account":"11111111"},{"amount":"650.88","date":"2012-02-02T06:00:00.000Z","business":"CompanyF","name":"Checking Account 9001","type":"payment","account":"123123123"},{"amount":"878.63","date":"2012-02-02T06:00:00.000Z","business":"CompanyG","name":"Money Market Account 8743","type":"deposit","account":"123123123"}]}
{"name":"Test2", "accountHistory":[{"amount":"199.29","date":"2012-02-02T06:00:00.000Z","business":"CompanyB","name":"Savings Account 3580","type":"invoice","account":"12312312"},{"amount":"841.48","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Home Loan Account 5988","type":"payment","account":"123123123"},{"amount":"116.55","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 1794","type":"withdrawal","account":"12312313"}]}
{"name":"Test3", "accountHistory":[{"amount":"602.45","date":"2012-02-02T06:00:00.000Z","business":"CompanyC","name":"Personal Loan Account 9924","type":"deposit","account":"12312312"},{"amount":"377.08","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Investment Account 1002","type":"withdrawal","account":"123123123"},{"amount":"204.44","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 2452","type":"withdrawal","account":"123123123"}]}
{"name":"Test4", "accountHistory":[{"amount":"31.19","date":"2012-02-02T06:00:00.000Z","business":"CompanyD","name":"Auto Loan Account 3893","type":"payment","account":"12312312"},{"amount":"820.25","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Savings Account 6260","type":"payment","account":"123123123"},{"amount":"582.19","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 9826","type":"payment","account":"123123123"}]}
{"name":"Test5", "accountHistory":[{"amount":"340.93","date":"2012-02-02T06:00:00.000Z","business":"CompanyE","name":"Checking Account 5055","type":"withdrawal","account":"12312312"},{"amount":"523.31","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Savings Account 1680","type":"deposit","account":"123123123"},{"amount":"153.38","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Money Market Account 1010","type":"withdrawal","account":"123123123"}]}

Parsing the above JSON creates a CSV value with the below values

"","account_type","the_count","sum_amount"
"1","withdrawal","1",474.37
"2","deposit","1",878.63
"3","invoice","1",0
"4","payment","2",582.19

nwerth · January 18, 2019, 3:36pm

Please post some short example JSON for us to work with, as well as what the final result of the example should look like. For instance,

# mydata.json
{foo: 1, bar: "hello"}
{foo: 9, bar: "goodbye"}

# desired result as a data.frame
  foo       bar
1   1   "hello"
2   9 "goodbye"

Some things I can suggest now:

Replace all select calls (and any wrapping unlist) which only grab a single column with [[-indexing. E.g.,
```
table(unlist(select(JsonAcctData, "accountHistory.type")))
```
can be replaced with
```
table(JsonAcctData[["accountHistory.type"]])
```
"Growing" objects (e.g., rbind in a for loop) is very inefficient. It's better to start by creating an appropriately sized object and then replacing parts. Even better, put off creating the collection until the end. So,
```
JsonAcctData <- NULL
 for(i in 1:length(parsedJson)){
   jsonValue <- fromJSON(parsedJson[[i]])
   frame <- data.frame(jsonValue)
   JsonAcctData <- rbind(JsonAcctData, frame)
 }
```
is much more efficient and simpler to read when written as
```
JsonAcctData <- parsedJson %>%
    lapply(FUN = fromJSON) %>%
    bind_rows()
```

You can also replace

consolodatedDataFrame <- data.frame(matrix(ncol = 3, nrow = 0))
headers <- c("account_type", "the_count", "sum_amount")
colnames(consolodatedDataFrame) <- headers

types <- list("withdrawal","deposit","invoice","payment")

#Consolodate all dataframes into one
for(actType in types){
  amtSum <- sum(accountTypeAmounts[accountTypeAmounts$type == actType, "amount"])
  actCount <- as.character(accountTypeCount[actType][1])
  newRow <- data.frame(account_type=actType, the_count=actCount, sum_amount=amtSum)
  consolodatedDataFrame <- rbind(consolodatedDataFrame, newRow)
}

consolodatedDataFrame

with

types <- c("withdrawal","deposit","invoice","payment")
# Want the first values of each column whose name is in `types`
counts <- as.character(accountTypeCount[1, types])
# Get a vector of sums by type, which has names stating the type
type_sums <- tapply(
  X     = accountTypeAmounts[["amount"]],
  INDEX = accountTypeAmounts[["type"]].
  FUN   = sum
)
# Now build the data.frame
data.frame(
  account_type = types,
  the_count    = counts,
  sum_amount   = type_sums[types]
)

Look into lapply, vapply, and tapply for "vectorizing" operations. Or their equivalents in the purrr package.

Cagideon · January 21, 2019, 2:01am

Thank you for the info! I made those changes but I'm running into a strange error when I try to run my code. I've pasted my code and the exception below. I think it's an issue with passing a list into a table as a parameter for the as.character function, but I can't find any articles online about why this might be wrong.

library("rjson")
library("dplyr")

parseJsonData <- function (sourceFile, outputFile) 
{
  #Get all total lines in the source file provided
  parsedJson <- readLines(sourceFile)
  
  #Get each amount for each account type
  accountTypeAmounts <- GetAccountTypeAmounts(parsedJson)

  #Clean up old output file
  if(file.exists(outputFile)){
    file.remove(outputFile)
  }
  
  #Loop over each line in the sourceFile, parse the JSON and append to DataFrame
  #JsonAcctData <- parsedJson %>%
  #  lapply(FUN = fromJSON) %>%
  #  bind_rows()

  JsonAcctData <- NULL
  for(i in 1:length(parsedJson)){
    jsonValue <- fromJSON(parsedJson[[i]])
    frame <- data.frame(jsonValue)
    JsonAcctData <- rbind(JsonAcctData, frame)
  }
  
  #Get account types
  accountType <- select(JsonAcctData, "accountHistory.type")
  #print(accountType)
  
  #Get account type total count
  accountTypeCount <- table(JsonAcctData[["accountHistory.type"]])
  print(accountTypeCount)
  
  #Take all dataframes and consolodate them into one
  finalAcctData <- ConsolodateDataFrames(accountType, accountTypeAmounts, accountTypeCount)
  #Write the DataFrame to the output file in CSV format
  write.csv(finalAcctData, file = outputFile)
}

GetAccountTypeAmounts <- function(json) {
  amount <- grep('\"amount\":\"\\d+\\.\\d+\"', json, value = TRUE)
  amount <- as.numeric(gsub('.*amount\":\"(\\d+\\.+\\d+)\".*', '\\1', amount, perl = TRUE))
  type   <- grep('\"type\":\"\\w+\"', json, value = TRUE)
  type   <- gsub('.*type\":\"(\\w+)\".*', '\\1', type, perl = TRUE)
  
  data.frame(type, amount)
}

ConsolodateDataFrames <- function(accountType, accountTypeAmounts, accountTypeCount) {
  
  #Constant account types
  types <- list("withdrawal","deposit","invoice","payment")
  counts <- as.character(accountTypeCount[1, types])
  
  type_sums <- tapply(
    X     = accountTypeAmounts[["amount"]],
    INDEX = accountTypeAmounts[["type"]],
    FUN   = sum
  )
  
  consolodatedDataFrame <- data.frame(
    account_type = types,
    the_count    = counts,
    sum_amount   = type_sums[types]
  )
  
  consolodatedDataFrame
}

Output:

 Error in `[.default`(accountTypeCount, types) : 
  invalid subscript type 'list' 
5.NextMethod() 
4.`[.table`(accountTypeCount, types) 
3.accountTypeCount[types] 
2.ConsolodateDataFrames(accountType, accountTypeAmounts, accountTypeCount) 
1.parseJsonData("simpleData.json", "test3.csv")

cderv · January 21, 2019, 2:02pm

I think the format you described correspond to ndjson.

There is some support for that in jsonlite or ndjson . See example here

json_string <- '{"name":"Test1", "accountHistory":[{"amount":"107.62","date":"2012-02-02T06:00:00.000Z","business":"CompanyA","name":"Home Loan Account 6220","type":"payment","account":"11111111"},{"amount":"650.88","date":"2012-02-02T06:00:00.000Z","business":"CompanyF","name":"Checking Account 9001","type":"payment","account":"123123123"},{"amount":"878.63","date":"2012-02-02T06:00:00.000Z","business":"CompanyG","name":"Money Market Account 8743","type":"deposit","account":"123123123"}]}
{"name":"Test2", "accountHistory":[{"amount":"199.29","date":"2012-02-02T06:00:00.000Z","business":"CompanyB","name":"Savings Account 3580","type":"invoice","account":"12312312"},{"amount":"841.48","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Home Loan Account 5988","type":"payment","account":"123123123"},{"amount":"116.55","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 1794","type":"withdrawal","account":"12312313"}]}
{"name":"Test3", "accountHistory":[{"amount":"602.45","date":"2012-02-02T06:00:00.000Z","business":"CompanyC","name":"Personal Loan Account 9924","type":"deposit","account":"12312312"},{"amount":"377.08","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Investment Account 1002","type":"withdrawal","account":"123123123"},{"amount":"204.44","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 2452","type":"withdrawal","account":"123123123"}]}
{"name":"Test4", "accountHistory":[{"amount":"31.19","date":"2012-02-02T06:00:00.000Z","business":"CompanyD","name":"Auto Loan Account 3893","type":"payment","account":"12312312"},{"amount":"820.25","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Savings Account 6260","type":"payment","account":"123123123"},{"amount":"582.19","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 9826","type":"payment","account":"123123123"}]}
{"name":"Test5", "accountHistory":[{"amount":"340.93","date":"2012-02-02T06:00:00.000Z","business":"CompanyE","name":"Checking Account 5055","type":"withdrawal","account":"12312312"},{"amount":"523.31","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Savings Account 1680","type":"deposit","account":"123123123"},{"amount":"153.38","date":"2012-02-02T06:00:00.000Z","business":"Company","name":"Money Market Account 1010","type":"withdrawal","account":"123123123"}]}'

temp_file <- tempfile("dummy.json")
writeLines(json_string, temp_file)
# use jsonlite
jsonlite::stream_in(file(temp_file, "r"))
#> 
 Found 5 records...
 Imported 5 records. Simplifying...
#>    name
#> 1 Test1
#> 2 Test2
#> 3 Test3
#> 4 Test4
#> 5 Test5
#>                                                                                                                                                                                                                                                                   accountHistory
#> 1        107.62, 650.88, 878.63, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, CompanyA, CompanyF, CompanyG, Home Loan Account 6220, Checking Account 9001, Money Market Account 8743, payment, payment, deposit, 11111111, 123123123, 123123123
#> 2            199.29, 841.48, 116.55, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, CompanyB, Company, Company, Savings Account 3580, Home Loan Account 5988, Auto Loan Account 1794, invoice, payment, withdrawal, 12312312, 123123123, 12312313
#> 3 602.45, 377.08, 204.44, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, CompanyC, Company, Company, Personal Loan Account 9924, Investment Account 1002, Auto Loan Account 2452, deposit, withdrawal, withdrawal, 12312312, 123123123, 123123123
#> 4               31.19, 820.25, 582.19, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, CompanyD, Company, Company, Auto Loan Account 3893, Savings Account 6260, Auto Loan Account 9826, payment, payment, payment, 12312312, 123123123, 123123123
#> 5      340.93, 523.31, 153.38, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, 2012-02-02T06:00:00.000Z, CompanyE, Company, Company, Checking Account 5055, Savings Account 1680, Money Market Account 1010, withdrawal, deposit, withdrawal, 12312312, 123123123, 123123123
# or ndjson package directly
ndjson::stream_in(temp_file)
#> Source: local data table [5 x 19]
#> 
#> # A tibble: 5 x 19
#>   accountHistory.~ accountHistory.~ accountHistory.~ accountHistory.~
#>   <chr>            <chr>            <chr>            <chr>           
#> 1 11111111         107.62           CompanyA         2012-02-02T06:0~
#> 2 12312312         199.29           CompanyB         2012-02-02T06:0~
#> 3 12312312         602.45           CompanyC         2012-02-02T06:0~
#> 4 12312312         31.19            CompanyD         2012-02-02T06:0~
#> 5 12312312         340.93           CompanyE         2012-02-02T06:0~
#> # ... with 15 more variables: accountHistory.0.name <chr>,
#> #   accountHistory.0.type <chr>, accountHistory.1.account <chr>,
#> #   accountHistory.1.amount <chr>, accountHistory.1.business <chr>,
#> #   accountHistory.1.date <chr>, accountHistory.1.name <chr>,
#> #   accountHistory.1.type <chr>, accountHistory.2.account <chr>,
#> #   accountHistory.2.amount <chr>, accountHistory.2.business <chr>,
#> #   accountHistory.2.date <chr>, accountHistory.2.name <chr>,
#> #   accountHistory.2.type <chr>, name <chr>

^{Created on 2019-01-21 by the reprex package (v0.2.1)}

About ndjson see website of the https://hrbrmstr.github.io/ndjson/

nwerth · January 24, 2019, 4:59pm

You're correct about that. In my replacement code, types is a character vector. If a vectors values are all the same data type, it's generally better to store them as an "atomic vector" (e.g., numeric, character, logical). Lists are good for storing mixed types.

system · February 14, 2019, 4:59pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.