Problem with json of different column sizes

luiandresgonzalez · June 1, 2021, 8:53pm

I'm trying to access an API, which retrieves some data and stores it in a data frame. The following code should be fully reproducible.

require("httr")
require("jsonlite")
require("tidyverse")

vouches2 <- data.frame()

reproducible_list <- c("0x00d18ca9782be1caef611017c2fbc1a39779a57c", "0x105645ffea02c7c8feaa1a32c100f1a30766d6a9")

for(i in reproducible_list){
  theURL <- paste0("HTTPS://api.poh.dev/profiles/", i, "/vouches")
  r <- GET(theURL)
  message("Getting ", theURL)
  s <- content(r, as = "text", encoding = "UTF-8")
  message("DEBUG contntent(...) success")
  df <- as.data.frame(fromJSON(s,flatten = TRUE, simplifyDataFrame=FALSE))
  message(names(df))
  message("as.data.frame success")
  # 
  # df_filtered <- df %>%
  #   select(given.eth_address,given.status,given.display_name) %>%
  #   mutate(voucher = i) %>%
  #   mutate(voucher_name = data_filtered$display_name[data_filtered$id == i]) %>%
  #   filter(!is.na(voucher_name)) # remueve los que no estan en la lista de challengers frecuentes
  message("DEBUG bind_rows")
  vouches2 <- bind_rows(vouches2, df) 
  message("DEBUG bind_rows DONE")
  Sys.sleep(0.5)
  
}

The first item in the list (0x00d18ca9782be1caef611017c2fbc1a39779a57c) goes well. The problem is that the second item in the list (value 0x105645ffea02c7c8feaa1a32c100f1a30766d6a9) shows this error:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 0, 1

The reason I suppose that this is happening is due to the fact that the second value has an empty set of columns (associated with "given" in the json data).

I'd like some general guidance on how to face this issue. I believe it has something to do with the handling of empty rows, but I'm not entirely sure.

Thanks!

technocrat · June 1, 2021, 10:00pm

Where does this come from in the example?

luiandresgonzalez · June 1, 2021, 10:59pm

Hmm... I'm not sure if I understand the question. That is there to merge each new data that gets pulled from the API, I just made an example of 2 elements, but in reality the actual list is dozens of elements more. I want the output to be a data frame with all the relevant information for each row. That line just grabs the last iteration and appends a new row to it with the results. Maybe it is not the most "elegant" way to solve that, but it is the procedure I know. Suggestions are welcome!

technocrat · June 1, 2021, 11:04pm

Sorry, I overlooked that. Try pulling a few object from outside the function (I got an HTTP 400 when I tried) to see exactly which one is returning zero rows.

luiandresgonzalez · June 1, 2021, 11:09pm

That's what I did , and the second element was the one that got the issues of differing number of rows.

luiandresgonzalez · June 1, 2021, 11:33pm

This is the item that is generating issues. given is empty.

{"given":[],
 "received":[{"eth_address":"0xc81d370e13a248e55208b52e4a9db9fbd5e01b6b","status":"REGISTERED","vanity_id":4743,"display_name":"Ale","first_name":"Mirian","last_name":"Alejandra","registered":true,"photo":"https://ipfs.kleros.io/ipfs/QmeD8TCcFZ8idYhiesX8EuHdsW1CaXEMTLFQgoJjzCz3mT/20210507-173114.jpg-2.jpg","video":"https://ipfs.kleros.io/ipfs/QmThfU8LShbx5PAseE46mD7f3AuyX8Wcn6ztdAdTEjMVGJ/20210507-172812.mp4","bio":"Love my kids","profile":"https://app.proofofhumanity.id/profile/0xc81d370e13a248e55208b52e4a9db9fbd5e01b6b","registered_time":"2021-05-21T01:09:43.000Z","creation_time":"2021-05-17T12:18:07.000Z"}]}

I'd like to find a way to catch and handle these issues.

technocrat · June 2, 2021, 12:09am

OK, that makes it easier to see. But am I right that given is a variable (column), not a row?

luiandresgonzalez · June 2, 2021, 1:38am

I'm not very familiarized with the hierarchical nature of this output, it seems like it is a nested category of other variables (something I would usually see as category.subcategory in a column). Check how the output below of the "nice" item looks like. More details in Swagger UI

{
  "given": [
    {
      "eth_address": "0x9021346151cab1467982766e417377eaf8323aae",
      "status": "REGISTERED",
      "vanity_id": 4175,
      "display_name": "Katy Daza",
      "first_name": "Katy",
      "last_name": "Daza",
      "registered": true,
      "photo": "https://ipfs.kleros.io/ipfs/QmbRDPVhXdi1PeQ9wAbWEQDAvUxr9quDRiHhsKTd6nmkG2/whatsapp-image-2021-05-04-at-4.44.10-pm.jpeg",
      "video": "https://ipfs.kleros.io/ipfs/QmNbKYXYhahPrHP6rcqbjf9fgjCqTVSpcb3RffiH6Hs7Jj/katy2.mp4",
      "bio": "Environmental Lawyer",
      "profile": "https://app.proofofhumanity.id/profile/0x9021346151cab1467982766e417377eaf8323aae",
      "registered_time": "2021-05-15T14:58:30.000Z",
      "creation_time": "2021-05-04T22:02:03.000Z"
    },
    {
      "eth_address": "0x6beca7fb81c1f7b3f91b212e6830d15fe7bf1012",
      "status": "REGISTERED",
      "vanity_id": 2647,
      "display_name": "CamiloTD",
      "first_name": "Juan Camilo",
      "last_name": "Torres Cepeda",
      "registered": true,
      "photo": "https://ipfs.kleros.io/ipfs/QmQTPz6Z5jjCvPUY1KifEdy6PaXP2zDN4adm6GdH2bXk8C/1598481112063-1-.jfif",
      "video": "https://ipfs.kleros.io/ipfs/QmcejEjb1JSfpR3znNjd55SgLvLZByy8icZi19nsXqK1rM/whatsapp-video-2021-04-26-at-11.35.46-1-.mp4",
      "bio": "Blockchain developer & passionate researcher",
      "profile": "https://app.proofofhumanity.id/profile/0x6beca7fb81c1f7b3f91b212e6830d15fe7bf1012",
      "registered_time": "2021-04-30T08:05:10.000Z",
      "creation_time": "2021-04-21T18:16:51.000Z"
    },
    {
      "eth_address": "0xcc24fde84f1a18cb857f112eeea4a35192063663",
      "status": "REGISTERED",
      "vanity_id": 1548,
      "display_name": "Lauren",
      "first_name": "Lauren",
      "last_name": "Bajin",
      "registered": true,
      "photo": "https://ipfs.kleros.io/ipfs/QmbjLEdaHK1AixCpzA1JMwCH83hMGVTMzRRcsKQLxng1a1/20210112-190216.jpg",
      "video": "https://ipfs.kleros.io/ipfs/Qma8iHKhAsdgQhhbqtLWN9xnBADha6diskR8gnmF2Hfdto/video-2021-04-09-15-09-44.mp4",
      "bio": "Blockchain dAbbler and movement enthusiast",
      "profile": "https://app.proofofhumanity.id/profile/0xcc24fde84f1a18cb857f112eeea4a35192063663",
      "registered_time": "2021-04-22T18:53:20.000Z",
      "creation_time": "2021-04-09T21:37:13.000Z"
    },
    {
      "eth_address": "0x317bbc1927be411cd05615d2ffdf8d320c6c4052",
      "status": "REGISTERED",
      "vanity_id": 2023,
      "display_name": "Carlos Quintero",
      "first_name": "Carlos",
      "last_name": "Quintero",
      "registered": true,
      "photo": "https://ipfs.kleros.io/ipfs/QmeGoecmiJni67AEuNQFzSEHKP1cJngQdHqg3faC6TGWoP/proofofhumanityphoto.jpg",
      "video": "https://ipfs.kleros.io/ipfs/QmcyhkfTLtosQyjX79mH1b1duZojMjywajN6WEp41AnbNC/proofofhumanityvideo.mp4",
      "bio": "I am Software Engineer with great interest in the blockchain",
      "profile": "https://app.proofofhumanity.id/profile/0x317bbc1927be411cd05615d2ffdf8d320c6c4052",
      "registered_time": "2021-04-26T14:13:15.000Z",
      "creation_time": "2021-04-12T19:59:26.000Z"
    },
    {
      "eth_address": "0x7d547666209755fb833f9b37eebea38ebf513abb",
      "status": "REGISTERED",
      "vanity_id": 749,
      "display_name": "Juankbell",
      "first_name": "Juan Carlos",
      "last_name": "Bell Llinas",
      "registered": true,
      "photo": "https://ipfs.kleros.io/ipfs/QmXWsMjBsAPRcm8zFLXHWg9WEcpGTW9KVnRGrHdTytNGSi/img-1207-2.jpg",
      "video": "https://ipfs.kleros.io/ipfs/QmPx1AaChXYB4ef44BeXKb76tXwLrtBfqKN6V2ynxyidDW/poh-juan-bell.m4v",
      "bio": "Political scientist, Mag. in Conflict Management. Ethereum Colombia.",
      "profile": "https://app.proofofhumanity.id/profile/0x7d547666209755fb833f9b37eebea38ebf513abb",
      "registered_time": "2021-04-14T17:50:46.000Z",
      "creation_time": "2021-04-05T21:49:07.000Z"
    }
  ],
  "received": [
    {
      "eth_address": "0xb20a327c9b4da091f454b1ce0e2e4dc5c128b5b4",
      "status": "REGISTERED",
      "vanity_id": 11,
      "display_name": "Merlin Egalite",
      "first_name": "Merlin",
      "last_name": "Egalite",
      "registered": true,
      "photo": "https://ipfs.kleros.io/ipfs/QmcsDzTCPyrDwBAbVVWLxqjmLhsHvuGu7xvc1oiM36cQBs/merlin.JPG",
      "video": "https://ipfs.kleros.io/ipfs/QmbjNPuD85SMfMW3ocUtwbgd1Zk5KExcXPDjj81VDaFwKv/merlin-egalite.mp4",
      "bio": "Smart Contract Hacker",
      "profile": "https://app.proofofhumanity.id/profile/0xb20a327c9b4da091f454b1ce0e2e4dc5c128b5b4",
      "registered_time": "2021-03-11T18:53:58.000Z",
      "creation_time": "2021-03-11T18:53:58.000Z"
    }
  ]
}

nirgrahamuk · June 2, 2021, 11:17am

replace

  df <- as.data.frame(fromJSON(s,flatten = TRUE, simplifyDataFrame=FALSE))

with

  fj <- fromJSON(s,flatten = TRUE, simplifyDataFrame=FALSE)
  fj2 <- Filter(Negate(purrr::is_empty),fj)
  df <- as.data.frame(fj2)

luiandresgonzalez · June 2, 2021, 7:42pm

This did work in the sense that it didn't generate any errors but it created a super-wide dataframe 116 columns wide.

nirgrahamuk · June 2, 2021, 9:36pm

Seems like the next part of your journey is understanding your data source, and figuring out how to extract the useful/interesting parts.

luiandresgonzalez · June 2, 2021, 10:45pm

Yes indeed, thank you. I did not intend to ask for the specific code that solves the issue, but rather get some general guidance on how to address this issue. Where should I focus? Where do you see that I'm not understanding the issue?
For example the second line in the code you suggested adds the intermediate step of purr to parse the json. Isn't there any parameter that leaves it like my original method which works very well in other queries (with non-null elements)?

technocrat · June 3, 2021, 6:18am

I've not been able to use any of the example objects to help further. See the FAQ: How to do a minimal reproducible example reprex for beginners. For a problem like this was would help most is a dput() of a json file with two simple records, one which parses and one that doesn't.

The idea is to isolate where the problem is arising: Does the source have poorly json? If fromJSON parses then it is something later in processing that is the difficulty.

Less is more—just the minimum information to illustrate.

nirgrahamuk · June 3, 2021, 8:59am

The code I inroduced, only removes NULL elements from lists that as.data.frame doesnt like to make columns out of. if your end result is more columns than you expect, that means you get in some api calls more non-empty lists than you bargained for.
I made a hopefully illustrative example for you .



(abc1 <- list(a="a",b="b",c="c"))
(abc1_df <- as.data.frame(abc1))

(abc2 <- list(a="a",b="b",c=NULL))
(abc2_df <- as.data.frame(abc2))

(abc2_fixed <- list(a="a",b="b",c=NULL))
(abc2_fixed_df <- as.data.frame( Filter(Negate(purrr::is_empty),
                                        x = abc2)))

bind_rows(abc1_df,
          abc2_fixed_df)

system · June 24, 2021, 8:59am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.