Parsing a JSON file with rjson:: and as.data.frame()

I'm dealilng with data that is, well, heavy Unicode. That is, both the column names and the data utilize characters that are not always roman, write in both directions, and even have more than one diacritical mark on some letters. It's a fake class of 35 students who each have names in a couple dozen languages in the world. My sort of "research" question is: if you have such data, what are the best ways to represent it (csv, JSON, sasv7bdat, Excel, &c.).

So I'm looking at the JSON file, which reads in with an error with jsonlite. I'm also trying rjson, but have run into a problem. According to everything I've read, the correct way to process the file is to read it into an object, then call as.data.frame() on that object, to wit:

raw_JSON <- rjson::fromJSON(file = "World Class.json")
rjson_fromJSON <- as.data.frame(raw_JSON)

However, this results in a data frame with a single row and 1400 columns. The JSON file looks fine, reads into other readers fine. After reading into R, raw_JSON appears to have read everything correctly, but I've only verified that by looking at it in RStudio's Envorinment -- not programatically.

I'm at a loss as to what the problem could be. I have asked over at StackOverflow, but the answer I get is "just use jsonlite". I'm not so much looking for a way to read the file in as I am trying to figure out what's going on here. For example, is as.data.frame() not parsing the lists correctly, maybe because of the wild Unicode? I don't think that's correct, but I'm grasping at straws here.

The original data can be found here.

I’ll take a look. In the meantime you do have UTF-8 as regular default, right?

Hmm. I thought my version of R and RStudio used UTF-8 by default.

As advised by an R search, I went to Global Options > Code > Saving, and changed Default Text Encoding from [Ask] to UTF-8. It doesn't seem to have made a difference.

Just checking that the default hadn’t been overridden.

Just FYI, I get the same behaviors using RJSONIO::fromJSON(). Seems to read fine into R, but it won’t parse in as.data.frame().

When I download it and view it in Notepad++ I see the encoding mentioned as UTF-8-BOM.
When I change the encoding in Notepad++ to UTF8 and save it, I can read it with:

b <- jsonlite::fromJSON("D:/downloads/World Class UTF8.json")  

str(head(b,2)) 
'data.frame':	2 obs. of  35 variables:
 $ age               : int  13 15
 $ sex               : chr  "F" "F"
 $ height (in.)      : int  61 64
 $ weight (lb.)      : int  107 112
 $ height (cm.)      : num  155 163
 $ weight (kg.)      : num  48.6 50.9
 $ Shqip             : chr  "Blerta" "Cyme"
 $ Euskera           : chr  "Ahuña" "Adoniñe"
 $ 中文 (Simplified) : chr  "蔼" "安"
 $ 中文 (Traditional): chr  "佩佩" "姗姗"
 $ Hrvatska          : chr  "Ana" "Ema"
 $ Dansk             : chr  "Anna" "Anne"
 $ English (GB)      : chr  "Aimee" "Amy"
 $ English (US)      : chr  "Alice" "Amy"
 $ فارسى             : chr  "آیسا" "افسانه"
 $ Suomi             : chr  "Aino" "Anneli"
 $ Français          : chr  "Adélaïde" "Adèle"
 $ Gailge            : chr  "Áine" "Bébhinn"
 $ Deutsch           : chr  "Anna" "Antonia"
 $ Ελληνικά          : chr  "Αβηιχα" "Αβροξενα"
 $ Magyar            : chr  "Apollónia" "Barbara"
 $ Íslenska          : chr  "Anna" "Ásta"
 $ India             : chr  "Aanya" "Aaradhya"
 $ Italiano          : chr  "Angelica" "Arianna"
 $ 日本語            : chr  "葵" "亜美"
 $ 한국어            : chr  "영숙" "선영"
 $ Македонија        : chr  "Александра" "Анастасија"
 $ Norsk             : chr  "Berit" "Bjørg"
 $ Português         : chr  "Angélica" "Bárbara"
 $ Русский язык      : chr  "Аделаида" "Анна"
 $ Srbija            : chr  "Aleksandra" "Ana"
 $ Slovenščina       : chr  "Alojzija" "Amalija"
 $ Español           : chr  "Alma" "Angélica"
 $ Svenska           : chr  "Anita" "Anna"
 $ Tiếng Việt        : chr  "Anh Đào" "Anh Thư"
> 

…but you’re using jsonlite::. I very much appreciate your effort, but I’m not just trying to read the file, I’m trying to understand why a well-formed, simply structured list of lists isn’t able to be parsed by as.data.frame(), the method widely recommended.

jsonlite will read it with the BOM, it just gives a warning.

that should be an error. UTF-BOM's invisible bytes are undetectable by the human eye, but linters and jsonlite see them clearly.

As @HanOostdijk saw with Notepad++, changing the encoding converts the file into the UTF-8 that json assumes that is dealing with the usual encoding without the extra bytes.

8. String and Character Issues 8.1. Character Encoding JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

RFC 7159.

Yes, the guy who writes jsonlite:: and I were in contact a few hours ago, and he pointed me to that very section. I didn’t bring it up because I didn’t think it relevant: all three json packages I use read the raw json file in just fine, I just couldn’t get them into a data frame, hence my question.

1 Like

Implementations should all be permissive or strict, or this how we end up—with dashed expectations.

1 Like

I'm sorry that I didn't include more information about the BOM in my original question. After talking to the jsonlite package author, I opened the file in BBEdit, changed the encoding from UTF-8 with BOM to UTF-8 and tried creating the data frame with the new, non-BOM file, which gave me the same results. That made me assume that the BOM wasn't at issue with this problem. I'll admit that I didn't use a hex editor to remove the BOM from the file and save it, I just trusted that BBEdit would take care of that.

I'm still not convinced that the BOM is the reason that as.data.frame() isn't working as it should. The lists that rjson produces are valid R lists and not too complicated, so I'm not sure what's going on.

I didn't create this solution out of whole cloth: if you Google "read json with rjson", the top two results give the same method I used: Read the file into an object, then call as.data.frame() to convert it to a data frame.

1 Like

I used a different conversion—dos2unix, which converts Windows Unicode (UTF-16) , which means stripping the the preceding bytes, a process which appears to involve much subtlety.

implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error

leads me to believe that when an implementation encounters the trouble it can ignore the error, fix it, maybe emit a warning, and move on. But, it can also decide on a hard stop.

Politely, and with thanks, could we move from talking about the BOM and move on to why the as.data.frame() call does not work? I examined the byte order mark quite carefully before posing my question.

1 Like

Could this be a start:

The advertised approach made sense to me, since I saw the following in ?as.data.frame():

“ If a list is supplied, each element is converted to a column in the data frame.”

…which is what I want to do. The json is imported as a list (as far as I can tell) but it appears that it contains rows, not columns. That is, the first element of the list is the info for student 1: a few numeric variables, then the name written in many different languages. Next comes student 2, and so on.

But: this is how JSON for a tidy table would always be rendered. The websites are somehow wrong in recommending as.data.frame(), which works in some situations, but not others?

Here is a little investigation ; from moi.

library(jsonlite)
library(rjson)

# a hand crafted input 
myjson <- '{"row1": [{"colval1":3,"colval2":4}],
            "row2": [{"colval1":4,"colval2":3}]}'

json_by_row <- jsonlite::fromJSON(myjson)
# smushed
as.data.frame(json_by_row)

#why ?
str(json_by_row)

#oh, its a list of dataframes; so lets vertically stack those frames
#unsmushed
(mydf <- do.call(rbind,json_by_row)) 
#note the implicit rownames

(my_df_better <- tibble::rownames_to_column(mydf))

# lets write it to json to compare
(j_lite <- jsonlite::toJSON(my_df_better))
(j_rjsn <- rjson::toJSON(my_df_better))

# we have two representations ... lets read them in by the two methods 
# this makes 4 combinations ...
(from_j_lite_lite <- jsonlite::fromJSON(j_lite))
(from_j_lite_rjsn <- jsonlite::fromJSON(j_rjsn))
(from_j_rjsn_lite <- rjson::fromJSON(j_lite))
(from_j_rjsn_rjsn <- rjson::fromJSON(j_rjsn))

# can we again make each of these data.frames ? 
#1
from_j_lite_lite # already is
#2
as.data.frame(from_j_lite_rjsn) # easy
#3 
as.data.frame(from_j_rjsn_lite) # smushed so ...
as.data.frame(do.call(rbind,from_j_rjsn_lite)) # unsmushed
#3
as.data.frame(from_j_rjsn_rjsn) # easy

Very nicely done! Trés bien!

I am sure you knew this (but I didn't until investigating your answer): There is a data frame version of rbind(), rbind.data.frame(). I can't imagine preferring one over the other, but

do.call(rbind.data.frame,from_j_rjsn_lite) # unsmushed

works identically to your

as.data.frame(do.call(rbind,from_j_rjsn_lite)) # unsmushed

Sorry for my confusion—encoding was where I got stuck and didn’t move on.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.