Help with splitting JSON column from dataframe

Hi there, first time posting.

I'm looking for some assistance with dealing with a dataframe which contains a JSON column (I hope I am describing it accurately as such, I'm fairly new to dealing with data of this kind). I have some data I have collected from a pilot experiment. One part asks participants several different questions about the experiment. All the questions and responses have then been saved in one single column in the dataframe called "response". The structure for one row is as follows:

{"Moral":"Should contribute money","behaviour":"I wanted to win", "Understanding":"5","Age":"62", "Gender":"Female", "Nationality":"White"}

What I would like, is to be able to split these into individual columns so that each question and response is placed in its own column. In other words, get 6 different columns for each of these responses.

I hope that this makes sense and is clear? Any help for this would be greatly appreciated. Here is a link to the data in question, which I imagine may make things clearer.

Have you taken a look at the jsonlite package? (link to quickstart, below)
https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html
It can take key-value pairs and simplify them into a data frame.

No I haven't, thank you so much this seems really helpful!

1 Like

Many thanks for your reply.

Unfortunately, this doesn't seem to have solved my issue. Using fromJSON on specifically the "response" column provided this error

Error: parse error: trailing garbage

I tried this on a different dataframe, one which contains a column with only a single key value pair {Prestige:0} and it did not produce any error, however fromJSON simply converted this to a dataframe with a single column which maintained this format {Prestige:0}.

If you had any further suggestions, I would greatly appreciate it. For example, perhaps I was mistaken and this is not JSON data at all.

Many thanks

Can you provide your reproducible example with your single key value pair example? Having a working example in code is most useful.

You have to work with map because the result is nested columns, see this example:

library(tidyverse)
library(jsonlite)

url <- "https://drive.google.com/uc?export=download&id=11A_WH6xgPltGXUmwpBbIxUnGY9EkuCFp"
sample_data <- read_csv(url)

sample_data %>% 
    select_if(~!all(is.na(.))) %>% 
    mutate(response = map(response, fromJSON)) %>% 
    unnest_wider(col = response)
#> # A tibble: 8 x 14
#>      id creation_time       failed details type  participant_id number question
#>   <dbl> <dttm>              <lgl>  <chr>   <chr>          <dbl>  <dbl> <chr>   
#> 1     1 2019-11-15 17:52:09 FALSE  {}      ques…              4      1 questio…
#> 2     2 2019-11-15 17:52:12 FALSE  {}      ques…              1      1 questio…
#> 3     3 2019-11-15 17:53:47 FALSE  {}      ques…              5      1 questio…
#> 4     4 2019-11-15 17:54:46 FALSE  {}      ques…              3      1 questio…
#> 5     5 2019-11-15 18:07:43 FALSE  {}      ques…              7      1 questio…
#> 6     6 2019-11-15 18:08:25 FALSE  {}      ques…             10      1 questio…
#> 7     7 2019-11-15 18:08:40 FALSE  {}      ques…              8      1 questio…
#> 8     8 2019-11-15 18:09:26 FALSE  {}      ques…              6      1 questio…
#> # … with 6 more variables: moral <chr>, behaviour <chr>, understanding <chr>,
#> #   Age <chr>, Gender <chr>, Nationality <chr>

Created on 2020-01-24 by the reprex package (v0.3.0.9000)

Thank you so much for your reply and solution, that works perfectly for the response column. However I am still having trouble with the remainder of the data. The approach you provided does not work for my other property columns. Here is a reprex for the node data attempting to split property2 (which contains only {Prestige:0} or {Prestige:1}). Alsoproperty5 which is perhaps even more awkward in that the data is in the format:

"{"prestige_list": [5, 5, 8, 7, 5, 6], ....

My desired output in this case would be 6 columns labelled round_ 1 -> round_6 for each of the contents of prestige list, conform list and payoff list (so 18 columns total). I can then use the other functions from tidyverse to tidy this into long format data.

In any case, both functions fail on the third line providing the error:

Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed

library(tidyverse)
library(jsonlite)

url <- "https://drive.google.com/uc?export=download&id=139G1oa5kQOMNgb9gKELpa0H0rgt0AE2q"
node <- read_csv(url)

tidy_property2 <- node %>%
  select_if(~!all(is.na(.))) %>% 
  mutate(property2 = map(property2, fromJSON)) %>%
  unnest_wider(col = property2)

tidy_property5 <- node %>%
  select_if(~!all(is.na(.))) %>% 
  mutate(property5 = map(property5, fromJSON)) %>%
  unnest_wider(col = property5)


Once again, any advice would be greatly appreciated

You have NA rows, you have to get rid of them.

library(tidyverse)
library(jsonlite)

url <- "https://drive.google.com/uc?export=download&id=139G1oa5kQOMNgb9gKELpa0H0rgt0AE2q"
node <- read_csv(url)

node %>%
    drop_na() %>% 
    mutate(property2 = map(property2, fromJSON)) %>%
    unnest_wider(col = property2)
#> # A tibble: 4 x 13
#>      id creation_time       property1 prestige property3 property4 property5
#>   <dbl> <dttm>              <chr>        <int> <chr>     <chr>     <chr>    
#> 1    16 2019-11-15 18:04:19 "{\"scor…        0 "{\"scor… "{\"left… "{\"pres…
#> 2     8 2019-11-15 17:34:27 "{\"scor…        0 "{\"scor… "{\"left… "{\"pres…
#> 3    18 2019-11-15 18:08:24 "{\"scor…        0 "{\"scor… "{\"left… "{\"pres…
#> 4    19 2019-11-15 18:09:03 "{\"scor…        0 "{\"scor… "{\"left… "{\"pres…
#> # … with 6 more variables: failed <lgl>, time_of_death <dttm>, details <chr>,
#> #   type <chr>, network_id <dbl>, participant_id <dbl>

node %>%
    drop_na() %>%
    mutate(property5 = map(property5, fromJSON)) %>%
    unnest_wider(col = property5)
#> # A tibble: 4 x 12
#>      id creation_time       property1 property2 property3 property4 failed
#>   <dbl> <dttm>              <chr>     <chr>     <chr>     <chr>     <lgl> 
#> 1    16 2019-11-15 18:04:19 "{\"scor… "{\"pres… "{\"scor… "{\"left… TRUE  
#> 2     8 2019-11-15 17:34:27 "{\"scor… "{\"pres… "{\"scor… "{\"left… TRUE  
#> 3    18 2019-11-15 18:08:24 "{\"scor… "{\"pres… "{\"scor… "{\"left… TRUE  
#> 4    19 2019-11-15 18:09:03 "{\"scor… "{\"pres… "{\"scor… "{\"left… TRUE  
#> # … with 5 more variables: time_of_death <dttm>, details <chr>, type <chr>,
#> #   network_id <dbl>, participant_id <dbl>

Created on 2020-01-24 by the reprex package (v0.3.0.9000)

1 Like

Thank you so much! I feel a bit stupid with it being that simple. I've had to add a little extra code to select the columns I need to keep (running drop_na() to just drop all incomplete cases loses the useful data, ironically keeping the bits I don't need). But with a little extra wrangling, I've now been able to extract exactly what I need, so once again thank you so much for your help! I really apprecite it.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.