Unnest data from columns

Hi

My problem: I'd like to unnest values from the column into readable data frame.

  • Since I am not familiar with the nested format, I am open to the community expert advice on best way to work with these types of data. I'd also appreciate if you can explain your approach and share any useful resources.
  • I will be using the output for exploratory analysis than develop insights and reporting.
  • I was thinking each unnest values into separate rows but that might be too much since I have thousands of rows. What do you recommend?

About the dataset:

  • The original file is in CSV format with 1000+ rows.
  • The data is stored in nested format in some columns.
  • Column symptom_reasons_for_visit is enclosed with [] but patient_running is not even though both have nested values.
  • The data is possibly collected as survey where consultation_Id represents the person who conducted the survey.

Sample dataset:

df <- tibble::tribble(
       ~consultation_id, ~consultation_week_start,                                                                                                                                                                                                                                                                                                                           ~symptom_reasons_for_visit, ~why_visit,                              ~medications,                                                                                                                                                                             ~global_questions, ~customer_age_range, ~patient_sex,                                                   ~patient_drinking,                                                                                                              ~patient_running,
                  1007L,             "01/11/2021", "[{symp_code=21, time_long={number=null, time_frame=null}, first_occurrence={number=2.000000000, time_frame=TTTT}, time_of_day=[], exacerbating_factors=[{id=31, name=carrying}], relieving_factors=[{id=007, name=headache (migraine)}], trigger_events=[{id=62, name=food and diet (intense exercise)}, {id=98, name=heavy headache (lifting)}]}]",       "[]", "[{id=74770, name=Panadol, problems=[]}]", "[{id=9, type=radio, title=Are you or could you be injured? , answers=[{id=99, name=Not pregnant }]}, {id=876, type=radio, title=Are you currently in pain? , answers=[{id=1807, name=No}]}]",             "20-29",     "FEMALE", "{frequency=RARE, drinks={number=5.000000000, time_frame=MONTHLY}}", "{running_per_day=5, time_long={number=2.000000000, time_frame=DTFY}, status=ON, quit={number=5.000000000, time_frame=DTFY}}"
       )

Thank you so much♥

This illustrates the problem more clearly

DF <- tibble::tribble(
  ~consultation_id, ~consultation_week_start, ~symptom_reasons_for_visit, ~why_visit, ~medications, ~global_questions, ~customer_age_range, ~patient_sex, ~patient_drinking, ~patient_running,
  1007L, "01/11/2021", "[{symp_code=21, time_long={number=null, time_frame=null}, first_occurrence={number=2.000000000, time_frame=TTTT}, time_of_day=[], exacerbating_factors=[{id=31, name=carrying}], relieving_factors=[{id=007, name=headache (migraine)}], trigger_events=[{id=62, name=food and diet (intense exercise)}, {id=98, name=heavy headache (lifting)}]}]", "[]", "[{id=74770, name=Panadol, problems=[]}]", "[{id=9, type=radio, title=Are you or could you be injured? , answers=[{id=99, name=Not pregnant }]}, {id=876, type=radio, title=Are you currently in pain? , answers=[{id=1807, name=No}]}]", "20-29", "FEMALE", "{frequency=RARE, drinks={number=5.000000000, time_frame=MONTHLY}}", "{running_per_day=5, time_long={number=2.000000000, time_frame=DTFY}, status=ON, quit={number=5.000000000, time_frame=DTFY}}"
)

t(DF)
#>                           [,1]                                                                                                                                                                                                                                                                                                                                                
#> consultation_id           "1007"                                                                                                                                                                                                                                                                                                                                              
#> consultation_week_start   "01/11/2021"                                                                                                                                                                                                                                                                                                                                        
#> symptom_reasons_for_visit "[{symp_code=21, time_long={number=null, time_frame=null}, first_occurrence={number=2.000000000, time_frame=TTTT}, time_of_day=[], exacerbating_factors=[{id=31, name=carrying}], relieving_factors=[{id=007, name=headache (migraine)}], trigger_events=[{id=62, name=food and diet (intense exercise)}, {id=98, name=heavy headache (lifting)}]}]"
#> why_visit                 "[]"                                                                                                                                                                                                                                                                                                                                                
#> medications               "[{id=74770, name=Panadol, problems=[]}]"                                                                                                                                                                                                                                                                                                           
#> global_questions          "[{id=9, type=radio, title=Are you or could you be injured? , answers=[{id=99, name=Not pregnant }]}, {id=876, type=radio, title=Are you currently in pain? , answers=[{id=1807, name=No}]}]"                                                                                                                                                       
#> customer_age_range        "20-29"                                                                                                                                                                                                                                                                                                                                             
#> patient_sex               "FEMALE"                                                                                                                                                                                                                                                                                                                                            
#> patient_drinking          "{frequency=RARE, drinks={number=5.000000000, time_frame=MONTHLY}}"                                                                                                                                                                                                                                                                                 
#> patient_running           "{running_per_day=5, time_long={number=2.000000000, time_frame=DTFY}, status=ON, quit={number=5.000000000, time_frame=DTFY}}"

All but the first variables are character (first is quoted solely due to conversion to matrix for purposes of display). It looks as if the data passed through json formatting at one point, but I don't think it would be practicable to backcast it into json and then import using jsonlite.

Absent that, it is a problem in string parsing to extract the desired information from each field into new variables. Before proceeding to how to do that it would be best to create a target data layout. For example

DF[3]
# A tibble: 1 × 1
  symptom_reasons_for_visit                                                                     
  <chr>                                                                                         
[{symp_code=21, time_long={number=null, time_frame=null}, first_occurrence={number=2.00000000

could be

int symp_code
duration how_long # lubridate duration object
int first_occ # unit is unlcear

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.