Flatten the data into rows

Rscotty · December 16, 2021, 6:43pm

Hi,

I would like to flatten the data. The dataset is in csv and the values are stored into one cell separated by comma. My expected output is to put the values into new row which are separated by commas.

Dataset info:
Column value contains information about the symptom of the patients. Each symptom are groped by "{ }" curly brackets. Within each symptom there could be more information stored inside "{ }" curly brackets.

Sample dataset:

library(tidyverse)

sample_data <- tibble::tribble(
  ~Doctor_ID,                                                                                                                                                                                                                                                        ~values,
        "A1", "[{Symptom=21, name=migraine headache, pain={low=2, mid=5, high=10}, dry_eye=[], character=[{id=164, nose_block = yes(very dry)}]}, {Symptom=99, name=back pain, pain={low=5, mid=7, high=10}, dry_eye={}, character=[{id=164, nose_block = yes(very dry)}]}]")

Expected output:

expected_output <- tibble::tribble(
                     ~Doctor_ID,                                          ~Symptoms,
                           "A1",                                       "Symptom=21",
                           "A1",                           "name=migraine headache",
                           "A1",                     "pain={low=2, mid=5, high=10}",
                           "A1",                                       "dry_eye=[]",
                           "A1", "character=[{id=164, nose_block = yes(very dry)}]",
                           "A1",                                       "Symptom=99",
                           "A1",                                   "name=back pain",
                           "A1",         "pain={low=5, mid=7, high=10}, dry_eye={}",
                           "A1", "character=[{id=164, nose_block = yes(very dry)}]"
                     )

Thank you

williaml · December 17, 2021, 5:37am

This is sort of close to what you're after:

sample_data %>% 
  mutate(symptoms = str_split(values, ",")) %>% 
  select(-values) %>% 
  unnest() %>% 
  mutate(symptoms = str_trim(symptoms),
         symptoms = str_replace_all(symptoms, "\\[\\{", ""),
         symptoms = str_replace_all(symptoms, "\\}\\]\\}", "")) 

# A tibble: 16 × 2
   Doctor_ID symptoms                   
   <chr>     <chr>                      
 1 A1        Symptom=21                 
 2 A1        name=migraine headache     
 3 A1        pain={low=2                
 4 A1        mid=5                      
 5 A1        high=10}                   
 6 A1        dry_eye=[]                 
 7 A1        character=id=164           
 8 A1        nose_block = yes(very dry) 
 9 A1        {Symptom=99                
10 A1        name=back pain             
11 A1        pain={low=5                
12 A1        mid=7                      
13 A1        high=10}                   
14 A1        dry_eye={}                 
15 A1        character=id=164           
16 A1        nose_block = yes(very dry)]

technocrat · December 17, 2021, 5:45am

I'd do it slightly differently from @williaml

library(stringr)
sample_data <- tibble::tribble(
  ~Doctor_ID, ~values,
  "A1", "[{Symptom=21, name=migraine headache, pain={low=2, mid=5, high=10}, dry_eye=[], character=[{id=164, nose_block = yes(very dry)}]}, {Symptom=99, name=back pain, pain={low=5, mid=7, high=10}, dry_eye={}, character=[{id=164, nose_block = yes(very dry)}]}]"
)

a <- sample_data$values
s <- strsplit(a,",")[[1]]
d <- rep("A1",length(s))
result <- data.frame(Doctor_ID = d, values = s)
result
#>    Doctor_ID                          values
#> 1         A1                    [{Symptom=21
#> 2         A1          name=migraine headache
#> 3         A1                     pain={low=2
#> 4         A1                           mid=5
#> 5         A1                        high=10}
#> 6         A1                      dry_eye=[]
#> 7         A1              character=[{id=164
#> 8         A1   nose_block = yes(very dry)}]}
#> 9         A1                     {Symptom=99
#> 10        A1                  name=back pain
#> 11        A1                     pain={low=5
#> 12        A1                           mid=7
#> 13        A1                        high=10}
#> 14        A1                      dry_eye={}
#> 15        A1              character=[{id=164
#> 16        A1  nose_block = yes(very dry)}]}]

except that I wouldn't do it at all without understanding the motivation for the transformation. It makes more sense to parse the value field first to split by symptom into separate object and then to parse each of the name, pain, dry_eye, id, etc subfields to end up with a separate variable for each.

Rscotty · December 17, 2021, 10:47am

@williaml and @technocrat my expected output is different and here is the reason.

Splitting into new rows by comma delimiter will not help because the data will not make sense.
Questionnaire related to one symptom are enclosed within the first curly brackets " { } ".
This is questionnaire related to Symptom = 21

{Symptom=21, name=migraine headache, pain={low=2, mid=5, high=10}, dry_eye=[], character=[{id=164, nose_block = yes(very dry)}]}

Within the first curly brackets there are questionnaire with multiple option enclosed by curly brackets " { } ".
Therefore, an output like this will not make sense when working with big dataset.
Row 3 to 5 - Multiple answer for pain are split into row there it will not be usable.

result <- data.frame(Doctor_ID = d, values = s)
result
#>    Doctor_ID                          values
#> 1         A1                    [{Symptom=21
#> 2         A1          name=migraine headache
#> 3         A1                     pain={low=2
#> 4         A1                           mid=5
#> 5         A1                        high=10}
#> 6         A1                      dry_eye=[]
#> 7         A1              character=[{id=164
#> 8         A1   nose_block = yes(very dry)}]}
#> 9         A1                     {Symptom=99
#> 10        A1                  name=back pain
#> 11        A1                     pain={low=5
#> 12        A1                           mid=7
#> 13        A1                        high=10}
#> 14        A1                      dry_eye={}
#> 15        A1              character=[{id=164
#> 16        A1  nose_block = yes(very dry)}]}]

My naïve approach was to have questionnaire with multiple options in the same row. Hence, this was my request. I was unable to get this result
Example from my output - third row for pain is in the same row

expected_output <- tibble::tribble(
                     ~Doctor_ID,                                          ~Symptoms,
                           "A1",                                       "Symptom=21",
                           "A1",                           "name=migraine headache",
                           "A1",                     "pain={low=2, mid=5, high=10}",
                           "A1",                                       "dry_eye=[]",
                           "A1", "character=[{id=164, nose_block = yes(very dry)}]",
                           "A1",                                       "Symptom=99",
                           "A1",                                   "name=back pain",
                           "A1",         "pain={low=5, mid=7, high=10}, dry_eye={}",
                           "A1", "character=[{id=164, nose_block = yes(very dry)}]"
                     )

So my question is
Q1. how do I split the rows in such a way that questions with multiple options are in the same row. E.g. "pain={low=2, mid=5, high=10}".
Q2. @technocrat I didn't get your comment. Do you have any reference or example for this

It makes more sense to parse the value field first to split by symptom into separate object and then to parse each of the name, pain, dry_eye, id , etc subfields to end up with a separate variable for each.

williaml · December 17, 2021, 10:31pm

Maybe something like this then. It works for this, but might not work with real data:

# count for row num below
count_symptoms <- str_count(sample_data$values, "Symptom")


# for join
symptom_row_names <- tibble(row_num = 1:8,
                            symptom_group = c("symptom num",
                                              "symptom name",
                                              rep("pain", 3),
                                              "dry_eye",
                                              "character",
                                              "nose_block"))

sample_data %>% 
  mutate(symptoms = str_split(values, ",")) %>% 
  select(-values) %>% 
  unnest(cols = c(symptoms)) %>% 
  mutate(symptoms = str_trim(symptoms),
         symptoms = str_replace_all(symptoms, "\\[\\{", ""),
         symptoms = str_replace_all(symptoms, "\\}\\]\\}", ""),
         row_num = rep(1:8, count_symptoms),
         symptom_group_num = rep(1:count_symptoms, each = length(symptoms)/count_symptoms)) %>% 
  left_join(symptom_row_names, by = "row_num") %>% 
  group_by(Doctor_ID, symptom_group, symptom_group_num) %>% 
  summarise(symptoms2 = paste(symptoms, collapse = ",")) %>% 
  arrange(Doctor_ID, symptom_group_num, symptoms2)

# A tibble: 12 x 4
# Groups:   Doctor_ID, symptom_group [6]
   Doctor_ID symptom_group symptom_group_num symptoms2                  
   <chr>     <chr>                     <int> <chr>                      
 1 A1        character                     1 character=id=164           
 2 A1        dry_eye                       1 dry_eye=[]                 
 3 A1        symptom name                  1 name=migraine headache     
 4 A1        nose_block                    1 nose_block = yes(very dry) 
 5 A1        pain                          1 pain={low=2,mid=5,high=10} 
 6 A1        symptom num                   1 Symptom=21                 
 7 A1        symptom num                   2 {Symptom=99                
 8 A1        character                     2 character=id=164           
 9 A1        dry_eye                       2 dry_eye={}                 
10 A1        symptom name                  2 name=back pain             
11 A1        nose_block                    2 nose_block = yes(very dry)]
12 A1        pain                          2 pain={low=5,mid=7,high=10}

technocrat · December 17, 2021, 10:47pm

Looks like @williaml and I both took your question too literally.

What the aim should be is a tidy data frame with one observation per row of one or more variables. Again, I'm guessing at your intentions.

refactored <- data.frame(dr = character(), sympton = integer(), name = character(), pain_low = integer(), pain_mid = integer(), pain_high = integer(), id = integer(),nose_block = integer(), severity = character())
refactored
#> [1] dr         sympton    name       pain_low   pain_mid   pain_high  id        
#> [8] nose_block severity  
#> <0 rows> (or 0-length row.names)
str(refactored)
#> 'data.frame':    0 obs. of  9 variables:
#>  $ dr        : chr 
#>  $ sympton   : int 
#>  $ name      : chr 
#>  $ pain_low  : int 
#>  $ pain_mid  : int 
#>  $ pain_high : int 
#>  $ id        : int 
#>  $ nose_block: int 
#>  $ severity  : chr

populating the refactored data frame along this schema is going to require splitting out the two Symptom strings on "}]}" and then within each of those, teasing out the respective fields. Depending on how many records, I'd approach this differently. Up to 100 or so, I'd pick the easy cases to do programmatically and for the rest do them by hand. If I had 1000, I'd definitely write a parser in bison/flex or maybe a bash script with a combination of perl, sed and awk.

nirgrahamuk · December 18, 2021, 6:53pm

my willingness to butcher the values by removing brackets and whitespaces allowed me to convert the example values to actual JSON

library(tidyverse)

sample_data <- tibble::tribble(
  ~Doctor_ID, ~values,
  "A1", "[{Symptom=21, name=migraine headache, pain={low=2, mid=5, high=10}, dry_eye=[], character=[{id=164, nose_block = yes(very dry)}]}, {Symptom=99, name=back pain, pain={low=5, mid=7, high=10}, dry_eye={}, character=[{id=164, nose_block = yes(very dry)}]}]"
)
sample_data$values <- gsub(pattern = "\\(", replacement = "", x = sample_data$values)
sample_data$values <- gsub(pattern = "\\)", replacement = "", x = sample_data$values)
sample_data$values <- gsub(pattern = " ", replacement = "", x = sample_data$values)
sample_data$values <- gsub(pattern = "=", replacement = ":", x = sample_data$values)


sample_data$values <- str_replace_all(sample_data$values,
                                      "\\w+:", function(x) 
                                        paste0('"', substr(x, 
                                                           1, nchar(x) - 1), '":'))
sample_data$values <- str_replace_all(sample_data$values, 
                                      "\\:\\w+", function(x) 
                                        paste0(':"', substr(x, 
                                                            2, nchar(x)), '"'))

library(jsonlite)
fromJSON(sample_data$values)

  Symptom             name pain.low pain.mid pain.high dry_eye       character
1      21 migraineheadache        2        5        10    NULL 164, yesverydry
2      99         backpain        5        7        10    NULL 164, yesverydry

system · December 29, 2021, 11:11am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.