Pivot tables - two steps conversion

Hi,
I have this simple file:

data.frame(stringsAsFactors=FALSE,
         URN = c("10BE022001884", "10BE022001936", "10BE022102325",
                 "10BE023230730", "10BE022102325", "10BE022102346",
                 "10BE022995949", "10BE023335219", "10BE022102340", "10BE022102346",
                 "10BE022104737"),
    Category = c("Recommendation", "Recommendation", "Recommendation",
                 "Recommendation", "Staff", "Staff", "Staff", "Staff",
                 "Timeliness", "Timeliness", "Timeliness"),
       Sent1 = c("+", "+", "-", "0", "-", "+", "0", "0", "+", "-", "-"),
        Cat1 = c(29, 11, 7, 29, 16, 12, 24, 10, 18, 5, 11),
       Sent2 = c("+", "+", "-", "0", "-", "+", "0", "0", "+", "-", "-"),
        Cat2 = c(8, 7, 8, 7, 18, 7, 7, 29, 9, 9, 23),
       Sent3 = c(NA, "+", "-", NA, "-", "-", NA, NA, NA, NA, "-"),
        Cat3 = c(NA, 8, 9, NA, 18, 12, NA, NA, NA, NA, 32),
       Sent4 = c(NA, NA, NA, NA, "-", "-", NA, NA, NA, NA, "-"),
        Cat4 = c(NA, NA, NA, NA, 32, 32, NA, NA, NA, NA, 35),
       Sent5 = c(NA, NA, NA, NA, "-", "-", NA, NA, NA, NA, NA),
        Cat5 = c(NA, NA, NA, NA, 32, 9, NA, NA, NA, NA, NA)
)

My challenge is pivoting this table twice (or once if there is a clever way).

Step 1:
I need only one column called "Cat" (with numbers which are currently in "Cat1" to "Cat 5") and one column Called "Sent" (with +, - or 0). We should have 32 records with duplicated URNs as a result

Step 2:
I need to pivot the result from step one again having 3 separate data frames based on "Sent" (+, - and 0). In each of the data frames I need a table with 3 Categories (Recommendation, Staff, Timeliness) and "Cat" below each of them (numbers) for each URN. We should not have any duplicated URNs in each data frame.

I am sure there is an easy way of doing that but I'm lack of ideas...

Can you help?

Step 1 should look like that:

data.frame(stringsAsFactors=FALSE,
         URN = c("10BE022001884", "10BE022001884", "10BE022001936",
                 "10BE022001936", "10BE022001936", "10BE022102325",
                 "10BE022102325", "10BE022102325", "10BE023230730", "10BE023230730",
                 "10BE022102325", "10BE022102325", "10BE022102325", "10BE022102325",
                 "10BE022102325", "10BE022102346", "10BE022102346", "10BE022102346",
                 "10BE022102346", "10BE022102346", "10BE022995949",
                 "10BE022995949", "10BE023335219", "10BE023335219", "10BE022102340",
                 "10BE022102340", "10BE022102346", "10BE022102346", "10BE022104737",
                 "10BE022104737", "10BE022104737", "10BE022104737"),
    Category = c("Recommendation", "Recommendation", "Recommendation",
                 "Recommendation", "Recommendation", "Recommendation",
                 "Recommendation", "Recommendation", "Recommendation", "Recommendation",
                 "Staff", "Staff", "Staff", "Staff", "Staff", "Staff", "Staff",
                 "Staff", "Staff", "Staff", "Staff", "Staff", "Staff", "Staff",
                 "Timeliness", "Timeliness", "Timeliness", "Timeliness",
                 "Timeliness", "Timeliness", "Timeliness", "Timeliness"),
        Sent = c("+", "+", "+", "+", "+", "-", "-", "-", "0", "0", "-", "-",
                 "-", "-", "-", "+", "+", "-", "-", "-", "0", "0", "0", "0",
                 "+", "+", "-", "-", "-", "-", "-", "-"),
         Cat = c(29, 8, 11, 7, 8, 7, 8, 9, 29, 7, 16, 18, 18, 32, 32, 12, 7,
                 12, 32, 9, 24, 7, 10, 29, 18, 9, 5, 9, 11, 23, 32, 35)
)

I cannot really show Step 2 result but for df with Sent ="-" it should look like that:

data.frame(stringsAsFactors=FALSE,
             ...1 = c(NA, "10BE022102325", "10BE022102346", "10BE022104737"),
   Recommendation = c(7, 1, NA, NA),
             ...3 = c(8, 1, NA, NA),
             ...4 = c(9, 1, NA, NA),
            Staff = c(12, NA, 1, NA),
             ...6 = c(9, NA, 1, NA),
             ...7 = c(16, 1, NA, NA),
             ...8 = c(18, 2, NA, NA),
             ...9 = c(32, 2, 1, NA),
       Timeliness = c(5, NA, 1, NA),
            ...11 = c(9, NA, 1, NA),
            ...12 = c(11, NA, NA, 1),
            ...13 = c(23, NA, NA, 1),
            ...14 = c(32, NA, NA, 1),
            ...15 = c(35, NA, NA, 1)
)


So for each record meeting a criteria I should get 1, otherwise 0 or NA.
For example for URN= 10BE022102325 we should have 1s for Recommendation and Cat 7, 8 and 9then for Staff we should have 1 for Cat 16, 2 for Cat 18 and 2 for Cat 32, nothing for Timeliness...

I know I can use stack function for Step 1 but the df should be based on URNs so for each URN I need one column with "Cat" and one with "Sent"...

My solution to Step 1:

library(reshape2)
dat3 <- reshape(data.frame, direction = "long", idvar=1:2, 
                varying=3:ncol(data.frame), sep = "", timevar="Measure")

but I can't get rid of blank "Measures". I have 5 "Cat" and 5 "Sent" and they are all taken into account even if responses are blank...

This would be a solution for your first step using a more modern tool (I can't help you with the second step because I really don't understand the logic and the output seems untidy or more like a table than a dataframe)

df <- data.frame(stringsAsFactors=FALSE,
           URN = c("10BE022001884", "10BE022001936", "10BE022102325",
                   "10BE023230730", "10BE022102325", "10BE022102346",
                   "10BE022995949", "10BE023335219", "10BE022102340", "10BE022102346",
                   "10BE022104737"),
           Category = c("Recommendation", "Recommendation", "Recommendation",
                        "Recommendation", "Staff", "Staff", "Staff", "Staff",
                        "Timeliness", "Timeliness", "Timeliness"),
           Sent1 = c("+", "+", "-", "0", "-", "+", "0", "0", "+", "-", "-"),
           Cat1 = c(29, 11, 7, 29, 16, 12, 24, 10, 18, 5, 11),
           Sent2 = c("+", "+", "-", "0", "-", "+", "0", "0", "+", "-", "-"),
           Cat2 = c(8, 7, 8, 7, 18, 7, 7, 29, 9, 9, 23),
           Sent3 = c(NA, "+", "-", NA, "-", "-", NA, NA, NA, NA, "-"),
           Cat3 = c(NA, 8, 9, NA, 18, 12, NA, NA, NA, NA, 32),
           Sent4 = c(NA, NA, NA, NA, "-", "-", NA, NA, NA, NA, "-"),
           Cat4 = c(NA, NA, NA, NA, 32, 32, NA, NA, NA, NA, 35),
           Sent5 = c(NA, NA, NA, NA, "-", "-", NA, NA, NA, NA, NA),
           Cat5 = c(NA, NA, NA, NA, 32, 9, NA, NA, NA, NA, NA)
)

library(tidyverse)

df %>% 
    pivot_longer(
        matches("Cat\\d|Sent"),
        names_to = c(".value", "set"),
        names_pattern = "([:alpha:]+)([:digit:]+)"
    ) %>% 
    na.omit() %>% 
    select(-set)
#> # A tibble: 32 x 4
#>    URN           Category       Sent    Cat
#>    <chr>         <chr>          <chr> <dbl>
#>  1 10BE022001884 Recommendation +        29
#>  2 10BE022001884 Recommendation +         8
#>  3 10BE022001936 Recommendation +        11
#>  4 10BE022001936 Recommendation +         7
#>  5 10BE022001936 Recommendation +         8
#>  6 10BE022102325 Recommendation -         7
#>  7 10BE022102325 Recommendation -         8
#>  8 10BE022102325 Recommendation -         9
#>  9 10BE023230730 Recommendation 0        29
#> 10 10BE023230730 Recommendation 0         7
#> # … with 22 more rows

Created on 2019-09-18 by the reprex package (v0.3.0.9000)

1 Like

thank you but I have following error:

Error in pivot_longer(., matches("Cat\\d|Sent"), names_to = c(".value",  : 
  could not find function "pivot_longer"

You have to update the tidyr package

Step 2 is pivoting step 1 so creating a table which may be exported to csv or excel where the first column is "URN", the second column is "Sent" and this is crosstabed with "Category" and "Cat" below the "Category"...

As far as I know, data wrangling tools are not designed to work that way (I might be wrong though), what you are describing is not well suited for a dataframe, although you can get what you want in the form of a table (graphic representation), one possible tool for the job is the gt package but currently is not posible to export the table to csv or excel.

Apart from the cosmetic format for your table (spanner column labels), this would leave the data in the desired form.

library(tidyverse)

df %>% 
    pivot_longer(
        matches("Cat\\d|Sent"),
        names_to = c(".value", "set"),
        names_pattern = "([:alpha:]+)([:digit:]+)"
    ) %>% 
    na.omit() %>%
    filter(Sent == "-") %>%
    select(-set, -Sent) %>%
    count(URN, Category, Cat) %>% 
    pivot_wider(names_from = c(Category, Cat), values_from = n)
#> # A tibble: 3 x 15
#>   URN   Recommendation_7 Recommendation_8 Recommendation_9 Staff_16
#>   <chr>            <int>            <int>            <int>    <int>
#> 1 10BE…                1                1                1        1
#> 2 10BE…               NA               NA               NA       NA
#> 3 10BE…               NA               NA               NA       NA
#> # … with 10 more variables: Staff_18 <int>, Staff_32 <int>, Staff_9 <int>,
#> #   Staff_12 <int>, Timeliness_5 <int>, Timeliness_9 <int>,
#> #   Timeliness_11 <int>, Timeliness_23 <int>, Timeliness_32 <int>,
#> #   Timeliness_35 <int>
1 Like

To exemplify, this would be your desired output as a gt table

library(tidyverse)
library(gt)

new_df <- df %>% 
    pivot_longer(
        matches("Cat\\d|Sent"),
        names_to = c(".value", "set"),
        names_pattern = "([:alpha:]+)([:digit:]+)"
    ) %>% 
    na.omit() %>%
    filter(Sent == "-") %>%
    select(-set, -Sent) %>%
    count(URN, Category, Cat) %>%
    arrange(Category, Cat) %>% 
    pivot_wider(names_from = c(Category, Cat), values_from = n)

new_names <- names(new_df) %>% 
    str_extract("URN|\\d+") %>% 
    set_names(names(new_df))

new_df %>% 
    gt() %>% 
    tab_spanner(
        label = "Recommendation",
        columns = starts_with("Recommendation")
    ) %>%
    tab_spanner(
        label = "Staff",
        columns = starts_with("Staff")
    ) %>% 
    tab_spanner(
        label = "Timeliness",
        columns = starts_with("Timeliness")
    ) %>% 
    cols_label(.list = new_names)

That is absolutely brilliant! The only thing missing which I'm straggling to add is a column with Sent next to URN...

I'm really sorry but I'm having an issue with the code. The real structure of the file is like that (the previous example was slightly different):

data.frame(stringsAsFactors=FALSE,
                           URN = c("10BE012112713", "10BE012112713", "10BE012112713",
                                   "10BE012112713", "10BE012113466"),
                      Language = c("FR", "FR", "FR", "FR", "FR"),
                      QUESTION = c("Facilities", "Recommendation", "Staff", "Choice Dealer",
                                   "Recommendation"),
                           Org = c("Locaux", "Recommandation concession", "Collaborateurs",
                                   "Choix de la concession",
                                   "Recommandation concession"),
                      VERBATIM = c("aaa", "ccc", "ddd", "fff", "ggg"),
                         Sent1 = c("0", "+", "+", NA, "-"),
                          Cat1 = c(998, 32, 7, NA, 10),
                         Sent2 = c(NA, "-", NA, NA, "-"),
                          Cat2 = c(NA, 11, NA, NA, 11),
                         Sent3 = c(NA, "-", NA, NA, "-"),
                          Cat3 = c(NA, 11, NA, NA, 12),
                         Sent4 = c(NA, "-", NA, NA, "-"),
                          Cat4 = c(NA, 32, NA, NA, 12),
                         Sent5 = c(NA, "-", NA, NA, "-"),
                          Cat5 = c(NA, 32, NA, NA, 32),
                         Sent6 = c(NA, NA, NA, NA, "-"),
                          Cat6 = c(NA, NA, NA, NA, 32),
                         Sent7 = c(NA, NA, NA, NA, "-"),
                          Cat7 = c(NA, NA, NA, NA, 32),
                         Sent8 = c(NA, NA, NA, NA, "-"),
                          Cat8 = c(NA, NA, NA, NA, 35),
                         Sent9 = c(NA, NA, NA, NA, "-"),
                          Cat9 = c(NA, NA, NA, NA, 47),
                        Sent10 = c(NA, NA, NA, NA, "-"),
                         Cat10 = c(NA, NA, NA, NA, 9)
                  )

unfortunately, applying this code:

tdat %>% 
  pivot_longer(
    matches("Cat\\d|Sent"),
    names_to = c(".value", "set"),
    names_pattern = "([:alpha:]+)([:digit:]+)"
  ) %>% 
  na.omit() %>%
  filter(Sent == "-") %>%
  select(-set, -Sent) %>%
  count(URN, QUESTION, Cat) %>% 
  pivot_wider(names_from = c(QUESTION, Cat), values_from = n)library(tidyverse)

generates this error:

Error: unexpected symbol in:
"  count(URN, QUESTION, Cat) %>% 
  pivot_wider(names_from = c(QUESTION, Cat), values_from = n)library"

What am I doing wrong? How can I keep "Sent"?

The error tells you the problem.

Take a look at the last line of the code. It looks like you pasted some code before rather than after the library() call.

That is wright! Thank you!!!

Blockquote

Final question to this post (I promise!).
How can I modify the code having only one "Cat" (instead of Cat1, Cat2 etc)? I would like to skip Step one...

Just skip the lines of code for the step one, this modification is pretty trivial so I think is a good exercise for you to understand what the code is doing instead of just copying a black box.

I've tried multiple options for this modified file:

data.frame(stringsAsFactors=FALSE,
                          ...1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                       þÿ.URN. = c("10BE013341890", "10BE013341890", "10BE012002051",
                                   "10BE012002051", "10BE012002051",
                                   "10BE012002051", "10BE012002051", "10BE012002051",
                                   "10BE012002057", "10BE012002057"),
                      QUESTION = c("Staff", "Staff", "Choice Brand", "Choice Dealer",
                                   "Ease of Buying", "Facilities",
                                   "Recommendation", "Staff", "Choice Brand",
                                   "Choice Dealer"),
                      VERBATIM = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
                                   "jjj"),
                          Sent = c("-", "0", "+", "NA", "NA", "NA", "0", "0", "+", "NA"),
                           Cat = c("5", "8", "32", "NA", "NA", "NA", "50", "50", "2", "NA"),
                     StartChar = c("224", "0", "0", "NA", "NA", "NA", "0", "0", "0", "NA"),
                       EndChar = c("41", "77", "76", "NA", "NA", "NA", "25", "23", "96", "NA")
                  )

but none of them seems to work :sweat: