Help with pivot_longer to make questionaire dataset tidy

Hello everyone, I hope you're doing well. I need help transforming a specific dataframe into a long format. The data pertains to a consumer survey for two or more products. The collection is done via an online form, and I import the generated Excel file into R as a dataframe at the end of the survey.

The structure of the dataframe is as follows, with the variables in sequence:

  • id (chr): Consumer identification
  • sample (chr): Identifier of the sample evaluated in the first position
  • question_1 (num): First question about the sample evaluated in the first position
  • question_2 (num): Second question about the sample evaluated in the first position
  • question_3 (chr): Third question about the sample evaluated in the first position
  • sample: (chr) Identifier of the sample evaluated in the second position
  • question_1 (num): First question about the sample evaluated in the second position
  • question_2 (num): Second question about the sample evaluated in the second position
  • question_3 (chr): Third question about the sample evaluated in the second position

The above example dataframe has 9 columns, with some of them having the same column names but different observations because they are different samples. R automatically renames repeated column names (example: sample, sample.1), despite the question name being the same.

The final dataframe that I need in long format would have the following variables in sequence:
id, sample, question_1, question_2, question_3

I tried using tidyr to code with pivot_longer (based on the example from R4DS bellow), but I couldn't solve the issue. Can anyone help me?

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  )

Hi @leonardo79 , Could you put a reproducible example of data?

# other way is:
# If you data is data, please paste the result of 
dput(head(data,30)) # for first 30 rows with all columns. #

Hi, @M_AcostaCH ! Thank you for your reply and the link with information. See bellow a small dataset that can be used: (edited to post the complete code that is not working properly)

# Carregando o pacote necessário
library(tidyr)
library(dplyr)

# Criando um exemplo de dataframe
dataset <- data.frame(
 id = c("ID1", "ID2", "ID3"),
 sample = c("Sample1", "Sample2", "Sample1"),
 question_1 = c(5, 4, 7),
 question_2 = c(3, 2, 5),
 question_3 = c("Yes", "No", "No"),
 sample = c("Sample2", "Sample1", "Sample2"),
 question_1 = c(4, 5, 3),
 question_2 = c(2, 3, 4),
 question_3 = c("No", "Yes", "No")
)

# Transformando o dataframe para o formato longo
df_long <- df %>%
 pivot_longer(
   cols = -id,
   names_to = c(".value", "sample_num"),
   names_pattern = "(.*)(\\d)$"
 )

# Visualizando o dataframe transformado
print(df_long)

Im try to help and find this form. A close example like you want:
Maybe a super R user could make better.

dataset <- data.frame(
 id = c("ID1", "ID2", "ID3"),
 sample = c("Sample1", "Sample2", "Sample1"),
 question_1 = c(5, 4, 7),
 question_2 = c(3, 2, 5),
 question_3 = c("Yes", "No", "No"),
 sample = c("Sample2", "Sample1", "Sample2"),
 question_1 = c(4, 5, 3),
 question_2 = c(2, 3, 4),
 question_3 = c("No", "Yes", "No")
)
# I think that the two `sample` columns is difficult for get a Q1, Q2, Q3 columns.

dataset_long <- dataset %>%
  pivot_longer(cols = starts_with("question_"), 
               names_to = "question", 
               values_to = "value") %>%
  mutate(question = gsub("question_", "Q", question)) 

#  id    sample  sample.1 question value
#  <chr> <chr>   <chr>    <chr>    <chr>
# 1 ID1   Sample1 Sample2  Q1       5    
# 2 ID1   Sample1 Sample2  Q2       3    
# 3 ID1   Sample1 Sample2  Q3       Yes  
# 4 ID1   Sample1 Sample2  Q1.1     4    
# 5 ID1   Sample1 Sample2  Q2.1     2    
# 6 ID1   Sample1 Sample2  Q3.1     No   
# 7 ID2   Sample2 Sample1  Q1       4    
# 8 ID2   Sample2 Sample1  Q2       2    
# 9 ID2   Sample2 Sample1  Q3       No   
# 10 ID2   Sample2 Sample1  Q1.1     5    
# 11 ID2   Sample2 Sample1  Q2.1     3    
# 12 ID2   Sample2 Sample1  Q3.1     Yes  
# 13 ID3   Sample1 Sample2  Q1       7    
# 14 ID3   Sample1 Sample2  Q2       5    
# 15 ID3   Sample1 Sample2  Q3       No   
# 16 ID3   Sample1 Sample2  Q1.1     3    
# 17 ID3   Sample1 Sample2  Q2.1     4    
# 18 ID3   Sample1 Sample2  Q3.1     No  

This may do what you want. Note I exported your data.frame as a .csv file and read it back in as a data.table. You can do the same thing with a data.frame, I just find it easier to use {data.table}

suppressMessages(library(data.table))

DT <- fread("leonardo.csv") 

DT1 <- DT[, .( id,  sample, question_1, question_2, question_3)]
DT2 <- DT[, .(id, sample.1, question_1.1, question_2.1, question_3.1)]

names(DT2) <- names(DT1)

DT3 <- rbind(DT1, DT2)

2 Likes

@M_AcostaCH Thank you for your kind answer! But the desired format to the new dataframe needs to have only id, sample, question_1, question_2 and question_3 columns.

Thank you @jrkrideau ! It's what I want (worked for this small example). How can I generalize this answer if I have 30 questions columns and 8 sample columns for example? Do you have any idea?

I'll have to think about this but you can just think of this but you can just keep repeating what I did for the 8 samples but with all 30 variables. :grimacing:

All I did was do

names(DT)

and copy the names into a template

I think it should be fairly easy to come up with a better method but probably will not be able to get to it until tomorrow morning my time (EDST North America).

1 Like

Hi @jrkrideau ! Thank you a lot for your time. I am trying to obtain a better code to use in general situations. If I discover, I'll share here.

Morning (my time)leonardo79

I seem to be having a mental block on an easy way to split up that data set. However slight modification of my earlier method will cut down on a lot of finicky work. See below:

DT <- fread("leonardo.csv") 
DT1 <- DT[, id, sample:question_3]
DT2 <- DT[, id, sample.1 :question_3.1]

You can do this

names(DT2) <- names(DT3) <- names(DT4) <- names(DT1)
DTtotal <- rbind(DT1, DT2, DT3, DT4)

It is not a general solution but it may let you get some work done until I get hit by some inspiration or someone else comes along and shows us what obvious thing I'm missing.

1 Like

For this setup, I've come up with the following solution:

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- data.frame(
  id = c("ID1", "ID2", "ID3"),
  sample = c("Sample1", "Sample2", "Sample1"),
  question_1 = c(5, 4, 7),
  question_2 = c(3, 2, 5),
  question_3 = c("Yes", "No", "No"),
  sample = c("Sample2", "Sample1", "Sample2"),
  question_1 = c(4, 5, 3),
  question_2 = c(2, 3, 4),
  question_3 = c("No", "Yes", "No")
)

dataset %>%
  mutate(across(where(is.numeric), as.character)) %>%
  pivot_longer(cols = -id, 
               names_to = "question", 
               values_to = "value") %>%
  mutate(
    SampleGp=if_else(stringr::str_detect(question, "\\.1"), "B", "A"),
    question=stringr::str_remove_all(question, "\\.1")
  ) %>%
  pivot_wider(id_cols=c(id, SampleGp), values_from=value, names_from=question) %>%
  mutate(
    sample=readr::parse_number(sample),
    across(c(question_1, question_2), as.numeric)
  ) %>%
  select(-SampleGp)
#> # A tibble: 6 × 5
#>   id    sample question_1 question_2 question_3
#>   <chr>  <dbl>      <dbl>      <dbl> <chr>     
#> 1 ID1        1          5          3 Yes       
#> 2 ID1        2          4          2 No        
#> 3 ID2        2          4          2 No        
#> 4 ID2        1          5          3 Yes       
#> 5 ID3        1          7          5 No        
#> 6 ID3        2          3          4 No

Created on 2024-03-12 with reprex v2.1.0

And this solution would expand to including an arbitrary number of groups though the manual bit still exists converting question_1 and question_2 to numeric at the end.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- data.frame(
  id = c("ID1", "ID2", "ID3"),
  sample = c("Sample1", "Sample2", "Sample1"),
  question_1 = c(5, 4, 7),
  question_2 = c(3, 2, 5),
  question_3 = c("Yes", "No", "No"),
  sample = c("Sample2", "Sample1", "Sample2"),
  question_1 = c(4, 5, 3),
  question_2 = c(2, 3, 4),
  question_3 = c("No", "Yes", "No")
)

dataset %>%
  mutate(across(where(is.numeric), as.character)) %>%
  pivot_longer(cols = -id, 
               names_to = "question", 
               values_to = "value") %>%
  separate_wider_delim(cols=question, delim=".", names=c("Question", "suffix"), too_few="align_start") %>%
  pivot_wider(id_cols=c(id, suffix), values_from=value, names_from=Question) %>%
  mutate(
    sample=readr::parse_number(sample),
    across(c(question_1, question_2), as.numeric)
  ) %>%
  select(-suffix)
#> # A tibble: 6 × 5
#>   id    sample question_1 question_2 question_3
#>   <chr>  <dbl>      <dbl>      <dbl> <chr>     
#> 1 ID1        1          5          3 Yes       
#> 2 ID1        2          4          2 No        
#> 3 ID2        2          4          2 No        
#> 4 ID2        1          5          3 Yes       
#> 5 ID3        1          7          5 No        
#> 6 ID3        2          3          4 No

Created on 2024-03-12 with reprex v2.1.0

1 Like

Thank you Stephanie @StatSteph !
An excellent option to solve the problem of keeping the data Tidy. I understood the possibility of generalizing the answer. With the insights obtained here, I am trying to create a code to address more general cases. If I succeed, I will share it here. :slight_smile:

Saved! I knew someone would get me out of my dead-end approach!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.