Building a data frame for plotting survey results

erinho · November 3, 2022, 4:14pm

My survey data has columns with headers like item1, item2, item3 and so on. I want to build a data frame for plotting that has all the items and their n, mean and standard deviation. I have tried this one.

tidy_data <- data %>% 
  select(c(contains("item"))) %>% 
  pivot_longer(names_to = "question",
               values_to = "answer",
               cols = all()) %>% 
  group_by(question) %>% 
  summarise(
    n=n(),
    mean=mean(answer),
    sd=sd(answer)
    )

I think this code has more than one error. I don't know if this cols=all argument is correct or do I even need it here. Also, I guess that select part goes wrong. Thanks for your advice!

Here's a data sample

structure(list(item1 = c(1, 1, 1, 1, 2, NA, 1, 1, 1, 1), item2 = c(2, 
5, 2, 2, 2, 2, 2, 5, 19, 6), item3 = c(1, 1, 1, 1, 1, 1, 2, 1, 
1, 2), item4 = c(1, 1, 1, 1, 1, 1, NA, 1, 1, NA), item5 = c(4, 
4, 4, 1, 4, 4, 6, 4, 2, 4)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

FJCC · November 3, 2022, 4:43pm

Your code was close to correct. You do not need the select() function because all of the columns meet the condition of contains("item"). If your real data set has other columns, then you might want to keep the select(), In any case, it does not do any harm. The all() function is not the right one to use. The correct one is everything(). The all() function is used to tell whether a set of logical conditions are all TRUE.
I added the na.rm argument to the calculation of the mean and sd. Otherwise, columns with one or more NA values are reported as NA. I also added a column to your summarize that shows how many responses are not NA.

library(tidyverse)
data <- structure(list(item1 = c(1, 1, 1, 1, 2, NA, 1, 1, 1, 1), 
               item2 = c(2, 5, 2, 2, 2, 2, 2, 5, 19, 6), 
               item3 = c(1, 1, 1, 1, 1, 1, 2, 1, 1, 2), 
               item4 = c(1, 1, 1, 1, 1, 1, NA, 1, 1, NA), 
               item5 = c(4, 4, 4, 1, 4, 4, 6, 4, 2, 4)), 
          row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

tidy_data <- data %>% 
  select(c(contains("item"))) %>% 
  pivot_longer(names_to = "question",
               values_to = "answer",
               cols = everything()) %>% 
  group_by(question) %>% 
  summarise(
    n=n(),
    mean=mean(answer,na.rm=TRUE),
    sd=sd(answer,na.rm=TRUE),
    n_adj = sum(!is.na(answer))
  )
tidy_data
#> # A tibble: 5 × 5
#>   question     n  mean    sd n_adj
#>   <chr>    <int> <dbl> <dbl> <int>
#> 1 item1       10  1.11 0.333     9
#> 2 item2       10  4.7  5.27     10
#> 3 item3       10  1.2  0.422    10
#> 4 item4       10  1    0         8
#> 5 item5       10  3.7  1.34     10

^{Created on 2022-11-03 with reprex v2.0.2}

system · November 10, 2022, 4:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.