Function to summarize number of studies by counts of values in columns in df

mckay.todd · June 25, 2019, 10:16pm

Hi, everyone!

I'm trying to create a function to summarize the number of studies that received a particular value in columns in a df. In the following example, the first column is for study names, and Item1:Item3 are variables. In particular, Item1:Item3 contain codes that a coding team assigned to information in a study. In the function, I want to be able to plug in different dfs; each df will always have the same "Study" column, but the number of variables might change from one df to the next.

Can someone help me turn this into a function? Thank you! (Thanks to woodman for getting me going here.)

df <- tibble(
  Study = c( rep("Wash_2009", 5), 
             rep("Zoey_2001", 12),
             rep("Jane_1999", 10),
             rep("Todd_1993", 15),
             rep("Coco_2019", 5),
             rep("Xena_2016", 3) ),
  Item1 = sample( c(1, 2, 3, 4, 5, "NS", "OT"), 50, T),
  Item2 = sample( c(1, 2, 3, 4, 5, "NS", "OT"), 50, T),
  Item3 = sample( c(1, 2, 3, 4, 5, "NS", "OT"), 50, T)
)

Item1 <- df %>%
  group_by(Study) %>%
  count(Item1) %>%
  group_by(Item1) %>%
  summarise(Studies = n())

Item2 <- df %>%
  group_by(Study) %>%
  count(Item2) %>%
  group_by(Item2) %>%
  summarise(Studies = n())

Item3 <- df %>%
  group_by(Study) %>%
  count(Item3) %>%
  group_by(Item3) %>%
  summarise(Studies = n())

Item1
Item2
Item3

joels · June 25, 2019, 10:27pm

One option is to reshape your data to long format, which is one way to avoid having to know in advance how many columns you're summarizing. For example:

library(tidyverse)

summary1 = df %>% 
  gather(key, value, -Study) %>% 
  group_by(key, value) %>% 
  summarise(n = length(unique(Study)))

summary1

# A tibble: 21 x 3
# Groups:   key [3]
   key   value     n
   <chr> <chr> <int>
 1 Item1 1         5
 2 Item1 2         4
 3 Item1 3         5
 4 Item1 4         3
 5 Item1 5         3
 6 Item1 NS        5
 7 Item1 OT        4
 8 Item2 1         2
 9 Item2 2         5
10 Item2 3         2
# … with 11 more rows

If you'd like the results in wide format, you can do:

summary1 %>% spread(key, n)

# A tibble: 7 x 4
  value Item1 Item2 Item3
  <chr> <int> <int> <int>
1 1         5     2     2
2 2         4     5     5
3 3         5     2     5
4 4         3     4     5
5 5         3     4     3
6 NS        5     6     4
7 OT        4     4     4

To turn this into a function:

sfnc = function(data, wide.output=FALSE) {
  data = data %>% 
    gather(key, value, -Study) %>% 
    group_by(key, value) %>% 
    summarise(n = length(unique(Study)))
  
  if(wide.output) {
    data = data %>% spread(key, n)
  }
  
  data
}

sfnc(df)

sfnc(df, wide.output=TRUE)

system · July 16, 2019, 10:30pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.