Coalesce survey responses in a dataframe

phiggins · June 3, 2019, 8:41pm

I have a dataframe of survey responses that includes a column for each response.
For example, for question 1, a multiple choice question with 5 responses from a to e,
I have five mutually exclusive columns, for Q1_a, Q1_b, Q1_c, Q1_d, and Q1_e.
I would like to coalesce these into a single variable, named Q1.
dplyr::coalesce seems exactly right, but only works on vectors, not variables in a dataframe.
Any ideas on workarounds to use coalesce, or more clever ways to do this programmatically?

technocrat · June 3, 2019, 9:22pm

dplyr::coalesce will find the first non-missing value in a vector, including a column in a data frame or tibble.

I'm going to assume that the five existing columns have values of a through e, respectively, and that you want Q1 to contain the value abcde in the same dataframe, which I'll call survey.

The goto operation to create a new column is mutate.

paste("a","b","c","d","e", sep = "")
[1] "abcde"

is the transformation and

survey <- survey %>% mutate(Q1 = paste(Q1_a, Q1_b, Q1_c, Q1_d, Q1_e, sep = "")

will create a column combining the existing columns.

phiggins · June 4, 2019, 2:49am

I thought that this would not work, as all of the non-selected columns are NAs, and pasting gives me a Q1 result of:
"NA NA C NA NA" if the respondent selected C.
However, if I add on a str_extract and a smidgen of simple regex (all I can manage), like
str_trim(str_extract(paste(NA,NA,"C",NA, NA), "\s[A-E]"))
I get
"C".
This works for everything (B,C,D,or E) except when the respondent selects A.
It needs the whitespace in front so it won't select the first A in NA.
Perhaps someone with better regex skills could make this work and turn it into a function.

andresrcs · June 4, 2019, 3:01am

Why not to do something like this?

Q1 <- c(NA, NA, "A", NA, NA)
Q1[!is.na(Q1)]
#> [1] "A"

It would be easier to help you if you could provide a minimal REPRoducible EXample (reprex).

EDIT: Just in case you still want to follow the regex path

library(stringr)
str_extract(paste("A",NA,NA,NA,NA), "(?<!N\\s?)[A-E]")
#> [1] "A"

nwerth · June 4, 2019, 1:43pm

You can use which(x, arr.ind = TRUE) to get the row/column indices where a group of questions is TRUE.

questions <- data.frame(
  Q1_a = c(TRUE,  FALSE, FALSE, TRUE,  FALSE, FALSE, TRUE),
  Q1_b = c(FALSE, TRUE,  FALSE, FALSE, FALSE, FALSE, FALSE),
  Q1_c = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
  Q1_d = c(FALSE, FALSE, FALSE, FALSE, TRUE,  TRUE,  FALSE),
  Q1_e = c(FALSE, FALSE, TRUE,  FALSE, FALSE, FALSE, FALSE)
)

answers <- which(as.matrix(questions), arr.ind = TRUE)
answers
#     row col
# [1,]   1   1
# [2,]   4   1
# [3,]   7   1
# [4,]   2   2
# [5,]   5   4
# [6,]   6   4
# [7,]   3   5

questions[answers[, "row"], "Q1_choice"] <- answers[, "col"]
questions
#    Q1_a  Q1_b  Q1_c  Q1_d  Q1_e Q1_choice
# 1  TRUE FALSE FALSE FALSE FALSE         1
# 2 FALSE  TRUE FALSE FALSE FALSE         2
# 3 FALSE FALSE FALSE FALSE  TRUE         5
# 4  TRUE FALSE FALSE FALSE FALSE         1
# 5 FALSE FALSE FALSE  TRUE FALSE         4
# 6 FALSE FALSE FALSE  TRUE FALSE         4
# 7  TRUE FALSE FALSE FALSE FALSE         1

Of course, you'd have to subset your survey for the question groups, and possibly convert the responses to logical columns.

phiggins · June 4, 2019, 2:21pm

Being brave and trying first reprex.
Here goes:

# build my first reprex
#
# there is a large questionnaire
# each participant can only put one answer each question (or no answer)
# this is an example of responses to Question 45480.
# the possible answers are coded as 1:5
# these are multiple choice Qs, with 4-6 answers (varies by Q)
# some respondents did not answer - they have NA for all 5 variables
# those who did answer will have one and only one non-NA response
# among the 5 variables Q45480_1 to Q45480_5
# there are many Q with responses coded this way
# a solution that results in a reusable function
# ie. coalesceQ(question, numresponses)
# to coalesce the 4-6 multiple choice responses (it varies by question)
# into one variable with responses as factors of 1-4, 1-5, or 1-6
# as appropriate with NAs when the participant did not respond.
# would be ideal.
# hoping all you clever folks have good ideas.
#
participant_id <- 1:10
Q45480_1 <- c(NA, 1,  NA, NA, NA, 1 , NA, NA, NA, NA)
Q45480_2 <- c(2,  NA, NA, NA, NA, NA, NA, 2 , NA, NA)
Q45480_3 <- c(NA, NA, NA, 3 , NA, NA, NA, NA, 3 , NA)
Q45480_4 <- c(NA, NA, 4 , NA, NA, NA, NA, NA, NA, NA)
Q45480_5 <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 5 )
dfr<- data.frame(participant_id, Q45480_1, Q45480_2, 
                 Q45480_3, Q45480_4, Q45480_5)

^{Created on 2019-06-04 by the reprex package (v0.3.0)}

phiggins · June 4, 2019, 4:11pm

This version comes very close, but for some reason adds a row 11. Not sure where it is coming from.

# build my 2nd reprex
#
# there is a large questionnaire
# each participant can only put one answer each question (or no answer)
# this is an example of responses to Question 45480.
# the possible answers are coded as 1:5
# these are multiple choice Qs, with 4-6 answers (varies by Q)
# some respondents did not answer - they have NA for all 5 variables
# those who did answer will have one and only one non-NA response
# among the 5 variables Q45480_1 to Q45480_5
# there are many Q with responses coded this way
# a solution that results in a reusable function
# ie. coalesceQ(question, numresponses)
# to coalesce the 4-6 multiple choice responses (it varies by question)
# into one variable with responses as factors of 1-4, 1-5, or 1-6
# as appropriate with NAs when the participant did not respond.
# would be ideal.
# hoping all you clever folks have good ideas.
#
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
participant_id <- 1:10
Q45480_1 <- c(NA, 1,  NA, NA, NA, 1 , NA, NA, NA, NA)
Q45480_2 <- c(2,  NA, NA, NA, NA, NA, NA, 2 , NA, NA)
Q45480_3 <- c(NA, NA, NA, 3 , NA, NA, NA, NA, 3 , NA)
Q45480_4 <- c(NA, NA, 4 , NA, NA, NA, NA, NA, NA, NA)
Q45480_5 <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 5 )
dfr<- data.frame(participant_id, Q45480_1, Q45480_2,
                 Q45480_3, Q45480_4, Q45480_5)
dfr_sub <- dfr %>% select(starts_with("Q454"))

answers <- which(!is.na(as.matrix(dfr_sub)), arr.ind = TRUE)

dfr_sub2 <- dfr_sub[answers[,"row"], "Qchoice"] <- answers[,"col"]

rbind(dfr_sub,dfr_sub2)
#>    Q45480_1 Q45480_2 Q45480_3 Q45480_4 Q45480_5 Qchoice
#> 1        NA        2       NA       NA       NA       2
#> 2         1       NA       NA       NA       NA       1
#> 3        NA       NA       NA        4       NA       4
#> 4        NA       NA        3       NA       NA       3
#> 5        NA       NA       NA       NA       NA      NA
#> 6         1       NA       NA       NA       NA       1
#> 7        NA       NA       NA       NA       NA      NA
#> 8        NA        2       NA       NA       NA       2
#> 9        NA       NA        3       NA       NA       3
#> 10       NA       NA       NA       NA        5       5
#> 11        1        1        2        2        3       3

^{Created on 2019-06-04 by the reprex package (v0.3.0)}

nwerth · June 4, 2019, 8:59pm

The problem is here:

dfr_sub2 <- dfr_sub[answers[,"row"], "Qchoice"] <- answers[,"col"]

There are two variable assignments in this line, and they're effectively these:

dfr_sub[answers[,"row"], "Qchoice"] <- answers[,"col"]
dfr_sub2 <- answers[,"col"]

I'm not sure why you're adding the coalesced answers back into the subset instead of the main survey dataset. That can be done like this:

dfr[answers[,"row"], "Qchoice"] <- answers[,"col"]

andresrcs · June 4, 2019, 10:10pm

What was your problem with coalesce? this seems to work

dfr <- data.frame(
    participant_id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L),
    Q45480_1 = c(NA, 1, NA, NA, NA, 1, NA, NA, NA, NA),
    Q45480_2 = c(2, NA, NA, NA, NA, NA, NA, 2, NA, NA),
    Q45480_3 = c(NA, NA, NA, 3, NA, NA, NA, NA, 3, NA),
    Q45480_4 = c(NA, NA, 4, NA, NA, NA, NA, NA, NA, NA),
    Q45480_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 5))

library(dplyr)

dfr %>%
    rowwise() %>% 
    mutate(Qchoice = coalesce(Q45480_1, Q45480_2, Q45480_3, Q45480_4, Q45480_5))
#> Source: local data frame [10 x 7]
#> Groups: <by row>
#> 
#> # A tibble: 10 x 7
#>    participant_id Q45480_1 Q45480_2 Q45480_3 Q45480_4 Q45480_5 Qchoice
#>             <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
#>  1              1       NA        2       NA       NA       NA       2
#>  2              2        1       NA       NA       NA       NA       1
#>  3              3       NA       NA       NA        4       NA       4
#>  4              4       NA       NA        3       NA       NA       3
#>  5              5       NA       NA       NA       NA       NA      NA
#>  6              6        1       NA       NA       NA       NA       1
#>  7              7       NA       NA       NA       NA       NA      NA
#>  8              8       NA        2       NA       NA       NA       2
#>  9              9       NA       NA        3       NA       NA       3
#> 10             10       NA       NA       NA       NA        5       5

Col · June 5, 2019, 12:37am

Is rowwise() needed here? I didn't think coalesce aggregated

phiggins · June 6, 2019, 11:38am

Mortified here. This does work. I can't reproduce the problem I was having before. I was getting an error message that made it sound like coalesce() would only work on vectors, not variables in dataframes. It is possible that updating packages helped(??) but now this completely works, which is great!

phiggins · June 6, 2019, 11:39am

It does work on my survey dataset without the rowwise() line of code.
thanks!

system · June 27, 2019, 11:39am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.