create cols/var 1) named as a function of specific responses in previous cols/var and 2) that contain new values as function of previous responses positions in an array (best-worst scaling application)

streamline55 · December 2, 2018, 8:33am

The reason for posting here is that I'd prefer a solution that uses packages from 'tidyverse', although I'm aware it may not be fully possible.

If my question is inappropriate for this forum, please disregard/delete it and, please accept my apologies for posting out of place.

The long title would've been "create cols/var 1) named as a function of specific responses in previous cols/var and 2) that contain new values as function of previous responses positions in an array (best-worst scaling application)"

I'm performing some best-worst scaling (maxdiff) analyses and I need to convert the data from the raw responses data file into a data file suitable for BWS analyses.

My responses data file is dfex and the end result should be dfbw.
Basically, the example starting datafile is self-explanatory, i.e., there are two questions q01 and q02, and the respondents indicate which of the available alternatives to these questions are "best" or "worst" ("most useful" and "least useful", here).

The final data file, suited for BWS analysis, should indicate for each question, which alternative, out of four possible per each question, was chosen. For instance, in case 1, question 1, the respondent indicated 1st alternative as the "most useful" and 3rd alternative as the "least useful".

I currently doing this via some seriously stupid manual conversions of scores but I'm certain that much more elegant solutions are available (I'm just not that advanced in R and I'm under a lot of pressure).

Could you, please, point me in the right direction (i.e., how to automate this process and make it more smoothly) and, if possible, please indicate your reasons for choosing using a certain package or particular approach versus another (e.g., a custom function versus a combination of functions in a package/s)?

    # This is my starting data frame
    dfex <- data.frame(stringsAsFactors=FALSE,
          q01_01 = c("most useful", "least useful", "least useful", NA),
          q01_02 = c(NA, "most useful", "most useful", "most useful"),
          q01_03 = c("least useful", NA, NA, "least useful"),
          q01_04 = c(NA, NA, NA, NA),
          q02_01 = c("least useful", "least useful", NA, "least useful"),
          q02_02 = c(NA, NA, NA, NA),
          q02_03 = c(NA, "most useful", "most useful", "most useful"),
          q02_04 = c("most useful", NA, "least useful", NA))

     # which now I'm processing manually
    dfex <- dfex %>% 
      mutate(b01_01 = case_when(q01_01 == "most useful" ~ 1)) %>% 
      mutate(b01_02 = case_when(q01_02 == "most useful" ~ 2)) %>% 
      mutate(b01_03 = case_when(q01_03 == "most useful" ~ 3)) %>% 
      mutate(b01_04 = case_when(q01_04 == "most useful" ~ 4)) %>% 
      mutate(b02_01 = case_when(q02_01 == "most useful" ~ 1)) %>% 
      mutate(b02_02 = case_when(q02_02 == "most useful" ~ 2)) %>% 
      mutate(b02_03 = case_when(q02_03 == "most useful" ~ 3)) %>% 
      mutate(b02_04 = case_when(q02_04 == "most useful" ~ 4)) %>%
      mutate(w01_01 = case_when(q01_01 == "least useful" ~ 1)) %>% 
      mutate(w01_02 = case_when(q01_02 == "least useful" ~ 2)) %>% 
      mutate(w01_03 = case_when(q01_03 == "least useful" ~ 3)) %>% 
      mutate(w01_04 = case_when(q01_04 == "least useful" ~ 4)) %>% 
      mutate(w02_01 = case_when(q02_01 == "least useful" ~ 1)) %>% 
      mutate(w02_02 = case_when(q02_02 == "least useful" ~ 2)) %>% 
      mutate(w02_03 = case_when(q02_03 == "least useful" ~ 3)) %>% 
      mutate(w02_04 = case_when(q02_04 == "least useful" ~ 4))
    
    # create (manually) cols 
    dfex %>% 
      select(c(b01_01:b01_04)) %>% 
      rowSums(., na.rm = T) -> dfex$B1
    dfex %>% 
      select(c(b02_01:b02_04)) %>% 
      rowSums(., na.rm = T) -> dfex$B2
    dfex %>% 
      select(c(w01_01:w01_04)) %>% 
      rowSums(., na.rm = T) -> dfex$W1
    dfex %>% 
      select(c(w02_01:w02_04)) %>% 
      rowSums(., na.rm = T) -> dfex$W2
    
    # getting the desired BW datafile after selecting only cols of interest
    dfbw <- dfex %>% select(B1:W2)

mishabalyasin · December 2, 2018, 10:54am

I think, main point to make here is that whenever you find yourself doing something by hand as in your example, you are probably working with wrong representation of data.
You've mentioned that you want to have tidyverse-style answer and one of the most important things to internalize is that tidyverse is built around a notion of tidy data (you can read about it here - https://en.wikipedia.org/wiki/Tidy_data). That means, that you first need to find a way to convert your input data to the tidy format and then, for example, dplyr verbs become much easier to use.
That all being said, you can solve your particular problem with the following:

suppressPackageStartupMessages(library(tidyverse))
input <- tibble::tibble(q01_01 = c("most useful", "least useful", "least useful", NA),
                   q01_02 = c(NA, "most useful", "most useful", "most useful"),
                   q01_03 = c("least useful", NA, NA, "least useful"),
                   q01_04 = c(NA, NA, NA, NA),
                   q02_01 = c("least useful", "least useful", NA, "least useful"),
                   q02_02 = c(NA, NA, NA, NA),
                   q02_03 = c(NA, "most useful", "most useful", "most useful"),
                   q02_04 = c("most useful", NA, "least useful", NA))

res <- input %>% 
  # item is implicit in input, so making it explicit
  dplyr::mutate(item = dplyr::row_number()) %>%
  # converting to tidy format
  tidyr::gather(key = "question", value = "choice", -item) %>%
  # we need to extract option out since it will be used in case_when
  tidyr::separate(question, into = c("question", "option"), sep = "_") %>%
  dplyr::mutate(option = as.integer(option)) %>%
  dplyr::mutate(best = dplyr::case_when(
    choice == "most useful" ~ option,
    TRUE                    ~ NA_integer_
  )) %>%
  dplyr::mutate(worst = dplyr::case_when(
    choice == "least useful" ~ option,
    TRUE                     ~ NA_integer_
  )) %>%
  # calculating best/worst
  dplyr::group_by(item, question) %>%
  dplyr::summarise(B = sum(best, na.rm = TRUE),
                   W = sum(worst, na.rm = TRUE)) %>%
  dplyr::ungroup() %>%
  
  # using gather to again convert to tidy format
  tidyr::gather(key = "result", value = "choice", -question, -item) %>%
  
  # converting to the form you expect
  dplyr::mutate(question = dplyr::case_when(
    question == "q01" ~ "1",
    question == "q02" ~ "2"
  )) %>%
  tidyr::unite("united", result, question, sep = "") %>%
  tidyr::spread(key = united, value = choice) %>%
  dplyr::select(-item)

# which now I'm processing manually
dfex <- input %>% 
  mutate(b01_01 = case_when(q01_01 == "most useful" ~ 1)) %>% 
  mutate(b01_02 = case_when(q01_02 == "most useful" ~ 2)) %>% 
  mutate(b01_03 = case_when(q01_03 == "most useful" ~ 3)) %>% 
  mutate(b01_04 = case_when(q01_04 == "most useful" ~ 4)) %>% 
  mutate(b02_01 = case_when(q02_01 == "most useful" ~ 1)) %>% 
  mutate(b02_02 = case_when(q02_02 == "most useful" ~ 2)) %>% 
  mutate(b02_03 = case_when(q02_03 == "most useful" ~ 3)) %>% 
  mutate(b02_04 = case_when(q02_04 == "most useful" ~ 4)) %>%
  mutate(w01_01 = case_when(q01_01 == "least useful" ~ 1)) %>% 
  mutate(w01_02 = case_when(q01_02 == "least useful" ~ 2)) %>% 
  mutate(w01_03 = case_when(q01_03 == "least useful" ~ 3)) %>% 
  mutate(w01_04 = case_when(q01_04 == "least useful" ~ 4)) %>% 
  mutate(w02_01 = case_when(q02_01 == "least useful" ~ 1)) %>% 
  mutate(w02_02 = case_when(q02_02 == "least useful" ~ 2)) %>% 
  mutate(w02_03 = case_when(q02_03 == "least useful" ~ 3)) %>% 
  mutate(w02_04 = case_when(q02_04 == "least useful" ~ 4))

# create (manually) cols 
dfex %>% 
  select(c(b01_01:b01_04)) %>% 
  rowSums(., na.rm = T) -> dfex$B1
dfex %>% 
  select(c(b02_01:b02_04)) %>% 
  rowSums(., na.rm = T) -> dfex$B2
dfex %>% 
  select(c(w01_01:w01_04)) %>% 
  rowSums(., na.rm = T) -> dfex$W1
dfex %>% 
  select(c(w02_01:w02_04)) %>% 
  rowSums(., na.rm = T) -> dfex$W2

# getting the desired BW datafile after selecting only cols of interest
dfbw <- dfex %>% select(B1:W2) %>% dplyr::mutate_all(as.integer)

dplyr::all_equal(dfbw, res)
#> [1] TRUE

^{Created on 2018-12-02 by the reprex package (v0.2.1)}

I've added couple of comments in the pipeline to explain why I did what I did. It might look a bit intimidating, but keep in mind that this pipeline will work for any number of questions and you can probably split into 2-3 logical parts to make it more readable.

streamline55 · December 2, 2018, 1:46pm

Amazing! For me, it is simply amazing how some people seem to be born speaking 'tidyverse'-ean.. I wish I learn it some day enough to come out to play.
i suspected that melting and recasting was the way to go, but my mind got scared along the way...
Thank you again for both your help and your elegant solution, as well as for your guiding advises in the solution, dear Mr. @mishabalyasin :)!

mishabalyasin · December 2, 2018, 2:10pm

That might look this way, but it's most likely because I've shown you only the final solution, not 10 previous attempts that didn't work :).

Main insight that helped me finally solve it was turning implicit item into explicit column in input dataset. That clicked it for me because I've managed to track what is actually happening and what is the desired result.

That's why I've mentioned that converting to tidy data and making sure that all required information is present is the key.

streamline55 · December 2, 2018, 2:22pm

thank you, again! I have no doubt that it required a bit of thinking. It took me several days to give up, understanding that I need many more steps before I could comprehend it. You should be a teacher, if you're not already one. You have a real vocation for spreading knowledge

streamline55 · December 2, 2018, 3:15pm

Just one more thing. In my haste to get an answer, I posted this question in Stack Overflow and in Super User.
I've already acknowledged your answer, there, in comments, but I thought you might want to post an answer. I admit that I might be terribly in the wrong here, because I don't know what the rules for cross posting are or if they forbid it (I couldn't find anything explicit about it, but implicitly, i don't know...). So, here are the links, if you think it's worth it:
https://stackoverflow.com/questions/53578603/create-cols-var-named-as-a-function-of-specific-responses-in-previous-cols-var?noredirect=1#comment94023264_53578603
https://superuser.com/questions/1380150/create-cols-var-named-as-a-function-of-specific-responses-in-previous-cols-var

Thank you, again

PS (I'm at the fourth re-read and i suspect i have quite a few left... )

mishabalyasin · December 2, 2018, 3:27pm

You can read about policy on cross-posting in FAQ:

streamline55 · December 2, 2018, 4:30pm

Got it! I'll deal with it. Thank you for your patience.

system · December 9, 2018, 4:30pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.