Labeling with data map/codebook

I work with survey data that comes in flat .csv files and is sent to us from our vendor with a separate data map/codebook to be able to ensure we label our survey question response punches (integers) properly to match up to the response wording from the survey.

I'm looking for a way to apply these question response labels to the integer punches without manually recoding. We typically handle this recoding in sql, but I'd like to handle this and some other typical tasks and output in a more automated fashion.

What do the codebooks look like? What is their format? Perhaps give a small snippet of the codebook and relevant columns.

Here's something I thought of but there's probably something much more elegant.

# You would import the codebook using read_csv or similar function and not type out
library(tidyverse)
codebook_in <- tibble(question=c("q1", rep(NA_character_, 5), "q2", rep(NA_character_, 3)),
                      wording=c("Wording for q1", rep(NA_character_, 5), "Wording for q2", rep(NA_character_, 3)),
                      punches=c(1:5, 99, 1:3, 99),
                      labels=c("Strongly disagree", "Disagree", "Neutral", "Agree", "Strongly agree", "Refused", "Yes", "No", "Not sure", "Refused"))

codebook_in
#> # A tibble: 10 x 4
#>    question wording        punches labels           
#>    <chr>    <chr>            <dbl> <chr>            
#>  1 q1       Wording for q1       1 Strongly disagree
#>  2 <NA>     <NA>                 2 Disagree         
#>  3 <NA>     <NA>                 3 Neutral          
#>  4 <NA>     <NA>                 4 Agree            
#>  5 <NA>     <NA>                 5 Strongly agree   
#>  6 <NA>     <NA>                99 Refused          
#>  7 q2       Wording for q2       1 Yes              
#>  8 <NA>     <NA>                 2 No               
#>  9 <NA>     <NA>                 3 Not sure         
#> 10 <NA>     <NA>                99 Refused

codebook <- codebook_in %>%
  fill(question, wording)

set.seed(200516)
fakedata <- tibble(q1=sample(c(1:5, 99), 1000, replace=TRUE),
                   q2=sample(c(1:3, 99), 1000, replace=TRUE))

fakedata
#> # A tibble: 1,000 x 2
#>       q1    q2
#>    <dbl> <dbl>
#>  1     1     2
#>  2     5     3
#>  3     1     3
#>  4     1     1
#>  5     1     2
#>  6    99     3
#>  7     5    99
#>  8     2     1
#>  9     5     2
#> 10    99     3
#> # … with 990 more rows

newfakedata <- fakedata
for (i in unique(codebook$question)){
  codev <- codebook %>% filter(question==i) 
  newfakedata[, i] <- factor(pull(newfakedata, i), levels=codev$punches, labels=codev$labels)
}

newfakedata
#> # A tibble: 1,000 x 2
#>    q1                q2      
#>    <fct>             <fct>   
#>  1 Strongly disagree No      
#>  2 Strongly agree    Not sure
#>  3 Strongly disagree Not sure
#>  4 Strongly disagree Yes     
#>  5 Strongly disagree No      
#>  6 Refused           Not sure
#>  7 Strongly agree    Refused 
#>  8 Disagree          Yes     
#>  9 Strongly agree    No      
#> 10 Refused           Not sure
#> # … with 990 more rows

Created on 2020-05-16 by the reprex package (v0.3.0)

It looks like you could do this with some reshaping and a join. Here's an example, using @StatSteph's data samples:

library(tidyverse)

fakedata %>% 
  mutate(id=seq(n())) %>% 
  pivot_longer(cols=-id, names_to="question", values_to="punches") %>% 
  left_join(
    codebook_in %>% fill(question, wording)
  ) %>% 
  select(id, question, labels) %>% 
  pivot_wider(names_from=question, values_from=labels)
      id q1                q2      
   <int> <chr>             <chr>   
 1     1 Strongly disagree No      
 2     2 Strongly agree    Not sure
 3     3 Strongly disagree Not sure
 4     4 Strongly disagree Yes     
 5     5 Strongly disagree No      
 6     6 Refused           Not sure
 7     7 Strongly agree    Refused 
 8     8 Disagree          Yes     
 9     9 Strongly agree    No      
10    10 Refused           Not sure
# … with 990 more rows
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Current format is a csv with the question name, question wording, punches/values, and labels.

Screen Shot 2020-05-16 at 3.29.09 PM

The actual data files have the question name and punches/values. So I would need to join these two matching up the question name to get the labels.

That said, if there's a format for the codebook that would facilitate automated labeling, would definitely be up to reformatting them before feeding them into the r script.