Leveraging labelled data in R - R Views Submission

Category: Entry Point into a Topic
Repo: https://github.com/shannonpileggi/pipinghotdata_distill/tree/master/_posts/2020-12-23-leveraging-labelled-data-in-r


Leveraging labelled data in R: Embracing SPSS, SAS, and Stata data sets with the haven, labelled, and sjlabelled packages

TL; DR

The haven, labelled, and sjlabelled packages can be used to effectively work
with SPSS, SAS, and Stata data sets in R through implementation of the haven_labelled class, which stores variable and value labels. Here are my most used functions for getting started with labelled data:

Purpose Function
1. Import SPSS labelled data haven::read_sav()
2. Create data dictionary labelled::generate_dictionary()
3. Identify if variable is haven_labelled haven::is.labelled()
4. Convert haven_labelled variables to numeric base::as.numeric()
5. Convert haven_labelled variables to factors haven::as_factor()
6. Convert variable label to variable name sjlabelled::label_to_colnames()

Introduction

Labelled data traditionally, though not exclusively, arises in survey data. SAS, SPSS, and Stata have established infrastructures for labelled data, which consists of metadata in the form of variable and value labels. This post is for R users who already have a SPSS (.sav), SAS (.sas7bdat), or Stata (.dta) data file and want to incorporate the labelled data features into their R workflow. With R’s haven, labelled, and sjlabelled packages, you can leverage the inherent data labelling structure in these data sets to work interactively with variable and value labels, making it easier to navigate data while also allowing the user to convert metadata to data. This post discusses general characteristics of labelled data and practical tips for data analysis with labelled data.

YRBSS labelled data

The Youth Risk Behavior Surveillance System (YRBSS) is a publicly available data set from the Centers for Disease Control and Prevention (CDC) that "monitors health-related behaviors that contribute to the leading causes of death and disability among youth and adults." On Aug 9, 2020, I downloaded YRBSS materials from the CDC website. This site has both the the 2017 national data (sadc_2017_national.dat) and the SPSS syntax to convert the .dat file to an SPSS labelled data file (2017_sadc_spss_input_program.sps). I do have an SPSS license, and I used the SPSS syntax to convert the .dat file to an SPSS labelled data file (sadc_2017_national.sav). As the .sav data file is not available on the CDC site, you can download the .sav data from my github repo.

Getting started

This material was developed using:

Software / package Version
R 4.0.5
RStudio 1.4.1103
tidyverse 1.3.1
here 0.1
haven 2.3.1
labelled 2.8.0
sjlabelled 1.1.7
library(tidyverse)  # general use ----
library(here)       # file paths  ----
library(haven)      # import .sav files ----  
library(labelled)   # tools for labelled data ----
library(sjlabelled) # more tools for labelled data ----

Importing labelled data

I use the haven package to import SPSS (.sav) data files.

# import data ----
dat_raw <- haven::read_sav(here::here( "_posts", "2020-12-23-leveraging-labelled-data-in-r", "data", "sadc_2017_national.sav"))

Variables in a data set have a class, which consists of assignments like numeric, character, and factor, among others. When labelled features are present, the haven package assigns a class of haven_labelled. This is important to know as many packages you work with may not have methods for haven_labelled objects.

When I first started working with SPSS data files, I also explored the foreign package, which preceeds haven.
Using foreign takes a bit longer than haven, can result in truncation of long character variables, and produces a different labelled data structure compared to haven. I have a strong preference for the haven package.

Creating a data dictionary

A data dictionary contains metadata about your data. The labelled::generate_dictionary function
can be used to create a data dictionary, extracted straight from your data. The usefulness
of the data dictionary depends on the quality of your metadata.

# create data dictionary ----
dictionary <- labelled::generate_dictionary(dat_raw)

The result is a data frame in my R environment with the number of observations equal to number of variables in the original data set. I can interactively explore the dictionary in R to quickly find variables or documentation of interest. For example, I can find all variables related to "weapons" with a search in the viewer pane.

Identifying labelled features

Standard data consists of variables (e.g., country) and values (e.g. US, UK, CA). When working with labelled data, variables and values each have two features. Variables consist of a name and a label; values consist of a code and a label. For example, here are the features of the q8 variable.

Feature Assignment
Variable name q8
Variable label Seat belt use
Value codes 1, 2, 3, 4, 5
Value labels Never, Rarely, Sometimes, Most of the time, Always

You can see this information in the data dictionary - here is a snippet of the dictionary for three variables. The value_labels field combines the value codes and value labels.

dictionary %>% 
  dplyr::filter(variable %in% c("q8", "q11", "q12")) %>% 
  dplyr::select(variable, label, value_labels) 

To dive a bit deeper, you can see the class of the q8 variable:

dat_raw %>% 
  dplyr::pull(q8) %>% 
  class(.)

and how the metadata of q8 is stored.

dat_raw %>% 
  dplyr::select(q8) %>% 
  str(.)

You don't need to get into the weeds of this to work effectively with labelled data, but knowing this can help troubleshoot errors.

Viewing labelled features

Beyond the dictionary, labelled features can also be seen when working with your data interactively. The console simultaneously prints value codes and labels side by side, with the code first followed by the label in brackets.

dat_raw %>% 
  dplyr::select(q8, q11, q12) 

Sometimes the alignment throws me a bit when I am reading this as the value codes and labels are left aligned, which places the value codes associated with q12 closer to q11.

When viewing the data frame in RStudio, the data frame displays the variable label under the variable name; however, only value codes (and not value labels) are displayed.

Common operations

I primarily use three packages for working with labelled data: haven, labelled, and sjlabelled. These three packages do have some overlap in functionality, in addition to naming schemes that differ but achieve the same objective (e.g., haven::as_factor vs sjlabelled::as_label), or naming schemes that are the same but achieve different objectives (e.g., haven::as_factor vs sjlabelled::as_factor). r emo::ji("grimace") To compound confusion, the concept of a label can refer to either variable or value labels. Frequently, plural function names refer to value labels, as in haven::zap_labels or labelled::remove_val_labels.

Here are operations I commonly perform on labelled data:

  1. Evaluate if variable is of class haven_labelled.

    • Why? Troubleshooting, exploring, mutating.

    • Function(s): haven::is.labelled()

  2. Convert haven_labelled variable to numeric value codes.

    • Why? To treat the variable as continuous for analysis. For example, if a 1-7 rating scale imports as labelled and you want to compute a mean.

    • Function(s): base::as.numeric() (strips variable of all metadata), haven::zap_labels() and labelled::remove_val_labels (removes value labels, retains other metadata)

  3. Convert haven_labelled() variable to factor with value labels.

    • Why? To treat the variable as categorical for analysis.

    • Function(s): haven::as_factor(), labelled::to_factor(), sjlabelled::as_label(). As far as I can tell, these three functions have the same result. By default, the factor levels are ordered by value codes.

  4. Convert variable label to variable name.

    • Why? For more informative or readable variable names.

    • Function(s): sjlabelled::label_to_colnames()

Example

For this example, I reduce the data set to 2017 records only and select three variables related to carrying weapons and safety, all of which are measured on the same scale.

# retain info on weapons and safety for 2017 ----
dat_2017 <- dat_raw %>% 
  dplyr::filter(year == 2017) %>% 
  dplyr::select(record, q12, q13, q15) 

# preview data ----
dat_2017

This code produces a bar plot showing the frequencies of the three variables from data as imported, displaying variable names and value codes.

# bar plot 1 ----
dat_2017 %>% 
   pivot_longer(
    cols = -1,
    names_to = "variable",
    values_to = "days"
  ) %>% 
  count(variable, days) %>% 
  # include factor(days) to correctly show value codes in ggplot ----
  ggplot(aes(x = n, y = factor(days))) +
  facet_wrap(. ~ variable) +
  geom_col() 

Now I add two lines of code to implement two changes - convert the variables to factors and convert the variable labels to variable names. This plot displays variable labels and value labels, producing a more informative figure.

# bar plot 2 ----
dat_2017 %>% 
  # --------------------------------------------------------------
  # change 1: convert haven_labelled variables to factors ----
  mutate_if(haven::is.labelled, haven::as_factor) %>% 
  # change 2: convert variable labels to variable names ----
  sjlabelled::label_to_colnames() %>% 
  # --------------------------------------------------------------
  pivot_longer(
    cols = -1,
    names_to = "variable",
    values_to = "days"
  ) %>% 
  count(variable, days) %>% 
  # unnecessary to include factor(days) here as was already converted in change 1 ----
  ggplot(aes(x = n, y = days)) +
  facet_wrap(. ~ variable) +
  geom_col() 

Other packages and haven_labelled objects

It is probably safe to assume that most packages you work with don't know how to handle the haven_labelled class - if the package does produce a result, it is likely making an educated guess which may not be in line with your needs.

For example, in using ggplot in the first figure above, I included the line y = factor(days); if instead I had y = days, ggplot yields the following message:

Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.

Treating the days variable as continuous resulted in an uninformative plot (not shown), which was corrected by converting the variable to factor.

What about other packages I use? In skimr 2.1.3 haven_labelled inputs result in value codes treated as numeric values. In gtsummary ≥1.4.0 the value labels of haven_labelled variables are ignored and the underlying values are shown; however, a helpful message is printed with instructions to convert or remove the value labels. In general, you will probably find a mix of messages, warnings, errors, omissions, or guessing when using haven_labelled variables with other packages.
These issues can be resolved by converting the haven_labelled variables to numeric or factor, depending on the context.

Note: Although gtsummary does not currently support value labels, it does support variable labels! See this tweet for a quick demo, and the Polished summary tables in R with gtsummary blog post for more information.

Workflow for labelled data manipulation

When converting haven_labelled objects to factor or numeric, be intentional about
where the conversion happens in your workflow. The Introduction to labelled vignette by Joseph Larmarange outlines two different approaches:

  1. First convert haven_labelled variables; second perform data manipulation utilizing variable labels (if factor).

  2. First perform data manipulation utilizing variable codes; second convert haven_labelled variables.

For me this question usually distills down to: for data manipulation, are the value codes or the value labels easier to work with? Sometimes the brevity of the value code helps (i.e., q12 == 1), whereas other times the context of the value label makes the code more readable (i.e., q12 == "0 days"). Note that the placement of the conversion can have downstream effects on your code.

Summary

The haven, labelled, sjlabelled packages create new structures and work flows for labelled data that allow you to harness the power of R while still honoring the valuable metadata framework that exists in SPSS, SAS, and Stata data sets. The functions discussed in this post cover most of my daily needs with labelled data; if you want to do more, next steps might include handling specific types of coded missing data or creating labelled data within R.

Acknowledgments

Thanks to Daniel Sjoberg for the gentle nudge to update this post.


This is a submission to the R Views Call for Documentation. For more information see rviews.rstudio.com.

4 Likes

Hello @shannon.pileggi !

This is a fantastic resource, thank you! I frequently get questions about gtsummary and haven labelled data, and I'll be pointing them here!

As of gtsummary v1.4.0 (released earlier this year), we made some changes how the haven labelled class is handled.

  1. The variable labels are used.
  2. The value labels are ignored, and the underlying class of the variable is used to determine the summary type and display values.
  3. We print a message when we encounter haven labelled data with instructions on how to convert the class for analysis. We also provide links for further reading.
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.4.2'

df <-
  tibble::tibble(
    v = labelled::labelled(c(1, 2, 2, 2, 9, 1, 2),
                           c(yes = 1, no = 2, "don't know" = 8, refused = 9))
  )

df %>%
  tbl_summary() %>%
  as_kable()
#> i Column(s) v are class "haven_labelled". This is an intermediate data
#>    structure not meant for analysis. Convert columns with `haven::as_factor()`, 
#>    `labelled::to_factor()`, `labelled::unlabelled()`, and `unclass()`. 
#>    "haven_labelled" value labels are ignored when columns are not converted. 
#>    Failure to convert may have unintended consequences or result in error.
#> * https://haven.tidyverse.org/articles/semantics.html
#> * https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled
Characteristic N = 7
v
1 2 (29%)
2 4 (57%)
9 1 (14%)
df %>%
  haven::as_factor() %>%
  tbl_summary()%>%
  as_kable()
Characteristic N = 7
v
yes 2 (29%)
no 4 (57%)
don’t know 0 (0%)
refused 1 (14%)

Created on 2021-09-13 by the reprex package (v2.0.1)

1 Like

Hi @statistishdan! Thanks for calling out these updates - excellent point that this post was written a few months ago and she be updated prior to submission. I'll get on it!

1 Like

I make too many updates! (or at least too many to keep up on them all!)

1 Like

Hi @EconomiCurtis! Can you remind me how to edit this submission? I made some minor tweaks in line with Daniel's comments, and it isn't clear to me how to update the body of the text.

I tried clicking on the pencil here, but I don't think it was letting me in the body of the text.

You can edit your thread via the pencil icon at the bottom of the body of your post.

Please let me know if you have any issues editing this.

Thank you for this question, as it points out an error in my editing instructions! There are in fact two edit pencils icons on the page. The one you refer to only lets you edit the title, tags and category. Where the other let's you edit the full post, and is confusingly hidden below the post.

Super, thanks for the clarification @EconomiCurtis! I was able to make the edits. :grinning: