The best way to attach labels to numeric variables

ktanizar · July 10, 2018, 6:29am

I need to read in a large dataset in fixed-column format. All variables are numeric and have their corresponding labels. For example, flavor will be read in as 1, 2, 3, 4, 5, ....Each has corresponding descriptions: chocolate, vanilla, strawberry, etc.

What is the most efficient way to do so?

What I've done: - read in the dataset using read_fwf from the readr package. - read in each variable using factor, e.g.:

x$INCOME <- factor(as.numeric(x$INCOME),
          levels = c(
            '1', 
            '2', 
            '3', 
            '4'), 
          labels = c(
            'Under $5,000', 
            '$5,000-$7,999', 
            '$8,000-$9,999', 
            '$10,000+'))

QUESTION: What is the best way to do this when the variables have ~ 100K values and there are many variables to read in? Factor works but the script gets really long and it takes a very long time to read the data in.

Thanks!

mara · July 10, 2018, 10:54am

Using the forcats package you can get the unique factor levels for a set of observations by converting to factor after you've read in the data with as_factor().

Once you've read your data in, you can recode the factors with their labels, if you so choose (see thread below).

Another option would be to use a "lookup table," which is nicely described in this thread (although, personally, I tend not to do that—I don't actually have a good reason, though):

nwerth · July 10, 2018, 1:46pm

You don't need to convert the columns to numeric before converting to factor. The factor() function will convert it back to character, anyway.

This is what I prefer when there are a lot of factor columns and/or levels. To reduce the amount of basic data in the code, you could save each column's lookup table in a separate CSV. This has the benefits of being (1) language agnostic and (2) easy to edit in Excel (it has to be good for something).

For example, suppose you had lookup table file named income_levels.csv like this:

level,label
"1","Under $5,000"
"2","$5,000-$7,999"
"3","$8,000-$9,999"
"4","$10,000+"

Then you can use that in the code like this:

library(readr)

income_levels <- read_csv("income_levels.csv")
x[["INCOME"]] <- factor(x[["INCOME"]], income_levels[["level"]], income_levels[["label"]])

Even when there are a lot of factor columns, I still prefer a separate recode file for each. That makes it easy for me to keep track of things. If I need to edit a file, it should be easy. The code can rearrange it into the "optimal" shape.

library(dplyr)
library(readr)

factor_columns <- c("INCOME", "RACE", "SEX", "EDUCATION")
level_files <- paste0("level_files/", factor_columns, "_levels.csv")
mappings <- lapply(level_files, read_csv)
for (fc in factor_columns) {
  fc_map <- mappings[[fc]]
  x[[fc]] <- factor(x[[fc]], fc_map[["level"]], fc_map[["label"]])
}

Not counting the library calls, that's 7 lines of code for handling four columns. Expanding it to 20 columns will only add two or three extra lines for including them in factor_columns. Sure, you'll need to manually create the level/label mappings, but that's unavoidable.*

* It's totally avoidable if the mappings are defined in an HTML or text file. You can use string-processing to extract and arrange the information you want. Just make sure to save the results instead of any fancy regex code which generates them. It's easier to clean up plain text than finely tune regex. Save the regex part as a separate script.