R: Factor to numeric conversion

simulationtrader · June 17, 2020, 11:59pm

I am using an ICPSR dataset RDA file. The dataset is downloaded from this link and includes the conversion code: https://www.icpsr.umich.edu/web/ICPSR/studies/36346#. The dataset consists of around 2600 variables (demographic, health, etc.) for around 4,000 subjects.

They have provided code to convert the variables from factor to numeric:

"factor_to_numeric_icpsr.R
2012/12/06

Convert R factor variable back to numeric in an ICPSR-produced R data
frame. This works because the original numeric codes were prepended by
ICSPR to the factor levels in the process of converting the original
numeric categorical variable to factor during R data frame generation.

REQUIRES add.value.labels function from prettyR package
http://cran.r-project.org/web/packages/prettyR/index.html

Substitute the actual variable and data frame names for da99999.0001$MYVAR
placeholders in syntax below.

 data frame = da99999.0001
 variable   = MYVAR

Line-by-line comments:

(1) Load prettyR package

(2) Create object (lbls) containing the factor levels for the specified
variable. Sort will be numeric as original codes (zero-padded, if
necessary) were preserved in the factor levels.

(3) Strip original codes from lbls, leaving only the value labels, e.g.,
"(01) STRONGLY DISAGREE" becomes "STRONGLY DISAGREE"

(4) Strip labels from data, leaving only the original codes, e.g.,
"(01) STRONGLY DISAGREE" becomes "1". Then, coerce variable to numeric.

(5) Add value labels, making this a named numeric vector"

lbls <- sort(levels(da99999.0001$MYVAR))
lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
da99999.0001$MYVAR <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", da99999.0001$MYVAR))
da99999.0001$MYVAR <- add.value.labels(da99999.0001$MYVAR, lbls)

I am trying to figure out a way to do this for ALL the variables in the dataframe, but I haven't been able to. Is there a way to specify all variables instead of a single variable ("MYVAR")? I don't want to have to repeat this procedure for every variable, because there are 2613.

For example, I can successfully use this to convert the single variable "C1PAA2J" from factor to numeric:

class(df$C1PAA2J)
'factor'

lbls <- sort(levels(df$C1PAA2J))
lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
df$C1PAA2J <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", df$C1PAA2J))
df$C1PAA2J <- add.value.labels(df$C1PAA2J, lbls)

class(df$C1PAA2J)
'numeric'

Great, that works for one single variable! But there are 2613 variables that all need to be converted to numeric. How can I convert all of them to numeric using this syntax?

I tried this:

lbls <- sort(levels(df[,1:2613]))
lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
df[,1:2613] <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", df[,1:2613]))
df[,1:2613] <- add.value.labels(df[,1:2613], lbls)

But when I do this, I get the warning "Warning message:
NAs introduced by coercion" and ALL values for ALL variables become NA! The dataset is completely wiped and replaced with NA.

The code works for individual variables. I'm just having trouble applying this to every variable in my dataframe. Thank you so much.

woodward · June 18, 2020, 2:52am

If you were using tidyverse and dplyr you could do it with mutate_at.

Otherwise you could just write a loop.

Or you could write a function to do it for one column and then use lapply.

myfunc <- function(x){
  lbls <- sort(levels(x))
  lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
  x <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", x))
  x <- add.value.labels(x, lbls)
  x
}

df <- lapply(df, myfunc)

simulationtrader · June 20, 2020, 7:59pm

Dear Woodward,

Thank you so much for your kind and generous reply. I am looking into using mutate_at and will spend time trying this out.

I tried your solution with lapply, and it seems to work for many iterations, but it eventually hits an error:
"Error in names(attr(x, "value.labels")) <- value.labels :
'names' attribute [1] must be the same length as the vector [0]
In addition: Warning message:
In FUN(X[[i]], ...) :"

Thank you so much for your help, I will keep working at it.

woodward · June 21, 2020, 7:41pm

You might need to do some error checking. It seems that not all the data is formatted as you expected.
e.g.

library(prettyR)

myfunc <- function(x){
  lbls <- sort(levels(x))
  lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
  x <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", x))
  x <- add.value.labels(x, lbls)
  print(x)
}

df <- data.frame(a = "(01) STRONGLY DISAGREE", b = "STRONGLY DISAGREE")

lapply(df, myfunc)
#> [1] 1
#> attr(,"value.labels")
#> STRONGLY DISAGREE 
#>                 1
#> Warning in FUN(X[[i]], ...): NAs introduced by coercion
#> More value labels than values, only the first 0 will be used
#> Error in names(attr(x, "value.labels")) <- value.labels: 'names' attribute [1] must be the same length as the vector [0]

^{Created on 2020-06-22 by the reprex package (v0.3.0)}

system · July 12, 2020, 7:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.