I am using an ICPSR dataset RDA file. The dataset is downloaded from this link and includes the conversion code: https://www.icpsr.umich.edu/web/ICPSR/studies/36346#. The dataset consists of around 2600 variables (demographic, health, etc.) for around 4,000 subjects.
They have provided code to convert the variables from factor to numeric:
"factor_to_numeric_icpsr.R
2012/12/06
Convert R factor variable back to numeric in an ICPSR-produced R data
frame. This works because the original numeric codes were prepended by
ICSPR to the factor levels in the process of converting the original
numeric categorical variable to factor during R data frame generation.
REQUIRES add.value.labels function from prettyR package
http://cran.r-project.org/web/packages/prettyR/index.html
Substitute the actual variable and data frame names for da99999.0001$MYVAR
placeholders in syntax below.
data frame = da99999.0001
variable = MYVAR
Line-by-line comments:
(1) Load prettyR package
(2) Create object (lbls) containing the factor levels for the specified
variable. Sort will be numeric as original codes (zero-padded, if
necessary) were preserved in the factor levels.
(3) Strip original codes from lbls, leaving only the value labels, e.g.,
"(01) STRONGLY DISAGREE" becomes "STRONGLY DISAGREE"
(4) Strip labels from data, leaving only the original codes, e.g.,
"(01) STRONGLY DISAGREE" becomes "1". Then, coerce variable to numeric.
(5) Add value labels, making this a named numeric vector"
lbls <- sort(levels(da99999.0001$MYVAR))
lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
da99999.0001$MYVAR <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", da99999.0001$MYVAR))
da99999.0001$MYVAR <- add.value.labels(da99999.0001$MYVAR, lbls)
I am trying to figure out a way to do this for ALL the variables in the dataframe, but I haven't been able to. Is there a way to specify all variables instead of a single variable ("MYVAR")? I don't want to have to repeat this procedure for every variable, because there are 2613.
For example, I can successfully use this to convert the single variable "C1PAA2J" from factor to numeric:
class(df$C1PAA2J)
'factor'
lbls <- sort(levels(df$C1PAA2J))
lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
df$C1PAA2J <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", df$C1PAA2J))
df$C1PAA2J <- add.value.labels(df$C1PAA2J, lbls)
class(df$C1PAA2J)
'numeric'
Great, that works for one single variable! But there are 2613 variables that all need to be converted to numeric. How can I convert all of them to numeric using this syntax?
I tried this:
lbls <- sort(levels(df[,1:2613]))
lbls <- (sub("^\\([0-9]+\\) +(.+$)", "\\1", lbls))
df[,1:2613] <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", df[,1:2613]))
df[,1:2613] <- add.value.labels(df[,1:2613], lbls)
But when I do this, I get the warning "Warning message:
NAs introduced by coercion" and ALL values for ALL variables become NA! The dataset is completely wiped and replaced with NA.
The code works for individual variables. I'm just having trouble applying this to every variable in my dataframe. Thank you so much.