How to convert factor data to numeric for all dataset

In general, I definitely concur with @felberr's advice to avoid factors when you read in data if you don't need them. But if you happen to have factors in your data that you need to convert, just remember they are a bit different than other data types in R. In particular, when you convert factors to numeric variables, you might run into to some surprising results, especially if your factors look like numeric variables. For example, this works as expected:

library(tidyverse)

one_three <- factor(1:3) %>% print()
#> [1] 1 2 3
#> Levels: 1 2 3
as.numeric(one_three)
#> [1] 1 2 3

But what if you try the same thing with this factor vector:

four_six <- factor(4:6) %>% print()
#> [1] 4 5 6
#> Levels: 4 5 6
as.numeric(four_six)
#> [1] 1 2 3

That's kind of confusing!

So, one way around this is to first convert the factor to a character vector, and then to numeric. Doing this will take the factor levels (4, 5, 6) and make them into a character vector ("4", "5", 6"). Then from there, you can convert those characters to numbers.

as.numeric(as.character(four_six))
#> [1] 4 5 6

So, all this is a long way of saying that if you want to convert all factor variables in a data frame to numeric variables, this is a pretty good way of doing it:

df <- tibble(a = one_three, b = four_six, c = c("one", "two", "three")) %>% 
  print()
#> # A tibble: 3 x 3
#>   a     b     c    
#>   <fct> <fct> <chr>
#> 1 1     4     one  
#> 2 2     5     two  
#> 3 3     6     three

mutate_if(df, is.factor, ~ as.numeric(as.character(.x)))
#> # A tibble: 3 x 3
#>       a     b c    
#>   <dbl> <dbl> <chr>
#> 1     1     4 one  
#> 2     2     5 two  
#> 3     3     6 three

The Factors chapter of R for Data Science is a pretty good place to start digging deeper into factors.

ADDING:

From the ?factor documentation:

The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)) .

So if you have very large data frame, doing the following might be a little quicker:

mutate_if(df, is.factor, ~ as.numeric(levels(.x))[.x])
#> # A tibble: 3 x 3
#>       a     b c    
#>   <dbl> <dbl> <chr>
#> 1     1     4 one  
#> 2     2     5 two  
#> 3     3     6 three

Created on 2018-11-05 by the reprex package (v0.2.1)

5 Likes