Regarding assigning number to categorical variables e.g (Iris Dataset)


#1

Hello friends,
I have taken the iris dataset as an example as the target variable is a categorical variable with 3 categories

  1. Setosa
    2)Versicolor
  2. Virginica
    Do we have to assign a number like 1 to Setosa 2 to Versicolor and 3 to Virginica and then convert it to a factor variable
    OR
    just convert it to a factor variable without assigning and number to each category....
    Thanks,
    Amod Shirke

#2

No, you do not need to assign numbers to categorical variables in order to convert them to factors (though note that, for the example you give, species in the Iris dataset, your variable is already a factor by default—so I'm converting it to a character and then back to factor in the reprex below).

library(tidyverse)
str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris <- iris %>%
  mutate(Species = as.character(Species))

str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

iris <- iris %>%
  mutate(Species = as.factor(Species))

levels(iris$Species)
#> [1] "setosa"     "versicolor" "virginica"

Created on 2018-09-09 by the reprex package (v0.2.0.9000).

Above I've used the base R function as.factor(), which doesn't require that you specify your factor levels. It's often a good idea to do so, though. See the forcats package docs, for example, for more detail on working with factors.