Using dummy variables for categorical data

#1

How do I convert the data below using dummy variables?

Class : chr "no-recurrence-events" "recurrence-events" "recurrence-events" "no-recurrence-events" ... PostMeno : chr "premeno" "It40" "premeno" "ge40" ...
NodeCaps : chr "no" "yes" "no" "no" "yes" "yes"... Breast : chr "left" "right" "left" "right" ...
Quadrant : chr "left_low" "right_up" "central" "left_up" "right_low"... Radiation: chr "no" "yes" "no" "yes" ...

Class : has 2 levels ----- "no-recurrence-events" "recurrence-events"
PostMeno : has 3 levels ----- "It40" "premeno" "ge40"
NodeCaps : has 2 levels -----" "no" "yes"
Breast : has 2 levels ----- "left" "right"
Quadrant : has 5 levels ----- "left_low" "right_up" "central" "left_up" "right_low"...
Radiation: has 2 levels -----" "no" "yes"

0 Likes

#2

Check out fct_recode() in the forcats pacakge:

Also, some good info on recoding dummy variables using ifelse() here:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/QuantCore/PH717_MultipleVariableRegression/PH717_MultipleVariableRegression4.html

And a package specifically for recoding (though I haven't personally used it), fastDummies.

0 Likes

#3

I have seen all this online. I have a very large data with 286 rows and 10 columns. My problem is trying a unique way to go about it. Now, out of the 10 columns, I want to create dummy variables for 9 of them. Please any suggestions on how to do that?

0 Likes

#4

Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

1 Like

#5

Can you please explain what do you mean by this? Can you please provide an expected object for a copy-paste friendly sample dataset? As Mara has noted, a reprex will be very helpful.

If you meant something like coding c("A", "B", "A", "A", "B", "C") as c(1, 2, 1, 1, 2, 3), then you can use the as.integer function. Or, you want to recode by some other labels, you can use the labels argument of the factor function.

Here, I'm providing an example, where I've recoded to integers but through the factor function. I'm recoding all columns except one particular column.

set.seed(seed = 28127)

suppressPackageStartupMessages(expr = library(package = "dplyr"))

dataset <- data.frame(Class = sample(x = c("no-recurrence-events", "recurrence-events"),
                                     size = 20,
                                     replace = TRUE),
                      PostMeno = sample(x = c("It40", "premeno", "ge40"),
                                        size = 20,
                                        replace = TRUE),
                      NodeCaps = sample(x = c("no", "yes"),
                                        size = 20,
                                        replace = TRUE),
                      Breast = sample(x = c("left", "right"),
                                      size = 20,
                                      replace = TRUE),
                      Quadrant = sample(x = c("left_low", "right_up", "central", "left_up", "right_low"),
                                        size = 20,
                                        replace = TRUE),
                      Radiation = sample(x = c("no", "yes"),
                                         size = 20,
                                         replace = TRUE))

dataset %>%
  mutate_at(.vars = vars(-Radiation),
            .funs = function(y) factor(x = y,
                                       labels = seq_len(length.out = nlevels(x = y))))
#>    Class PostMeno NodeCaps Breast Quadrant Radiation
#> 1      1        3        1      1        1       yes
#> 2      1        3        1      1        1        no
#> 3      2        2        1      1        1        no
#> 4      1        2        1      2        5       yes
#> 5      1        1        1      1        5        no
#> 6      1        2        1      2        2       yes
#> 7      1        1        1      2        3       yes
#> 8      1        3        2      2        3        no
#> 9      2        2        1      2        1        no
#> 10     2        1        1      1        3       yes
#> 11     2        1        2      2        2       yes
#> 12     1        2        2      2        5       yes
#> 13     2        1        2      1        4       yes
#> 14     2        2        2      2        5        no
#> 15     2        2        2      2        1       yes
#> 16     1        2        1      2        4        no
#> 17     1        2        2      2        5       yes
#> 18     1        1        1      2        1        no
#> 19     2        3        2      1        1        no
#> 20     1        2        1      1        1       yes

Created on 2019-04-09 by the reprex package (v0.2.1)

Hope this helps.

0 Likes

#6

Also, have in mind that recoding your factor variables as integers (i.e. 1, 3, 4, 5) it's going to introduce an order in your data (which may or may not be desirable for your model) if you want to avoid this you have to create "one hot encoded" dummy variables (i.e. only 1 or 0 values). One way of doing this easily is using the caret package, see this example.

df <- data.frame(stringsAsFactors = FALSE,
                 age = as.factor(c("75+", "55-74", "35-54", "25-34", "15-24", "5-14")),
                 value = 1:6)

library(caret)

dmy <- dummyVars(" ~ .", data = df)
recoded <- data.frame(predict(dmy, newdata = df))
recoded
#>   age.15.24 age.25.34 age.35.54 age.5.14 age.55.74 age.75. value
#> 1         0         0         0        0         0       1     1
#> 2         0         0         0        0         1       0     2
#> 3         0         0         1        0         0       0     3
#> 4         0         1         0        0         0       0     4
#> 5         1         0         0        0         0       0     5
#> 6         0         0         0        1         0       0     6
0 Likes

#7

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

0 Likes

#9

I like your coding... But I have a large data set with 286 columns and 10 column..... The name of the data set is "Cancer". Do I replace "x" with "Cancer"? How do I input that into your coding? I am new to R.

0 Likes

#10

Thank you for adding this. But I want each age group to be replaced with the mid-range. For example, for "55-74" to be replace with "64.5" and "35-54" to be replace with "43.5". How do I write such a code?

0 Likes

#11

Did you mean that you have a data.frame (or tibble) named Cancer with 10 columns? (I assume you meant 286 rows and 10 columns as you said earlier to Mara. But even if you have 286 columns, the following will work.)

In that case, substitute dataset %>% ... in my code to Cancer %>% ....

Edit:

Actually if it's a tibble, then unless the columns are already factors, you may face errors because of the nlevels function. This doesn't happen with data.frame considers the strings as factors by default. But if you explicitly change that, you can face it there too. In these cases, you can use the number of unique elements instead of using nlevels.

Hope this helps.

0 Likes

closed #12

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

0 Likes