Convert from string to numeric & regression

Hello,

I want to do a linear regression model, and I've some categorical values (string values).

I have two questions:

  1. When I want to convert from string to numeric, it's not clear to me: Does the values I assign for a string value matter? Or, if I have for example, c('rich', 'average', 'poor'), it doesnt matter if I assign 1 to rich, 2 to poor and 3 to average or whatever the order?

  2. What is the easiest and fatest way to convert a column of a dataframe from strings values to numerical values? For the moment, this is how I do it:

# Creation of a temporary vector with all the conversions I give
functional.list <- c('None' = 0, 'Sal' = 1, 'Sev' = 2, 'Maj2' = 3, 'Maj1' = 4,
                     'Mod' = 5, 'Min2' = 6, 'Min1' = 7, 'Typ'= 8)
# Conversion => I assign the values in df.numeric in a new column called Functional
df.numeric['Functional'] <- as.numeric(functional.list[df.fulldata$Functional])

Thanks for your help!

Don't do that unless you know that rich = 2 * poor. In most cases, you probably want a numeric encoding that doesn't enforce an ordinal structure. There is broad discussion of this here.

The canonical method for doing this is to have your data as a factor instead of character. Then, if you use a model that takes the formula argument, the appropriate binary numeric encodings are created automatically. Otherwise, you can pre-compute them using model.matrix or, better yet, with a recipe.

4 Likes

Thanks for your answer.

It’s still not clear for me. There are several questions in my head:

  1. In which cases can I convert my categories into 1,2,3....,n numeric values, without thinking more? Just simply convert

  2. When do I have to convert into dummy variable? I tried to understand the limit, but it’s not clear to me.
    By the way, is a dummy variable a transformation of one column into several columns with values 0 or 1 in each of the columns?

  3. when do I have to convert into numeric values with a defined order?

Thanks again

Not to be snarky, but don't do any data analysis without considering the context.

There are not many cases where this encoding make sense. Unless your data have a natural ordering and you feel/know that a value of 2 should be twice the effect of the category associated with a value of 1 (and so on), don't do this. In the link that I sent, there is this:

For example, when discussing failure modes of a piece of computer hardware, experts would be able to rank the severity of a type of failure on an integer scale. A minor failure might be scored as a “1” while a catastrophic failure mode could be given a score of “10” and so on.

A dummy variable (aka indicator variable) is usually binary 0/1 for a set of columns to replace the original non-numeric source data. An example using a predictor for the days of the week is here.

There are other ways to encode the data that aren't binary, but dummy variables are the most common method.

Not sure what you mean. Do you mean that the categories have a defined order? Can you give an example?

3 Likes

Thank you,

I have all my answers :smiley:

Have a great day !

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.