What does stringsAsFactors in R mean?

What does stringsAsFactors in R mean? Could anyone please explain it in detail with some examples? Thank you!

Hi,

You're asking a question that is dividing the whole R-community and has been topic of heated debates for a long time.

For a nice read, I suggest this blog:
https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

In summary, strings are read by default as factors (i.e. distinct groups). This has two consequences:

  • Your data is stored more efficiently, because each unique string gets a number and whenever it's used in your data frame you can store its numerical value (which is much smaller in size)
  • Factors are set when the data-frame is created (or file loaded). Only stings present at that time will become factors. If you try and assign any other value to that column, and it's not in the list of factor strings, you'll get an error. The good thing is this prevents entering wrong data into a set data frame, the downside is it's very annoying when you want to alter data frames.(There are ways to add or change factors, but it's often cumbersome)

In short, use stringAsFactors = F if you're planning to change the type of strings you're going to use in your data frame. If the data will not be changed.

Hope this helps
PJ

2 Likes

One more thing to keep in mind is that while base data import functions (like read.csv and read.table) convert strings to factors by default, tidyverse functions (like read_csv from the readr package or read_excel from the readxl package) do not.

With base R functions, to avoid conversion of strings to factors you would do, for example:

x = read.csv("my_file.csv", stringsAsFactors=FALSE)

In readr you can just read the file, as there is no stringsAsFactors argument and no automatic conversion of strings to factors:

library(readr)
x = read_csv("my_file.csv")

However, if you wish you can also use the optional col_types argument to specify whether a particular column should be read in as a factor. See the col_types section of the help for read_csv for details.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.