How to read columns as factors with read_csv, if I don't know the factor levels in advance

ggplot2
readr
read_csv

#1

Hi,

I’m apologize if this is not the right place for this kind of questions, and I would be happy to post it somewhere else if you think it’d be more appropriate.

I sometimes need to read large .csv files, containing data on various items categorized by a serial number (an integer). Since I want to make nice plots with ggplots, I’d like the serial numbers to be read as factors, so that I can color different items differently (for example). However, this isn’t possible with read_csv(): col_types = col( Serial = col_factor()) requires knowing the levels (i.e., all the serial numbers) before reading. I’m then stuck with either one of these options:

  • read the data with read_csv() and then convert columns with mutate_at() (doable, but a bit of a waste)
  • read data with read.csv() (slooow!)
  • read data with data.table::fread (works, but I like much more the tidyverse API)

Why isn’t possible to read the columns as factors and let read_csv() compute the levels? I guess this comes from @hadley’s aversion to stringsAsFactors = TRUE (which I can totally relate to). However, in my use case I’m explicitly specifying which columns to read as factors. Am I missing something obvious?

PS reproducible example:

write_csv(data.frame(x=1:19, y=letters[1:19]), path = "foo.csv")
bar <- read_csv("foo.csv", col_types = cols(x = col_factor()), col_names = T)
Error in structure(list(...), class = c(paste0("collector_", type), "collector")) : 
  argument "levels" is missing, with no default

#2

You could change x to a factor after reading it in:
bar %>% mutate(x = factor(x))


#3

Actually, that would be really slow for large datasets.

How about:
bar <- read_csv(“foo.csv”, col_types = cols(x = col_character()), col_names = T)

ggplot2 will coerce the characters to factors.


#4

This is a good idea! Thanks


#5

You use col_factor(NULL), as mentioned in ?col_factor

levels: Character vector providing set of allowed levels. if ‘NULL’,
will generate levels based on the unique values of ‘x’,
ordered by order of appearance in ‘x’.

library(readr)
x <- read_csv("a\nfoo\nbar\nbaz", col_types = cols(col_factor(NULL)))
x
#> # A tibble: 3 x 1
#>   a     
#>   <fctr>
#> 1 foo   
#> 2 bar   
#> 3 baz
levels(x$a)
#> [1] "foo" "bar" "baz"

#6

@jimhester great! I misinterpreted the docs: I thought that

levels	Character vector providing set of allowed levels. if NULL, will generate levels based on the unique values of x, ordered by order of appearance in x.

meant that if the levels argument was missing, then col_factor would generate levels automatically. However, since this seemed not to work, I thought there was an issue with the docs…instead it was I who had misinterpreted them. Good! Problem solved :slight_smile: