How to make read_csv smarter at infering column types?

I have been using read_csv for reading tables and has worked like a charm so far. The column type inference is very nice, since it saves from having to do type conversions.

Except, for when it behaves a bit too naïve. Here is an example where some of date columns are misinferred as dbl:

Warning message:
“One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)”
Rows: 15052 Columns: 286
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ";"
dbl  (285): min_date, max_date, ...
date   (1): disease_start_date

I know what is triggering this: Columns like min_date and max_date have NA rows. Only disease_start_date is complete enough to infer the pattern.

Is there anything I can do to make the type inference a bit smarter?
Thanks!

Add the option:

read_csv(, ..., guess_max = n_max)

You can also specify n_max, which is Inf by default.

1 Like

That brought some improvement, in fact as far as the function is concerned there were no problems this time .

Rows: 15052 Columns: 286
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ";"
dbl  (284):  death_date, max_date, sex, ...
dttm   (1): min_date
date   (1): disease_start_date

But hmmm... death_date and max_date are still being inferred as double.

I tried doing spec() and that seems to trigger an interesting error:

Error in `validate_not_missing()`:
! `truth` is missing and must be supplied.
Traceback:

1. spec(databank)
2. spec.data.frame(databank)
3. metric_summarizer(metric_nm = "spec", metric_fn = spec_vec, data = data, 
 .     truth = !!enquo(truth), estimate = !!enquo(estimate), estimator = estimator, 
 .     na_rm = na_rm, case_weights = !!enquo(case_weights), event_level = event_level)
4. validate_not_missing(truth, "truth")
5. abort(paste0("`", nm, "` ", "is missing and must be supplied."))
6. signal_abort(cnd, .file)

Any idea what might be going on here?

I think you are calling a different spec() function than you think you are.

best guess is that the above error when you ran spec comes from having used yardstick::spec() likely because of order of loading packages.

try naming it full instead i.e.

readr::spec()
1 Like

Ok interesting , I must have missed a conflict warning.

Now I think I am in the thick of it. My diagnosis is that the dbl-inference seems to happen if there are substantially more NA entries than actual values.

> sum(is.na(databank$min_date))      # min_date is inferred as date
> 14                                 # a mere 14 entries are NA
> sum(is.na(databank$death_date))    # death_date is inferred as dbl
> 12964                              # Fortunately mortality is not too high, unfortunately that means many empty rows

I don't suppose there is any option to ignore NA while inferring the type?

My assumption would be that death_date doesnt look like dates.
Have you checked how the input looks ?

If only! death_date is just as good as min_date but loaded with missing values. Missing, simply put, because the corresponding patients have not died.

   [1] 20161126 20161126       NA       NA       NA       NA       NA       NA
   [9]       NA       NA       NA       NA       NA       NA       NA       NA
   [17]       NA       NA       NA       NA       NA       NA       NA       NA
   [25]       NA       NA 20030203 20140728       NA       NA       NA       NA

min_date was kept in the same format and inferred just as easily.

Again, is there any way to stop NAs from poisoning the type inference?

I'm still sceptical that NA's have anything to do with it.

library(readr)
parse_guess("2016-11-26") |> str()
# Date[1:1], format: "2016-11-26"
parse_guess("20161126") |> str()
# num 20161126

is min_date really 8digits with no punctuation ?

Is this sufficient proof?

Good to know about parse_guess though, that might allow for the post-processing I am looking for.

If I cannot find anything else I think it should be possible to target all columns with col_guess, ignore all NAs and infer type from non-NA values.

Hmmm... looks like it might not be that easy afterall.

> databank |> select(death_date) |> drop_na() |> mutate(death_date = as.character(death_date)) |> pull(death_date) |> guess_parser()
> 'double'
> databank |> select(death_date) |> drop_na() |> mutate(death_date= as.character(death_date)) |> pull(death_date) |> parse_guess() |> head()
> 20161126 . 20161126 . 20030203 . 20140728 . 20090404 . 20090404

And yet, inference works just fine for min_date, which initially starts in the same format. So what gives?

Does the format vary. ?

Sure, let's check:

> databank <- read_csv2(guess_max = 10000, col_types=cols(.default = col_character())) # Forcing character typing on all columns

> databank$min_date
> Strings of pattern: '20070510'...

> databank$death_date
> Strings of pattern: '20161126', or NA...

Yes, they are definitely the same. Seeing as MY issue is mostly with dates, perhaps I can force a conversion to col_date if the variable name contains date in it.

But I still don't understand why inference isn't able to gleam that these two are the same format. date is able to at least infer the first YYYYMMDD after all.

Ok, here is what worked:

databank <- file |> read_csv2(guess_max = 10000)

col_names <- names(databank)                                                        
 # Get all column names
date_col_names <- col_names |> enframe() |> filter(endsWith(value, 'date'))        
 # Filter anything that ends with `date` as date_col_names.
databank |> mutate(across(date_col_names$value, ymd))                              
 # Mutate all such variables in databank, applying ymd.


Huzzah!

Curious observation: as_date instead of ymd produces the following:

I really have to wonder why. Oh, well thx for teaching me about guess_max at least :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.