How to make read_csv smarter at infering column types?

themadprogramer · March 7, 2023, 12:00pm

I have been using read_csv for reading tables and has worked like a charm so far. The column type inference is very nice, since it saves from having to do type conversions.

Except, for when it behaves a bit too naïve. Here is an example where some of date columns are misinferred as dbl:

Warning message:
“One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)”
Rows: 15052 Columns: 286
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ";"
dbl  (285): min_date, max_date, ...
date   (1): disease_start_date

I know what is triggering this: Columns like min_date and max_date have NA rows. Only disease_start_date is complete enough to infer the pattern.

Is there anything I can do to make the type inference a bit smarter?
Thanks!

martin.R · March 7, 2023, 12:14pm

Add the option:

read_csv(, ..., guess_max = n_max)

You can also specify n_max, which is Inf by default.

themadprogramer · March 7, 2023, 2:16pm

That brought some improvement, in fact as far as the function is concerned there were no problems this time .

Rows: 15052 Columns: 286
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ";"
dbl  (284):  death_date, max_date, sex, ...
dttm   (1): min_date
date   (1): disease_start_date

But hmmm... death_date and max_date are still being inferred as double.

I tried doing spec() and that seems to trigger an interesting error:

Error in `validate_not_missing()`:
! `truth` is missing and must be supplied.
Traceback:

1. spec(databank)
2. spec.data.frame(databank)
3. metric_summarizer(metric_nm = "spec", metric_fn = spec_vec, data = data, 
 .     truth = !!enquo(truth), estimate = !!enquo(estimate), estimator = estimator, 
 .     na_rm = na_rm, case_weights = !!enquo(case_weights), event_level = event_level)
4. validate_not_missing(truth, "truth")
5. abort(paste0("`", nm, "` ", "is missing and must be supplied."))
6. signal_abort(cnd, .file)

Any idea what might be going on here?

nirgrahamuk · March 7, 2023, 2:21pm

I think you are calling a different spec() function than you think you are.

best guess is that the above error when you ran spec comes from having used yardstick::spec() likely because of order of loading packages.

try naming it full instead i.e.

readr::spec()

themadprogramer · March 7, 2023, 2:46pm

Ok interesting , I must have missed a conflict warning.

Now I think I am in the thick of it. My diagnosis is that the dbl-inference seems to happen if there are substantially more NA entries than actual values.

> sum(is.na(databank$min_date))      # min_date is inferred as date
> 14                                 # a mere 14 entries are NA
> sum(is.na(databank$death_date))    # death_date is inferred as dbl
> 12964                              # Fortunately mortality is not too high, unfortunately that means many empty rows

I don't suppose there is any option to ignore NA while inferring the type?

nirgrahamuk · March 7, 2023, 2:59pm

My assumption would be that death_date doesnt look like dates.
Have you checked how the input looks ?

themadprogramer · March 7, 2023, 10:35pm

If only! death_date is just as good as min_date but loaded with missing values. Missing, simply put, because the corresponding patients have not died.

   [1] 20161126 20161126       NA       NA       NA       NA       NA       NA
   [9]       NA       NA       NA       NA       NA       NA       NA       NA
   [17]       NA       NA       NA       NA       NA       NA       NA       NA
   [25]       NA       NA 20030203 20140728       NA       NA       NA       NA

min_date was kept in the same format and inferred just as easily.

Again, is there any way to stop NAs from poisoning the type inference?

nirgrahamuk · March 7, 2023, 10:42pm

I'm still sceptical that NA's have anything to do with it.

library(readr)
parse_guess("2016-11-26") |> str()
# Date[1:1], format: "2016-11-26"
parse_guess("20161126") |> str()
# num 20161126

is min_date really 8digits with no punctuation ?

themadprogramer · March 8, 2023, 2:46am

Is this sufficient proof?

Good to know about parse_guess though, that might allow for the post-processing I am looking for.

If I cannot find anything else I think it should be possible to target all columns with col_guess, ignore all NAs and infer type from non-NA values.

themadprogramer · March 9, 2023, 1:13am

Hmmm... looks like it might not be that easy afterall.

> databank |> select(death_date) |> drop_na() |> mutate(death_date = as.character(death_date)) |> pull(death_date) |> guess_parser()
> 'double'
> databank |> select(death_date) |> drop_na() |> mutate(death_date= as.character(death_date)) |> pull(death_date) |> parse_guess() |> head()
> 20161126 . 20161126 . 20030203 . 20140728 . 20090404 . 20090404

And yet, inference works just fine for min_date, which initially starts in the same format. So what gives?

nirgrahamuk · March 9, 2023, 7:33am

Does the format vary. ?

themadprogramer · March 9, 2023, 1:44pm

Sure, let's check:

> databank <- read_csv2(guess_max = 10000, col_types=cols(.default = col_character())) # Forcing character typing on all columns

> databank$min_date
> Strings of pattern: '20070510'...

> databank$death_date
> Strings of pattern: '20161126', or NA...

Yes, they are definitely the same. Seeing as MY issue is mostly with dates, perhaps I can force a conversion to col_date if the variable name contains date in it.

But I still don't understand why inference isn't able to gleam that these two are the same format. date is able to at least infer the first YYYYMMDD after all.

themadprogramer · March 17, 2023, 2:39am

Ok, here is what worked:

databank <- file |> read_csv2(guess_max = 10000)

col_names <- names(databank)                                                        
 # Get all column names
date_col_names <- col_names |> enframe() |> filter(endsWith(value, 'date'))        
 # Filter anything that ends with `date` as date_col_names.
databank |> mutate(across(date_col_names$value, ymd))                              
 # Mutate all such variables in databank, applying ymd.

Huzzah!

Curious observation: as_date instead of ymd produces the following:

I really have to wonder why. Oh, well thx for teaching me about guess_max at least

system · March 24, 2023, 2:40am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.