How do I read first columns of csv and ingore the rest?

grosscol · May 12, 2020, 8:00pm

Exposition

There exists a number of data submitters who have various levels of technical skill. In order to accommodate them all, accepting a csv with some number of extraneous columns is permitted. This is specifically an issue with some exports from Excel that have a couple dozen empty columns tacked on.

Problem

The order and meaning of the first columns, which are the ones to be read, are well enforced. There's some variation in the column names, but at least the order and meaning is constant. The rest is a laissez-faire content ranging from blank columns to irrelevant data.

Reading in a csv with extraneous columns results in warning message. This is undesirable as it generates some useless warnings, and those warnings emitted are logged.

Question

Is there a way to specify just read the first few columns and assign them their types without generating the warnings about expected vs actual, and col_types should be the same length as col_names?

Warning: 2 parsing failures.
row col  expected    actual         file
1  -- 3 columns 5 columns literal data
2  -- 3 columns 5 columns literal data

The interim solution has been to suppress warnings, but this is suboptimal as it will also suppress any actual warnings of interest such as "no trailing characters .0" when some flat file editor or settings ~~helpfully~~ determine that the user really wanted floating point representation instead of integers.

Example

library(readr)

# Input data with extra unwanted bits.
extra_blanks <- "foo,bar,baz,,,\naaa,100,ccc,,,\naaa,200.0,ccc,,,\n"
extra_garbage <- "foo,bar,baz,xtra,xtra\naaa,100,ccc,xtra,xtra\naaa,200,ccc,xtra,xtra\n"

# Read input data with extraneous bits using column specification
col_spec <- cols(col_character(), col_integer(), col_character(), .default = col_skip())

readr::read_csv(extra_blanks, col_types = col_spec)
readr::read_csv(extra_garbage, col_types = col_spec)

# Read input data with extraneous bits using cols_only
col_spec <- cols_only(col_character(), col_integer(), col_character())

readr::read_csv(extra_blanks, col_types = col_spec)
readr::read_csv(extra_garbage, col_types = col_spec)

# A tibble: 2 x 3
# foo     bar baz  
# <chr> <int> <chr>
# 1 aaa   100 ccc  
# 2 aaa   200 ccc

cderv · May 12, 2020, 8:26pm

Here some example about how you could read those type of file

library(readr)

# Input data with extra unwanted bits.
extra_blanks <- "foo,bar,baz,,,\naaa,100,ccc,,,\naaa,200.0,ccc,,,\n"
extra_garbage <- "foo,bar,baz,xtra,xtra\naaa,100,ccc,xtra,xtra\naaa,200,ccc,xtra,xtra\n"

# using data.table
data.table::fread(extra_blanks, select = 1:3)
#>    foo bar baz
#> 1: aaa 100 ccc
#> 2: aaa 200 ccc
data.table::fread(extra_garbage, select = 1:3)
#>    foo bar baz
#> 1: aaa 100 ccc
#> 2: aaa 200 ccc

# using vroom
vroom::vroom(extra_blanks, col_select = 1:3)
#> Rows: 2
#> Columns: 3
#> Delimiter: ","
#> chr [2]: foo, baz
#> dbl [1]: bar
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <dbl> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc
vroom::vroom(extra_garbage, col_select = 1:3)
#> Rows: 2
#> Columns: 3
#> Delimiter: ","
#> chr [2]: foo, baz
#> dbl [1]: bar
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <dbl> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc

# using readr
col_spec <- cols_only(foo = col_character(), 
                      bar = col_integer(), 
                      baz = col_character())

readr::read_csv(extra_blanks, col_types = col_spec)
#> Warning: Missing column names filled in: 'X4' [4], 'X5' [5], 'X6' [6]
#> Warning: 1 parsing failure.
#> row col               expected actual         file
#>   2 bar no trailing characters     .0 literal data
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <int> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa      NA ccc
readr::read_csv(extra_garbage, col_types = col_spec)
#> Warning: Duplicated column names deduplicated: 'xtra' => 'xtra_1' [5]
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <int> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc

^{Created on 2020-05-12 by the reprex package (v0.3.0)}

I believe readr will deal with guessing column names before ignoring them in the cold spec provided. I guess the way it works is that it needs to know the name to apply specs.

this is a rather old issue it seems

github.com/tidyverse/readr

Ignore missing/duplicate names if column is skipped

opened 11:39AM - 09 Jan 17 UTC

cbrnr

feature read :book:

When I read a file with trailing delimiters, `read_csv` spits out a warning that… a missing column name was filled in. Is there a way to tell the function that I want to read in all but the last (empty) column so that the warning message is not produced? I don't know how common such (malformed) CSV files are, but an option to ignore trailing delimiters might be useful. I tried to get it to work with the `col_types` argument, but it seems like all columns are read in at first. See also my question on [StackOverflow](http://stackoverflow.com/questions/41280289/ignore-trailing-delimiters-in-readrread-csv).

Maybe vroom's col_select argument can enter readr world... will see.

Hope it helps!

grosscol · May 13, 2020, 3:20pm

This is informative. In short, there is no readr solution to the issue, and a different package must be used to ingest the data.

Your reply illustrates thjat to get readr to avoid emitting the superfluous warnings, I'd have to know the column names in advance. In this case, that's not possible as they vary a bit. The column types alone appear to be insufficient for readr.

Thanks for the fread and vroom examples. Those will be useful.

system · May 20, 2020, 3:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.