How do I read first columns of csv and ingore the rest?

Exposition

There exists a number of data submitters who have various levels of technical skill. In order to accommodate them all, accepting a csv with some number of extraneous columns is permitted. This is specifically an issue with some exports from Excel that have a couple dozen empty columns tacked on.

Problem

The order and meaning of the first columns, which are the ones to be read, are well enforced. There's some variation in the column names, but at least the order and meaning is constant. The rest is a laissez-faire content ranging from blank columns to irrelevant data.

Reading in a csv with extraneous columns results in warning message. This is undesirable as it generates some useless warnings, and those warnings emitted are logged.

Question

Is there a way to specify just read the first few columns and assign them their types without generating the warnings about expected vs actual, and col_types should be the same length as col_names?

Warning: 2 parsing failures.
row col  expected    actual         file
1  -- 3 columns 5 columns literal data
2  -- 3 columns 5 columns literal data

The interim solution has been to suppress warnings, but this is suboptimal as it will also suppress any actual warnings of interest such as "no trailing characters .0" when some flat file editor or settings helpfully determine that the user really wanted floating point representation instead of integers.

Example

library(readr)

# Input data with extra unwanted bits.
extra_blanks <- "foo,bar,baz,,,\naaa,100,ccc,,,\naaa,200.0,ccc,,,\n"
extra_garbage <- "foo,bar,baz,xtra,xtra\naaa,100,ccc,xtra,xtra\naaa,200,ccc,xtra,xtra\n"

# Read input data with extraneous bits using column specification
col_spec <- cols(col_character(), col_integer(), col_character(), .default = col_skip())

readr::read_csv(extra_blanks, col_types = col_spec)
readr::read_csv(extra_garbage, col_types = col_spec)

# Read input data with extraneous bits using cols_only
col_spec <- cols_only(col_character(), col_integer(), col_character())

readr::read_csv(extra_blanks, col_types = col_spec)
readr::read_csv(extra_garbage, col_types = col_spec)

# A tibble: 2 x 3
# foo     bar baz  
# <chr> <int> <chr>
# 1 aaa   100 ccc  
# 2 aaa   200 ccc

Here some example about how you could read those type of file

library(readr)

# Input data with extra unwanted bits.
extra_blanks <- "foo,bar,baz,,,\naaa,100,ccc,,,\naaa,200.0,ccc,,,\n"
extra_garbage <- "foo,bar,baz,xtra,xtra\naaa,100,ccc,xtra,xtra\naaa,200,ccc,xtra,xtra\n"

# using data.table
data.table::fread(extra_blanks, select = 1:3)
#>    foo bar baz
#> 1: aaa 100 ccc
#> 2: aaa 200 ccc
data.table::fread(extra_garbage, select = 1:3)
#>    foo bar baz
#> 1: aaa 100 ccc
#> 2: aaa 200 ccc

# using vroom
vroom::vroom(extra_blanks, col_select = 1:3)
#> Rows: 2
#> Columns: 3
#> Delimiter: ","
#> chr [2]: foo, baz
#> dbl [1]: bar
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <dbl> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc
vroom::vroom(extra_garbage, col_select = 1:3)
#> Rows: 2
#> Columns: 3
#> Delimiter: ","
#> chr [2]: foo, baz
#> dbl [1]: bar
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <dbl> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc

# using readr
col_spec <- cols_only(foo = col_character(), 
                      bar = col_integer(), 
                      baz = col_character())

readr::read_csv(extra_blanks, col_types = col_spec)
#> Warning: Missing column names filled in: 'X4' [4], 'X5' [5], 'X6' [6]
#> Warning: 1 parsing failure.
#> row col               expected actual         file
#>   2 bar no trailing characters     .0 literal data
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <int> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa      NA ccc
readr::read_csv(extra_garbage, col_types = col_spec)
#> Warning: Duplicated column names deduplicated: 'xtra' => 'xtra_1' [5]
#> # A tibble: 2 x 3
#>   foo     bar baz  
#>   <chr> <int> <chr>
#> 1 aaa     100 ccc  
#> 2 aaa     200 ccc

Created on 2020-05-12 by the reprex package (v0.3.0)

I believe readr will deal with guessing column names before ignoring them in the cold spec provided. I guess the way it works is that it needs to know the name to apply specs.

this is a rather old issue it seems

Maybe vroom's col_select argument can enter readr world... will see.

Hope it helps!

3 Likes

This is informative. In short, there is no readr solution to the issue, and a different package must be used to ingest the data.

Your reply illustrates thjat to get readr to avoid emitting the superfluous warnings, I'd have to know the column names in advance. In this case, that's not possible as they vary a bit. The column types alone appear to be insufficient for readr.

Thanks for the fread and vroom examples. Those will be useful.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.