read_csv not respecting col_select, can't read problems()

Hi! I'm pretty new to R. Enjoying it immensely.

I'm reading in all CSV files from a directory using map_dfr to apply read_csv over a list of filenames. The CSV files have a varying number of columns. I only want to import 1:7 and to discard 8: onwards where they exist. All files have a column 8, and some files have some text that is parsed as columns 9, 10 etc when I look at the files in Excel. I don't care about any of these columns. Notably, these extra columns don't have headers.

My code is:

df_csv <- map_dfr(
  csvpaths, 
  read_csv, 
  skip_empty_rows = TRUE,
  col_names = TRUE,
  col_select = 1:7,
  col_types = cols_only(
    Project = col_character(),
    Date = col_character(),
    Employee = col_character(),
    Role = col_character(),
    Rate = col_double(),
    Hours = col_double(),
    Amount = col_double()
  ),
  .id = "Source"
)

What's happening:

  • The files are being read in correctly, in that the final df includes the correct information AFAIK.
  • In my RStudio console, read_csv is now outputting what looks like a time? 0s Why? I think this started showing up after my last package update.
  • In my latest test of 15 files, I received 8 warnings (the 0s time is output after the start of the warning message - unsure why?):
Warning messages:                                0s
1: One or more parsing issues, see `problems()` for details 
2: One or more parsing issues, see `problems()` for details 
3: One or more parsing issues, see `problems()` for details 
4: One or more parsing issues, see `problems()` for details 
5: One or more parsing issues, see `problems()` for details 
6: One or more parsing issues, see `problems()` for details 
7: One or more parsing issues, see `problems()` for details 
8: One or more parsing issues, see `problems()` for details 

I'm pretty sure that the warnings relate to instances where a CSV file has more than 8 columns. But!
I can't get any output from problems() so I can't tell what is happening.

> problems()
> 

It doesn't seem to matter if I restrict the map_dfr call to just a single filename; I still can't view any output from problems.

read_csv IS respecting the argument to select only columns 1:7 in the read, but it ISN'T stopping errors from being created from files which have more than 7 columns, which I thought was the purpose of using col_select in the first place.

How can I get these warnings sorted? I previously wrapped this in suppressWarnings but realised it was masking some other real parsing errors I needed to fix, which I've now done.

For the parsing issues, it kind of sounds like read_csv is encountering some items that don't fit the column type.

If this type casting is the issue, you can set everything to 'col_character()' to see if you still get the warnings. Alternatively, you could not define the columns, but rather set the argument "guess_max = " to something quite large. Then, if R finds a letter in a number column, it'll just assume the whole column should be a character column instead.

I find it helpful to read in each CSV to a list before binding them, so I can inspect each. That would look something like:

df_csv <- csvpathsmap %>% map(
  ~ read_csv(., 
             skip_empty_rows = TRUE,
             col_names = TRUE,
             col_select = 1:7,
             col_types = cols_only(
               Project = col_character(),
               Date = col_character(),
               Employee = col_character(),
               Role = col_character(),
               Rate = col_double(),
               Hours = col_double(),
               Amount = col_double()
               )))

Then you can check the column types of each one like:

df_csv %>% map(str)

If they look right, then you can bind them:

df_csv <- df_csv %>% bind_rows(.id = "Source") 

Thanks, @Hayward , for responding and for your suggestions!

I've tried the inspection tactic you suggested. All of the files have columns being parsed correctly, in that they each have 7 columns read, and those columns have the correct types. Casting all to 'col_character()' didn't make the parsing warnings go away.

I've now mucked around with many other ways to do this:
My directory has 232 .csv files with sizes ranging from 4-211 rows.

  • I've tested with a for loop that each individual file parses correctly with my column type mapping and selection as in the original post. They all do.
  • I've tried structuring the map slightly differently to pipe the list of filename to map_dfr(~read_csv(.,[args])) and I still get the parsing errors and note about warnings(), which is empty.
  • I've tried using sapply to apply the read_csv function to the list of filenames, a version of what you've suggested here that is simpler for me to understand. I've used a very simple syntax here:
    sapply(csvpaths, read_csv, col_names = TRUE, col_types = cols(.default = col_character()))
    This works, with no parsing errors. I note that the output is a list of tibbles with variable rows, and 8 columns. ONE FILE has 7 columns! All have the right column names, and are casted to character types, so they should bind correctly, but...
  • taking the output of this sapply and piping to bind_rows throws the same warning about parsing errors, and and the warnings reference problem(), which is still empty.

I'm going crazy over this, since I want to receive any parse errors that are meaningful, but getting more than 50 of them every time, and getting the "correct" output in the end, is maddening.

I hadn't before encountered problems(), so went and took a look at the readr documentation.

Looks like 'read_csv' calls parsing functions of the form 'parse_*()'. If those parsing functions encounter an issue, they populate an attribute that you can access with problems(). However, I don't know where the problems attribute gets stored as you loop through with map.

It may help to try the function 'stop_for_problems()'. I think it would be called as follows below, but I didn't test it to make sure.

df_csv <- csvpathsmap %>% map(
    ~ stop_for_problems(read_csv(., 
               skip_empty_rows = TRUE,
               col_names = TRUE,
               col_select = 1:7,
               col_types = cols_only(
                   Project = col_character(),
                   Date = col_character(),
                   Employee = col_character(),
                   Role = col_character(),
                   Rate = col_double(),
                   Hours = col_double(),
                   Amount = col_double()
               ))))

It's hard to troubleshoot without a dataset, so I can't find what works and give you an exact answer. However, I'm guessing that you may be able to access a problems attribute with 'problems()' if you pass in the dataframe where the mapping fails. Say it fails on the 10th csv path. Then maybe this would report out problems()?

problems(df_csv[[10]])

If that doesn't work and you can't tell which dataframe has parsing failures, one last thought might be to try reporting on everything from your original read-in-dataframes (meaning prior to attempting to incorporate the function stop_for_problems()):

df_csv %>% map(problems)

(Side note: the '~' in my code above just means I'm making a simple function but am too lazy to create a name for it and write out 'function(){ }'. Also, the '.' means 'put here whatever data gets piped in from map'. In this case it's a path.)