read_csv (readr) output empty dataframe with expected headers if actual headers don't all match definition

Hi,

I am combining multiple csvs and have an issue whereby some of the older csvs have some different headers from those named in colTypes. I would like to simply gracefully output an empty dataframe/tibble, with the expected headers, when the csv read in doesn't exactly match all headers in my specification.

In the reprex below, ok has the desired headers, not_ok does not and should instead be handled such that it returns an empty dataframe/tibble with the expected headers.

The idea being that the flow of csv combining doesn't fall over with these older files but carries on until all files are handled.

Please can someone advise on how I might go about this? Happy to clarify anything.

I imagine I perhaps need to write a wrapper function that tests for a specific warning and handles accordingly? I will end up with a lot of nested function calls :flushed: though.

I can' t see any arguments within read_csv itself that would help with this scenario.

Here is the reprex:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3

colTypes = cols(
  REPORTING_PERIOD_START = col_character(),
  REPORTING_PERIOD_END = col_character(),
  STATUS = col_character(),
  BREAKDOWN = col_character(),
  PRIMARY_LEVEL = col_character(),
  PRIMARY_LEVEL_DESCRIPTION = col_character(),
  SECONDARY_LEVEL = col_character(),
  SECONDARY_LEVEL_DESCRIPTION = col_character(),
  MEASURE_ID = col_character(),
  MEASURE_NAME = col_character(),
  MEASURE_VALUE = col_character()
)

ok <- 'https://files.digital.nhs.uk/18/C92DD8/MHSDS%20Data_AugPrf_2020.csv'
not_ok <-  "https://files.digital.nhs.uk/97/0C4714/MHSDS%20Monthly_File_Dec_2018_Final.csv"

# works as headers all match...wish to keep as is
keep <- read_csv(ok, col_types = colTypes)

# would like empty df with expected names here instead of warning and actual file content
return_empty <- read_csv(not_ok, col_types = colTypes)
#> Warning: The following named parsers don't match the column names:
#> REPORTING_PERIOD_START, REPORTING_PERIOD_END, MEASURE_ID, MEASURE_NAME,
#> MEASURE_VALUE

Created on 2020-12-02 by the reprex package (v0.3.0)


In case it matters, the csvs are being combined as follows:

combined_csvs <- future_map_dfr(.x = files_df$links, 
                                .f = ~ cleaned_csv2(
                                  read_csv(get_url2(.x, '[title*="MHSDS Data File"],[title*="MHSDS Monthly Data File"]'), col_types = colTypes)      
                                ), seed=TRUE)

In this case, ok or not_ok would be being returned by:

get_url2(.x, '[title*="MHSDS Data File"],[title*="MHSDS Monthly Data File"]')

files_df$links is the input list of files and cleaned_csv2 does some column type conversions.

Would something like this work for you?

library(readr)
colTypes = cols(
  REPORTING_PERIOD_START = col_character(),
  REPORTING_PERIOD_END = col_character(),
  STATUS = col_character(),
  BREAKDOWN = col_character(),
  PRIMARY_LEVEL = col_character(),
  PRIMARY_LEVEL_DESCRIPTION = col_character(),
  SECONDARY_LEVEL = col_character(),
  SECONDARY_LEVEL_DESCRIPTION = col_character(),
  MEASURE_ID = col_character(),
  MEASURE_NAME = col_character(),
  MEASURE_VALUE = col_character()
)

EmptyDF <- data.frame(REPORTING_PERIOD_START=NA,
                      REPORTING_PERIOD_END=NA,
                      STATUS=NA,
                      BREAKDOWN=NA,
                      PRIMARY_LEVEL=NA,
                      PRIMARY_LEVEL_DESCRIPTION=NA,
                      SECONDARY_LEVEL=NA,
                      SECONDARY_LEVEL_DESCRIPTION=NA,
                      MEASURE_ID=NA,
                      MEASURE_NAME=NA ,
                      MEASURE_VALUE=NA)

ok <- 'https://files.digital.nhs.uk/18/C92DD8/MHSDS%20Data_AugPrf_2020.csv'
not_ok <-  "https://files.digital.nhs.uk/97/0C4714/MHSDS%20Monthly_File_Dec_2018_Final.csv"
#Good <- read_csv(ok, col_types = colTypes)

return_empty <- tryCatch(read_csv(not_ok, col_types = colTypes), warning=function(e) EmptyDF)

Can I just check that you want to keep data from the historic columns in the MHSDS files where they match the current?

Also, just checking, are you scraping these directly from the website? I used:

download.file(
  "https://files.digital.nhs.uk/18/C92DD8/MHSDS%20Data_AugPrf_2020.csv",
  destfile = "MHSDS.csv",
  method = "wininet",
  mode = "wb")

ok <- read_csv("MHSDS.csv")

download.file(
  "https://files.digital.nhs.uk/97/0C4714/MHSDS%20Monthly_File_Dec_2018_Final.csv",
  destfile = "MHSDS2018.csv",
  method = "wininet",
  mode = "wb")

not_ok <- read_csv("MHSDS2018.csv")

Hi,

No, I want to reject the file if the headers are not exactly the same (I looked at using cols_only which keeps matched columns) . I decided on an empty dataframe as output in this case so I can continue combining files with my existing code then is.na filter out empties at end.

1 Like

Thank you. I will trying implementing this and testing.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Thanks for the swift reply. I appreciate your help. This was great. I just needed to alter the NAs so I didn't get a logical versus character comparison error.

Thanks

You can stipulate the type of NA so you can make PRIMARY_LEVEL in to a character NA:

EmptyDF <- data.frame(REPORTING_PERIOD_START=NA,
                      REPORTING_PERIOD_END=NA,
                      STATUS=NA,
                      BREAKDOWN=NA,
                      PRIMARY_LEVEL=NA_character_,
                      PRIMARY_LEVEL_DESCRIPTION=NA,
                      SECONDARY_LEVEL=NA,
                      SECONDARY_LEVEL_DESCRIPTION=NA,
                      MEASURE_ID=NA,
                      MEASURE_NAME=NA ,
                      MEASURE_VALUE=NA)
2 Likes