CSV file with Notes at the end of the file

I'm importing a file from an Open data portal that has Notes at the end embedded in the file.
I know I can use a loop and delete every row after the Notes line but I was wondering if there was a more elegant solution to reading in the file to get it to stop reading at the first blank row? Or when it encounters the Notes.

In this particular case the Notes line is on line 93 and anything after should be deleted from the data set.

#Table 1 - Approved Canada Emergency Wage Subsidy (CEWS) claims by period and province/territory of business address
cews_prov_src = "https://www.canada.ca/content/dam/cra-arc/serv-info/tax/business/topics/cews/statistics/cews_tbl1.csv"

cews_prov_raw = read_csv(url(cews_prov_src), col_names = c("Claim_Period", 
                                                           "Province", 
                                                           "Applications_Approved_YTD", 
                                                           "Number_Eligible_Employees", 
                                                           "Number_Eligible_Leave_with_Pay", 
                                                           "Number_Employees_Supported", 
                                                           "CEWS_Approved_YTD", 
                                                           "Average_CEWS_Per_Employee", 
                                                           "Percent_Approved_Applications_Period"), 
                         skip=2)

I wrote a little function (drop_after_empty_line()), its not terribly elegant, but it works so long as you are certain there is a blank line at the end of each CSV. Feel free to change as needed. It uses a regex to find a line of just commas, signifying an empty line, and then returns the data with that line and all lines after it removed. You can pass through skip or col_names

library(tidyverse)

cews_prov_src = "https://www.canada.ca/content/dam/cra-arc/serv-info/tax/business/topics/cews/statistics/cews_tbl1.csv"

drop_after_empty_line <- function(path, skip = 0, col_names = TRUE) {
  lines <- read_lines(path, skip = skip) 
  
  stop_before <- 
    lines %>% 
    map_lgl(~str_detect(.x, "^\\,+$")) %>% 
    which()
  
  lines[1:stop_before - 1] %>% 
    rbind() %>% 
    read_csv(col_names = col_names)
}

drop_after_empty_line(url(cews_prov_src), skip = 2, col_names = FALSE)
#> # A tibble: 91 x 9
#>    X1              X2                 X3 X4     X5    X6          X7 X8    X9   
#>    <chr>           <chr>           <dbl> <chr>  <chr> <chr>    <dbl> <chr> <chr>
#>  1 P1: March 15 t… Newfoundland a…  2670 30000  -     30000   5.51e7 1838  0.01…
#>  2 P1: March 15 t… Prince Edward …  1010 11930  -     11930   1.88e7 1574  0.00…
#>  3 P1: March 15 t… Nova Scotia      4750 64330  -     64330   1.19e8 1850  0.02…
#>  4 P1: March 15 t… New Brunswick    4060 47540  -     47540   9.06e7 1906  0.01…
#>  5 P1: March 15 t… Quebec          43750 598340 -     598340  1.22e9 2039  0.19…
#>  6 P1: March 15 t… Ontario         87730 13952… -     13952…  2.85e9 2040  0.39…
#>  7 P1: March 15 t… Manitoba         6300 116160 -     116160  2.48e8 2132  0.028
#>  8 P1: March 15 t… Saskatchewan     5940 61920  -     61920   1.24e8 1999  0.02…
#>  9 P1: March 15 t… Alberta         31740 460870 -     460870  1.08e9 2340  0.14…
#> 10 P1: March 15 t… British Columb… 35890 424640 -     424640  8.69e8 2046  0.15…
#> # … with 81 more rows

Created on 2020-10-14 by the reprex package (v0.3.0)

2 Likes

You can try this

#Table 1 - Approved Canada Emergency Wage Subsidy (CEWS) claims by period and province/territory of business address
cews_prov_src = "https://www.canada.ca/content/dam/cra-arc/serv-info/tax/business/topics/cews/statistics/cews_tbl1.csv"

data1 <- read.csv(cews_prov_src, skip = 2, col.names = c("Claim_Period", 
                                                       "Province", 
                                                       "Applications_Approved_YTD", 
                                                       "Number_Eligible_Employees", 
                                                       "Number_Eligible_Leave_with_Pay", 
                                                       "Number_Employees_Supported", 
                                                       "CEWS_Approved_YTD", 
                                                       "Average_CEWS_Per_Employee", 
                                                       "Percent_Approved_Applications_Period"),
                strip.white = TRUE)

data <- data1[1:(grep("^Notes:", data1[, 1]) - 1), ]

or perhaps even better

data <- data1[1:(grep("^Notes:", data1[, 1]) - 2), ]

Hope this helps.