Merge pieces into one column using read_csv

doctor_g · December 15, 2017, 11:34am

I have a file like this

ID, Type, Value
1, 27, Startup
2, 28, Load system
3, 28, Initialise system
4, 10, Processing data
5, 10, Processed data, written result to file
6, 30, Shutdown

You can see the Value column is not quoted and sometimes contains commas. Reading this using read_csv generates:

Parsed with column specification:
cols(
ID = col_integer(),
Type = col_integer(),
Value = col_character()
)
Warning: 1 parsing failure.
row # A tibble: 1 x 5 col row col expected actual file expected actual 1 5 3 columns 4 columns 'eg.csv' file # A tibble: 1 x 5

Warning message:
In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 2)

Ideally there'd be some option to read_csv akin to separate's extra = "merge". The only behaviour I'm aware is available is akin to separates extra = "warn" with the result being:

# A tibble: 6 x 3
     ID  Type             Value
  <int> <int>             <chr>
1     1    27           Startup
2     2    28       Load system
3     3    28 Initialise system
4     4    10   Processing data
5     5    10    Processed data
6     6    30          Shutdown

where the offending row has been truncated.

yonicd · December 15, 2017, 2:05pm

you are using the wrong format then to read in the data since you are using the active delimiter in two forms here.

you can either replace the "fake" comma with another delimiter and switch it back after reading in the data, or you can make the line delimiter something else eg tab and then the comma will be read in correctly.

here is option A (note that this assumes your offending column is the last one)

txt <- ("ID, Type, Value\n1, 27, Startup\n2, 28, Load system\n3, 28, Initialise system\n4, 10, Processing data\n5, 10, Processed data, written result to file\n6, 30, Shutdown")

cat(txt,file = 'temp.csv',sep = '\n')

my_lines <- readLines('temp.csv')

#number of assumed columns in data
colnums <- 3

#locate bad line(s)
idx <- stringi::stri_count(my_lines,regex = ',')>(colnums-1)

#replace delimiter
my_lines[idx] <- stringi::stri_replace_last(my_lines[idx],'XXX',regex=',')

#collapse back
new_txt <- paste0(my_lines,collapse = '\n')

problem_data <- readr::read_csv(new_txt)
stringi::stri_replace_all(problem_data$Value,',',regex='XXX')

[1] "Startup"                               
[2] "Load system"                           
[3] "Initialise system"                     
[4] "Processing data"                       
[5] "Processed data, written result to file"
[6] "Shutdown"

doctor_g · December 15, 2017, 2:16pm

Sorry, I guess I should have been clearer that I was asking if there were a read_csv solution (such as with an argument akin to that of separate). Of course one can read the lines using readLine and parse them that way.

yonicd · December 15, 2017, 2:17pm

then ==> you are using the wrong format then to read in the data since you are using the active delimiter in two forms here.

doctor_g · December 15, 2017, 2:25pm

The reading format is "right". It's the lack of quoting of the text column (that, it turns out, can contain commas) that's the issue. Commas are the column delimiters. So either read_csv (or your comma-separated input function of choice) can say "Okay, I'll separate on the first k commas, then I'll just lump everything together" or else you need to parse each line yourself (as in your readLines approach).

I'm not really interested in rights and wrongs of who's doing what. I'm just asking if read_csv has any way of dealing with this (even if it's creating unnamed additional columns that you can deal with after).

cderv · December 15, 2017, 2:53pm

I don't think read_csv can deal with that currently. Could be an intereing feature to add but need some thought to come up with a solution. Although I don't know if it is something that readr could deal with as it seems like something like a special case. Using delimiter also inside a column without quoting does not seems like a correct text file.

In the meantime, you can play with the tidyverse to come up with a wrapper function as stringr::str_split as the merge functionnality that you are looking for.

txt <- ("ID, Type, Value\n1, 27, Startup\n2, 28, Load system\n3, 28, Initialise system\n4, 10, Processing data\n5, 10, Processed data, written result to file\n6, 30, Shutdown")

library(tidyverse)
tibble(txt = read_lines(txt)) %>%
  separate(txt, into = str_split_fixed(slice(., 1), ",", 3), extra = "merge") %>%
  slice(-1)
#> # A tibble: 6 x 3
#>      ID ` Type`                               ` Value`
#>   <chr>   <chr>                                  <chr>
#> 1     1      27                                Startup
#> 2     2      28                            Load system
#> 3     3      28                      Initialise system
#> 4     4      10                        Processing data
#> 5     5      10 Processed data, written result to file
#> 6     6      30                               Shutdown

Created on 2017-12-15 by the reprex package (v0.1.1.9000).

mara · December 15, 2017, 2:54pm

Given you have non-delimiting commas in the file, I'd think a different readr function would make more sense. I don't know how to add that column on ingest, but maybe take a look at the col_specs section of read_delim.R (which includes read_csv, as it's just a special case of read_delim):

github.com

tidyverse/readr/blob/main/R/read_delim.R#L188


      
          #'   list(rlib_name_repair_verbosity = "quiet"),
          #'   read_csv(I("x,x\n1,2\n3,4"))
          #' )
          #'
          #' # Or use "minimal" to turn off name repair
          #' read_csv(I("x,x\n1,2\n3,4"), name_repair = "minimal")
          #'
          #' # File types ----------------------------------------------------------------
          #' read_csv(I("a,b\n1.0,2.0"))
          #' read_csv2(I("a;b\n1,0;2,0"))
          #' read_tsv(I("a\tb\n1.0\t2.0"))
          #' read_delim(I("a|b\n1.0|2.0"), delim = "|")
          read_delim <- function(file, delim = NULL, quote = '"',
                                 escape_backslash = FALSE, escape_double = TRUE,
                                 col_names = TRUE, col_types = NULL,
                                 col_select = NULL,
                                 id = NULL,
                                 locale = default_locale(),
                                 na = c("", "NA"), quoted_na = TRUE,
                                 comment = "", trim_ws = FALSE,
                                 skip = 0, n_max = Inf, guess_max = min(1000, n_max),

yonicd · December 15, 2017, 2:59pm

txt <- ("ID, Type, Value\n1, 27, Startup\n2, 28, Load system\n3, 28, Initialise system\n4, 10, Processing data\n5, 10, Processed data, written result to file\n6, 30, Shutdown")

cat(txt,file = 'temp.csv',sep='\n')
 
 read.delim('temp.csv', header=FALSE, sep=',', dec = '.', 
            stringsAsFactors=F , quote = "\"" ,   
            fill = TRUE , skip = 2 )
  V1 V2                 V3                      V4
1  2 28        Load system                        
2  3 28  Initialise system                        
3  4 10    Processing data                        
4  5 10     Processed data  written result to file
5  6 30           Shutdown                        
 
 readr::read_delim(file = 'temp.csv',
                   delim=',',
                   skip = 2,
                   col_names = FALSE)
Parsed with column specification:
cols(
  X1 = col_integer(),
  X2 = col_character(),
  X3 = col_character()
)
Warning: 1 parsing failure.
row # A tibble: 1 x 5 col     row   col  expected    actual       file expected   <int> <chr>     <chr>     <chr>      <chr> actual 1     4  <NA> 3 columns 4 columns 'temp.csv' file # A tibble: 1 x 5

# A tibble: 5 x 3
     X1    X2                 X3
  <int> <chr>              <chr>
1     2    28        Load system
2     3    28  Initialise system
3     4    10    Processing data
4     5    10     Processed data
5     6    30           Shutdown
Warning message:
In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 2)