Read CSV files with quotation marks within string variables

readr

#1

Hi all!

I'm trying to read a semicolon separated csv file with Swedish politicians from the Swedish Election Authority, using the following code.

library(tidyverse)
candidates <- read_csv2("https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv",
                  locale=locale("sv",encoding="ISO-8859-1")) 

I get a warning about 63 parsning errors and the following warnings message:
In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)

I think this is because of how some of the string variables look like in the csv file. Those string variables are not enclosed with quotation marks, but they can include quotation marks. The file seems to be read correctly, but could I read the file in some way that does not produce any warnings? I assume I should somehow state that R should not care about what characters are found between the semicolons.

Also, how do I get to see all the 63 parsing errors? They are now truncated and I only see 2.

Any help would be greatly appreciated!

Best,
Richard


#2

One of the benefits of using reprex (short for minimal reproducible example) is that it will give the full warning message, which, in this case, includes the answer to your question above! :tada:

See problems(...) for more details. In this case, that's problems(candidates). You're only seeing the first few here in the reprex, but you can run it and give it a name locally for further inspection. (All of this is also in the readr function reference for read_delim(), of which read_csv2() is just a special instance).

https://readr.tidyverse.org/reference/read_delim.html

I also ran glimpse() in the reprex below for a peek into what the columns all look like.

library(tidyverse)
candidates <- read_csv2("https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv",
                        locale=locale("sv",encoding="ISO-8859-1"))
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   PARTIKOD = col_double(),
#>   LISTNUMMER = col_double(),
#>   ORDNING = col_double(),
#>   KANDIDATNUMMER = col_double(),
#>   Ă…LDER_PĂ…_VALDAGEN = col_double(),
#>   ANT_BEST_VALS = col_double(),
#>   VALBAR_PĂ…_VALDAGEN = col_logical()
#> )
#> See spec(...) for full column specifications.
#> Warning: 63 parsing failures.
#>   row              col           expected actual                                                                 file
#> 24663 VALSEDELSUPPGIFT delimiter or quote      , 'https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv'
#> 24663 VALSEDELSUPPGIFT delimiter or quote      R 'https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv'
#> 94397 VALSEDELSUPPGIFT delimiter or quote      , 'https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv'
#> 94397 VALSEDELSUPPGIFT delimiter or quote      F 'https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv'
#> 94397 VALSEDELSUPPGIFT delimiter or quote        'https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv'
#> ..... ................ .................. ...... ....................................................................
#> See problems(...) for more details.
problems(candidates)
#> # A tibble: 63 x 5
#>      row col              expected           actual file                  
#>    <int> <chr>            <chr>              <chr>  <chr>                 
#>  1 24663 VALSEDELSUPPGIFT delimiter or quote ,      'https://data.val.se/…
#>  2 24663 VALSEDELSUPPGIFT delimiter or quote R      'https://data.val.se/…
#>  3 94397 VALSEDELSUPPGIFT delimiter or quote ,      'https://data.val.se/…
#>  4 94397 VALSEDELSUPPGIFT delimiter or quote F      'https://data.val.se/…
#>  5 94397 VALSEDELSUPPGIFT delimiter or quote " "    'https://data.val.se/…
#>  6 94397 VALSEDELSUPPGIFT delimiter or quote F      'https://data.val.se/…
#>  7 94397 VALSEDELSUPPGIFT delimiter or quote " "    'https://data.val.se/…
#>  8 94397 VALSEDELSUPPGIFT delimiter or quote F      'https://data.val.se/…
#>  9 94397 VALSEDELSUPPGIFT delimiter or quote " "    'https://data.val.se/…
#> 10 94397 VALSEDELSUPPGIFT delimiter or quote F      'https://data.val.se/…
#> # ... with 53 more rows
glimpse(candidates)
#> Observations: 119,920
#> Variables: 24
#> $ VALTYP             <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R"...
#> $ VALOMRĂ…DESKOD      <chr> "00", "00", "00", "00", "00", "00", "00", "...
#> $ VALOMRĂ…DESNAMN     <chr> "HELA LANDET", "HELA LANDET", "HELA LANDET"...
#> $ VALKRETSKOD        <chr> "01", "01", "01", "01", "01", "01", "01", "...
#> $ VALKRETSNAMN       <chr> "Stockholms kommun", "Stockholms kommun", "...
#> $ PARTIBETECKNING    <chr> "Alternativ för Sverige", "Alternativ för S...
#> $ PARTIFĂ–RKORTNING   <chr> "AfS", "AfS", "AfS", "AfS", "AfS", "AfS", "...
#> $ PARTIKOD           <dbl> 1325, 1325, 1325, 1325, 1325, 1325, 1325, 1...
#> $ VALSEDELSSTATUS    <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S"...
#> $ LISTNUMMER         <dbl> 14843, 14843, 14843, 14843, 14843, 14843, 1...
#> $ ORDNING            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
#> $ ANMKAND            <chr> "J", "J", "J", "J", "J", "J", "J", "J", "J"...
#> $ ANMDELTAGANDE      <chr> "J", "J", "J", "J", "J", "J", "J", "J", "J"...
#> $ SAMTYCKE           <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I"...
#> $ FĂ–RKLARING         <chr> "J", "J", "J", "J", "J", "J", "J", "J", "J"...
#> $ KANDIDATNUMMER     <dbl> 355149, 355150, 355151, 355152, 355153, 355...
#> $ NAMN               <chr> "Gustav Kasselstrand", "William Hahne", "Je...
#> $ Ă…LDER_PĂ…_VALDAGEN  <dbl> 31, 26, 28, 28, 29, 58, 32, 33, 52, 31, 53,...
#> $ KĂ–N                <chr> "M", "M", "K", "M", "M", "K", "M", "M", "K"...
#> $ FOLKBOKFÖRINGSORT  <chr> "Stockholm", "Stockholm", "Nyköping", "Nykö...
#> $ VALSEDELSUPPGIFT   <chr> "ekonom, Stockholm", "militär, Stockholm", ...
#> $ ANT_BEST_VALS      <dbl> 1e+07, 1e+07, 1e+07, 1e+07, 1e+07, 1e+07, 1...
#> $ VALBAR_PĂ…_VALDAGEN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ GILTIG             <chr> "J", "J", "J", "J", "J", "J", "J", "J", "J"...

Created on 2018-05-03 by the reprex package (v0.2.0).

As for the quotation mark situation, this issue in the readr GitHub repo has an explanation from Jim.


#3

Thank you for the suggestion to use reprex, and for the link to Jim's answer. Using that information I was able to read the file without any warnings using the following code.

library(tidyverse)

candidates <- read_csv2("https://data.val.se/val/val2018/valsedlar/partier/kandidaturer.skv",
                  locale=locale("sv",encoding="ISO-8859-1"), quote="")
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   PARTIKOD = col_integer(),
#>   LISTNUMMER = col_integer(),
#>   ORDNING = col_integer(),
#>   KANDIDATNUMMER = col_integer(),
#>   Ă…LDER_PĂ…_VALDAGEN = col_integer(),
#>   ANT_BEST_VALS = col_integer()
#> )
#> See spec(...) for full column specifications.

Created on 2018-05-04 by the reprex package (v0.2.0).

However, as indicated in the discussion on GitHub, this solution could cause problems if there are any semicolons in the string variables (but that might cause problems in any case).

All the best,
Richard