Parsing errors with read_csv

Hello, I'm new to R and I'm trying to use the read_csv function. However it gives me parsing errors. I did use the problems() function to view the parsing errors, but I have no idea what the errors mean or how to fix it. I've searched around this issue on RStudio but have not been able to find an explanation. Any help would be appreciated. I created a reprex below. Thank you!

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.6.1
#> Warning: package 'tidyr' was built under R version 3.6.1
#> Warning: package 'dplyr' was built under R version 3.6.1
#> Warning: package 'stringr' was built under R version 3.6.1
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

#set directory and import data
directory <-"H:/Research/Ophtho"
file_path <- file.path(directory, "Input 2019-10-07_1951 All Raw Data.csv")
data <- read_csv(paste0(file_path))
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   mcn_id = col_character(),
#>   redcap_event_name = col_character(),
#>   first_name = col_character(),
#>   last_name = col_character(),
#>   dob = col_date(format = ""),
#>   mcn = col_character(),
#>   exam_date = col_date(format = ""),
#>   exam_va_od = col_character(),
#>   exam_va_od_pm = col_character(),
#>   exam_va_od_other = col_character(),
#>   exam_va_os = col_character(),
#>   exam_va_os_pm = col_character(),
#>   exam_va_os_other = col_character(),
#>   exam_iop_od = col_character(),
#>   exam_iop_os = col_character(),
#>   exam_cup_disc_od = col_character(),
#>   exam_cup_disc_os = col_character(),
#>   exam_ecc_date = col_date(format = ""),
#>   exam_oct_date = col_date(format = ""),
#>   exam_vf_od = col_character()
#>   # ... with 66 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 24 parsing failures.
#>  row                      col           expected     actual                                                                        file
#> 1155 gdd_suprachoroidal_dt    1/0/T/F/TRUE/FALSE 2003-03-26 'H:/Research/Ophtho/Input 2019-10-07_1951 All Raw Data.csv'
#> 1202 exam_oct_os              a double           x          'H:/Research/Ophtho/Input 2019-10-07_1951 All Raw Data.csv'
#> 1203 cornea_compli_fail_uncer 1/0/T/F/TRUE/FALSE 2          'H:/Research/Ophtho/Input 2019-10-07_1951 All Raw Data.csv'
#> 1288 gdd_suprachoroidal_dt    1/0/T/F/TRUE/FALSE 2001-06-13 'H:/Research/Ophtho/Input 2019-10-07_1951 All Raw Data.csv'
#> 1421 exam_tube_length_od      1/0/T/F/TRUE/FALSE 2          'H:/Research/Ophtho/Input 2019-10-07_1951 All Raw Data.csv'
#> .... ........................ .................. .......... ...........................................................................
#> See problems(...) for more details.

problems(data)
#> # A tibble: 24 x 5
#>      row col        expected    actual                file                 
#>    <int> <chr>      <chr>       <chr>                 <chr>                
#>  1  1155 gdd_supra~ 1/0/T/F/TR~ 2003-03-26            'H:/Research/Ophtho ~
#>  2  1202 exam_oct_~ a double    x                     'H:/Research/Ophtho ~
#>  3  1203 cornea_co~ 1/0/T/F/TR~ 2                     'H:/Research/Ophtho ~
#>  4  1288 gdd_supra~ 1/0/T/F/TR~ 2001-06-13            'H:/Research/Ophtho ~
#>  5  1421 exam_tube~ 1/0/T/F/TR~ 2                     'H:/Research/Ophtho ~
#>  6  1640 gdd_infec~ 1/0/T/F/TR~ 2005-09-19            'H:/Research/Ophtho ~
#>  7  1743 gdd_perce~ 1/0/T/F/TR~ 2005-09-13            'H:/Research/Ophtho ~
#>  8  1862 exam_ecc_~ a double    AC deep with 2+ cell~ 'H:/Research/Ophtho ~
#>  9  2113 exam_oct_~ a double    LP                    'H:/Research/Ophtho ~
#> 10  2165 gdd_infec~ 1/0/T/F/TR~ 2014-01-30            'H:/Research/Ophtho ~
#> # ... with 14 more rows

Created on 2019-10-07 by the reprex package (v0.3.0.9000)

The output of problems(data) says, for example, that in row 1155, in the column whose name starts with gdd_supra, it is expecting to find TRUE/FALSE values (which may also be represented by 1/0 or T/F) but it is finding 2003-03-26.
Similarly, in row 1202, in the column that starts with exam_oct_os, it is expecting a decimal number (that is what double means, basically) but it is finding x.

You need to open your data file in a program that tells you what row number you are in, a spreadsheet would work if there are not too many rows, and look at the contents in light of what problems(data) says.

2 Likes

Thanks for your response! I did open the csv file in excel. And yes, when I looked at row 155 and column gdd_suprachoroidal_dt, it showed me 2003-03-26, which is what I expected it to be.

I guess I don't understand why it is expecting to find TRUE/FALSE values. Is there a way to change it? I just want that column to have dates as values. Also, there are multiple other columns in the csv file which also have dates as the values, but they don't end up giving an error. Same for the other columns listed in problems(data) with numbers as values. It seems like maybe some columns in the csv file are corrupted or something like that - how can I fix this?

Note: When I used read.csv(), I didn't come across any parsing problems. However, I'm trying to use read_csv() because most of the R code I'm using was written with read_csv() in mind and doesn't work well with read.csv().

You can set the data type for each column using the col_types argument. Here is the description of that from the help that you can get by typing

?read_csv

in the console.

col_types One of NULL , a cols() specification, or a string. See vignette("column-types") for more details.
If NULL , all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.
If a column specification created by cols() , it must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only() .
Alternatively, you can use a compact string representation where each character represents one column: c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _ / - to skip the column.

The last paragraph that starts with"Alternatively" explains how you can designate each column using a character vector. With as many columns as you have, that will be tedious. If the data are corrupted, you will have to fix that in any case,

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.