Loading Multiple csv in R issue - variable types causing erro

Hi,

I have a folder that contains csv's in it, each file has around 3K rows/observations and 600 variables.

I've seen various topics about the subject and came up with my own "solution".

Create auxiliary function that reads csv, adds file name so I can later extract info from the filename which contains the year for each csv.

library(tidyverse)
library(fs)
### List CSV files in directory
csv_files <- fs::dir_ls("ACS_DP02_data")

# Aux function call to append file names:
read_plus <- function(flnm) {
  read_csv(flnm ) %>%
    mutate(filename = flnm)
}
# Create df from csv files
my_df1 <- csv_files %>% map_df(read_plus)

Initially it seemed to work, as seen below:


It looked right but then I noticed on observation 1, the row name... looked at the tail (theoretically last csv file in folder) checked file (csv) and tail (df) and it did not match. I also noticed that Id2 came up 9 times, indicating that the tables were attached after each other into a single df but the observations names were in there as well.

So I played around loading single files and noticed that I should skip 1st row. The data was imported correctly on individual files. Adding the skip = 1 to read_csv(), when attempting the 6th file I got the below error, basically not combining char with dbl .

Single file loading using read_csv() and skip = 1 would load data correctly, as seen below:

I therefore concluded I had to check if the variables were character and convert them to numeric, which lead me to the below code:

library(tidyverse)
library(fs)

### List CSV files in directory
csv_files <- fs::dir_ls("ACS_DP02_data")

# Aux function call to append file names:
read_plus <- function(flnm) {
  read_csv(flnm) %>% 
    mutate_if(is.character, as.numeric) %>% 
    mutate(filename = flnm)
}
# Create df from csv files
my_df1 <- csv_files %>% map_df(read_plus)

It seemed to have worked but when I checked the data everything that was character was now NA, like below:

Could anyone shine a light what's going on? I suspect my mutate_all() is not doing what I think it is... I've been troubleshooting this for quite some time now and I'm frustrated.

Well thank you for your time beforehand and patience.

It is hard to give you specific advice because your example is not reproducible (lack of sample data) but I can give you some pointers to help you move forward.

There is no need to implement this yourself, map_df() has been superseded by the more specific map_dfr() and map_dfc() functions. For your use case map_dfr() has the .id argument that when defined stores the names of the mapped list as a new column. The code pattern would be as follows:

library(tidyverse)

list_of_files <- list.files(path = "path_to_your_files",
                            pattern = "\\.csv$",
                            full.names = TRUE)
my_df <- list_of_files %>% 
    set_names() %>% 
    map_dfr(read_csv, .id = "filename")

read_csv() guesses the variable classes but if it is guessing wrong, you can specify the classes using the col_types argument so you get consistent variable classes among iterations. See the documentation:

col_types
One of NULL , a cols() specification, or a string. See vignette("readr") for more details.

If NULL , all column types will be imputed from guess_max rows on the input interspersed throughout the file. This is convenient (and fast), but not robust. If the imputation fails, you'll need to increase the guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only() .

Alternatively, you can use a compact string representation where each character represents one column:

  • c = character
  • i = integer
  • n = number
  • d = double
  • l = logical
  • f = factor
  • D = date
  • T = date time
  • t = time
  • ? = guess
  • _ or - = skip
    By default, reading a file without a column specification will print a message showing what readr guessed they were. To remove this message, set show_col_types = FALSE or set 'options(readr.show_col_types = FALSE).
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.