combining multiple csv files and adding a column

Nushuhr · September 3, 2019, 9:39am

Hi All,

As a coding newbie I am struggling to combine 70 csv files into one. I also need the new file to include an additional (first) column to indicate which original csv file the respective rows came from (i.e. participant number).

I've tried the following, but I think it doesn't work because the original file names don't have subject numbers (i.e. only 4 columns), and I'm not sure how to add these to multiple files...

# needed for reading data
filenames <- list.files("CSV Raw Data", pattern = "stim", full.names = TRUE) 

# needed for identifying subject numbers
subjects <- list.files("CSV Raw Data", pattern = "stim") 

# this next line isn't used, but will be useful for combining the other data types, which don't contain subNum as a column.
testdata <- NULL
for (subj in 1:length(subjects)) { 
  
 # extract participant number from file name
 p <- substr(subjects[subj],7,9) 
 
 # read the data from csv
  psData <- read_csv(filenames[subj], col_types = cols(), col_names = FALSE)

  # combine data array with existing data
  testdata <- rbind(testdata, psData) 

}

colnames(testdata)[1:5] <- c("subj","time","x","y","trial")

write_csv(testdata, "StimAllPs_csv.csv")

This gives me the following error message: "Error in attr(x, "names") <- as.character(value) :
'names' attribute [5] must be the same length as the vector [4]"

Any help would be greatly appreciated!

Thanks,
Nushuhr

valeri · September 3, 2019, 10:36am

I am not sure how the subject file relates to the name of the files you are concatenating, so in my example I just add a column file_id which will contain the file name - you will have to modify this so that it becomes your subj column. From your example the filenames and subjects is the same (apart from a full.names argument)? Shouldn't the patterns you are matching be different?

If you actually provide at least 2 csv's with mock data but the right structure (and the file from which you get the subject/participant numbers), a full reprex can be made.

Maybe you can try something along the lines

file_names <- list.files("CSV Raw Data", pattern = "stim", full.names = TRUE)
final_output <- purrr::map_df(file_names, function(x) {
	
	data <- read_csv(x)
	cbind(file_id = x, data)
	
	})

andresrcs · September 3, 2019, 9:32pm

You can achieve this with purrr:map_df() by specifying the .id parameter, see the example bellow. If you need more specific help, please provide more information about the structure of your data and a sample of your file names.

library(tidyverse)
library(stringr)

list_of_files <- list.files(path = "CSV Raw Data",
                            pattern = "stim",
                            full.names = TRUE)
df <- list_of_files %>%
  setNames(nm = .) %>% 
  map_df(~read_csv(.x, col_types = cols(), col_names = FALSE), .id = "file_name")

Nushuhr · September 4, 2019, 10:29pm

This is awesome, thank you!

I'm so impressed at how simple your code is - and it works!

If it's not too much trouble, could I please ask how the individual steps of the code work, i.e. what each of these functions does?
Specifically, I'm unsure how the functions 'setNames(nm = .)', 'map_df', and '.id' work.

Many thanks again!

Nushuhr · September 4, 2019, 10:36pm

Thank you very much for your quick support! I'm clearly not skilled enough to understand or adapt your code, but in the end I managed to make it work using @andresrcs code.

andresrcs · September 5, 2019, 2:51am

Ok, I'm going to give it a try although I'm not very good at teaching for what I have being told.

First of all I'm using the pipe operator %>% to concatenate commands, so for example, this part list_of_files %>% setNames(nm = .) would be equivalent to setNames(list_of_files, nm = list_of_files) using regular sintax.

The setNames() function adds names to the elements of the character vector i.e list_of_files, see this example

list_of_files <- c("a", "b", "c")
list_of_files
#> [1] "a" "b" "c"

setNames(object = list_of_files, nm = list_of_files)
#>   a   b   c 
#> "a" "b" "c"

The map_df() function iterates over the list of files, executes the read_csv() function for each element of the list and merges the result into a single dataframe.

The .id argument for the map_df() function is set to NA by default, which means no id is going to be added for each element of the list, but if you change it to take the value of any character string ("file_name" in this case) is going to create a column with the specified name, containing an identifier for each element of the list, in this case the name of each element in list_of_files.

Nushuhr · September 5, 2019, 8:46pm

That's great! Thank you very much

jcblum · September 6, 2019, 3:26am

I’d like to propose a couple of tiny additions to @andresrcs’s nice solution (and even nicer explanation ):

since purrr 0.2.3, the new name for the function that binds a list of data frames together by stacking the rows is map_dfr() (the old map_df() still works for now, but it’s no longer easy to find in the main documentation)
purrr has its own set_names(), which is even more pipe-friendly and easy to use for this case: by default, it names the elements of the vector after themselves

df <- list_of_files %>%
  set_names() %>% 
  map_dfr(
    ~ read_csv(.x, col_types = cols(), col_names = FALSE),
    .id = "file_name"
  )

system · September 27, 2019, 3:26am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.