Merge multiple files and add new column "subject"

Hi, I've just started to learn Rstudio and coding and I'm having some trouble with a few things. I'm trying to merge 20+ files into one data frame and add a column for subject number/ID which correspond to rows from each datafile.
e.g. all the data rows from file #1 would be labelled "1" in the subject column, etc.

All of the original data files already has a column labelled "subject". However, this column is blank (our experiment didn't output a subject number, but created a column called subject anyways), so there aren't any subject names in any of the original data files.

I tried implementing solutions from this thread, but I received an error that says "Error: file must be a string, raw vector or a connection."

I already read and merged all the data files using purrr:

data = list.files(path = "data", full.names = T) %>%
map(read_csv) %>%
reduce(rbind)

Any help is appreciated!

You could do it like this.

library(tidyverse)

df <- list.files(path = "data", full.names = TRUE) %>%
  map_dfr(read_csv, .id = "file_path") %>% 
  group_by(file_path) %>% 
  mutate(subject = group_indices())
2 Likes

Hi, thanks for the quick response.
I tried running the code, but I received this error:
Error: group_indices.default() should only be called in a data context

Can you please post the exact code you ran? I assume you've replaced "data" with the path to the folder where your files are located.

Actually, the path to the folder is called "data" too!

I ran this:
subjdata <- list.files(path = "data", full.names = T) %>%
map_dfr(read_csv, .id = "file_path") %>%
group_by(file_path) %>%
mutate(subject = group_indices())

I'm unable to reproduce your error. Can you please only run list.files(path = "data", full.names = T) and tell me what is the output you see in the console?

list.files(path = "data", full.names = T)
[1] "data/01.txt" "data/02.txt" "data/03.txt" "data/04.txt" "data/05.txt"
[6] "data/06.txt" "data/07.txt" "data/08.txt" "data/09.txt" "data/10.txt"
[11] "data/11.txt" "data/12.txt" "data/13.txt" "data/14.txt" "data/15.txt"
[16] "data/16.txt" "data/17.txt" "data/18.txt" "data/19.txt" "data/20.txt"
[21] "data/21.txt" "data/22.txt" "data/23.txt" "data/24.txt"

For reading text files, you should generally use read_delim(), not read_csv(). Do you get a single data frame as output after running the map_dfr(...) statement?

Ah okay!
I ran up to the map_dfr() and I received a single data frame output which includes the new column! It's called "file_name" and provides the name of the the file "1, 2, etc" (which are numbered anyways).
Thank you!

Okay cool. I didn't know what your files were named, so the group_indices() would help if your files didn't contain a sequence number. Strange that the rest of the code doesn't work for you though.

Thankfully, it worked out conveniently for my case!

Would it have to do with how my files are named/formatted or the type of variable? (I'm not well-versed in R so might be a newbie question!)

Are you sure that the new column is called "file_name"? It should be "file_path" if you used the code I gave. I'd advise you to create a reprex so that we can see exactly what's happening by following this guide.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.