insert part of file name as a column value when multiple trials for each subject.

phomaria · September 15, 2020, 3:09pm

Hello,
I am a novice in R and I would like some help. I have an .xlsx data file for each subject with 288 trials each. Before I merge all these excel files together I want to add the subject number which is part of each file name (e.g, subject-103) as a new column next to each trial. So for each of the 288 trials the subject number should be listed.

How can i do it?

Thank you in advance.

mara · September 15, 2020, 3:49pm

You can do this using tidyr::separate(), which allows you to split a character column into multiple columns, and you'd keep the original column (if you wish to do so) be setting the remove argument to FALSE (remove = FALSE).

You could also use substr() or stringr::str_extract() to take the part of the file name (the number) and create a new variable.

Here's an example using substr() to choose the last three characters of each of the filename variables:

    library(tidyverse)
    example <- tibble::tribble(
          ~filename,
      "subject-103",
      "subject-222",
      "subject-042",
      "subject-236"
      )

    example <- example %>%
      mutate(subject_number = substr(filename, 9, 11))

    example
    #> # A tibble: 4 x 2
    #>   filename    subject_number
    #>   <chr>       <chr>         
    #> 1 subject-103 103           
    #> 2 subject-222 222           
    #> 3 subject-042 042           
    #> 4 subject-236 236

^{Created on 2020-09-15 by the reprex package (v0.3.0.9001)}

phomaria · September 16, 2020, 12:59pm

Dear @mara thank you for your response.
Ok, here is the case: i have around 123 subjects and each subject has 288 trials. The trials for each subject are saved in a separate excel file.
So, i have the following questions:

Do i need to import in R-studio all the separate excel files, add the extra column with the subject number (using the script you have suggested) in each file and then merge all the files together?
or
Is there a way to direct the system to the main folder with all the individual excel files and ask to do the same thing as above, that is create a new column adding in each of the 288 rows the subject id based on the file name and then merge the files together once it is done with the previous process?

Thank you in advance,
Maria

mara · September 16, 2020, 1:51pm

You can do this all with a script. There are many ways to do this in R, but the how-to using the purrr package is described very well in this post, the only difference for you is that you are reading in an .xlsx file (I assume) instead of .csv.

You can read and combine them together using purrr::map_dfr() (again, the post has all the details), and save the source (i.e. the filename) by using the .id argument. At that point you'll have all the data in one data frame, and you can extract the subject id from the column you created from .id.

Wendell_Miyaji · September 16, 2020, 2:17pm

I do something like this all the time:

library(tidyverse)
library(readxl)

dir.list <- dir()[grepl("subject", dir())] # if you have your data files is separate directory that can be cleaner, use paste0 to add the directory to the filename

all.data <- dir.list %>% map_df(.f=function(FILE){
data.out <- read_excel(FILE) %>%
mutate(subject = str_sub(FILE, 9,11))
return(data.out)
})

phomaria · September 17, 2020, 1:55pm

Dear @mara and @Wendell_Miyaji thank you for your responses.

phomaria · October 5, 2020, 10:38am

hello again @mara @Wendell_Miyaji and everyone else,
I have tried several things based on your suggestions but I cannot figure out how to get the result I need. So I will try once more to explain better what I want to do.

Description of my data:

I have 156 subjects. Each sub's data are stored in a separate .csv file in a folder stored on my Desktop (a total of 156 csv files).
Each .csv file (which corresponds to an individual) contains 197 rows and 93 columns (some with numeric and others with character data).

What I need to do:

I need to add a column with the subject number in each separate csv file. The subject id can be found in the name of each csv file of each participant e.g. subject -103.
I need to create and save in my Desktop a new file for each subject (e.g., subject-103new) with all the rows and columns that are stored in the original file, and also with the extra column which will contain the sub id.

I was thinking that I should maybe create a loop which will read from each file's individual name, create the new column in each file with the subject id be present in each of the 197 rows and then save an updated file on my desktop.

I will appreciate some assistance of how to actually do it in R-studio.

Thank you in advance,
Maria

mara · October 5, 2020, 12:47pm

If you look at the post on reading in a folder, you'll see that "mapping" is similar to a loop, in that you apply the operation (functions) to whatever you're "mapping" over (in your case, a bunch of files). Just as you can do this to read in files, you can write them out, if you want to keep them separate. Are they stored directly on your desktop? If so, you can choose them based on file extension or name pattern, but it might be easier to put them together in a folder.

For a number of reasons (e.g. what you're trying to do now to all of the files), it's useful to group things together in a single data frame, and then use your grouping variable for calculations, or filter them out as needed. That said, you could keep them in separate files (though the fact that you want the subject number in a column suggests to me that you're probably going to be working with them "together" in some sort of way).

What have you tried, and where are you getting stuck? The best practice here is to figure it out for one file (I think you have the pieces for that now), and then do it for all of them together. Have you successfully gotten that first part down?

andresrcs · October 5, 2020, 8:54pm

From your previous post, I understand that your ultimate goal is to merge all files into a single data frame but keeping track of the origin, if that is still the case, the approach suggested by Mara is the way to go, I'm going to give you some general pattern code so you can have a starting point.

library(tidyverse)
library(stringr)

list_of_files <- list.files(path = "path/to/files",
                            recursive = TRUE,
                            pattern = "\\.csv$",
                            full.names = TRUE)
df <- list_of_files %>%
    set_names() %>%  
    map_df(read_csv, .id = "file_name") %>% 
    mutate(subject_id = str_extract(string = file_name,
                                    pattern = "(?<=\\-)\\d+(?=\\.csv)") #This extracts the id from the file name
           )

system · October 26, 2020, 8:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.