Multiple dataframes from dataframe of file IDs

jarle · September 14, 2022, 11:47am

I have a table in excel containing 6-digit csv file IDs in 2 different columns (A, B).

I want to

create dataframes for all column A files, all column B files.
combine all the A dfs into an A_all df, all the B dfs into a B_all df.
merge the A_all, B_all into one DATA df.

I have a function performing #3 but need help on #1 and #2.

I would appreciate some suggestions on how to approach this (preferably without for looping) to get me started

I have the following idea, but am in lack of which functions to use:

1 I start by making a df with 2 columns with file IDs

library (readxl)
file_ids <- read_excel("path/file_ids.xls")

my idea is then to append the file ID number from the file_ids df to the file path

A <- paste("csvfilepath/filename_", file_ids$colA, ".csv", sep="")
B <- paste("csvfilepath/filename_", file_ids$colB, ".csv", sep="")

and create dataframes for each and every A file and each and every B file

A_file_id <- read_csv(A)
B_file_id <- read_csv(B)

2 combine A datasets to an A_all dataset and B datasets to a B_all dataset

A_all <- apply( some full_join/merge function to all the A_file_id dfs
B_all <- apply( some full_join/merge function to all the B_file_id dfs

(Not all A dfs have same number of columns, not all B dfs have same number of columns)

3 merge into one df (due to some missing person_ID)

DATA <- merge(A_all, B_all, by.x="person_ID", by.y="person_ID", all.x = TRUE, all.y = TRUE)

Thanks in advance!

nirgrahamuk · September 14, 2022, 12:24pm

isnt your imagined A_all identical with A_file_id that you made ?

jarle · September 14, 2022, 12:39pm

Thanks! Yes, it is! (then this is an instance that a formulation of the problem solves it, I guess)

Then my question is how to iterate over multiple columns (A, B, C etc.)?

Also: Suggestions when the csv files does not have equal number of columns and I want to keep all columns?

FactOREO · September 14, 2022, 1:08pm

Regarding the columns issue you could use data.table::rbindlist() (documentation) with use.names = TRUE if the colnames of your tables match and fill = TRUE to fill nonmatching columns with NA. This will work if the issue is in the form of "Table 1 has columns A, B, C and D and Table 2 has A, C and D".

If you have completely different columns, but an usuable ID in every DF, you should use full joins (e.g. from dplyr - see here) to keep every entry and every column. But as you haven't provided sample data which reveals the general structure of your tables, it is kind of guessing from my side.

Kind regards

jarle · September 15, 2022, 8:17am

Thanks. I'll look into that later. A first try gave me an error saying
Input is character but should be a plain list of items to be stacked.

Another issue I ran into is non-existing files. I tried if (file.exists() read_csv()
but warning msg say condition has length > 1 and only the first element will be used.

nirgrahamuk · September 15, 2022, 8:41am

ifelse is vectorised, if is not and evaluates a single condition.
but aside from that my advice is that
you should probably take the list of files to be read in; do a step where you check if they exists; and only do the read_csv on the ones that do (i.e. a shortened list)

A_to_do <- A[file.exists(A)]
A_all <- read_csv(A_to_do)

system · October 27, 2022, 8:41am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.