Concatenate csv files selected columns

Hello Everyone,

I am relatively new to Rstudio. I have multiple CSV files and I want to combine them horizontally. The CSV files have two common columns (i.e. Date and Q ). I want to have an output csv for certain period of time using the Date columns as index. if the data is missing for any of dates from the CSV files then it should be filled "NA". See the figure below

I tried this script

library(readr)
#Create a dummy table for combining data

Date <- seq.Date(as.Date("2001-10-15"), as.Date("2001-10-20"), by = "day")
Date<-data.frame(Date)
colnames(Date)<-("Date")
data<-data.frame(matrix(NA,nrow=6,ncol=5))
x<-c("Flow (“Sno”,”Q”,”Volume(cf)”, “area(sf)”))
colnames(data)<-x
Table<-cbind(Date,data)

#import all csv files
list_of_files <- list.files(path = "C:/Users/ /Documents",
recursive = TRUE,
pattern = "\.csv$",
full.names = TRUE)

df <- readr::read_csv(list_of_files, id = "file_name")

To better understand, why does the final output contain the row Sno 9 with the Date 10/20/2001? There is no 10/20/2001 in csv-2.

Yes you are correct, CSV-2 does not have 10/20/2001. The idea is to have final output CSV from 10/15/2001 to 10/20/2001. Therefore, the final out put should have "NA" for missing date in any of the CSV files. For this case, the final output csv have "NA" (Highlighted in yellow) for the missing date.

it seems like you have a constructed Table that is the spine you want to left join additional (csv) data onto.
first of all, readr::read_csv doesnt directly support reading a list of files, it process a file at a time. However, you can of course iterate over the list of files and apply the read_csv repeatedly for each.
library(purrr) and its map() function is a good choice for this.

after this results in a list of data.frames, you then want to iterate again, left joining each to , in the first case Table, otherwise to the result of having joined Table with the previous iterations. Again, purrr is useful for this, it implements a reduce() function

It's not exactly the same result as you gave in your question, but the code below joins table 1 and table 2 on Date and Q. Maybe this helps already.

library(tidyverse)

tab1 <- data.frame(
  stringsAsFactors = FALSE,
       check.names = FALSE,
               Sno = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L),
                      Date = c("10/12/2001",
                               "10/13/2001","10/14/2001","10/15/2001",
                               "10/16/2001","10/17/2001","10/18/2001","10/19/2001",
                               "10/20/2001","10/21/2001"),
                         Q = c(106L,109L,98L,
                               131L,133L,119L,125L,96L,101L,44L),
               `Volume(d)` = c(200L,50L,133L,
                               122L,143L,200L,121L,230L,500L,300L)
        )
        

tab2 <- data.frame(
  stringsAsFactors = FALSE,
       check.names = FALSE,
               Sno = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L),
                      Date = c("10/12/2001",
                               "10/13/2001","10/14/2001","10/15/2001",
                               "10/16/2001","10/17/2001","10/18/2001","10/19/2001"),
                 Q = c(106L, 109L, 98L, 131L, 133L, 119L, 125L, 96L),
                `area(sf)` = c(500L,2000L,
                               450L,800L,900L,1456L,2000L,3000L)
        )


tab_overlap <- tab1 %>% 
  left_join(., tab2,
             by=c("Date", "Q"))
tab_overlap
#>    Sno.x       Date   Q Volume(d) Sno.y area(sf)
#> 1      1 10/12/2001 106       200     1      500
#> 2      2 10/13/2001 109        50     2     2000
#> 3      3 10/14/2001  98       133     3      450
#> 4      4 10/15/2001 131       122     4      800
#> 5      5 10/16/2001 133       143     5      900
#> 6      6 10/17/2001 119       200     6     1456
#> 7      7 10/18/2001 125       121     7     2000
#> 8      8 10/19/2001  96       230     8     3000
#> 9      9 10/20/2001 101       500    NA       NA
#> 10    10 10/21/2001  44       300    NA       NA

Created on 2022-05-11 by the reprex package (v2.0.1)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.