Selecting particular columns from multiple csv files using R Studio?

tailchopper · February 9, 2021, 9:34am

I have large number of csv files (100+) in a directory with same structure and column names. The data is in time series and there are approx 30 columns/file. However i only want to aggregate particular number of columns together while dont care about others.

For example each csv file has particular columns (same column name in each file) say Col A, Col B and Col C. I want to select these columns from each file and add together i.e Col A from file 1,file2.... and Col B from file 1,file2.....

I need help in R coding for this.

csv files look like :

Timestamp , Col A , ColB , ColC ................................
2021/09/02 00:00, 1234 , 1234 , 1234

I found that merge funciton would work best since there could be mismatch in number of rows. Note that Columnn names and column numbers in each file are similar.

martin.R · February 9, 2021, 12:10pm

This is one method:

library(dplyr)
library(purrr)
library(readr)

files <- list.files(path = "your_directory", pattern = "*.csv", full.names = TRUE)
output <- 
  map_dfr(
    files, 
    read_csv, 
    col_types = cols_only(
      Timestamp = col_character(), 
      ColA = col_double(), 
      ColB = col_double(), 
      ColC = col_double()
    )
  ) %>% 
  group_by(Timestamp) %>% 
  summarise(across(everything(), sum, na.rm = TRUE))

You should ensure that the list.files() line captures which files you wish to read and then that the col_types() argument describes which columns to select. Depending on your case, there may be more convenient ways of specifying which columns you wish to select.

tailchopper · February 9, 2021, 1:41pm

Super thanks Martin !. Actually I need to add one colA's together and colB's together seperately (aggregation of columns seperately) then plot monthly trends i.e based on Timestamp, Agrgegated column A, aggregated column B and so on. It might eed some changing after the summarise fucntion ?

tailchopper · February 9, 2021, 1:50pm

I am using the following code. Its running but no output is shown (AND NO errors)

library(dplyr)
library(purrr)
library(readr)
library(ggplot2)
library(tidyverse)

files <- list.files(path = "my directory", pattern = "*.csv", full.names = TRUE)
redflag <-
map_dfr(files,read_csv,
col_types = cols_only(
C = col_character(),
I = col_double(),
J = col_double(),
L = col_double(),
V = col_double(),
W = col_double()
)
) %>%

mutate(month = months.Date(as.Date(C)),
month = factor(month, levels = month.name)) %>%
group_by(month) %>%

summarize(trend.I = sum(I, na.rm = T),
trend.J = sum(J, na.rm = T),
trend.L = sum(L, na.rm = T),
trend.V = sum(V, na.rm = T),
trend.W = sum(W, na.rm = T)) %>%
melt(id.var = 'month') %>%

ggplot(aes(x = month, y = value, fill = variable)) +
geom_bar(aes(x = month, y = value, fill = variable), stat = "identity", position = "dodge") +
theme_classic() + ylab("Value") + xlab("Month") + ggtitle("Monthly trend of data") +
geom_text(aes(label = value), hjust=1, vjust=1, size=3,position = position_dodge(width = 1))

system · March 2, 2021, 1:50pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.