Hi, I'm new to r and have a question about data input. Any suggestion is greatly appreciated.
I am trying to read and merge multiple excel files into a data frame.
Each of the excel file is a time series of a variable, but it is in the following format:
I would like to input 50 of these excel files and create a data frame in the format that each row contains a time series variable, and the columns are the time frame: 2003-12, 2004-01, 2004-02, ...... 2020-05.
I was wondering what should I do to read and merge these files and at the same time turn them from wide format into long format?
Hi, the code below should do everything you need assuming each excel file is the same as your screenshot, the data is in sheet 1 of each file and all the files are in one directory.
I added comments but let me know if you have any questions!
library(tidyverse)
library(readxl)
library(fs)
# path to directory containing excel files
excel_dir <- here::here("excel_dir/")
df_combined_long <-
# list all the excel file paths in the directory (assuming they are xlsx files not xls)
dir_ls(excel_dir, regexp = "[.]xlsx$") %>%
# name the list with the filename portion of the path with no file extension
# this will added as a variable in the final dataframe
set_names(path_ext_remove(path_file(.))) %>%
# read each file and combine together into single df
# set a 'variable' column with the value of the filename created above
map_dfr(read_excel, .id = "variable") %>%
# convert to long format with a month and value column
pivot_longer(cols = c(-variable, -Year), names_to = "month", values_to = "value") %>%
# combine year and month columns into a single date column
unite(date, Year, month, sep = "-") %>%
# convert date column to R date format
mutate(date = parse_date(date, format = "%Y-%b"))
Hi, thank you so much for the detailed comments. I really appreciate it.
But the excel files actually have some headings. Sorry for not clarifying that in my original post.
Ah ok, so you can actually pull out the information in the header using cell references and add it to the data frame when reading each file, then supply a skip = 10 argument to read the main data table starting from row 11.
Like this (assuming the first cell in your screenshot is A1):
library(tidyverse)
library(readxl)
library(fs)
# path to directory containing excel files
excel_dir <- here::here("excel_dir/")
df_combined_long <-
# list all the excel file paths in the directory (assuming they are xlsx files not xls)
dir_ls(excel_dir, regexp = "[.]xlsx$") %>%
# read each file and combine together into single df
map_dfr(~{
series_id <- read_excel(path = .x, range = "B4", col_names = FALSE) %>% pull()
series_label <- read_excel(path = .x, range = "B5", col_names = FALSE) %>% pull()
industry <- read_excel(path = .x, range = "B6", col_names = FALSE) %>% pull()
product <- read_excel(path = .x, range = "B7", col_names = FALSE) %>% pull()
read_excel(.x, skip = 10) %>%
mutate(
series_id = series_id,
series_label = series_label,
industry = industry,
product = product
) %>%
select(series_id, series_label, industry, product, everything())
}) %>%
# convert to long format with a month and value column
pivot_longer(cols = Jan:Dec, names_to = "month", values_to = "value") %>%
# combine year and month columns into a single date column
unite(date, Year, month, sep = "-") %>%
# convert date column to R date format
mutate(date = parse_date(date, format = "%Y-%b"))