Importing multiple excel files and turn them into long format

Hi, I'm new to r and have a question about data input. Any suggestion is greatly appreciated.

I am trying to read and merge multiple excel files into a data frame.
Each of the excel file is a time series of a variable, but it is in the following format:

I would like to input 50 of these excel files and create a data frame in the format that each row contains a time series variable, and the columns are the time frame: 2003-12, 2004-01, 2004-02, ...... 2020-05.

I was wondering what should I do to read and merge these files and at the same time turn them from wide format into long format?

Thank you in advance!

1 Like

Hi, the code below should do everything you need assuming each excel file is the same as your screenshot, the data is in sheet 1 of each file and all the files are in one directory.

I added comments but let me know if you have any questions!

library(tidyverse)
library(readxl)
library(fs)

# path to directory containing excel files
excel_dir <- here::here("excel_dir/")

df_combined_long <- 
  # list all the excel file paths in the directory (assuming they are xlsx files not xls)
  dir_ls(excel_dir, regexp = "[.]xlsx$") %>% 
  # name the list with the filename portion of the path with no file extension
  # this will added as a variable in the final dataframe 
  set_names(path_ext_remove(path_file(.))) %>% 
  # read each file and combine together into single df
  # set a 'variable' column with the value of the filename created above
  map_dfr(read_excel, .id = "variable") %>% 
  # convert to long format with a month and value column
  pivot_longer(cols = c(-variable, -Year), names_to = "month", values_to = "value") %>% 
  # combine year and month columns into a single date column
  unite(date, Year, month, sep = "-") %>% 
  # convert date column to R date format
  mutate(date = parse_date(date, format = "%Y-%b"))
3 Likes

Hi, thank you so much for the detailed comments. I really appreciate it.
But the excel files actually have some headings. Sorry for not clarifying that in my original post.

I guess because of this, I received error message:
Error: Can't subset columns that don't exist. x The column Year doesn't exist.

I was wondering how should I specify the selection range from the excel? (The range is the same for all files.)

Thank you again for your help.

Ah ok, so you can actually pull out the information in the header using cell references and add it to the data frame when reading each file, then supply a skip = 10 argument to read the main data table starting from row 11.

Like this (assuming the first cell in your screenshot is A1):

library(tidyverse)
library(readxl)
library(fs)

# path to directory containing excel files
excel_dir <- here::here("excel_dir/")

df_combined_long <- 
  # list all the excel file paths in the directory (assuming they are xlsx files not xls)
  dir_ls(excel_dir, regexp = "[.]xlsx$") %>% 
  # read each file and combine together into single df
  map_dfr(~{
    series_id <- read_excel(path = .x, range = "B4", col_names = FALSE) %>% pull()
    series_label <- read_excel(path = .x, range = "B5", col_names = FALSE) %>% pull()
    industry <- read_excel(path = .x, range = "B6", col_names = FALSE) %>% pull()
    product <- read_excel(path = .x, range = "B7", col_names = FALSE) %>% pull()
    
    read_excel(.x, skip = 10) %>% 
      mutate(
        series_id = series_id,
        series_label = series_label,
        industry = industry,
        product = product
      ) %>% 
      select(series_id, series_label, industry, product, everything())
  }) %>% 
  # convert to long format with a month and value column
  pivot_longer(cols = Jan:Dec, names_to = "month", values_to = "value") %>% 
  # combine year and month columns into a single date column
  unite(date, Year, month, sep = "-") %>% 
  # convert date column to R date format
  mutate(date = parse_date(date, format = "%Y-%b"))
2 Likes

Wow this worked perfectly. I still need to study it line by line, but thank you very very much!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.