read_excel + purrr maps - How to deal with empty cells from Excel.

Hello there !

I'm a newbie in R programming, and I need help for a particular case I've just encountered.
I don't know if it was already discussed in the past in the community, if that's the case, don't hesitate to let me know.

As part of an assignment, I'm currently trying to recursively import different cell values from hundreds of Excel workbooks (only one particular sheet in each wb) as an unified data frame in R.Studio (one row = one wb).

All these workbooks were meant to be manually edited, and have the same disposition, however, the data is sparse (no unified columns, no unified rows or whatsoever), with a lot of useless elements to be skipped.

Hence, they must be transcribed with multiple ranges, with one cell (x,y) at a time.

To that end, I originally found a purrr::map2_dfc based function which seemed to work very well at first sight : (Chapter 3 Data Importing/Exporting and interaction with other programmes | DfT R Cookbook, part 3.6.6 "Non-rectangular data - single worksheet - many workbooks").

And indeed ! The first results seem satisfactory.

The method is to set 3 inputs : 1 character (path) and 2 character vector inputs (cells, and col_names) which would be interpreted thanks to readxl::read_excel and purrr::map_dfr.

  • "col_names" is always the same and will act as a header for all the values in each Workbook, for example : "<- c("Title 1", "Title 2", "Title 3")"
  • "cells" directly point to the targeted cells with Excel Coordinates for each wb, for example : "cells <- c("Z1", "A7", "B8")", and are interpreted by read_excel.

"col_names" and "cells" both logically appear as "chr[1:3]". (There is supposed to be much more cells, but it's just for the example).

However, it only works when "cells" has no empty cell in it.

And in case a cell from just one workbook (example : A7) is left empty (which often happen), then "cells" will always fail at effectively aligning itself with "col_names" (dimension issue ?).

Which, in turn, will prevent RStudio from acquiring further values.

From what I've seen so far, it seems the problem may come from read_excel "range =" parameter always shrink-wrapping to only include filled cells (and ignoring all the ones with no values at all),

Which means that for a range of just one empty cell, it will return "0 obs. of 0 variable" instead of "0 obs. of 1 variable" (with a predefined string value such as "NA" for example, which perhaps would have been a preferred outcome in this situation).

I tried to use other "read_excel" parameters such as "na =", but for no avail at the moment.

Is there another way to overcome this ? For example, "forcing" read_excel (or a similar tool) to count empty Excel cells and/or always fixing the tibble dimension to be at least 0x1 for just one cell ?

Thanks in advance !

Here is how it looks like so far :

``` r
library(purrr)
library(tidyverse)
library(openxlsx)
library(fs) 

cells_to_rows <- function(path, cells, col_names){
  purrr::map2_dfc(
    .x = cells
    , .y = col_names
    , ~ readxl::read_excel(
      path = path,
      , range = .x
      , col_names = .y,
    ) 
  )
}

cells <- c("Z1", # Title 1
           "A7", # Title 2
           "B8")  # Title 3

col_names <- c("Title 1", 
               "Title 2", 
               "Title 3")

binding_data <- fs::dir_ls(
  path = "X:/XXX/XXX/XXX/XXX", recurse = TRUE,regex = ".xlsx$")   %>% 
  purrr::map_dfr(
    ~ cells_to_rows(path = .x, cells = cells, col_names = col_names)
    , .id = "path"
  )

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.