Read XLSX from Github

I am having trouble finding how to import an XLSX from a private Github repo. For CSV the below is working, is there a standard way to do the same with XLSX other than downloading the file to my local machine?

library(httr)
library(xlsx)

github_link <- "https://raw.githubusercontent.com/..."
api_key <- "<secret_token>"
req <- GET(github_link, 
           add_headers(Authorization = paste("token", api_key, sep = " ")))

# This works
csv_file <- content(req, type = "text/csv")

# How to do this?
xlsx_file <- loadWorkbook(?????)

As an excel workbook is a binary file, I think you need to download the file before reading. But I may be wrong...

However, you can do this without manualy downloading the file. This can be done in a tempfile and you can add in a wrapper

loadWorkbook_url <- function(url) {
    temp_file <- tempfile(fileext = ".xlsx")
    download.file(url = url, destfile = temp_file, mode = "wb", quiet = TRUE)
    loadWorkbook(temp_file)
}

This is what other :package: for dealing with excel in R can do:

  • See openxlsx that can read from a url with read.xlsx, using this technic internaly (see Github source for internal getFile function)

  • in readxl, this is still a feature in thinking

As your file is on a github private repo, you may need to create your custom wrapper using httr::write_disk to write on disk, in the tempfile, the result of your GET that uses the API key.

If I find a demo file in a github repo, I will provide an exemple.

3 Likes

Here's an example file for your use and abuse:

https://github.com/CerebralMastication/Presentations/raw/master/test.xlsx

I played around a little bit and confirmed that your temp file approach is better than anything else I could cook up.

4 Likes

Thank you !

@sgranitz here is an example that should work on your private repo too, using GITHUB_PAT environment variable.


github_link <- "https://github.com/CerebralMastication/Presentations/raw/master/test.xlsx"
library(httr)
temp_file <- tempfile(fileext = ".xlsx")
req <- GET(github_link, 
          # authenticate using GITHUB_PAT
           authenticate(Sys.getenv("GITHUB_PAT"), ""),
          # write result to disk
           write_disk(path = temp_file))
tab <- readxl::read_excel(temp_file)
tab
#> # A tibble: 5 x 1
#>   text 
#>   <chr>
#> 1 what 
#> 2 fresh
#> 3 hell 
#> 4 is   
#> 5 this
unlink(temp_file)

Created on 2018-06-05 by the reprex package (v0.2.0).

You can wrap all this in a function for convenience.

And for the record, it is not the only way to do that I think but it works

4 Likes

Thank you so much!

Only change I made was from readxl::read_excel() to xlsx::loadWorkbook() so that I could preserve all the sheets instead of just the first one.

Make sense! I just use readxl because I don't use xlsx, so don't have it on my computer. :wink:

rio is a useful package that wraps a nice uniform API around lots of I/O utilities covering most any format you'll find. Among other niceties, it will make and delete tempfiles for you automatically, so if you pass it a URL to an Excel file, it will read it with readxl (or openxlsx if you prefer). It also handles file compression very nicely without extra steps.

so you can just do

test <- rio::import('https://github.com/CerebralMastication/Presentations/raw/master/test.xlsx')
test
#>    text
#> 1  what
#> 2 fresh
#> 3  hell
#> 4    is
#> 5  this

I don't believe it can handle authentication, though.

5 Likes