I have an analysis where I've used purrr::map* to repeat the following pattern for a series of dates.
* Well, really furrr::map, but I want to discuss basic principles here.
- Query database for data occurring on Day X.
- Crunch some stats for Day X.
- Repeat for all dates and return the calculations by day.
Since each day's data is quite large and I'm doing this for a full year's worth of dates, the job takes a long while (8-10 hours depending on EC2 size). I'd like to use sparklyr to leverage spark to speed up the computations, which I'll explain below with code, but as I wrote it out I started to wonder if there was a better practice for this pattern.
library(tidyverse)
library(sparklyr)
analysis_calendar <- analysis_start_date %>%
seq.Date(analysis_end_date, by = "day")
config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "path/to/driver/RedshiftJDBC42-no-awssdk-1.2.41.1065.jar"
sc <- spark_connect(master = "local", config = config)
jdbc_read_url <- config %>%
glue::glue_data("jdbc:redshift://{host_url}:{port}/{db_name}") %>%
as.character()
calculations <- analysis_calendar %>%
map(~{
ith_query <- glue::glue_sql(
"SELECT id, date, col1, col2, col3, col4
FROM {`schema_code`}.table_5min_data
WHERE date = {.x}",
.con = DBI::ANSI()) %>% # Use ANSI connection for proper date quotes
as.character()
data_sdf <- sc %>%
spark_read_jdbc(name = "5min_data",
options = list(url = jdbc_read_url,
user = creds$user_name,
password = creds$password,
query = ith_query,
driver = "com.amazon.redshift.jdbc42.Driver"),
memory = FALSE,
overwrite = TRUE)
# Just a sample computation
summarise(data_sdf, average = mean(col1, na.rm = TRUE)) %>% collect()
})
The code above really boils down to purrr::map(R list, ~Spark calls). It seems to work fine, but something about this construction just feels wrong to me. Is there a better way to use sparklyr/Spark to handle the repeated database querying and performing tasks on all of those data frames?