Issue with Plumber API performance

Hi, I am having issues with maintaining proper performance with my Plumber API. I have one function, lets call it update, that is designed to receive a bunch of parameters from an Excel macro and create an SQL query out of them, and submit them all to a database by executing that SQL query. The issue is that there will be a very large number of queries being made, all sequentially. The Excel app cannot hand control back to the user while it is waiting for these queries to be processed, which can take upwards of 20 minutes. Not to mention, that during this entire 20 minutes the API is not usable by others.

So, in my search to find a solution, I dug around in the Plumber documentation. In section 5.3, it says "All filters and endpoints should complete very quickly and any long-running or complicated tasks should be done outside of the API process." (emphasis mine). This seems like a potential solution to my problem, but I have no idea how to offload the massive amounts of SQL query executions to something outside of the API process. I don't understand Plumber or the ways that it handles processes well enough to figure this out on my own. My best and only idea is to create a second Plumber API, and have the first API call that one, but that introduces its own problems, like how I would have to be able to send all of my queries over to the second API at once.

My understanding of Plumber is limited and the documentation for it, while very helpful, is incomplete. If I could get some help from someone I would greatly appreciate it. If you need more information about my setup, let me know and I'll be happy to provide. Thanks.

Check out this issue for some inspiration on how to handle polling processes: https://github.com/rstudio/plumber/issues/497

Specifically check out the tokic.R example near the end of https://github.com/rstudio/plumber/issues/497#issue-504880501


Yes, (if possible) complicated tasks should be done outside of the Plumber API.

So for the database queries, is it possible to pre-fetch the data information? This way the request is done 1 time vs N times.

As you said, I would not investigate bringing in another Plumber API.


If you are able to just "refresh" the data arbitrarily (not on demand), please check out the later R package: https://github.com/r-lib/later .

Imagining you have a set of data (dt) that just needs to be refreshed using the update_data function, you could add a bit of code to the top of your Plumber file

dt <- list()
refresh_dt <- function() {
  # set `dt` with output from `update_data()`
  dt <<- update_data()
  # recursively call refresh_dt with a delay of 1 hr
  later::later(
    refresh_dt, 
    1 * 60 * 60 # (1 hr * 60min/hr * 60s/min)
}
# call refresh to init the data
refresh_dt()

Hope one of the options above helps!

Thank you for your detailed response. I do have a few questions to better understand it.

How do I go about performing complicated tasks outside of the Plumber API? Do you just mean trying to do some of the computation inside my Excel macro?
Is there a way for the API to continue doing its computations, but return control of Excel back to the user? Like basically a 'handoff' of all the applicable data is given to the API, the API then tells the macro that it has received it, and can carry out the tasks in the background, without needing the user to wait for it to finish?

So, it would require a lot of restructuring of my Excel macro, but it could be done. I would have to send one massive string of all of the data, and then have the API parse it all out and perform its tasks from there. Is that what you mean by pre-fetch?

Also, I am unsure what you mean by "refresh" the data. In my program, the data is being sent from the Excel sheet using the API URL (each parameter is a parameter in the URL). So, the data doesn't really exist as 'data', from the API's point of view. It's just a bunch of strings.

I hope this provides a bit of clarity. Thanks again.

How do I go about performing complicated tasks outside of the Plumber API?

This means to remove "consistent" parts of your data. If your route handler tries to download data that is consistent for every route request, then it could be moved from the route to outside the route.

get_iris_data <- function() {
  Sys.sleep(1) # artificial slowness
  iris
}

#' @get /iris_slow
function(){
  dt <- get_iris_data()
  c(dt$Sepal.Length, runif(1))
}

# Move constant data calculations out of the route handler
dt <- get_iris_data()
sepal_length <- dt$Sepal.Length
#' @get /iris_faster
function(){
  c(sepal_length, runif(1))
}

Is there a way for the API to continue doing its computations, but return control of Excel back to the user?

Yes, this concept is similar to the approach of Microsoft Flow (from the GH Issue).


Demo implementation: tokic.R example near the end of https://github.com/rstudio/plumber/issues/497#issue-504880501


In my program, the data is being sent from the Excel sheet using the API URL (each parameter is a parameter in the URL). So, the data doesn't really exist as 'data', from the API's point of view. It's just a bunch of strings.

Ah. So is your route handler fitting a model given the supplied parameters? There is no data knowledge kept within the Plumber API?

Why does one route request take 20 minutes to compute? Do you have a watered down example that you can share to paint the picture of your issue? In general, I believe that is something that needs to be reduced.

Ah I see what you mean. Unfortunately, with the current structure of my program, I don't know if it is possible to remove the complicated parts of my data, because the data doesn't exist as a data frame in the R code.

#* Add parameters to DB
#* @param parameter1
#* @param parameter2
# ...
#* @get /addParameters
function(parameter1, parameter2) {
  print("Running addParameters")
  ParametersValues <- list(parameter1, parameter2)
  ParametersNames <- c("Parameter 1 Name", "Parameter 2 Name")
  
  for(j in 1:(length(ParametersValues))) {
    conn <- poolCheckout(pool)
    sql <- sqlInterpolate(
      conn = conn,
      sql = "INSERT INTO database (Parameter, Value) VALUES (?parameter, ?actualVal)",
      parameter = ParametersNames[j],
      actualVal = ParametersValues[[j]][[1]]
    )
    DBI::dbExecute(conn, sql)
    poolReturn(conn)
  }
}

Here is an example of what I'm doing. The Excel macro is using a double for-loop to grab every value in a very large area of cells. It then creates a URL for each row in this area, and performs one GET request to the API per row, sending each value in that row. As I said, the area of cells is large. There are roughly 10,000 cells in it that need to be sent to the API and submitted to the SQL database. The code above is for each row in this grid. There are about 300 rows. So, parameter1 and parameter2 would be values in that row. Then the for-loops loop through each of these values in the row and submit them one-by-one to the database. There will be one row in the database per value in the row, which means there will be 10,000 SQL queries performed every time the macro is run. That's why it takes upwards of 20 minutes, and why I need a way for the API to just say "Ok, I have all the values that need to be submitted" and return control back to Excel.

Your github post was very informative, and it looks like its exactly the functionality that I need. I'll look at the R code in it some more and see if I can implement it into my API.

Ah! So you're needing to do a batch processing after acknowledging receiving the data.

tl;dr reduce communication attempts to a minimum. Transferring data is cheap, connecting is expensive.


First thing I would look into is to reduce the number of requests being made to plumber and to the database.

In a perfect world, a round trip internet request takes about 200ms with 0ms of processing time. So 10000 requests * 0.2 s / request * (1min/60s) = 30+mins of wasted time. So completing your task in 20mins is beating the odds.

Since you already have the information in hand at the time of /addParamters, you could change the script to add multiple values at one time. Similar to

INSERT INTO MyTable
  ( Column1, Column2, Column3 )
VALUES
  ('John', 123, 'Lloyds Office'), 
  ('Jane', 124, 'Lloyds Office'), 
  ('Billy', 125, 'London Office'),
  ('Miranda', 126, 'Bristol Office');

(https://stackoverflow.com/a/452934/591574)

This would reduce the sql queries to 300 (vs 10k). (This could be reduced even further if all information could be sent at once.)

Something like (assuming same length of names/values)

a <- letters[1:10]
b <- 11:20
sql_txt <- paste0(
  "INSERT INTO database\n  (Parameter, Value)\nVALUES\n  ", 
  paste0(
    "(?parameter", seq_along(a), ", ?actualVal", seq_along(b), ")", 
    collapse = ",\n  "
  )
)
cat(sql_txt)
INSERT INTO database
  (Parameter, Value)
VALUES
  (?parameter1, ?actualVal1),
  (?parameter2, ?actualVal2),
  (?parameter3, ?actualVal3),
  (?parameter4, ?actualVal4),
  (?parameter5, ?actualVal5),
  (?parameter6, ?actualVal6),
  (?parameter7, ?actualVal7),
  (?parameter8, ?actualVal8),
  (?parameter9, ?actualVal9),
  (?parameter10, ?actualVal10)
sql_params <- append(
  setNames(as.list(a), paste0("parameter", seq_along(a))),
  setNames(as.list(b), paste0("actualVal", seq_along(b)))
)
str(sql_params)
List of 20
 $ parameter1 : chr "a"
 $ parameter2 : chr "b"
 $ parameter3 : chr "c"
 $ parameter4 : chr "d"
 $ parameter5 : chr "e"
 $ parameter6 : chr "f"
 $ parameter7 : chr "g"
 $ parameter8 : chr "h"
 $ parameter9 : chr "i"
 $ parameter10: chr "j"
 $ actualVal1 : int 11
 $ actualVal2 : int 12
 $ actualVal3 : int 13
 $ actualVal4 : int 14
 $ actualVal5 : int 15
 $ actualVal6 : int 16
 $ actualVal7 : int 17
 $ actualVal8 : int 18
 $ actualVal9 : int 19
 $ actualVal10: int 20
conn <- poolCheckout(pool)
sql <- sqlInterpolate(
  conn = conn,
  sql = sql_txt,
  .dots = sql_params
)
DBI::dbExecut(conn, sql)
poolReturn(conn)

(No for-loop, yay!)

Final product:

#* Add parameters to DB
#* @param parameter1
#* @param parameter2
# ...
#* @get /addParameters
function(parameter1, parameter2) {
  print("Running addParameters")
  # set up params
  ParametersValues <- list(parameter1, parameter2)
  ParametersNames <- c("Parameter 1 Name", "Parameter 2 Name")

  # create GIANT sql command
  sql_txt <- paste0(
    "INSERT INTO database\n  (Parameter, Value)\nVALUES\n  ", 
    paste0(
      "(?parameter", seq_along(ParametersValues), 
      ", ?actualVal", seq_along(ParametersNames), ")", 
      collapse = ",\n  "
    )
  )
  # setup paramters for sql command
  sql_params <- append(
    setNames(
      as.list(ParametersNames), 
      paste0("parameter", seq_along(ParametersNames))),
    setNames(
      as.list(ParametersValues), 
      paste0("actualVal", seq_along(ParametersValues)))
  )

  conn <- poolCheckout(pool)
  sql <- sqlInterpolate(
    conn = conn,
    sql = sql_txt,
    .dots = sql_params
  )
  DBI::dbExecut(conn, sql)
  poolReturn(conn)

  # return the time it completed inserting data
  paste0("Done! ", Sys.time())
}

I'm hoping this will be able to be done without the use of pool as the number of queries will be MUCH less. I'm also hoping the execution time will be MUCH faster, allowing you to do it in real time and not need to use later or callr.


If the queries still take too long, I would look into callr R package. It will allow you to do your data processing in another R session. later will not be of good use for you as it will still block the main thread.

callr has a function r_bg which will call and R background process and execute a function.

Putting it all together, it'd be something like

#* Add parameters to DB
#* @param parameter1
#* @param parameter2
# ...
#* @get /addParameters
function(parameter1, parameter2) {
  print("Running addParameters")
  # set up params
  ParametersValues <- list(parameter1, parameter2)
  ParametersNames <- c("Parameter 1 Name", "Parameter 2 Name")

  # create GIANT sql command
  sql_txt <- paste0(
    "INSERT INTO database\n  (Parameter, Value)\nVALUES\n  ", 
    paste0(
      "(?parameter", seq_along(ParametersValues), 
      ", ?actualVal", seq_along(ParametersNames), ")", 
      collapse = ",\n  "
    )
  )
  # setup paramters for sql command
  sql_params <- append(
    setNames(
      as.list(ParametersNames), 
      paste0("parameter", seq_along(ParametersNames))),
    setNames(
      as.list(ParametersValues), 
      paste0("actualVal", seq_along(ParametersValues)))
  )

  # exec in R background process
  pr <- callr::r_bg(
    function(sql_, sql_params_) {
      # all functions must be namespaced or re-`library()` in this func

      # open conn
      conn <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # change for real info
      # setup
      sql <- DBI::sqlInterpolate(
        conn = conn,
        sql = sql_,
        .dots = sql_params_
      )
      # exec
      DBI::dbExecute(conn, sql)
      # close conn
      DBI::dbDisconnect(conn)
    },
    # pass in parameters to external function
    list(
      sql_ = sql_txt,
      sql_params_ = sql_params
    )
  )

  paste0("Done! ", Sys.time())
}

Caveat... This will create a new R process for each request. Creating 300 R processes all at once will break your machine.

End goal: I would instead work hard to send ALL 10k cell info at once and try to make a single sql insert statement in the main R thread (without callr/later etc). If the function takes over 10s (or longer than tolerable), try using callr to submit the statement in a background R process.


Hope this helps!

5 Likes

You have no idea how helpful this was, thank you! I have it working and it now runs in under 90 seconds! I will mark your answer as a solution. But can you answer one more question for me? I will be hosting this Plumber API on R Connect, so with that in mind, would I be able to use your r_bg function, creating 300 processes, and see performance gains? 90 seconds is obviously much better than 20+ minutes, but if I can get it lower I'd love to. Thanks again!

Edit: I just re-read your End Goal. That sounds exactly like what I want, but I don't see how I could send all 10K cells to the API without having 10K parameters declared in and above the function to receive them. I am not a Plumber expert though, so maybe I'm missing something.

Yay! Glad it's working! 7% of the original time. Awesome

Hopefully we can get it even faster if we can do 1 plumber route with 1 sql. We're still making 300 round trip requests. I believe we can get it to 1 round trip.

My goal is < 10 seconds, hopefully ~1 second.


Context

because it's a lot of text above...

@barret
End goal : I would instead work hard to send ALL 10k cell info at once and try to make a single sql insert statement in the main R thread (without callr/later etc). If the function takes over 10s (or longer than tolerable), try using callr to submit the statement in a background R process.

@duncanispro13
Edit: I just re-read your End Goal. That sounds exactly like what I want, but I don't see how I could send all 10K cells to the API without having 10K parameters declared in and above the function to receive them. I am not a Plumber expert though, so maybe I'm missing something.


Use POST not GET

Looking at ...

send all 10K cells to the API without having 10K parameters declared

We should change the plumber route (and your excel code) to use a POST command. This is due to having limits on urls (~2k characters in total to be safe). POST urls can have body values that are in the megabytes without issue (plumber is only really limited by memory within R).

Your plumber route will only need to change the plumber tag to

#' @post /addParameters

VBA + JSON content

Next, we need to update the VBA to POST JSON content. I'll leave that for you.


Combine parameters into a list

Finally, I would send one parameter to the plumber router: parameterList. This parameter list will be an array of all prior parameter1 and parameter2 combinations.

(So giant assumption about your vba code...)
So I would change your for loop from

for every row { 
  get info
  post url 
}

to

make data
for every row { 
  append info to data 
}
post url w/ data

Change data structure of parameter information

I would also change the data structure to send the name and value so that plumber can just pass along the values. This will allow the code to scale

Hopefully the final data structure / body text string will be something like the example below. The title of parameterList is important to distinguish what variable the content goes to.

{
  "parameterList": [
    {"value": "A", "name": "Parameter 1 Name"},
    {"value": "B", "name": "Parameter 2 Name"},
    {"value": "C", "name": "Parameter 2 Name"},
    ...
    {"value": "ZZ", "name": "Parameter 10000 Name"}
  ]
}

POST plumber route

Altering the plumber route to handle parameterList with the updated data structure...

#* Add parameters to DB
#* @param parameterList
# ...
#* @post /addParameters
function(parameterList) {
  print("Running addParameters")
  # set up params
  ParametersValues <- lapply(parameterList, `[[`, "value") # get every item's value
  ParametersNames <- lapply(parameterList, `[[`, "name")  # get every item's name

  # create GIANT sql command
  sql_txt <- paste0(
    "INSERT INTO database\n  (Parameter, Value)\nVALUES\n  ", 
    paste0(
      "(?parameter", seq_along(ParametersValues), 
      ", ?actualVal", seq_along(ParametersNames), ")", 
      collapse = ",\n  "
    )
  )
  # setup paramters for sql command
  sql_params <- append(
    setNames(
      ParametersNames, 
      paste0("parameter", seq_along(ParametersNames))),
    setNames(
      ParametersValues, 
      paste0("actualVal", seq_along(ParametersValues)))
  )

  conn <- poolCheckout(pool)
  sql <- sqlInterpolate(
    conn = conn,
    sql = sql_txt,
    .dots = sql_params
  )
  DBI::dbExecut(conn, sql)
  poolReturn(conn)

  # return the time it completed inserting data
  paste0("Done! ", Sys.time())
}

Notes

The golden parts about this setup:

  • Removes all unnecessary communication
    • Only one request to plumber
    • Only one request to SQL (Keeping the same setup as before, which already worked!)
    • Sending all 10k data cells in one POST command is the fastest way to get all of the data to plumber
  • callr isn't needed as only one round-trip of data communication happens to transfer all of the data
  • Shouldn't block the plumber router as it should be a quick SQL insert

Please let me know how this goes! :smile::smile:

1 Like