Best practice - model API in Plumber

Hello,

I am developing a model and hosting it in Docker as a REST-API with Plumber. Let's say the function is calculate_prediction(x1, x2...) and it returns a probability y.

However, I had a discussion with the team on how to send the parameters to the function and we are stuck between two alternatives:

  1. Send all the parameters directly into the function using JSON or query string, such that the function will be calculate_prediction(x1, x2..., xn), where x is all the explanatory variables needed in the model.

  2. Send only the customer ID as the parameter, and have the function pick up the data itself - so the Plumber function would be called as calculate_prediction(id), and then inside the function it collects the data for that id from the database and returns the prediction.

Those who do not like option 1) claim that it is very "inefficient" to send large amount of data this way (currently there are 20 variables but that could increase), but I don't quite understand why.

Does anyone have an idea what is considered "best practice" here?

To my knowledge it's pretty standard to do option (1) in a more classic IT sense. If you have an IT architecture built on top of many independent microservices (APIs) where each of them does one little thing very well in insolation, only to pass it on to another microservice to perform another thing, you're gonna end up passing on really large volumes of data that way. But that's actually completely fine, to my knowledge that is really what's considered the best practice in today's IT world.

In our DS department we embrace the same concept where the API is only responsible for receiving all the data in the request and using that data to make a prediction. We only relay on option (2) when for instance there is no IT service yet that can make the request with the entire requested payload in the form that we need it OR if that's a Shiny app and the user interacts with DB through the UI. Coming back to the prediction part, in the request we then accept some kind of an id (uuid) wich is only as a trigger to make the DB querying entirely on our APIs side.

Long story short, it sort of depends on the entire philosophy that your company has in terms of building up its IT architecture, but option (1) is what is generally preferred because the service is more isolated and faster. Btw: 20 variables is really nowhere close to "large amounts of data" :wink:

2 Likes

Thanks a lot, very insightful! :blush:

Indeed, 20 variables is not much, but the idea is that the API is to be used for both batch and live-scoring, so for the batch part there could (theoretically) be millions of rows at once.

Oh, ok. In that case you indeed could consider option (2) if that's what about to happen. We relay on option (1) only for live-scoring, batch is a another different discussion.

I see.

Maybe implementing a hybrid could be optimal here - a function where you either send all variables, or send the IDs?

Another thing we have been discussing is where to save the trained ML-model (e.g. "model.rds"), do people usually just save these files directly on github etc or should they be saved in some database?

Not sure about the hybrid approach, to me it sounds like having completely two separate endpoints.

I think storing it on github or G-drive solves the problem of storing the model right? I don't think there's really any point (or even a way) to keep it in DB. Each subsequent model building iteration should be a separate folder/ directory and voila :slight_smile:

1 Like

This repository contains an example of a model deployed as an API that accepts new data as a JSON payload in the request body. As mentioned by @konradino, this approach works well for real time scoring, but probably isn't optimal for batch processing.

3 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.