Shiny dashboard that overlooks all API calls made on connect, timings, trends etc.?

konradino · June 13, 2019, 8:03pm

Hello!

I'm looking for a way to set up a dashboard (it doesn't necessarily need to be Shiny - could be another tool, but it might be convenient to use it after all) which would give me an overview of all the API calls that are made against R Studio Connect, their timings, show the trends e.g. if the timings are increasing/ decreasing etc. Is this even possible with the RSC API? Has anyone tried that and can share experiences?

Thanks!

cole · June 13, 2019, 8:08pm

Hey @konradino ! Would you mind clarifying what you mean by:

all the API calls that are made against R Studio Connect

Do you mean all Plumber or TensorFlow API calls made? Or all API calls made to the RStudio Connect Server API? Or all API calls period?

I think that would very much help us understand the context of your question!

konradino · June 13, 2019, 8:10pm

Sure! Sorry for mixing it up a bit. I actually meant all calls made to Plumber and TensorFlow APIs that are deployed on our servers along with their timings and possibly logs.

cole · June 13, 2019, 8:20pm

Ahh nice. Yes, that makes much more sense Unfortunately, we haven't yet taken on the work to log "instrumentation data" about API performance. This is something we are thinking about, and it is challenging because you don't want to impact the API performance too much by gathering information.

At present, instrumentation data is only captured for Shiny and static assets (RMarkdown, Jupyter, etc.): https://docs.rstudio.com/connect/admin/historical-information.html#historical_events

There is at least one example of how to do this - this is an RMarkdown report: GitHub - sol-eng/connect-usage: Report on RStudio Connect Usage

It would definitely be helpful to get a picture of what kinds of metrics and information you're looking for, though, so we can pass it along to the engineering team when they get to this feature. Accessing the logs programmatically would be one type of Connect Server API endpoint. The metrics (e.g. response time, request time, etc.) would probably be an instrumentation endpoint. Do you have any other metrics or information in mind that would be helpful?

Unfortunately, today the only options are to either implement this type of logging yourself (in the Plumber or TensorFlow API code) or to put a proxy in front of RStudio Connect that can track response times by request.

konradino · June 13, 2019, 8:34pm

Thanks for your prompt reply @cole!

Regarding the metrics I think you pretty much nailed the basic ones that we really need - we essentially need to know how many calls are made and when, as well as what are the timings or responses and requests. Perhaps we could also get a deep-dive on the number of processes serving a given API at a given point in time (compared to requests made) to get a feeling whether there sufficient resources? I'm not an expert on the field so just giving some ideas

2.1. By implementing ourselves do you simply mean including some timing objects in the body response of the API? So that when it's persisted with the main part of the response (e.g. in a DB) we could analyse that?

2.2 What do you mean by that proxy exactly? Would that need to happen for each API individually? Are there some common solutions for businesses of that kind that offer some analytics outside the box?

cole · June 14, 2019, 2:40am

Awesome! That is great feedback, thanks!! Do you think "each request" matters, and do you care about the requester? One of the things we had discussed was "aggregating" some of these requests together so the database would not get too large and performance would not be affected too much

2.1. Right, I basically mean use something in your R code to write timing somewhere (a file, a database, in the response, etc.). Obviously this misses out on part of the request and so is less than ideal in some form or fashion. One random thought you might be interested in is using another API in front of the others as a router... this adds latency, but can be useful in e.g. model A/B testing as described here: https://solutions.rstudio.com/model-management/overview/

2.2. Yeah, I don't have a clear sense of what this looks like either. I suspect that there are proxy layers (paid and open source) like this out there in the industry. What I envisioned was a single nginx server that does a proxy_pass to Connect. This nginx server would need to be smart enough to know which requests to track performance on, but then would log the request and response. The difference between them is the elapsed time. You could also probably do it with sufficiently verbose nginx logs that include the timestamp.

Basically:

client makes request
client request hits nginx (log event)
request proxied to Connect
Connect serves request and sends response
response hits nginx (log event)
response proxied back to the client

Then you just take the difference between the log events to get the request time. Massively oversimplified, it requires additional standing infrastructure, and there are probably products that do this for you, but an option if you are on a timeline faster than some point in the future when we add such functionality to Connect natively.

The feedback is super helpful, though, so thank you for that! It definitely helps us understand what would help our customers!

konradino · June 14, 2019, 7:31am

On which level were you thinking of aggregating those requests? Could you maybe make an example?

2.1. Right, I basically mean use something in your R code to write timing somewhere (a file, a database, in the response, etc.). Obviously this misses out on part of the request and so is less than ideal in some form or fashion. One random thought you might be interested in is using another API in front of the others as a router... this adds latency, but can be useful in e.g. model A/B testing as described here: https://solutions.rstudio.com/model-management/overview/

Thanks - I'll give these new resources a read. Really good work with the solutions website, I found it very useful already!

2.2. Yeah, I don't have a clear sense of what this looks like either. I suspect that there are proxy layers (paid and open source) like this out there in the industry. What I envisioned was a single nginx server that does a proxy_pass to Connect. This nginx server would need to be smart enough to know which requests to track performance on, but then would log the request and response. The difference between them is the elapsed time. You could also probably do it with sufficiently verbose nginx logs that include the timestamp.

I spoke with our IT and apparently we have two solutions for that: 1) APIGee from Google which servers as that proxy layer and 2) Papertrail for persisting logs and doing some visualizations. I'll check those things out and let you guys know how it worked out

cole · June 14, 2019, 4:19pm

We definitely haven't landed on anything or discussed enough to have a firm proposal here. Just an idea. For instance, let's imagine you have an api at /myapi that gets 100 requests/second. Rather than write 100 records to the database that look like:

path,   requesttime,  duration, client
/myapi, 193051351351, 10ms,     cole

We might write one record that looks like:

path,   requesttime,  n,   min, median, max,  mean, client
/myapi, 193051351351, 100, 8ms, 10ms,   30ms, 12ms, cole

Any initial thoughts on that idea? My inner data scientist always loathes to lose the granular data, but we could probably learn from tools that already do this type of monitoring to see how they aggregate / etc. (one per second is still ~ 31.5 million records per year! )

Great to hear!! I will pass it along to those who worked on it!

Exciting stuff! I'm looking forward to hearing how it goes!

konradino · June 25, 2019, 7:55pm

Hi @cole! Apologies for a late reply, this slipped through somehow.

Any initial thoughts on that idea? My inner data scientist always loathes to lose the granular data, but we could probably learn from tools that already do this type of monitoring to see how they aggregate / etc. (one per second is still ~ 31.5 million records per year! )

Haha, I can completely relate to the lost of granularity comment that you made! If I was asked about this my immediate answer would be the same: "of course we need this from the lowest level of granularity", but you made a fair point - the number of records here could grow out of control.

Perhaps there's some middle-ground here: keep individual call records for a certain period of time (no clue what that could be, 1-3 months?) for detailed inspection and remove them after that time, but persist aggregated logs to see the long term trends? In that way the user would have the ability to inspect individual logs on an ongoing basis and have the full aggregated, historical picture.

We're still in the process of setting our logging up but I'll keep you posted when we land on something

system · July 16, 2019, 7:55pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.