How to disconnect spark when idle

Hi,
We are a few people working on an rstudio server and use sparklyr to access data and compute on a HDFS/spark cluster (dozens request by day each).

Our problem is the following: we connect with sparklyr to work, but the connection does not stop on idle, even after an entire week-end. The cluster is not that big, and each connection use 1 core (we use dynamic allocation) and multiple Go of RAM, that other people can not use, and more importantly night/week-end CRON job cannot use either.
We need to be sure to manually disconnect before leaving, and before working on something else, wich of course is not that great and we easily forget it.

I spend some time in the configuration parameters of spark, but since sparklyr uses spark-shell to connect it is not possible to stop it when idle (difficult to recongnize an idle), at least to my knowledge.
It seems it should be easier to recognize the idle inside rstudio (no activity and no running command), and I know it's exits. So I tried to add :

library(sparklyr)
spark_disconnect_all()

in the .Last function of .Rprofile, in hope it will disconnect when rstudio is idle, but it does not seems to work (session-timeout-minutes=20 in /etc/rstudio/rsession.conf ).
When I manually do:

rstudio-server force-suspend-session (pid)

It correctly stop the spark session.

The only workaround I see right know is to run a CRON that forcefully stop session each fixed hour/day. But its ugly, we cannot launch big calcul at night without modifying the CRON first (does not detect running calcul), and cannot be safely used during day (it will stop people when they are possibly working)

My questions are:

  • Does an active spark connection is blocking the idle detection and session suspend in rstudio ? Why ? Can I change that ?
  • Can I detect an idle state from outside an rstudio session ? or the time when the last command was run ? or if a command is running ?
  • Is there a spark parameter that I missed that will automatically disconnect when not launching anything after a certain amount of time?

In RStudio Server pro you can define:

  • session-timeout-minutes
  • session-timeout-kill-hours

I can't say for certain whether this will help in your case, but you may want to experiment with the session-timeout-kill-hours setting:

To configure the amount of idle time to wait before killing and destroying sessions you can use the session-timeout-kill-hours option. This allows you to specify when a session should automatically be cleaned up when it has been idled, allowing you to automatically reclaim temporary disk space used by the sessions, and to stop their processes and children.

This is described in section 5.2.3 Session timeout kill of the admin guide

Hi andrie,
Thanks for the reply.
I don't think that answer my problem, as it is said that the session’s data is lost forever (I don't want to lose unsaved work when I forgot to save a script when I go to lunch). That's why I want to suspend it, not kill it.
Is there really no way to configure the idle detection? Or at least to detect it from an external script?

RStudio Server explicitly prevents sessions using Spark to suspend since Spark does not support suspending sessions. In general, the R session is not suspended when a connection is active in the connection panel.

However, you can disable the connections panel and with this, allow R sessions to suspend by setting the connectionObserver option to NULL before connecting to Spark, as in:

connectionObserver <- NULL
sc <- spark_connect(master = "local")

However, rather than deactivating the connections pane, it would be nice to support opting-out from blocking suspension, I've opened this RStudio issue: https://github.com/rstudio/rstudio/issues/4194 which can help us track progress of this feature request.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.