Auto saving the session environment

(The question is about RStudio but probably goes beyond that, so maybe it should get into generic...)

At the high level my question is about automatic saving of the image associated with the recession I'm working with --- and also a question about history management at the end of this topic.

A situation that I encountered way too many times is that when a R session crashes for one reason or another, I lose all the data that is in memory, that is data that stored in data frames, lists, etc, that I generated after the last call to save.image(). This is expected, especially if there is no way to access the rsession process even with a debugger and hence access its memory, but it costs me hours afterwards to regenerate all the data since the last call to save.image; I have roughly a 5Gb .RData file (without compression, that I avoid for a faster save/load process).
The bottom line is that whenever something like that happens, even if the session is "hung" after some command that I put in, and I am trying to abort the command that caused the problem, there is often no other way than aborting/killing the rsession process, even if I keep RStudio alive. The then is not being saved, and the .RDate remains the same as it was the last time I called save.image (or closed the project, which is usually weeks before the crash).

I wanted to check check with the community whether there is already a working solution that people are using. I looked for it in the forum as well as googling it but didn't find anything.
To be precise, what I am basically looking for is some kind of automated periodic saving of an image snapshot. Even if there is no way to achieve that, can I at least force saving it when the rsession is still alive but "hung" for hours --- but not yet terminated, and is just not accessible from the command line? Yesterday I had a session that was hung for hours --- even RStudio couldn't terminate it, and when I did it externally, it took RStudio another few hours to show me the "R session has terminated" window. I started thinking of complicated solution of having another background process that shares the memory with the rsession one, or some kind of a client-server solution, but I really hope I'm just missing something trivial and that there is an easy way to achieve that without going that far. I am working on a Mac if it is relevant.

That leads to my second question, about the RStudio (and rseesion) history mechanism:

Recovering from a situation like the one described above is highly mitigated with the exceptionally good history saving mechanism of RStudio; it saves me a few hours every time something like that happens, in the process of re-generate all the data that I lost. It would be really helpful if I can remove from history some commands after I ran them, that were incorrect or caused errors. That way I will not repeat these mistakes when regenerating the data --- but I also think that in general, it is sometimes helpful to remove a line from history so it will not be suggested at the command line when you use the history retrieval mechanism.

Any advice on these matters will be highly appreciated.
Thanks.

1 Like

This may not be the answer you want to hear (which is that the IDE can fix this), but I think you should re-consider your workflow. It is dangerous for multiple reasons to get too attached to your current workspace. First, as you point out, it's vulnerable to loss in the case of a crash or hang. Second, it makes it easy to fall into bad habits, where your .Rhistory and .RData files are "real" versus the safer situation where your source code and data sources are "real".

I've written about this mindset shift as part of a blog post:

and am developing it further here:

https://whattheyforgot.org/save-source.html

It sounds like a key adjustment for you would be to save very large precious objects (which you can't afford to recreate on demand) as .rds files, which you can reload into sessions as needed.

5 Likes

Thanks, I appreciate the quick response, and I think I understand the approach you are suggesting, and will respond to that here. I am not sure exactly what is the difference between rds and simple .dat files that are saved by the "save" command, but I can find that later, but I doubt that this difference will affect the high level discussions.

First, to clarify:

  • I am using R projects all the time. I almost never open a file that does not belong to an R project, and if there is a file in one project that I need to source in another, I either source using the whole path, or create a local soft link in the home directory of my project --- this is all under the assumption that a file is not being modified by two projects, which is a good practice, I think.
  • I do store subset of data structures in separate .dat file. In fact, I sometimes creates an environment where I store related object and saves the environment as a .dat file (well, I guess you can see the environment as a just a list that I store, but there are different in various ways).
  • I understand the problem of putting all of my objects and state in one file (the .RDate file) that is associated with a project. First, the file becomes huge, and second, it is often the case that you don't need all of the data that you created at the same time, so saving them in a hierarchy and loading what you need when you need it makes sense and also reduces the memory footprint of the workspace.

But, that being said, at least the way I see it, the fact that R is an interpreter that you can use to experiment with various ways to look at data, is a key difference between that and other programming languages, where you write a program, generate output into a file, then write another variant, etc. The ability to quickly explore different views of the data "on the fly", query various properties, and save, even temporarily, those that might be useful, is a very powerful and and important aspect of working with R, in my opinion. It is very often the case that I hold 5 different representation of the data, for example some with additional columns in the data frame, some a shorter summary of it, etc. Once classic example is a data frame and a melted variant of it, where the melted form is practically only used for charting, and the regular form is by all other functions.
But this temporal experimenting with various views of the data creates a lot of temporary variables, that I repeatedly use in the process of developing a solid, well formed code that will generate and store the data in the best form that is useful to me. I just don't know upfront what it would be, and I'm using the interpreter to help me figure it out. Then at some point I do a "cleanup", but if I were to save every temporary data frame that is created and I need to save for a day or two into a .dat (or .rds) file, the process of exploring data and coding to work with it would be slowed down by a huge factor --- especially when dealing with huge data structures. (Lazy evaluation allows working with multiple representation of huge data frames very efficiently.) But if I need to put into persistent state (e.g., disk) every local state that I maintain, I am losing a lot of the efficiency of quick exploration and resulted coding that the R interpreter allows.
The above is also true with code, but for much lesser extent, as the code in R often revolve around the data, and the efficient way to explore and manipulate it. Only after a long exploration of different approaches I converge to the form of the data and coding to work with it that I'd like to keep. The interpreter supports a very efficient experimental phase until I converge to a state worth saving. And of course I am also use git to store anything that is not "half-baked".
So I don't think that I disagree with any of the aspects and principals that you brought up in the blog, and I have many years of experiment in designing and writing, real, long living code, in various languages. I also developed some R libraries. But at least for me, the big difference with R vs using C, C++, or any other language, is the ability for quick exploration and trying different approaches that feeds the developing of a solid form code and data representation.
How all of that relates to .RData and automatic saving to it? well, for me .RData is a temporary snapshot for the temporary state of my project, and of the ongoing exploration state. Probably one of the primary examples are exploring using different graphical representation of the code, where different representation require different form of the data structure (if you want it to be efficient); while I eventually end up with one or two charts, it takes dozens of them for temporary analysis of the data, until you converge to the "right" one. But this temporary state could last for quite a long time, so I do need to have a persistent storage for it, at least on a daily basis.

So I think that it all boils down to storing a temporal, experimental state, vs. a final, much smaller state that should probably be distributed into separate data files. What I should probably do is putting myself a reminder to run save.image() at the end of each day, to save the temporary state, but I do agree with you that I often being "lazy" with not cleaning up very old temporal state and just store it in some .dat file (or .rds, I will find the difference).

Please let me know if I misunderstood you, the approach that you mention in the blog, and/or you think that I'm simply "out for lunch" :slight_smile:

Thanks again.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.