Why not cntrl+z for reversing data changes in dataframe?

Harikrishna · October 8, 2017, 9:53am

I am not sure whether there is any package already existing that can reverse recent data manipulation changes done on a dataframe. If not can someone explain how difficult is it to develop

ex: data <- read.csv("abc.csv")
data[which(is.na(data)),] <- 0

Now I want to reverse the zero assignment for NA rows can this be done??

mara · October 8, 2017, 12:44pm

I think you're talking about this as an abstraction, but, in this case, whether or not you could "reverse" that particular transformation (changing NAs to 0) would depend on whether or not there were other observations with 0 that were not NA. If so, they would now be indistinguishable from the formerly-NA 0s in that particular data frame.

However, you have your script there, so it's easy enough to "undo." Just execute that first line again, and you've got your raw data back (or your data from whatever form it was in prior to swapping out the NAs for 0.

This is why you'll notice that in reproducible analysis workflows (e.g. this one by Joris Muller) raw data and "produced" or manipulated data are kept separately.

atiretoo · October 8, 2017, 12:48pm

Well, my solution is to simply

data <- read.csv(“abc.csv”)
data[which(is.na(data)),] <- 0
data <- read.csv(“abc.csv”)

You'd have to capture the state of the workspace at the start of each expression, and then you could simply reset the current environment to the previous one ... no idea how that would work but it sounds possible. But that could get very expensive in terms of memory?

nick · October 8, 2017, 2:23pm

Yes, making a "generic" undo would likely have to save the workspace to memory or disk at every step. While it might be possible, it would slow down execution. If your analysis is small enough that the trade-off in speed would be acceptable, it's also probably fast enough to re-execute if you make a mistake -- which is why you want to do your analysis in a script, rather than at the command line. If you have an intermediate step in your analysis that takes some time to get to (due to a slow step in processing), then you could create an intermediate copy of the data and work with that instead.

Something like Excel needs an undo because there's no reasonable way to quickly recreate a spreadsheet if you make a mistake, which shouldn't be the case in R.

Harikrishna · October 8, 2017, 4:35pm

Thanks for the answer and it was informative.

I had this issue while running large scripts and I frequently used to fall into cases where I have to revert my changes which is very time consuming to rerun the script. I was following the same process of saving the state of the dataframe into a new one and then reassigning to old one for reversing the changes. However was looking for methods which is simpler than that.

How about tracking only the changes i.e., row number and changes made instead of saving state of entire data frame with a restriction on dataframe size?

Frank · October 8, 2017, 7:54pm

In very simple cases, you can create an intermediate object containing the row numbers, but in general, that won't work...

DF = data.frame(id = LETTERS[1:5], v = rep(0:1, 2:3))
w = which(DF$v == 0)
DF$v[w] <- NA

# reversible as...
DF$v[w] <- 0

To properly write the reversal line, you have to keep around the w object and review the line that created it, which is probably not worth the rigmarole compared to simply rerunning the eariler lines to recreate the data.

Also, the philosophy behind tidyverse entails not overwriting your input data in the first place. See for example...

dplyr will never support in-place mutation of data. This is something I feel very strongly about.

--from the dplyr issue tracker

There are tools for tracking changes (e.g., with data.table), but general reversal for any modification is probably too hard a problem.

nick · October 8, 2017, 8:04pm

If possible, I would suggest working on a smaller data set while working through the logic of your code. This should speed up the coding process, and then you can substitute back in your full data set once you have the logic worked out. The main problem with this approach is if you discover that your full data set will run into memory issues that a smaller one won't, but you likely have some feel if that is a potential problem on a given analysis.

mara · October 8, 2017, 9:21pm

I'm with @nick on that recommendation. Thought I haven't tried it out myself lumberjack is, I believe, a package/method through which you can keep track of what happens to (perhaps) indiv. data records. (see also here).

Harikrishna · October 9, 2017, 4:21am

@mara lumberjack looks interesting, will try out that and see if I can do something. Thanks for the answer

mara · October 9, 2017, 3:24pm

It's definitely not gonna give you a Ctrl-Z (nor should it), but worth looking into!

taras · October 9, 2017, 5:07pm

I agree with @mara here: I've always been saving any manipulated object as a new object exactly because I didn't know how else to "undo" the change; and once you adhere to it - there shouldn't be any problem...

I wonder if some of the changes you make "experimenting" even have to go into your script. I mean, feel free to play with the data in the console without assigning it to an object, look at the outputs, and when ready and sure - add to the script.
I think this could work on some 1 or 2 step modifications. Wouldn't work if you go down the rabbit hole right away though...

I totally agree that sometimes it is very time-consuming to re-run the script. However, personally, I find this process to be the best way to ensure that my script is correct and it is going to run again tomorrow. Restarting the session is one of the most frequent things I do.

Reminds me of this thread:

rpodcast · October 15, 2017, 3:54pm

I'd also recommend looking at the archivist package which gives you a version control-like mechanism for any R object. The functions take a little getting used to but I'm starting to use it in analysis projects as well as shiny apps to keep track of user manipulations to various artifacts.

thoughtfulnz · October 15, 2017, 7:58pm

When you think about it, the desire to store partial results through the course of an analysis is literally what is going on when you use Rmarkdown and set a chunk to be cached.

With the knit process, it starts from a completely clean environment, then either runs the code or loads the cache depending on the chunk settings.