One operation affects two datasets

I have a simple R question. This is my code:

gamma <- input_output
gamma <- gamma[,downstr_tot := sum(value), by = .(country, downstr)]

It creates a column "downstr_tot" and adds it to the "gamma" dataset (which it should). However, it also creates that same column in the "input_output" datset (which it should not). Is that normal??

You're defining a new variable when you use := to define downstr_tot.

Try this;

gamma[,  .(downstr_tot = sum(value)),   .(country, downstr)]

Thank you for your help. However, I want it to produce a new variable. I just want it to add that variable to the gamma dataset only. Currently it adds it to both datsets.

Ah. I misunderstood. Sorry about that.

But to your actual question; if your code is not assigning anything to input_output, then no, it shouldn't.

Conceptually, this is what it should do.

iris <- iris     # Original dataset
dt <- as.data.table(iris)  # New assignment, etc.
dt <- dt[, Sepal.Length_m := mean(Sepal.Length), .(Species)]  # Functionally similar

Thanks again! I agree. The issue seems to be the following: I load input_output with setDT(XXX). When I do this and then create a new copy using <-, it somehow links the two datasets. You can see it here:

iris <- setDT(iris)     # Original dataset
dt <- iris  # New assignment, etc.
dt <- dt[, Sepal.Length_m := mean(Sepal.Length), .(Species)]

This is your problem.

In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.

setDT converts lists (both named and unnamed) and data.frames to data.tables by reference. This feature was requested on Stackoverflow.

Yep, that's it. I was not aware of this issue. Seems to be very important to know. Thanks for helping out!!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.