Conceptual reason for transmute

mungojam · November 20, 2017, 2:41pm

Coming from SQL, the transmute function feels a bit redundant when select exists. I keep wanting to be able to say:

myTibble <- tibble(a = c(1, 2, 3), b = c(5, 6, 7))

myTibble %>% 
        select(
            a,
            c = a + 1,
            d = b
        )
#Error in overscope_eval_next(overscope, expr) : object 'a' not found

The d = b works fine, so I am able to rename columns but not the c = a + 1.

Is it for a conceptual reason or a technical one?

With transmute it works fine:

myTibble <- tibble(a = c(1, 2, 3), b = c(5, 6, 7))

myTibble %>% 
        transmute(
            a,
            c = a + 1,
            d = b
        )

danr · November 20, 2017, 3:10pm

select and mutate are doing different things.

select only selects columns and optionally changes their names. It does on change any of the values in the columns.

mutate changes column values.

a + 1 doesn't mean anything in your select call. That is because it expects a column name and there is not way it can interpret that as a column name.

a + 1 is interpreted as an arithmetic op on column in mutate. It means get column a and add 1 to it... like this for your tibble

c(5,6,7) + 1

So think of select as sql select and mutate as sql update.

mungojam · November 20, 2017, 4:50pm

Thanks @danr, I appreciate the need for mutate and the current restrictions with select. mutate is useful because it lets you change single columns or add new ones. That makes it different enough from select.

What I struggle to see is the conceptual need for transmute. It seems that select could be extended to serve the same purpose.

This has been exacerbated today by hitting limitations in transmute in that it doesn't seem to support everything(). So I can't use it to add a new column and then include all the other columns after it. I'd like to be able to do:

myTibble %>% select(c = a + 1, everything())

but I have to do it in two operations, a mutate, followed by a select to get things in the right order.

mungojam · November 20, 2017, 5:13pm

You gave me a thought with the comparison with SQL update. I think the difference here though is that nothing actually gets altered even with a mutate. All of the operations are functionally pure, so I think it would feel ok to have select doing mutations since they aren't really mutations, just a different map of the columns.

I'd also like for pull to support mutations. I just hit a case where I wanted to pull out the combination of two columns as a vector and I ended up doing mutate followed by pull, which felt unnecessary.

jakekaupp · November 20, 2017, 5:15pm

I think it may also be for clarity and purpose. For the most part dplyr verbs perform a single function. Sure you can rename in select and you can create new variables in group_by, but keeping the outcome of the verb similar to it's literal meaning improves the overall clarity of the package and preserves the intent of writing more readable code.

If the verbs start introducing outcomes that conflict with their definitions, you start introducing avenues for misunderstanding.

danr · November 20, 2017, 5:16pm

You can do something like this:

suppressPackageStartupMessages(library(tidyverse))

myTibble <- tibble(a = c(1, 2, 3), b = c(5, 6, 7), c = 2:4)

myTibble %>% 
    transmute(
        c = a + 1,
        d = b,
        e = c
    ) %>% 
    select(e, d, c)
#> # A tibble: 3 x 3
#>       e     d     c
#>   <dbl> <dbl> <dbl>
#> 1     2     5     2
#> 2     3     6     3
#> 3     4     7     4

but of course you have to do two function calls, not one. But if you find yourself doing the kind of operation often you should look at @lionel's paper on Programming with dplyr. Part of it covers how to make functions that wrap dplyr functions in a pipeline. It's a draft right now but still very good

http://rpubs.com/lionel-/programming-draft

mungojam · November 20, 2017, 5:18pm

Thanks @jakekaupp, I had a feeling that it was something like that. Coming from Select in SQL and in C# too, where you can happily do transforms, it doesn't seem like there is a clash of name with purpose in this case. Select to me is a functional map and that can include transformations.

@danr (update, sorry didn't fully read it) - I'm not trying to do any variable re-use here. I understand why that would need to be done in two steps. I'm questioning the need for transmute as a concept when select could be adapted to serve the same purpose.

tbradley · November 20, 2017, 7:29pm

I would agree with @jakekaupp that adding this could cause sources of confusion about what select is designed for. From everything that I have seen, @hadley advocates for writing functions that do one thing and do that thing well, so it would make sense for these functions to be separated.

As for everything() not working within transmute(), you can always file an issue on the dplyr github page and they may fix that, if the functionality was not excluded intentionally.

mungojam · November 20, 2017, 7:36pm

Thanks, I thought it might be that.

Being a devil's advocate, there are plenty of other exceptions. transmute is one of them as is mutate which can be used to change existing columns as well as adding new ones. In theory adding new columns could be left to add_column with mutate only able to change existing ones.

Issue now raised about transmute with everything(), thanks.

hadley · November 20, 2017, 7:44pm

It's not an issue; transmute() has mutate() semantics; not select() semantics.

mungojam · November 20, 2017, 7:46pm

transmute can also be used to select or re-order columns though, which is where I end up using select until there's something it can't do when I have to fall through to transmute. Sometimes I separate select and mutate, but sometimes it's clearer with a single transmute

mara · November 20, 2017, 7:51pm

I think there's a benefit to having them separate if you consider the syntactical (as in reading, not as in programming syntax) advantages for thinking through and sharing analyses. There's a whole lot of lab-bench analysis that can be accomplished without ever really having to think about mapping columns as an abstraction. There's definitely a benefit of understanding these abstractions, but (IMHO) select and mutate evoke conceptually distinct tasks.

mungojam · November 20, 2017, 7:59pm

Good point. I guess in tools like SPSS they tend to be done in separate steps, so I can see why that would be a familiar way of thinking to many from that background.

The way I'm trying to use R and tidyverse at the moment feels a bit more like general purpose language and I know the background of anybody picking this up will be from other languages where it would feel natural to have derived select columns as well as renamed ones.

I can sense this won't go my way . I understand you are targeting people from different backgrounds.

nick · November 20, 2017, 8:21pm

Supporting the select helpers in transmute would make the syntax more confusing, as a given argument could then refer to a single column or a collection of columns.

@mungojam, am I right that the majority of the need for select helpers (specifically everything()) in transmute would go away for you if you could choose to insert the new columns from mutate on the "left" side of the data frame? That seems like a smaller ask, in that inserting columns at the end/right doesn't seem inherent to the idea of creating a new column.

The implementation would be a little tricky, in that adding an argument to mutate could potentially break existing code that had a column named with the same name as the new argument. The alternative would be a mutate_left or similar.

Actually, you could arguably just define a mutate_left function yourself and stick it in a utils.R file/package, if it's something you regularly do.

mungojam · November 20, 2017, 8:28pm

Thanks @nick. That's a good point, having a way to add multiple columns at the start like add_column can do for single columns would cover a lot of the times that I end up using transmute.

I think some of it is that I should start making my own s3 classes as a lot of the time I am working with the same types of data and doing similar column mutations to it. I have started making my own packages which is a start and that is hiding much of this data wrangling that I'm doing.

I think that is the case with select too, if you do select(d, e, everything()), then the first two are one column, while the last is many,

I think my mental difficulty with all this is that it is neither very strict nor very flexible. If it were very strict, then transmute would not exist and people would be forced to use a mixture of select, rename, mutate and add_column or add_columns and select wouldn't allow renaming while mutate wouldn't allow adding new columns. The way it is at present, I keep having to try things out to know if they will work or not because in theory they might work (like mutating in a select), but they might not.

mungojam · November 20, 2017, 8:45pm

In my head I don't really relate transmute strongly to mutate though the documentation strongly links them. I just see transmute as the only tool that gives me full(ish) flexibility in defining the columns I want in my tibble in whichever order I like and which can include transformed columns or constants if I want.

When I do use it, I tend to put one column on each row of the code to make it reasonably easy to read and I only use it when my transforms are very succinct ones (maybe calls out to transform functions I have written).

nick · November 20, 2017, 9:45pm

Agreed. What I was trying to get across (and still don't really have the right wording for) is that there's some potential ambiguity from a function standpoint. transmute(x = 2*y, everything()) could potentially mean that you create an x column followed by all remaining columns. However, it could also mean that you want to create an x column alongside a column called `everything()` that contains the result of that function that you've defined in your code. I doubt it's an insurmountable issue, but it shows an additional reason why separating the functionality of transmutate and select makes sense.

mungojam · November 21, 2017, 7:43am

There's a similar problem with select(). I'd say if you start naming functions the same as things in dplyr then you have this risk everywhere:

myTibble <- tibble(
     a = c(1, 2, 3), 
     b = c(5, 6, 7), 
     c = c(1, 1, 1), 
     d = c(2, 3, 3)
)
everything <- function() {c("b", "d")}
myTibble %>% select(a, everything())

# A tibble: 3 x 3
      a     b     d
  <dbl> <dbl> <dbl>
1     1     5     2
2     2     6     3
3     3     7     3

I didn't realise that you could create columns like that named someFunction(), seems a bit odd. I can't imagine anybody ever uses that feature, but could be wrong.