reference a column in a table after modifying the table with a pipe

lhunsicker · June 8, 2020, 3:38pm

This question must have been asked and answered a hundred times, but I can't figure out what I have to do to make this work. Let's say that I want to find out how many samples of the setosa species of irises in iris have petal widths < 0.2. I can easily do this in a two steps. But there must be a way to do this using magrittr pipes. I just can't find the needed magic:

temp1 <- iris %>% filter(Species == 'setosa')
sum(temp1$Petal.Width < 0.2)
[1] 5 # But:
iris %>% filter(Species == 'setosa') %>% sum(Petal.width < 0.2)
Error in function_list[k] : object 'Petal.width' not found
iris %>% filter(Species == 'setosa') %>% sum(.$Petal.width < 0.2)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
iris %>% filter(Species == 'setosa') %>% pull(Petal.width) %>% sum(Petal.width < 0.2)
Error: object 'Petal.width' not found
Run rlang::last_error() to see where the error occurred.

And so forth. Is there a known way to refer to the table (or to the vector after pull) that results from prior modifications of a table using pipes, so that the last function operates on the table at the end of the chain of pipes?
Thanks to any that can tell me how to do this.
Larry Hunsicker

nirgrahamuk · June 8, 2020, 3:56pm

library(tidyverse)
temp1 <- iris %>% filter(Species == 'setosa')
sum(temp1$Petal.Width < 0.2)

#as tibble
iris %>% filter(Species == 'setosa') %>% summarise(n=sum(Petal.Width < 0.2))
# as vectore
iris %>% filter(Species == 'setosa') %>% summarise(n=sum(Petal.Width < 0.2)) %>%pull()

lhunsicker · June 9, 2020, 9:56am

Yes. That does it. And thanks! But is there a way, when using a pipe, to refer directly to the object resulting from all the previous manipulations? Your method gives the specific answer but loses the derived table. It would be nice to have something like:
table %>% various manipulations %>% new_function_on_a_column(.$column)
In some manipulations, the "." seems to refer to the object resulting from all the prior manipulations that is to be inserted as the first parameter to the function. I guess that I want something sort of like a "with" that puts the columns of the table into the current scope.

nirgrahamuk · June 9, 2020, 11:03am

I dont understand, because the pipe explicity passes the object on the left as the first parameter into function calls on the right.
If you want to assign the result of the final manipulation into an object (the same or a new) you use <- as normal.
You can attach(yourtable) to reference the columns of it without yourtable$yourcolumn. Though I write a lot of code and never use that feature.

lhunsicker · June 9, 2020, 12:30pm

Yes. Of course. But there isn't a way to add an indexing selection to the explicitly passed table reference (with brackets or, for columns, "$"). Similarly, I can pull a column, but then I can't just apply a function to the pulled column because the function I want to use expects a vector rather than a table. You have reminded me that I can apply a function to the column by invoking summarise(). But that seems a cumbersome way to do it. I am used to subsetting tables in my function calls (by column or row) using the usual indexing methods. (E.g. summary(table[1:20, 1:5] or table$length). Maybe this is considered to be not "tidy." Maybe I should learn to use filter() and select() to do this selection and then use summarise() to apply the function. But using indexing seems more intuitive to me.
I guess that I am suggesting that it would be nice for the piping method to give me a "name" for the passed object to which I could apply indexing or, in the case of a pulled column, directly use a vector oriented function.
In any case, you have shown me an approach to do this sort of thing without cluttering my environment with a lot of temporary intermediate objects. I do appreciate your suggestions.

Leon · June 9, 2020, 1:19pm

Do you mean something like this:

library("tidyverse")

d <- tibble(x = rnorm(20),
            y = rnorm(20))

d %>%
  lm(y ~ x, data = .) # The '.' here means what was passed by the pipe

d %>% 
  summarise(mu1 = mean(.$x),
            mu2 = mean(x)) # same as above due to non-std eval

I would recommend getting used to using non-std eval and not use neither the $-referral or [x1,x2]-indexing

Hope it helps

nirgrahamuk · June 9, 2020, 1:32pm

Again, im confused because in my example code to you, i showed with the pull() function how the column result of a resultant table can be extracted in the form of a vector, that can be passed to other functions that require vector inputs... perhaps I should mention that pull() can be passed params, so if you want to pull a specific column you can say it.

lhunsicker · June 9, 2020, 5:01pm

Yes and no. I can pull a column and pass it as a vector to a simple function:

iris %>% filter(Species -- setosa) %>% pull(Petal.Width) %>% sum()
[1] 12.3
But I can"t pass it to a function that has to pass it as a vector to an inner (in this case logic test) function,
iris %>% filter(Species == 'setosa') %>% pull(Petal.Width) %>% sum(< 0.2)
Error: unexpected '<' in "iris %>% filter(Species == 'setosa') %>% pull(Petal.Width) %>% sum(<"

But as I was thinking that what I wanted was something like with(), it occurred to me just to use with(), since the product of what went before would be inserted as the first parameter of the function, and with() would give access to the columns of the final table.
iris %>% filter(Species == 'setosa') %>% with(sum(Petal.Width < 0.2))
[1] 5
This approach seems to work so long as the pipe is passing on a table, since with() requires a table as its first parameter. The disadvantage, of course, is that the last code is not exactly transparent. It would be nice for the pipe to provide a way to refer explicitly to what is passed on, as in Leon"s first example .

lhunsicker · June 9, 2020, 5:07pm

Thanks, Leon. Does the "data = ." in your first example only work when the called function has an explicit parameter for the data source, as in lm()? I knew that I had seen that expression somewhere.

Leon · June 9, 2020, 5:29pm

These are equivalent:

iris %>%
  filter(Species == 'setosa') %>%
  filter(Petal.Width < 0.2) %>%
  nrow()

iris %>%
  filter(Species == 'setosa') %>%
  filter(Petal.Width < 0.2) %>%
  nrow(.)

It's not entirely clear what you're trying to achieve. I get the feeling that you're thinking along the lines of base, but want to use tidy - I suggest going all in on the latter

Hope it helps

lhunsicker · June 10, 2020, 3:48pm

You're helping me to clarify where I am, and that is good. I quite like piping with magrittr and the "verbs" of dplyr and I use them all the time now. I have not read about "tidy," and I suppose that I should. What I understand now is that the "." following a pipe refers only to he object that has been passed by the pipe, but that it doesn't permit applying any array selection to the object or allow the object to be passed to an internal function within the function following the pipe. I'm pretty sure that this is a scope issue. Let me explain one of my frustrations.

I am primarily a statistician, not a programmer. I deal with a lot of data sets in which columns contain only numbers, but that may need to be referred to in some situations as factors and in others as numbers. (E.g., data where the number represents one of a number of conditions, but where the conditions may be graded monotonically.) So I don't want either to change the column from numeric to factor, and I don't want to add a new column with the value as a factor. As a convenience, I have created the function facsum(x) summary(as.factor(x). This works fine on a named table. :

facsum <- function(x) summary(as.factor(x))
facsum(iris$Petal.Length)
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9 3 3.3 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2
1 1 2 7 13 13 7 4 2 1 2 2 1 1 1 3 5 3 4
4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1
2 4 8 3 5 4 5 4 8 2 2 2 3 6 3 3 2 2 3
6.3 6.4 6.6 6.7 6.9
1 1 1 2 1

But it doesn't work after I have manipulated iris using dplyr and magrittr:

iris %>% filter(Species == 'setosa') %>% facsum(Petal.Length)
Error in facsum(., Petal.Length) : unused argument (Petal.Length)

Adding the with() fixes this (though I acknowledge that this is a real kludge):
iris %>% filter(Species == 'setosa') %>% with(facsum(Petal.Length))
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9
1 1 2 7 13 13 7 4 2

The same is true whenever I want to apply any function after the pipe that needs to pass the piped object to an internal function. With() solves this problem.

iris %>% filter(Species == 'setosa') %>% sum(Petal.Length < 0.2)
Error in function_list[k] : object 'Petal.Length' not found
iris %>% filter(Species == 'setosa') %>% with(sum(Petal.Length < 0.2))
[1] 0

I agree that your "tidy" is probably a better way to deal with row and column selection, but it would be nice to be able to use functions that have to call functions after a pipe. Is there a better way than with() to do this?

nirgrahamuk · June 10, 2020, 4:03pm

I think the issue is , that predominantly tidy functions are for designed to be convenient for manipulating data.frame, when you go from frames to vectors, you are edging into where base is more convenient.
as far as alternatives to with there is the pull I've already demonstrated, its less flexible as with() gives you acess to all columns of your frame, while pull only gets you one relevant column, though in the the case that one is all you need, its much more readable to a human programmer what logical flow is happening.

iris %>% filter(Species == 'setosa') %>% with(facsum(Petal.Length))
iris %>% filter(Species == 'setosa') %>%  pull(Petal.Length) %>% facsum

in terms of a pure piping approach to your last puzzle, I agree with Leon that the tidy approach, of doing as much transformation in the dataframe before crossing over into vector land, is prefereble and the tidy way. (his nrow example is perfectly equivalent to the code you use with with and much more conventional.
Just to nerd out for a second, I can point out that the logic check of . < 2 makes use of an infix operator which is a pipeable function itself. so its possible for a programmer to write

iris %>% filter(Species == 'setosa') %>% pull(Petal.Width) %>% `<`( 0.2) %>% sum

Though I wouldn't do this as the fully dplyr version is better in my eyes, though it can be reduced in terms of the filter can do both the species and the petal width condition at once

iris %>%
  filter(Species == 'setosa', 
          Petal.Width < 0.2) %>%
  nrow()

lhunsicker · June 10, 2020, 4:58pm

Eureka!!!

I finally realized that I was asking a magrittr question, not a dplyr question. So I went to the documentation of the use of '.' in magrittr. The solution is to enclose the post-pipe command in curly braces. Then the . can be used like the name of a table.

iris %>% filter(Species == 'setosa') %>% sum(.$Petal.Length < 0.2)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
iris %>% filter(Species == 'setosa') %>% {sum(.$Petal.Length < 0.2)}
[1] 0

nirgrahamuk · June 10, 2020, 6:22pm

Nice find.
I'd only seen that curly brace used a handful of times, in this way, and had forgotten it, but like everything else it's good to know about.

system · June 17, 2020, 6:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.