How does row_number() work without the 'x' argument?

As an example, using row_number() inside mutate() in the following code sequentially numbers each row -- even without specifying the x argument:

mtcars %>% 
  slice(1:5) %>% 
  select(1:4) %>% 
  mutate(row = row_number())
   mpg cyl disp  hp row
1 21.0   6  160 110   1
2 21.0   6  160 110   2
3 22.8   4  108  93   3
4 21.4   6  258 110   4
5 18.7   8  360 175   5

When I look at the code inside row_number() I see:

function (x) 
rank(x, ties.method = "first", na.last = "keep")
<bytecode: 0x1108b3208>
<environment: namespace:dplyr>

When I try to use that code inside mutate without specifying the x argument, I get an error:

mtcars %>% 
  slice(1:5) %>% 
  select(1:4) %>% 
  mutate(row = rank(ties.method = "first", na.last = "keep"))
Error in mutate_impl(.data, dots) : 
  Evaluation error: argument "x" is missing, with no default.

My question is, how does row_number() work without specifying the x argument? I'd like to be able to add similar functionality to some of my own functions.

1 Like

I believe this functionality is limited to using row_number inside of single table dplyr verbs, like mutate (see last line of row_number documentation).

As far as implementation goes, I don't see anything purely in R that makes this work, so I'm guessing it's implemented in the C++ code, but I haven't dug into it.

I found some code here:

row_number <- function(x) {
  if (missing(x)){
    seq_len(from_context("..group_size"))
  } else {
    rank(x, ties.method = "first", na.last = "keep")
  }

}

row_number() is part of the class of hybrid evaluation functions in dplyr. When possible, these are evaluated in C++ and in the context of the data frame you are mutating / filtering / etc.

The hybrid implementation of row_number() in particular is defined here:

And I think it gets registered here:

In that second link, you can see all the other hybrid functions!

You can actually check if an expression is going to use hybrid evaluation or not (at least in dev dplyr)

suppressPackageStartupMessages(library(dplyr)) # 0.8.0.9000

d <- tibble(a = 1:5)

# A cpp call
hybrid_call(d, row_number())
#> <hybrid evaluation>
#>   call      : dplyr::row_number()
#>   C++ class : dplyr::hybrid::internal::RowNumber0<dplyr::NaturalDataFrame>

# A R call
hybrid_call(d, row_number() + 1)
#> <standard evaluation>
#>   call      : row_number() + 1

Created on 2019-01-04 by the reprex package (v0.2.1.9000)

RowNumber0 is defined in that first link to the cpp file where all of the row_number() implementation is.

As @pete mentioned, in the newest dplyr there is also some extra code in row_number() using from_context("..group_size") when x is missing. If you try and call that outright, you will be disappointed:

dplyr:::from_context("..group_size")
# Error: NULL should only be called in a data context

But (and you should not do this) use it inside of a mutate() call where the "context" is correct, and you get real results:

suppressPackageStartupMessages(library(dplyr)) # 0.8.0.9000

d <- tibble(a = 1:5)

# using it in the right context
mutate(d, x = dplyr:::from_context("..group_size"))
#> # A tibble: 5 x 2
#>       a     x
#>   <int> <int>
#> 1     1     5
#> 2     2     5
#> 3     3     5
#> 4     4     5
#> 5     5     5

# it just returns the group size
mtcars %>%
  group_by(cyl) %>%
  mutate(
    group_size = dplyr:::from_context("..group_size")
  ) %>%
  select(cyl, group_size)
#> # A tibble: 32 x 2
#> # Groups:   cyl [3]
#>      cyl group_size
#>    <dbl>      <int>
#>  1     6          7
#>  2     6          7
#>  3     4         11
#>  4     6          7
#>  5     8         14
#>  6     6          7
#>  7     8         14
#>  8     4         11
#>  9     4         11
#> 10     6          7
#> # … with 22 more rows

Created on 2019-01-04 by the reprex package (v0.2.1.9000)

The moral of the story is, just let dplyr use these hybrid evaluation functions, and there currently is no way for you to access enough information at the R level to create custom ones for your own use.

5 Likes

yes, you found new implementation in Dev version and next 0.8.0 version but it is not the same in CRAN version.


This is why it is different from @brad.cannell

this has changed, maybe due to new behaviour, a breaking change

Some thoughts about how it works and why you observe this:
some of dplyr magic comes from something called hybrid evaluation. You'll find some reference of in release candidate blog post


basically, dplyr executes some codes in C++ not R, and try to identify some function call to use either c++ call or R call. (or at least I think of it that way... :thinking:). So when it identifies row_number() inside a mutate or a summary, it does not call the R version. (At least version <= 0.7.6).

There is some C code about row_number dispatch that illustrate this

new version has a help hybrid_call to see some of that dark magic

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
packageVersion("dplyr")
#> [1] '0.8.0.9000'
mtcars %>% 
  slice(1:5) %>% 
  select(1:4) %>% 
  hybrid_call(mutate(row = row_number()))
#> <standard evaluation>
#>   call      : mutate(row = row_number())

it does not back me up in new >= 0.8 version though as it said standard evalutation... :man_facepalming:
But as you found code changed and R function row_number() now deals with empty x.
NEWS from 0.8.0 says

Hybrid evaluation has been completely redesigned for better performance and stability.

So it may be why difficult to explain or illustrate previous behavioir

Hope it is not too confusing and it helps in some ways

3 Likes

beat you to it by 1 minute :wink:

1 Like

Thank you, @davis. I appreciate you taking the time to explain all that.

:open_mouth: I did not see you were writing at the same time ! It was close! :smiley:
Your explanation is clearer and I was near the correct use of hybrid_call :wink: