General question about lazy evaluation and dplyr

A question has been bugging me. I know R has lazy evaluation, which means that it doesn't do calculations until it needs to, and statements create promises rather than do the actual work.

I want to understand how this works with dplyr. Are there tricks to using dplyr that prevent copies of the dataframe being made, if all we are really doing is selecting rows and columns? I have't been able to find a guide to this. I made this short reprex that identifies instances where dplyr did or did not make a copy of the data. Does anyone know of a guide?

The context is I am trying to streamline a shiny app which has a sizeable dataframe that I am filtering and sorting.

library(pryr)
library(dplyr)

# data frame with a million doubles
df <- data.frame(x=runif(500000), y=runif(500000))
object_size(df)
#> 8 MB

# rename doesn't make a copy
df2 <- df %>%
    rename(w=x)
object_size(df, df2)
#> 8 MB

# filter does make a copy
df2 <- df %>%
    filter(x>0.5)
object_size(df, df2)
#> 12 MB

# arrange does make a copy
df2 <- df %>%
    arrange(x)
object_size(df, df2)
#> 16 MB

# select doesn't make a copy
df2 <- df %>%
    select(y)
object_size(df, df2)
#> 8 MB

Created on 2019-07-06 by the reprex package (v0.3.0)

You can read a bit more here - https://adv-r.hadley.nz/names-values.html#object-size

However, the main idea is that shared objects in lists don't create a copy. Since dataframe is a list, renaming doesn't create a new object. Whenever you actually change the data inside vectors, new object is created. So that is why arrange and filter create new objects, while rename and select don't modify objects, so don't create anything extra.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.