Anyone have any fun R things they've learned recently?

Another random tidbit, tally() is a "summarize" function (meaning it drops all variables other than your group_by vars). You can actually do grouped mutates in dplyr that behave like the dplyr::add_count above. It saves the join and gives you a lot of flexibility. Here, I do a manual count like add_count, but I also get the mean HP for each group of (# of cylinders).

This is actually behavior that I first learned in PROC SQL, if memory serves me correctly, and I was very happy to find it in R as well :smiley:

library(dplyr)

mtcars %>% 
  group_by(cyl) %>%
  mutate(count = n(), avg_hp = mean(hp)) %>%
  select(mpg, cyl, hp, count, avg_hp)
#> # A tibble: 32 x 5
#> # Groups:   cyl [3]
#>      mpg   cyl    hp count avg_hp
#>    <dbl> <dbl> <dbl> <int>  <dbl>
#>  1  21       6   110     7  122. 
#>  2  21       6   110     7  122. 
#>  3  22.8     4    93    11   82.6
#>  4  21.4     6   110     7  122. 
#>  5  18.7     8   175    14  209. 
#>  6  18.1     6   105     7  122. 
#>  7  14.3     8   245    14  209. 
#>  8  24.4     4    62    11   82.6
#>  9  22.8     4    95    11   82.6
#> 10  19.2     6   123     7  122. 
#> # ... with 22 more rows

Created on 2019-01-28 by the reprex package (v0.2.1)

1 Like

Instead of age - mean(age) you can also do scale(age, scale = FALSE), not more compact but no variable repetition :slight_smile:

1 Like

I've begun using between(), which works quite nicely:

> tibble(x = rnorm(10)) %>% filter(x %>% between(-1, 1))
# A tibble: 6 x 1
        x
    <dbl>
1  0.463 
2  0.891 
3 -0.254 
4  0.0976
5 -0.819 
6  0.596 

There are two versions of between: one from the dplyr package (which I assume you're using), and another from the data.table package.

dplyr's version has the benefit of being translatable to SQL by the dbplyr package.

library(dplyr)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mycars <- copy_to(con, mtcars)
mycars %>% filter(between(hp, 10, 30)) %>% show_query()
# <SQL>
# SELECT *
# FROM `mtcars`
# WHERE (`hp` BETWEEN 10.0 AND 30.0)

But the data.table version allows vectors for the lower and upper bounds. Obviously, this is good when the bounds depend on the observation.

campaigns <- data.frame(
  name  = c("Super sale!", "Crazy cuts!", "Delirious deals!"),
  start = as.Date(c("2018-08-01", "2018-12-20", "2019-01-15")),
  end   = as.Date(c("2018-08-06", "2019-01-02", "2019-02-15"))
)

campaigns %>%
  filter(data.table::between(Sys.Date(), start, end))
#               name      start        end
# 1 Delirious deals! 2019-01-15 2019-02-15
1 Like

I have never used data table - Total dplyr/TV fan :sunglasses:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.