If you have many columns to summarize, you can use summarise_all
, which will operate on all non-grouping columns. For example:
library(tidyverse)
DF <- tribble(
~PNR, ~Passenger_Name, ~Sector_Flown,
"ABC", "XYZ1", "DEL-NYC",
"ABC", "XYZ2", "DEL-NYC",
"ABC", "XYZ3", "DEL-NYC",
"ABC", "XYZ1", "NYC-DEL",
"ABC", "XYZ2", "NYC-DEL",
"ABC", "XYZ3", "NYC-DEL")
DF %>%
group_by(PNR) %>%
summarise_all(list(distinct=~length(unique(.))))
# A tibble: 1 x 3
# PNR Passenger_Name_distinct Sector_Flown_distinct
# <chr> <int> <int>
# 1 ABC 3 2
If you don't want a suffix added on to the summarized column names, you can do:
DF %>%
group_by(PNR) %>%
summarise_all(~length(unique(.)))
You can also apply multiple summary functions, with or without suffixes:
DF %>%
group_by(PNR) %>%
summarise_all(list(uniq=~length(unique(.)), ~mean(nchar(.))))
Given the size of your data, you might want to try using the data.table
package, which is much faster. Although data.table
has a separate syntax, the dtplyr
package allows you to program using tidyverse
functions, while the code is executed using data.table
"under the hood". For example:
library(dtplyr)
# Create a lazy data table version of DF
DF.DT = lazy_dt(DF)
DF.DT %>%
group_by(PNR) %>%
summarise_all(list(distinct=~length(unique(.)))) %>%
as_tibble()
Now, compare the speed of the two approaches on a larger data frame. Note, in the timings at the end of the code, that the dtplyr
approach is more than 11x faster on a 100,000-row data frame. (I also tried this on a million-row data frame with 337,000 unique PNRs and got about the same increase in speed.):
# Set up fake data: 100,000 rows
set.seed(2)
df2 = tibble(PNR=replicate(1e5, paste(sample(LETTERS, 3), collapse="")))
df2 = df2 %>%
group_by(PNR) %>%
mutate(Passenger_Name=LETTERS[1:n()],
Sector_Flown=sample(letters, n(), replace=TRUE)) %>%
ungroup %>%
arrange(PNR, Passenger_Name, Sector_Flown)
df2
# A tibble: 100,000 x 3
# PNR Passenger_Name Sector_Flown
# <chr> <chr> <chr>
# 1 ABC A t
# 2 ABC B w
# 3 ABC C v
# 4 ABC D j
# 5 ABC E j
# 6 ABC F j
# 7 ABD A x
# 8 ABD B s
# 9 ABD C c
# 10 ABD D e
# … with 99,990 more rows
library(microbenchmark)
# Create a lazy data table version of df2
df2.dt = lazy_dt(df2)
# Get timings
microbenchmark(
dplyr=df2 %>%
group_by(PNR) %>%
summarise_all(list(distinct=~length(unique(.)))),
dtplyr=df2.dt %>%
group_by(PNR) %>%
summarise_all(list(distinct=~length(unique(.)))) %>%
as_tibble,
times=5, unit="s"
)
#Unit: seconds
# expr min lq mean median uq max neval cld
# dplyr 1.08584555 1.11027390 1.1651085 1.1742366 1.2159225 1.2392638 5 b
# dtplyr 0.09888661 0.09973629 0.1065941 0.1005101 0.1144581 0.1193795 5 a