Taking distinct count over multiple column

hsm.niks · February 22, 2020, 2:51pm

Hi Guys,

I am new to R even though have good exposure on SPSS and SAS. Here I am trying to take distinct count over multiple columns.

My data looks like below:-

PNR Passenger_Name Sector_Flown
"ABC" | "XYZ1 " | "DEL-NYC"
"ABC" | "XYZ2 " | "DEL-NYC"
"ABC" | "XYZ3 " | "DEL-NYC"
"ABC" | "XYZ1 " | "NYC-DEL"
"ABC" | "XYZ2 " | "NYC-DEL"
"ABC" | "XYZ3 " | "NYC-DEL"

Here logic is that same PNR can refer to multiple passenger name as well as multiple sector. I want to create a new file which should look like below:-

PNR Distinct_Passenger_Count Distinct_Sector_Count
ABC 3 2

Above is just an example which shows a family of 3 members of a family flown on a return ticket against PNR no ABC. My data set has more than 20 million rows, so would request something which is easy to handle as well as can be handled on a normal laptop with it's processing capacity.

Requesting your help guys.

FJCC · February 22, 2020, 4:30pm

Try something like the code below will work for you.

library(tibble)
library(dplyr)
DF <- tribble(
  ~PNR, ~Passenger_Name, ~Sector_Flown,
"ABC", "XYZ1", "DEL-NYC",
"ABC", "XYZ2", "DEL-NYC",
"ABC", "XYZ3", "DEL-NYC",
"ABC", "XYZ1", "NYC-DEL",
"ABC", "XYZ2", "NYC-DEL",
"ABC", "XYZ3", "NYC-DEL")
DF
#> # A tibble: 6 x 3
#>   PNR   Passenger_Name Sector_Flown
#>   <chr> <chr>          <chr>       
#> 1 ABC   XYZ1           DEL-NYC     
#> 2 ABC   XYZ2           DEL-NYC     
#> 3 ABC   XYZ3           DEL-NYC     
#> 4 ABC   XYZ1           NYC-DEL     
#> 5 ABC   XYZ2           NYC-DEL     
#> 6 ABC   XYZ3           NYC-DEL
DF2 <- DF %>% group_by(PNR) %>% 
  summarize(Distnc_P = length(unique(Passenger_Name)), 
            Distinc_Sector = length(unique(Sector_Flown)))
DF2
#> # A tibble: 1 x 3
#>   PNR   Distnc_P Distinc_Sector
#>   <chr>    <int>          <int>
#> 1 ABC          3              2

^{Created on 2020-02-22 by the reprex package (v0.3.0)}

joels · February 22, 2020, 5:24pm

If you have many columns to summarize, you can use summarise_all, which will operate on all non-grouping columns. For example:

library(tidyverse)

DF <- tribble(
  ~PNR, ~Passenger_Name, ~Sector_Flown,
  "ABC", "XYZ1", "DEL-NYC",
  "ABC", "XYZ2", "DEL-NYC",
  "ABC", "XYZ3", "DEL-NYC",
  "ABC", "XYZ1", "NYC-DEL",
  "ABC", "XYZ2", "NYC-DEL",
  "ABC", "XYZ3", "NYC-DEL")

DF %>% 
  group_by(PNR) %>% 
  summarise_all(list(distinct=~length(unique(.))))

# A tibble: 1 x 3
#    PNR   Passenger_Name_distinct Sector_Flown_distinct
#  <chr>                     <int>                 <int>
#  1 ABC                         3                     2

If you don't want a suffix added on to the summarized column names, you can do:

DF %>% 
  group_by(PNR) %>% 
  summarise_all(~length(unique(.)))

You can also apply multiple summary functions, with or without suffixes:

DF %>% 
  group_by(PNR) %>% 
  summarise_all(list(uniq=~length(unique(.)), ~mean(nchar(.))))

Given the size of your data, you might want to try using the data.table package, which is much faster. Although data.table has a separate syntax, the dtplyr package allows you to program using tidyverse functions, while the code is executed using data.table "under the hood". For example:

library(dtplyr)

# Create a lazy data table version of DF
DF.DT = lazy_dt(DF)

DF.DT %>% 
  group_by(PNR) %>% 
  summarise_all(list(distinct=~length(unique(.)))) %>% 
  as_tibble()

Now, compare the speed of the two approaches on a larger data frame. Note, in the timings at the end of the code, that the dtplyr approach is more than 11x faster on a 100,000-row data frame. (I also tried this on a million-row data frame with 337,000 unique PNRs and got about the same increase in speed.):

# Set up fake data: 100,000 rows
set.seed(2)
df2 = tibble(PNR=replicate(1e5, paste(sample(LETTERS, 3), collapse="")))
df2 = df2 %>% 
  group_by(PNR) %>% 
  mutate(Passenger_Name=LETTERS[1:n()],
         Sector_Flown=sample(letters, n(), replace=TRUE)) %>% 
  ungroup %>% 
  arrange(PNR, Passenger_Name, Sector_Flown)

df2
# A tibble: 100,000 x 3
# PNR   Passenger_Name Sector_Flown
# <chr> <chr>          <chr>       
# 1 ABC   A              t           
# 2 ABC   B              w           
# 3 ABC   C              v           
# 4 ABC   D              j           
# 5 ABC   E              j           
# 6 ABC   F              j           
# 7 ABD   A              x           
# 8 ABD   B              s           
# 9 ABD   C              c           
# 10 ABD   D              e           
# … with 99,990 more rows

library(microbenchmark)

# Create a lazy data table version of df2
df2.dt = lazy_dt(df2)

# Get timings
microbenchmark(
  dplyr=df2 %>% 
    group_by(PNR) %>% 
    summarise_all(list(distinct=~length(unique(.)))),
  dtplyr=df2.dt %>% 
    group_by(PNR) %>% 
    summarise_all(list(distinct=~length(unique(.)))) %>% 
    as_tibble,
  times=5, unit="s"
)

#Unit: seconds
#   expr        min         lq      mean    median        uq       max neval cld
#  dplyr 1.08584555 1.11027390 1.1651085 1.1742366 1.2159225 1.2392638     5   b
# dtplyr 0.09888661 0.09973629 0.1065941 0.1005101 0.1144581 0.1193795     5  a

dromano · February 22, 2020, 5:41pm

Since you're new to R, I thought I'd also post a solution that illustrates some relevant commands for didactic purposes:

# required for recognition of functions and syntax used below
library(tidyverse) 

# create table
flight_data <- 
  tribble(
    ~ PNR, ~ Passenger_Name, ~ Sector_Flown,
    "ABC" , "XYZ1" , "DEL-NYC",
    "ABC" , "XYZ2" , "DEL-NYC",
    "ABC" , "XYZ3" , "DEL-NYC",
    "ABC" , "XYZ1" , "NYC-DEL",
    "ABC" , "XYZ2" , "NYC-DEL",
    "ABC" , "XYZ3" , "NYC-DEL",
    'DEF', 'mmm', 'nnn',
    'DEF', 'mmm', 'ooo'
    )

# view table
flight_data
#> # A tibble: 8 x 3
#>   PNR   Passenger_Name Sector_Flown
#>   <chr> <chr>          <chr>       
#> 1 ABC   XYZ1           DEL-NYC     
#> 2 ABC   XYZ2           DEL-NYC     
#> 3 ABC   XYZ3           DEL-NYC     
#> 4 ABC   XYZ1           NYC-DEL     
#> 5 ABC   XYZ2           NYC-DEL     
#> 6 ABC   XYZ3           NYC-DEL     
#> 7 DEF   mmm            nnn         
#> 8 DEF   mmm            ooo

# approach that doesn't require joins, illustrated first with passenger data
## count all combinations of PNR and Passenger_Name, store in new column
flight_data %>% 
  add_count(PNR, Passenger_Name, name = 'pass_dup')
#> # A tibble: 8 x 4
#>   PNR   Passenger_Name Sector_Flown pass_dup
#>   <chr> <chr>          <chr>           <int>
#> 1 ABC   XYZ1           DEL-NYC             2
#> 2 ABC   XYZ2           DEL-NYC             2
#> 3 ABC   XYZ3           DEL-NYC             2
#> 4 ABC   XYZ1           NYC-DEL             2
#> 5 ABC   XYZ2           NYC-DEL             2
#> 6 ABC   XYZ3           NYC-DEL             2
#> 7 DEF   mmm            nnn                 2
#> 8 DEF   mmm            ooo                 2

## then group rows by PNR and calculate distinct count
flight_data %>% 
  add_count(PNR, Passenger_Name, name = 'pass_dup') %>% 
  group_by(PNR) %>% 
  summarise(pass_count = sum(1 / pass_dup))
#> # A tibble: 2 x 2
#>   PNR   pass_count
#>   <chr>      <dbl>
#> 1 ABC            3
#> 2 DEF            1

## repeat, but include sector data, too
flight_data %>% 
  add_count(PNR, Passenger_Name, name = 'pass_dup') %>% 
  add_count(PNR, Sector_Flown, name = 'sect_dup')
#> # A tibble: 8 x 5
#>   PNR   Passenger_Name Sector_Flown pass_dup sect_dup
#>   <chr> <chr>          <chr>           <int>    <int>
#> 1 ABC   XYZ1           DEL-NYC             2        3
#> 2 ABC   XYZ2           DEL-NYC             2        3
#> 3 ABC   XYZ3           DEL-NYC             2        3
#> 4 ABC   XYZ1           NYC-DEL             2        3
#> 5 ABC   XYZ2           NYC-DEL             2        3
#> 6 ABC   XYZ3           NYC-DEL             2        3
#> 7 DEF   mmm            nnn                 2        1
#> 8 DEF   mmm            ooo                 2        1

flight_data %>% 
  add_count(PNR, Passenger_Name, name = 'pass_dup') %>% 
  add_count(PNR, Sector_Flown, name = 'sect_dup') %>% 
  group_by(PNR) %>% 
  summarise(
    pass_count = sum(1/pass_dup), 
    sect_count = sum(1/sect_dup)
    ) 
#> # A tibble: 2 x 3
#>   PNR   pass_count sect_count
#>   <chr>      <dbl>      <dbl>
#> 1 ABC            3          2
#> 2 DEF            1          2
  
# approach with join
## identify distinct combinations of PNR and Passenger_Name
flight_data %>% 
  distinct(PNR, Passenger_Name)
#> # A tibble: 4 x 2
#>   PNR   Passenger_Name
#>   <chr> <chr>         
#> 1 ABC   XYZ1          
#> 2 ABC   XYZ2          
#> 3 ABC   XYZ3          
#> 4 DEF   mmm

## calculate passenger counts per PNR
flight_data %>% 
  distinct(PNR, Passenger_Name) %>% 
  group_by(PNR) %>%
  tally(name = 'pass_count')
#> # A tibble: 2 x 2
#>   PNR   pass_count
#>   <chr>      <int>
#> 1 ABC            3
#> 2 DEF            1

## equivalent with 'count()'
flight_data %>% 
  distinct(PNR, Passenger_Name) %>% 
  group_by(PNR) %>%
  count(name = 'pass_count') %>% 
  ungroup()
#> # A tibble: 2 x 2
#>   PNR   pass_count
#>   <chr>      <int>
#> 1 ABC            3
#> 2 DEF            1

## store
pass_counts <- 
  flight_data %>% 
  distinct(PNR, Passenger_Name) %>% 
  group_by(PNR) %>%
  tally(name = 'pass_count')

## repeat with Sector_Flown
sect_counts <- 
  flight_data %>% 
  distinct(PNR, Sector_Flown) %>% 
  group_by(PNR) %>%
  tally(name = 'sect_count')

## join
pass_counts %>% 
  inner_join(sect_counts, by = 'PNR')
#> # A tibble: 2 x 3
#>   PNR   pass_count sect_count
#>   <chr>      <int>      <int>
#> 1 ABC            3          2
#> 2 DEF            1          2

^{Created on 2020-02-22 by the reprex package (v0.3.0)}

hsm.niks · February 22, 2020, 7:51pm

A big thank you to you all guys. Almost all the approaches worked. In the mean time I also found one solution which I wanted to share with you all (not sure whether it will help experts like you guys or not, but still).

I used SQLDF package which allows for SQL syntaxes inside R. I used the script like below and it produces the exact result I wanted.

analysis <- sqldf("SELECT PNR, COUNT(distinct(PNR)) as distinct_PNR, COUNT(distinct(PassengerName)) as distinct_PName, COUNT(distinct(Sector)) as distinct_Sectors FROM FY2019 GROUP BY PNR")

TBH, all the answers above work and wanted to accept all of them as solution but unfortunately I cant select more than one solution. Please dont get mad at me for this.

dromano · February 22, 2020, 8:06pm

That's much simpler! Here's a tidyverse version that mimics your SQL version:

library(tidyverse) 

flight_data <- 
  tribble(
    ~ PNR, ~ Passenger_Name, ~ Sector_Flown,
    "ABC" , "XYZ1" , "DEL-NYC",
    "ABC" , "XYZ2" , "DEL-NYC",
    "ABC" , "XYZ3" , "DEL-NYC",
    "ABC" , "XYZ1" , "NYC-DEL",
    "ABC" , "XYZ2" , "NYC-DEL",
    "ABC" , "XYZ3" , "NYC-DEL",
    'DEF', 'mmm', 'nnn',
    'DEF', 'mmm', 'ooo'
  )

flight_data %>% 
  group_by(PNR) %>% 
  summarise(
    pass_count = length(unique(Passenger_Name)),
    sect_count = length(unique(Sector_Flown))
  )
#> # A tibble: 2 x 3
#>   PNR   pass_count sect_count
#>   <chr>      <int>      <int>
#> 1 ABC            3          2
#> 2 DEF            1          2

^{Created on 2020-02-22 by the reprex package (v0.3.0)}

system · February 29, 2020, 8:06pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.