Creating Frequency Table Using Tidyverse

dplyr

#1

Hello, could someone please help me on how I can create a frequency table based on two variables?

I have a dataset for passenger travel destinations. The first variable is the passenger number and the second is the destinations visited. I’m looking to find what destinations and combinations are most popular.

The original dataset looks like:
tribble(
~passenger_id, ~destination,
1, “China”,
1, “Japan”,
2, “England”,
2, “US”,
3, “Canada”,
3, “China”,
3, “US”,
4, “Japan”
)

I’m hoping for an outcome like:

image

Edited for Canada/China to be 1


#2

Have you looked at the widyr package by @drob? If you think of passenger_ids as strings, I think the Counting and correlating pairs of words with the widyr package section from @drob and @julia’s Text Mining with R might be easily adapted.

Oh, also worth looking at corrr by @drsimonj


I think the figure he uses to illustrate what corrr does is a great way of visualizing the transformation


#3

Any chance the Canada/China association should be 1 instead of 2? If not, I’m not quite following what the methodology should be.

Otherwise, using a join to create all possible combinations along with spread to format in the needed table works well, from what I can tell:

suppressPackageStartupMessages(library(tidyverse))

dat <- tribble(
  ~passenger_id, ~destination,
  1, "China",
  1, "Japan",
  2, "England",
  2, "US",
  3, "Canada",
  3, "China",
  3, "US",
  4, "Japan"
)

transformed_dat <-
  left_join(dat, dat, by = "passenger_id") %>%
  select(Columns = destination.x,
         Rows = destination.y) %>%
  group_by(Columns, Rows) %>%
  summarize(Count = n()) %>%
  spread(Columns, Count, fill = 0)

transformed_dat
#> # A tibble: 5 x 6
#>      Rows Canada China England Japan    US
#> *   <chr>  <dbl> <dbl>   <dbl> <dbl> <dbl>
#> 1  Canada      1     1       0     0     1
#> 2   China      1     2       0     1     1
#> 3 England      0     0       1     0     1
#> 4   Japan      0     1       0     2     0
#> 5      US      1     1       1     0     2

#4

I like your solution, @nick.

I found the example little bit confusing too. I found interesting the task of counting frequency of the pairs, so here is my solution, using widyr, mentioned above (thank you for that, @mara!). So it is not what you need, but it is nice to have this in the same tread.

library(dplyr); library(widyr)
dat %>%
  pairwise_count(destination, passenger_id) %>% 
  spread(item2, n, fill = 0)
# A tibble: 5 x 6
    item1 Canada China England Japan    US
*   <chr>  <dbl> <dbl>   <dbl> <dbl> <dbl>
1  Canada      0     1       0     0     1
2   China      1     0       0     1     1
3 England      0     0       0     0     1
4   Japan      0     1       0     0     0
5      US      1     1       1     0     0

#5

You results are not the same as @nick 's. For example China/China should be 2 but in your solution it is 0. Am I missing something here?


#6

Yes, @danr, as I wrote, it just count pairwise coexisting. @nick’s solution has different diagonal values – country frequency itself.


#7

Thanks, now I see it.


#8

I’m still trying to understand how R manages memory but @nick doesn’t

	select(Columns = destination.x,
				 Rows = destination.y)

end up creating a new table in memory? Cross joins can get to be pretty big fast. Maybe R does something in this case to minimize the memory usage in a pipeline like this?

This script skips the select() step and just changes the names attribute of transformed_dat at the end so it doesn’t make a new table. If you don’t care about the first column name just drop the part that changes the column name.

suppressPackageStartupMessages(library(tidyverse))
dat <- tribble(
    ~passenger_id, ~destination,
    1, "China",
    1, "Japan",
    2, "England",
    2, "US",
    3, "Canada",
    3, "China",
    3, "US",
    4, "Japan"
)
transformed_dat <-
    left_join(dat, dat, by = "passenger_id") %>%
    group_by(destination.x, destination.y) %>%
    summarize(Count = n()) %>%
    spread(destination.x, Count, fill = 0)
ns <- names(transformed_dat)
ns[[1]] <- "Rows"
names(transformed_dat) <- ns
transformed_dat
#> # A tibble: 5 x 6
#>      Rows Canada China England Japan    US
#> *   <chr>  <dbl> <dbl>   <dbl> <dbl> <dbl>
#> 1  Canada      1     1       0     0     1
#> 2   China      1     2       0     1     1
#> 3 England      0     0       1     0     1
#> 4   Japan      0     1       0     2     0
#> 5      US      1     1       1     0     2

#9

at the time of merge, you get two columns called destination (with the appended suffix .x and .y to tell them apart).
select is choosing columns from the dataset at this point in the processing, and renaming the destination.x column as Columns and the destination.y column as Rows.


#10

Per this old github issue, dplyr hasn’t copied a table on select in quite some time (at least it was closed a few years ago):

Playing around with dplyr:::dfloc and select indicates that the table column memory locations don’t change when I use select, so I think it’s a safe operation. Which is good, as I tend to sprinkle select statements in my code fairly liberally to drop columns that are no longer needed and rename ones that have awkward names.


#11

That is good to know but…

So maybe there are some copy on write semantics going on here, but the following example seems to show that in fact a new table is being made…

suppressPackageStartupMessages(library(tidyverse))
tbl1 <- tibble(a = 1:3, b=4:6)
tbl1 %>% select( a1=a, b1=b)
#> # A tibble: 3 x 2
#>      a1    b1
#>   <int> <int>
#> 1     1     4
#> 2     2     5
#> 3     3     6
tbl1
#> # A tibble: 3 x 2
#>       a     b
#>   <int> <int>
#> 1     1     4
#> 2     2     5
#> 3     3     6

I do a select on tbl1 and get a new table with new column names, but when I examine tbl1 after that I see tbl1 still has the old names.

With some clever copy on write semantics you can see behavior like this where behind the scenes both tables are using the same memory for data but different memory for attributes. And some operating systems will do this for you without even asking. For example on Windows memory is copy on write.

I’ll have to dig into that article now and see if I can figure out how to reconcile my little experiment with what the article says.

Thanks,
Dan


#12

Yup a select doesn’t copy the columns, they are in the same memory location before and after… but make a change to one of the tables and the column(s) are the copied to another location in memory. So it is using copy on write semantics. My guess it that is being provided by the operating system maybe with some help from R? But in any case that is good to know… Notice that only one column was moved not both. Thanks for the pointer @nick

suppressPackageStartupMessages(library(tidyverse))
dfloc <- dplyr:::dfloc
tbl1 <- tibble(a = 1:3, b=4:6)
dfloc(tbl1)
#>                a                b 
#> "0x7f83db3efb58" "0x7f83de06b788"
tbl2 <- tbl1 %>% select( a1=a, b1=b)
dfloc(tbl1)
#>                a                b 
#> "0x7f83db3efb58" "0x7f83de06b788"
dfloc(tbl2)
#>               a1               b1 
#> "0x7f83db3efb58" "0x7f83de06b788"
# change one of the tables and somebody's columns get moved
tbl1[[1,1]] <- "xyz"
dfloc(tbl1)
#>                a                b 
#> "0x7f83de0eeec8" "0x7f83de06b788"
dfloc(tbl2)
#>               a1               b1 
#> "0x7f83db3efb58" "0x7f83de06b788"

#13

Thank you @mara and @nick! Both the widyr package and @nick dplyr with spread worked.

@nick - yes, the Canada/China association should be 1 instead of 2. Sorry for the sloppiness. I will change that in the original post.


#14

I think the janitor package will accomplish this for you pretty easily, too.


#15
freq_tibble <- function(data, var1, var2) {
  var1 <- rlang::enquo(var1)
  var2 <- rlang::enquo(var2)

  data %>%
    dplyr::count(!!var1, !!var2) %>%
    tidyr::spread(!!var2, n, fill = 0) %>%
    dplyr::mutate(Total := rowSums(dplyr::select(., -!!var1))) %>%
    dplyr::bind_rows(dplyr::bind_cols(!!rlang::quo_name(var1) := "Total", dplyr::summarize_if(., is.numeric, sum)))
}

#16

Another option for expansion instead of a self-join is a grouped tidyr::expand:

library(tidyverse)

dat <- tibble(passenger_id = c(1, 1, 2, 2, 3, 3, 3, 4), 
              destination = c("China", "Japan", "England", "US", "Canada", "China", "US", "Japan"))

dat %>% 
    group_by(passenger_id) %>% 
    expand(origin = destination, destination) %>%    # connect nodes
    ungroup() %>% select(-passenger_id) %>%    # clean up 
    count(origin, destination) %>%    # aggregate duplicates
    spread(origin, n, fill = 0L)    # to wide form
#> # A tibble: 5 x 6
#>   destination Canada China England Japan    US
#>   <chr>        <int> <int>   <int> <int> <int>
#> 1 Canada           1     1       0     0     1
#> 2 China            1     2       0     1     1
#> 3 England          0     0       1     0     1
#> 4 Japan            0     1       0     2     0
#> 5 US               1     1       1     0     2

or equivalently,

dat %>% 
    group_by(passenger_id) %>% 
    expand(origin = destination, destination) %>% 
    janitor::crosstab(destination, origin)

Bigger picture, this is a graph, so tidygraph may be helpful for further analysis.