Hello, could someone please help me on how I can create a frequency table based on two variables?
I have a dataset for passenger travel destinations. The first variable is the passenger number and the second is the destinations visited. I'm looking to find what destinations and combinations are most popular.
Have you looked at the widyr package by @drob? If you think of passenger_ids as strings, I think the Counting and correlating pairs of words with the widyr package section from @drob and @julia's Text Mining with R might be easily adapted.
Oh, also worth looking at corrr by @drsimonj
I think the figure he uses to illustrate what corrr does is a great way of visualizing the transformation
I found the example little bit confusing too. I found interesting the task of counting frequency of the pairs, so here is my solution, using widyr, mentioned above (thank you for that, @mara!). So it is not what you need, but it is nice to have this in the same tread.
library(dplyr); library(widyr)
dat %>%
pairwise_count(destination, passenger_id) %>%
spread(item2, n, fill = 0)
# A tibble: 5 x 6
item1 Canada China England Japan US
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Canada 0 1 0 0 1
2 China 1 0 0 1 1
3 England 0 0 0 0 1
4 Japan 0 1 0 0 0
5 US 1 1 1 0 0
end up creating a new table in memory? Cross joins can get to be pretty big fast. Maybe R does something in this case to minimize the memory usage in a pipeline like this?
This script skips the select() step and just changes the names attribute of transformed_dat at the end so it doesn't make a new table. If you don't care about the first column name just drop the part that changes the column name.
suppressPackageStartupMessages(library(tidyverse))
dat <- tribble(
~passenger_id, ~destination,
1, "China",
1, "Japan",
2, "England",
2, "US",
3, "Canada",
3, "China",
3, "US",
4, "Japan"
)
transformed_dat <-
left_join(dat, dat, by = "passenger_id") %>%
group_by(destination.x, destination.y) %>%
summarize(Count = n()) %>%
spread(destination.x, Count, fill = 0)
ns <- names(transformed_dat)
ns[[1]] <- "Rows"
names(transformed_dat) <- ns
transformed_dat
#> # A tibble: 5 x 6
#> Rows Canada China England Japan US
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Canada 1 1 0 0 1
#> 2 China 1 2 0 1 1
#> 3 England 0 0 1 0 1
#> 4 Japan 0 1 0 2 0
#> 5 US 1 1 1 0 2
at the time of merge, you get two columns called destination (with the appended suffix .x and .y to tell them apart).
select is choosing columns from the dataset at this point in the processing, and renaming the destination.x column as Columns and the destination.y column as Rows.
Playing around with dplyr:::dfloc and select indicates that the table column memory locations don't change when I use select, so I think it's a safe operation. Which is good, as I tend to sprinkle select statements in my code fairly liberally to drop columns that are no longer needed and rename ones that have awkward names.
I do a select on tbl1 and get a new table with new column names, but when I examine tbl1 after that I see tbl1 still has the old names.
With some clever copy on write semantics you can see behavior like this where behind the scenes both tables are using the same memory for data but different memory for attributes. And some operating systems will do this for you without even asking. For example on Windows memory is copy on write.
I'll have to dig into that article now and see if I can figure out how to reconcile my little experiment with what the article says.
Yup a select doesn't copy the columns, they are in the same memory location before and after... but make a change to one of the tables and the column(s) are the copied to another location in memory. So it is using copy on write semantics. My guess it that is being provided by the operating system maybe with some help from R? But in any case that is good to know.. Notice that only one column was moved not both. Thanks for the pointer @nick
suppressPackageStartupMessages(library(tidyverse))
dfloc <- dplyr:::dfloc
tbl1 <- tibble(a = 1:3, b=4:6)
dfloc(tbl1)
#> a b
#> "0x7f83db3efb58" "0x7f83de06b788"
tbl2 <- tbl1 %>% select( a1=a, b1=b)
dfloc(tbl1)
#> a b
#> "0x7f83db3efb58" "0x7f83de06b788"
dfloc(tbl2)
#> a1 b1
#> "0x7f83db3efb58" "0x7f83de06b788"
# change one of the tables and somebody's columns get moved
tbl1[[1,1]] <- "xyz"
dfloc(tbl1)
#> a b
#> "0x7f83de0eeec8" "0x7f83de06b788"
dfloc(tbl2)
#> a1 b1
#> "0x7f83db3efb58" "0x7f83de06b788"