How to create a data frame for apriori?

kinzia · March 13, 2020, 8:52am

Hello, Im new in rStudio and self-learning with Yelp dataset

Learning how to use Apriori and preparing the data frame for it but no idea how to make it well

the data is something like this:
uid=c("u1","u2","u3","u4","u5")
ubid=c("u1","u1","u1","u1","u2","u2","u5","u5","u5","u5","u5","u3")
tbid=c("b1","b1","b2","b3","b3","b1","b4","b1","b5","b6","b7","b8")
pTable <- data.frame(uid)
dTable <- data.frame(ubid,tbid)

and I want to create this table
--- | v1 | v2 | v3 | v4 | v5
u1 | b1 | b1 | b2 | b3 | NA
u2 | b3 | b1 | NA | NA | NA
u3 | b8 | NA | NA | NA | NA
u4 | NA | NA | NA | NA | NA
u5 | b5 | b6 | b7 | b8 | NA

any idea to create this table?

pieterjanvc · March 13, 2020, 11:40am

Hi,

Welcome to the RStudio community!

Here is my implementation using some nice functions from the Tidyverse:

CODE

library(tidyr)
library(dplyr)

options(stringsAsFactors = F)

uid=c("u1","u2","u3","u4","u5")
ubid=c("u1","u1","u1","u1","u2","u2","u5","u5","u5","u5","u5","u3")
tbid=c("b1","b1","b2","b3","b3","b1","b4","b1","b5","b6","b7","b8")

myTable = data.frame(ubid = ubid, tbid = tbid) %>% 
  group_by(ubid) %>% mutate(order = 1:n()) %>% ungroup() %>% 
  pivot_wider(names_from = order, values_from = tbid) %>% 
  add_row(ubid = setdiff(uid, ubid)) %>% 
  arrange(ubid)

RESULT

# A tibble: 5 x 6
  ubid  `1`   `2`   `3`   `4`   `5`  
  <chr> <chr> <chr> <chr> <chr> <chr>
1 u1    b1    b1    b2    b3    NA   
2 u2    b3    b1    NA    NA    NA   
3 u3    b8    NA    NA    NA    NA   
4 u4    NA    NA    NA    NA    NA   
5 u5    b4    b1    b5    b6    b7

Note: In your table, you messed up the values for u5 according to the provided data

EXPLANATION
Let me explain in a bit more detail:
Note that I set stringsAsFactors = F to prevent errors when manipulating the character data

I first created a data frame with the 2 longest vectors

data.frame(ubid = ubid, tbid = tbid)
 ubid tbid
1    u1   b1
2    u1   b1
3    u1   b2
...

Then I labelled the order of each of the values of tbid as they appear per ubid using the group_by function

data.frame(ubid = ubid, tbid = tbid) %>% 
group_by(ubid) %>% mutate(order = 1:n()) %>% ungroup() 

# A tibble: 12 x 3
   ubid  tbid  order
   <chr> <chr> <int>
 1 u1    b1        1
 2 u1    b1        2
 3 u1    b2        3
 4 u1    b3        4
 5 u2    b3        1
 6 u2    b1        2
...

Next, I used the pivot_wider to spread out the data based on the newly created order values (they become column names). Missing values will be automatically filled in with NA

data.frame(ubid = ubid, tbid = tbid) %>% 
group_by(ubid) %>% mutate(order = 1:n()) %>% ungroup() %>% 
pivot_wider(names_from = order, values_from = tbid)

# A tibble: 4 x 6
  ubid  `1`   `2`   `3`   `4`   `5`  
  <chr> <chr> <chr> <chr> <chr> <chr>
1 u1    b1    b1    b2    b3    NA   
2 u2    b3    b1    NA    NA    NA   
3 u5    b4    b1    b5    b6    b7   
4 u3    b8    NA    NA    NA    NA

Given "u4" had no data, the previous code did not take it into account so we finally end up adding empty rows for all missing ubid (using the setdiff function to find missing ones) and sort the final result. (output is at top).

If you are not familiar with the Tidiverse and its functions, I recommend you check this out:

Hope this helps,
PJ

system · April 3, 2020, 11:40am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.