Finding all Universal Combinations in an Occurrence Matrix

mnty89 · March 21, 2018, 1:37pm

Hi everyone,

I'm a beginning R learner and I have a problem where I have reached a point where I cannot figure out how to do the next step.

I have an occurrence matrix made from a set of sample data. The data represents a series of stores, and a series of items held at each store. There are over 6000 unique store locations, and around 400 unique items. The occurrence matrix maps whether the item is at the store. In the sample occurrence matrix the first column is just the auto inputted row identifier, the second is the store ID number, and the first row across is the item list. The matrix is made up of binary 1,0. 1 being a store/item mix that is valid.

What I want to determine from this data is: a way to set a required range of stores and items(ex: show me solutions where there are >50 stores & <3000 stores and item count is <200, but >30) and then have a script loop through and churn out possible universal item mix/store combinations. Only show me lists of unique item & store where all of those items are held in all of those stores and the count for each falls into the range I set. For example in the attached solution, it is a list of items and stores where each item is held at all of those stores. I ideally would want every item/store mix within the range count criteria to be outputted as a separate data frame.

Any ideas how to write a script capable of this?

Link to the sample occurrence matrix csv: http://www44.zippyshare.com/v/sHIgseSS/file.html
Link to a sample solution csv: http://www83.zippyshare.com/v/DUIEHopG/file.html

mara · March 21, 2018, 2:23pm

Hi mnty,

Can you share what you have so far?

The best way to do this is with a reprex (short for minimal reproducible example). Take heed of the minimal part here. Even though this is going to involve 6,000 stores and 400 items in the end, it's best to start out small, and with "dummy" data that others can look at to. It'll take less time to iterate through versions (i.e. you'll fail fast), and help others help you if they don't have to import a large dataset.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page, and take a look at the reprex dos and don'ts.

For pointers specific to the community site, check out the reprex FAQ, linked to below.

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

mnty89 · March 21, 2018, 6:50pm

Thanks for your input! Here is the code I have so far.. I'm using a package that was recommended to me for the application I want. HapEstXXR. The results it populates are actually what I want, with two exceptions.
1: It is limited to only run 15 "stores". I need a scale solution essentially in order to apply this to the rest of my data.
2: The results are output with multiple instances of the "store", but ideally I'd want it to be a unique list.

The only omitted aspects of the code I pasted are where I do a get/setwd() to load the sample data file.

Here are the smaller data set, and the associated occurrence matrix files for it.

    #install.packages('HapEstXXR')
    library("HapEstXXR")
    #install.packages("data.table")
    library(data.table)
    
    sample_data = fread("sample_data_small.csv", header = T)
    #store item combinations algorithm
    
    #library(data.table) if(!require(HapEstXXR)) 
    
    # setting columns to objects? not sure if correct method???
    #all_store_items_possible <- (store = sample_data[ , 1],
    #                              item = sample_data[ , 2])
    store = sample_data[ , 1]
    item = sample_data[ , 2]
    
    # occurrence matrix- grid of store items and if they take it or not..
    occurrence_matrix <- dcast(sample_data, store ~ item, 
                               fun.aggregate = length,
                               value.var = "item")
    write.csv(occurrence_matrix, file="OccMat.csv")
    occurrence_matrix
    #occurrence_array <- reshape2::acast(sample_data, store ~ item ~ item), 
    #                            fun.aggregate = length,
    #                            value.var = 'item')
    
    
    # rowsums and colsums to see the how popular the most popular stores and items are freqncy
    # popular items
    #colSums(occurrence_matrix[ , !item])
    ##how popular is the most popular item?
    #max(colSums(occurrence_matrix[ , !item]))
    
    #same view
    sample_data[ , uniqueN(store) , by = .(item)][ , max(V1)]
    
    # popular stores
    #rowSums(occurrence_matrix[ , !store])
    # volume scale of top stores
    #max(rowSums(occurrence_matrix[ , !store]))
    # same view
    sample_data[ , uniqueN(item) , by = .(store)][ , max(V1)]
    
    #all possible sets of stores- idk if can handle
    
    
    unique_stores <- sample_data[ , unique(store)]
    all_store_combns <- HapEstXXR::powerset(unique_stores) 
    #  might be too many results?
    #how to trim down ahead of time?
    
    all_store_combns <- all_store_combns[ sapply(all_store_combns, length) > 3 & # not interested in sets that that have "X" input or less stores in them
                                            sapply(all_store_combns, length) <= 10  ]
                                          # keep sets that have more than Y second input stores in them,
    
    all_store_combns
    
    names(all_store_combns) <- sapply(all_store_combns, function(i) paste0(sort(i), collapse = ''))
    
    result <- sapply(all_store_combns, USE.NAMES = TRUE, simplify = FALSE,
                     function(store_set) {
                       # subset down to relevant store subset
                       x <- sample_data[store %in% store_set]
                       # count how many stores each item is represented in
                       x[, cnt := uniqueN(store) , by = item]
                       # NB: IFF there are no duplicate rows, then the following line does the same thing more efficiently
                       x[, cnt := .N , by = item]
                       # remove items that aren't present in all stores
                       x1 <- x[ cnt == length(store_set) ]
                       
                       # output
                       x1
                       
                     })