How to sample data based on a function with arguments reading values from another dataset

Loffi · February 20, 2023, 4:23pm

Hi all, please assist with potential guidance steps for the following.

I have two datasets, A and B. A contains of three variables: business name, product type and quantity sold. In B we have business name, product type, SKU (unique identifier of the purchase).

I would like to create a function that iterates over A and extracts a (sample) list of SKUs related to quantity for each row in A. For instance if business X offered z type of product 10 times (read 10 sales), would like to select say 7 items randomly. The number is provided as quantity in dataset A. That is assuming my POS does not capture specific SKUs when a sale happens.

I can provide more details and real data if not clear enough.

Thanks in advance.

Sample data can be found in the following link: Sample dataset A & B.xlsx - Google Sheets

technocrat · February 20, 2023, 9:47pm

Representative data would help, please.

Loffi · February 21, 2023, 7:22pm

I have attached as requested. Thanks.

technocrat · February 22, 2023, 8:31am

That's not going to be possible with the data provided.

nirgrahamuk · February 22, 2023, 11:35am

Here is a possible approach to the general task;

library(tidyverse)

frst <- tibble::tribble(
  ~business.name, ~product.type, ~quantity.sold,
             "A",           "x",             3L,
             "B",           "y",             5L,
             "C",           "z",             4L,
             "C",           "x",             2L
  )

scnd <- tibble::tribble(
  ~business.name, ~product.type,       ~SKU,
             "A",           "x", 255353845L,
             "A",           "x", 255356319L,
             "A",           "x", 255362751L,
             "A",           "x", 255370204L,
             "A",           "x", 255454003L,
             "A",           "x", 255458362L,
             "A",           "x", 255593042L,
             "A",           "x", 255636116L,
             "A",           "x", 255717549L,
             "B",           "y", 255844018L,
             "B",           "y", 255909292L,
             "B",           "y", 255915742L,
             "B",           "y", 255919530L,
             "B",           "y", 255923843L,
             "B",           "y", 255933049L,
             "B",           "y", 255936045L,
             "B",           "y", 255941696L,
             "B",           "y", 255946809L,
             "B",           "y", 255949532L,
             "B",           "y", 255966127L,
             "B",           "y", 255997292L,
             "C",           "z", 255998831L,
             "C",           "z", 256001327L,
             "C",           "z", 256003617L,
             "C",           "z", 256005181L,
             "C",           "z", 256011669L,
             "C",           "z", 256013837L,
             "C",           "z", 256017545L,
             "C",           "z", 256019818L,
             "C",           "z", 256036093L,
             "C",           "z", 256040208L,
             "C",           "z", 256051672L,
             "C",           "x", 256138309L,
             "C",           "x", 256161535L,
             "C",           "x", 256189403L,
             "C",           "x", 256227739L,
             "C",           "x", 256236233L,
             "C",           "x", 256240570L,
             "C",           "x", 256245429L,
             "C",           "x", 256662283L,
             "C",           "x", 256671028L,
             "C",           "x", 256672339L,
             "C",           "x", 256678241L,
             "C",           "x", 256683024L,
             "C",           "x", 256718422L,
             "C",           "x", 256779953L
  )


map_dfr(1:nrow(frst), \(x_){ 
  row <- slice(frst,x_)
  target <- row$quantity.sold
  candidates <- row |> left_join(scnd,
                                 multiple="all")
  
  candidates_sample <- slice_sample(candidates,
                                   n = target)
  candidates_sample
  })

Loffi · March 2, 2023, 6:10am

Truly you are a star! Thank you so much. It worked.

Kindly allow me to put the explanation for what the code does so that it may benefit someone in the future that might be facing a similar challenge.

This code is using the map_dfr() function from the purrr package in R to iterate over each row of a data frame called frst. For each row, it selects a specific number of rows from another data frame called scnd based on the value of a column in the frst data frame.

Here is a breakdown of each line of code:

map_dfr(1:nrow(frst), \(x_){ ... }): This is calling the map_dfr() function with two arguments. The first argument is a sequence of integers from 1 to the number of rows in the frst data frame (nrow(frst)). The second argument is an anonymous function (a function without a name) that takes a single argument (x_). The map_dfr() function applies this anonymous function to each integer in the sequence and combines the results into a data frame.

row <- slice(frst,x_): This selects the x_-th row of the frst data frame and assigns it to a variable called row. The slice() function is used to select specific rows from a data frame based on their position.

target <- row$quantity.sold: This extracts the value of the quantity.sold column from the row data frame and assigns it to a variable called target.

candidates <- row |> left_join(scnd, multiple=“all”): This selects all columns from the row data frame and joins them with all columns from the scnd data frame based on a matching column(s). The |> operator is used to pipe the row data frame into the left_join() function. The multiple=“all” argument is used to retain all matches between the two data frames.

candidates_sample <- slice_sample(candidates, n = target): This selects a random sample of target rows from the candidates data frame using the slice_sample() function from the dplyr package.

candidates_sample: This returns the resulting candidates_sample data frame as the output of the anonymous function passed to map_dfr().

Overall, this code is useful for selecting a random sample of rows from a data frame based on the values in another column of the same or a different data frame.

system · March 23, 2023, 6:11am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.