I am going through a problem, i cant understand which logic i should use for that, below is the description,
Coupon title Product Code
Red Apple 123,345,456
Green Apple 123,234,456
Lets Suppose I have 4 coupon titles like (apple,apples,red apple and green apple).One coupon title can have multiple products. lets say, coupon title apple has 4 products (123,234,345,456).
Now, if any of the coupon titles have the same products code, then they are same coupon title.
for example, coupon title apple and apples have same products codes.
123,234,345 are matching.so, they are same.
In my above example, all the coupon titles should be same, because they have same some matching products codes.
let say, My matching threshold limit is 90%. ( That means, If they are at-least 90% matching, then i will consider them they are same, otherwise not)
so, if i implement this on R, then which logic i should use for that?
is there any package exists which can help me to find out this matching and also give me the matching percentage.(How much they are matching) ?
Please help me to solve this problem.
Any suggestions and advice are really appreciable.
Could you please turn this into a self-contained reprex (short for minimal reproducible example)?
Even if you're unsure of the logic, at the very least, it will save someone the hassle of having to turn the Coupon/Product Code setup into an R-readable format, which, regardless of how you go about solving the problem, will need to be done.
As @mara said, it would be helpful to have a reprex to get things going, but since you've asked for general advice as well, I think I would have done it using sets.
Each element in "Product Code" column is a set that you want to cross with every other set in the column. Therefore, you can use, e.g., purrr::map and pass a function that takes in a vector of sets. Inside of this function you use one of the functions from set (e.g., look up ?union in R, it will show help for this function and all others). Then you can use your threshold to filter out "Coupon title" column and return similar elements from there.
Of course, beware of the fact that this is O(n^2) operation, so if your dataset is too big, it might take a while.
Also you need to be precise about what you mean by the "percent difference" between two sets. I don't think there is a common definition for the percent difference between two sets... maybe there is one, I just don't know.
Here is an example of a reprex building your data in a way we can easily work with... but, of course this isn't the only way to do this.
# show some sample data by constructing it
pt <- tribble(
"Red Apple", c(123,345,456),
"Green Apple", c(123,234,456),
#> # A tibble: 4 x 2
#> product skus
#> <chr> <list>
#> 1 Apple <dbl >
#> 2 Apples <dbl >
#> 3 Red Apple <dbl >
#> 4 Green Apple <dbl >
Using the set operations with a clear definition of what you mean by "percent difference" between sets gives you the basics you need for what you are trying to do.
Here is an example of a function that calculates a "percent difference" between sets. It may not be a sensible "percent difference" for what you are doing but it does give you a template for a way to go about making this calculation.