As @mara and @mishabalyasin suggested you should include a reprex in your questions.
A prose description isn't sufficient, you also need to make a simple reprex that:
- Builds the input data you are using.
- The function you are trying to write, even if it doesn't work.
- Usage of the function you are trying to write, even if it doesn't work.
- Builds the output data you want the function to produce.
You can learn more about reprex's here:
Right now the is an issue with the version of reprex that is in CRAN so you should download it directly from github.
Until CRAN catches up with the latest version install reprex with
devtools::install_github("tidyverse/reprex")
The reason we ask for a reprex is that it is the easiest and quickest way for someone to understand the issue you are running into and answer it.
Nearly everyone here who is answering questions is doing it on their own time and really appreciate anything you can do to minimize that time.
You are working with sets, in your case numbers, so you should check out the R doc's for it's support of set operations.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html
Also you need to be precise about what you mean by the "percent difference" between two sets. I don't think there is a common definition for the percent difference between two sets... maybe there is one, I just don't know.
Here is an example of a reprex building your data in a way we can easily work with... but, of course this isn't the only way to do this.
suppressPackageStartupMessages(library(tidyverse))
# show some sample data by constructing it
pt <- tribble(
~product, ~skus,
"Apple", c(123,234,345,456),
"Apples", c(123,234,345),
"Red Apple", c(123,345,456),
"Green Apple", c(123,234,456),
)
pt
#> # A tibble: 4 x 2
#> product skus
#> <chr> <list>
#> 1 Apple <dbl [4]>
#> 2 Apples <dbl [3]>
#> 3 Red Apple <dbl [3]>
#> 4 Green Apple <dbl [3]>
Created on 2018-03-16 by the reprex package (v0.2.0).
Using the set operations with a clear definition of what you mean by "percent difference" between sets gives you the basics you need for what you are trying to do.
Here is an example of a function that calculates a "percent difference" between sets. It may not be a sensible "percent difference" for what you are doing but it does give you a template for a way to go about making this calculation.
percent_diff <- function(s1, s2) {
d1 <- setdiff(s1, s2)
d2 <- setdiff(s2, s1)
s1percent_diff <- length(d1) / length(s2)
s2percent_diff <- length(d2) / length(s1)
100 * (s1percent_diff + s2percent_diff) / 2
}
percent_diff(c(2,1,4), c(5,1,3,2))
#> [1] 45.83333
Created on 2018-03-16 by the reprex package (v0.2.0).
Once you have a function that calculates what you want the percent difference to be you can use @mishabalyasin suggestion to find the products that are similar to each other.