How to find out matching number within some threshold limit


#1

Hi all,
I am going through a problem, i cant understand which logic i should use for that, below is the description,

Coupon title                       Product Code
---------------                    ----------------------
Apple                                123,234,345,456

Apples                               123,234,345

Red Apple                            123,345,456

Green Apple                          123,234,456

Lets Suppose I have 4 coupon titles like (apple,apples,red apple and green apple).One coupon title can have multiple products. lets say, coupon title apple has 4 products (123,234,345,456).

Now, if any of the coupon titles have the same products code, then they are same coupon title.
for example, coupon title apple and apples have same products codes.
123,234,345 are matching.so, they are same.

In my above example, all the coupon titles should be same, because they have same some matching products codes.

let say, My matching threshold limit is 90%. ( That means, If they are at-least 90% matching, then i will consider them they are same, otherwise not)

so, if i implement this on R, then which logic i should use for that?

is there any package exists which can help me to find out this matching and also give me the matching percentage.(How much they are matching) ?

Please help me to solve this problem.

Any suggestions and advice are really appreciable.


#2

Could you please turn this into a self-contained reprex (short for minimal reproducible example)?

Even if you're unsure of the logic, at the very least, it will save someone the hassle of having to turn the Coupon/Product Code setup into an R-readable format, which, regardless of how you go about solving the problem, will need to be done.

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.


#3

As @mara said, it would be helpful to have a reprex to get things going, but since you've asked for general advice as well, I think I would have done it using sets.

Each element in "Product Code" column is a set that you want to cross with every other set in the column. Therefore, you can use, e.g., purrr::map and pass a function that takes in a vector of sets. Inside of this function you use one of the functions from set (e.g., look up ?union in R, it will show help for this function and all others). Then you can use your threshold to filter out "Coupon title" column and return similar elements from there.

Of course, beware of the fact that this is O(n^2) operation, so if your dataset is too big, it might take a while.


#4

As @mara and @mishabalyasin suggested you should include a reprex in your questions.

A prose description isn't sufficient, you also need to make a simple reprex that:

  1. Builds the input data you are using.
  2. The function you are trying to write, even if it doesn't work.
  3. Usage of the function you are trying to write, even if it doesn't work.
  4. Builds the output data you want the function to produce.

You can learn more about reprex's here:

Right now the is an issue with the version of reprex that is in CRAN so you should download it directly from github.

Until CRAN catches up with the latest version install reprex with

devtools::install_github("tidyverse/reprex")

The reason we ask for a reprex is that it is the easiest and quickest way for someone to understand the issue you are running into and answer it.

Nearly everyone here who is answering questions is doing it on their own time and really appreciate anything you can do to minimize that time.

You are working with sets, in your case numbers, so you should check out the R doc's for it's support of set operations.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html

Also you need to be precise about what you mean by the "percent difference" between two sets. I don't think there is a common definition for the percent difference between two sets... maybe there is one, I just don't know.

Here is an example of a reprex building your data in a way we can easily work with... but, of course this isn't the only way to do this.

suppressPackageStartupMessages(library(tidyverse))
# show some sample data by constructing it

pt <- tribble(
    ~product, ~skus,
"Apple", c(123,234,345,456),
"Apples", c(123,234,345),
"Red Apple", c(123,345,456),
"Green Apple", c(123,234,456),
)

pt
#> # A tibble: 4 x 2
#>   product     skus     
#>   <chr>       <list>   
#> 1 Apple       <dbl [4]>
#> 2 Apples      <dbl [3]>
#> 3 Red Apple   <dbl [3]>
#> 4 Green Apple <dbl [3]>

Created on 2018-03-16 by the reprex package (v0.2.0).

Using the set operations with a clear definition of what you mean by "percent difference" between sets gives you the basics you need for what you are trying to do.

Here is an example of a function that calculates a "percent difference" between sets. It may not be a sensible "percent difference" for what you are doing but it does give you a template for a way to go about making this calculation.

percent_diff <- function(s1, s2) {
    d1 <- setdiff(s1, s2)
    d2 <- setdiff(s2, s1)
    s1percent_diff <- length(d1) / length(s2)
    s2percent_diff <- length(d2) / length(s1)
    100 * (s1percent_diff + s2percent_diff) / 2 
}

percent_diff(c(2,1,4), c(5,1,3,2))
#> [1] 45.83333

Created on 2018-03-16 by the reprex package (v0.2.0).

Once you have a function that calculates what you want the percent difference to be you can use @mishabalyasin suggestion to find the products that are similar to each other.


#5

Thank you very much @danr.. Your suggestion is really helpful. I am going through your advice. however,in the case of reprex, i am still trying to to reprex. and getting following error.

No input provided and clipboard is not available.

If i can fix it, then i will always post my question related data by reprex..

Thank you very much for your valuable comment.


#6

Thank you sir, for your comment.I am going to take product code as set by using map function and following your instruction.

Thanks a lot sir.