De-duplicate text, retain one original

bj_bloom · May 10, 2018, 1:48pm

I'm working on some natural language processing tasks on a large corpus of documents. After some exploratory analysis I've discovered quite a few duplicate documents. I've been using the great textreuse package to do pairwise comparison of documents using the Jaccard similarity score (see this vignette for more information). The problem I am having is more of a data manipulation problem than a text mining problem.

The output of the lsh_compare function is a data frame with the document IDs as the first two columns and the jaccard similarity score (between 0 and 1 in increasing similarity) as the third column. What I want to do is narrow down the set of documents with a score of 1 and remove duplicates, yet retain one original version of the document. Here is a mock set of data that helps explain what I'm trying to do:

text <- tibble::tribble(~id, ~text,
                        "101a", "apples are nice",
                        "102a", "chocolate is great",
                        "103a", "apples are nice",
                        "104a", "apples are nice",
                        "105a", "chocolate and apples are fine",
                        "106a", "peaches are peachy", 
                        "107a", "chocolate is great", 
                        "108a", "apples are nice",
                        "109a", "Pie is the best though", 
                        "110a", "don't forget ice cream")

similarity <- tibble::tribble(~a, ~b, ~score,
                              "101a", "103a", 1,
                              "101a", "104a", 1,
                              "101a", "108a", 1,
                              "104a", "103a", 1,
                              "104a", "108a", 1,
                              "103a", "108a", 1,
                              "102a", "107a", 1)

result <- tibble::tribble(~id, ~text,
                          "101a", "apples are nice",
                          "102a", "chocolate is great",
                          "105a", "chocolate and apples are fine",
                          "106a", "peaches are peachy", 
                          "109a", "Pie is the best though", 
                          "110a", "don't forget ice cream")

For this, text is the original data (and associated document IDs), similarity is the output of the lsh_compare function from textreuse, and result is what I am looking for (note that I don't care about which specific document ID is returned, as long as I get one copy of the original text).

I have found a (perhaps inelegant) way of removing all duplicates, but it's not what I want:

library(dplyr)

distinct_a <- similarity %>% 
  distinct(a) %>% 
  rename(id = a)

distinct_b <- similarity %>%  
  distinct(b) %>% 
  rename(id = b)

dupes <- bind_rows(distinct_a, distinct_b)

(no_dupes <- text %>% anti_join(dupes, by = "id"))

># A tibble: 4 x 2
  id    text                         
  <chr> <chr>                        
1 105a  chocolate and apples are fine
2 106a  peaches are peachy           
3 109a  Pie is the best though       
4 110a  don't forget ice cream

no_dupes removes the duplicates but does not retain one version of the original text, so it goes beyond what I am looking for. I feel like the solution is on the tip of my tongue and I just can't quite get there. I'm sure it involves dplyr and possibly tidyr or a window function, but I'm a bit stuck at this point. As I said, at this point I don't care which document ID is returned, so any solution that removes all duplicates except one original would probably work for my purposes. Any advice would be greatly appreciated.

mishabalyasin · May 10, 2018, 2:24pm

I can get the result you want with this approach:

library(tidyverse)
text <- tibble::tribble(~id, ~text,
                        "101a", "apples are nice",
                        "102a", "chocolate is great",
                        "103a", "apples are nice",
                        "104a", "apples are nice",
                        "105a", "chocolate and apples are fine",
                        "106a", "peaches are peachy", 
                        "107a", "chocolate is great", 
                        "108a", "apples are nice",
                        "109a", "Pie is the best though", 
                        "110a", "don't forget ice cream")

similarity <- tibble::tribble(~a, ~b, ~score,
                              "101a", "103a", 1,
                              "101a", "104a", 1,
                              "101a", "108a", 1,
                              "104a", "103a", 1,
                              "104a", "108a", 1,
                              "103a", "108a", 1,
                              "102a", "107a", 1)

result <- tibble::tribble(~id, ~text,
                          "101a", "apples are nice",
                          "102a", "chocolate is great",
                          "105a", "chocolate and apples are fine",
                          "106a", "peaches are peachy", 
                          "109a", "Pie is the best though", 
                          "110a", "don't forget ice cream")
remove <- similarity[["b"]] %>% unique()

res <- text %>%
  dplyr::filter(!id %in% remove)

identical(result, res)
#> [1] TRUE

Created on 2018-05-10 by the reprex package (v0.2.0).

bj_bloom · May 10, 2018, 3:23pm

@mishabalyasin thanks for that. It does work for this example, which made me realize that I meant to account for one additional potential wrinkle in the data that I didn't provide in the initial example. Your solution depends on the repeated document ID 101a to be in column a of similarity each time. I don't want to assume that. So I have revised my example data (row 3 of similarity_new, which I think is more realistic for the data I will be using.

text <- tibble::tribble(~id, ~text,
                        "101a", "apples are nice",
                        "102a", "chocolate is great",
                        "103a", "apples are nice",
                        "104a", "apples are nice",
                        "105a", "chocolate and apples are fine",
                        "106a", "peaches are peachy", 
                        "107a", "chocolate is great", 
                        "108a", "apples are nice",
                        "109a", "Pie is the best though", 
                        "110a", "don't forget ice cream")

similarity_new <- tibble::tribble(~a, ~b, ~score,
                              "101a", "103a", 1,
                              "101a", "104a", 1,
                              "108a", "101a", 1,
                              "104a", "103a", 1,
                              "104a", "108a", 1,
                              "103a", "108a", 1,
                              "102a", "107a", 1)

result <- tibble::tribble(~id, ~text,
                          "101a", "apples are nice",
                          "102a", "chocolate is great",
                          "105a", "chocolate and apples are fine",
                          "106a", "peaches are peachy", 
                          "109a", "Pie is the best though", 
                          "110a", "don't forget ice cream")

prosoitos · May 11, 2018, 5:01am

This works:

library(tidyverse)

text <- tibble::tribble(~id, ~text,
                        "101a", "apples are nice",
                        "102a", "chocolate is great",
                        "103a", "apples are nice",
                        "104a", "apples are nice",
                        "105a", "chocolate and apples are fine",
                        "106a", "peaches are peachy", 
                        "107a", "chocolate is great", 
                        "108a", "apples are nice",
                        "109a", "Pie is the best though", 
                        "110a", "don't forget ice cream")

similarity_new <- tibble::tribble(~a, ~b, ~score,
                                  "101a", "103a", 1,
                                  "101a", "104a", 1,
                                  "108a", "101a", 1,
                                  "104a", "103a", 1,
                                  "104a", "108a", 1,
                                  "103a", "108a", 1,
                                  "102a", "107a", 1)

## sort the columns independently for each row
ordered <-
  similarity_new %>%
  t() %>%
  as_tibble() %>%
  map_dfc(sort) %>%
  t() %>%
  as_tibble()

text %>% filter(!id %in% ordered$V3)
#> # A tibble: 6 x 2
#>   id    text                         
#>   <chr> <chr>                        
#> 1 101a  apples are nice              
#> 2 102a  chocolate is great           
#> 3 105a  chocolate and apples are fine
#> 4 106a  peaches are peachy           
#> 5 109a  Pie is the best though       
#> 6 110a  don't forget ice cream

prosoitos · May 11, 2018, 5:06am

This is actually unnecessary since you use %in% on the following line (repeats do not change anything).

mishabalyasin · May 11, 2018, 7:10am

You are right, but I like to be obvious

mishabalyasin · May 11, 2018, 7:31am

This may or may not be an overkill for your problem, but in fact, your similarity_new is nothing more than a graph. And in a general case you can have a situation like this: 101a -> 104a -> ... -> 108a. In other words, number of steps between two edges is not known. Therefore, when you think of this problem as a graph, you can identify all of those relationship with a clique. Once you know all the cliques, you can just pick one of them at random/first and it will guarantee that you have all of the results you need:

library(tidyverse)
library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:dplyr':
#> 
#>     as_data_frame, groups, union
#> The following objects are masked from 'package:purrr':
#> 
#>     compose, simplify
#> The following object is masked from 'package:tidyr':
#> 
#>     crossing
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
library(ggraph)

text <- tibble::tribble(~id, ~text,
                        "101a", "apples are nice",
                        "102a", "chocolate is great",
                        "103a", "apples are nice",
                        "104a", "apples are nice",
                        "105a", "chocolate and apples are fine",
                        "106a", "peaches are peachy", 
                        "107a", "chocolate is great", 
                        "108a", "apples are nice",
                        "109a", "Pie is the best though", 
                        "110a", "don't forget ice cream")

similarity_new <- tibble::tribble(~a, ~b, ~score,
                                  "101a", "103a", 1,
                                  "101a", "104a", 1,
                                  "108a", "101a", 1,
                                  "104a", "103a", 1,
                                  "104a", "108a", 1,
                                  "103a", "108a", 1,
                                  "102a", "107a", 1)

result <- tibble::tribble(~id, ~text,
                          "101a", "apples are nice",
                          "102a", "chocolate is great",
                          "105a", "chocolate and apples are fine",
                          "106a", "peaches are peachy", 
                          "109a", "Pie is the best though", 
                          "110a", "don't forget ice cream")


graph <- igraph::graph_from_data_frame(similarity_new)

ggraph(graph) +
  geom_edge_link() +
  geom_node_point()
#> Using `nicely` as default layout


res <- igraph::max_cliques(graph)
#> Warning in igraph::max_cliques(graph): At ./maximal_cliques_template.h:
#> 203 :Edge directions are ignored for maximal clique calculation

purrr::map_chr(res, function(id){
  id[1] %>% names()
})
#> [1] "102a" "101a"

Created on 2018-05-11 by the reprex package (v0.2.0).

prosoitos · May 11, 2018, 8:45am

The problem of the overlapping pairs creating those chains of unknown lengths gets sorted when the pairs are ordered (our similarity_new dataframe with the two columns of interest ordered for each row, which I called ordered):

Each overlapping pair will have one value in each column (over 2 different rows). So excluding all the values from one or the other column will get rid of them. Only the ends of those chains, which do not overlap with any other value will get conserved. So that is a way to select the unique values despite all the potential repeats in those pairs (or to count the number of distinct values).

I am not explaining it very well... but once the columns of interest are ordered for each row independently, filter(!id %in% (ordered$col)) with col being either of the columns of interest does the job, regardless of the numbers of repeats, or steps in the various chains.

bj_bloom · May 11, 2018, 12:27pm

Brilliant. This seems to be a great solution to this problem. I think if this were simply the case of duplicates (meaning just two), it wouldn't be so hard. But with "n-plicates" (where n>2), it becomes an issue. As someone working in SQL a lot, my inclination was to use a window function. But without a clearly sorted and consistent document ID in one column, it became hard to figure out a grouping variable to partition on.

My actual dataset is over 100,000 comparisons, with max(x) over 500. So I think this will scale fairly easily as well. Thanks for the help!

bj_bloom · May 11, 2018, 12:29pm

@mishabalyasin I think this may be a bit overkill for this particular problem but also would work. Conceptually it's something helpful for me to keep in mind when I encounter a similar problem in the future. I'm using network analysis in some other work I'm doing, so it's good to start thinking about how that way of conceptualizing data can transfer to other domains.

prosoitos · May 11, 2018, 3:47pm

You're welcome. Glad I could help. And it was really fun to brainstorm on this actually