I'm working on some natural language processing tasks on a large corpus of documents. After some exploratory analysis I've discovered quite a few duplicate documents. I've been using the great textreuse package to do pairwise comparison of documents using the Jaccard similarity score (see this vignette for more information). The problem I am having is more of a data manipulation problem than a text mining problem.
The output of the lsh_compare
function is a data frame with the document IDs as the first two columns and the jaccard similarity score (between 0 and 1 in increasing similarity) as the third column. What I want to do is narrow down the set of documents with a score of 1 and remove duplicates, yet retain one original version of the document. Here is a mock set of data that helps explain what I'm trying to do:
text <- tibble::tribble(~id, ~text,
"101a", "apples are nice",
"102a", "chocolate is great",
"103a", "apples are nice",
"104a", "apples are nice",
"105a", "chocolate and apples are fine",
"106a", "peaches are peachy",
"107a", "chocolate is great",
"108a", "apples are nice",
"109a", "Pie is the best though",
"110a", "don't forget ice cream")
similarity <- tibble::tribble(~a, ~b, ~score,
"101a", "103a", 1,
"101a", "104a", 1,
"101a", "108a", 1,
"104a", "103a", 1,
"104a", "108a", 1,
"103a", "108a", 1,
"102a", "107a", 1)
result <- tibble::tribble(~id, ~text,
"101a", "apples are nice",
"102a", "chocolate is great",
"105a", "chocolate and apples are fine",
"106a", "peaches are peachy",
"109a", "Pie is the best though",
"110a", "don't forget ice cream")
For this, text
is the original data (and associated document IDs), similarity
is the output of the lsh_compare
function from textreuse
, and result
is what I am looking for (note that I don't care about which specific document ID is returned, as long as I get one copy of the original text).
I have found a (perhaps inelegant) way of removing all duplicates, but it's not what I want:
library(dplyr)
distinct_a <- similarity %>%
distinct(a) %>%
rename(id = a)
distinct_b <- similarity %>%
distinct(b) %>%
rename(id = b)
dupes <- bind_rows(distinct_a, distinct_b)
(no_dupes <- text %>% anti_join(dupes, by = "id"))
># A tibble: 4 x 2
id text
<chr> <chr>
1 105a chocolate and apples are fine
2 106a peaches are peachy
3 109a Pie is the best though
4 110a don't forget ice cream
no_dupes
removes the duplicates but does not retain one version of the original text, so it goes beyond what I am looking for. I feel like the solution is on the tip of my tongue and I just can't quite get there. I'm sure it involves dplyr
and possibly tidyr
or a window function, but I'm a bit stuck at this point. As I said, at this point I don't care which document ID is returned, so any solution that removes all duplicates except one original would probably work for my purposes. Any advice would be greatly appreciated.