Problem: writing code to remove partial duplicate rows from csv file

ledgreve · December 12, 2022, 10:23am

Hello,

Firstly, I wish to mention and stress that I do not have a lot of experience with R and programming in general.

The situation: I have a csv consisting of annotated data to be used for ABSA and Automatic Aspect Term Extraction. The first column ("col1") contains the name of the annotated file, the second column ("col2") contains the tagged aspect and the third column ("col3") contains the category tag given to the aspect. The columns are separated by tabs (as sometimes the aspect contains a comma).

Rather frequently, the same target words are tagged with different aspects or there is at least some overlap.

Example:

sentence = "It's time for the annual London Book Fair"

To train the system for Automatic Aspect Term Extraction on the annotated data, there can be no overlap.

Goal: I wish to remove those rows where the value for the first column is an exact duplicate and where the value of the second column is at the same time a partial duplicate (third column can be ignored).

failing code: This is the script I wrote using the dplyr package to try and accomplish this, but unfortunately it does not seem to work. I only get a nearly empty csv file consisting of a single line, namely "","col1","col2".

This is my code:

#load package
library(dplyr)
#Next read in the csv file and store it as a dataframe. 
#Columns are separated by tabs and sometimes contain quotes.
df_aspectcategories <- read.csv("C:/Users/…annotations_aspectcategory_tab.csv", sep ="\t", quote = "")
# First I want to filter the data. 
#I want to select only those rows where the value in column 1 is an exact duplicate and where the value of column 2 is simultaneously a partial duplicate
filtered_data <-filter(df_aspectcategories, col1 == duplicated(col1) & col2 %in% duplicated (col2))
#Now I try to use the distinct() function to remove the duplicate rows. 
#This function takes two arguments: the dataframe and a vector indicating which columns to consider when identifying duplicates. 
#I only want them to consider the "col1" and the "col2". 
dedup_aspectcategories <- distinct(filtered_data, col1, col2)
#finally, I want to export this dataframe containing the deduplicated data and store it as a new csv. 
write.csv(dedup_aspectcategories, "C:/Users/…Dedup_aspectcategories.csv")

Any advice (preferably in very simple layman terms) on how to solve this problem or concrete help would be very much appreciated! Thank you in advance!

technocrat · December 12, 2022, 10:17pm

Here are some pieces, because the question is missing a reprex.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(readr)

# rename source file with tsv extension
# and use readr::read_tsv()
dat <- read_tsv("/Users/ro/projects/demo/grist.tsv")
#> Rows: 2 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): col1, col2, col3
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# because file doesn't exist on anyone else's
# system

dat <- data.frame(
  stringsAsFactors = FALSE,
  col1 = c("document_1", "document_1","document_2"),
  col2 = c("London Book Fair", "London","London"),
  col3 = c("EVENT", "LOCATION","foo")
)

dat
#>         col1             col2     col3
#> 1 document_1 London Book Fair    EVENT
#> 2 document_1           London LOCATION
#> 3 document_2           London      foo

# count occurrences of docs; those with more
# than one are the rows to check for overlapping
# col2 values

docs <- dat %>% group_by(col1) %>% count()
to_check <- 
  
  to_check <- docs[which(docs$n > 1),]$col1
to_check
#> [1] "document_1"
# single document case; will need to be
# adapted for use to apply to all cases
# where more than a single document has
# multiple occurences, which is the
# example provided in the question

to_check <- dat %>% filter(col1 == to_check)
to_check
#>         col1             col2     col3
#> 1 document_1 London Book Fair    EVENT
#> 2 document_1           London LOCATION

# consider tokenizing col2 to avoid
# problems with inconsistent capitalization

# similarly limited to data presented

str_match_all(c(to_check$col2)[1],c(to_check$col2)[2]) %>% unlist
#>  [1] TRUE
# match found, but no rule provided in 
# question to decide which to keep

The key to solving R challenges like this is to shift focus from how to do something to what to do to get closer from what is at hand to what is desired.

ledgreve · December 13, 2022, 11:58am

Hello, first of all thank you for replying to my post! I've tried to create a reprex of my csv-file (and hope I've succeeded), consisting of the first 15 rows, header included. I wish I could just share the shortened csv file with you, but that is apparently not supported by this website, so here is a somewhat longer dataframe:

> df_aspectcategories <- read.csv("C:/Users/.../reprex.csv", sep ="\t", fileEncoding = "UTF-8")
> df_aspectcategories[1-15]
        col1                                 col2        col3
1  document1                     London Book Fair       EVENT
2  document1                               London    LOCATION
3  document1                              weather     WEATHER
4  document2                          short story        BOOK
5  document2                               Atwood      AUTHOR
6  document3 “there’s never been a novel like it; I think”       QUOTE
7  document3                                novel        BOOK
8  document3                        Hilary Mantel      AUTHOR
9  document3                          'Wolf Hall’        BOOK
10 document4                           Man Booker       EVENT
11 document4                     at the Guildhall    LOCATION
12 document5                          shortlistee      AUTHOR
13 document5                          shortlistee COMPETITION
14 document5                        Women’s Prize       EVENT

The code I shared was the exact same one I used in Rstudio. Would somewhat more expansive dataframe help you? It has a bit of everything, ranging from partial (row 1 &2, row 6 & 7) to exact (row 12 & 13) duplicates, containing quotation marks, semicolon...

And once again thank you!

technocrat · December 13, 2022, 10:31pm

The way show representative data from a data frame is to use cut-and-paste from the console if you don't have a public URL to share, such as a Github gist.

> dput(dat)
structure(list(col1 = c("documen_1", "documen_1"), col2 = c("London Book Fair", 
"London"), col3 = c("EVENT", "LOCATION")), row.names = c(NA, 
-2L), spec = structure(list(cols = list(col1 = structure(list(), class = c("collector_character", 
"collector")), col2 = structure(list(), class = c("collector_character", 
"collector")), col3 = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = "\t"), class = "col_spec"), problems = <pointer: 0x6000001227b0>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

Looking at dat[1:2,] dat[6:7,] and dat[12:13,] only:

The token "London" appears in first pair, along with other tokens
The word "novel" appears in the second pair, along with other tokens
The word "shortlisted" appears in the third pair, without any other tokens

In each case, I have the same question

Which row survives and is it ok to lose the contents of the following column?

mikecrobp · December 14, 2022, 9:25am

Can you define a partial duplicate and specify how you want it to be treated. I have a function to deduplicate based on a column which needs to be unique (have never needed a multi-column version) and it resolves which row to take by taking the largest value of another column. But not sure that is the problem you want to solve.

ledgreve · December 14, 2022, 10:07am

Hello and thank you for the follow-up and your patience!
I apologise! I've run de dput() function you showed me on my shortened dataset and got this outcome, which I hope will be more usefull for you:

> df_aspectcategories <- read.csv("C:/Users/.../reprex.csv", sep ="\t", quote = "", fileEncoding = "UTF-8")
> dput(df_aspectcategories)
structure(list(col1 = c("document1", "document1", "document1", 
"document2", "document2", "document3", "document3", "document3", 
"document3", "document4", "document4", "document5", "document5", 
"document5"), col2 = c("London Book Fair", "London", "weather", 
"short story", "Atwood", "“there’s never been a novel like it”", 
"novel", "Hilary Mantel", "'Wolf Hall’", "Man Booker", "at the Guildhall", 
"shortlistee", "shortlistee", "Women’s Prize"), col3 = c("EVENT", 
"LOCATION", "WEATHER", "BOOK", "AUTHOR", "QUOTE", "BOOK", "AUTHOR", 
"BOOK", "EVENT", "LOCATION", "AUTHOR", "COMPETITION", "EVENT"
)), class = "data.frame", row.names = c(NA, -14L))

As to your questions:
Fristly, it does not matter very much which one of the rows - containing a duplicate col1 value and a partial duplicate col2 value - remains. Perhaps it would be possible (and to keep things consistent and hopefully easier) to only have the first of the duplicate rows remain? So in this example rows no. 2, 7 and 13 would be removed.
I'm not sure whether you mean the contents for col3 in general or for the duplicate rows only. It is not ok to lose the contents of the third column (col3) for the remaining rows, as the tags it contains will be needed to train the AATE learning system. However, the rows that are to be removed can be deleted in full, including the value this specific row has in col3. I hope this clarifies things somewhat?

Thank you for your help!

ledgreve · December 14, 2022, 10:43am

Good morning and thank you for your time!

Firstly, this is an example of what the data looks like:

        col1                                 col2        col3
1  document1                     London Book Fair       EVENT
2  document1                               London    LOCATION
3  document1                              weather     WEATHER
4  document2                          short story        BOOK
5  document2                               Atwood      AUTHOR
6  document3 “there’s never been a novel like it”       QUOTE
7  document3                                novel        BOOK
8  document3                        Hilary Mantel      AUTHOR
9  document3                          'Wolf Hall’        BOOK
10 document4                           Man Booker       EVENT
11 document4                     at the Guildhall    LOCATION
12 document5                          shortlistee      AUTHOR
13 document5                          shortlistee COMPETITION
14 document5                        Women’s Prize       EVENT

And technocrat was kind enough to show me how to give you the data via a cut-and-paste way:

dput(df_aspectcategories)
structure(list(col1 = c("document1", "document1", "document1", 
"document2", "document2", "document3", "document3", "document3", 
"document3", "document4", "document4", "document5", "document5", 
"document5"), col2 = c("London Book Fair", "London", "weather", 
"short story", "Atwood", "“there’s never been a novel like it”", 
"novel", "Hilary Mantel", "'Wolf Hall’", "Man Booker", "at the Guildhall", 
"shortlistee", "shortlistee", "Women’s Prize"), col3 = c("EVENT", 
"LOCATION", "WEATHER", "BOOK", "AUTHOR", "QUOTE", "BOOK", "AUTHOR", 
"BOOK", "EVENT", "LOCATION", "AUTHOR", "COMPETITION", "EVENT"
)), class = "data.frame", row.names = c(NA, -14L))

Now, for my dataset I want to look at those rows that have 1) the exact same document name in the first column, therefore an exact duplicate, and 2) at the same time an exact or partial duplicate in the second column. In this context I take "partial duplicate" to mean any sort of overlap in the value in col2 between ronws where the value in col1 is identical. So the first condition for possible overlap (and thus a partial duplicate) would be that the value in col1 has to be identical. Then I would say there is a partial duplicate if at least one token is the same in the value of these rows (with identical col1-values) in the 2nd column.

Concrete for this example, rows 1-3 have the same value in col1, but there is only a form of overlap (partial duplicate) regarding the value in col2 for row 1 & 2, namely the word "London". So one of them (either row 1 or 2) have to be removed. For rows 4 & 5, they have the same document name, but no overlap in col2, so both can remain. Rows 6, 7, 8 and 9 also have the same document name ("document3"), but only 6 and 7 also have overlap in their col2-value, the word "novel". Either row 6 or 7 thus has to be removed, but 8 and 9 can remain. Rows 10 and 11, share the same document name, but the value in col2 is different, so they can both stay. Lastly, for rows 12, 13 and 14 (all "document5" in col1), the value in rows 12 and 13 is an exact duplicate, so one of them has to go, but row 14 is good to remain.

So if there are several rows (two or more) that have the exact same value in col1 (exact duplicate) and simultaneously have an exact or partial duplicate value in col2, only one of these rows can remain (for example the first one).

Does this answers your question? I hope my explanation is somewhat clear, English is my third language.

nirgrahamuk · December 16, 2022, 10:58am

Here's an attempt; test and use at your own risk:

df_0 <- structure(list(col1 = c(
  "document1", "document1", "document1",
  "document2", "document2", "document3", "document3", "document3",
  "document3", "document4", "document4", "document5", "document5",
  "document5"
), col2 = c(
  "London Book Fair", "London", "weather",
  "short story", "Atwood", "“there’s never been a novel like it”",
  "novel", "Hilary Mantel", "'Wolf Hall’", "Man Booker", "at the Guildhall",
  "shortlistee", "shortlistee", "Women’s Prize"
), col3 = c(
  "EVENT",
  "LOCATION", "WEATHER", "BOOK", "AUTHOR", "QUOTE", "BOOK", "AUTHOR",
  "BOOK", "EVENT", "LOCATION", "AUTHOR", "COMPETITION", "EVENT"
)), class = "data.frame", row.names = c(NA, -14L))

library(tidyverse)

(tokenised_df <- df_0 |>
  mutate(rowname = row_number()) |>
  rowwise() |>
  mutate(col2_tokens = str_split(col2, pattern = " ")) |> ungroup())

(xj_df <- left_join(tokenised_df,
  tokenised_df,
  by = "col1"
) |> rowwise() |>
  filter(!identical(col2.x, col2.y) | !identical(col3.x, col3.y)) |>
  mutate(
    matched_toks = list(intersect(col2_tokens.x, col2_tokens.y)),
    matched_toks_string = paste0(matched_toks, collapse = ", "),
    num_matched_toks = length(matched_toks),
    matched_toks_flag = num_matched_toks > 0
  ))

(slim_xj_df <- select(xj_df,
  rn = rowname.x, col1, col2 = col2.x, matched_toks_string, matched_toks_flag, r2 = rowname.y
) |>
  group_by(rn, col1, col2) |> summarise(
    matched_toks = any(matched_toks_flag),
    matched_toks_string = paste0(unique(matched_toks_string), collapse = "; "),
    r2list = paste0(unique(ifelse(matched_toks, r2, "")), collapse = "; ")
  ) |>
  ungroup())

(split_df_list <- split(slim_xj_df, ~matched_toks))
# FALSE group are kept
# TRUE group need one drop per col1 matched_toks_string pair
split_df_list$`TRUE` <- split_df_list$`TRUE` |>
  group_by(col1) |>
  slice_head(n = 1)
(decision_df <- bind_rows(split_df_list) |> arrange(rn))

(fin_df <- left_join(decision_df |> select(rn),
  tokenised_df |> select(-col2_tokens),
  by = c("rn" = "rowname")
) |> select(-rn))

technocrat · December 16, 2022, 10:47pm

Your English is fine, far better than my other languages. I'm fortunate that English has become a world language and in exchange I feel an obligation to focus on what is being said, rather than how. Because communication is the point.

I've been struggling with the how of continuing to work the problem, and when that happens, I eventually step back and ask if I really understand what the problem actually is.

My problem goes to the part of the classification problem that implements the rule keep the first entry in a document that contains no overlaps with subsequent entries. So,

1  document1                     London Book Fair       EVENT
2  document1                               London    LOCATION

row 1 survives and row 2 is discarded.

I ran into problems with the expanded data in the cases where more than two document entries are present.

library(dplyr)
library(stringr)

dat <- structure(list(col1 = c("document1", "document1", "document1", 
                        "document2", "document2", "document3", "document3", "document3", 
                        "document3", "document4", "document4", "document5", "document5", 
                        "document5"), col2 = c("London Book Fair", "London", "weather", 
                                               "short story", "Atwood", "“there’s never been a novel like it”", 
                                               "novel", "Hilary Mantel", "'Wolf Hall’", "Man Booker", "at the Guildhall", 
                                               "shortlistee", "shortlistee", "Women’s Prize"), col3 = c("EVENT", 
                                                                                                        "LOCATION", "WEATHER", "BOOK", "AUTHOR", "QUOTE", "BOOK", "AUTHOR", 
                                                                                                        "BOOK", "EVENT", "LOCATION", "AUTHOR", "COMPETITION", "EVENT"
                                               )), class = "data.frame", row.names = c(NA, -14L))
docs <- dat %>% group_by(col1) %>% count()
to_check <- docs[which(docs$n > 1),]$col1
# all documents need checking
length(to_check) != unique(dat$col1)
#> [1] TRUE TRUE TRUE TRUE TRUE
# variable number of doc entries to check
dat %>% group_by(col1) %>% count()
#> # A tibble: 5 × 2
#> # Groups:   col1 [5]
#>   col1          n
#>   <chr>     <int>
#> 1 document1     3
#> 2 document2     2
#> 3 document3     4
#> 4 document4     2
#> 5 document5     3

The cases of document1 and document3 illustrate the problem. For simplification let's use a toy data set.

dat <- data.frame(entry = c("first","second","third"),
                  terms = c("alpha baker charlie",
                            "charlie delta echo",
                            "echo foxtrot golf"))
dat
#>    entry               terms
#> 1  first alpha baker charlie
#> 2 second  charlie delta echo
#> 3  third   echo foxtrot golf
combn(dat$terms,2)
#>      [,1]                  [,2]                  [,3]                
#> [1,] "alpha baker charlie" "alpha baker charlie" "charlie delta echo"
#> [2,] "charlie delta echo"  "echo foxtrot golf"   "echo foxtrot golf"

Row 1 has a partial with overlap row 2, and row 1 and row 3 do not overlap, but row 2 and row 3 have a partial overlap. Do we

Keep row 1 and discard row 2 immediately and then keep row 3?
Keep row 1 but use row 2 to knock out row 3 before discarding row 2?

Then there are two more problems

Quotes contain non-substantive parts of speech:

I would like to think that a good time was had by all.
The author makes a spectacle of herself.
THAT WAS A SPECTACLE!

The order in which entries are recorded may determine which entries survive and which die.

This might be better framed as a problem in natural language processing with the words classified as lexical tokens. A solution would proceed by converting the data into a "corpus", "tokenizing" the terms column by stripping punctuation, lower casing and removing "stopwords" such as a, to, that, of. The concept of overlap could be expanded by using synonyms—a smash hit overlaps a terrific success.—and notions of "distance" can be introduced with metrics to decide the degree of similarity as a basis for decision.

Without knowing the goal of the exercise it's hard to say whether the statement of the problem is underspecified (i.e., more subtle methods such as NLP will be required) or overspecified(*i.e., just merge the terms of all a document's entries.

ledgreve · December 21, 2022, 10:45am

Hello @nirgrahamuk and thank you so much for your reply and the script you shared! I tried to run it exactly how it is in Rstudio today . The only thing I changed was replacing the |> to %>%. However, I keep getting an error when I try to run this line:

The error I receive is Error in unique.default(x, nmax = nmax) : unique() applies only to vectors. I tried to solvethis by changing the line to

(split_df_list <- split(slim_xj_df, as.vector(~matched_toks)))

but the error stays the same and I'm not sure what the problem is or how I can change it?

Thank you very for your time and help!

nirgrahamuk · December 21, 2022, 10:58am

I would recommend upgrading your R to a more modern version.
The older way to do

I think would be

(split_df_list <- split(slim_xj_df, slim_xj_df$matched_toks))

ledgreve · December 21, 2022, 2:54pm

@nirgrahamuk Thank you so much, I've run the script again and it worked perfectly; I know it's an exression that is often used far to liberally, but I could hug you right now! You've saved me weeks (at least!) of manually going through all of my data. Thank you very much for your time, effort and help! I am currently on cloud nine.

system · December 28, 2022, 2:54pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.