Creating and populating a SQLite database via R - how to ignore duplicate rows?

I am interested in creating and populating a SQLite database via R. One challenge I've encountered concerns inserting duplicate rows.

As in the following reprex, it's easy to (advertently or not) add duplicate rows:

library(DBI)
library(RSQLite)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbCreateTable(con, "iris", iris)
dbAppendTable(con, "iris", iris)
#> [1] 150
dbAppendTable(con, "iris", iris)
#> [1] 150
nrow(dbReadTable(con, "iris"))
#> [1] 300
dbDisconnect(con)

If you manually create a table using DBI::dbSendQuery(), you can specify a primary key; my understanding is that this makes it so that duplicate rows cannot be added (by adding as a constraint that there are no duplicated primary keys):

library(DBI)
library(RSQLite)
library(tidyverse)

iris <- mutate(iris, row = row_number())

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbSendQuery(con,
            "CREATE TABLE iris (
            row INTEGER PRIMARY KEY,
            `Sepal.Length` REAL,
            `Sepal.Width` REAL,
            `Petal.Length` REAL,
            `Petal.Width` REAL,
            Species TEXT);")
#> <SQLiteResult>
#>   SQL  CREATE TABLE iris (
#>             row INTEGER PRIMARY KEY,
#>             `Sepal.Length` REAL,
#>             `Sepal.Width` REAL,
#>             `Petal.Length` REAL,
#>             `Petal.Width` REAL,
#>             Species TEXT);
#>   ROWS Fetched: 0 [complete]
#>        Changed: 0
dbAppendTable(con, "iris", iris)
#> [1] 150
dbAppendTable(con, "iris", iris)
#> Error: UNIQUE constraint failed: iris.row
nrow(dbReadTable(con, "iris"))
#> [1] 150
dbDisconnect(con)

Here - correctly - none of the rows were inserted, because all were duplicates.

However, my question is: Is there a way to only add to a table rows that are unique?

It seems like there is a way to do this in SQLite using an INSERT IGNORE command/part of a query:

if you use the INSERT IGNORE statement, the rows with invalid data that cause the error are ignored and the rows with valid data are inserted into the table

But, neither DBI::dbWriteTable() and DBI::dbAppendTable() (or any other function from the DBI package) appear to support it. In short, I'm looking for something like the following to add 100 new rows - and to ignore the first 50 with that are duplicates (because of the duplicate row ID):

library(DBI)
library(RSQLite)
library(tidyverse)

iris <- mutate(iris, row = row_number())
iris <- mutate(iris, row = 101:250)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbSendQuery(con,
            "CREATE TABLE iris (
            row INTEGER PRIMARY KEY,
            `Sepal.Length` REAL,
            `Sepal.Width` REAL,
            `Petal.Length` REAL,
            `Petal.Width` REAL,
            Species TEXT);")
#> <SQLiteResult>
#>   SQL  CREATE TABLE iris (
#>             row INTEGER PRIMARY KEY,
#>             `Sepal.Length` REAL,
#>             `Sepal.Width` REAL,
#>             `Petal.Length` REAL,
#>             `Petal.Width` REAL,
#>             Species TEXT);
#>   ROWS Fetched: 0 [complete]
#>        Changed: 0
dbAppendTable(con, "iris", iris)
#> [1] 150
dbAppendTable(con, "iris", iris)
#> Error: UNIQUE constraint failed: iris.row
nrow(dbReadTable(con, "iris"))
#> [1] 150
dbDisconnect(con)

An option you may consider - though it may seem like an overkill in your use case - is to use the BI concept known as stage table.

This involves creating two tables:

  • one that will accept your R object, in my case called stage; this should have no constraints (it should accept everything as it comes) and should be wiped clean before regular processing
  • one that will be final, in my case called final this should have a constraint - in my case via unique index called identity_check on field row in final table
  • the contents of the stage layer are flipped over to final via a SQL command; the exact formulation will depend on your dialect - sqlite has insert or ignore, while other dialects may have insert ... on conflict do nothing; these differ in detail but the principle is the same

So consider this code:

library(DBI)
library(RSQLite)
library(dplyr)

# some data (with unique column "row")
iris <- mutate(iris, row = row_number()) %>% 
  relocate(row)

# create database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
 
# create structures (empty)
dbSendQuery(con, 
            "CREATE TABLE stage (
            row INTEGER,
            `Sepal.Length` REAL,
            `Sepal.Width` REAL,
            `Petal.Length` REAL,
            `Petal.Width` REAL,
            Species TEXT);")

dbSendQuery(con,
            "CREATE TABLE final (
            row INTEGER PRIMARY KEY,
            `Sepal.Length` REAL,
            `Sepal.Width` REAL,
            `Petal.Length` REAL,
            `Petal.Width` REAL,
            Species TEXT);")

# index on final table 
dbSendQuery(con,
            "create unique index identity_check on final (row);")

# wipe the stage clean
dbSendQuery(con,
            "delete from stage;")

# insert data to stage
dbAppendTable(con, "stage", iris)

# flip the stage over
dbSendQuery(con,
            "insert or ignore into final select * from stage;")

# try for the second time - this should have no effect
dbSendQuery(con,
            "insert or ignore into final select * from stage;")

# make sure by reading the final & comparing row count with original iris
check <- dbReadTable(con, "final")

if (nrow(check)== nrow(iris))  print("all is well :)")

# clean up
dbDisconnect(con)
2 Likes

yay, this works beautifully. Thanks @jlacko!

I wrapped up the key commands into a little function, insert_and_ignore_duplicates():

library(DBI)
library(RSQLite)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# some data (with unique column "row")
iris <- mutate(iris, row = row_number()) %>% 
  relocate(row)

# create database
con <- dbConnect(RSQLite::SQLite(), ":memory:")

# create structures (empty)
dbSendQuery(con, 
            "CREATE TABLE stage (
            row INTEGER,
            `Sepal.Length` REAL,
            `Sepal.Width` REAL,
            `Petal.Length` REAL,
            `Petal.Width` REAL,
            Species TEXT);")
#> <SQLiteResult>
#>   SQL  CREATE TABLE stage (
#>             row INTEGER,
#>             `Sepal.Length` REAL,
#>             `Sepal.Width` REAL,
#>             `Petal.Length` REAL,
#>             `Petal.Width` REAL,
#>             Species TEXT);
#>   ROWS Fetched: 0 [complete]
#>        Changed: 0

dbSendQuery(con,
            "CREATE TABLE final (
            row INTEGER PRIMARY KEY,
            `Sepal.Length` REAL,
            `Sepal.Width` REAL,
            `Petal.Length` REAL,
            `Petal.Width` REAL,
            Species TEXT);")
#> Warning: Closing open result set, pending rows
#> <SQLiteResult>
#>   SQL  CREATE TABLE final (
#>             row INTEGER PRIMARY KEY,
#>             `Sepal.Length` REAL,
#>             `Sepal.Width` REAL,
#>             `Petal.Length` REAL,
#>             `Petal.Width` REAL,
#>             Species TEXT);
#>   ROWS Fetched: 0 [complete]
#>        Changed: 0

dbListTables(con)
#> Warning: Closing open result set, pending rows
#> [1] "final" "stage"

# index on final table 
dbSendQuery(con,
            "create unique index identity_check on final (row);")
#> <SQLiteResult>
#>   SQL  create unique index identity_check on final (row);
#>   ROWS Fetched: 0 [complete]
#>        Changed: 0

insert_and_ignore_duplicates <- function(con, d) {
  # wipe the stage clean
  dbSendQuery(con,
              "delete from stage;")
  
  # insert data to stage
  dbAppendTable(con, "stage", d)
  
  # flip the stage over into final
  dbSendQuery(con,
              "insert or ignore into final select * from stage;")
  
}

# adding new rows, some of which are duplicated
iris_new <- mutate(iris, row = 101:250) %>% 
  relocate(row)

insert_and_ignore_duplicates(con, iris) # iris has rows 100-150
#> Warning: Closing open result set, pending rows

#> Warning: Closing open result set, pending rows
#> Warning: Factors converted to character
#> <SQLiteResult>
#>   SQL  insert or ignore into final select * from stage;
#>   ROWS Fetched: 0 [complete]
#>        Changed: 150
insert_and_ignore_duplicates(con, iris_new) # iris_new has rows 101-250, so 50 are duplicated
#> Warning: Closing open result set, pending rows
#> Warning: Closing open result set, pending rows
#> Warning: Factors converted to character
#> <SQLiteResult>
#>   SQL  insert or ignore into final select * from stage;
#>   ROWS Fetched: 0 [complete]
#>        Changed: 100

dbReadTable(con, "final") %>% 
  as_tibble()
#> Warning: Closing open result set, pending rows
#> # A tibble: 250 x 6
#>      row Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>    <int>        <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#>  1     1          5.1         3.5          1.4         0.2 setosa 
#>  2     2          4.9         3            1.4         0.2 setosa 
#>  3     3          4.7         3.2          1.3         0.2 setosa 
#>  4     4          4.6         3.1          1.5         0.2 setosa 
#>  5     5          5           3.6          1.4         0.2 setosa 
#>  6     6          5.4         3.9          1.7         0.4 setosa 
#>  7     7          4.6         3.4          1.4         0.3 setosa 
#>  8     8          5           3.4          1.5         0.2 setosa 
#>  9     9          4.4         2.9          1.4         0.2 setosa 
#> 10    10          4.9         3.1          1.5         0.1 setosa 
#> # … with 240 more rows

# clean up
dbDisconnect(con)

Created on 2020-10-22 by the reprex package (v0.3.0)

Glad to be of service!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.