DBI/dbplyr: creating a remote table from a remote table (duckdb)

Hi! I am trying to create a remote table from a remote table, without bringing data into my local R environment. This seems straightforward but I'm a little stumped and would appreciate some help.

I have 3 tables in a duckdb (don't think that is super relevant, aside from the methods implemented in dbplyr) database. I would like to join them based on one key column. I would like for all of the computation to happen in the db and not within R. How do I join them within the database? I can get copy_to to work, but I am worried that cannot deal with larger than RAM situations (am currently playing around with tiny things and want to make sure they scale). At least, in 2019: dbplyr_issue Additionally, the documentation for copy_to specifies that it is for small amounts of data, so I do not think it is the appropriate approach.

I am getting errors with dbWriteTable, but I think that is very much due a syntax problem somewhere. Thank you!!

library(duckdb) 
library(dbplyr) 
library(DBI) 

conn_mem <- dbConnect(duckdb::duckdb(), dbdir=":memory:") 

dbWriteTable(conn_mem, "mtcars1", mtcars) 
dbWriteTable(conn_mem, "mtcars2", mtcars) 
dbWriteTable(conn_mem, "mtcars3", mtcars) 

table1 <- tbl(conn_mem, "mtcars1") 
table2 <- tbl(conn_mem, "mtcars2") 
table3 <- tbl(conn_mem, "mtcars3") 

joined_table <- left_join(table1, table2, by = "mpg") %>% left_join(table3, by = "mpg") 

dbWriteTable(duckdb_conn, "mtcars_123", statement = joined_table) 
# Error in (function (classes, fdef, mtable) : 
# unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "missing"' 

dbWriteTable(duckdb_conn, "mtcars_123", value = joined_table) 
# Error in (function (classes, fdef, mtable) : 
# unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "tbl_duckdb_connection"' 

copy_to(duckdb_conn, df = joined_table, name = "mtcars_123")`

Hello @rkb965 ,

I do not see duckdb_conn defined. Did you leave out the part where this is done?

You could raise an issue (question in this case) in
https://github.com/duckdb/duckdb/issues .
Indicate in the title that it is an R related question.

1 Like

Ah, that's what I get for not actually rendering a reprex -- sloppy renaming. I wish that were the only issue, but consistency of naming still results in the same errors.

``` r
library(duckdb) 
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI) 

conn_mem <- dbConnect(duckdb::duckdb(), dbdir=":memory:") 

dbWriteTable(conn_mem, "mtcars1", mtcars) 
dbWriteTable(conn_mem, "mtcars2", mtcars) 
dbWriteTable(conn_mem, "mtcars3", mtcars) 

table1 <- tbl(conn_mem, "mtcars1") 
table2 <- tbl(conn_mem, "mtcars2") 
table3 <- tbl(conn_mem, "mtcars3") 

joined_table <- left_join(table1, table2, by = "mpg") %>% left_join(table3, by = "mpg") 

dbWriteTable(conn_mem, "mtcars_123", statement = joined_table) 
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "missing"'
# Error in (function (classes, fdef, mtable) : 
# unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "missing"' 

dbWriteTable(conn_mem, "mtcars_123", value = joined_table) 
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "tbl_duckdb_connection"'
# Error in (function (classes, fdef, mtable) : 
# unable to find an inherited method for function 'dbWriteTable' for signature '"duckdb_connection", "character", "tbl_duckdb_connection"' 

copy_to(conn_mem, df = joined_table, name = "mtcars_123")

Created on 2023-05-02 with reprex v2.0.2

Thanks for the GitHub issue suggestion. I've poked around a bit over there but it's a bit foreign to me (the SQL and memory details) -- was hoping to have friendly tidyverse-adjacent help. But I'll give it a try if I don't get a {DBI} solution. Thank you!

Hi @rkb965, so dbWriteTable expects a data.frame object in the "value" argument, and joined_table is not one. With this {dplyr} workflow you set up, dplyr::compute will work best. Note that in your real use case you most likely won't usedbdir=":memory:" for your connection so you'll want to use temporary = FALSE inside of compute to make sure the table persists after closing the db connection.

library(duckdb) 
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI) 

conn_mem <- dbConnect(duckdb::duckdb(), dbdir=":memory:") 

dbWriteTable(conn_mem, "mtcars1", mtcars) 
dbWriteTable(conn_mem, "mtcars2", mtcars) 
dbWriteTable(conn_mem, "mtcars3", mtcars) 

table1 <- tbl(conn_mem, "mtcars1") 
table2 <- tbl(conn_mem, "mtcars2") 
table3 <- tbl(conn_mem, "mtcars3") 

joined_table <- left_join(table1, table2, by = "mpg") %>% left_join(table3, by = "mpg") 

# Execute on db using dplyr verbs with `dplyr::compute`
compute(joined_table, name = "mtcars_123", temporary = FALSE)
#> # Source:   table<mtcars_123> [?? x 31]
#> # Database: DuckDB 0.7.1 [root@Darwin 21.6.0:R 4.1.2/:memory:]
#>      mpg cyl.x disp.x  hp.x drat.x  wt.x qsec.x  vs.x  am.x gear.x carb.x cyl.y
#>    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
#>  1  21       6   160    110   3.9   2.62   16.5     0     1      4      4     6
#>  2  21       6   160    110   3.9   2.88   17.0     0     1      4      4     6
#>  3  21.4     6   258    110   3.08  3.22   19.4     1     0      3      1     4
#>  4  18.7     8   360    175   3.15  3.44   17.0     0     0      3      2     8
#>  5  18.1     6   225    105   2.76  3.46   20.2     1     0      3      1     6
#>  6  14.3     8   360    245   3.21  3.57   15.8     0     0      3      4     8
#>  7  24.4     4   147.    62   3.69  3.19   20       1     0      4      2     4
#>  8  19.2     6   168.   123   3.92  3.44   18.3     1     0      4      4     8
#>  9  17.8     6   168.   123   3.92  3.44   18.9     1     0      4      4     6
#> 10  16.4     8   276.   180   3.07  4.07   17.4     0     0      3      3     8
#> # … with more rows, and 19 more variables: disp.y <dbl>, hp.y <dbl>,
#> #   drat.y <dbl>, wt.y <dbl>, qsec.y <dbl>, vs.y <dbl>, am.y <dbl>,
#> #   gear.y <dbl>, carb.y <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>,
#> #   wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

# Verify table exists
dbListTables(conn = conn_mem)
#> [1] "mtcars1"    "mtcars2"    "mtcars3"    "mtcars_123"

tbl(conn_mem, "mtcars_123") %>% glimpse()
#> Rows: ??
#> Columns: 31
#> Database: DuckDB 0.7.1 [root@Darwin 21.6.0:R 4.1.2/:memory:]
#> $ mpg    <dbl> 21.0, 21.0, 21.4, 18.7, 18.1, 14.3, 24.4, 19.2, 17.8, 16.4, 17.…
#> $ cyl.x  <dbl> 6, 6, 6, 8, 6, 8, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, …
#> $ disp.x <dbl> 160.0, 160.0, 258.0, 360.0, 225.0, 360.0, 146.7, 167.6, 167.6, …
#> $ hp.x   <dbl> 110, 110, 110, 175, 105, 245, 62, 123, 123, 180, 180, 180, 205,…
#> $ drat.x <dbl> 3.90, 3.90, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.07, 3.0…
#> $ wt.x   <dbl> 2.620, 2.875, 3.215, 3.440, 3.460, 3.570, 3.190, 3.440, 3.440, …
#> $ qsec.x <dbl> 16.46, 17.02, 19.44, 17.02, 20.22, 15.84, 20.00, 18.30, 18.90, …
#> $ vs.x   <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, …
#> $ am.x   <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
#> $ gear.x <dbl> 4, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, …
#> $ carb.x <dbl> 4, 4, 1, 2, 1, 4, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, …
#> $ cyl.y  <dbl> 6, 6, 4, 8, 6, 8, 4, 8, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, …
#> $ disp.y <dbl> 160.0, 160.0, 121.0, 360.0, 225.0, 360.0, 146.7, 400.0, 167.6, …
#> $ hp.y   <dbl> 110, 110, 109, 175, 105, 245, 62, 175, 123, 180, 180, 150, 215,…
#> $ drat.y <dbl> 3.90, 3.90, 4.11, 3.15, 2.76, 3.21, 3.69, 3.08, 3.92, 3.07, 3.0…
#> $ wt.y   <dbl> 2.875, 2.875, 2.780, 3.440, 3.460, 3.570, 3.190, 3.845, 3.440, …
#> $ qsec.y <dbl> 17.02, 17.02, 18.60, 17.02, 20.22, 15.84, 20.00, 17.05, 18.90, …
#> $ vs.y   <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, …
#> $ am.y   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
#> $ gear.y <dbl> 4, 4, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 5, 4, 3, 3, 3, …
#> $ carb.y <dbl> 4, 4, 2, 2, 1, 4, 2, 2, 4, 3, 3, 2, 4, 4, 4, 1, 2, 1, 1, 2, 2, …
#> $ cyl    <dbl> 6, 6, 4, 8, 6, 8, 4, 8, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, …
#> $ disp   <dbl> 160.0, 160.0, 121.0, 360.0, 225.0, 360.0, 146.7, 400.0, 167.6, …
#> $ hp     <dbl> 110, 110, 109, 175, 105, 245, 62, 175, 123, 180, 180, 150, 215,…
#> $ drat   <dbl> 3.90, 3.90, 4.11, 3.15, 2.76, 3.21, 3.69, 3.08, 3.92, 3.07, 3.0…
#> $ wt     <dbl> 2.875, 2.875, 2.780, 3.440, 3.460, 3.570, 3.190, 3.845, 3.440, …
#> $ qsec   <dbl> 17.02, 17.02, 18.60, 17.02, 20.22, 15.84, 20.00, 17.05, 18.90, …
#> $ vs     <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, …
#> $ am     <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
#> $ gear   <dbl> 4, 4, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 5, 4, 3, 3, 3, …
#> $ carb   <dbl> 4, 4, 2, 2, 1, 4, 2, 2, 4, 3, 3, 2, 4, 4, 4, 1, 2, 1, 1, 2, 2, …

Created on 2023-05-03 by the reprex package (v2.0.1)

2 Likes

Perfect, thank you so much @michaelbgarcia !

I am indeed using a persistent directory and not :memory: for my real use case, but I do appreciate that you pointed that out. So many of the duckdb examples online use :memory: for short examples so I followed suit but it is an important reminder.

This works perfectly for me -- thank you so much!!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.