RSQLite and DuckDB seem to create in-memory databases, even when I don't want them to

Hey - struggling with this one and couldn't find any obvious stackoverflow examples answering the question - I'm sure I'm missing an obvious argument step here, but any help would be appreciated!

I have ~60GB of csv weather data scattered across 366 folders containing ~100 files each of about 1-2MB, so roughly 36,600 files. Way too much to keep in memory! So, I thought - let's iterate over all these files and load them into a local DB like duckdb or RSQLite and query results from there! If the results aren't in memory, we should be OK, right?

So, did something like this:

library(readr)
library(DBI)
library(duckdb)
library(dbplyr)
con <- dbConnect(duckdb::duckdb(), dbname = "Output-Files/WX-DB.duckdb") 
wx_file <- read_csv("Input-Data/directory/20220101/11052.csv")
wx_spec <- spec(wx_file)
wx_colnames <- colnames(wx_file)
copy_to(con, wx_file)
wx_file_remote <- tbl(con, "wx_file")
rows_delete(x = wx_file_remote, y = wx_file, copy = TRUE, unmatched = "ignore")

The theory here is we instantiate a local database, read in one of the files as a sample to get the spec and column names, copy it over to get the schema and then delete all the rows so we don't need to ignore the sample file in the full load.

Unfortunately, something weird happens here - the DB file doesn't grow! If I call:

dbDisconnect(conn = con)
con <- dbConnect(duckdb::duckdb(), dbname = "Output-Files/WX-DB.duckdb")
wx_file_remote <- tbl(con, "wx_file")

I get:

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT *
FROM wx_file AS q02
WHERE (0 = 1)
Error: Catalog Error: Table with name wx_file does not exist!
Did you mean "pg_type"?
LINE 2: FROM wx_file AS q02

Where'd my table go! I get similar results with RSQLite:

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! no such table: wx_file
Run `rlang::last_error()` to see where the error occurred.

As far as I can tell, even though I (think) I'm asking for the DB to not be created in-memory, it's doing it anyway, and I can't find any consistent way of making it not do that. That's confirmed by doing the following:

library(purrr)
library(dplyr)
library(stringr)
dir <- dir(path = "Input-Data/directory/")
purrr::walk(dir, ~ {
  files <- dir(path = str_c("Input-Data/directory/", .x, "/", sep = "")) %>% str_c("Input-Data/directory/", .x, "/", ., sep = "")
  tmp <- read_csv(files, col_types = wx_spec, col_names = wx_colnames, num_threads = 1, lazy = TRUE)
  rows_append(wx_file_remote, tmp, in_place = TRUE, copy = TRUE)
  rm(tmp)
  gc()
}, .progress = TRUE)

Which quickly annihilates the relatively little local RAM I have.

What am I missing here? Real "feel like I'm taking crazy pills" situation. Sure I'm missing something obvious.

I think you are using dbname as if it were the param you need , dbdir

Good answer! Gets the same response, though:

con <- dbConnect(duckdb::duckdb(), dbdir = "Output-Files/WX-DB-2.duckdb") #, create = TRUE)
wx_file <- read_csv("Input-Data/omd_data/20220101/11052.csv")
wx_spec <- spec(wx_file)
wx_colnames <- colnames(wx_file)
copy_to(con, wx_file)
wx_file_remote <- tbl(con, "wx_file")
rows_delete(x = wx_file_remote, y = wx_file, copy = TRUE, unmatched = "ignore")
dbDisconnect(conn = con, shutdown = TRUE)
con <- dbConnect(duckdb::duckdb(), dbdir = "Output-Files/WX-DB-2.duckdb")
wx_file_remote <- tbl(con, "wx_file")

Returns:

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT *
FROM wx_file AS q02
WHERE (0 = 1)
Error: Catalog Error: Table with name wx_file does not exist!
Did you mean "pg_type"?
LINE 2: FROM wx_file AS q02
             ^
Run `rlang::last_error()` to see where the error occurred.

In the duckdb documentation for R, rather than show introducing data via this copy_to approach, the following.

To write a R data frame into DuckDB, use the standard DBI function dbWriteTable(). This creates a table in DuckDB and populates it with the data frame contents. For example:

dbWriteTable(con, "iris_table", iris)
res = dbGetQuery(con, "SELECT * FROM iris_table LIMIT 1")
print(res)

Hey, that works! I might post this as an issue with dbplyr on their github.

Oho! So, I dug around in the documentation and found out what's happening - the copy_to help page isn't useful, but the dbplyr::copy_to.src_sql one is:

Copy a local data frame to a remote database
Description
This is an implementation of the dplyr copy_to() generic and it mostly a wrapper around DBI::dbWriteTable().

It is useful for copying small amounts of data to a database for examples, experiments, and joins. By default, it creates temporary tables which are only visible within the current connection to the database.

Usage

S3 method for class 'src_sql'

copy_to(
dest,
df,
name = deparse(substitute(df)),
overwrite = FALSE,
types = NULL,
temporary = TRUE,
unique_indexes = NULL,
indexes = NULL,
analyze = TRUE,
...,
in_transaction = TRUE
)

So I'd need to set temporary = FALSE - not clear at all in the documentation or the examples that's what's going on. Thankyou for helping me work through it.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.