Upload data from R into multiple, connected postgreSQL tables

dplyr

#1

This doesn’t seem to be a topic in this community yet, but Hadley asked me to post it here, so here we go. :wink:

I have a postgreSQL database to which I connect via R and (so far) RpostgrSQL. I want to upload exercise data, which I’ve split up into multiple tables in my database, but which comes in one big CSV file originally. My basic database structure consists of four tables with different workout aspects identified by serial IDs, which are all connected in a fifth table that links to the serial IDs in the other tables.

I found it quite easy to upload data for this using Python, but since R is focused on uploading one data.frame to one table, I’m not sure what the best approach would be here. My “pythonic” approach is outlined in the example - just not wrapped in a function yet, which I would then call with apply or similar.

Can you recommend any better approaches?

# create example data
exercise <- data.frame(wotype=c("yoga", "running", "cycling"),
                       start=c("25-07-2017  06:21:38", "05-06-2017  08:27:01", "03-06-2017  14:17:22"),
                       name=c("Morning Yoga", "Hannover Running", "Hannover Cycling"),
                       distance=c(0, 4.25, 7.64),
                       calories=c(30, 287, 168),
                       duration=c("10:05", "29:40", "29:38"),
                       avgHR=c(87, 159, 117),
                       maxHR=c(102, 172, 142))

## not run ##
# upload scheme I would use in Python - not fully functional yet

library(RPostgreSQL)

#connect to the database
con <- dbConnect(dbDriver("PostgreSQL"), dbname="summerpain", user="postgres")

# begin transactions
dbBegin(con)

# uploading statements
metrics <- dbSendQuery(con, "INSERT INTO \"Metrics\" (duration, distance) VALUES('10:05', 0) RETURNING id;")
m <- fetch(metrics)

results <- dbSendQuery(con, "INSERT INTO \"Results\" (avgHR, maxHR, calories) VALUES(87, 102, 30) RETURNING id;")
r <- fetch(results)

wotype <- dbSendQuery(con, "INSERT INTO \"WorkoutType\" (wotype) VALUES('yoga') RETURNING id;")
t <- fetch(wotype)

woname <- dbSendQuery(con, "INSERT INTO \"WorkoutName\" (name) VALUES('Morning Yoga') RETURNING id;")
n <- fetch(woname)

# final table
dbcall <- paste("INSERT INTO \"Activity\" (start, wotype, name, metrics, results) VALUES('25-07-2017  06:21:38',",
                 t, n, m, r, ");")
dbSendQuery(con, call)

# commit transactions
dbCommit(con)

# disconnect from database
dbDisconnect(con)

#2

Hi @LilithElina, thank you for posting the question here as well. And you’re right, there’s no easy solution at this time, so I went ahead and opened an Issue in GitHub to look into that (https://github.com/tidyverse/dplyr/issues/3124). Hopefully the workaround that I posted in SO can help a little with what you need to do, I’m posting it here as well for the benefit of the community:

    library(dplyr)
    library(dbplyr)
    library(purrr)
    
    # Setting up a SQLite db with 3 tables
    con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
    copy_to(con, filter(mtcars, cyl == 4), "mtcars1")
    copy_to(con, filter(mtcars, cyl == 6), "mtcars2")
    copy_to(con, filter(mtcars, cyl == 8), "mtcars3")
    
    
    
    # Pre-process the SQL statements
    tables <- c("mtcars1","mtcars2","mtcars3")
    all_results <- tables %>%
      map(~{
        tbl(con, .x) %>%
          summarise(avg_mpg = mean(mpg),
                    records = n()) %>%
          sql_render() 
      })
    
    # Execute the SQL statements, 1st one creates the table
    # subsquent queries are insterted to the table
    first_query <- TRUE
    all_results %>%
      walk(~{
        if(first_query == TRUE){
          first_query <<- FALSE
          db_save_query(con, ., "results")
        } else {
          dbExecute(con, build_sql("INSERT INTO results ", .))
        }
      })
    
    
    tbl(con, "results")
    
    dbDisconnect(con)
      

#3

How often do you do this? Perhaps writing a stored procedure on the database would be a simpler option. This would allow you to write the code to send the data to their respective tables in one place, and simplify the code in R.


#4

Thank you! I haven’t yet found the time to try this, but it looks interesting.


#5

The project I need this for is a little private thing I’m doing in my free time. I’m definitely doing it once and hope to do it a few times more in the future, but that’s not for sure yet (getting the raw data is a pain).
Anyway, it will be done as often as I upload new data into the database, so if there’s a procedure on the database side I could implement, that probably would make sense. I’m a beginner, though, and didn’t even know that was possible. I’ll look into it. :slight_smile: