How can I supply a list nested in a dataframe to `glue_data_sql()`

DaveRGP · February 1, 2018, 3:49pm

I like glue, but I feel like I've missed a detail.

starwars %>% 
  slice(1:2) %>% 
  glue_data_sql("SELECT *
                FROM tbl
                WHERE IN {films}")

This code feeds a list column into the query. It is nearly correct, however the answer still has a proceeding c as the films column is a list column. It also has the contents of the list enquoted and escaped with `("item1", "item2") which doesn't work with redshift at least.

<SQL> SELECT *
FROM tbl
WHERE film IN c("Revenge of the Sith", "Return of the Jedi", "The Empire Strikes Back", "A New Hope", "The Force Awakens")
<SQL> SELECT *
FROM tbl
WHERE film IN c("Attack of the Clones", "The Phantom Menace", "Revenge of the Sith", "Return of the Jedi", "The Empire Strikes Back", "A New Hope")

Adding the * to collapse doesn't help in this case.

I've attempted to fix it by string manipulation:

starwars %>% 
  slice(1:2) %>% 
  glue_data_sql("SELECT *
                FROM tbl
                WHERE film IN {films}") %>% 
  str_replace_all("c\\(", "(")

[1] "SELECT *\nFROM tbl\nWHERE film IN (\"Revenge of the Sith\", \"Return of the Jedi\", \"The Empire Strikes Back\", \"A New Hope\", \"The Force Awakens\")"                           
[2] "SELECT *\nFROM tbl\nWHERE film IN (\"Attack of the Clones\", \"The Phantom Menace\", \"Revenge of the Sith\", \"Return of the Jedi\", \"The Empire Strikes Back\", \"A New Hope\")"

I thought this might be an issue about needing to escape brackets properly, but this modification which doesn't reply on using the brackets gives the same output.

starwars %>% 
  slice(1:2) %>% 
  glue_data_sql("SELECT *
                FROM tbl
                WHERE film IN {films}") %>% 
  str_replace_all("IN c", "IN ")

But this seems to change the some details (or at least the print method) and also inputs escapes before the " which confuses me (probably my bad regex skills). Have I missed a detail of glue/listcolumns that will make this work?

stephlocke · February 1, 2018, 5:48pm

This is a fairly gnarly problem right now. Here is a solution (but I might have a better one later)

library(tidyverse)
library(glue)
starwars %>% 
  slice(1:2) %>% 
  unnest(films) %>% 
  group_by( name) %>% 
  glue_data("select * FROM tbl
            WHERE IN ({films})",
            films=glue(collapse(glue("'{films}'"), sep=",")))

DaveRGP · February 1, 2018, 6:26pm

Thanks, it's a lot closer in terms of " and c. My output looks like this:

select * FROM tbl
WHERE IN ('Revenge of the Sith','Return of the Jedi','The Empire Strikes Back','A New Hope','The Force Awakens','Attack of the Clones','The Phantom Menace','Revenge of the Sith','Return of the Jedi','The Empire Strikes Back','A New Hope')

It seems this returns 1 query though, with every possible film. In my irl use case it's important to maintain isolation of the equivalent to films though, where in the first query the contents of WHERE film in ({films}) is specific to the row and it's other parameters, and distinct to the other cells in the list column. i.e. query one in the output here should contain ('Revenge of the Sith','Return of the Jedi','The Empire Strikes Back','A New Hope','The Force Awakens') for Luke Skywalker (5 items), and another query should return ('Attack of the Clones', 'The Phantom Menace', 'Revenge of the Sith', 'Return of the Jedi', 'The Empire Strikes Back', 'A New Hope') for C-3PO (6 items).

stephlocke · February 1, 2018, 6:48pm

oh it's not respecting the group_by! hmm

This time without glue

library(tidyverse)
library(glue)
starwars %>%
  slice(1:2) %>%
  unnest(films) %>%
  group_by_at( vars(everything(),-films)) %>%
  summarise(sql=paste0("select * from tbl where films in (",paste("'",films,"'", sep="",collapse=","),")")) %>% 
  pull(sql)

jimhester · February 1, 2018, 7:43pm

The glue equivalent would be

library(tidyverse)
library(glue)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
starwars %>%
  slice(1:2) %>%
  unnest(films) %>%
  group_by_at( vars(everything(),-films)) %>%
  summarise(sql=glue_sql("select * from tbl where {films*} in films", .con = con)) %>%
  pull(sql)
#> <SQL> select * from tbl where 'Attack of the Clones', 'The Phantom Menace', 'Revenge of the Sith', 'Return of the Jedi', 'The Empire Strikes Back', 'A New Hope' in films
#> <SQL> select * from tbl where 'Revenge of the Sith', 'Return of the Jedi', 'The Empire Strikes Back', 'A New Hope', 'The Force Awakens' in films

DaveRGP · February 2, 2018, 9:12am

Thanks both. Appears unnesting is definitely something I overlooked here, as well as summarise. I think I can apply this to my IRL now.

The IRL is a little more complex than the query above, and the target sequel is more like:

<SQL> SELECT * FROM tbl WHERE character = 'C-3PO' AND film IN ('Attack of the Clones', 'The Phantom Menace', 'Revenge of the Sith', 'Return of the Jedi', 'The Empire Strikes Back', 'A New Hope')
<SQL> SELECT * FROM tbl WHERE character = 'Luke Skywalker' AND film IN ('Revenge of the Sith', 'Return of the Jedi', 'The Empire Strikes Back', 'A New Hope', 'The Force Awakens')

This previously seemed straight forward to implement, but in the context of the new solution needs some tweaking. What I've come up with is:

starwars %>%
  slice(1:2) %>%
  unnest(films) %>%
  group_by_at( vars(everything(),-films)) %>%
  summarise(
    sql=glue_sql("SELECT * FROM tbl WHERE character = {name} AND film IN ({films*})",
                 name = name[1], 
                 .con = con)) %>%
  pull(sql)

Where in each iteration of the name parameter, the function on it's own returns the value of name a number of times equal to the length of film. The modification above takes the first entry of that vector, and supplies just that, not the full vector.

starwars %>%
  slice(1:2) %>%
  unnest(films) %>%
  group_by_at( vars(everything(),-films)) %>%
  summarise(
    sql=glue_sql("SELECT * FROM tbl WHERE character = {name} AND film IN ({films*})",
                 name = unique(name), 
                 .con = con)) %>%
  pull(sql)

Is an alternative, but if things change around it could potentially return a vector of length two, which wouldn't be suitable.

stephlocke · February 2, 2018, 10:00am

That * is nifty!

Change {name} to {unique(name)} to dedupe as opposed to using a vector selection Dave. The integrity of length 1 here is protected by the group by.