dbplyr summarize() without a group_by()

lisalendway · September 27, 2021, 8:55pm

I am trying to run some code that for sure worked last April-ish (2021). It does a summarize() without a group_by() using dbplyr. But it now gives an error. When I look at the SQL generated, it seems it automatically is looking for a grouping variable, but I don't want one - I just want an overall summary. Is there a way around this? Or should I just use SQL for this? Just trying to teach this to students, so it's not that big of a deal. Reprex is below.

library(tidyverse)   # for reading in data, graphing, and cleaning
library(dbplyr)      # for SQL query "cheating" - part of tidyverse but needs to be loaded separately
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(mdsr)        # for accessing some databases - goes with Modern Data Science with R textbook
library(RMySQL)      # for accessing MySQL databases
#> Loading required package: DBI
library(RSQLite)     # for accessing SQLite databases
#> 
#> Attaching package: 'RSQLite'
#> The following object is masked from 'package:RMySQL':
#> 
#>     isIdCurrent

con_air <- dbConnect_scidb("airlines")

overall_smry <-
  tbl(con_air, "flights") %>% 
  summarize(min_year = min(year),
            max_year = max(year),
            num_flights = n(),
            num_carriers = n_distinct(carrier))

overall_smry
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> Error in .local(conn, statement, ...): could not run statement: Unknown column '' in 'field list'

overall_smry %>% 
  show_query()
#> <SQL>
#> SELECT ``, MIN(`year`) AS `min_year`, MAX(`year`) AS `max_year`, COUNT(*) AS `num_flights`, COUNT(DISTINCT `carrier`) AS `num_carriers`
#> FROM `flights`
#> GROUP BY ``

Created on 2021-09-27 by the reprex package(v2.0.0)

edgararuiz · September 27, 2021, 9:16pm

Hi, can you share what it returns when you run class(con_air)? I think it has to do with the type of database dbplyr think it's working with. The output of that command should give us a clue of what dbplyr is seeing.

It works when I use DBI instead of the helper function from mdsr:

library(dplyr)
library(dbplyr)     
library(nycflights13)
library(RSQLite)    
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

tbl_flights <- copy_to(con, flights)


overall_smry <- tbl_flights %>% 
  summarize(min_year = min(year, na.rm = TRUE),
            max_year = max(year, na.rm = TRUE),
            num_flights = n(),
            num_carriers = n_distinct(carrier)
            )

overall_smry
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.36.0 [:memory:]
#>   min_year max_year num_flights num_carriers
#>      <int>    <int>       <int>        <int>
#> 1     2013     2013      336776           16

class(con)
#> [1] "SQLiteConnection"
#> attr(,"package")
#> [1] "RSQLite"

lisalendway · September 28, 2021, 12:00pm

It says

class(con_air)
[1] "MySQLConnection"
attr(,"package")
[1] "RMySQL"

nirgrahamuk · September 28, 2021, 12:49pm

In the short term you can use this workaround

install if you dont have it the RMariaDB package which is a more modern MySQL interface (or so I read / apparently)
make your own dbconnect function which is based on the one within 'mdsr'
like this:

dcon <- function(dbname, ...){
  {
    DBI::dbConnect(RMariaDB::MariaDB(), dbname = dbname, host = "mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com", 
                   user = "mdsr_public", password = "ImhsmflMDSwR", 
                   ...)
  }
}

this can be used instead of con_air <- dbConnect_scidb("airlines")
i.e. after the above definition do

con_air <- dcon("airlines")

In the long term, you can raise an issue over at
https://github.com/mdsr-book/mdsr/issues
perhaps they can alter their code for future package users benefit

lisalendway · September 28, 2021, 9:02pm

Thank you! I think for now I can just skip that tiny chunk of code since the workaround might be a bit much for my students, but thanks for your suggestion. Sorry I didn't realize the problem comes from mdsr - I just assumed it was from dbplyr. Thanks again!

beanumber · September 29, 2021, 2:00pm

We're tracking this as you suggested

You can add a specious grouping variable as a hacky workaround that will get you the answer you want without having to switch DB engines:

PI_Info %>% group_by("a") %>% summarize(total = n())

Source: lazy query [?? x 2]

Database: mysql 5.5.58-0ubuntu0.14.04.1-log @.***:/wai]

"a" total

1 a 20

system · October 6, 2021, 2:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.