dbplyr - lazy query: problem with median function in MSSQL

Morning All,

i'm using and learning dbplyr with a Microsoft SQL Server, but i'm in trouble with a simple task.
I want to calculate some descriptive statistic on a variable on all table, so i have not a grouping variable, but and i receive an error. I don't understand very well how to handle dbplyr's group_by clauses that translate SQL's over() clauses.

My lazy query

tbl_query <- tbl(con, "table")

dbplyr query

tbl_query %>% 
    select(numeric_variable) %>%
    summarise(min = min(numeric_variable),
              mean = mean(numeric_variable),
              median = quantile(numeric_variable, 5),
              max = max(numeric_variable),
              sum = sum(numeric_variable),
    )

the error

Error: <SQL> 'SELECT TOP(11) MIN("numeric_variable") AS "min", AVG("numeric_variable") AS "mean", PERCENTILE_CONT(5.0) WITHIN GROUP (ORDER BY "numeric_variable") AS "median", MAX("numeric_variable") AS "max", SUM("numeric_variable") AS "sum"
FROM (SELECT TOP 100 PERCENT "idconsole", "idconsole" AS "numeric_variable"
FROM "lead") "dbplyr_027"'
  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]The function 'PERCENTILE_CONT' must have an OVER clause. 

the only way is to think in SQL way and calculate all aggregate function as window function and at the end keep only a row? like this

tbl_query %>% 
    select(numeric_variable) %>%
    mutate(min = min(numeric_variable, na.rm = TRUE),
           median = quantile(numeric_variable, 0.5),
           mean = mean(numeric_variable, na.rm = TRUE),
           max = max(numeric_variable, na.rm = TRUE),
           sum = sum(numeric_variable, na.rm = TRUE)) %>%
    ungroup() %>% 
    head(n = 1)

there's a way to give in dbplyr different group_by condition for the different aggregation functions?

thanks in advance.
MC

Hi, I think you need to specify the column to order by, which in this case will be numeric_variable:

tbl_query %>% 
    select(numeric_variable) %>%
    arrange(numeric_variable) %>%
    summarise(min = min(numeric_variable),
              mean = mean(numeric_variable),
              median = quantile(numeric_variable, 5),
              max = max(numeric_variable),
              sum = sum(numeric_variable),
    )

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.