DBI::dbWriteTable is slow for writing to Azure

dbplyr

#1

I have read a few issues on github relating to write speed, but I have not found a workaround for the problem.

I currently have a 2.5MB file I would like to write to Azure. Below is an example of the code I am using.

The problem is the the write time is close to 40mins! What can I do to speed up the write speed?

 # connect to Azure
  con_azure <- DBI::dbConnect(odbc::odbc(), Driver = "SQL Server Native Client 11.0",
                                Server = "name-server.database.windows.net",
                                Database = "work_dw", UID = "user", PWD = "pass")

  # write to data to table
  my_file #2.5MB
  tictoc::tic()
  DBI::dbWriteTable(con_azure, DBI::Id(schema = "my_dev", name = "new_table"), my_file)
  tictoc::toc()
# 2305.56 sec elapsed

#2

Is this question something that is better posted in this community or as an issue on github?


#3

Hi @alex628, there are several factors that may influence the speed. The first thing I'd try is to do a comparison between this and how long it takes using a tool outside R, such as SQL Studio.


#4

Thanks for the suggestion @edgararuiz.

I just gave this a try. The alternative tool I ended up using was SQL Server Import and Export Wizard. I took the same 2.5MB file and saved as a csv then used the Wizard to upload to the Azure database. The write time was 8.32 seconds. This was a huge improvement over the ~40 minutes in R.

Do you have any suggestions on what I can try to speed up write times from R?


#5

Hi @edgararuiz, I was wondering if you knew of any alternative solutions.


#6

Hi @alex628, is there a sample file you can create using R code that you can test and share the code with me to run some tests? Maybe using something like rnorm() ?


#7

Thanks @edgararuiz here are a 3 speed tests.

library(tidyverse)
library(odbc)
library(tictoc)
con_azure <- DBI::dbConnect(odbc::odbc(), Driver = "SQL Server Native Client 11.0", 
Server = "my-server.database.windows.net", 
Database = "my-dw", UID = "me", PWD = "***")

# 7.97 sec elapsed
tic()
df100 <- rnorm(n = 100) %>% as.tibble()  
DBI::dbWriteTable(con_azure, DBI::Id(schema = "MY_DEV", name = "speed_test1"), df100, append = TRUE)  
toc()

# 64.49 sec elapsed
tic()
df1e3 <- rnorm(n = 1e3) %>% as.tibble()  
DBI::dbWriteTable(con_azure, DBI::Id(schema = "MY_DEV", name = "speed_test1"), df1e3, append = TRUE)  
toc()

# 647.22 sec elapsed
tic()
df1e4 <- rnorm(n = 1e4) %>% as.tibble()  
DBI::dbWriteTable(con_azure, DBI::Id(schema = "MY_DEV", name = "speed_test1"), df1e4, append = TRUE)  
toc()

#8

To add another nail in the coffin. I am unable to write tables alternative ways to the Azure database.

If there are alternatives to the following commands that I can try I would love to learn about them.

Also, these examples were taken from the Rstudio website in the schema selection section.

As far as I know there is no efficient way to push a table to an Azure schema from R, but I would love to be proven wrong.

Alternative 1

df1e4 <- rnorm(n = 1e4) %>% as.tibble() 
dplyr::copy_to(con_azure,DBI::Id(schema = "MY_DEV", name = "speed_test1"), df1e4)

Error: is_string(x = name) is not TRUE

Alternative 2

DBI::dbWriteTable(con_azure, df1e4, SQL("MY_DEV.speed_test1"))

Error in SQL("MY_DEV.speed_test1") : could not find function "SQL"


#9

Hi, can you try figuring where the slow down is, and then open an Issue on that package's repo?


#11

Hi @edgararuiz. I would open an issue, but I am struggling to know how to know which package is the issue. I am testing the same thing but with RODBC as opposed to DBI

# 15.04 sec elapsed
tic()
sqlQuery2 <- "drop table [MY_DEV].[speed_test1]"
RODBC::sqlQuery(channel = conn, sqlQuery2)
df100 <- rnorm(n = 100) %>% as.tibble()  
RODBC::sqlSave(channel = conn, dat = df100, tablename = "MY_DEV.speed_test1")
toc()

# 142.03 sec elapsed
tic()
sqlQuery2 <- "drop table [MY_DEV].[speed_test1]"
RODBC::sqlQuery(channel = conn, sqlQuery2)
df1e3 <- rnorm(n = 1e3) %>% as.tibble()
RODBC::sqlSave(channel = conn, dat = df1e3, tablename = "MY_DEV.speed_test1")
toc()

Maybe I can list what I have done to get pointed in the right direction.

SSIS Import Wizard write speeds to Azure are fast (TRUE)
Local (non-Azure) database write speeds as fast (TRUE)
DBI library write speeds to Azure are fast (FALSE)
RODBC library write speeds to Azure are fast (FALSE)

To me it almost seems like an Azure problem. It does not seem specific to library since DBI and RODBC were slow uploading to Azure, but fast uploading to local databases. It does not seem like a connectivity problem since the Import Wizard to Azure was fast.


#12

Hi,

Actually you are changing drivers. One for your local test and one for the remote azure.

Testing with dedicated tool, shows that the connection is likely not an issue.

So you should check what driver the dedicated tool is using and what driver R libraries are using. Could be different ones.

Another layer to investigate, would be the driver profiles used. Default values could be different. Slow upload could be linked to how often records are commited. Other sensitive parameters are types conversion, transactions, etc.

You could try to measure if creating table and inserting records behave different. You can also check if inserting records in 2 batch still sum to the same time.

Regards,

jm


#13

Thanks @jm_t,

So you should check what driver the dedicated tool is using and what driver R libraries are using. Could be different ones.

Thanks for pointing that out. Both the SSIS Import Wizard and the R libraries are using the SQL Server Native Client 11.0 driver.

Slow upload could be linked to how often records are commited.

I think this is possible, but I am not sure how to test it.

You could try to measure if creating table and inserting records behave different. You can also check if inserting records in 2 batch still sum to the same time.

I will check this.


#14

My experiences says this is a VERY high probability. One way to check is to sniff the actual commands being sent over the wire. Every time I have had huge slow downs when loading data it was caused by the write query committing every single row.

Ideas: https://dba.stackexchange.com/questions/63138/spy-sql-queries-generated-by-an-application


#15

@jdlong Thanks, that was very useful. I have to wait to get sysadmin permissions to use the SQL Server Profiler. So, lets assume that it turns out that the write query is committing every single row. How have you worked around this problem?

Is it a setting I would change on the SQL Server side?

Or is there a workaround in R to force queries to be committed in a batch?


#16

I think this is a client issue, not a server side issue. What is your OS where R is running? Is it Windows too?


#17

The OS where R is running is Windows.


#18

Hi,

I have no experience with Azure, but I would assume experiences from AWS could apply.

When using a ‘write db’ from R/DBI my guess is that each record is inserted into the target table using INSERT statements. This will be incredibly slow and in my opinion something that only works with a few hundred records.

The tool/solution will be to temporarily store the output as a file (csv), and then use a COPY or LOAD statement to copy the data into the table. I would assume the tool you ended up with does exactly this.

However, I’d be surprised if there are no R libraries to bulk load data from R and into Azure, since there are for AWS/Redshift. The data flow between R and Redshift would be R -> upload S3 -> COPY Redshift.

I know this is far from a solution, but I thought I’d share some experiences that could serve as pointers.

Best,
Richard


#19

@Arfuzzum thanks for the help. The S3 analogue in Azure is blob storage which can then be copied to the data base using Azure data factory pipelines. I settled on converting my data to json, using jsonlite then pushing to blob storage. I then have a pipeline which copies to json to my Azure sql server. I do this for repetitive tasks. If it is not repeating then I just upload with the SSIS Wizard.

I have used AWS in the past and found it much easier to work with. I hope Azure's tools get better over time.


#20

This is a bcp utility solution which I was happy with. The write speeds were lightning fast.

df100 <- rnorm(n = 100) %>% as.tibble() %>% rename(rnorms = value) %>% round(2)
df100 %>% write.table("df100.csv", row.names = F, sep ="\t") 
shell('sqlcmd.exe -S my_server.windows.net -d database -U username -P password -I -Q "create table [schema].[table](rnorms decimal(30, 4) NOT NULL);"')
shell("bcp [schema].[table] in df100.csv -S my_server.windows.net -d database  -U username -P password  -q -c -t")

I did have some issues if I switched write.table with write_csv. The above bcp command likes tab delim data. I am sure a setting in the bcp command could be changed to allow for comma delim, but this works fine for me.

More bcp documentation can be found here. https://docs.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017


#21

I'm so glad you found a solution. I'm thinking about adding a reference to bpc in db.rstudio.com. Thanks for your tenacity in finding an appropriate way of doing this!