How to improve dbWriteTable speed on Hadoop?

I am trying to write a R dataframe(50K * 206) with mutiple dataype columns as table in Hadoop(Cloudera, Impala, Hue) with dbWriteTable. It is exceptionally slow. Even if I split the dataframe into cunks of 1000, i have been able to write only 5000 rows in last 2 hour.

Is there a way or method to speed this up?

Following is a reprex

> #create a big dataframe
> type <- sample(c("integer()", "character()"),205, replace = T) 
> col <- paste0("col", c(1:205))
> 
> create_big_df <- function(x){
> 
> if(x == "integer()"){
>   
>   sample(letters, 50000, replace = T)
> }else{
>   
>   sample(1:10000, 50000, replace = T)
> }
> }
> 
> df <- data.frame(map(type, create_big_df))
> colnames(df) <- col
> 
> > dim(df)
> [1] 50000   205
> 
> 
> #splitting df into row chunks
> >split_df <-
>   split(df , cut(1:nrow(df), nrow(df) / 100))
> 
> #function to append chunks of df
> write_tbl_sandbox_1 <- function(df, n) {
>   print(n)
>   dbAppendTable(con,
>                SQL("sandbox_schema.df"),
>                df, append = TRUE)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.