Writing 1 million rows * 52 columns

What is the best way to write ~ 1 million rows of data with around 52 columns? I have RS connect, RS studio pro and I am pulling data out of AWS redshift. I tried rendering the data using datatable but the connection times out every time.

Hi,

Few questions:

  • What is the goal of this project. Just pulling data from AWS and storing it or are you transforming the data in R?
  • Is the problem loading the data into R when pulling it from AWS or writing it to a file/database once you have loaded/manipulated it in R?
  • What data types are in the 52 columns. Are they all text or numbers, or is it mixed data?
  • What's the estimated file size. Try pulling a smaller number of rows (e.g. 1000) and see how big that is

PJ

1 Like

Hi,

Thanks for your reply.

  1. Yeah, it’s a very basic use case. I pull the data, do some data wrangling. I should have mentioned this before. I am using two static excel sheets to join the data with and then I am supposed to make the data available to users.
  2. The problem is while writing the data. I am looking for utilizing a flexdashboard or equivalent to make it available for users but everytime I try rendering the table, I get the timeout issue.
  3. Data types are mixed with mostly text and numbers but there are some dates too.
  4. You mean the file size as a csv or excel or the memory storage In R?

Manish

Hi,

Let's start with the file size in memory (in R). Remember that if you have an app, the data needs to be sent back and forth between the server and the client so if they want to download a file that's very large you might get into trouble there. Also, displaying data with 1 million rows and 52 columns is not very useful. If you want them to see some results, you should think of ways of filtering the data before displaying it I think.

I'm still not sure where the issue lies exactly, because you say the issue is with writing the data, then you say the issue is rendering the table. These are two very different issues :slight_smile:

PJ

lets start with a base case where the data is local, and see if this performance is ok, and whether your code for this is in good order.
Here is a simple example of showing 1milx52

library(shiny)
library(DT)
# Define UI for application that draws a histogram
ui <- fluidPage( dataTableOutput("testdtout")
)


examplem <- matrix(data = runif(n = 52*10^6), nrow=10^6,ncol=52)
example_df <- as_tibble(examplem)

# Define server logic required to draw a histogram
server <- function(input, output) {
    
    output$testdtout <- DT::renderDataTable({
        example_df
    })
}

# Run the application 
shinyApp(ui = ui, server = server)
1 Like

@pieterjanvc no, the issue is with rendering the table. I had said I am rendering it using datatable package but I’m not writing the data to a file. Just trying to render it.

Thanks, let me try this.

This worked pretty fast, in seconds. Could the issue be that i am creating a .rmd file to publish the results? I was using flexdashboard rmd for data wrangling + publishing.

I dont think .rmd would be an issue.
I suggest the next thing would be to replace example_df with an instance of a dataframe that is an example of how your typical table might be. Can you do that ?

No, I am still getting the error. I tried both .r and .rmd scripts and also tried after subsetting the data down to only 2000 records. I am getting the below error

An error has occurred

Unable to connect to worker after 60.00 seconds; startup took too long.

I found a previous thread here with similar error issue. Seems to relate to accessing database rather than issues with rendering any data you might have successfully downloaded. A simple thing might be to write some print statements in your code highlighting the milestones of your code. That way you can see how far it gets before going wrong when you check the logs.

1 Like

Thanks a lot for your help.

Just to update, I pushed all the heavy lifting in another rmd file and used the file written out by from that rmd in my shiny app. It worked really smoothly. Not sure but maybe it was the data wrangling part which was taking some time and causing the timeout issue.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.