Ideas for SQL storage for a low budget hobby project

slodge · March 2, 2023, 11:18am

I've got a hobby project that I'm wondering about publishing...

As a hobby project it's got a very low personal budget

I think I can use shinyappsio for the front end - I think I can start on the free level and then stretch to the $100 per year. However, I'm not so sure what to do with my data... Currently I've got just over 1GB of SQL data stored in a duckdb instance - it's not going to get any smaller and some of the queries against it require some RAM - so I guess it won't be able to stay in duckdb - I won't be able to bundle it with the app - instead it's going to need to go into a dedicated box.

I don't particularly care which flavour of SQL is supported - the data is in 7 or 8 tables and is fairly simple instructure.

I'm currently wondering about using Hobby level on PlanetScale - Billing - PlanetScale Documentation - but wondering if anyone has any other ideas/suggestions/esperience.

Thanks

andresrcs · March 2, 2023, 11:39am

You could store your data in parquet files on an S3 bucket then duckdb can read from S3 directly.

slodge · March 2, 2023, 12:11pm

I like the idea - I've also thought about downloading the duckdb files from S3 each time...

However I'm worried this will lead to slow startup times, plus I think I'll hit the shiny RAM limits, especially on the 1GB Basic plan, but also on other plans with 5GB limits - RAM limits on Chapter 3 Applications | shinyapps.io user guide

My experience with duckdb has been fabulous for local development - it's really good - but it does also seem to use RAM quite quickly when I start building queries with joins, groups, CTEs, etc

andresrcs · March 2, 2023, 12:17pm

My understanding is that duckdb doesn't do that, I believe it loads the data on RAM by chunks instead so it will indeed be limited by the available RAM but it will not crash, it will only get slower. That is the whole point of using an "on-disk" approach to process larger than memory data.

slodge · March 2, 2023, 10:01pm

If I can't find any other solution, then I will try that - but I am worried that it'll still require a long startup time for the app.

The largest table is over 0.5GB in size so unless I start manually partitioning it, then it'll likely require many seconds to transfer to shinyapps.io even if I colocate the S3 bucket inside AWS US-East to make use of the local gigabit networking there.

Currently looking at planetscale - but struggling because i can't see any easy way to migrate the data there... (writing 100M rows remotely would definitely break any hobby sized budget)

andresrcs · March 2, 2023, 10:08pm

Have you tried? I haven't myself but I have seen a lot of people claiming they can query GB size data files from S3 in a few seconds. (as long as the collected result is not that big) The idea is to push computation to the the RDBMS as much as possible and collect summarized data only.

slodge · March 3, 2023, 12:43pm

No - I haven't tried.

Still at the stage of asking for recommendations. And it's a hobby project - so only got so much time to experiment in, as well as only so much budget to play with.

Would be interesting to see if anyone who has tried it has some numbers they can share If anyone has got links to conversations/blogs/etc about using GB of data from S3 in shinyapps.io then I'd be interested to read them.

andresrcs · March 3, 2023, 1:15pm

I think this YouTube video gives a nice example of how it performs although test data is way smaller than yours but maybe you could partition

slodge · March 3, 2023, 9:30pm

Thanks

Sadly partitioning is not really going to help because the nature of the queries I want to enable means they want to see a wide set of data

I could get some of the results I want by sampling data randomly... but I genuinely have 80 million rows in one table that I want to analyse - it's a nice problem to have

andresrcs · March 3, 2023, 9:44pm

How about using AWS Athena reading from S3? since they both, storage and RDBMS are in the cloud you get better performance

slodge · March 3, 2023, 10:49pm

I've used Athena in the past... the sql is very non-standard and it's somewhat hard to predict in terms of pricing - but it does work - so is definitely an option.

I think my main challenge I have here is that duckdb SQL working on my laptop has really worked exceptionally well... so I'd love to just get a serverside replication of that same experience... without too much rewriting and without any budget

I'm trying the S3 route at the moment - will see how it goes

andresrcs · March 3, 2023, 10:51pm

Good luck! Please tell us how it goes, this is a very interesting topic.

slodge · March 13, 2023, 8:27pm

The project continues....

PlanetScale offered free data upload to start with - so I've been playing with that - is working well, but I don't think it's a long term solution (not unless they change to offer other bulk upload/import mechanisms)

Locally I'm still using duckdb and it continues to impress (although the files got large quickly when I tried to add composite keys)

Still experimenting and playing - but time is a bit short at the moment - will update more when I have anything more significant to add.

slodge · March 14, 2023, 9:52am

Also linking in How Many Shiny Apps Can You Host for Free? <- not SQL specific but an interesting summary. I'd not heard of fly.io before - might have to up my docker game to see how it performs!