This is a bit of a general question. I realize that there are many different variables here. I am looking for some general guidance ...
I am trying to do some analysis on a large number of csv files. In aggregate, these files take up about 70GB of disk space. The individual csv files range from ~100MB to ~7GB. The data in these files is very similar to gapminder data, just a lot larger (see screenshot below).
I need to ask my company's IT team for more computing resources - my laptop computer has only 8GB of RAM.
- How much computing resources should I ask for?
(I.e. do I need 70GB of RAM to read in 70GB of data? If not, how much do I need?)
- What types of resources should I ask for (e.g. AWS cloud)?
Here are the types of analysis/process that I am looking to do:
- Summarize time series entries into smaller units (e.g. monthly data into quarters)
- calculate percentages, ratios, year-over-year growth rates
- various types of joins on the data coming from multiple csv files
- after joining, nest by variables to create one or more list-columns
- apply linear regression modeling on the list-columns (inspired by https://r4ds.had.co.nz/many-models.html)
- create charts with ggplot. I anticipate creating a library of about 1,000
I hope that questions is not too vague. Any thoughts/comments would be really helpful!