Using Pre-Aggregated Data - Looking for learning Resources

rstudio
recommendations

#1

Hi

How do I find resources that help me learn to use pre-aggregated data?

I'm new to RStudio and I've been trying to do some basic stuff. I have the R for Data Science book that I'm working through but I'm struggling with pre-aggregated data. For instance the UK government drug data most conveniently available at openprescribing,net

I've found the stat=identity argument in ggplot that allows me to tell it that the y axis should use my pre-aggregated numbers but apart from that I struggle. All the training seems to assume individual observation based data. When I search for blogs and write-ups terms like aggregate give me the aggregate function/method rather than the stuff I need.

Does anyone know of beginner resources targeted at pre-aggregated or could just help me with a couple of search terms that might give some results.

For reference my background is SQL, Business Objects and SPSS in the early 1990s.

thanks

Jon


#2

I've had the same limit in my work with public health statistics. It really depends on what you're trying to do.

If you just want to learn how to manipulate data, then you can use aggregate statistics as a substitute for individual observations. Instead of "patient X" you have "drug Y" (or however the data is). Your results will probably lack publication-quality statistical rigor, but it's fine for learning.

If you want to eventually learn how to do a specific type of analysis, look up resources for doing that analysis in R. There are some great and free online sources for topics such as time series and geospatial analysis. The bookdown gallery is a nice aggregator for these. Don't worry about not knowing enough R to read these; if the text showcases R code, it'll likely start with a primer on R geared towards using relevant data.

If you find any other good resources, please keep us updated. My "to read" bookmark list can still fit a few more links!


#3

I agree with @nwerth that for your purposes, there's no difference between individual observations, sums of individual observations, means, sds or any other way of aggregating them. In SQL terms, they are all just records.

One rich data resource is https://www.gapminder.org, which addresses a host of questions that can only be answered by aggregated data and some googling will lead you to R tutorials that use it as an example.