I work with relatively large datasets: the main tibble is a tidy, long tibble (millions of rows) with three columns: Sample, Hash, number of occurrences. Each row is an observation. In order to plot, model, etc, I need to group samples according to their metadata, and hence there is another tibble with as many rows as samples (~100 rows and ~10 columns: Sample, Site, Month,...).
So, in order to, let's say, obtain a boxplot of the number of unique sequences per sample at each site, I will have to join the two tibbles. A workflow will look like
library(RCurl)
library(tidyverse)
long_tibble<- read_csv(file = getURL("https://raw.githubusercontent.com/ramongallego/dplyr_question/master/example_long_tible.csv") )
factor_tibble <- read_csv (file = getURL("https://raw.githubusercontent.com/ramongallego/dplyr_question/master/example_factor_tible.csv"))
together<- left_join(long_tibble, factor_tibble)
ggplot (data = together, aes (x= Site, y = nunique)) +
geom_boxplot(data = together %>%
group_by(sample, Site) %>%
summarise(nunique = n_distinct(Hash)))
That works and gives me the plot I am after. My question is whether it is mandatory to formally join both tibbles, or if it is possible to generate the group_by of tibble #1 with the grouping information of tibble #2.
This may seem trivial but I think with really long tibbles, any new column added should increase the size of the object dramatically, right?
So my question is which is the tidiest way of dealing with relational tibbles