Relational tibbles - do they need to be joined?

rgallego · June 19, 2018, 5:57am

I work with relatively large datasets: the main tibble is a tidy, long tibble (millions of rows) with three columns: Sample, Hash, number of occurrences. Each row is an observation. In order to plot, model, etc, I need to group samples according to their metadata, and hence there is another tibble with as many rows as samples (~100 rows and ~10 columns: Sample, Site, Month,...).

So, in order to, let's say, obtain a boxplot of the number of unique sequences per sample at each site, I will have to join the two tibbles. A workflow will look like

library(RCurl)
library(tidyverse)

long_tibble<- read_csv(file = getURL("https://raw.githubusercontent.com/ramongallego/dplyr_question/master/example_long_tible.csv") )

factor_tibble <- read_csv (file = getURL("https://raw.githubusercontent.com/ramongallego/dplyr_question/master/example_factor_tible.csv"))

together<- left_join(long_tibble, factor_tibble) 

ggplot (data = together, aes (x= Site, y = nunique)) + 
  geom_boxplot(data = together %>%
 group_by(sample, Site) %>%
 summarise(nunique = n_distinct(Hash)))

That works and gives me the plot I am after. My question is whether it is mandatory to formally join both tibbles, or if it is possible to generate the group_by of tibble #1 with the grouping information of tibble #2.

This may seem trivial but I think with really long tibbles, any new column added should increase the size of the object dramatically, right?

So my question is which is the tidiest way of dealing with relational tibbles

nwerth · June 19, 2018, 2:22pm

If each sample is in exactly one row of factor_tibble, then isn't group_by(sample, Site) the same as group_by(sample)?

In answer to your question, you could keep the tables separate to reduce memory usage and only join the necessary columns as part of a pipeline. That way, it takes much less memory which is (loosely) "freed up" after the pipe's done.

long_tibble %>%
  left_join(factor_tibble[, c("sample", "Site")], by = "sample") %>%
  group_by(sample, Site) %>%
  summarise(nunique = n_distinct(Hash)) %>%
  ggplot(aes(x = Site, y = nunique)) + 
  geom_boxplot()

rgallego · June 19, 2018, 3:28pm

Thanks for that Nathan,

First, fair point on the sample/site group redundancy.

Second, your solution avoids storing the join tibble in a Robject and that's an improvement. I guess what I was hoping for is a way of passing the grouping without actually merging the tibbles, even under the hood.

Thanks again

nwerth · June 19, 2018, 4:36pm

A data.table is really great for making fast lookup tables. Of course, it's still technically a merge, but the package optimizes a lot of stuff.

library(data.table)

key_dt <- factor_tibble %>%
  select(sample, Site) %>%
  as.data.table()

summary_dt <- key_dt[
  long_tibble,
  on = "sample"
][
  ,
  list(nunique = n_distinct(Hash)),
  by = list(sample, Site)
]

ggplot(summary_dt, aes(x = Site, y = nunique)) + 
  geom_boxplot()

You could also look into using a database, like SQLite.

jlacko · June 19, 2018, 6:59pm

There is beauty in normal form of data (... so help me Codd). Hard to improve upon in an abstract way.

Of course not all constraints are as hard today as they were in the late 1960's - e.g. storage costs mean something different now and then.

Still, in your case I would suggest offloading the main and the lookup tibbles into a database. Especially if you expect your work to involve a lot of grouping / subsetting. These are the kind of operations in which relational databases shine.

You will move from an environment of available memory constraint (R) to an environment of available disk space constraint (RDBS). In most use cases the hard drive wins.

rgallego · June 19, 2018, 7:33pm

Thanks for the input Jindra, this is very exciting - would you mind pointing me in the direction of some online resources to start working on this?

Many thanks

Ramón Gallego, PhD

jlacko · June 20, 2018, 8:32am

Thanks for your kind words!

A good starting point for working with databases is https://db.rstudio.com/ - @edgararuiz has a nice webinar in the media section.

About design of relational model you can have a quick web search on star schema - which is the simplest one (and describes nicely your situation - one fact table with one or more dimensions).

To get more fancy you can look up the great Inmon vs. Kimball controversy (hint: it is not much of a controversy, but why spoil a good story?)