"Missing" workflow in tidyverse?

bragks · December 29, 2018, 4:27pm

This question/problem might stem from my limited experience with R and programming in general, but over the last few weeks there have been three posts here with questions that (in my eyes) are quite similar:

How do i use the ranking function Row_number with multiple conditions in R tidyverse

Here is my dummy dataset. ID Order Case Date_created 123456 25800265 1 2018-06-27 07:40:23 123456 25800265 1 2018-06-25 05:29:23 123456 25800265 0 2018-07-26 06:16:28 789454 25906588 1 2018-07-12 05:59:50 789454 25906588 0 2018-07-12 07:41:29 789454 25906588 0 2018-07-10 05:43:45 789454 25906588 0 2018-07-09 05:59:26 789454 25906588 0 2018-07-05 10:39:45 287541 32140567 0 2018-07-12 07:41:29 287541 32140567 0 2018-07-10 05:43:45 287541 32140567 0 2018-07-09 05:59:26 287541 32140567 0 2018-07-05 10:39:45 …

I'm not entirely sure how to explain this using the proper language, so please bear with me...(I'm hoping you'll see the resemblance of the questions my way).

In accord with the principles of tidy data the examples are all related to datasets with one observation per row.
There are repeated measures.
Generalized, the problem is something in the lines of: "For each ID, perform some operation on/return a value from column x, given a condition in column x, or possibly column y".

Given the different answers I'm having some trouble seeing a generalised approach to situations like this. I've come across this several times since I started using R and I've always managed to somehow work around it (usually with tidyr), but every time if feels annoyingly elaborate. Like I'm missing a function or simpler workflow (most likely it's just missing knowledge and not actually missing from the tidyverse..). I came across rap by Romain Francois and the first example with filter looks kind of promising, but I haven't been able to figure out how to properly use it.

What is your standard approach to situations like this? All suggestions and examples are greatly appreciated!

phiggins · January 1, 2019, 2:05pm

I think I am at a similar point. There is a pattern here of:

(1) tall and tidy datasets,
(2) repeated measures for an individual/case,
(3) a need to collapse the data to one row per individual with summary measures that are a bit complicated (not just sum, max, mean, etc.), and sometimes
(4) to join that summary measure back to the original dataset.

I am stumbling through a few examples myself, and I am working this out on the fly with individual examples, but things that seem to regularly help are

group_by individual/case
add_count (by group) or add_tally (overall) for the appropriate denominators where needed (can just do mutate with count/tally if already grouped)
occasionally arrange by time order, then dplyr::lead or lag (with order_by) help to compare data in sequential events to detect change
mutate (sometimes with case_when) to get your needed more complex variable
mutate/row_number within the individual group can come in handy
arrange
filter
optional join back to original dataset

Is that sort of what you are looking for??

bragks · January 2, 2019, 7:11am

Hey! Thank you for taking the time! This sounds pretty familiar, I guess the main challenge pops up at step (3). The op in this thread used `ìfelse()``to work around some of this.

Still though, given his data, I'd be great if you could do something like this:

library(dplyr, warn.conflicts = FALSE)

my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  arrange_if(cc == 2, ...)

Do you think scoped dplyr verbs with custom functions for .predicate and .funs could handle situations like this? I'm finding it really hard to find any good tutorials on scoped verbs.

phiggins · January 4, 2019, 10:50am

I am afraid I don't know much about scoping - still learning a lot about coding.
It would be great if someone who really knows coding could provide a guide to this sort of thing.

I usually work with clinical (medical) data, where patients have multiple repeated encounters, listed sequentially, but my unit of analysis is usually the individual patient.
Figuring this out on my own (probably not the most efficient approach) as I go:

I often use things like

total n denominator

df <- df %>% add_tally() %>% rename(total_n =n)

denominator - n encounters per patient

df <- df %>% add_count(patient_id) %>% rename(num_pt_enc = n)

number the encounters in order within patient

df <- df %>% group_by(patient_id) %>% mutate(pt_enc_no = row_number())

identify first encounter date, merge back to original dataset

df <- df %>% group_by(patient_id) %>% filter(pt_enc_no == 1) %>%
mutate(first_enc_date = enc date) %>% select(patient_id, first_enc_date) %>%
right_join(df)

identify last encounter date, merge back to original dataset

df <- df %>% group_by(patient_id) %>% filter(pt_enc_no == max(pt_enc_no) %>%
mutate(last_enc_date = enc date) %>% select(patient_id, last_enc_date) %>%
right_join(df)

then I can calculate things like time in cohort, events per year, lag to pull previous value and show change over time, make sparklines (or calculate slope) of change over time in a particular value, etc.

bj_bloom · January 4, 2019, 3:16pm

What you're fundamentally talking about are essentially a variety of window functions, which are used extensively in SQL. Discovering window functions a couple of years ago was one of those "aha" or "woah" moments in programming for me. Since I was working a lot with financial transaction data, I had a lot of use cases where I would want to retain the individual-level observations but also needed some sort of aggregate function attached to a dataset. Or I would want to rank order observations by date (or by multiple fields). Or I would have a customer ID that could join to an account table but either a missing transaction ID, or transaction data that was messy and I needed a way of tracking transactions in the order they were received. All of these things can be done using window functions in SQL, and dplyr attempts to convert some of these functions into R. There are just some limitations.

Here are a few good resources that can explain how window functions work in SQL:

Window functions tutorial

MS SQL window functions

The function (expression, [parameters]) OVER (window specification) aspect of window functions in SQL gives you a lot of options in how you can do this type of operation.

dplyr creates a series of functions that mimic SQL in a number of ways. When you connect to a SQL database, in fact, they convert a certain set of R dplyr commands into SQL itself (see the vignette on databases for dplyr). You can even use show_query to see how dplyr converts R syntax into SQL when it is connected to a database.

dplyr in turn has capabilities for window functions. You may or may not have come across this specifically when working with dplyr on the kinds of things you are interested in, but this breaks it down more conceptually for how dplyr takes the concept of window functions from SQL and converts that into R syntax. There is one somewhat important function that dplyr does not provide that you can do more easily in SQL. In this SQL statement:

SELECT empid, ordermonth, qty,
  SUM(qty) OVER(PARTITION BY empid
        ORDER BY ordermonth
        ROWS BETWEEN UNBOUNDED PRECEDING
             AND CURRENT ROW) AS runqty
FROM Sales.EmpOrders;

There is no way of changing the window frame (the "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW"). However, as mentioned in the dplyr vignette, there are some other R packages (like RcppRoll that can do this.

If you have your data in a database, I actually prefer staying there and doing complex window functions in SQL. There are a few options that make it easier to do in SQL, and if you're working on large datasets it can be a lot more efficient to keep the data stored in the database (using materialized views, common table expressions, and a series of nested joins before outputting to R) to get your data in a more finalized state for analysis.

All of this can be done in R with a bit of thought. And that's what all of these posts do end up doing using tidyverse packages (primarily dplyr, but occasionally tidyr as well). I do think it helps to understand the concept of window functions, at least, since there are a few other options out there. And if you don't know SQL, this would be a good reason to learn it (I think).

bragks · January 4, 2019, 7:08pm

This! This certainly looks promising, thank you! Reading the vignette from dplyr, not sure if I completely understand the possibilities here, but googling "window functions dplyr" gave me this description: Like SQL, dplyr uses window functions that are used to subset data within a group. I'm preparing my "woah".

I'm surprised this concept doesn't get more attention considering the whole idea of one observation per row and repeated measures. Looking back at R4DS it's briefly mentioned under mutate().

Would you say the examples mentioned in the initial post could be solved with window functions from dplyr?

Learning SQL is on my (endless) list, but I feel like I need to get a better grasp on R first.

phiggins · January 6, 2019, 3:17am

I am also wondering if looking at repeated measures over time with tsibble might be productive
https://pkg.earo.me/tsibble/

I think that there is good value in the
group_by/nest -> extract summary data values -> mutate to add new column
workflow
I am still sorting out when to nest, but I may be underusing this tool,
as illustrated in three examples here:

https://users.dimi.uniud.it/~massimo.franceschet/ds/r4ds/syllabus/make/gapminder/gapminder.html

https://www.datacamp.com/courses/introduction-to-the-tidyverse

system · January 27, 2019, 3:17am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.