What you're fundamentally talking about are essentially a variety of window functions, which are used extensively in SQL. Discovering window functions a couple of years ago was one of those "aha" or "woah" moments in programming for me. Since I was working a lot with financial transaction data, I had a lot of use cases where I would want to retain the individual-level observations but also needed some sort of aggregate function attached to a dataset. Or I would want to rank order observations by date (or by multiple fields). Or I would have a customer ID that could join to an account table but either a missing transaction ID, or transaction data that was messy and I needed a way of tracking transactions in the order they were received. All of these things can be done using window functions in SQL, and dplyr attempts to convert some of these functions into R. There are just some limitations.
Here are a few good resources that can explain how window functions work in SQL:
Window functions tutorial
MS SQL window functions
The function (expression, [parameters]) OVER (window specification) aspect of window functions in SQL gives you a lot of options in how you can do this type of operation.
dplyr creates a series of functions that mimic SQL in a number of ways. When you connect to a SQL database, in fact, they convert a certain set of R dplyr commands into SQL itself (see the vignette on databases for dplyr). You can even use
show_query to see how dplyr converts R syntax into SQL when it is connected to a database.
dplyr in turn has capabilities for window functions. You may or may not have come across this specifically when working with dplyr on the kinds of things you are interested in, but this breaks it down more conceptually for how dplyr takes the concept of window functions from SQL and converts that into R syntax. There is one somewhat important function that dplyr does not provide that you can do more easily in SQL. In this SQL statement:
SELECT empid, ordermonth, qty,
SUM(qty) OVER(PARTITION BY empid
ORDER BY ordermonth
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS runqty
There is no way of changing the window frame (the "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW"). However, as mentioned in the dplyr vignette, there are some other R packages (like RcppRoll that can do this.
If you have your data in a database, I actually prefer staying there and doing complex window functions in SQL. There are a few options that make it easier to do in SQL, and if you're working on large datasets it can be a lot more efficient to keep the data stored in the database (using materialized views, common table expressions, and a series of nested joins before outputting to R) to get your data in a more finalized state for analysis.
All of this can be done in R with a bit of thought. And that's what all of these posts do end up doing using tidyverse packages (primarily dplyr, but occasionally tidyr as well). I do think it helps to understand the concept of window functions, at least, since there are a few other options out there. And if you don't know SQL, this would be a good reason to learn it (I think).