Tidyverse Strategies/Packages For Measuring Change Over Time?

dplyr
ggplot2
tidyverse

#1

Hey Everyone,

Long time follower/learner of the #rstats hashtag and various other places where R folks congregate online to help each other. disclaimer: I’ve never posted in any forum, so feel free to let me know if this is the wrong avenue for getting some help, but I’m having a problem that I can’t solve despite several attempts to find a clear explanation.

Description of Data: I have a two very large data sets related to school finance and school districts. One contains 25 years worth of Actual School Finance data for every school district in Texas, comes from the Texas Education Agency, and has over 100 variables. The other has 50 years of Public Finance Data for every unified school district in the United States, comes from the U.S. Census, and has over 100 variables.

My question is: What are the best ways using the tidyverse to measure change over time for a specific variable (i.e. change in property values) for every group of observations (school districts) (i.e. _between two points in time 1995 vs. 2015 _ or between all points in time from 1995-2015). I’m sure someone has good ideas, but I’m relatively new to the world of R and am still learning. :man_student:

My goal is: to visualize change over time for a given variable by school district on a choropleth map of each school district in Texas (for the TEA data) and the United States (for the Census data).

Thanks in advance for any help! I’ll try to contribute more to the community as I am able!

Subset Data Sample

tibble::tribble(
  ~YEAR, ~STATEID,                ~NAME, ~Reported.Enrollment, ~Average.Daily.Attendance..ADA., ~Total.Revenue,
  2012L,   15901L,  "Alamo Heights ISD",                4791L,                        4528.251,      36869389L,
  2012L,   15905L,       "Edgewood ISD",               11822L,                       10585.428,      96552441L,
  2012L,   15910L,     "North East ISD",               67208L,                       63634.238,     484719766L,
  2012L,   15915L,      "Northside ISD",               97204L,                       91223.485,     676759259L,
  2013L,   15901L,  "Alamo Heights ISD",                4805L,                        4557.841,      37688165L,
  2013L,   15905L,       "Edgewood ISD",               11931L,                       10618.405,     101029708L,
  2013L,   15910L,     "North East ISD",               67701L,                       63932.903,     471328861L,
  2013L,   15915L,      "Northside ISD",               99426L,                       93064.652,     693467943L,
  2014L,   15901L,  "Alamo Heights ISD",                4815L,                        4557.153,      39636063L,
  2014L,   15905L,       "Edgewood ISD",               12059L,                       10640.124,     105918783L,
  2014L,   15910L,     "North East ISD",               67986L,                       64308.082,     493326195L,
  2014L,   15915L,      "Northside ISD",              101549L,                       95202.712,     767956155L,
  2015L,   15901L,  "Alamo Heights ISD",                4760L,                        4529.232,      40083066L,
  2015L,   15905L,       "Edgewood ISD",               11726L,                         10625.8,     109926501L,
  2015L,   15910L,     "North East ISD",               67757L,                       63844.451,     511662489L,
  2015L,   15915L,      "Northside ISD",              102950L,                       96475.443,     804214714L,
  2016L,   15901L,  "Alamo Heights ISD",                4808L,                        4588.662,      42354896L,
  2016L,   15905L,       "Edgewood ISD",               11279L,                        9940.369,     112947815L,
  2016L,   15910L,     "North East ISD",               67501L,                       63683.555,     533641434L,
  2016L,   15915L,      "Northside ISD",              104437L,                       97952.067,     860866726L
  )

R studio for accounting?
#2

Though I haven’t done a ton of time series work, the tools by @davis, @mdancho & co at Business Science could be relevant (e.g. timetk, tibbletime).

A couple of posts come to mind:


Hope these help get you going!

Oh, also Kyle E Walker, and Len Kiefer for spatio-temporal visualization stuff:

https://walkerke.github.io/


#3

I think that you might be able to use some of the dplyr functions that work on multiple columns at once (ie mutate_all() or mutate_at())

There are a few ways to tackle this. Assume df is the reproducible data you provided.

  1. Assuming the data is ordered by year, you could group by STATEID (and also NAME so it get’s dragged along even though it doesnt add more groups) then mutate_all() to calculate the difference from year to year for every variable.
    • You can’t see it here, but the new column names all have _diff appended to them.
    • Unlike purrr, the formula that you want to apply to each column is written using . rather than .x. So here for each column we take the difference between the column and it’s first lag.
    • To look at one example, the 14 in the Reported.Enrollment_diff column is the difference of 4805 - 4791 from years 2013 and 2012 of Reported.Enrollment.
library(dplyr)

df %>%
  group_by(STATEID, NAME) %>%
  mutate_all(funs(diff = . - lag(.)))
#> # A tibble: 20 x 10
#> # Groups: STATEID, NAME [4]
#>     YEAR STATEID NAME   Repor… Aver… Total… YEAR… Repo… Average.… Total.R…
#>    <int>   <int> <chr>   <int> <dbl>  <int> <int> <int>     <dbl>    <int>
#>  1  2012   15901 Alamo…   4791  4528 3.69e⁷    NA    NA    NA      NA     
#>  2  2012   15905 Edgew…  11822 10585 9.66e⁷    NA    NA    NA      NA     
#>  3  2012   15910 North…  67208 63634 4.85e⁸    NA    NA    NA      NA     
#>  4  2012   15915 North…  97204 91223 6.77e⁸    NA    NA    NA      NA     
#>  5  2013   15901 Alamo…   4805  4558 3.77e⁷     1    14    29.6     8.19e⁵
#>  6  2013   15905 Edgew…  11931 10618 1.01e⁸     1   109    33.0     4.48e⁶
#>  7  2013   15910 North…  67701 63933 4.71e⁸     1   493   299     - 1.34e⁷
#>  8  2013   15915 North…  99426 93065 6.93e⁸     1  2222  1841       1.67e⁷
#>  9  2014   15901 Alamo…   4815  4557 3.96e⁷     1    10 -   0.688   1.95e⁶
#> 10  2014   15905 Edgew…  12059 10640 1.06e⁸     1   128    21.7     4.89e⁶
#> 11  2014   15910 North…  67986 64308 4.93e⁸     1   285   375       2.20e⁷
#> 12  2014   15915 North… 101549 95203 7.68e⁸     1  2123  2138       7.45e⁷
#> 13  2015   15901 Alamo…   4760  4529 4.01e⁷     1 -  55 -  27.9     4.47e⁵
#> 14  2015   15905 Edgew…  11726 10626 1.10e⁸     1 - 333 -  14.3     4.01e⁶
#> 15  2015   15910 North…  67757 63844 5.12e⁸     1 - 229 - 464       1.83e⁷
#> 16  2015   15915 North… 102950 96475 8.04e⁸     1  1401  1273       3.63e⁷
#> 17  2016   15901 Alamo…   4808  4589 4.24e⁷     1    48    59.4     2.27e⁶
#> 18  2016   15905 Edgew…  11279  9940 1.13e⁸     1 - 447 - 685       3.02e⁶
#> 19  2016   15910 North…  67501 63684 5.34e⁸     1 - 256 - 161       2.20e⁷
#> 20  2016   15915 North… 104437 97952 8.61e⁸     1  1487  1477       5.67e⁷
  1. A slight modification if you don’t want to diff the YEAR variable. Use mutate_at() and exclude YEAR with vars(-YEAR). This works like select() for selecting columns. This is my preferred approach.
df %>%
  group_by(STATEID, NAME) %>%
  mutate_at(.vars = vars(-YEAR), funs(diff = . - lag(.)))
#> # A tibble: 20 x 9
#> # Groups: STATEID, NAME [4]
#>     YEAR STATEID NAME   Report… Averag… Total.… Repor… Average.D… Total.R…
#>    <int>   <int> <chr>    <int>   <dbl>   <int>  <int>      <dbl>    <int>
#>  1  2012   15901 Alamo…    4791    4528  3.69e⁷     NA     NA      NA     
#>  2  2012   15905 Edgew…   11822   10585  9.66e⁷     NA     NA      NA     
#>  3  2012   15910 North…   67208   63634  4.85e⁸     NA     NA      NA     
#>  4  2012   15915 North…   97204   91223  6.77e⁸     NA     NA      NA     
#>  5  2013   15901 Alamo…    4805    4558  3.77e⁷     14     29.6     8.19e⁵
#>  6  2013   15905 Edgew…   11931   10618  1.01e⁸    109     33.0     4.48e⁶
#>  7  2013   15910 North…   67701   63933  4.71e⁸    493    299     - 1.34e⁷
#>  8  2013   15915 North…   99426   93065  6.93e⁸   2222   1841       1.67e⁷
#>  9  2014   15901 Alamo…    4815    4557  3.96e⁷     10  -   0.688   1.95e⁶
#> 10  2014   15905 Edgew…   12059   10640  1.06e⁸    128     21.7     4.89e⁶
#> 11  2014   15910 North…   67986   64308  4.93e⁸    285    375       2.20e⁷
#> 12  2014   15915 North…  101549   95203  7.68e⁸   2123   2138       7.45e⁷
#> 13  2015   15901 Alamo…    4760    4529  4.01e⁷  -  55  -  27.9     4.47e⁵
#> 14  2015   15905 Edgew…   11726   10626  1.10e⁸  - 333  -  14.3     4.01e⁶
#> 15  2015   15910 North…   67757   63844  5.12e⁸  - 229  - 464       1.83e⁷
#> 16  2015   15915 North…  102950   96475  8.04e⁸   1401   1273       3.63e⁷
#> 17  2016   15901 Alamo…    4808    4589  4.24e⁷     48     59.4     2.27e⁶
#> 18  2016   15905 Edgew…   11279    9940  1.13e⁸  - 447  - 685       3.02e⁶
#> 19  2016   15910 North…   67501   63684  5.34e⁸  - 256  - 161       2.20e⁷
#> 20  2016   15915 North…  104437   97952  8.61e⁸   1487   1477       5.67e⁷
  1. Finally, you could go all out with tidying and say that all the columns you want to take the difference of are really 1 “variable”, so you can gather them. Now if you additionally group by the key column and ensure the years within each group are in order, you can take the difference of the value column to difference everything at once. I don’t particularly like this approach, but figured I would include it.
df %>%
  group_by(STATEID, NAME) %>%
  tidyr::gather(key, value, -YEAR, -STATEID, -NAME) %>%
  group_by(STATEID, NAME, key) %>%
  arrange(YEAR, .by_group = TRUE) %>%
  mutate(value_diff = value - lag(value))
#> # A tibble: 60 x 6
#> # Groups: STATEID, NAME, key [12]
#>     YEAR STATEID NAME              key                     value value_di…
#>    <int>   <int> <chr>             <chr>                   <dbl>     <dbl>
#>  1  2012   15901 Alamo Heights ISD Average.Daily.Attendan…  4528    NA    
#>  2  2013   15901 Alamo Heights ISD Average.Daily.Attendan…  4558    29.6  
#>  3  2014   15901 Alamo Heights ISD Average.Daily.Attendan…  4557   - 0.688
#>  4  2015   15901 Alamo Heights ISD Average.Daily.Attendan…  4529   -27.9  
#>  5  2016   15901 Alamo Heights ISD Average.Daily.Attendan…  4589    59.4  
#>  6  2012   15901 Alamo Heights ISD Reported.Enrollment      4791    NA    
#>  7  2013   15901 Alamo Heights ISD Reported.Enrollment      4805    14.0  
#>  8  2014   15901 Alamo Heights ISD Reported.Enrollment      4815    10.0  
#>  9  2015   15901 Alamo Heights ISD Reported.Enrollment      4760   -55.0  
#> 10  2016   15901 Alamo Heights ISD Reported.Enrollment      4808    48.0  
#> # ... with 50 more rows

#4

For what it’s worth, if you’re interested in change over time an important metric to consider is the rate of change. Using mutate, along with case_when, you can create new columns of values with a lag command (essentially placing the former value next to the current value in the same record). Than another new value with former/current which you can plot with points and connecting lines. The first entry in the data set will not be returned unless you manually supply a value in the case_when statement (like inserting an estimated value for the year prior to the beginning of the data set). Hope this makes sense.