# Best way to take an average of lagged rows with mutate?

I have a dataset that looks like this:

surface_data <- structure(list(year = 1995:2015, surface_area = c(0.0791523008380127,
0.097225347899475, 0.0822670001840209, 0.0336522849847411, 0.117791995169983,
0.0461216963919067, 0.130880043279724, 0.141475758216248, 0.0529759044537353,
0.105950554455872, 0.109061354341431, 0.124643054853821, 0.125276420983276,
0.105325262088928, 0.147077049232788, 0.125895508190918, 0.0735345832772827,
0.102212300140076, 0.100343940362244, 0.0760356945687866, 0.350289531413268
)), class = "data.frame", row.names = c(NA, -21L))

#>   year surface_area
#> 1 1995   0.07915230
#> 2 1996   0.09722535
#> 3 1997   0.08226700
#> 4 1998   0.03365228
#> 5 1999   0.11779200
#> 6 2000   0.04612170


I'd like to add a column using mutate() that shows the change between rows based on a specified lag (not necessarily the previous row). The idea being:

\displaystyle \frac{(current value) - (lagged value)}{mean(current value, lagged value)}*100

What I've tried unsuccessfully is this:

library(tidyverse)

surface_data %>%
mutate(surface_change = ((surface_area -
lag(x = surface_area, n = 1, order_by = year)) /
mean(c(surface_area,
lag(x = surface_area, n = 1, order_by = year)),
na.rm = T)) * 100) %>%
#>   year surface_area surface_change
#> 1 1995   0.07915230             NA
#> 2 1996   0.09722535       17.21608
#> 3 1997   0.08226700      -14.24907
#> 4 1998   0.03365228      -46.30956
#> 5 1999   0.11779200       80.15008
#> 6 2000   0.04612170      -68.27193


The problem is that the denominator of the above equation is constant:

surface_data %>%
mutate(numerator = surface_area -
lag(x = surface_area, n = 1, order_by = year),
denominator = mean(c(surface_area,
lag(x = surface_area, n = 1, order_by = year)),
na.rm = T),
surface_change = (numerator / denominator) * 100)%>%
#>   year surface_area   numerator denominator surface_change
#> 1 1995   0.07915230          NA   0.1049777             NA
#> 2 1996   0.09722535  0.01807305   0.1049777       17.21608
#> 3 1997   0.08226700 -0.01495835   0.1049777      -14.24907
#> 4 1998   0.03365228 -0.04861472   0.1049777      -46.30956
#> 5 1999   0.11779200  0.08413971   0.1049777       80.15008
#> 6 2000   0.04612170 -0.07167030   0.1049777      -68.27193


Created on 2019-03-19 by the reprex package (v0.2.1)

My questions: Why doesn't mean() behave as intended here? What is the best/simplest way to get the output below? I would like to maintain the ability to specify how many rows back to look for the lagged value. Thanks!

surface_data %>%
mutate(surface_change = ((surface_area -
lag(x = surface_area, n = 1, order_by = year)) /
((surface_area +
lag(x = surface_area, n = 1, order_by = year)) / 2)) * 100) %>%
#>   year surface_area surface_change
#> 1 1995   0.07915230             NA
#> 2 1996   0.09722535       20.49358
#> 3 1997   0.08226700      -16.66739
#> 4 1998   0.03365228      -83.87684
#> 5 1999   0.11779200      111.11639
#> 6 2000   0.04612170      -87.44883


dplyr operates column wise by default and you are trying to operate on a mix of column and row wise, you just have to separate this in two steps.

surface_data <- structure(list(year = 1995:2015, surface_area = c(0.0791523008380127,
0.097225347899475, 0.0822670001840209, 0.0336522849847411, 0.117791995169983,
0.0461216963919067, 0.130880043279724, 0.141475758216248, 0.0529759044537353,
0.105950554455872, 0.109061354341431, 0.124643054853821, 0.125276420983276,
0.105325262088928, 0.147077049232788, 0.125895508190918, 0.0735345832772827,
0.102212300140076, 0.100343940362244, 0.0760356945687866, 0.350289531413268
)), class = "data.frame", row.names = c(NA, -21L))

library(tidyverse)

surface_data %>%
mutate(lag_surface_area = lag(surface_area, n = 1, order_by = year)) %>%
rowwise() %>%
mutate(surface_change = ((surface_area - lag_surface_area) /
mean(c(surface_area, lag_surface_area),na.rm = T)) * 100) %>%
select(-lag_surface_area)
#> Source: local data frame [21 x 3]
#> Groups: <by row>
#>
#> # A tibble: 21 x 3
#>     year surface_area surface_change
#>    <int>        <dbl>          <dbl>
#>  1  1995       0.0792          NA
#>  2  1996       0.0972          20.5
#>  3  1997       0.0823         -16.7
#>  4  1998       0.0337         -83.9
#>  5  1999       0.118          111.
#>  6  2000       0.0461         -87.4
#>  7  2001       0.131           95.8
#>  8  2002       0.141            7.78
#>  9  2003       0.0530         -91.0
#> 10  2004       0.106           66.7
#> # … with 11 more rows


Created on 2019-03-20 by the reprex package (v0.2.1)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.