Best way to take an average of lagged rows with mutate?

I have a dataset that looks like this:

surface_data <- structure(list(year = 1995:2015, surface_area = c(0.0791523008380127, 
                                                  0.097225347899475, 0.0822670001840209, 0.0336522849847411, 0.117791995169983, 
                                                  0.0461216963919067, 0.130880043279724, 0.141475758216248, 0.0529759044537353, 
                                                  0.105950554455872, 0.109061354341431, 0.124643054853821, 0.125276420983276, 
                                                  0.105325262088928, 0.147077049232788, 0.125895508190918, 0.0735345832772827, 
                                                  0.102212300140076, 0.100343940362244, 0.0760356945687866, 0.350289531413268
)), class = "data.frame", row.names = c(NA, -21L))

head(surface_data)
#>   year surface_area
#> 1 1995   0.07915230
#> 2 1996   0.09722535
#> 3 1997   0.08226700
#> 4 1998   0.03365228
#> 5 1999   0.11779200
#> 6 2000   0.04612170

I'd like to add a column using mutate() that shows the change between rows based on a specified lag (not necessarily the previous row). The idea being:

\displaystyle \frac{(current value) - (lagged value)}{mean(current value, lagged value)}*100


What I've tried unsuccessfully is this:

library(tidyverse)

surface_data %>%
  mutate(surface_change = ((surface_area -
                              lag(x = surface_area, n = 1, order_by = year)) /
                             mean(c(surface_area,
                                    lag(x = surface_area, n = 1, order_by = year)),
                                  na.rm = T)) * 100) %>%
  head()
#>   year surface_area surface_change
#> 1 1995   0.07915230             NA
#> 2 1996   0.09722535       17.21608
#> 3 1997   0.08226700      -14.24907
#> 4 1998   0.03365228      -46.30956
#> 5 1999   0.11779200       80.15008
#> 6 2000   0.04612170      -68.27193

The problem is that the denominator of the above equation is constant:

surface_data %>%
  mutate(numerator = surface_area -
           lag(x = surface_area, n = 1, order_by = year),
         denominator = mean(c(surface_area,
                              lag(x = surface_area, n = 1, order_by = year)),
                            na.rm = T),
         surface_change = (numerator / denominator) * 100)%>%
  head()
#>   year surface_area   numerator denominator surface_change
#> 1 1995   0.07915230          NA   0.1049777             NA
#> 2 1996   0.09722535  0.01807305   0.1049777       17.21608
#> 3 1997   0.08226700 -0.01495835   0.1049777      -14.24907
#> 4 1998   0.03365228 -0.04861472   0.1049777      -46.30956
#> 5 1999   0.11779200  0.08413971   0.1049777       80.15008
#> 6 2000   0.04612170 -0.07167030   0.1049777      -68.27193

Created on 2019-03-19 by the reprex package (v0.2.1)

My questions: Why doesn't mean() behave as intended here? What is the best/simplest way to get the output below? I would like to maintain the ability to specify how many rows back to look for the lagged value. Thanks!

surface_data %>%
  mutate(surface_change = ((surface_area -
                              lag(x = surface_area, n = 1, order_by = year)) /
                    ((surface_area +
                        lag(x = surface_area, n = 1, order_by = year)) / 2)) * 100) %>%
  head()
#>   year surface_area surface_change
#> 1 1995   0.07915230             NA
#> 2 1996   0.09722535       20.49358
#> 3 1997   0.08226700      -16.66739
#> 4 1998   0.03365228      -83.87684
#> 5 1999   0.11779200      111.11639
#> 6 2000   0.04612170      -87.44883

dplyr operates column wise by default and you are trying to operate on a mix of column and row wise, you just have to separate this in two steps.

surface_data <- structure(list(year = 1995:2015, surface_area = c(0.0791523008380127, 
                                                                  0.097225347899475, 0.0822670001840209, 0.0336522849847411, 0.117791995169983, 
                                                                  0.0461216963919067, 0.130880043279724, 0.141475758216248, 0.0529759044537353, 
                                                                  0.105950554455872, 0.109061354341431, 0.124643054853821, 0.125276420983276, 
                                                                  0.105325262088928, 0.147077049232788, 0.125895508190918, 0.0735345832772827, 
                                                                  0.102212300140076, 0.100343940362244, 0.0760356945687866, 0.350289531413268
)), class = "data.frame", row.names = c(NA, -21L))

library(tidyverse)

surface_data %>%
    mutate(lag_surface_area = lag(surface_area, n = 1, order_by = year)) %>% 
    rowwise() %>% 
    mutate(surface_change = ((surface_area - lag_surface_area) /
                                 mean(c(surface_area, lag_surface_area),na.rm = T)) * 100) %>% 
    select(-lag_surface_area)
#> Source: local data frame [21 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 21 x 3
#>     year surface_area surface_change
#>    <int>        <dbl>          <dbl>
#>  1  1995       0.0792          NA   
#>  2  1996       0.0972          20.5 
#>  3  1997       0.0823         -16.7 
#>  4  1998       0.0337         -83.9 
#>  5  1999       0.118          111.  
#>  6  2000       0.0461         -87.4 
#>  7  2001       0.131           95.8 
#>  8  2002       0.141            7.78
#>  9  2003       0.0530         -91.0 
#> 10  2004       0.106           66.7 
#> # … with 11 more rows

Created on 2019-03-20 by the reprex package (v0.2.1)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.