# How to create a new data frame using for loop

Hello,
I wanted to create a new data frame from an exsisting data frame based on some conditions. The old table is below:

Old table
id year var1 var2 var3
1 2010/2011 10 43 13
1 2011/2012 6 13 14
1 2012/2013 13 21 31
1 2013/2014 17 24 24
2 2010/2011 11 25 26
2 2011/2012 15 31 21
2 2012/2013 21 42 27
2 2013/2014 30 16 33
3 2010/2011 15 32 21
3 2011/2012 31 34 23
3 2012/2013 10 51 24
3 2013/2014 6 19 29

I wanted the new table to look like this

New table shold look like
id new_year new_var1 new_var2 new_var3
1 2011 8 28 13.5
1 2012 9.5 17 22.5
1 2013 15 22.5 27.5
2 2011 13 28 23.5
2 2012 18 36.5 24
2 2013 25.5 29 30
3 2011 23 33 22
3 2012 20.5 42.5 23.5
3 2013 8 35 26.5

The values of each new variable should be the average of two consecutive year values (except the last year) of the variable in the old table if the same id. In the new data frame, we will have one less row for each id.
For example, for those with id 1:
new_var1 for the new_year 2011 = (10 + 6)/2, which is the average of the year 2010/2011 and 2011/2012
new_var1 for the new_year 2012 = (6 + 13)/2, which is the average of the year 2011/2012 and 2012/2013
new_var1 for the new_year 2013 = (13 + 17)/2, which is the average of the year 2012/2013 and 2013/2014

These works for the rest of ids and variables in the data frame.

I was trying to do using for loop, but I couldn't figure it out.

Using `dplyr`, many operations can be vectorized. Here you can use `lag(var1)` to get a shifted version of `var1`. So computing the mean on consecutive rows can be done with:

``````df %>%
mutate(new_v1 = (var1+lag(var1))/2)
# A tibble: 11 x 6
#      id year       var1  var2  var3 new_v1
#   <dbl> <chr>     <dbl> <dbl> <dbl>  <dbl>
# 1     1 2010/2011    10    43    13   NA
# 2     1 2011/2012     6    13    14    8
# 3     1 2012/2013    13    21    31    9.5
# 4     1 2013/2014    17    24    24   15
# 5     2 2010/2011    15    31    21   16
# 6     2 2012/2013    21    42    27   18
# 7     2 2013/2014    30    16    33   25.5
# 8     3 2010/2011    15    32    21   22.5
# 9     3 2011/2012    31    34    23   23
# 10     3 2012/2013    10    51    24   20.5
# 11     3 2013/2014     6    19    29    8
``````

But you want to compute this separately for each `id`! Actually, `group_by()` lets you do it directly:

``````df %>%
group_by(id) %>%
mutate(new_v1 = (var1+lag(var1))/2)
# A tibble: 11 x 6
# Groups:   id [3]
#  id year       var1  var2  var3 new_v1
#   <dbl> <chr>     <dbl> <dbl> <dbl>  <dbl>
# 1     1 2010/2011    10    43    13   NA
# 2     1 2011/2012     6    13    14    8
# 3     1 2012/2013    13    21    31    9.5
# 4     1 2013/2014    17    24    24   15
# 5     2 2010/2011    15    31    21   NA
# 6     2 2012/2013    21    42    27   18
# 7     2 2013/2014    30    16    33   25.5
# 8     3 2010/2011    15    32    21   NA
# 9     3 2011/2012    31    34    23   23
# 10     3 2012/2013    10    51    24   20.5
# 11     3 2013/2014     6    19    29    8
``````

So the hard part is done, you need to do that for each variable, remove the unneeded lines (with `filter()` and `is.na()`), and reformat the year (look at `separate()`). Also, don't forget to `ungroup()`, or some of the next operations might give you surprising results!

Many thanks for your quick response and help.
Yes, I got what I am looking for, but could there be any option that can work for a couple of columns at once?
Doing separately for each column may take time if there are many columns.

Thanks

Yes, with `across()`. You'll need to define a condition about which columns contain variables, in your example `starts_with("var")` would work, in real life you might prefer to exclude `id` and `year`. And you'll also need to define a function that computes the rolling mean.

Could you help me with that (keeping id and year in the data frame)?
Thanks

`across()` ignores grouping variables, so we actually don't need to exclude `id`, only `year`. So this works:

``````rol_mean <- function(vec){
(vec + lag(vec))/2
}
df %>%
group_by(id) %>%
mutate(across(-year, rol_mean))
``````

You can also see here for more options and details about "tidy selection", the selection of variable used in the `dplyr` function.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.