Hello,
I wanted to create a new data frame from an exsisting data frame based on some conditions. The old table is below:
Old table
id
year
var1
var2
var3
1
2010/2011
10
43
13
1
2011/2012
6
13
14
1
2012/2013
13
21
31
1
2013/2014
17
24
24
2
2010/2011
11
25
26
2
2011/2012
15
31
21
2
2012/2013
21
42
27
2
2013/2014
30
16
33
3
2010/2011
15
32
21
3
2011/2012
31
34
23
3
2012/2013
10
51
24
3
2013/2014
6
19
29
I wanted the new table to look like this
New table shold look like
id
new_year
new_var1
new_var2
new_var3
1
2011
8
28
13.5
1
2012
9.5
17
22.5
1
2013
15
22.5
27.5
2
2011
13
28
23.5
2
2012
18
36.5
24
2
2013
25.5
29
30
3
2011
23
33
22
3
2012
20.5
42.5
23.5
3
2013
8
35
26.5
The values of each new variable should be the average of two consecutive year values (except the last year) of the variable in the old table if the same id. In the new data frame, we will have one less row for each id.
For example, for those with id 1:
new_var1 for the new_year 2011 = (10 + 6)/2, which is the average of the year 2010/2011 and 2011/2012
new_var1 for the new_year 2012 = (6 + 13)/2, which is the average of the year 2011/2012 and 2012/2013
new_var1 for the new_year 2013 = (13 + 17)/2, which is the average of the year 2012/2013 and 2013/2014
These works for the rest of ids and variables in the data frame.
I was trying to do using for loop, but I couldn't figure it out.
I really appreciate your help!
Using dplyr, many operations can be vectorized. Here you can use lag(var1) to get a shifted version of var1. So computing the mean on consecutive rows can be done with:
So the hard part is done, you need to do that for each variable, remove the unneeded lines (with filter() and is.na()), and reformat the year (look at separate()). Also, don't forget to ungroup(), or some of the next operations might give you surprising results!
Many thanks for your quick response and help.
Yes, I got what I am looking for, but could there be any option that can work for a couple of columns at once?
Doing separately for each column may take time if there are many columns.
Yes, with across(). You'll need to define a condition about which columns contain variables, in your example starts_with("var") would work, in real life you might prefer to exclude id and year. And you'll also need to define a function that computes the rolling mean.