Trying to sort data frame to calculate differences in a column

I have a dataframe that needs to be sorted by these columns. There is data in the dataframe as well.
|Area name|Area code|Area type|date|
|England|E92000001|Nation|01/07/2020|
|South West|E12000009|Region|01/07/2020|
|South East|E12000008|Region|01/07/2020|
|London|E12000007|Region|01/07/2020|
|East of England|E12000006|Region|01/07/2020|
|West Midlands|E12000005|Region|01/07/2020|
|East Midlands|E12000004|Region|01/07/2020|
|Yorkshire and The Humber|E12000003|Region|01/07/2020|
|North West|E12000002|Region|01/07/2020|
|North East|E12000001|Region|01/07/2020|
|Worcestershire|E10000034|Upper tier local authority|01/07/2020|
|West Sussex|E10000032|Upper tier local authority|01/07/2020|
|Warwickshire|E10000031|Upper tier local authority|01/07/2020|
|Surrey|E10000030|Upper tier local authority|01/07/2020|
|Suffolk|E10000029|Upper tier local authority|01/07/2020|
|Staffordshire|E10000028|Upper tier local authority|01/07/2020|
|Somerset|E10000027|Upper tier local authority|01/07/2020|
|Oxfordshire|E10000025|Upper tier local authority|01/07/2020|

This data charts everyday back to January and contains over 4000 rows.

I have a column of all this data and I want to find out the difference in the column between the dates and the same area.

I need to

  1. sort data. Firstly by name and then by area type as some areas have different area types but are the same place.
  2. have it ordered by date so each day is consecutvie.
  3. calculate the difference between the dates for the same area. But I do not want the code to calculate the difference between different areas, as this would be incorrect.
  4. I have to make put this into a new column but have to make sure that the column length matches the data frame as you cannot find the difference between the first row and no rows.

How should I do it and please could you provide some example code that I can work with.

Tricky bits are: sorting table by multiple columns. making sure that the difference is not calculated between different areas. making sure that a value is applied to the first row of each area as they rows would not have any difference and so R would not be able to plot this.

I want to chart this new column of differences in the values over time. So the difference between the values on the y axis and dates on the axis and to colour the data by area name.

  1. sorting data with multiple columns, you just have to know what the columns are and the desired order.
    use the arrange() function from dplyr (tidyverse). arrange(iris,Petal.Length,Petal.Width) will sort iris by first Petal.Length and then Petal.Width.
  2. thats just one more column to sort / arrange by
  3. its a bit ambigious to me what you mean by difference between dates, I suppose my best guess is that you might intend to find the range between the minimum and maximum date for the area, and might want to represent that as a number of elapsed days.
    To find the differencec between the max and min length of the Petal's in Iris for each species you would do
group_by(iris, Species) %>% summarise(petal_diff = max(Petal.Length) - min(Petal.Length))

Thanks. The arrange function looks good.

The difference between dates.
On the 01/07/2020 has a value e.g. 20
The day before is 30/06/2020 and has a value e.g. 19. The difference is 1. So the value increased by 1 on the 01/07/2020

I want to calculate the difference between this date and the day before. I want to do this for each area name and then each area type.

if you want a daily difference and you have daily data, and values (which I think were absent from your example ? ) then taking the different between the present row value and the previous (via the lag() function) would seem sufficient (if you group by the appropriate columns).
Problem would be if you are missing days, but you could always calculate all days in the range, and insert them as necessary before doing the calculation

How does the lab function work? Would it solve the problem of not calculating the difference between areas

its called lag. it calculates the difference between the present row value and the one before (or further back if you use the params) , and it will respect group_by groups to 'reset' and not mix groups.

you can read about it (as well as any function) by questionmarking the name in your console
?lag

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.