Calculating difference between two columns and multiply

Good day everyone

Suppose that I have these annual datasets (i.e., January 1st to December 31st for multiple years) that show how much energy is being produced by various generators in a certain area. The data looks something like the table below. The table only has four generators (i.e., generators A, B, C and D) and only covers year 2022, but please keep in mind that the actual dataset has A LOT more than four generators and covers multiple years.

Date Generator Type Period Capacity Production
2022-01-01 A Hydro 1 12 10
2022-01-01 A Hydro 2 12 8
2022-01-01 A Hydro 3 12 5
2022-01-01 A Hydro 4 12 11
2022-01-01 B Nuclear 1 10 10
2022-01-01 B Nuclear 2 10 7
2022-01-01 B Nuclear 3 10 4
2022-01-01 B Nuclear 4 10 10
2022-01-01 C Wind 1 15 12
2022-01-01 C Wind 2 15 9
2022-01-01 C Wind 3 15 7
2022-01-01 C Wind 4 15 14
2022-01-01 D Hydro 1 11 8
2022-01-01 D Hydro 2 11 6
2022-01-01 D Hydro 3 11 3
2022-01-01 D Hydro 4 11 9
2022-01-02 A Hydro 1 12 9.5
2022-01-02 A Hydro 2 12 9
2022-01-02 A Hydro 3 12 4
2022-01-02 A Hydro 4 12 11
2022-01-02 B Nuclear 1 10 10
2022-01-02 B Nuclear 2 10 8
2022-01-02 B Nuclear 3 10 4
2022-01-02 B Nuclear 4 10 10
2022-01-02 C Wind 1 15 11
2022-01-02 C Wind 2 15 8
2022-01-02 C Wind 3 15 7
2022-01-02 C Wind 4 15 13
2022-01-02 D Hydro 1 11 7.5
2022-01-02 D Hydro 2 11 7
2022-01-02 D Hydro 3 11 2
2022-01-02 D Hydro 4 11 9
. . . . . .
. . . . . .
. . . . . .
2022-12-31 B Nuclear 4 80 73
2022-12-31 C Wind 1 15 6
2022-12-31 C Wind 2 15 3
2022-12-31 C Wind 3 15 2
2022-12-31 C Wind 4 15 8
2022-12-31 D Hydro 1 11 6
2022-12-31 D Hydro 2 11 6
2022-12-31 D Hydro 3 11 2
2022-12-31 D Hydro 4 11 8

I want to create another column called "Difference" that shows the difference between the capacity and production for each time period and generator. This is quite simple for wind or nuclear generators since I can just do capacity minus production for each time period.

For each hydro unit, however, the process is not that simple due to unique nature of hydro-electric generators. I first have to take the difference between capacity and production in each period and multiply that by generator's capacity minus production for the year (i.e., sum of generator's capacity for the year minus sum of generator's production).

For example, if I wanted to calculate hydro generator A's "difference" on 2022-01-01 at Period 1, I would do:

(Capacity at period 1 - Production at period 1) multiply by (2022 sum of Generator A's Capacity - 2022 sum of Generator A's Production). So it would be: (12-10) multiply by (Z - X), where "Z" = 2022 sum of A's capacity and "X" = 2022 sum of A's production. If I was doing this for 2021 dates, then I would have to use the 2021 sum values (I hope I'm making sense).

Now, I can easily do this for wind or nuclear generators using "mutate" function. But I am not sure how I'm gonna do this for hydro units. So it would be great if anyone could assist me with this! Below is the sample data frame that you can use to answer my question. Please note that the sample data frame below only covers January 1, 2022 to January 2, 2022, but the actual dataset covers January 1 to December 31 for multiple years.

df <- data.frame(date=as.Date(c('1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022'), '%m/%d/%Y'),
                 Generator = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D',
                               'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D'),
                 Type = c('Hydro', 'Hydro', 'Hydro', 'Hydro', 
                          "Nuclear", "Nuclear", "Nuclear", "Nuclear",
                          "Wind", "Wind", "Wind", "Wind",
                          'Hydro', 'Hydro', 'Hydro', 'Hydro',
                          'Hydro', 'Hydro', 'Hydro', 'Hydro', 
                          "Nuclear","Nuclear", "Nuclear", "Nuclear", 
                          "Wind", "Wind", "Wind", "Wind",
                          'Hydro', 'Hydro', 'Hydro', 'Hydro'),
                 Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
                            1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                 Capacity = c(12, 12, 12, 12, 10, 10, 10, 10, 15, 15, 15, 15, 11, 11, 11, 11,
                              12, 12, 12, 12, 10, 10, 10, 10, 15, 15, 15, 15, 11, 11, 11, 11),
                 Production = c(10, 8, 5, 11, 10, 7, 4, 10, 12, 9, 7, 14, 8, 6, 3, 9,
                                9.5, 9, 4, 11, 10, 8, 4, 10, 11, 8, 7, 13, 7.5, 7, 2, 9))

At the end, the final output should look something like:

Date Generator Type Period Capacity Production Difference
2022-01-01 A Hydro 1 12 10 57
2022-01-01 A Hydro 2 12 8 114
2022-01-01 A Hydro 3 12 5 199.5
2022-01-01 A Hydro 4 12 11 28.5
2022-01-01 B Nuclear 1 10 10 0
2022-01-01 B Nuclear 2 10 7 3
2022-01-01 B Nuclear 3 10 4 6
2022-01-01 B Nuclear 4 10 10 0
2022-01-01 C Wind 1 15 12 3
2022-01-01 C Wind 2 15 9 6
2022-01-01 C Wind 3 15 7 8
2022-01-01 C Wind 4 15 14 1
2022-01-01 D Hydro 1 11 8 109.5
2022-01-01 D Hydro 2 11 6 182.5
2022-01-01 D Hydro 3 11 3 292
2022-01-01 D Hydro 4 11 9 73
2022-01-02 A Hydro 1 12 9.5 71.25
2022-01-02 A Hydro 2 12 9 85.5
2022-01-02 A Hydro 3 12 4 228
2022-01-02 A Hydro 4 12 11 28.5
2022-01-02 B Nuclear 1 10 10 0
2022-01-02 B Nuclear 2 10 8 2
2022-01-02 B Nuclear 3 10 4 6
2022-01-02 B Nuclear 4 10 10 0
2022-01-02 C Wind 1 15 11 4
2022-01-02 C Wind 2 15 8 7
2022-01-02 C Wind 3 15 7 8
2022-01-02 C Wind 4 15 13 2
2022-01-02 D Hydro 1 11 7.5 127.75
2022-01-02 D Hydro 2 11 7 146
2022-01-02 D Hydro 3 11 2 328.5
2022-01-02 D Hydro 4 11 9 73

Again, (sorry for being so repetitive), the actual dataset covers multiple years and has more generators, so the codes/solution should be as "general/flexible" as possible and be able to be used for much larger datasets. That is, the solution shouldn't be too "rigid" to the point that it can ONLY be used to address the example above. IF doing this for multiple years is challenging, then we can just focus on a specific year for now.

Thank you

I think this produces the calculation you want.

library(dplyr)
library(lubridate)
df <- data.frame(date=as.Date(c('1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/1/2022', '1/1/2022', '1/1/2022', '1/1/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022',
                                '1/2/2022', '1/2/2022', '1/2/2022', '1/2/2022'), '%m/%d/%Y'),
                 Generator = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D',
                               'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D'),
                 Type = c('Hydro', 'Hydro', 'Hydro', 'Hydro', 
                          "Nuclear", "Nuclear", "Nuclear", "Nuclear",
                          "Wind", "Wind", "Wind", "Wind",
                          'Hydro', 'Hydro', 'Hydro', 'Hydro',
                          'Hydro', 'Hydro', 'Hydro', 'Hydro', 
                          "Nuclear","Nuclear", "Nuclear", "Nuclear", 
                          "Wind", "Wind", "Wind", "Wind",
                          'Hydro', 'Hydro', 'Hydro', 'Hydro'),
                 Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
                            1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                 Capacity = c(12, 12, 12, 12, 10, 10, 10, 10, 15, 15, 15, 15, 11, 11, 11, 11,
                              12, 12, 12, 12, 10, 10, 10, 10, 15, 15, 15, 15, 11, 11, 11, 11),
                 Production = c(10, 8, 5, 11, 10, 7, 4, 10, 12, 9, 7, 14, 8, 6, 3, 9,
                                9.5, 9, 4, 11, 10, 8, 4, 10, 11, 8, 7, 13, 7.5, 7, 2, 9))

df <- df |> mutate(Year = year(date)) 
HydroSum <- df |> 
  filter(Type=="Hydro") |> 
  group_by(Generator, Year) |> 
  summarize(YearCap = sum(Capacity), YearProd = sum(Production)) |> 
  mutate(Cap_Prod = YearCap - YearProd) |> 
  select(Generator, Year, Cap_Prod)
#> `summarise()` has grouped output by 'Generator'. You can override using the
#> `.groups` argument.
HydroSum
#> # A tibble: 2 × 3
#> # Groups:   Generator [2]
#>   Generator  Year Cap_Prod
#>   <chr>     <dbl>    <dbl>
#> 1 A          2022     28.5
#> 2 D          2022     36.5

df <- left_join(df, HydroSum, by = c("Generator", "Year"))
head(df)
#>         date Generator    Type Period Capacity Production Year Cap_Prod
#> 1 2022-01-01         A   Hydro      1       12         10 2022     28.5
#> 2 2022-01-01         A   Hydro      2       12          8 2022     28.5
#> 3 2022-01-01         A   Hydro      3       12          5 2022     28.5
#> 4 2022-01-01         A   Hydro      4       12         11 2022     28.5
#> 5 2022-01-01         B Nuclear      1       10         10 2022       NA
#> 6 2022-01-01         B Nuclear      2       10          7 2022       NA
df <- df |> mutate(difference = case_when(
  Type == "Hydro" ~ (Capacity - Production) * Cap_Prod,
  TRUE ~ Capacity - Production)
  )
head(df)
#>         date Generator    Type Period Capacity Production Year Cap_Prod
#> 1 2022-01-01         A   Hydro      1       12         10 2022     28.5
#> 2 2022-01-01         A   Hydro      2       12          8 2022     28.5
#> 3 2022-01-01         A   Hydro      3       12          5 2022     28.5
#> 4 2022-01-01         A   Hydro      4       12         11 2022     28.5
#> 5 2022-01-01         B Nuclear      1       10         10 2022       NA
#> 6 2022-01-01         B Nuclear      2       10          7 2022       NA
#>   difference
#> 1       57.0
#> 2      114.0
#> 3      199.5
#> 4       28.5
#> 5        0.0
#> 6        3.0

Created on 2022-10-26 with reprex v2.0.2

1 Like

Thank you so much! This is precisely what I am looking for!

One issue tho, it seems like the actual data that I'm working with contains some missing observations/values that cause the "Difference" column to give "N/A" results. I reckon that I'll have to use "na.rm = TRUE" but do you know where I should put that in? Sorry, I am a newbie to R haha.

You will have to put it inside the sum() calls. :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.