# Merging Two Datasets & Codes to "Validate" Certain Values in a Dataset.

Good day everyone!

I am a "newbie" in R (I am one of those Excel plebs trying to learn R, haha) and was wondering if people could assist me with two problems that I'm trying to solve. I reckon that the first problem is going to be super simple for everyone here, but the second one might be tricky.

Please note that you do not have to answer all questions if you feel unsure about something. Partial answers are perfectly fine since they will still give me some starting points. Again, I am new to R, so please be as descriptive as possible to help me understand and learn.

It should also be noted that the actual datasets I'm working with are a lot bigger and more complex than the example I'm using below. The example only shows 10 units in two regions (region 1 & 2) with four fuel categories (liquid fuel, wind, mixed & solar) and two liquid fuel types (gasoline & diesel), but the actual data has over 100 units across multiple regions that use various types of fuels. Therefore, the codes/solution would have to be as "general/flexible" as possible and should be able to be used for much larger datasets. (I feel like I'm overexplaining here, but the solution shouldn't be too "rigid" to the point that it can ONLY be used to address the examples below).

Problem 1:
I've imported two datasets (Data A and Data B). Each dataset has unique information about ten heat generating units in two regions. Specifically, Data A contains information regarding the location of each unit, general category of fuel it uses, and the fuel cost. Data B has additional information about specific fuel types and fuel fractions for liquid and mixed fuel units. Therefore,

Data A:

Unit ID Region Fuel Fuel Cost (\$)
Unit 1 1 Liquid Fuel 200
Unit 2 2 Wind 0
Unit 3 1 Mixed 168
Unit 4 1 Liquid Fuel 120
Unit 5 2 Liquid Fuel 200
Unit 6 2 Mixed 183.5
Unit 7 1 Solar 0
Unit 8 2 Liquid Fuel 90
Unit 9 1 Liquid Fuel 200
Unit 10 2 Liquid Fuel 90

Data B:

Unit ID Fuel Type Fuel Fraction
Unit 1 Gasoline 1
Unit 3 Gasoline 0.6
Unit 3 Diesel 0.4
Unit 4 Diesel 1
Unit 5 Gasoline 1
Unit 6 Gasoline 0.85
Unit 6 Diesel 0.15
Unit 8 Diesel 1
Unit 9 Gasoline 1
Unit 10 Diesel 1

Note that Data B does not have any information about Unit 2 and Unit 7 in Data A because they use wind and solar, respectively. Also, Unit 3 and 6 have two rows each because they mix different types of fuels (i.e., Unit 3 uses 60% gasoline and 40% diesel).

I wanted to merge these two datasets by Unit ID and create something like this (Data C):

Unit ID Region Fuel Fuel Cost (\$) Fuel Type Fuel Fraction
Unit 1 1 Liquid Fuel 200 Gasoline 1
Unit 2 2 Wind 0
Unit 3 1 Mixed 168 Gasoline 0.6
Unit 3 1 Mixed 168 Diesel 0.4
Unit 4 1 Liquid Fuel 120 Diesel 1
Unit 5 2 Liquid Fuel 200 Gasoline 1
Unit 6 2 Mixed 183.5 Gasoline 0.85
Unit 6 2 Mixed 183.5 Diesel 0.15
Unit 7 1 Solar 0
Unit 8 2 Liquid Fuel 90 Diesel 1
Unit 9 1 Liquid Fuel 200 Gasoline 1
Unit 10 2 Liquid Fuel 90 Diesel 1

In order to do this, I've used the merge function and wrote this:

DataC<- merge(DataA, DataB, by = "Unit ID")

But the result that I got was:

Unit ID Region Fuel Fuel Cost (\$) Fuel Type Fuel Fraction
Unit 1 1 Liquid Fuel 200 Gasoline 1
Unit 3 1 Mixed 168 Gasoline 0.6
Unit 3 1 Mixed 168 Diesel 0.4
Unit 4 1 Liquid Fuel 120 Diesel 1
Unit 5 2 Liquid Fuel 200 Gasoline 1
Unit 6 2 Mixed 183.5 Gasoline 0.85
Unit 6 2 Mixed 183.5 Diesel 0.15
Unit 8 2 Liquid Fuel 90 Diesel 1
Unit 9 1 Liquid Fuel 200 Gasoline 1
Unit 10 2 Liquid Fuel 90 Diesel 1

The issue with this is that the function removes Unit 2 and Unit 7 from the output because Unit 2 and 7 in Data A do not have corresponding/matching Unit IDs in Data B.

So anybody know how to solve this? I feel like the solution would be super simple, but please let me know what you think!

Problem 2:

Now, suppose that we found a way to make Data C looking like the way that I wanted. That is,

Data C:

Unit ID Region Fuel Fuel Cost (\$) Fuel Type Fuel Fraction
Unit 1 1 Liquid Fuel 200 Gasoline 1
Unit 2 2 Wind 0
Unit 3 1 Mixed 168 Gasoline 0.6
Unit 3 1 Mixed 168 Diesel 0.4
Unit 4 1 Liquid Fuel 120 Diesel 1
Unit 5 2 Liquid Fuel 200 Gasoline 1
Unit 6 2 Mixed 183.5 Gasoline 0.85
Unit 6 2 Mixed 183.5 Diesel 0.15
Unit 7 1 Solar 0
Unit 8 2 Liquid Fuel 90 Diesel 1
Unit 9 1 Liquid Fuel 200 Gasoline 1
Unit 10 2 Liquid Fuel 90 Diesel 1

Now, I wanna come up with a chunk of codes that will "validate" the "Fuel Cost (\$)" values for each unit. In other words, I wanna check whether the fuel cost information in Data C match/satisfy certain conditions. These conditions are:

1. Gasoline units spend \$200 for fuel

2. Diesel units in region 1 spend \$120 for fuel

3. Diesel units in region 2 spends \$90 for fuel because diesel is cheaper there

4. For "mixed" units, the fuel costs depend on the fraction of fuels that they use.
For example, Unit 3 uses 60% gasoline and 40% diesel, so the fuel cost is going to be: (\$200 x 60%) + (\$120 x 40%) = \$168.
Similarly, Unit 6 is in region 2 and uses 85% gasoline and 15% diesel, so the fuel cost is going to be: (\$200 x 85%) + (\$90 x 15%) = \$183.5

5. Solar or wind units spends \$0 for fuel (these units have "blank" fuel type and fuel fraction).

So, for example, if the fuel cost for Unit 4 is \$140 instead of \$120 (2nd condition), the code should flag it by saying "No" under "Validation" column. Please see the chart below:

Unit ID Region Fuel Fuel Cost (\$) Fuel Type Fuel Fraction (%) Validation
Unit 1 1 Liquid Fuel 200 Gasoline 1 Yes
Unit 2 2 Wind 0 Yes
Unit 3 1 Mixed 168 Gasoline 0.6 Yes
Unit 3 1 Mixed 168 Diesel 0.4 Yes
Unit 4 1 Liquid Fuel 140 Diesel 1 No
Unit 5 2 Liquid Fuel 200 Gasoline 1 Yes
Unit 6 2 Mixed 183.5 Gasoline 0.85 Yes
Unit 6 2 Mixed 183.5 Diesel 0.15 Yes
Unit 7 1 Solar 0 Yes
Unit 8 2 Liquid Fuel 90 Diesel 1 Yes
Unit 9 1 Liquid Fuel 200 Gasoline 1 Yes
Unit 10 2 Liquid Fuel 90 Diesel 1 Yes

I was thinking maybe I need to use the if...else... function? I tried to figure it out, but I wasn't sure how to approach it due to my lack of experience with R (maybe I am a bit too ambitious with the stuff that I'm trying to do....). Please let me know how I should approach this!

I reckon that the most difficult part of this would be dealing with units that mix different fuels because their fuel costs depend on multiple factors (types of fuel, fuel fraction, and location/region). If it's way too complicated, we can just focus on validating units that do not mix fuels for now.

Thank you for reading my long questions.

For the first question, you have to set the argument all.x of merge() to TRUE.

``````DFA <- data.frame(Unit = 1:3, Value = c(2,4,6))
> DFB <- data.frame(Unit = c(1,3), Value = c(12,16))
> DFA
Unit Value
1    1     2
2    2     4
3    3     6
> DFB
Unit Value
1    1    12
2    3    16
> merge(DFA, DFB, all.x = TRUE, by = "Unit")
Unit Value.x Value.y
1    1       2      12
2    2       4      NA
3    3       6      16
``````

Here is a solution to your Problem 2. Instead of nested ifelse statements, I used case_when. That provides a much cleaner syntax for that kind of logic. The left side consists of a logical test and the right side has values. The first test that evaluates to TRUE provides the value.
The mutate function adds new columns or modifies existing columns. I first calculate the calculated cost as Fuel_Fraction times the CostCheck. I use coalesce() to provide a 0 if Fuel_Fraction is NA. Then for each Unit I sum the calculated cost and name that ValidCost. This will be the sum of one number unless there is mixed fuel. Finally, I compare the ValidCost to the Fuel_Cost, returning TRUE or FALSE.
Note that I changed column names to not have spaces or special characters. I also removed spaces from the unit ID but that was only to ease my translation of your table into a csv file. Rather than posting a table, it is better to define your data as a data frame or tibble and then use the dput() function to post code that others can copy and use to recreate your data quickly.

``````library(dplyr)
DFC
#>    Unit_ID Region        Fuel Fuel_Cost Fuel_Type Fuel_Fraction
#> 1    Unit1      1 Liquid Fuel     200.0  Gasoline          1.00
#> 2    Unit2      2        Wind       0.0                      NA
#> 3    Unit3      1       Mixed     168.0  Gasoline          0.60
#> 4    Unit3      1       Mixed     168.0    Diesel          0.40
#> 5    Unit4      1 Liquid Fuel     120.0    Diesel          1.00
#> 6    Unit5      2 Liquid Fuel     200.0  Gasoline          1.00
#> 7    Unit6      2       Mixed     183.5  Gasoline          0.85
#> 8    Unit6      2       Mixed     183.5    Diesel          0.15
#> 9    Unit7      1       Solar       0.0                      NA
#> 10   Unit8      2 Liquid Fuel      90.0    Diesel          1.00
#> 11   Unit9      1 Liquid Fuel     200.0  Gasoline          1.00
#> 12  Unit10      2  LiquidFuel      90.0    Diesel          1.00
DFD <- DFC |> mutate(CostCheck = case_when(
Fuel %in% c("Wind", "Solar") ~ 0,
Fuel_Type == "Gasoline" ~ 200,
Fuel_Type == "Diesel" & Region == 1 ~ 120,
Fuel_Type == "Diesel" & Region == 2 ~ 90
)) |> mutate(CalcCost = coalesce(Fuel_Fraction,0) * CostCheck) |>
group_by(Unit_ID) |>
mutate(ValidCost = sum(CalcCost),
Validation = Fuel_Cost == ValidCost)
``````

Created on 2022-02-28 by the reprex package (v2.0.1)