Help with wrong value in Calculated Column

ksingh19 · August 24, 2020, 6:43pm

Hi,

I am adding two new columns and for some reason, calculation works on one column but not on other column. For this second column which is not working, the result is always NA. I do have some NAs in two of the columns used for its calculation. The sample here shows all NAs, but in real data, its a combination of values and NAs. I also tried replacing NA with 0 and still I see NA as result. In the below example, NP_O should give value of NP when NP_C and NP_T are NA, but it doesn't work for some reason.

Can you please help? Thank you!

df <- data.frame(
  stringsAsFactors = FALSE,
                                      Date = c("2015/01/01",
                                               "2015/01/01","2015/01/01",
                                               "2015/01/01","2015/02/01"),
                                   Country = c("Brazil",
                                               "Canada","Mexico","USA","Brazil"),
                                         M = c(510875,95736,
                                               319243,1205822.64843412,457566),
                                        NP = c(82280,19147.2,
                                               180710,217251,74068),
              NP_C = c(NA, NA, NA, NA, NA),
              NP_T = c(NA, NA, NA, NA, NA)
                        )
# Calculating P and NP_O
df <- df %>%
  mutate("P" = M - `NP`)%>%
  mutate("NP_O" = `NP` - `NP_C` - `NP_T`)

FJCC · August 24, 2020, 7:16pm

If I convert the NA values to zero, the calculation works correctly, I think. If you want to preserve the original values of NP_C and NP_T, you could make new columns of the mutated zero values and delete them from df after the calculation.

df <- data.frame(
  stringsAsFactors = FALSE,
  Date = c("2015/01/01",
           "2015/01/01","2015/01/01",
           "2015/01/01","2015/02/01"),
  Country = c("Brazil",
              "Canada","Mexico","USA","Brazil"),
  M = c(510875,95736,
        319243,1205822.64843412,457566),
  NP = c(82280,19147.2,
         180710,217251,74068),
  NP_C = c(NA, NA, NA, NA, NA),
  NP_T = c(NA, NA, NA, NA, NA)
)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
# Calculating P and NP_O
df
#>         Date Country       M       NP NP_C NP_T
#> 1 2015/01/01  Brazil  510875  82280.0   NA   NA
#> 2 2015/01/01  Canada   95736  19147.2   NA   NA
#> 3 2015/01/01  Mexico  319243 180710.0   NA   NA
#> 4 2015/01/01     USA 1205823 217251.0   NA   NA
#> 5 2015/02/01  Brazil  457566  74068.0   NA   NA
df <- df %>%
  mutate(P = M - `NP`) %>%
  mutate(NP_C = replace_na(NP_C, 0), NP_T = replace_na(NP_T, 0)) %>% 
  mutate(NP_O = NP - NP_C - NP_T)
df
#>         Date Country       M       NP NP_C NP_T        P     NP_O
#> 1 2015/01/01  Brazil  510875  82280.0    0    0 428595.0  82280.0
#> 2 2015/01/01  Canada   95736  19147.2    0    0  76588.8  19147.2
#> 3 2015/01/01  Mexico  319243 180710.0    0    0 138533.0 180710.0
#> 4 2015/01/01     USA 1205823 217251.0    0    0 988571.6 217251.0
#> 5 2015/02/01  Brazil  457566  74068.0    0    0 383498.0  74068.0

^{Created on 2020-08-24 by the reprex package (v0.3.0)}

AlexisW · August 24, 2020, 7:22pm

Here you are doing a substraction, for the first row 82280 - NA - NA. This is the same as asking R "I had $82280, then I spent some money, then I spent more money, how much do I have left?" the answer is we can't know (i.e. NA).

What you want is to use if_else() and is.na(). Here is code for using the value of NP when NP_C OR NP_T are NA.

df <- data.frame(
  stringsAsFactors = FALSE,
  Date = c("2015/01/01",
           "2015/01/01","2015/01/01",
           "2015/01/01","2015/02/01"),
  Country = c("Brazil",
              "Canada","Mexico","USA","Brazil"),
  M = c(510875,95736,
        319243,1205822.64843412,457566),
  NP = c(82280,19147.2,
         180710,217251,74068),
  NP_C = c(12345, 12345, NA, NA, NA),
  NP_T = c(54321, NA, 54321, NA, NA)
)
# Calculating P and NP_O
df %>%
  mutate("P" = M - `NP`)%>%
  mutate("NP_O" = if_else(is.na(NP_C) | is.na(NP_T), NP, NP - NP_C - NP_T))

Note that I changed some values in NP_C and NP_T to better illustrate.
if_else() has 3 arguments, the condition, what to do if true, and what to do if false. You can build the condition with | (or) and & (and). If any of NP_C or NP_T is NA, then you return NP. You weren't clear on what you wanted to do if they are not NA, i.e. the condition is false, so here I just did teh subtraction.

ksingh19 · August 24, 2020, 7:26pm

Thanks @FJCC! Replacing NA with 0 works now. The only change I made was to include warn.conflicts as FALSE as you have it and it works like a charm.

Thanks!

ksingh19 · August 24, 2020, 7:33pm

Thanks @AlexisW!

Even though replacing NA with 0 worked before, this is such a great insight and a better solution for future problems similar to it. I understand it now why R was creating NAs always and using this understanding in my code now. Thank you so much!

system · August 31, 2020, 7:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.