Help with wrong value in Calculated Column

Hi,

I am adding two new columns and for some reason, calculation works on one column but not on other column. For this second column which is not working, the result is always NA. I do have some NAs in two of the columns used for its calculation. The sample here shows all NAs, but in real data, its a combination of values and NAs. I also tried replacing NA with 0 and still I see NA as result. In the below example, NP_O should give value of NP when NP_C and NP_T are NA, but it doesn't work for some reason.

Can you please help? Thank you!

df <- data.frame(
  stringsAsFactors = FALSE,
                                      Date = c("2015/01/01",
                                               "2015/01/01","2015/01/01",
                                               "2015/01/01","2015/02/01"),
                                   Country = c("Brazil",
                                               "Canada","Mexico","USA","Brazil"),
                                         M = c(510875,95736,
                                               319243,1205822.64843412,457566),
                                        NP = c(82280,19147.2,
                                               180710,217251,74068),
              NP_C = c(NA, NA, NA, NA, NA),
              NP_T = c(NA, NA, NA, NA, NA)
                        )
# Calculating P and NP_O
df <- df %>%
  mutate("P" = M - `NP`)%>%
  mutate("NP_O" = `NP` - `NP_C` - `NP_T`)

If I convert the NA values to zero, the calculation works correctly, I think. If you want to preserve the original values of NP_C and NP_T, you could make new columns of the mutated zero values and delete them from df after the calculation.

df <- data.frame(
  stringsAsFactors = FALSE,
  Date = c("2015/01/01",
           "2015/01/01","2015/01/01",
           "2015/01/01","2015/02/01"),
  Country = c("Brazil",
              "Canada","Mexico","USA","Brazil"),
  M = c(510875,95736,
        319243,1205822.64843412,457566),
  NP = c(82280,19147.2,
         180710,217251,74068),
  NP_C = c(NA, NA, NA, NA, NA),
  NP_T = c(NA, NA, NA, NA, NA)
)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
# Calculating P and NP_O
df
#>         Date Country       M       NP NP_C NP_T
#> 1 2015/01/01  Brazil  510875  82280.0   NA   NA
#> 2 2015/01/01  Canada   95736  19147.2   NA   NA
#> 3 2015/01/01  Mexico  319243 180710.0   NA   NA
#> 4 2015/01/01     USA 1205823 217251.0   NA   NA
#> 5 2015/02/01  Brazil  457566  74068.0   NA   NA
df <- df %>%
  mutate(P = M - `NP`) %>%
  mutate(NP_C = replace_na(NP_C, 0), NP_T = replace_na(NP_T, 0)) %>% 
  mutate(NP_O = NP - NP_C - NP_T)
df
#>         Date Country       M       NP NP_C NP_T        P     NP_O
#> 1 2015/01/01  Brazil  510875  82280.0    0    0 428595.0  82280.0
#> 2 2015/01/01  Canada   95736  19147.2    0    0  76588.8  19147.2
#> 3 2015/01/01  Mexico  319243 180710.0    0    0 138533.0 180710.0
#> 4 2015/01/01     USA 1205823 217251.0    0    0 988571.6 217251.0
#> 5 2015/02/01  Brazil  457566  74068.0    0    0 383498.0  74068.0

Created on 2020-08-24 by the reprex package (v0.3.0)

2 Likes

Here you are doing a substraction, for the first row 82280 - NA - NA. This is the same as asking R "I had $82280, then I spent some money, then I spent more money, how much do I have left?" the answer is we can't know (i.e. NA).

What you want is to use if_else() and is.na(). Here is code for using the value of NP when NP_C OR NP_T are NA.

df <- data.frame(
  stringsAsFactors = FALSE,
  Date = c("2015/01/01",
           "2015/01/01","2015/01/01",
           "2015/01/01","2015/02/01"),
  Country = c("Brazil",
              "Canada","Mexico","USA","Brazil"),
  M = c(510875,95736,
        319243,1205822.64843412,457566),
  NP = c(82280,19147.2,
         180710,217251,74068),
  NP_C = c(12345, 12345, NA, NA, NA),
  NP_T = c(54321, NA, 54321, NA, NA)
)
# Calculating P and NP_O
df %>%
  mutate("P" = M - `NP`)%>%
  mutate("NP_O" = if_else(is.na(NP_C) | is.na(NP_T), NP, NP - NP_C - NP_T))

Note that I changed some values in NP_C and NP_T to better illustrate.
if_else() has 3 arguments, the condition, what to do if true, and what to do if false. You can build the condition with | (or) and & (and). If any of NP_C or NP_T is NA, then you return NP. You weren't clear on what you wanted to do if they are not NA, i.e. the condition is false, so here I just did teh subtraction.

1 Like

Thanks @FJCC! Replacing NA with 0 works now. The only change I made was to include warn.conflicts as FALSE as you have it and it works like a charm.

Thanks!

Thanks @AlexisW!

Even though replacing NA with 0 worked before, this is such a great insight and a better solution for future problems similar to it. I understand it now why R was creating NAs always and using this understanding in my code now. Thank you so much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.