# Sum of missing values

Hello Everyone,

I have three variables which are x1, x2 and x3. These variables have missing values in them. Now I want to generate a new variable x which will basically sum x1, x2 and x3 in the following way:

a. If there is no missing values in an observation of x1, x2 and x3, then x will show the sum of the three variables.
b. If there is only one missing value in either x1 or x2 or x3, then x will average the two existing values and multiplies it by three (Thus I am assuming that the missing value is the average of the two existing values)
c. x will show a missing value (NA) if there are two or three missing values in x1, x2 or x3.

Thank you very much.

Naveed

Hi @Naveed,

Welcome to the RStudio Community!! I would like to help out; however, I have a question first. Do the variables `x1`, `x2` and `x3` each contain a single value? For example, would it be correct to say that we can have a situation in which `x1` is equal to `5`, `x2` is equal to `NA` and `x3` is equal to `12`? I think that's the case from how I understand your post, but I need to make sure.

Hello Gueyenono,

Yes, you are correct. Each contain single value. Below I am giving an example of my data:

x1 x2 x3 x
1 2 3 6
NA NA 8 NA
4 3 NA 10.5
NA 8 3 16.5
2 9 7 18

I hope this will explain my data better.

Thanks a lot again.

Hi Naveed, welcome to the community!

A straightforward crude solution:

``````df <- data.frame(x1 = c(1, NA, 4, NA, 2),
x2 = c(2, NA, 3, 8, 9),
x3 = c(3, 8, NA, 3, 7))

apply(X = df,
MARGIN = 1,
FUN = function(t) {
if (sum(is.na(x = t)) > 1)
{
return(NA)
} else
{
return(3 * mean(x = t,
na.rm = TRUE))
}
})
#>   6.0   NA 10.5 16.5 18.0
``````

Created on 2019-04-02 by the reprex package (v0.2.1)

1 Like

Okay, I understand better now. In this case, you have a data frame (i.e. a table of data) with 3 columns (i.e. `x1`, `x2` and `x3`) and you are trying to create a 4th column using the conditions that you stated in the original post. Let me start by replicating the data that you provided in your example:

``````library(dplyr)

# Create the data frame

my_data <- tibble(
x1 = c(1, NA, 4, NA, 2),
x2 = c(2, NA, NA, 8, 7),
x3 = c(3, 8, NA, 3, 7)
)

my_data

x1    x2    x3
<dbl> <dbl> <dbl>
1     1     2     3
2    NA    NA     8
3     4    NA    NA
4    NA     8     3
5     2     7     7
``````

Alright, now that we have the data, it is time to make the 4th column: `x`. The following code might be a bit intricate if you are not familiar with the `purrr` package, but I could rewrite it in a different way if you need. But, basically:

1. First, I count the number of `NA`s in each row of `my_data`.
2. Then, I implement the right calculation based on whether the number of `NA`s is `0`, `1` or anything else.
``````my_data %>%
mutate(

x = pmap_dbl(list(x1, x2, x3), function(...){

row_values <- unlist(list(...))

number_of_NAs <- sum(is.na(row_values))

map_dbl(number_of_NAs, ~ case_when(
.x == 0 ~ sum(row_values),
.x == 1 ~ mean(row_values, na.rm = TRUE) * 3,
TRUE ~ NA_real_
))

}))

x1    x2    x3     x
<dbl> <dbl> <dbl> <dbl>
1     1     2     3   6
2    NA    NA     8  NA
3     4     3    NA  10.5
4    NA     8     3  16.5
5     2     9     7  18
``````
2 Likes

Here's another one that is dplyr dependent. The only tricky thing is that this `rowSums(!is.na(.))` counts the number of non-null columns in a row.

``````library(dplyr)

# Create the data frame

my_data <- tibble(
x1 = c(1, NA, 4, NA, 2),
x2 = c(2, NA, NA, 8, 7),
x3 = c(3, 8, NA, 3, 7)
)

my_data%>%
mutate(x = case_when(rowSums(!is.na(.)) == 1 ~ NA_real_,
TRUE ~ 3*rowMeans(.,na.rm = TRUE)
)
)``````
2 Likes

Hi Gueyenono,

Thank you very much for your response. Actually I am not familiar in working with purrr package. And somehow it is showing me error when I run the commands that you have given.

If you have time, then can you please explain in a bit detail? I wanted to attach my dataset here (just in case if it helps) but I cant upload excel file here.

Thank you so much.

Hey @Naveed,

A bit of advice for getting the best help possible. If you tried something, which did not work, it is always a good idea to post your code as well as the errors that you got. This will allow everybody to guide you in the right direction Did you install the `dplyr` and purrr` packages? If you did not, then run:

`install.packages(c("dplyr", "purrr"))`

Also, on top of your script, you should have:

``````library(dplyr)
library(purrr)
``````

So your script you look like this:

``````# Load useful packages

library(dplyr)
library(purrr)

# Import the data (instead of importing the data, I generate it)

my_data <- tibble(
x1 = c(1, NA, 4, NA, 2),
x2 = c(2, NA, NA, 8, 7),
x3 = c(3, 8, NA, 3, 7)
)

# Create the x column with the stated conditions

my_data %>%
mutate(

x = pmap_dbl(list(x1, x2, x3), function(...){

row_values <- unlist(list(...))

number_of_NAs <- sum(is.na(row_values))

map_dbl(number_of_NAs, ~ case_when(
.x == 0 ~ sum(row_values),
.x == 1 ~ mean(row_values, na.rm = TRUE) * 3,
TRUE ~ NA_real_
))
}))
``````

Hello Gueyenono,

Thanks for your helpful tips. I have run the following codes. Here I want to create the variable Employment from the variables Employment1, Employment2 and Employmen3.

Lesson3_WBData %>%
mutate(

``````Employment = pmap_dbl(list(Employment1, Employment2, Employmen3), function(...){

row_values <- unlist(list(...))

number_of_NAs <- sum(is.na(row_values))

map_dbl(number_of_NAs, ~ case_when(
.Employment == 0 ~ sum(row_values),
.Employment == 1 ~ mean(row_values, na.rm = TRUE) * 3,
TRUE ~ NA_real_
))
}))
``````

But in the console, I get the following error:

Thanks again Gueyenono

It would also be very helpful if you could make your data accessible to your potential helpers. If the data is too sensitive (e.g. because it's a company's data or something of the sort), it helps to try to recreate a similar dataset and provide the code you used to do so.

Can you share your data? If you are wondering how, you could save your data file in a cloud service (provided you use one) such as Google Drive, Dropbox or OneDrive and give us a link to download it.

Thanks a lot again. I am sure over time you will see more polished posts and replies from me.

Here is the link of my data:

You are very welcome @Naveed. We all are on the long journey of learning Your file is an excel file so I am using the `read_xlsx()` function from the `readxl` package to import it. Also, note that there is a small typo in the column name for `Employment3`. Currently, it says `Employmen3` (without the final `t`).

Alright, here is the code that does the job:

``````# Load useful packages

library(dplyr)
library(purrr)

# Import the dataset and perform the computations

mutate(

Employment = pmap_dbl(list(Employment1, Employment2, Employmen3), function(...){

row_values <- unlist(list(...))

number_of_NAs <- sum(is.na(row_values))

map_dbl(number_of_NAs, ~ case_when(
.x == 0 ~ sum(row_values),
.x == 1 ~ mean(row_values, na.rm = TRUE) * 3,
TRUE ~ NA_real_
))
}))
``````

And here is a snapshot of `my_data` with the relevant columns:

``````my_data %>%
select(contains("Employ")) %>%

Employment1 Employment2 Employmen3 Employment
<dbl>       <dbl>      <dbl>      <dbl>
1    59.122        6.9750     NA         99.146
2    42.108       16.469      41.423    100.00
3    35.806       21.976      42.217     99.999
4    21.606       NA          49.364    106.45
5     0.18600     39.566      60.242     99.994
6     0.80800     34.078      65.113     99.999
``````

I am not sure what your experience in R is but I highly recommend you read (i.e. study) R For Data Science by Hadley Wickham, which you can access for free at the following link:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.