Sum of missing values

Hello Everyone,

Can anybody please help me to code/program the following situation?

I have three variables which are x1, x2 and x3. These variables have missing values in them. Now I want to generate a new variable x which will basically sum x1, x2 and x3 in the following way:

a. If there is no missing values in an observation of x1, x2 and x3, then x will show the sum of the three variables.
b. If there is only one missing value in either x1 or x2 or x3, then x will average the two existing values and multiplies it by three (Thus I am assuming that the missing value is the average of the two existing values)
c. x will show a missing value (NA) if there are two or three missing values in x1, x2 or x3.

Thank you very much.

Naveed

Hi @Naveed,

Welcome to the RStudio Community!! :slight_smile:

I would like to help out; however, I have a question first. Do the variables x1, x2 and x3 each contain a single value? For example, would it be correct to say that we can have a situation in which x1 is equal to 5, x2 is equal to NA and x3 is equal to 12? I think that's the case from how I understand your post, but I need to make sure.

Hello Gueyenono,

Thank you very much for your reply. Really appreciate your intention to help.

Yes, you are correct. Each contain single value. Below I am giving an example of my data:

x1 x2 x3 x
1 2 3 6
NA NA 8 NA
4 3 NA 10.5
NA 8 3 16.5
2 9 7 18

I hope this will explain my data better.

Thanks a lot again.

Hi Naveed, welcome to the community!

A straightforward crude solution:

df <- data.frame(x1 = c(1, NA, 4, NA, 2),
                 x2 = c(2, NA, 3, 8, 9),
                 x3 = c(3, 8, NA, 3, 7))

apply(X = df,
      MARGIN = 1,
      FUN = function(t) {
        if (sum(is.na(x = t)) > 1)
        {
          return(NA)
        } else
        {
          return(3 * mean(x = t,
                          na.rm = TRUE))
        }
      })
#> [1]  6.0   NA 10.5 16.5 18.0

Created on 2019-04-02 by the reprex package (v0.2.1)

1 Like

Okay, I understand better now. In this case, you have a data frame (i.e. a table of data) with 3 columns (i.e. x1, x2 and x3) and you are trying to create a 4th column using the conditions that you stated in the original post. Let me start by replicating the data that you provided in your example:

library(dplyr)

# Create the data frame

my_data <- tibble(
  x1 = c(1, NA, 4, NA, 2),
  x2 = c(2, NA, NA, 8, 7),
  x3 = c(3, 8, NA, 3, 7)
)

my_data

     x1    x2    x3
  <dbl> <dbl> <dbl>
1     1     2     3
2    NA    NA     8
3     4    NA    NA
4    NA     8     3
5     2     7     7

Alright, now that we have the data, it is time to make the 4th column: x. The following code might be a bit intricate if you are not familiar with the purrr package, but I could rewrite it in a different way if you need. But, basically:

  1. First, I count the number of NAs in each row of my_data.
  2. Then, I implement the right calculation based on whether the number of NAs is 0, 1 or anything else.
my_data %>%
  mutate(
    
    x = pmap_dbl(list(x1, x2, x3), function(...){
      
      row_values <- unlist(list(...))
      
      number_of_NAs <- sum(is.na(row_values))
      
      map_dbl(number_of_NAs, ~ case_when(
        .x == 0 ~ sum(row_values),
        .x == 1 ~ mean(row_values, na.rm = TRUE) * 3,
        TRUE ~ NA_real_
      ))
      
    }))

     x1    x2    x3     x
  <dbl> <dbl> <dbl> <dbl>
1     1     2     3   6  
2    NA    NA     8  NA  
3     4     3    NA  10.5
4    NA     8     3  16.5
5     2     9     7  18  
2 Likes

Here's another one that is dplyr dependent. The only tricky thing is that this rowSums(!is.na(.)) counts the number of non-null columns in a row.

library(dplyr)

# Create the data frame

my_data <- tibble(
  x1 = c(1, NA, 4, NA, 2),
  x2 = c(2, NA, NA, 8, 7),
  x3 = c(3, 8, NA, 3, 7)
)

my_data%>%
  mutate(x = case_when(rowSums(!is.na(.)) == 1 ~ NA_real_,
                       TRUE ~ 3*rowMeans(.,na.rm = TRUE)
    )
  )
2 Likes

Hi Gueyenono,

Thank you very much for your response. Actually I am not familiar in working with purrr package. And somehow it is showing me error when I run the commands that you have given.

If you have time, then can you please explain in a bit detail? I wanted to attach my dataset here (just in case if it helps) but I cant upload excel file here.

Thank you so much.

Hey @Naveed,

A bit of advice for getting the best help possible. If you tried something, which did not work, it is always a good idea to post your code as well as the errors that you got. This will allow everybody to guide you in the right direction :slight_smile:

Did you install the dplyr and purrr` packages? If you did not, then run:

install.packages(c("dplyr", "purrr"))

Also, on top of your script, you should have:

library(dplyr)
library(purrr)

So your script you look like this:

# Load useful packages

library(dplyr)
library(purrr)

# Import the data (instead of importing the data, I generate it)

my_data <- tibble(
  x1 = c(1, NA, 4, NA, 2),
  x2 = c(2, NA, NA, 8, 7),
  x3 = c(3, 8, NA, 3, 7)
)

# Create the x column with the stated conditions

my_data %>%
  mutate(
    
    x = pmap_dbl(list(x1, x2, x3), function(...){
      
      row_values <- unlist(list(...))
      
      number_of_NAs <- sum(is.na(row_values))
      
      map_dbl(number_of_NAs, ~ case_when(
        .x == 0 ~ sum(row_values),
        .x == 1 ~ mean(row_values, na.rm = TRUE) * 3,
        TRUE ~ NA_real_
      ))
    }))

Hello Gueyenono,

Thanks for your helpful tips. I have run the following codes. Here I want to create the variable Employment from the variables Employment1, Employment2 and Employmen3.

Lesson3_WBData %>%
mutate(

Employment = pmap_dbl(list(Employment1, Employment2, Employmen3), function(...){
  
  row_values <- unlist(list(...))
  
  number_of_NAs <- sum(is.na(row_values))
  
  map_dbl(number_of_NAs, ~ case_when(
    .Employment == 0 ~ sum(row_values),
    .Employment == 1 ~ mean(row_values, na.rm = TRUE) * 3,
    TRUE ~ NA_real_
  ))
}))

But in the console, I get the following error:

Error: Evaluation error: object '.Employment' not found.

Thanks again Gueyenono

@Naveed,

It would also be very helpful if you could make your data accessible to your potential helpers. If the data is too sensitive (e.g. because it's a company's data or something of the sort), it helps to try to recreate a similar dataset and provide the code you used to do so.

Can you share your data? If you are wondering how, you could save your data file in a cloud service (provided you use one) such as Google Drive, Dropbox or OneDrive and give us a link to download it.

Thanks a lot again. I am sure over time you will see more polished posts and replies from me.

Here is the link of my data:

Please do let me know if the link does not work and data cant be downloaded.

You are very welcome @Naveed. We all are on the long journey of learning :slight_smile:

Your file is an excel file so I am using the read_xlsx() function from the readxl package to import it. Also, note that there is a small typo in the column name for Employment3. Currently, it says Employmen3 (without the final t).

Alright, here is the code that does the job:

# Load useful packages

library(dplyr)
library(purrr)
library(readxl)

# Import the dataset and perform the computations

my_data <- read_xlsx("Lesson3_WorldBank Data.xlsx") %>%
  mutate(
    
    Employment = pmap_dbl(list(Employment1, Employment2, Employmen3), function(...){
      
      row_values <- unlist(list(...))
      
      number_of_NAs <- sum(is.na(row_values))
      
      map_dbl(number_of_NAs, ~ case_when(
        .x == 0 ~ sum(row_values),
        .x == 1 ~ mean(row_values, na.rm = TRUE) * 3,
        TRUE ~ NA_real_
      ))
    }))

And here is a snapshot of my_data with the relevant columns:

my_data %>%
  select(contains("Employ")) %>%
  head()

  Employment1 Employment2 Employmen3 Employment
        <dbl>       <dbl>      <dbl>      <dbl>
1    59.122        6.9750     NA         99.146
2    42.108       16.469      41.423    100.00 
3    35.806       21.976      42.217     99.999
4    21.606       NA          49.364    106.45 
5     0.18600     39.566      60.242     99.994
6     0.80800     34.078      65.113     99.999

I am not sure what your experience in R is but I highly recommend you read (i.e. study) R For Data Science by Hadley Wickham, which you can access for free at the following link:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.