Replacing NAs with random values from a set

lhunsicker · January 31, 2023, 3:24am

A common procedure in preparing data sets for analysis is to impute missing values by replacing the NAs with a random member of the set of non-missing values (only legitimate with a small fraction of missing data!). I tried the following code:

df <- data.frame( var = c( 1,NA, 2,3,4,NA,6,NA,9,NA))
df$var
 [1]  1 NA  2  3  4 NA  6 NA  9 NA
df <- df %>% mutate(var = ifelse(!is.na(var),var, sample(var[!is.na(var)],1)))
df$var
 [1] 1 9 2 3 4 9 6 9 9 9

This didn't work, as the "sample(var[!is.na(var)],1)" only ran once and chose one value (9)to fill in all the NAs.

I then worked out the following code:

df <- data.frame(pid = 1:10, var = c( 1,NA, 2,3,4,NA,5,NA,6,NA))
df$var
 [1]  1 NA  2  3  4 NA  5 NA  6 NA
df$var <- replace(df$var, which(is.na(df$var)), sample(df$var[!is.na(df$var)], length(which(is.na(df$var)))))
df$var
 [1] 1 2 2 3 4 4 5 6 6 3

This worked, but it is excessively complicated.

My question is whether there is a way to modify the first code so that the sample function is invoked separately for each NA rather than once for all the NAs?

Thanks to anyone that can suggest a better(simpler) way to do this.
:Larry Hunsicker

williaml · January 31, 2023, 4:16am

You could do this, though perhaps this is just as complicated.

df %>% 
  mutate(var2 = map(var, ~if_else(is.na(.x), as.numeric(sample(df$var[!is.na(df$var)], 1)), .x)))

 ### var2 shown just for demonstration purposes ----
# var var2
# 1    1    1
# 2   NA    4
# 3    2    2
# 4    3    3
# 5    4    4
# 6   NA    9
# 7    6    6
# 8   NA    2
# 9    9    9
# 10  NA    2

nirgrahamuk · January 31, 2023, 9:49am

remove ,1
and to easily confirm what values got replaced for the NA's you can temporarily keep the new column with a different name; so you can compare the entries.

this would be


df2 <- df |> mutate(var2 = ifelse(!is.na(var),
                            var, 
                            sample(var[!is.na(var)]))) 

df2 |> filter(is.na(var))

# e.g.
#    var var2
# 1  NA    3
# 2  NA    1
# 3  NA    3
# 4  NA    4

I tend to prefer syntax that minimises repeats even for something small like var so a minor modification might be to swap sample(var[!is.na(var)]) for sample(na.omit(var))

lhunsicker · February 1, 2023, 4:02pm

This works. I finally was able actually to understand the edit. The help for sample () is not very transparent about the default value for the size parameter. Your correction was very helpful in my understanding this. I learned a couple of other new tricks that will be helpful, too. Many thanks.
Larry Hunsicker

system · February 8, 2023, 4:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.