Need help replacing the Minimum Value with NA across columns in my dataframe with Dplyr

Radon_my_Nikodym196 · August 12, 2022, 6:54pm

Hello, I need help "de"-imputing my data. I conceptually know what I have to do and have an idea of how this will look/be done but I'm struggling with the implementation of it. I have done some reading on stack overflow and found this answer particularly helpful (r - Replace all NA values with the (minimum value/2) value for each column, in large 6000+ column dataset - Stack Overflow -- though this one wasn't bad either: r - Correct syntax for mutate_if - Stack Overflow). I'm trying to adapt what is in the top answer there to my code here. The data is coming to me as imputed but for merging purposes I need to "unimpute" it.

Thankfully, the imputation process we use isn't complicated and missing values are imputed with the minimum value observed, so it's just a matter of systematically going across the columns of the dataframe, and replacing the minimum value with NA. Below is some dummy code that will provide a reproducible example of the kind of data I am working with:

# making a few toy data frames to construct an example of one final merged and imputed df.
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))

names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")

# merging the toy dataframes together.
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)

# creating an imputation function I'll need.
imputeme <- function(x){
  value <- ifelse(is.na(x),
                  min(x,na.rm=TRUE),x); value
}

# creating the imputed "data all" file/data set.
toy_data_all_imputed <- apply(toy_data_all, 2, imputeme)

Here is where I'm running into issues. On my "real life" version of toy_data_all_imputed (which is called volNormImputedData), I am trying to run the following code to un-impute it:

# creating the data I need and piping it
volNormImputedData <- volNormData %>% 
  mutate_if(is.numeric, ~replace(., min(.), is.na(.)))

However, when I run this, while it doesn't return any errors, it returns 22 warnings that say number of items to replace is not a multiple of replacement length

(and if you substitute volNormImputedData for toy_data_all_imputed as well as substitute volNormData with toy_data_all you get not only an error message that says,
Error in mutate(): ! Problem while computing x2 = (structure(function (..., .x = ..1, .y = ..2, . = ..1) ... . Caused by error in x[list] <- values: ! NAs are not allowed in subscripted assignments,

but also the same warning message I got above for the real data I'm using).

To try and get around this, I tried to reverse what my imputation function did by creating

Un_imputeme <- function(x){
  value <- ifelse(min(x,na.rm=TRUE),
                  is.na(x),x); value
}

But when I ran this on both my toy_data_all_imputed data set as well as my actual, real life data of volNormImputedData, it didn't work/replace the minimum in each column with NA, so I am now stuck/blocked and could use some help.

I really want to use the dplyr library if I can because I'm trying to familiarize myself with it and become proficient in using it, as I believe even my little imputation function I have written could easily be done with some dplyr commands, but also because that first example from Stack.Overflow is so close to what I need and I can't understand why the minimal change I made to it isn't working for my use case. I greatly appreciate the time taken to read this post and help me. Thank you!

-Radon.

DavoWW · August 13, 2022, 5:09am

Hi @Radon_my_Nikodym196,
Thanks for supplying a good example of your problem. You were 90% there.
Note that the solutions below also "de-impute" the original minimum values used to replace the NAs. You will have to decide if this is important.

suppressPackageStartupMessages(library(tidyverse))

# Making a few toy data frames to construct an example of one final merged and imputed df.
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))

# Deliberately introduce NA values using mis-matched column names
names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")

# merging the toy dataframes together.
# Need to force result to be a dataframe.
toy_data_all <- as.data.frame(bind_rows(toy_df1, toy_df2, toy_df3))

head(toy_data_all)
#>            x1          x2         x3         x4         x5          x6
#> 1  0.98295425 -0.08242873 -0.3577468 -0.8124883  2.1009541  0.36934203
#> 2 -0.19293366  1.65683146  0.9431282 -2.1358237 -0.1480268  0.01517654
#> 3 -1.00176910 -0.11726903  0.2140003  1.1451571 -0.2461528  0.19253270
#> 4  0.60431347 -0.65940289 -0.0780722 -0.3259104  0.6930319 -1.21569047
#> 5 -0.08835004  1.21018336  0.0417432 -1.3386030  1.0936188  0.48476012
#> 6  0.44113246 -0.10734136  0.6133137 -1.3081755 -1.6588221  0.19949441
#>           x7         x8         x9        x10 x11 x13
#> 1  2.1146015 -0.6705700  1.1954711 -0.4035832  NA  NA
#> 2  0.1806518 -0.0849321 -1.7979244 -1.5161697  NA  NA
#> 3 -1.3212254  1.2047925 -0.5358144 -1.1834244  NA  NA
#> 4  0.4856780 -1.7217910  0.8037951  0.6448625  NA  NA
#> 5 -0.4018093  0.7162611 -0.6450060 -0.4653116  NA  NA
#> 6  0.8550276  0.2762728 -0.2093185 -0.9682742  NA  NA
str(toy_data_all)
#> 'data.frame':    30 obs. of  12 variables:
#>  $ x1 : num  0.983 -0.1929 -1.0018 0.6043 -0.0884 ...
#>  $ x2 : num  -0.0824 1.6568 -0.1173 -0.6594 1.2102 ...
#>  $ x3 : num  -0.3577 0.9431 0.214 -0.0781 0.0417 ...
#>  $ x4 : num  -0.812 -2.136 1.145 -0.326 -1.339 ...
#>  $ x5 : num  2.101 -0.148 -0.246 0.693 1.094 ...
#>  $ x6 : num  0.3693 0.0152 0.1925 -1.2157 0.4848 ...
#>  $ x7 : num  2.115 0.181 -1.321 0.486 -0.402 ...
#>  $ x8 : num  -0.6706 -0.0849 1.2048 -1.7218 0.7163 ...
#>  $ x9 : num  1.195 -1.798 -0.536 0.804 -0.645 ...
#>  $ x10: num  -0.404 -1.516 -1.183 0.645 -0.465 ...
#>  $ x11: num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ x13: num  NA NA NA NA NA NA NA NA NA NA ...

# creating an imputation function I'll need.
imputeme <- function(x){
  value <- ifelse(is.na(x),
                  min(x,na.rm=TRUE),x); value
}

# creating the imputed data set.
# Need to force result to be a dataframe.
toy_data_all_imputed <- as.data.frame(apply(toy_data_all, 2, imputeme))

head(toy_data_all_imputed)
#>            x1          x2         x3         x4         x5          x6
#> 1  0.98295425 -0.08242873 -0.3577468 -0.8124883  2.1009541  0.36934203
#> 2 -0.19293366  1.65683146  0.9431282 -2.1358237 -0.1480268  0.01517654
#> 3 -1.00176910 -0.11726903  0.2140003  1.1451571 -0.2461528  0.19253270
#> 4  0.60431347 -0.65940289 -0.0780722 -0.3259104  0.6930319 -1.21569047
#> 5 -0.08835004  1.21018336  0.0417432 -1.3386030  1.0936188  0.48476012
#> 6  0.44113246 -0.10734136  0.6133137 -1.3081755 -1.6588221  0.19949441
#>           x7         x8         x9        x10       x11       x13
#> 1  2.1146015 -0.6705700  1.1954711 -0.4035832 -1.887954 -1.447181
#> 2  0.1806518 -0.0849321 -1.7979244 -1.5161697 -1.887954 -1.447181
#> 3 -1.3212254  1.2047925 -0.5358144 -1.1834244 -1.887954 -1.447181
#> 4  0.4856780 -1.7217910  0.8037951  0.6448625 -1.887954 -1.447181
#> 5 -0.4018093  0.7162611 -0.6450060 -0.4653116 -1.887954 -1.447181
#> 6  0.8550276  0.2762728 -0.2093185 -0.9682742 -1.887954 -1.447181
str(toy_data_all_imputed)
#> 'data.frame':    30 obs. of  12 variables:
#>  $ x1 : num  0.983 -0.1929 -1.0018 0.6043 -0.0884 ...
#>  $ x2 : num  -0.0824 1.6568 -0.1173 -0.6594 1.2102 ...
#>  $ x3 : num  -0.3577 0.9431 0.214 -0.0781 0.0417 ...
#>  $ x4 : num  -0.812 -2.136 1.145 -0.326 -1.339 ...
#>  $ x5 : num  2.101 -0.148 -0.246 0.693 1.094 ...
#>  $ x6 : num  0.3693 0.0152 0.1925 -1.2157 0.4848 ...
#>  $ x7 : num  2.115 0.181 -1.321 0.486 -0.402 ...
#>  $ x8 : num  -0.6706 -0.0849 1.2048 -1.7218 0.7163 ...
#>  $ x9 : num  1.195 -1.798 -0.536 0.804 -0.645 ...
#>  $ x10: num  -0.404 -1.516 -1.183 0.645 -0.465 ...
#>  $ x11: num  -1.89 -1.89 -1.89 -1.89 -1.89 ...
#>  $ x13: num  -1.45 -1.45 -1.45 -1.45 -1.45 ...

# Yes, they are column-wise minimums
apply(toy_data_all, 2, min, na.rm=TRUE)
#>        x1        x2        x3        x4        x5        x6        x7        x8 
#> -1.151563 -1.082896 -2.094066 -2.135824 -1.695153 -1.884420 -2.672497 -2.139812 
#>        x9       x10       x11       x13 
#> -1.797924 -1.919782 -1.887954 -1.447181

Un_imputeme <- function(x){
  value <- min(x, na.rm=TRUE)
  ifelse(x==value, NA, x)
}

Un_imputeme(toy_data_all_imputed$x4)
#>  [1] -0.81248829          NA  1.14515713 -0.32591036 -1.33860299 -1.30817553
#>  [7]  0.72004965 -1.77745866  1.76281538  0.01315958          NA          NA
#> [13]          NA          NA          NA          NA          NA          NA
#> [19]          NA          NA -1.81997300 -0.47621406 -0.60041151 -1.06902010
#> [25] -0.89004886 -0.71862591 -1.30061246  1.04264666  1.16343785 -0.79814594
#lapply(toy_data_all_imputed, Un_imputeme)

fixed <- data.frame(lapply(toy_data_all_imputed, Un_imputeme))

head(fixed)
#>            x1          x2         x3         x4         x5          x6
#> 1  0.98295425 -0.08242873 -0.3577468 -0.8124883  2.1009541  0.36934203
#> 2 -0.19293366  1.65683146  0.9431282         NA -0.1480268  0.01517654
#> 3 -1.00176910 -0.11726903  0.2140003  1.1451571 -0.2461528  0.19253270
#> 4  0.60431347 -0.65940289 -0.0780722 -0.3259104  0.6930319 -1.21569047
#> 5 -0.08835004  1.21018336  0.0417432 -1.3386030  1.0936188  0.48476012
#> 6  0.44113246 -0.10734136  0.6133137 -1.3081755 -1.6588221  0.19949441
#>           x7         x8         x9        x10 x11 x13
#> 1  2.1146015 -0.6705700  1.1954711 -0.4035832  NA  NA
#> 2  0.1806518 -0.0849321         NA -1.5161697  NA  NA
#> 3 -1.3212254  1.2047925 -0.5358144 -1.1834244  NA  NA
#> 4  0.4856780 -1.7217910  0.8037951  0.6448625  NA  NA
#> 5 -0.4018093  0.7162611 -0.6450060 -0.4653116  NA  NA
#> 6  0.8550276  0.2762728 -0.2093185 -0.9682742  NA  NA

# Use dplyr instead
toy_data_all_imputed %>% 
  mutate_if(is.numeric, ~ifelse(.==min(.), NA, .)) -> fixed2

head(fixed2)
#>            x1          x2         x3         x4         x5          x6
#> 1  0.98295425 -0.08242873 -0.3577468 -0.8124883  2.1009541  0.36934203
#> 2 -0.19293366  1.65683146  0.9431282         NA -0.1480268  0.01517654
#> 3 -1.00176910 -0.11726903  0.2140003  1.1451571 -0.2461528  0.19253270
#> 4  0.60431347 -0.65940289 -0.0780722 -0.3259104  0.6930319 -1.21569047
#> 5 -0.08835004  1.21018336  0.0417432 -1.3386030  1.0936188  0.48476012
#> 6  0.44113246 -0.10734136  0.6133137 -1.3081755 -1.6588221  0.19949441
#>           x7         x8         x9        x10 x11 x13
#> 1  2.1146015 -0.6705700  1.1954711 -0.4035832  NA  NA
#> 2  0.1806518 -0.0849321         NA -1.5161697  NA  NA
#> 3 -1.3212254  1.2047925 -0.5358144 -1.1834244  NA  NA
#> 4  0.4856780 -1.7217910  0.8037951  0.6448625  NA  NA
#> 5 -0.4018093  0.7162611 -0.6450060 -0.4653116  NA  NA
#> 6  0.8550276  0.2762728 -0.2093185 -0.9682742  NA  NA

^{Created on 2022-08-13 by the reprex package (v2.0.1)}

system · September 3, 2022, 5:09am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.