missForest not working

packages
#1

Hi everyone,

I'm new with R and I have a question about missForest package. I have a data.frame with some NA's values on some columns.
I use the missForest library to imput values, but when I make a summary of the new dataset, it told me that there's still some NA's values.

The code is the following:

new.credits<-missForest(credits.df)
summary(new.credits$ximp)

(credits.df is a data.frame)

The code never told me there's an error, but is not working.

Output:

> new.credits<-missForest(credits.df)
  missForest iteration 1 in progress...done!
  missForest iteration 2 in progress...done!
> summary(new.credits$ximp)
 V1        V2
 ?: 12    ?      : 12 
 a:210    22.67  :  9 
 b:468    20.42  :  7 
          18.83  :  6
          19.17  :  6 
          20.67  :  6
          (Other): 644

When I do a summary of the new dataset you can see 12 "?" values in V1 and 12 "?" values on V2.
I don't know what I'm doing wrong with the library.

Hope you can help me.
Thank you all!

0 Likes

#2

I can't reproduce your issue with built-in data, could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

0 Likes

#3

Here it is my reprex:

library(missForest)
#> Loading required package: randomForest
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#> Loading required package: foreach
#> Loading required package: itertools
#> Loading required package: iterators
df<-data.frame(
          V3 = c(0, 4.46, 0.5, 1.54, 5.625),
          V8 = c(1.25, 3.04, 1.5, 3.75, 1.71),
         V11 = c(1L, 6L, 0L, 5L, 0L),
         V15 = c(0L, 560L, 824L, 3L, 0L),
      header = c(FALSE, FALSE, FALSE, FALSE, FALSE),
          V1 = as.factor(c("b", "a", "a", "b", "b")),
          V2 = as.factor(c("30.83", "58.67", "24.50", "27.83", "20.17")),
          V4 = as.factor(c("u", "u", "u", "u", "u")),
          V5 = as.factor(c("g", "g", "g", "g", "g")),
          V6 = as.factor(c("w", "q", "q", "w", "w")),
          V7 = as.factor(c("v", "h", "h", "v", "v")),
          V9 = as.factor(c("t", "t", "t", "t", "t")),
         V10 = as.factor(c("t", "t", "f", "t", "f")),
         V12 = as.factor(c("f", "f", "f", "t", "f")),
         V13 = as.factor(c("g", "g", "g", "g", "s")),
         V14 = as.factor(c("00202", "00043", "00280", "00100", "00120")),
         V16 = as.factor(c("+", "+", "+", "+", "+"))
)

new.credits<-missForest(df)
#>   missForest iteration 1 in progress...done!
#>   missForest iteration 2 in progress...done!
summary(new.credits$ximp)
#>        V3              V8            V11           V15            header 
#>  Min.   :0.000   Min.   :1.25   Min.   :0.0   Min.   :  0.0   Min.   :0  
#>  1st Qu.:0.500   1st Qu.:1.50   1st Qu.:0.0   1st Qu.:  0.0   1st Qu.:0  
#>  Median :1.540   Median :1.71   Median :1.0   Median :  3.0   Median :0  
#>  Mean   :2.425   Mean   :2.25   Mean   :2.4   Mean   :277.4   Mean   :0  
#>  3rd Qu.:4.460   3rd Qu.:3.04   3rd Qu.:5.0   3rd Qu.:560.0   3rd Qu.:0  
#>  Max.   :5.625   Max.   :3.75   Max.   :6.0   Max.   :824.0   Max.   :0  
#>  V1        V2    V4    V5    V6    V7    V9    V10   V12   V13      V14   
#>  a:2   20.17:1   u:5   g:5   q:2   h:2   t:5   f:2   f:4   g:4   00043:1  
#>  b:3   24.50:1               w:3   v:3         t:3   t:1   s:1   00100:1  
#>        27.83:1                                                   00120:1  
#>        30.83:1                                                   00202:1  
#>        58.67:1                                                   00280:1  
#>                                                                           
#>  V16  
#>  +:5  
#>       
#>       
#>       
#>       
#> 

Created on 2019-03-19 by the reprex package (v0.2.1)

0 Likes

#4

How are missing values represented in your sample data? With the '+' symbol?

0 Likes

#5

Andres,

No, the '+' symbol is right, is an accepted value.
I now realize that the data.frame I use for the example don't have any NA symbol. In attribute 'V1' the values are 'a', 'b' or NA. When I do the summary it appears like this:

V1
?: 12
a: 430
b: 278

I really don't know how it appear on data.frame, but the summary tells me that there is 12 missing values. (And I know this because in the exercise it says that there is 12 NA values).
I don't know if I'm doing something wrong with the library "missForest".

In a couple of hours I will try to generate a reprex with some NA values.

Regards.

0 Likes

#6

The NA symbol is "?". If you look the data.frame (complete data frame):

The missForest library don't tell me there's an error, but when I do the summary of the result, It's says there are 12 NA values (?: 12).

0 Likes

#7

You can replace the ? character by NA in your dataframe with something like this

library(missForest)
library(dplyr)

df<-data.frame(stringsAsFactors = FALSE,
               V1 = as.factor(c("b", "?", "a", "?", "b")),
               V3 = c(0, 4.46, 0.5, 1.54, 5.625),
               V2 = as.factor(c("30.83", "?", "24.50", "?", "20.17")),
               V4 = as.factor(c("u", "u", "u", "u", "u")),
               V5 = as.factor(c("g", "g", "g", "g", "g"))
)

new.credits<-missForest(df)
summary(new.credits$ximp)
#>  V1          V3            V2    V4    V5   
#>  ?:2   Min.   :0.000   ?    :2   u:5   g:5  
#>  a:1   1st Qu.:0.500   20.17:1              
#>  b:2   Median :1.540   24.50:1              
#>        Mean   :2.425   30.83:1              
#>        3rd Qu.:4.460                        
#>        Max.   :5.625

df <- df %>% 
    mutate_all(~replace(., .=='?', NA)) %>% 
    mutate_if(is.factor, ~factor(.))

new.credits<-missForest(df)
summary(new.credits$ximp)
#>  V1          V3            V2    V4    V5   
#>  a:3   Min.   :0.000   20.17:1   u:5   g:5  
#>  b:2   1st Qu.:0.500   24.50:3              
#>        Median :1.540   30.83:1              
#>        Mean   :2.425                        
#>        3rd Qu.:4.460                        
#>        Max.   :5.625

But ideally you should do it while reading the data, if you are reading from a .csv file then you can do something like this

df <- read.csv("your_file.csv", na.strings = c("?", "NA"))
1 Like

#8

andresrcs, thank you so much!

You were right! I thought "?" was a representation of NA value, but instead is a different character like "a" or "b". Replacing the "?" character, the library missForest work fine!

Thanks!

0 Likes

closed #9

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

0 Likes