Imputing a variable with mean value

jayant · September 16, 2018, 8:51pm

Good afternoon colleagues,
I am trying to implement a function to impute numeric variables in a data frame by its mean of that variable. Here is the function I wrote;

impute_mean <- function(data_new,column){
  print(mean(column,na.rm=TRUE))
  data_new$column[is.na(data_new$column)]<- mean(column,na.rm=TRUE)
  return(data_new)
}

Next, I execute this function on the following tibble;

library(tibble)
movie_new <- data.frame(aspect_ratio=c(1.2,2.3,3.4,5.6, NA))
test <- impute_mean(movie_new,movie_new$aspect_ratio)

The error message says the following;

Error in `$<-.data.frame`(`*tmp*`, column, value = numeric(0)) : replacement has 
[x] rows, data has [y]

I am aware that it can't find a field named column in the data frame data_new, therefore it is complaining. But that is my goal, I want to define a generic function which can take arbitrary column name and do the imputation. Therefore I plan to keep the arguments to the function kinda generic,data_new, column.

Can I kindly get help here? Thanks in advance for the support.

pomchip · September 17, 2018, 12:40am

Hi,

Using base R coding, the 2nd line of your function should be:

data_new[is.na(data_new[, column]), column]<- mean(data_new[, column], na.rm=TRUE)

You should probably enclose this line in a if statement to make sure that 'column' exist in your data.frame and that there are NA's in column.

nwerth · September 17, 2018, 1:49pm

As a general rule, only use $ for accessing parts of an object during interactive use. It cannot take variable column names, and it will do partial matching if a direct match doesn't exist ^1.

The [[ notation is the best choice for non-interactively selecting table columns. Your second line of code would work if it used data_new[[column]] instead (and replaced the lonely column with data_new[[column]]).

^1 Like all things, there are exceptions, but the easiest solution is to just use [[.

ttrodrigz · September 19, 2018, 3:23pm

Hi @jayant,

Here's something that might work for you. For a problem like this, my general thought process is to build something that works on a single vector and then apply it to the column(s) of interest in your dataframe. Let's consider your aspect ratio data, we'll build that vector and inspect what the mean is:

aspect_ratios <- c(1.2, 2.3, 3.4, 5.6, NA)

mean(aspect_ratios, na.rm = TRUE)
#> [1] 3.125

Let's now build a function that will work on this single numeric vector. For this function, we simply calculate the mean of the numeric vector, and replace every NA with the computed mean.

impute_mean <- function(x) {
    
    # if the vector supplied is not numeric...
    # return the original vector with a warning
    if (!is.numeric(x)) {
        warning("x is not numeric, returning original")
        return(x)
    }
    
    # this value will be used to replace NA's
    vector_mean <- mean(x, na.rm = TRUE)
    
    # read as: "replace x with the vector mean where x is NA"
    x[is.na(x)] <- vector_mean
    
    x
    
}

When we apply this function to the vector aspect_ratios, we get the following output (notice the NA has been replaced with the mean):

impute_mean(aspect_ratios)
#> [1] 1.200 2.300 3.400 5.600 3.125

To bring this together in the context of a dataframe, we can now use our impute_mean() function, say, with dplyr::mutate(). I'll build a dataframe with a few extra variables to highlight one more point.

movie_new <- tibble(
    aspect_ratio = aspect_ratios,
    movie_mins = c(95, NA, 109, NA, 155),
    revenue = c(64, 15, 41, NA, 34),
    actor = c("Katie", "Tom", NA, "Sam", "Jake")
)

We can apply our function thusly:

movie_new %>%
    mutate(aspect_imputed = impute_mean(aspect_ratio))
#> # A tibble: 5 x 5
#>   aspect_ratio movie_mins revenue actor aspect_imputed
#>          <dbl>      <dbl>   <dbl> <chr>          <dbl>
#> 1          1.2         95      64 Katie           1.2 
#> 2          2.3         NA      15 Tom             2.3 
#> 3          3.4        109      41 <NA>            3.4 
#> 4          5.6         NA      NA Sam             5.6 
#> 5         NA          155      34 Jake            3.12

One more cool thing you can do utilizes the purrr::map() family of functions. Since your data frame is essentially a fancy list of vectors, you can iterate the impute_mean() function over all elements of this fancy list (your columns).

map_df(movie_new, impute_mean)
#> Warning in .f(.x[[i]], ...): x is not numeric, returning original
#> # A tibble: 5 x 4
#>   aspect_ratio movie_mins revenue actor
#>          <dbl>      <dbl>   <dbl> <chr>
#> 1         1.2         95     64   Katie
#> 2         2.3        120.    15   Tom  
#> 3         3.4        109     41   <NA> 
#> 4         5.6        120.    38.5 Sam  
#> 5         3.12       155     34   Jake

Notice how all of the original variables have been modified, replacing all NA's with it's column's mean. Also notice how our function printed a warning letting us know that we had a character variable (actor) and the original is being returned.

This response was likely overkill, but I hope it was informative and was able to help with some concepts which you can apply to different scenarios.

Best,
Tony

jayant · September 20, 2018, 7:19pm

Thanks for the responses, peers. It is great to see more than one ways to do things.