Imputing a variable with mean value



Good afternoon colleagues,
I am trying to implement a function to impute numeric variables in a data frame by its mean of that variable. Here is the function I wrote;

impute_mean <- function(data_new,column){
  data_new$column[$column)]<- mean(column,na.rm=TRUE)

Next, I execute this function on the following tibble;

movie_new <- data.frame(aspect_ratio=c(1.2,2.3,3.4,5.6, NA))
test <- impute_mean(movie_new,movie_new$aspect_ratio)

The error message says the following;

Error in `$<`(`*tmp*`, column, value = numeric(0)) : replacement has 
[x] rows, data has [y]

I am aware that it can't find a field named column in the data frame data_new, therefore it is complaining. But that is my goal, I want to define a generic function which can take arbitrary column name and do the imputation. Therefore I plan to keep the arguments to the function kinda generic,data_new, column.

Can I kindly get help here? Thanks in advance for the support.



Using base R coding, the 2nd line of your function should be:

data_new[[, column]), column]<- mean(data_new[, column], na.rm=TRUE)

You should probably enclose this line in a if statement to make sure that 'column' exist in your data.frame and that there are NA's in column.


As a general rule, only use $ for accessing parts of an object during interactive use. It cannot take variable column names, and it will do partial matching if a direct match doesn't exist ^1.

The [[ notation is the best choice for non-interactively selecting table columns. Your second line of code would work if it used data_new[[column]] instead (and replaced the lonely column with data_new[[column]]).

^1 Like all things, there are exceptions, but the easiest solution is to just use [[.


Hi @jayant,

Here's something that might work for you. For a problem like this, my general thought process is to build something that works on a single vector and then apply it to the column(s) of interest in your dataframe. Let's consider your aspect ratio data, we'll build that vector and inspect what the mean is:

aspect_ratios <- c(1.2, 2.3, 3.4, 5.6, NA)

mean(aspect_ratios, na.rm = TRUE)
#> [1] 3.125

Let's now build a function that will work on this single numeric vector. For this function, we simply calculate the mean of the numeric vector, and replace every NA with the computed mean.

impute_mean <- function(x) {
    # if the vector supplied is not numeric...
    # return the original vector with a warning
    if (!is.numeric(x)) {
        warning("x is not numeric, returning original")
    # this value will be used to replace NA's
    vector_mean <- mean(x, na.rm = TRUE)
    # read as: "replace x with the vector mean where x is NA"
    x[] <- vector_mean

When we apply this function to the vector aspect_ratios, we get the following output (notice the NA has been replaced with the mean):

#> [1] 1.200 2.300 3.400 5.600 3.125

To bring this together in the context of a dataframe, we can now use our impute_mean() function, say, with dplyr::mutate(). I'll build a dataframe with a few extra variables to highlight one more point.

movie_new <- tibble(
    aspect_ratio = aspect_ratios,
    movie_mins = c(95, NA, 109, NA, 155),
    revenue = c(64, 15, 41, NA, 34),
    actor = c("Katie", "Tom", NA, "Sam", "Jake")

We can apply our function thusly:

movie_new %>%
    mutate(aspect_imputed = impute_mean(aspect_ratio))
#> # A tibble: 5 x 5
#>   aspect_ratio movie_mins revenue actor aspect_imputed
#>          <dbl>      <dbl>   <dbl> <chr>          <dbl>
#> 1          1.2         95      64 Katie           1.2 
#> 2          2.3         NA      15 Tom             2.3 
#> 3          3.4        109      41 <NA>            3.4 
#> 4          5.6         NA      NA Sam             5.6 
#> 5         NA          155      34 Jake            3.12

One more cool thing you can do utilizes the purrr::map() family of functions. Since your data frame is essentially a fancy list of vectors, you can iterate the impute_mean() function over all elements of this fancy list (your columns).

map_df(movie_new, impute_mean)
#> Warning in .f(.x[[i]], ...): x is not numeric, returning original
#> # A tibble: 5 x 4
#>   aspect_ratio movie_mins revenue actor
#>          <dbl>      <dbl>   <dbl> <chr>
#> 1         1.2         95     64   Katie
#> 2         2.3        120.    15   Tom  
#> 3         3.4        109     41   <NA> 
#> 4         5.6        120.    38.5 Sam  
#> 5         3.12       155     34   Jake

Notice how all of the original variables have been modified, replacing all NA's with it's column's mean. Also notice how our function printed a warning letting us know that we had a character variable (actor) and the original is being returned.

This response was likely overkill, but I hope it was informative and was able to help with some concepts which you can apply to different scenarios.



Thanks for the responses, peers. It is great to see more than one ways to do things.