Hi @jayant,
Here's something that might work for you. For a problem like this, my general thought process is to build something that works on a single vector and then apply it to the column(s) of interest in your dataframe. Let's consider your aspect ratio data, we'll build that vector and inspect what the mean is:
aspect_ratios <- c(1.2, 2.3, 3.4, 5.6, NA)
mean(aspect_ratios, na.rm = TRUE)
#> [1] 3.125
Let's now build a function that will work on this single numeric vector. For this function, we simply calculate the mean of the numeric vector, and replace every NA
with the computed mean.
impute_mean <- function(x) {
# if the vector supplied is not numeric...
# return the original vector with a warning
if (!is.numeric(x)) {
warning("x is not numeric, returning original")
return(x)
}
# this value will be used to replace NA's
vector_mean <- mean(x, na.rm = TRUE)
# read as: "replace x with the vector mean where x is NA"
x[is.na(x)] <- vector_mean
x
}
When we apply this function to the vector aspect_ratios
, we get the following output (notice the NA
has been replaced with the mean):
impute_mean(aspect_ratios)
#> [1] 1.200 2.300 3.400 5.600 3.125
To bring this together in the context of a dataframe, we can now use our impute_mean()
function, say, with dplyr::mutate()
. I'll build a dataframe with a few extra variables to highlight one more point.
movie_new <- tibble(
aspect_ratio = aspect_ratios,
movie_mins = c(95, NA, 109, NA, 155),
revenue = c(64, 15, 41, NA, 34),
actor = c("Katie", "Tom", NA, "Sam", "Jake")
)
We can apply our function thusly:
movie_new %>%
mutate(aspect_imputed = impute_mean(aspect_ratio))
#> # A tibble: 5 x 5
#> aspect_ratio movie_mins revenue actor aspect_imputed
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 1.2 95 64 Katie 1.2
#> 2 2.3 NA 15 Tom 2.3
#> 3 3.4 109 41 <NA> 3.4
#> 4 5.6 NA NA Sam 5.6
#> 5 NA 155 34 Jake 3.12
One more cool thing you can do utilizes the purrr::map()
family of functions. Since your data frame is essentially a fancy list of vectors, you can iterate the impute_mean()
function over all elements of this fancy list (your columns).
map_df(movie_new, impute_mean)
#> Warning in .f(.x[[i]], ...): x is not numeric, returning original
#> # A tibble: 5 x 4
#> aspect_ratio movie_mins revenue actor
#> <dbl> <dbl> <dbl> <chr>
#> 1 1.2 95 64 Katie
#> 2 2.3 120. 15 Tom
#> 3 3.4 109 41 <NA>
#> 4 5.6 120. 38.5 Sam
#> 5 3.12 155 34 Jake
Notice how all of the original variables have been modified, replacing all NA
's with it's column's mean. Also notice how our function printed a warning letting us know that we had a character variable (actor
) and the original is being returned.
This response was likely overkill, but I hope it was informative and was able to help with some concepts which you can apply to different scenarios.
Best,
Tony