weighted mean with missing values

flik · June 13, 2019, 10:44am

Dear all,
I'm writing you because I am stuck on this point.
I have a dataframe like this:

Rif year Var1 Var2
Name1 2015 11 15
Name1 2016 15 17
Name1 2017 10 12
Name1 2018 20 5
Name2 2015 26 10
Name2 2016 13 20
Name2 2017 5 7
Name3 2015 16 8
Name4 2015 19 11
Name4 2017 15 17
Name5 2017 21 9
Name5 2018 30 25

So, I am studying a period from 2015 to 2018, but there are missing values on some couples (Rif_year).

I would like to calcute the weighted mean for Var1 and Var2 with the following weights:
2018 = 0.5 -- 2017 = 0.3 -- 2016 = 0.15 -- 2015 = 0.05

For some Rif (Name2, Name3, Name4, Name5), Var1 e Var2 are missing on some years.
I would like to get all years for each Rif. Where I have the missing value, I would like to have row/rows containing Rif, the missing year/years, mode of Var1 and mode of Var2, where modes are computed on that specific Rif.

Hope to have been clear enough.
Thank you for your support!

Yarnabrina · June 13, 2019, 12:00pm

Are you looking for something like this?

dataset <- read.table(text = 'Rif year Var1 Var2
Name1 2015 11 15
Name1 2016 15 17
Name1 2017 10 12
Name1 2018 20 5
Name2 2015 26 10
Name2 2016 13 20
Name2 2017 5 7
Name3 2015 16 8
Name4 2015 19 11
Name4 2017 15 17
Name5 2017 21 9
Name5 2018 30 25',
                      header = TRUE,
                      stringsAsFactors = FALSE)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset %>%
  group_by(Rif) %>%
  summarise_at(.vars = vars(starts_with(match = "Var")),
               .funs = ~ ifelse(test = length(.) == 4,
                                yes = weighted.mean(x = .,
                                                    w = c(0.05, 0.15, 0.3, 0.5)),
                                no = max(., na.rm = TRUE)))
#> # A tibble: 5 x 3
#>   Rif    Var1  Var2
#>   <chr> <dbl> <dbl>
#> 1 Name1  15.8  9.40
#> 2 Name2  26   20   
#> 3 Name3  16    8   
#> 4 Name4  19   17   
#> 5 Name5  30   25

^{Created on 2019-06-13 by the reprex package (v0.3.0)}

flik · June 13, 2019, 12:20pm

Thank you Yarnabrina!
Could you help me to understand this:

dataset %>%
group_by(Rif) %>%
summarise_at(.vars = vars(starts_with(match = "Var")),
.funs = ~ ifelse(test = length(.) == 4,
yes = weighted.mean(x = .,
w = c(0.05, 0.15, 0.3, 0.5)),
no = max(., na.rm = TRUE)))

Thanks

Yarnabrina · June 13, 2019, 12:32pm

Do you want me to explain my code? I'll try, but I'm not good at it.

From what I understand from the question, you need to summarise observations corresponding to each Rif. Hence the 1st line: group_by(Rif).
Next, the relecant records are only in the columns Var1 and Var2. So, I'll summarise observations only in those two columns. To summarise at columns of my preferences (to be specified), I've used summarise_at. Now, how shall I tell R to locate the columns I want? I'll tell it to choose those columns that starts with "Var". Hence the part: vars(starts_with(match = "Var")).
Now comes the main summarisation part. If observations on all four years are available, you'll use mean, otherwise the "mode". So, I first check whether there are 4 or less observations corresponding to each Rif group in this part: test = length(.) == 4. If there are, I'll summarise using weighted mean. Otherwise, I'll take the maximum of the observations using this part no = max(., na.rm = TRUE).

I think it's necessary to mention that for this particular problem, starts_with is unnecessary. You can simply specify the column names yourself. Also, though ifelse leads to correct (provided I understand your question correctly) results, it actually calculates weighted means and modes for both groups (both yes and no vectors are calculated), and chooses a value based on the test vector.

Does this help?

Note

I realised a mistake of my code, only after checking @jlacko's code below. I reported only the maximum value, which is not the definition of mode. But, I used this since among 4 years, it is unlikely that two or more records will be exactly same. If you want to use mode, either use @jlacko's solution, or replace this part no = max(., na.rm = TRUE) with no = sort(.)[which.max(x = table(.))].

jlacko · June 13, 2019, 12:37pm

Also consider this approach:

It works in steps:

creates a data frame of names & modes per name
expands the data for all combination of names & years (with missing values as NA)
replaces the NAs with modes from previously calculated data frame
joins a data frame of weights per year
finally summarise using weighted.mean()

library(tidyverse)

modus <- function(x) { # base R has no modus function :(
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

raw_data <- tibble(name = c("cats", "dogs", "dogs"),
                   year = c(2015, 2015, 2016),
                   value = c(1, 2, 3))

weights <- tibble(year = c(2015, 2016),
                  weight = c(.3, .7))

modes <- raw_data %>% 
  group_by(name) %>% 
  summarise(modus = modus(value))

new_data <- expand(raw_data, name, year) %>% # all combinations
  left_join(raw_data) %>% # cats for year 2016 is missing = NA
  left_join(modes) %>%  # modes per name
  mutate(value = ifelse(is.na(value), modus, value)) %>% # replace missing values by modes
  select(-modus) %>% # the modus field is no longer required
  inner_join(weights) %>% 
  group_by(year) %>% 
  summarise(wmean = weighted.mean(value, weight))

jlacko · June 13, 2019, 1:32pm

Unusual mode indeed

But I agree with your observation that there is unlikely to be a modal value in less than four observations (as four observations is the maximum possible, and there would be no NAs to replace).

Mean might be a better option still, if entirely dropping NAs is not desirable.

system · July 4, 2019, 1:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.