How to assign value based on multiple conditions efficiently?

I'm looking for a way to assign weights to survey observations. I have a dataframe with survey data, which looks like this:

set.seed(123)
survey <- data.frame(id=1:30, 
country = sample(letters[1:2], 30, replace = T), 
age = sample(c("young", "older", "old"), 30, replace = T), 
sex = sample(c("Male", "Female"), 30, replace = T))

I also have a census dataset, which looks like this:

census <- data.frame(country = rep(letters[1:2], each = 6),
age = rep(rep(c("young", "older", "old"), each = 2), 2), 
sex = rep(c("Male", "Female"), 6), 
rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6)))

Now, I'd like to calculate the hypothetical frequencies (how many I should have sampled) by multiplying the relative frequencies of the census times the number of observations for each country. I honestly don't know how to start.

Update:
OK, I do know how to start but I'm not sure if I'm on the right track.

This will give me the number of observations in each country:

survey %>%
    group_by(country) %>%
    nest() %>%
    mutate(country_pop=map_dbl(data, nrow)) %>%
    unnest(data)

Update2:

I realize it's not clear what I'm looking for. The final dataset should look something like this:

  id country   age  sex hyp_freq
1  1       a young Male    2.125 # 0.125*17 (0.125 from census data, 17 from number of observations of country a)
2  2       a older Male    2.125 # 0.125*17
3  3       a   old Male    4.250 # 0.25*17
...

So, I want to look up my survey respondent information (age and sex) in the other table to obtain the relative frequency for that age and sex group. Then I want to multiply that frequency by the number of observations per country and save it to my survey dataframe.

I'm not sure exactly what this requires based on the description, but the object below should get you started

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(tidyr)
})
set.seed(123)
survey <- data.frame(
  country = sample(letters[1:2], 30, replace = T),
  age = sample(c("young", "older", "old"), 30, replace = T),
  sex = sample(c("Male", "Female"), 30, replace = T)
)
census <- data.frame(
  country = rep(letters[1:2], each = 6),
  age = rep(rep(c("young", "older", "old"), each = 2), 2),
  sex = rep(c("Male", "Female"), 6),
  rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6))
)

left_join(survey,census, by = "country")
#>     country age.x  sex.x age.y  sex.y rel_freq
#> 1         a young   Male young   Male    0.125
#> 2         a young   Male young Female    0.125
#> 3         a young   Male older   Male    0.125
#> 4         a young   Male older Female    0.125
#> 5         a young   Male   old   Male    0.250
#> 6         a young   Male   old Female    0.250
#> 7         a older   Male young   Male    0.125
#> 8         a older   Male young Female    0.125
#> 9         a older   Male older   Male    0.125
#> 10        a older   Male older Female    0.125
#> 11        a older   Male   old   Male    0.250
#> 12        a older   Male   old Female    0.250
#> 13        a   old   Male young   Male    0.125
#> 14        a   old   Male young Female    0.125
#> 15        a   old   Male older   Male    0.125
#> 16        a   old   Male older Female    0.125
#> 17        a   old   Male   old   Male    0.250
#> 18        a   old   Male   old Female    0.250
#> 19        b older Female young   Male    0.167
#> 20        b older Female young Female    0.167
#> 21        b older Female older   Male    0.167
#> 22        b older Female older Female    0.167
#> 23        b older Female   old   Male    0.167
#> 24        b older Female   old Female    0.167
#> 25        a young   Male young   Male    0.125
#> 26        a young   Male young Female    0.125
#> 27        a young   Male older   Male    0.125
#> 28        a young   Male older Female    0.125
#> 29        a young   Male   old   Male    0.250
#> 30        a young   Male   old Female    0.250
#> 31        b   old   Male young   Male    0.167
#> 32        b   old   Male young Female    0.167
#> 33        b   old   Male older   Male    0.167
#> 34        b   old   Male older Female    0.167
#> 35        b   old   Male   old   Male    0.167
#> 36        b   old   Male   old Female    0.167
#> 37        b   old   Male young   Male    0.167
#> 38        b   old   Male young Female    0.167
#> 39        b   old   Male older   Male    0.167
#> 40        b   old   Male older Female    0.167
#> 41        b   old   Male   old   Male    0.167
#> 42        b   old   Male   old Female    0.167
#> 43        b young   Male young   Male    0.167
#> 44        b young   Male young Female    0.167
#> 45        b young   Male older   Male    0.167
#> 46        b young   Male older Female    0.167
#> 47        b young   Male   old   Male    0.167
#> 48        b young   Male   old Female    0.167
#> 49        a   old Female young   Male    0.125
#> 50        a   old Female young Female    0.125
#> 51        a   old Female older   Male    0.125
#> 52        a   old Female older Female    0.125
#> 53        a   old Female   old   Male    0.250
#> 54        a   old Female   old Female    0.250
#> 55        a older Female young   Male    0.125
#> 56        a older Female young Female    0.125
#> 57        a older Female older   Male    0.125
#> 58        a older Female older Female    0.125
#> 59        a older Female   old   Male    0.250
#> 60        a older Female   old Female    0.250
#> 61        b young   Male young   Male    0.167
#> 62        b young   Male young Female    0.167
#> 63        b young   Male older   Male    0.167
#> 64        b young   Male older Female    0.167
#> 65        b young   Male   old   Male    0.167
#> 66        b young   Male   old Female    0.167
#> 67        b   old Female young   Male    0.167
#> 68        b   old Female young Female    0.167
#> 69        b   old Female older   Male    0.167
#> 70        b   old Female older Female    0.167
#> 71        b   old Female   old   Male    0.167
#> 72        b   old Female   old Female    0.167
#> 73        b young Female young   Male    0.167
#> 74        b young Female young Female    0.167
#> 75        b young Female older   Male    0.167
#> 76        b young Female older Female    0.167
#> 77        b young Female   old   Male    0.167
#> 78        b young Female   old Female    0.167
#> 79        a young Female young   Male    0.125
#> 80        a young Female young Female    0.125
#> 81        a young Female older   Male    0.125
#> 82        a young Female older Female    0.125
#> 83        a young Female   old   Male    0.250
#> 84        a young Female   old Female    0.250
#> 85        b older Female young   Male    0.167
#> 86        b older Female young Female    0.167
#> 87        b older Female older   Male    0.167
#> 88        b older Female older Female    0.167
#> 89        b older Female   old   Male    0.167
#> 90        b older Female   old Female    0.167
#> 91        a   old   Male young   Male    0.125
#> 92        a   old   Male young Female    0.125
#> 93        a   old   Male older   Male    0.125
#> 94        a   old   Male older Female    0.125
#> 95        a   old   Male   old   Male    0.250
#> 96        a   old   Male   old Female    0.250
#> 97        b   old Female young   Male    0.167
#> 98        b   old Female young Female    0.167
#> 99        b   old Female older   Male    0.167
#> 100       b   old Female older Female    0.167
#> 101       b   old Female   old   Male    0.167
#> 102       b   old Female   old Female    0.167
#> 103       a young Female young   Male    0.125
#> 104       a young Female young Female    0.125
#> 105       a young Female older   Male    0.125
#> 106       a young Female older Female    0.125
#> 107       a young Female   old   Male    0.250
#> 108       a young Female   old Female    0.250
#> 109       a   old Female young   Male    0.125
#> 110       a   old Female young Female    0.125
#> 111       a   old Female older   Male    0.125
#> 112       a   old Female older Female    0.125
#> 113       a   old Female   old   Male    0.250
#> 114       a   old Female   old Female    0.250
#> 115       a young   Male young   Male    0.125
#> 116       a young   Male young Female    0.125
#> 117       a young   Male older   Male    0.125
#> 118       a young   Male older Female    0.125
#> 119       a young   Male   old   Male    0.250
#> 120       a young   Male   old Female    0.250
#> 121       a   old   Male young   Male    0.125
#> 122       a   old   Male young Female    0.125
#> 123       a   old   Male older   Male    0.125
#> 124       a   old   Male older Female    0.125
#> 125       a   old   Male   old   Male    0.250
#> 126       a   old   Male   old Female    0.250
#> 127       b older Female young   Male    0.167
#> 128       b older Female young Female    0.167
#> 129       b older Female older   Male    0.167
#> 130       b older Female older Female    0.167
#> 131       b older Female   old   Male    0.167
#> 132       b older Female   old Female    0.167
#> 133       a young   Male young   Male    0.125
#> 134       a young   Male young Female    0.125
#> 135       a young   Male older   Male    0.125
#> 136       a young   Male older Female    0.125
#> 137       a young   Male   old   Male    0.250
#> 138       a young   Male   old Female    0.250
#> 139       a older Female young   Male    0.125
#> 140       a older Female young Female    0.125
#> 141       a older Female older   Male    0.125
#> 142       a older Female older Female    0.125
#> 143       a older Female   old   Male    0.250
#> 144       a older Female   old Female    0.250
#> 145       a young Female young   Male    0.125
#> 146       a young Female young Female    0.125
#> 147       a young Female older   Male    0.125
#> 148       a young Female older Female    0.125
#> 149       a young Female   old   Male    0.250
#> 150       a young Female   old Female    0.250
#> 151       a young   Male young   Male    0.125
#> 152       a young   Male young Female    0.125
#> 153       a young   Male older   Male    0.125
#> 154       a young   Male older Female    0.125
#> 155       a young   Male   old   Male    0.250
#> 156       a young   Male   old Female    0.250
#> 157       b   old Female young   Male    0.167
#> 158       b   old Female young Female    0.167
#> 159       b   old Female older   Male    0.167
#> 160       b   old Female older Female    0.167
#> 161       b   old Female   old   Male    0.167
#> 162       b   old Female   old Female    0.167
#> 163       b young Female young   Male    0.167
#> 164       b young Female young Female    0.167
#> 165       b young Female older   Male    0.167
#> 166       b young Female older Female    0.167
#> 167       b young Female   old   Male    0.167
#> 168       b young Female   old Female    0.167
#> 169       a older   Male young   Male    0.125
#> 170       a older   Male young Female    0.125
#> 171       a older   Male older   Male    0.125
#> 172       a older   Male older Female    0.125
#> 173       a older   Male   old   Male    0.250
#> 174       a older   Male   old Female    0.250
#> 175       b young   Male young   Male    0.167
#> 176       b young   Male young Female    0.167
#> 177       b young   Male older   Male    0.167
#> 178       b young   Male older Female    0.167
#> 179       b young   Male   old   Male    0.167
#> 180       b young   Male   old Female    0.167

Created on 2020-11-16 by the reprex package (v0.3.0.9001)

Thanks! Unfortunately, this isn't quite what I was looking for. I updated my post to make it clearer. Is it possible to adjust your code to do what I'm looking for?

I hand checked a couple of these, but it's past my nap time and no assurances can be given

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(tidyr)
})
set.seed(123)
survey <- data.frame(
  country = sample(letters[1:2], 30, replace = T),
  age = sample(c("young", "older", "old"), 30, replace = T),
  sex = sample(c("Male", "Female"), 30, replace = T)
)
census <- data.frame(
  country = rep(letters[1:2], each = 6),
  age = rep(rep(c("young", "older", "old"), each = 2), 2),
  sex = rep(c("Male", "Female"), 6),
  rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6))
)

left_join(survey,census, by = "country") %>%
  select(-sex.y,-age.y) %>% 
  rename(sex = sex.x, age = age.x) %>%
  group_by(country,sex,age, rel_freq) %>% 
  count() %>%
  group_by(country,sex,age) %>%
  summarise(freq = rel_freq * n) %>%
  group_by(country,sex,age) %>%
  summarise(freq = sum(freq))
#> `summarise()` regrouping output by 'country', 'sex', 'age' (override with `.groups` argument)
#> `summarise()` regrouping output by 'country', 'sex' (override with `.groups` argument)
#> # A tibble: 11 x 4
#> # Groups:   country, sex [4]
#>    country sex    age    freq
#>    <chr>   <chr>  <chr> <dbl>
#>  1 a       Female old    2   
#>  2 a       Female older  2   
#>  3 a       Female young  3   
#>  4 a       Male   old    3   
#>  5 a       Male   older  2   
#>  6 a       Male   young  5   
#>  7 b       Female old    3.01
#>  8 b       Female older  3.01
#>  9 b       Female young  2.00
#> 10 b       Male   old    2.00
#> 11 b       Male   young  3.01

Created on 2020-11-16 by the reprex package (v0.3.0.9001)

This seems to work, thanks! Could you perhaps add a few comments to explain the logic?

Oh, good. I was nodding off.

My approach to R is functional: f(x) = y, where the three objects (in R everything is an object) where

x is what is at hand
y is what is desired
f transforms x to y

Objects can be composite—think of data frames. I could have rolled this all into a single function, but the logic would be harder to follow.

So here's the thought process:

  1. Create x by combining the survey and census data into a single data frame, called tab, eliminating the duplicate names and renaming the surviving names.
  2. Create f by %>% piping tab through a series of functions to create y
  3. First, consolidate each combination of country, sex and age and get a count (n)
  4. Regroup the same way with the new column n and multiply by each rel_freq 1.25 * 8, etc.
  5. Regroup again the same way and add the results from step 4

If I were going to be doing this with a number of identically organized but differently named survey and census arguments, I'd do a little hard-wired function like this:

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(tidyr)
})
set.seed(123)
survey <- data.frame(
  country = sample(letters[1:2], 30, replace = T),
  age = sample(c("young", "older", "old"), 30, replace = T),
  sex = sample(c("Male", "Female"), 30, replace = T)
)
census <- data.frame(
  country = rep(letters[1:2], each = 6),
  age = rep(rep(c("young", "older", "old"), each = 2), 2),
  sex = rep(c("Male", "Female"), 6),
  rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6))
)

get_freq <- function(x, y) {
  left_join(x,y, by = "country") %>%
    select(-sex.y,-age.y) %>% 
    rename(sex = sex.x, age = age.x) %>%
    group_by(country,sex,age, rel_freq) %>% 
    count() %>%
    group_by(country,sex,age) %>%
    summarise(freq = rel_freq * n) %>%
    group_by(country,sex,age) %>%
    summarise(freq = sum(freq))
}

get_freq(survey,census)
#> `summarise()` regrouping output by 'country', 'sex', 'age' (override with `.groups` argument)
#> `summarise()` regrouping output by 'country', 'sex' (override with `.groups` argument)
#> # A tibble: 11 x 4
#> # Groups:   country, sex [4]
#>    country sex    age    freq
#>    <chr>   <chr>  <chr> <dbl>
#>  1 a       Female old    2   
#>  2 a       Female older  2   
#>  3 a       Female young  3   
#>  4 a       Male   old    3   
#>  5 a       Male   older  2   
#>  6 a       Male   young  5   
#>  7 b       Female old    3.01
#>  8 b       Female older  3.01
#>  9 b       Female young  2.00
#> 10 b       Male   old    2.00
#> 11 b       Male   young  3.01

Created on 2020-11-16 by the reprex package (v0.3.0.9001)

Awesome. Thanks for taking the time to explain this, I just learned something :slight_smile:

How would I reassign the freq values to the original survey data?

Again, f(x) = y. This time x is the return value of

x  <-  get_freq(survey,census)

which has freq as a column, along with country, age and sex, which is everything in survey. The only difference is that x has aggregated survey, so it has fewer rows. For a row in survey, what is it that freq is supposed to signify?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.