How to assign value based on multiple conditions efficiently?

TeeTrea · November 16, 2020, 6:59pm

I'm looking for a way to assign weights to survey observations. I have a dataframe with survey data, which looks like this:

set.seed(123)
survey <- data.frame(id=1:30, 
country = sample(letters[1:2], 30, replace = T), 
age = sample(c("young", "older", "old"), 30, replace = T), 
sex = sample(c("Male", "Female"), 30, replace = T))

I also have a census dataset, which looks like this:

census <- data.frame(country = rep(letters[1:2], each = 6),
age = rep(rep(c("young", "older", "old"), each = 2), 2), 
sex = rep(c("Male", "Female"), 6), 
rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6)))

Now, I'd like to calculate the hypothetical frequencies (how many I should have sampled) by multiplying the relative frequencies of the census times the number of observations for each country. I honestly don't know how to start.

Update:
OK, I do know how to start but I'm not sure if I'm on the right track.

This will give me the number of observations in each country:

survey %>%
    group_by(country) %>%
    nest() %>%
    mutate(country_pop=map_dbl(data, nrow)) %>%
    unnest(data)

Update2:

I realize it's not clear what I'm looking for. The final dataset should look something like this:

  id country   age  sex hyp_freq
1  1       a young Male    2.125 # 0.125*17 (0.125 from census data, 17 from number of observations of country a)
2  2       a older Male    2.125 # 0.125*17
3  3       a   old Male    4.250 # 0.25*17
...

So, I want to look up my survey respondent information (age and sex) in the other table to obtain the relative frequency for that age and sex group. Then I want to multiply that frequency by the number of observations per country and save it to my survey dataframe.

technocrat · November 16, 2020, 9:20pm

I'm not sure exactly what this requires based on the description, but the object below should get you started

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(tidyr)
})
set.seed(123)
survey <- data.frame(
  country = sample(letters[1:2], 30, replace = T),
  age = sample(c("young", "older", "old"), 30, replace = T),
  sex = sample(c("Male", "Female"), 30, replace = T)
)
census <- data.frame(
  country = rep(letters[1:2], each = 6),
  age = rep(rep(c("young", "older", "old"), each = 2), 2),
  sex = rep(c("Male", "Female"), 6),
  rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6))
)

left_join(survey,census, by = "country")
#>     country age.x  sex.x age.y  sex.y rel_freq
#> 1         a young   Male young   Male    0.125
#> 2         a young   Male young Female    0.125
#> 3         a young   Male older   Male    0.125
#> 4         a young   Male older Female    0.125
#> 5         a young   Male   old   Male    0.250
#> 6         a young   Male   old Female    0.250
#> 7         a older   Male young   Male    0.125
#> 8         a older   Male young Female    0.125
#> 9         a older   Male older   Male    0.125
#> 10        a older   Male older Female    0.125
#> 11        a older   Male   old   Male    0.250
#> 12        a older   Male   old Female    0.250
#> 13        a   old   Male young   Male    0.125
#> 14        a   old   Male young Female    0.125
#> 15        a   old   Male older   Male    0.125
#> 16        a   old   Male older Female    0.125
#> 17        a   old   Male   old   Male    0.250
#> 18        a   old   Male   old Female    0.250
#> 19        b older Female young   Male    0.167
#> 20        b older Female young Female    0.167
#> 21        b older Female older   Male    0.167
#> 22        b older Female older Female    0.167
#> 23        b older Female   old   Male    0.167
#> 24        b older Female   old Female    0.167
#> 25        a young   Male young   Male    0.125
#> 26        a young   Male young Female    0.125
#> 27        a young   Male older   Male    0.125
#> 28        a young   Male older Female    0.125
#> 29        a young   Male   old   Male    0.250
#> 30        a young   Male   old Female    0.250
#> 31        b   old   Male young   Male    0.167
#> 32        b   old   Male young Female    0.167
#> 33        b   old   Male older   Male    0.167
#> 34        b   old   Male older Female    0.167
#> 35        b   old   Male   old   Male    0.167
#> 36        b   old   Male   old Female    0.167
#> 37        b   old   Male young   Male    0.167
#> 38        b   old   Male young Female    0.167
#> 39        b   old   Male older   Male    0.167
#> 40        b   old   Male older Female    0.167
#> 41        b   old   Male   old   Male    0.167
#> 42        b   old   Male   old Female    0.167
#> 43        b young   Male young   Male    0.167
#> 44        b young   Male young Female    0.167
#> 45        b young   Male older   Male    0.167
#> 46        b young   Male older Female    0.167
#> 47        b young   Male   old   Male    0.167
#> 48        b young   Male   old Female    0.167
#> 49        a   old Female young   Male    0.125
#> 50        a   old Female young Female    0.125
#> 51        a   old Female older   Male    0.125
#> 52        a   old Female older Female    0.125
#> 53        a   old Female   old   Male    0.250
#> 54        a   old Female   old Female    0.250
#> 55        a older Female young   Male    0.125
#> 56        a older Female young Female    0.125
#> 57        a older Female older   Male    0.125
#> 58        a older Female older Female    0.125
#> 59        a older Female   old   Male    0.250
#> 60        a older Female   old Female    0.250
#> 61        b young   Male young   Male    0.167
#> 62        b young   Male young Female    0.167
#> 63        b young   Male older   Male    0.167
#> 64        b young   Male older Female    0.167
#> 65        b young   Male   old   Male    0.167
#> 66        b young   Male   old Female    0.167
#> 67        b   old Female young   Male    0.167
#> 68        b   old Female young Female    0.167
#> 69        b   old Female older   Male    0.167
#> 70        b   old Female older Female    0.167
#> 71        b   old Female   old   Male    0.167
#> 72        b   old Female   old Female    0.167
#> 73        b young Female young   Male    0.167
#> 74        b young Female young Female    0.167
#> 75        b young Female older   Male    0.167
#> 76        b young Female older Female    0.167
#> 77        b young Female   old   Male    0.167
#> 78        b young Female   old Female    0.167
#> 79        a young Female young   Male    0.125
#> 80        a young Female young Female    0.125
#> 81        a young Female older   Male    0.125
#> 82        a young Female older Female    0.125
#> 83        a young Female   old   Male    0.250
#> 84        a young Female   old Female    0.250
#> 85        b older Female young   Male    0.167
#> 86        b older Female young Female    0.167
#> 87        b older Female older   Male    0.167
#> 88        b older Female older Female    0.167
#> 89        b older Female   old   Male    0.167
#> 90        b older Female   old Female    0.167
#> 91        a   old   Male young   Male    0.125
#> 92        a   old   Male young Female    0.125
#> 93        a   old   Male older   Male    0.125
#> 94        a   old   Male older Female    0.125
#> 95        a   old   Male   old   Male    0.250
#> 96        a   old   Male   old Female    0.250
#> 97        b   old Female young   Male    0.167
#> 98        b   old Female young Female    0.167
#> 99        b   old Female older   Male    0.167
#> 100       b   old Female older Female    0.167
#> 101       b   old Female   old   Male    0.167
#> 102       b   old Female   old Female    0.167
#> 103       a young Female young   Male    0.125
#> 104       a young Female young Female    0.125
#> 105       a young Female older   Male    0.125
#> 106       a young Female older Female    0.125
#> 107       a young Female   old   Male    0.250
#> 108       a young Female   old Female    0.250
#> 109       a   old Female young   Male    0.125
#> 110       a   old Female young Female    0.125
#> 111       a   old Female older   Male    0.125
#> 112       a   old Female older Female    0.125
#> 113       a   old Female   old   Male    0.250
#> 114       a   old Female   old Female    0.250
#> 115       a young   Male young   Male    0.125
#> 116       a young   Male young Female    0.125
#> 117       a young   Male older   Male    0.125
#> 118       a young   Male older Female    0.125
#> 119       a young   Male   old   Male    0.250
#> 120       a young   Male   old Female    0.250
#> 121       a   old   Male young   Male    0.125
#> 122       a   old   Male young Female    0.125
#> 123       a   old   Male older   Male    0.125
#> 124       a   old   Male older Female    0.125
#> 125       a   old   Male   old   Male    0.250
#> 126       a   old   Male   old Female    0.250
#> 127       b older Female young   Male    0.167
#> 128       b older Female young Female    0.167
#> 129       b older Female older   Male    0.167
#> 130       b older Female older Female    0.167
#> 131       b older Female   old   Male    0.167
#> 132       b older Female   old Female    0.167
#> 133       a young   Male young   Male    0.125
#> 134       a young   Male young Female    0.125
#> 135       a young   Male older   Male    0.125
#> 136       a young   Male older Female    0.125
#> 137       a young   Male   old   Male    0.250
#> 138       a young   Male   old Female    0.250
#> 139       a older Female young   Male    0.125
#> 140       a older Female young Female    0.125
#> 141       a older Female older   Male    0.125
#> 142       a older Female older Female    0.125
#> 143       a older Female   old   Male    0.250
#> 144       a older Female   old Female    0.250
#> 145       a young Female young   Male    0.125
#> 146       a young Female young Female    0.125
#> 147       a young Female older   Male    0.125
#> 148       a young Female older Female    0.125
#> 149       a young Female   old   Male    0.250
#> 150       a young Female   old Female    0.250
#> 151       a young   Male young   Male    0.125
#> 152       a young   Male young Female    0.125
#> 153       a young   Male older   Male    0.125
#> 154       a young   Male older Female    0.125
#> 155       a young   Male   old   Male    0.250
#> 156       a young   Male   old Female    0.250
#> 157       b   old Female young   Male    0.167
#> 158       b   old Female young Female    0.167
#> 159       b   old Female older   Male    0.167
#> 160       b   old Female older Female    0.167
#> 161       b   old Female   old   Male    0.167
#> 162       b   old Female   old Female    0.167
#> 163       b young Female young   Male    0.167
#> 164       b young Female young Female    0.167
#> 165       b young Female older   Male    0.167
#> 166       b young Female older Female    0.167
#> 167       b young Female   old   Male    0.167
#> 168       b young Female   old Female    0.167
#> 169       a older   Male young   Male    0.125
#> 170       a older   Male young Female    0.125
#> 171       a older   Male older   Male    0.125
#> 172       a older   Male older Female    0.125
#> 173       a older   Male   old   Male    0.250
#> 174       a older   Male   old Female    0.250
#> 175       b young   Male young   Male    0.167
#> 176       b young   Male young Female    0.167
#> 177       b young   Male older   Male    0.167
#> 178       b young   Male older Female    0.167
#> 179       b young   Male   old   Male    0.167
#> 180       b young   Male   old Female    0.167

^{Created on 2020-11-16 by the reprex package (v0.3.0.9001)}

TeeTrea · November 16, 2020, 9:39pm

Thanks! Unfortunately, this isn't quite what I was looking for. I updated my post to make it clearer. Is it possible to adjust your code to do what I'm looking for?

technocrat · November 16, 2020, 10:16pm

I hand checked a couple of these, but it's past my nap time and no assurances can be given

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(tidyr)
})
set.seed(123)
survey <- data.frame(
  country = sample(letters[1:2], 30, replace = T),
  age = sample(c("young", "older", "old"), 30, replace = T),
  sex = sample(c("Male", "Female"), 30, replace = T)
)
census <- data.frame(
  country = rep(letters[1:2], each = 6),
  age = rep(rep(c("young", "older", "old"), each = 2), 2),
  sex = rep(c("Male", "Female"), 6),
  rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6))
)

left_join(survey,census, by = "country") %>%
  select(-sex.y,-age.y) %>% 
  rename(sex = sex.x, age = age.x) %>%
  group_by(country,sex,age, rel_freq) %>% 
  count() %>%
  group_by(country,sex,age) %>%
  summarise(freq = rel_freq * n) %>%
  group_by(country,sex,age) %>%
  summarise(freq = sum(freq))
#> `summarise()` regrouping output by 'country', 'sex', 'age' (override with `.groups` argument)
#> `summarise()` regrouping output by 'country', 'sex' (override with `.groups` argument)
#> # A tibble: 11 x 4
#> # Groups:   country, sex [4]
#>    country sex    age    freq
#>    <chr>   <chr>  <chr> <dbl>
#>  1 a       Female old    2   
#>  2 a       Female older  2   
#>  3 a       Female young  3   
#>  4 a       Male   old    3   
#>  5 a       Male   older  2   
#>  6 a       Male   young  5   
#>  7 b       Female old    3.01
#>  8 b       Female older  3.01
#>  9 b       Female young  2.00
#> 10 b       Male   old    2.00
#> 11 b       Male   young  3.01

^{Created on 2020-11-16 by the reprex package (v0.3.0.9001)}

TeeTrea · November 16, 2020, 11:56pm

This seems to work, thanks! Could you perhaps add a few comments to explain the logic?

technocrat · November 17, 2020, 12:44am

Oh, good. I was nodding off.

My approach to R is functional: f(x) = y, where the three objects (in R everything is an object) where

x is what is at hand
y is what is desired
f transforms x to y

Objects can be composite—think of data frames. I could have rolled this all into a single function, but the logic would be harder to follow.

So here's the thought process:

Create x by combining the survey and census data into a single data frame, called tab, eliminating the duplicate names and renaming the surviving names.
Create f by %>% piping tab through a series of functions to create y
First, consolidate each combination of country, sex and age and get a count (n)
Regroup the same way with the new column n and multiply by each rel_freq 1.25 * 8, etc.
Regroup again the same way and add the results from step 4

If I were going to be doing this with a number of identically organized but differently named survey and census arguments, I'd do a little hard-wired function like this:

suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
  library(tidyr)
})
set.seed(123)
survey <- data.frame(
  country = sample(letters[1:2], 30, replace = T),
  age = sample(c("young", "older", "old"), 30, replace = T),
  sex = sample(c("Male", "Female"), 30, replace = T)
)
census <- data.frame(
  country = rep(letters[1:2], each = 6),
  age = rep(rep(c("young", "older", "old"), each = 2), 2),
  sex = rep(c("Male", "Female"), 6),
  rel_freq = c(rep(.125, 4), rep(.25, 2), rep(.167, 6))
)

get_freq <- function(x, y) {
  left_join(x,y, by = "country") %>%
    select(-sex.y,-age.y) %>% 
    rename(sex = sex.x, age = age.x) %>%
    group_by(country,sex,age, rel_freq) %>% 
    count() %>%
    group_by(country,sex,age) %>%
    summarise(freq = rel_freq * n) %>%
    group_by(country,sex,age) %>%
    summarise(freq = sum(freq))
}

get_freq(survey,census)
#> `summarise()` regrouping output by 'country', 'sex', 'age' (override with `.groups` argument)
#> `summarise()` regrouping output by 'country', 'sex' (override with `.groups` argument)
#> # A tibble: 11 x 4
#> # Groups:   country, sex [4]
#>    country sex    age    freq
#>    <chr>   <chr>  <chr> <dbl>
#>  1 a       Female old    2   
#>  2 a       Female older  2   
#>  3 a       Female young  3   
#>  4 a       Male   old    3   
#>  5 a       Male   older  2   
#>  6 a       Male   young  5   
#>  7 b       Female old    3.01
#>  8 b       Female older  3.01
#>  9 b       Female young  2.00
#> 10 b       Male   old    2.00
#> 11 b       Male   young  3.01

^{Created on 2020-11-16 by the reprex package (v0.3.0.9001)}

TeeTrea · November 17, 2020, 12:52am

Awesome. Thanks for taking the time to explain this, I just learned something

How would I reassign the freq values to the original survey data?

technocrat · November 17, 2020, 8:25am

Again, f(x) = y. This time x is the return value of

x  <-  get_freq(survey,census)

which has freq as a column, along with country, age and sex, which is everything in survey. The only difference is that x has aggregated survey, so it has fewer rows. For a row in survey, what is it that freq is supposed to signify?

system · November 24, 2020, 8:26am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.