Calculating rowwise totals and proportions using tidyeval?

tidyeval

#1

I haven’t delved too deep into tidyeval and quasiquotation yet, but I have a case where it seems like it makes sense to use and I need some help to make it work.

Say I have a tibble in wide format where each row is an election district and each column is the number of votes a candidate received. I want to calculate to total votes per district and the proportion of votes each candidate received in each district.

library(tidyverse)

votes <- tribble(
  ~district, ~a, ~b, ~c,
  "dist1", 20L, 5L, 10L,
  "dist2", 25L, 5L, 15L,
  "dist3", 10L, 15L, 15L,
)

votes
#> # A tibble: 3 x 4
#>   district     a     b     c
#>   <chr>    <int> <int> <int>
#> 1 dist1       20     5    10
#> 2 dist2       25     5    15
#> 3 dist3       10    15    15

If I were just calculating it for this set of candidates, I would do something like the following:

(edited to include rowise()) for expected totals)

votes %>%
  rowwise() %>%
  mutate(
    total = sum(a + b + c),
    a_prop = a / total,
    b_prop = b / total,
    c_prop = c / total
  )
#> Source: local data frame [3 x 8]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 8
#>   district     a     b     c total a_prop b_prop c_prop
#>   <chr>    <int> <int> <int> <int>  <dbl>  <dbl>  <dbl>
#> 1 dist1       20     5    10    35  0.571  0.143  0.286
#> 2 dist2       25     5    15    45  0.556  0.111  0.333
#> 3 dist3       10    15    15    40  0.250  0.375  0.375

But where I’m having trouble is if I want to write a function to do this where there could be any number of candidates with any possible name. I think this is a case where I would want to use (...) and := in my function, but I’m not really sure how to go about it unquoting each element of the dots list to create a new column.


#3

I understand where you going with your line of thought, but it is possible to calculate this without tidyeval (or at least get really close to the result you want). Here is my a bit hacky attempt:

candidates <- c("a", "b", "c") # I assume you know name of the candidates in advance
votes %>%
  tidyr::gather(key = "candidate", ... =  -district) %>%
  dplyr::group_by(district) %>%
  dplyr::mutate(prop = value/sum(value)) %>%
  dplyr::select(-value) %>%
  tidyr::spread(candidate, prop, sep = "_") %>%
  dplyr::left_join(votes, by = "district") %>%
  dplyr::ungroup() %>%
  dplyr::mutate(total = rowSums(votes[, candidates]))

#4

Thanks @mishabalyasin, that gets the desired output.

But I’m trying to figure out how to do this if I didn’t know the names of candidates in advance or if I wanted to map this function over a list of tibbles with different numbers and names of candidates. So, that is the case where I’m thinking I may need to generalize with tidyeval.


#5

But you can always get names of candidates, no? For example:

nms <- dplyr::select(votes, -district) %>%
   names()

This way you don’t need to know the names in advance, you can always get them in a format you need programmatically.


#6

One more version that uses tidyeval, but only slightly :slight_smile: :

library(tidyverse)

votes <- tribble(
  ~district, ~a, ~b, ~c,
  "dist1", 20L, 5L, 10L,
  "dist2", 25L, 5L, 15L,
  "dist3", 10L, 15L, 15L,
)

candidates <- votes %>% dplyr::select(-district) %>% names()

prop <- function(df, total = "total", candidate = "a"){
  name <- paste0(candidate, "_prop")
  prop <- df[[candidate]]/df[[total]]
  df %>%
    dplyr::mutate(!!name := prop)
}

sum_columns <- function(df, columns) {
  rowSums(df[, columns])
}

f <- lapply(candidates, function(x){purrr::partial(prop, candidate = x)}) %>% Reduce(f = purrr::compose, x = .)

votes %>%
  dplyr::mutate(total = sum_columns(., candidates)) %>%
  f()

# A tibble: 3 x 8
  district     a     b     c total c_prop b_prop a_prop
  <chr>    <int> <int> <int> <dbl>  <dbl>  <dbl>  <dbl>
1 dist1       20     5    10  35.0  0.286  0.143  0.571
2 dist2       25     5    15  45.0  0.333  0.111  0.556
3 dist3       10    15    15  40.0  0.375  0.375  0.250

In this case output is exactly the same (save for ordering) that you wanted, so can be a viable solution, I think.


#7

This solution works without tidyeval, provided that the data.frame consists only of the district variable and columns with candidate names:

library(tidyverse)

votes <- tribble(
  ~district, ~a, ~b, ~c,
  "dist1", 20L, 5L, 10L,
  "dist2", 25L, 5L, 15L,
  "dist3", 10L, 15L, 15L,
)

candidate_proportion <- function(d) {
  require(tidyverse)
  
  d_gather <- d %>% 
    gather(key = "candidate", value = "vote_count", -district)
  
  d_sum <- d_gather %>% 
    group_by(district) %>% 
    summarise(district_sum = sum(vote_count))
  
  d_prop <- d_gather %>% 
    left_join(d_sum, by = "district") %>% 
    mutate(prop = vote_count / district_sum) %>% 
    select(-vote_count, -district_sum) %>% 
    mutate(candidate = paste0(candidate,"_prop")) %>% 
    spread(key = candidate, value = prop)
  
  d %>% left_join(d_prop, by = "district")
    
}

candidate_proportion(votes)

#8

These are all good solutions! I’m accepting @w.joel.schneider’s as it’s the most intuitive to me. I guess I was overthinking it and some reshaping and grouping can get this done without tidyeval.

Thanks!


#9

The previous solutions are already covering a lot, and I am not pretending to offer a better way.
It is just that When I saw the question, I wanted to play with purrr, tidyeval and list-columns in tibble to see what I could do. Specifically to found a way of doing this manipulation trying to fit with list columns in a unique tibble. This is just to share another workflow.

  1. First preparing the tidy data (1 column = 1 variable)
  2. Preparing each of desired spread format keeping all result in a unique tibble.
  3. Extract what is needed, for example the all synthesis table.
library(tidyverse)

votes <- tribble(
  ~district, ~a, ~b, ~c,
  "dist1", 20L, 5L, 10L,
  "dist2", 25L, 5L, 15L,
  "dist3", 10L, 15L, 15L,
)

# First tidy data

tidy_tab <- votes %>%
  gather("candidates", "votes", -district) %>%
  add_count(district, wt = votes) %>%
  mutate(prop = votes / n) %>%
  rename(total = n)
tidy_tab
#> # A tibble: 9 x 5
#>   district candidates votes total  prop
#>   <chr>    <chr>      <int> <int> <dbl>
#> 1 dist1    a             20    35 0.571
#> 2 dist2    a             25    45 0.556
#> 3 dist3    a             10    40 0.250
#> 4 dist1    b              5    35 0.143
#> 5 dist2    b              5    45 0.111
#> 6 dist3    b             15    40 0.375
#> 7 dist1    c             10    35 0.286
#> 8 dist2    c             15    45 0.333
#> 9 dist3    c             15    40 0.375

# use tidy eval to avoid repetition

spread_custom <- function(df, col) {
  col <- enquo(col)
  select(df, candidates, !! col) %>%
    spread(candidates, !! col)
}

# but all in the same tibble
spread_tab <- tidy_tab %>%
  nest(-district, -total) %>%
  mutate(
    votes = map(data, ~ spread_custom(.x, votes)),
    prop = map(data, ~ spread_custom(.x, prop))
  )
spread_tab
#> # A tibble: 3 x 5
#>   district total data             votes            prop            
#>   <chr>    <int> <list>           <list>           <list>          
#> 1 dist1       35 <tibble [3 x 3]> <tibble [1 x 3]> <tibble [1 x 3]>
#> 2 dist2       45 <tibble [3 x 3]> <tibble [1 x 3]> <tibble [1 x 3]>
#> 3 dist3       40 <tibble [3 x 3]> <tibble [1 x 3]> <tibble [1 x 3]>


# get the tidy format 
spread_tab %>%
  unnest(data)
#> # A tibble: 9 x 5
#>   district total candidates votes  prop
#>   <chr>    <int> <chr>      <int> <dbl>
#> 1 dist1       35 a             20 0.571
#> 2 dist1       35 b              5 0.143
#> 3 dist1       35 c             10 0.286
#> 4 dist2       45 a             25 0.556
#> 5 dist2       45 b              5 0.111
#> 6 dist2       45 c             15 0.333
#> 7 dist3       40 a             10 0.250
#> 8 dist3       40 b             15 0.375
#> 9 dist3       40 c             15 0.375

# get votes
spread_tab %>%
  unnest(votes, .drop = T)
#> # A tibble: 3 x 5
#>   district total     a     b     c
#>   <chr>    <int> <int> <int> <int>
#> 1 dist1       35    20     5    10
#> 2 dist2       45    25     5    15
#> 3 dist3       40    10    15    15

# get the synthesis table
spread_tab %>%
  select(-data) %>%
  unnest(.sep = "_")
#> # A tibble: 3 x 8
#>   district total votes_a votes_b votes_c prop_a prop_b prop_c
#>   <chr>    <int>   <int>   <int>   <int>  <dbl>  <dbl>  <dbl>
#> 1 dist1       35      20       5      10  0.571  0.143  0.286
#> 2 dist2       45      25       5      15  0.556  0.111  0.333
#> 3 dist3       40      10      15      15  0.250  0.375  0.375

Created on 2018-02-04 by the reprex package (v0.1.1.9000).