Gather list name and list elements into a summary data frame

jgarrigan · March 18, 2022, 9:14pm

I wish to run a Benford analysis per group for my data.

I wish to extract the MAD.conformity and n per group and store the output in a data frame where there would be one row per group with 3 columns, 1st column - VendorNum, 2nd column = n and 3rd column = MAD.conformity

I can extract the MAD.conformity metric but when I add extra metrics to my extract I get the following error:

Any help to get the 3 metrics into a data frame would be greatly appreciated

library(dplyr)
library(purrr)
library(benford.analysis)

data(corporate.payment)

vendor_records <- corporate.payment %>% 
  group_by(VendorNum) %>% 
  summarise(records = n()) %>% 
  arrange(desc(records))

top_vendors <- corporate.payment %>% 
  filter(VendorNum %in% c('3630','6661','2001','4984'))

result <- top_vendors %>% 
  split(.$VendorNum) %>% 
  map(~ benford(number.of.digits = 1, discrete = TRUE, sign = "positive", data = .x$Amount)) %>% 
  map_dfr(magrittr::extract, c("MAD.conformity"))

Error:
! Column names `MAD.conformity`, `MAD.conformity`, and `MAD.conformity` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
  * "MAD.conformity" at locations 1, 2, 3, and 4.

ttrodrigz · March 19, 2022, 2:08am

Hi there, this would be my proposed workflow:

Step 1: count() the number of records for VendorNum

corporate.payment <- as_tibble(corporate.payment)

top_vendors <-
    corporate.payment %>%
    count(VendorNum, sort = TRUE) %>%
    slice_max(order_by = n, n = 4)

top_vendors
#> # A tibble: 4 x 2
#>   VendorNum     n
#>   <chr>     <int>
#> 1 3630      13973
#> 2 6661       4947
#> 3 2001       4736
#> 4 4984       4321

Step 2: don't split, work in a nested data frame

results <-
    corporate.payment %>%
    inner_join(top_vendors, by = "VendorNum") %>%
    group_nest(VendorNum, n) %>%
    mutate(
        ben_res = map(
            .x = data,
            .f = ~benford(
                data = .x$Amount,
                number.of.digits = 1,
                discrete = TRUE,
                sign = "positive"
            )       
        ),
        MAD.conformity = map_chr(ben_res, pluck, "MAD.conformity")
    ) %>%
    select(VendorNum, n, MAD.conformity)

results
#> # A tibble: 4 x 3
#>   VendorNum     n MAD.conformity       
#>   <chr>     <int> <chr>                
#> 1 2001       4736 Nonconformity        
#> 2 3630      13973 Acceptable conformity
#> 3 4984       4321 Acceptable conformity
#> 4 6661       4947 Nonconformity

The call to inner_join() will keep columns both from the LHS and RHS of the join, but only the rows where the the keys match, this in effect filters the larger data frame by only the top (4, in this case) vendor records while simultaneously bringing in the counts.

Rather than splitting, group_nest() will allow you operate by-group like split() but in a more organized manner. This will create a new column called data, on to which we can map the benford() function and then use a combination of map_chr() and pluck() to extract the result you wanted.

Hope this helps!

phil_hummel · March 19, 2022, 3:03am

Look at the pattern for Many Models in R4DS. I get the same error if I use purr::map to add columns for both broom:tidy() and broom::glance() because they have duplicate output columns that collide in the grouped and nested output data frame.

jgarrigan · March 19, 2022, 9:34am

@ttrodrigz this is awesome, thank you so much. I need more practice working with lists.

ttrodrigz · March 19, 2022, 8:57pm

You're welcome, glad it helped!

system · March 26, 2022, 8:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.