Suppress low cell counts in output tables and graphs

Hi there,

I'm trying to figure out general approach to removing low cell counts, such as of less than 10, from output tables or graphs in R. The tables could be frequency tables of one variable, cross-tabulations or three-way tables. The graphs could be bar graphs, including clustered or stack bar graphs, or histograms. It's a common requirement in social research.

In Stata, I was able to do this using the following code to generate a counter variable with which to set a logical condition (if) for producing the table or graph. I then dropped the counter from the data set. For example:

# bysort variable1: gen count = _N
# tab variable1 variable2 if count >= 10
# drop count

What would be the equivalent procedure in R using the following example?

    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    df <- data.frame(
        zone_1 = rpois(10, 5),
        zone_2 = rpois(10, 5),
        zone_3 = rpois(10, 5)
    )
    df    

z2z3 <- table(df$zone_2,df$zone_3)  [what condition would I need to apply here to remove cell counts of say less than 2?]

z2z3 

    1 2 5 6 14
  1 0 1 0 0  0
  3 0 0 0 0  1
  4 0 0 0 2  0
  5 0 1 0 0  0
  7 1 1 0 1  0
  8 0 0 2 0  0

Is there a common approach with using packages such as 'sjPlot'?

Using the same dataset,

    tab_xtab(df$zone_2,
             df$zone_3, 
             var.labels = c("z2", "z3"),
             statistics = "fisher",
             show.row.prc = TRUE) 

Where would I insert the condition or how would I transform the data prior to running 'tab_xtab'?

Thanks very much,

Steve

It sounds like you could use the forcats package for this,
https://forcats.tidyverse.org
particularly the functions that are variants of fct_lump()
https://forcats.tidyverse.org/reference/fct_lump.html

Thanks, Peter @phiggins. It looks like some of those functions could work from the documentation you supplied. Have you used 'forcats' yourself for this purpose? I assume all variables run with 'forcats' need to be designated as factor variables first.

Yes, that is correct, unless you mutate them to factors on the fly, as below

library(tidyverse)
library(tibble)

data <- tribble(
  ~city, ~state,
  "New York",   "NY",
  "Albuquerque", "NM",
  "Los Angeles",   "CA",
  "New York",   "NY",
  "Chicago", "IL",
  "San Francisco",   "CA",
  "New York",   "NY",
  "Chicago", "IL",
  "Los Angeles",   "CA",
  "New York",   "NY",
  "Chicago", "IL",
  "San Francisco",   "CA",
  "New York",   "NY",
  "Chicago", "IL",
  "Los Angeles",   "CA",
  "New York",   "NY",
  "Ann Arbor", "MI",
  "Chicago",   "IL",
  "New York",   "NY",
  "Chicago", "IL",
  "San Francisco",   "CA"
)

data %>% 
  mutate(city = fct_lump_min(as.factor(city), 
                             3)) %>% 
  mutate(state = fct_lump_min(as.factor(state), 
                              3)) %>% 
  print(n = 21)
#> # A tibble: 21 x 2
#>    city          state
#>    <fct>         <fct>
#>  1 New York      NY   
#>  2 Other         Other
#>  3 Los Angeles   CA   
#>  4 New York      NY   
#>  5 Chicago       IL   
#>  6 San Francisco CA   
#>  7 New York      NY   
#>  8 Chicago       IL   
#>  9 Los Angeles   CA   
#> 10 New York      NY   
#> 11 Chicago       IL   
#> 12 San Francisco CA   
#> 13 New York      NY   
#> 14 Chicago       IL   
#> 15 Los Angeles   CA   
#> 16 New York      NY   
#> 17 Other         Other
#> 18 Chicago       IL   
#> 19 New York      NY   
#> 20 Chicago       IL   
#> 21 San Francisco CA

Created on 2020-05-17 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.