Hist() data distribution: Not quite clear as an output

ggplot2
rstudio

#1

Hello,

I have a sample data:

Summary
> cat(wrapr::draw_frame(head(cust_city_distribution,30)))
build_frame(
   "CUST_CITY"       , "frequency", "Percentage_Frequency" |
   "Aberdeen        ",  6L        , 0.01167                |
   "ABERDEEN        ",  2L        , 0.003889               |
   "ABILENE         ", 16L        , 0.03111                |
   "Abingdon        ",  1L        , 0.001944               |
   "ABINGDON        ",  6L        , 0.01167                |
   "ACTON           ", 16L        , 0.03111                |
   "Acworth         ",  2L        , 0.003889               |
   "Ada             ",  9L        , 0.0175                 |
   "ADA             ", 10L        , 0.01944                |
   "ADAIRSVILLE     ",  8L        , 0.01555                |
   "ADRIAN          ", 10L        , 0.01944                |
   "AIEA            ",  4L        , 0.007777               |
   "Aiken           ",  7L        , 0.01361                |
   "AIKEN           ", 43L        , 0.0836                 |
   "AKRON"           , 18L        , 0.035                  |
   "Akron           ",  1L        , 0.001944               |
   "AKRON           ",  5L        , 0.009721               |
   "Alabaster       ",  1L        , 0.001944               |
   "ALABASTER       ",  1L        , 0.001944               |
   "Alamogordo      ",  9L        , 0.0175                 |
   "Albany          ", 16L        , 0.03111                |
   "ALBANY          ", 48L        , 0.09333                |
   "ALBERTVILLE     ", 19L        , 0.03694                |
   "Albion          ",  2L        , 0.003889               |
   "Albuquerque     ", 32L        , 0.06222                |
   "ALBUQUERQUE     ", 14L        , 0.02722                |
   "ALEXANDER CITY  ", 17L        , 0.03305                |
   "Alexandria      ",  2L        , 0.003889               |
   "ALEXANDRIA      ", 19L        , 0.03694                |
   "ALGONQUIN       ",  1L        , 0.001944               )

I am trying to produce a histogram which basically tells me which the cities' frequency distribution so I can drop some cities which have low frequencies.

The reason I am doing this is because I need to run a decision tree analysis and one of the dead blocks I am running into is that the factor "city" has 3764 or something levels.


I can produce a history but since the x-axis has so many small values (I think the percentage can go as low as 0.001 or lower), I cannot visualize the plot well enough to make a decision which city to keep in my analysis and which city to drop.

Thanks!


#2

If you just want to look at the relative distribution of the rare cities, you could filter out the highest percentage frequency values (or some high quantile of raw frequencies) before plotting.

But I’m not sure what you’re hoping to get out of visual inspection in this case that you couldn’t achieve with calculating quantiles? It’s always going to be a judgement call where to draw the line on “too rare”. Have you looked at which —and how many— cities fall into, say, the lowest decile (bottom 10%) of frequency?


#3

Filtering the data works for your purpose, since you're looking at absolute frequencies (of frequencies??), but if you want to visualize the percentages, you would use coord_cartesian() instead, as in

... + coord_cartesian(xlim=c(.25, 3)

I default to that function because it works either way.

On the modeling side, instead of dropping data from less common cities, you might consider hard coding the most frequent cities as binary variables. E.g., city_chicago = city=="Chicago".


#4

In addition to what everyone else is saying, this seems like a classic example where you want to transform your variable to make the variation more meaningful (and clear). Why not take the root or the log of the percentage?