What would be a good threshold number to convert some categories to Not Specified

The current topic is derived from this topic: Exploratory Data Analysis for Big Data with continuous and categorical variables (mixed data types) (optional to read).

I have the following dataset: myds :

str(myds)
'data.frame':   841500 obs. of  30 variables:
 $ score                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ amount_sms_received       : int  0 0 0 0 0 0 3 0 0 3 ...
 $ amount_emails_received    : int  3 36 3 12 0 63 9 6 6 3 ...
 $ distance_from_server      : int  17 17 7 7 7 14 10 7 34 10 ...
 $ age                       : int  17 44 16 16 30 29 26 18 19 43 ...
 $ points_earned             : int  929 655 286 357 571 833 476 414 726 857 ...
 $ registrationYYYY          : Factor w/ 2 levels ...
 $ registrationDateMM        : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_05                 : Factor w/ 2 levels ...
 $ catVar_06                 : Factor w/ 140 levels ...
 $ catVar_07                 : Factor w/ 21 levels ...
 $ catVar_08                 : Factor w/ 1582 levels ...
 $ catVar_09                 : Factor w/ 70 levels ...
 $ catVar_10                 : Factor w/ 755 levels ...
 $ catVar_11                 : Factor w/ 23 levels ...
 $ catVar_12                 : Factor w/ 129 levels ...
 $ catVar_13                 : Factor w/ 15 levels ...
 $ city                      : Factor w/ 22750 levels ...
 $ state                     : Factor w/ 55 levels ...
 $ zip                       : Factor w/ 26659 levels ...
 $ catVar_17                 : Factor w/ 2 levels ...
 $ catVar_18                 : Factor w/ 2 levels ...
 $ catVar_19                 : Factor w/ 3 levels ...
 $ catVar_20                 : Factor w/ 6 levels ...
 $ catVar_21                 : Factor w/ 2 levels ...
 $ catVar_22                 : Factor w/ 4 levels ...
 $ catVar_23                 : Factor w/ 5 levels ...

Now, let’s do an experiment, let’s take the discrete variable: catVar_08 and let’s count for each of its values, on how many observations that value shows up. The table below will be sorted in decreasing order by the amount of observations:

varname = "catVar_08"
counts = get_discrete_category_counts(myds, varname)
counts
## # A tibble: 1,571 x 2
## # Groups:   catVar_08 [1,571]
##    catVar_08            count
##    <chr>                <int>
##  1 catVar_08_value_415  83537
##  2 catVar_08_value_463  68244
##  3 catVar_08_value_179  65414
##  4 catVar_08_value_526  59172
##  5 catVar_08_value_195  49275
##  6 catVar_08_value_938  26834
##  7 catVar_08_value_1142 25351
##  8 catVar_08_value_1323 23794
##  9 catVar_08_value_1253 18715
## 10 catVar_08_value_1268 18379
## # ... with 1,561 more rows

Let’s check the counts above more deeply. In order to do that, let’s do some plots on different ranges:

plot_discrete_category_counts(myds, varname, 1, 500)

plot_discrete_category_counts(myds, varname, 501, 1000)

plot_discrete_category_counts(myds, varname, 1001)

As we can see on the second plot above, if we keep on our analysis the first 500 categories with higher amount of observations and remove the rest (1571 - 500 = 1071) by replacing those values to a generic value like: Not Specified , then we are going to be using categories that have been observed more than around 40 times.

Another very good advantage of doing such level reduction is that the computational effort when training our model will be considerably less which is very important.

Bear in mind that this dataset have: 841500 observations.

Let’s do another similar experiment with another discrete variable: zip :

varname = "zip"
counts = get_discrete_category_counts(myds, varname)
counts
## # A tibble: 26,458 x 2
## # Groups:   zip [26,458]
##    zip             count
##    <chr>           <int>
##  1 zip_value_18847   428
##  2 zip_value_18895   425
##  3 zip_value_25102   425
##  4 zip_value_2986    422
##  5 zip_value_1842    414
##  6 zip_value_25718   410
##  7 zip_value_3371    397
##  8 zip_value_11638   395
##  9 zip_value_4761    394
## 10 zip_value_6746    391
## # ... with 26,448 more rows

Let’s check the counts above more deeply. In order to do that, let’s do some plots on different ranges:

plot_discrete_category_counts(myds, varname, 1, 1000)

plot_discrete_category_counts(myds, varname, 1001, 2500)

plot_discrete_category_counts(myds, varname, 2501)

On this case, if we keep on our analysis the first 2500 categories with higher amount of observations and remove the rest (26458 - 2500 = 23958) by replacing those values to a generic value like: Not Specified , then we are going to be using categories that have been observed more than around 100 times.

Another very good advantage of doing such level reduction is that the computational effort when training our model will be considerably less which is very important.

Bear in mind that this dataset have: 841500 observations.

Highlight:

On the second case, we were analyzing the zip code which is a very common discrete variable. So, I would say this is a common use case.

My Question:

What would be a good threshold number or formula X , such that when for a given discrete variable, we have categories/values that have been seen less than X times on the dataset observations, it worth to remove them?

For example, probably if for catVar_08 above we have some categories/values that have been seen 5 times, probably we should not take them into consideration and replace that value by Not Specified because most likely that the machine learning model will not have the ability to learn about those categories/values with few observations.

One possible signature for the formula would be:

discrete_category_counts = <available above>
num_rows_dataset = <available above>
num_cols_dataset = <available above>
arg_4 = ...
arg_5 = ...
...
get_discrete_var_threshold_index = function(
  discrete_category_counts,
  num_rows_dataset,
  num_cols_dataset,
  arg_4,
  arg_5,
  ...
) {
  index = ... # threshold index until which we are going to use the categories with higher frecuencies
  return (index)
}

Any idea/recommendation about this?

So far I have the following function, but this is only my first try, for sure there will be much better vresions:

get_discrete_var_threshold_index = function(counts) {
  t_obs = 40 # minimum number of observations for a discrete variable
  t_elems = 2500 # maximum number of levels for a discrete variable
  if (max(counts[["count"]] > t_obs)) {
    # getting index of the category which have the closest amount of observations to t_obs
    # notice that it could be less than t_obs. In that case we need rectify that by decreasing the index by 1
    index = which.min(abs(counts[["count"]] - t_obs))
    if (
      # may be the chance that the category on index selected above doesn't have more than t_obs observations
      # in that case we rectify the index by decrementing by 1 (if possible)
      # in case the index was decremented, then the category on that index will have more than t_obs observations for sure
      counts[["count"]][index] < t_obs &&
      index - 1 >= 1
    ) {
      index = index - 1
    }
    index = min(index, t_elems) # getting index relative to num of elements threshold
  }
  else {
    # categories have been seen under our standard threshold
    # this is an special case with a special solution
    index_1 = which.min(abs(counts[["count"]] - 1)) # excluding categories with just one observations
    index = min(nrow(counts), t_elems, (index_1-1))
  }
  return (index)
}

Thanks!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.