How to identify columns with the same number of categories?

Nile · November 18, 2022, 6:34pm

Hi,
I have a dataset of over 100 columns with varying numbers of categories e.g. some are dichotomous, some have 4 and some have 7 categories. My objective is to identify columns with the same number of categories to process them further.

Here is a sample data:

df= data.frame(col1= sample(1:4, 10, replace = T), col2= sample(11:14, 10, replace = T), col3= sample(1:38, 10, replace = T), col4= sample(11:49, 10, replace = T), col5= sample(1:64, 10, replace = T) , col6= sample(11:75, 10, replace = T))

df
col1 col2 col3 col4 col5 col6
1 2 13 22 41 45 72
2 1 13 6 44 59 68
3 3 12 19 22 34 43
4 3 13 19 40 16 18
5 1 12 13 40 56 28
6 1 11 36 20 37 15
7 2 13 31 42 25 61
8 4 11 31 45 33 35
9 3 14 8 40 3 50
10 2 14 22 45 44 22

Please let me know if there is any way to achieve that.

Thank you.

scottyd22 · November 18, 2022, 6:59pm

Below is one way to identify the number of unique values/categories within each column of df.

library(tidyverse)

df= data.frame(col1= sample(1:4, 10, replace = T), col2= sample(11:14, 10, replace = T), 
               col3= sample(1:38, 10, replace = T), col4= sample(11:49, 10, replace = T), 
               col5= sample(1:64, 10, replace = T) , col6= sample(11:75, 10, replace = T)
               )

categories = lapply(1:length(df), 
                    function(i){
                      d = data.frame(nrow(unique(df[i])))
                      names(d) = names(df[i])
                      d
                      }
                    ) %>%
  bind_cols()

categories
#>   col1 col2 col3 col4 col5 col6
#> 1    4    4    9   10   10    9

Created on 2022-11-18 with reprex v2.0.2.9000

EconProf · November 18, 2022, 7:36pm

There are usually many ways to do something in R. One alternative uses across()

library(tidyverse)

df= data.frame(col1= sample(1:4, 10, replace = T), col2= sample(11:14, 10, replace = T), 
               col3= sample(1:38, 10, replace = T), col4= sample(11:49, 10, replace = T), 
               col5= sample(1:64, 10, replace = T) , col6= sample(11:75, 10, replace = T)
)
df
#>    col1 col2 col3 col4 col5 col6
#> 1     4   13    1   16   61   50
#> 2     1   13   18   24    8   55
#> 3     3   11   14   43   62   60
#> 4     4   13   18   20    5   20
#> 5     2   13   18   48   27   74
#> 6     3   13   25   15   13   20
#> 7     1   12   14   38   46   41
#> 8     1   14   25   14   20   29
#> 9     4   12    2   43    3   71
#> 10    1   12    8   38    7   23

categories <- df |> summarise(across(col1:col6, n_distinct))
categories
#>   col1 col2 col3 col4 col5 col6
#> 1    4    4    6    8   10    9

^{Created on 2022-11-18 with reprex v2.0.2}

system · November 25, 2022, 7:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.