Summary of categorical data

Hi, when I used the summary() function on a data(containing both numerical and categorical variable), the summary of the categorical data shows the length, class and mode of the variables. I was expecting to see the summary in terms of the levels. Do I need to install a package to see that or there is some other problem?

I'd need to know what you expect to get from such function. I'm going to assume you need to compute a level-wise summary of the data using a list of functions (here I use the mean and SD as an example).

You can try with the combination of group_by() and summarise() from the package dplyr. I don't have the code or data you are using but it would look like this:

data %>%
    group_by(categorical_variable) %>% # group observations from the same level of 
    categorical_variable
    summarise(mn1 = mean(numeric_variable1), # compute the mean of numeric_variable1 for each level of categorical_variable
              std1 = sd(numeric_variable1),
              mn2 = mean(numeric_variable2),
              std2 = sd(numeric_variabl2))

You can also group observations from combinations of levens from two categorical variables using group_by(categorical_variable1,categorical_variable2).

Hope this addresses you problem! :slightly_smiling_face:

NOTE: Edited to indet the code properly.

If you run summary(iris), you'll see:

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500 

The column Species is a factor (not just a vector of characters) so the summary breaks it down by level.

I suspect that the data frame you're considering doesn't have factors. Is that part of some tutorial you're following? Previous versions of R (< 4.0.0) used to automatically turn strings into factors. This is no longer the case. So if you're following old-ish code, the behaviour will be a bit different; you'll see vectors of characters where you probably expect factors.

Either use StringAsFactors =TRUE when building the data frames, or set the global option to TRUE with options(stringsAsFactors = TRUE) to emulate what would happen in R <4.0.0.

Thank you for the suggestion. I am using R >4.0.0 but was expecting the strings to automatically turn into factors. Using the global option to TRUE, it works now!

Thank you for the reply. I was expecting the strings to automatically turn into factor and show the summary accordingly. Anyway I have followed the suggestion given by ChrisL and it works now.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.