Hi Marit,
I've had a similar need and created a function to count and extract the unique values (sorted) for all columns (sorted).
library(tidyverse)
get_unique_values <- function(df) {
cn <- df %>% names() ## column names
out_n <- vector("character", length(cn)) ## number of unique values
out_v <- vector("character", length(cn)) ## unique values
section_width <- 144
# get unique values -------------------------------------------------------
for (i in seq_along(cn)) {
## skip processing when column is of type list ..
if (is.list(df[[i]])) {
out_v[i] <- "<column type: list>"
}
## .. else do process it
if (!is.list(df[[i]])) {
## `sort()` must be done last, while
## numeric values should NOT be converted to characters
## (which is needed for sorting values in factoral columns)
uv <-
df[i] %>%
unique() %>%
## only use `pull()` on data.frame objects
when(
any(class(df) == "data.frame") ~ pull(.),
~ .
) %>%
## conditionally convert column type factor to character, so that
## `sort()` works as expected
when(
any(class(df[[i]]) == "factor") ~ as.character(.),
~ .
) %>%
sort()
out_n[i] <- uv %>% length()
out_v[i] <- uv %>% str_c(collapse = ", ")
}
}
# format output result ----------------------------------------------------
## add padding to column names
cn <-
str_pad(cn,
max(str_length(cn)),
side = "right"
)
## add padding before number of distinct values
out_n <-
str_pad(out_n,
max(str_length(out_n))
)
## create final ouput having: column names|nr. of distinct values|distinct values
out <- str_c(cn, "|", out_n, "|", out_v)
## take into account the required space for the indices
i_width <- str_length(length(cn)) + 5
## remove stuff if output is longer then specified section_width
out <-
if_else((str_length(out) + i_width) > section_width,
## cutoff values
str_c(str_sub(out, end = section_width - i_width), ".."),
out
)
# return result -----------------------------------------------------------
out %>% sort()
}
Created on 2021-01-30 by the reprex package (v1.0.0)
And although I really value the available summary functionalities, sometimes listing only the (sorted) unique values can be quite insightful, see for example:
mpg %>% glimpse()
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
#> $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
#> $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
#> $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
#> $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
#> $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
#> $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
#> $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
#> $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
#> $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
#> $ class <chr> "compact", "compact", "compact", "compact", "compact",...
mpg %>% get_unique_values()
#> [1] "class | 7|2seater, compact, midsize, minivan, pickup, subcompact, suv"
#> [2] "cty |21|9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 33, 35"
#> [3] "cyl | 4|4, 5, 6, 8"
#> [4] "displ |35|1.6, 1.8, 1.9, 2, 2.2, 2.4, 2.5, 2.7, 2.8, 3, 3.1, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4, 4.2, 4.4, 4.6, 4.7, 5, 5.2, 5.3,.."
#> [5] "drv | 3|4, f, r"
#> [6] "fl | 5|c, d, e, p, r"
#> [7] "hwy |27|12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 41, 44"
#> [8] "manufacturer|15|audi, chevrolet, dodge, ford, honda, hyundai, jeep, land rover, lincoln, mercury, nissan, pontiac, subaru, toyota, volksw.."
#> [9] "model |38|4runner 4wd, a4, a4 quattro, a6 quattro, altima, c1500 suburban 2wd, camry, camry solara, caravan 2wd, civic, corolla, co.."
#> [10] "trans |10|auto(av), auto(l3), auto(l4), auto(l5), auto(l6), auto(s4), auto(s5), auto(s6), manual(m5), manual(m6)"
#> [11] "year | 2|1999, 2008"
Created on 2021-01-30 by the reprex package (v1.0.0)
As a matter of fact, I've included the above function in another function show_summaries()
, which I use regularly when working with new data in an R Notebook. This function (which is a bit more robust, see here) makes also use of other available summary functionalities:
-
glimpse()
from the dplyr
package
-
skim()
from the skimr
package
-
describe()
from the Hmisc
package
Maybe you find this useful as well.