Exctract and count unique number/names across columns

Hi!

I have a data sett like this:

FristID FirstName SecondID SecondName
100 Ole 102 Per
105 Gunnar 105 Gunnar
106 May NA NA

I want to count the number of unique ID's across the colums, furtermore, I want to exctract the unqiue names from the name columns. So this would be the output dataframe:

FristID FirstName SecondID SecondName Uniquecounts Uniquenames
100 Ole 102 Per 2 Ole,Per
105 Gunnar 105 Gunnar 1 Gunnar
106 May NA NA 1 May

I have tried the apply function, but I am struggling with the NA's that also appears in my dataset. I do not want to count and extract the NA's.

Does anyone have suggestion on how to solve this?

Regards Marit

suppressPackageStartupMessages({
library(dplyr)
})

  DF <- data.frame(
      FristID = c(100, 105, 106), 
      FirstName = c("Ole", "Gunnar", "May"), 
      SecondID = c(102, 105, NA), 
      SecondName = c("Per", "Gunnar", NA)
      )

DF %>% mutate(
        Uniquecounts = case_when(
          FristID != SecondID            ~ 2,
          is.na(c(FristID == SecondID))  ~ 1,
          !is.na(c(FristID == SecondID)) ~ 1
        ),
        Uniquenames = FirstName)
#>   FristID FirstName SecondID SecondName Uniquecounts Uniquenames
#> 1     100       Ole      102        Per            2         Ole
#> 2     105    Gunnar      105     Gunnar            1      Gunnar
#> 3     106       May       NA       <NA>            1         May

Hi Marit,

I've had a similar need and created a function to count and extract the unique values (sorted) for all columns (sorted).

library(tidyverse)

get_unique_values <- function(df) {
  cn    <- df %>% names()                   ## column names
  out_n <- vector("character", length(cn))  ## number of unique values
  out_v <- vector("character", length(cn))  ## unique values

  section_width <- 144
  
  # get unique values -------------------------------------------------------
  
  for (i in seq_along(cn)) {
    ## skip processing when column is of type list ..
    if (is.list(df[[i]])) { 
      out_v[i] <- "<column type: list>"
    }
    
    ## .. else do process it
    if (!is.list(df[[i]])) {
      ## `sort()` must be done last, while
      ## numeric values should NOT be converted to characters
      ## (which is needed for sorting values in factoral columns)
      uv <- 
        df[i] %>%
        unique() %>%
        ## only use `pull()` on data.frame objects
        when(
          any(class(df) == "data.frame") ~ pull(.),
          ~ .
        ) %>% 
        ## conditionally convert column type factor to character, so that 
        ## `sort()` works as expected
        when(
          any(class(df[[i]]) == "factor") ~ as.character(.),
          ~ .
        ) %>% 
        sort()
      
      out_n[i] <- uv %>% length()
      out_v[i] <- uv %>% str_c(collapse = ", ")
    }
  }
  
  # format output result ----------------------------------------------------
  
  ## add padding to column names
  cn <- 
    str_pad(cn,
            max(str_length(cn)),
            side = "right"
    )
  
  ## add padding before number of distinct values
  out_n <- 
    str_pad(out_n,
            max(str_length(out_n))
    )
  
  ## create final ouput having: column names|nr. of distinct values|distinct values
  out <- str_c(cn, "|", out_n, "|", out_v)
  
  ## take into account the required space for the indices
  i_width <- str_length(length(cn)) + 5
  
  ## remove stuff if output is longer then specified section_width
  out <- 
    if_else((str_length(out) + i_width) > section_width,
            ## cutoff values
            str_c(str_sub(out, end = section_width - i_width), ".."),
            out
    )
  
  # return result -----------------------------------------------------------
  
  out %>% sort()
}

Created on 2021-01-30 by the reprex package (v1.0.0)

And although I really value the available summary functionalities, sometimes listing only the (sorted) unique values can be quite insightful, see for example:

mpg %>% glimpse()
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi"...
#> $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro"...
#> $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0,...
#> $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, ...
#> $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, ...
#> $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "a...
#> $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4",...
#> $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17...
#> $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25...
#> $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
#> $ class        <chr> "compact", "compact", "compact", "compact", "compact",...
mpg %>% get_unique_values()
#>  [1] "class       | 7|2seater, compact, midsize, minivan, pickup, subcompact, suv"                                                                
#>  [2] "cty         |21|9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 33, 35"                                          
#>  [3] "cyl         | 4|4, 5, 6, 8"                                                                                                                 
#>  [4] "displ       |35|1.6, 1.8, 1.9, 2, 2.2, 2.4, 2.5, 2.7, 2.8, 3, 3.1, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4, 4.2, 4.4, 4.6, 4.7, 5, 5.2, 5.3,.."
#>  [5] "drv         | 3|4, f, r"                                                                                                                    
#>  [6] "fl          | 5|c, d, e, p, r"                                                                                                              
#>  [7] "hwy         |27|12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 41, 44"                 
#>  [8] "manufacturer|15|audi, chevrolet, dodge, ford, honda, hyundai, jeep, land rover, lincoln, mercury, nissan, pontiac, subaru, toyota, volksw.."
#>  [9] "model       |38|4runner 4wd, a4, a4 quattro, a6 quattro, altima, c1500 suburban 2wd, camry, camry solara, caravan 2wd, civic, corolla, co.."
#> [10] "trans       |10|auto(av), auto(l3), auto(l4), auto(l5), auto(l6), auto(s4), auto(s5), auto(s6), manual(m5), manual(m6)"                     
#> [11] "year        | 2|1999, 2008"

Created on 2021-01-30 by the reprex package (v1.0.0)

As a matter of fact, I've included the above function in another function show_summaries(), which I use regularly when working with new data in an R Notebook. This function (which is a bit more robust, see here) makes also use of other available summary functionalities:

  • glimpse() from the dplyr package
  • skim() from the skimr package
  • describe() from the Hmisc package

Maybe you find this useful as well.

Probably more so than the way in which the problem was phrased in the OP, which is what I addressed.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.