Finding primary keys the tidy way

I want to find the primary key (unique row identifier) within a single table, using the tidyverse.

My current method is to use dplyr::add_count() on an accumulating number of column names, and then run filter(..., n > 1) to discover results which yield zero rows.


# Test data set
dupes <- tibble(id = sample(letters[1:5], 20, TRUE),
                date = Sys.Date() - sample(0:3, 20, TRUE),
                category = sample(LETTERS[1:5], 20, TRUE),
                uni = 1:20)

# This is the gist of the idea
dupes %>%
  add_count(id) %>%
  filter(n > 1)

# The tidy way
dupes %>%
  names() %>%
  accumulate(c) %>%
  map(syms) %>%
    dupes %>%
      add_count(!!!.x) %>%
      filter(.data$n > 1) %>%
      nrow() %>%
  }) %>%
  which()  # this tells me how many combinations it takes to find the key

Is there an automagic way of doing this? My method does not create all possible combinations of column names, and is only a rough guess.

For datasets with a large number of columns, running such a script on all possible combinations of column names would take a considerably large amount of time to run.

I strongly suspect that I'm misinterpreting your question. You want a column to serve as a unique row identifier?

tibble::rownames_to_col(YOURDATA, var = "recid")

An odd (yet fun) question.

First it in practice would be much better to know what set of columns were intended to be primary keys.

How about trying the set of all of the columns as primary key and then incrementally seeing if a column can be dropped and maintain the keying property? This finds a non-shrinkable set (but not guaranteed to be minimal size).

If we ignore practical restrictions, this is fun. If there are practical restrictions, then the best solution probably includes hard investigative work (as is sadly the case for most things in practice).

But let's ignore them for now!

The base package has anyDuplicated, which is a function for quickly finding the first non-unique row.

anyDuplicated(dupes[, c("date", "category")])
# [1] 5
dupes[1:5, c("date", "category")]
# # A tibble: 5 x 2
#   date       category
#   <date>     <chr>   
# 1 2019-06-04 D       
# 2 2019-06-02 B       
# 3 2019-06-03 B       
# 4 2019-06-01 A       
# 5 2019-06-03 B

If there is no duplicate, it returns 0. So let's make this into a function that checks if a combination of columns has only unique rows.

only_uniques <- function(cols) {
  first_dupe <- anyDuplicated(dupes[, cols])
  first_dupe == 0

The combn function gives all combinations of n elements from a vector. With lapply, we can do this for all possible combination sizes, from 1 to the number of columns in dupes.

combo_list <- lapply(
  X = seq_along(dupes),
  FUN = combn,
  x = names(dupes),
  simplify = FALSE

For sanity, let's organize the combinations and their validity in a tibble.

combos <- tibble(columns = unlist(combo_list, recursive = FALSE)) %>%
  mutate(valid = vapply(columns, only_uniques, logical(1))) # data.frame printing shows the contents of list columns
#                    columns valid
# 1                       id FALSE
# 2                     date FALSE
# 3                 category FALSE
# 4                      uni  TRUE
# 5                 id, date FALSE
# 6             id, category FALSE
# 7                  id, uni  TRUE
# 8           date, category FALSE
# 9                date, uni  TRUE
# 10           category, uni  TRUE
# 11      id, date, category  TRUE
# 12           id, date, uni  TRUE
# 13       id, category, uni  TRUE
# 14     date, category, uni  TRUE
# 15 id, date, category, uni  TRUE

I need to find the primary key within the data -- if any such key exists.

1 Like

That is a very good idea. Thank you very much! I will apply this to my real data set.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.