 Fast key generation for integer matrix/data frame (add an integer id for each unique row.)

This is what I have

a b c
1  1 1 1
2  1 1 1
3  1 1 1
4  1 2 2
5  1 2 2
6  1 1 1
7  1 3 3
8  2 4 4
9  1 3 3
10 1 2 2

And this is what I need to create from it:

a b c key
1  1 1 1 1
2  1 1 1 1
3  1 1 1 1
4  1 2 2 2
5  1 2 2 2
6  1 1 1 1
7  1 3 3 3
8  2 4 4 4
9  1 3 3 3
10 1 2 2 2

A sorted key would be great, but it's more important that it is fast. E.g. this one would also work.

a b c key
1  1 1 1 4
2  1 1 1 4
3  1 1 1 4
4  1 2 2 2
5  1 2 2 2
6  1 1 1 4
7  1 3 3 3
8  2 4 4 1
9  1 3 3 3
10 1 2 2 2

Hi,

You can do this by creating factors and then converting them into integers like so

library(dplyr)

#Data
df = data.frame(
a = c(1,1,1,1,1,1,1,2,1,1),
b = c(1,1,1,2,2,1,3,4,3,2),
c = c(1,1,1,2,2,1,3,4,3,2)
)

#Base R implementation
df\$key = as.integer(as.factor(paste(df\$a, df\$b, df\$c, sep = "")))

#Tidyverse implementation
df = df %>% mutate(
key = paste(a, b, c, sep = "") %>%
as.factor() %>% as.integer()
)

df
#>    a b c key
#> 1  1 1 1   1
#> 2  1 1 1   1
#> 3  1 1 1   1
#> 4  1 2 2   2
#> 5  1 2 2   2
#> 6  1 1 1   1
#> 7  1 3 3   3
#> 8  2 4 4   4
#> 9  1 3 3   3
#> 10 1 2 2   2

Created on 2021-12-12 by the reprex package (v2.0.1)

You first paste the columns abc into a string, then make a factor of this string, then convert it into and integer.

Hope this helps,
PJ

1 Like

Hi Pieter, thanks for this solution. Unfortunately this one doesn't scale very well as it seems to grow at least linearly with nrow(df). See benchmark below

df_x <- data.frame(a = sample(100000),
b = sample(100000),
c = sample(100000))
df_100    <- df_x[1:100, ]
df_1000   <- df_x[1:1000, ]
df_10000  <- df_x[1:10000, ]
df_100000 <- df_x[1:100000, ]
bench::mark(
u_100    = tidyr::unite(df_100   , col = v_pasted),
u_1000   = tidyr::unite(df_1000  , col = v_pasted),
u_10000  = tidyr::unite(df_10000 , col = v_pasted),
u_100000 = tidyr::unite(df_100000, col = v_pasted),
check = FALSE
)
#> # A tibble: 4 x 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 u_100       349.7us  525.9us   1547.      3.31KB     2.11   734     1      475ms
#> 2 u_1000       1.75ms   2.01ms    442.     31.44KB     0      222     0      503ms
#> 3 u_10000     20.83ms   22.8ms     41.4   312.69KB     0       21     0      507ms
#> 4 u_100000   201.24ms 206.15ms      4.61    3.05MB     0        3     0      650ms
#> # ... with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
bench::mark(
p_100    = paste(df_100\$a   , df_100\$b   , df_100\$c),
p_1000   = paste(df_1000\$a  , df_1000\$b  , df_1000\$c),
p_10000  = paste(df_10000\$a , df_10000\$b , df_10000\$c),
p_100000 = paste(df_100000\$a, df_100000\$b, df_100000\$c),
check = FALSE
)
#> # A tibble: 4 x 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 p_100       161.7us  177.9us   4353.      3.31KB     2.27  1919     1      441ms
#> 2 p_1000       1.56ms    1.6ms    567.     31.44KB     0      284     0      501ms
#> 3 p_10000     21.52ms   22.4ms     35.8   312.69KB     0       18     0      502ms
#> 4 p_100000   210.92ms  223.6ms      4.21    3.05MB     0        3     0      713ms
#> # ... with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

This seems intuitive to me, the 'algorithm' requires every row to be looked at. There is no scenario in which you can skip processing a row and get a reliable result for your algorithm, therefore linear scaling is the achievable ideal.

1 Like

I came up with a matrix based method, on my benchmarking it scales the same linearly as expected but is maybe something like a constant 4x faster that the other base or tidy method I attempted.

matrix_approach <- function(df_base){
m1 <- as.matrix(df_base)
cbind(m1,factor(m1[,"a"]*100+m1[,"b"]*10+m1[,"c"]))
}

full code

library(dplyr)

#Data
df_x <- data.frame(a = sample(100000),
b = sample(100000),
c = sample(100000))
df_1000   <- df_x[1:1000, ]
df_10000  <- df_x[1:10000, ]
df_100000 <- df_x[1:100000, ]

#Base R implementation
base_approach <- function(df){
df\$key = as.integer(as.factor(paste(df\$a, df\$b, df\$c, sep = "")))
df
}
#Tidyverse implementation
tidy_approach <- function(df)
{df %>% mutate(
key = paste(a, b, c, sep = "") %>%
as.factor() %>% as.integer()

)}
matrix_approach <- function(df_base){
m1 <- as.matrix(df_base)
cbind(m1,factor(m1[,"a"]*100+m1[,"b"]*10+m1[,"c"]))
}

library(microbenchmark)
microbenchmark(
base_1000 = base_approach(df_1000),
base_10000 = base_approach(df_10000),
base_100000 = base_approach(df_100000),
tidy_1000 = tidy_approach(df_1000),
tidy_10000 = tidy_approach(df_10000),
tidy_100000 = tidy_approach(df_100000),
matrix_1000 = matrix_approach(df_1000),
matrix_10000 = matrix_approach(df_10000),
matrix_100000 = matrix_approach(df_100000)
)

results

Unit: milliseconds
expr      min         lq        mean     median         uq       max neval  cld
base_1000   4.5995    5.37785    6.122804    5.74590    6.24020   15.8473   100 a
base_10000  60.7216   70.26135   86.276763   74.33305   80.35400  436.7080   100  b
base_100000 922.2126 1049.32145 1165.463113 1110.50335 1154.88385 2672.0201   100    d
tidy_1000   6.1317    7.35880    8.448057    7.88565    8.54585   33.5334   100 a
tidy_10000  64.8919   74.73555   86.286060   77.99445   82.90730  260.7543   100  b
tidy_100000 890.3916 1044.33520 1163.625317 1103.43745 1167.05805 2079.9020   100    d
matrix_1000   1.9455    2.57615    3.103023    2.78965    3.32965    8.1076   100 a
matrix_10000  20.1894   25.01425   29.750546   26.98115   29.78145  114.1647   100 a
matrix_100000 220.3871  253.94540  298.793029  273.26120  310.57860  834.4328   100   c

Hi Nir, you'r right. And thanks for the other idea.
Unfortunaley this doesn't give a unique key (see counter example below). I think the coefficients need to depend on the matrix itself. However, for large matrices (rows and cols) with lots of unique values the coefficients might explode very quick.

df <- data.frame(a = c(2,1),
b = c(2,12),
c = c(2,2))
f_m <- function(df){
m1 <- as.matrix(df)
factor(m1[,"a"]*100+m1[,"b"]*10+m1[,"c"])
}
f_m(df)
#>  222 222
#> Levels: 222

when optimising for speed, taking advantage of constraints known in advance on the inputs can be valuable.
I think your original example threw me off, as the integers were small numbers less than 10. So I wrongly assumed we could give each their own space in base 10.
This is obviously wrong, and so must we allow the three values to be any integer value (+/- 2billion) ?
I believe translating from integer to character might be a relatively expensive operation (if we need to paste these together rather than find ways to add them into a common composite number). but it might be necessary if these are large integers and non-trivial to pack

perhaps this for sorted key

matrix_approach2 <- function(df_base){
m1 <- as.matrix(df_base)
m2 <- m1[order(m1[,1],m1[,2],m1[,3]),]
cbind(m1,cumsum(ifelse(m2[,1]==lagm2[,1] & m2[,2]==lagm2[,2] & m2[,3]==lagm2[,3],0,1)))
}

Hi again,

@Tazinho, can I ask what your use case if for this method? I'm intrigued now and always love a problem that needs to work at scale, but sometimes you can find an alternative solution if you know the particular problem at hand.

The provided solutions are still pretty fast for even a large number of rows, but it indeed depends on how many unique values you expect and how many columns there are.

PJ

It's a bit hard to frame. I have several use cases which would benefit from this.

My initial use case was writing sth. very fast for counting strings (I encountered this, as the default for table() is very slow). I found this solution which is very fast for atomics in general.

table_v <- function(x) {
x_u <- if(is.factor(x)) x[!duplicated(x)] else (unique(x))
x_i <- match(x, x_u)  # or fastmatch::fmatch
t_x_i <- tabulate(x_i)
t_x_i <- list2DF(list(value = x_u, count = t_x_i[seq_along(x_u)]))

t_x_i
}

My next use case was counting rows. There are already some solutions like

# This one is slow
dplyr::count(df, a, b, c)

# This one is fast (but I believe it's poosible to speed it up)
library(data.table)
dt[, .(count = .N), by = c("a", "b", "c")]

The idea here is to translate df into a numeric data frame/matrix similar to the reprex above via lapply() and table_v() and then create a unique key.
This will allow very fast counting as one just needs to count the key-column now (again via table_v()). The main bottleneck left however ist creating the key and I am struggling to find a good solution (maybe order() could be, but it seems to resolve ties, which is not what I would want in this case).

The pseudo code for table_df() would look like:

table_df <- function(df) {

# Transform into integer matrix/data frame
df_x <- lapply(df, function(x) {
x_u <- if(is.factor(x)) x[!duplicated(x)] else (unique(x));
match(x, x_u) # or fastmatch::fmatch
}
df_x <- list2DF(df_x)

# Calculate a key
# key <- "some code for key"

# Calculate count
count <- table_v(key)[["count"]]

# Filter unique rows and append count
cbind(
df[!duplicated(key), ],
count
)
)

data.table has its own key creation syntax you could try.

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html

2 Likes

vctrs::vec_group_id() can help with this

df <- data.frame(
a = c(1,1,1,1,1,1,1,2,1,1),
b = c(1,1,1,2,2,1,3,4,3,2),
c = c(1,1,1,2,2,1,3,4,3,2)
)

df\$id <- vctrs::vec_group_id(df)

df
#>    a b c id
#> 1  1 1 1  1
#> 2  1 1 1  1
#> 3  1 1 1  1
#> 4  1 2 2  2
#> 5  1 2 2  2
#> 6  1 1 1  1
#> 7  1 3 3  3
#> 8  2 4 4  4
#> 9  1 3 3  3
#> 10 1 2 2  2
2 Likes

Thanks, that's perfect.

Unfortunately my final method is still little bit slower than data.table in most cases.
For reference.

table_df <- function(df, n = "n") {

# Calculate a key
key <- vctrs::vec_group_id(df)

# Calculate count
x_u <- unique(key)
x_i <- fastmatch::fmatch(key, x_u)
t_x_i <- tabulate(x_i)
count <- t_x_i[seq_along(x_u)]
#
df <- df[!duplicated(key), ]
df[[n]] <- count

df
}

I believe this is superfluous and you can do for the same result

t_x_i <- tabulate(key)

Nice. Then it's:

table_df <- function(df, n = "n") {

# Calculate a key
key <- vctrs::vec_group_id(df)

# Calculate count
count <- tabulate(key)

df <- df[!duplicated(key), ]
df[[n]] <- count

df
}

It's already faster than data.table at least in some cases. The remaining bottlenecks seems to be unique() and data-frame-subsetting.

EDIT: Changed unique() to !duplicated as this is the correct logic.
EDIT 2: The subsetting (assignment) is slow, because it triggers a copy

Oh now that I see what you are doing, it looks like you could also do this with vec_group_loc() and list_sizes(). But do you really just want a dplyr::count()? count() eventually does something pretty similar to the vec_group_loc() approach.

df <- data.frame(
a = c(1,1,1,1,1,1,1,2,1,1),
b = c(1,1,1,2,2,1,3,4,3,2),
c = c(1,1,1,2,2,1,3,4,3,2)
)

table_df <- function(df, n = "n") {
# Calculate a key
key <- vctrs::vec_group_id(df)

# Calculate count
count <- tabulate(key)

df <- df[unique(key), ]
df[[n]] <- count

df
}

table_df2 <- function(df, n = "n") {
result <- vctrs::vec_group_loc(df)
out <- result\$key
out[[n]] <- vctrs::list_sizes(result\$loc)
out
}

table_df(df)
#>   a b c n
#> 1 1 1 1 4
#> 2 1 1 1 3
#> 3 1 1 1 2
#> 4 1 2 2 1
table_df2(df)
#>   a b c n
#> 1 1 1 1 4
#> 2 1 2 2 3
#> 3 1 3 3 2
#> 4 2 4 4 1
dplyr::count(df, a, b, c)
#>   a b c n
#> 1 1 1 1 4
#> 2 1 2 2 3
#> 3 1 3 3 2
#> 4 2 4 4 1

Created on 2021-12-15 by the reprex package (v2.0.1)

1 Like

I think theres a logic error here.
the intention is to retain only the first rows of the unique key, but the concrete effect is to scramble the records. this is retaining the rows whos row numbers match the unique keys but the rownumbers have no logical relationship to the unique keys.

Nice catch. I believe I introduced this mistake above when I changed !duplicated() into unique(). Will edit my posts.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.