Combining vectors with varying names

Problem: I have a list of named vectors. The names are not always the same but I'd like to correctly stack them into a matrix which has a row for each vector and a column for each name.

Note - the input vectors may vary in length and may have differing names. It can be any number of vectors but my example shows 2.

This is not homework. I'm working on fixing a bug in a R package that uses sapply to create the vectors and then make the matrix but that ignores the names and then causes issues. Instead, I thought about using lapply to create the vectors but then still stuck on how to combine them. The package has no dependency on dplyr so I can't use bind_rows.

v1 <- c(5,6)
v2 <- c(7,8)
names(v1) <- c("x1", "x2")
names(v2) <- c("x1", "x3")

results <- list(v1, v2)

# desired output 
(deisred <- structure(c(5, 7, 6, NA, NA, 8), .Dim = 2:3, .Dimnames = list(
   NULL, c("x1", "x2", "x3"))))
#>      x1 x2 x3
#> [1,]  5  6 NA
#> [2,]  7 NA  8

# I can do this with dplyr but want to remove dependency on dplyr
as.matrix(dplyr::bind_rows(results))
#>      x1 x2 x3
#> [1,]  5  6 NA
#> [2,]  7 NA  8

# problem - disregards the name mismatch
do.call(rbind, results) 
#>      x1 x2
#> [1,]  5  6
#> [2,]  7  8

Created on 2022-03-30 by the reprex package (v2.0.1)

This gets to the goal, though it seems rather convoluted.

v1 <- c(5,6)
v2 <- c(7,8)
names(v1) <- c("x1", "x2")
names(v2) <- c("x1", "x3")

results <- list(v1, v2)

# desired output 
(desired <- structure(c(5, 7, 6, NA, NA, 8), .Dim = 2:3, .Dimnames = list(
  NULL, c("x1", "x2", "x3"))))
#>      x1 x2 x3
#> [1,]  5  6 NA
#> [2,]  7 NA  8

UL <- unlist(results)
UL
#> x1 x2 x1 x3 
#>  5  6  7  8
NMs <- unique(names(UL))
MakeRows <- function(Vec, NAMES) {
  tmp <- sapply(NAMES, function(Nm) Vec[Nm],USE.NAMES = FALSE)
  names(tmp) <- NAMES
  return(tmp)
}
OutList <- lapply(results, MakeRows, NAMES = NMs)
do.call(rbind, OutList)
#>      x1 x2 x3
#> [1,]  5  6 NA
#> [2,]  7 NA  8

Created on 2022-03-30 by the reprex package (v2.0.1)

A little simpler:

UL <- unlist(results)
NMs <- unique(names(UL))
RowFunc <- function(Vec){
  tmp <- Vec[NMs]
  names(tmp) <- NMs
  return(tmp)
}
OutList <- lapply(results, RowFunc)
do.call(rbind, OutList)
1 Like

An alternative way is by coercing the vectors into data frames, and then transposing them, and merged them by using merge function, with specifying all.x = TRUE or all.y = TRUE. By this specification, x3 in v2 that has no matching in v1 will be included with NA. A more detailed explanation is available in ?merge.

v1
# x1 x2 
# 5  6 
 v2
# x1 x3 
 # 7  8 

row1 <- merge(t(as.data.frame(v1)),  t(as.data.frame(v2)),  all.x = TRUE);  row1
#     x1 x2 x3
# 1  5  6 NA

row2 <- merge(t(as.data.frame(v1)),  t(as.data.frame(v2)),  all.y = TRUE);  row2
 #   x1 x2 x3
#1  7 NA  8

rbind(row1, row2)
#   x1 x2 x3
#1  5  6 NA
#2  7 NA  8

The result is a data frame. If necessary, it can be coerced into a matrix by using as.matrix (rbind(row1, row2)).

1 Like

I thought this was a very interesting ask, and FJCC's solution is excellent.

Below I made a more complex version of the ask, requiring the merge of 3 such lists, each longer, the idea being to give profiling more of a change to show performance differences.

FJCC's solution I called 'f2_lapply_docall', I wanted to compare with direct matrix building and filling out with a loop. They seem to be roughly on a par, if you have expectations whether you would be merging very many more lists at a time, and what typical length of lists would be you might get more relevant performance metrics than the ones here.

set.seed(42)

v1 <- sample.int(1000,1000,replace=TRUE)
v2 <- sample.int(1000,1000,replace=TRUE)
v3 <- sample.int(1000,1000,replace=TRUE)
names(v1) <- sort(paste0("x",formatC(sample.int(3000,1000,replace=FALSE),width=4,flag="0")))
names(v2) <- sort(paste0("x",formatC(sample.int(3000,1000,replace=FALSE),width=4,flag="0")))
names(v3) <- sort(paste0("x",formatC(sample.int(3000,1000,replace=FALSE),width=4,flag="0")))

results <- list(v1, v2,v3)

desired <-   as.matrix(dplyr::bind_rows(results))

f1_dplyr <- function(r){
  as.matrix(dplyr::bind_rows(r))
}




f2_lapply_docall <- function(r){
  UL <- unlist(r)
  NMs <- unique(names(UL))
  RowFunc <- function(Vec){
    tmp <- Vec[NMs]
    names(tmp) <- NMs
    return(tmp)
  }
  OutList <- lapply(r, RowFunc)
  do.call(rbind, OutList)
}


f3_matrix_forloop <- function(r){
  NMs <- unique(names(unlist(r)))
  lengthr <- length(r)
  newmat <- matrix(nrow=lengthr,
            ncol = length(NMs),dimnames = list(NULL,
                                               NMs))
for(row in seq_len(lengthr)){
  newmat[row ,] <- r[[row]][NMs]
}
  newmat
}

library(bench)
mark(f1_dplyr(results),
     f2_lapply_docall(results),
     f3_matrix_forloop(results),
     time_unit = "ns",
     min_time = 5)

# A tibble: 3 x 13
  expression                       min    median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc  total_time result     memory    
  <bch:expr>                     <dbl>     <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>       <dbl> <list>     <list>    
1 f1_dplyr(results)          14217100. 17419950.      56.1     723KB     17.0   208    63 3706866100. <int[...]> <Rprofmem>
2 f2_lapply_docall(results)    455300.   560800.    1622.      586KB     18.4  7499    85 4623899600. <int[...]> <Rprofmem>
3 f3_matrix_forloop(results)   476100.   609000.    1569.      636KB     19.4  7209    89 4595938200.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.