shorten column names in data frame

Hi, I have a large matrix of data with column names like this:

ABCD-123A-1234-AB1AB1

I would like to short these to

ABCD-123A

Unfortunately the second part (123A) is 4 or 5 characters so I can't cut it by length?
Is there a way to use gsub to sub everything from the second - for ""? Or another solution?

(example_matrix <-structure(c(1, 2), .Dim = 1:2, .Dimnames = list(NULL, c("ABCD-123A-1234-AB1AB1", 
                                                                          "ABCD-123AX-1234-AB1AB1"))))

(long_names <- colnames(example_matrix))

(short_names <- lapply(
  X = strsplit(x = long_names, split = "-"),
  FUN = function(x) paste0(head(x, n = 2), collapse = "_")))
  
colnames(example_matrix) <- short_names

example_matrix

Another option is to use regular expressions, a less readable but more direct approach.

example_matrix <- structure(c(1, 2), .Dim = 1:2, .Dimnames = list(NULL, c("ABCD-123A-1234-AB1AB1", 
                                                                         "ABCD-123AX-1234-AB1AB1")))

colnames(example_matrix) <- regmatches(colnames(example_matrix), regexpr("^.{4}-[^-]{4,5}", colnames(example_matrix)))

example_matrix
#>      ABCD-123A ABCD-123AX
#> [1,]         1          2

Created on 2022-07-26 by the reprex package (v2.0.1)

2 Likes

Thanks so much this worked! I was wondering if you could break down how this works so I can alter it for future use? I'm guessing split is where to cut and the n=2 is saying split at the 2nd instance? But I'm not sure about the collapse part? Or if you have a good resource instead on this that would be great

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.