How to extract part of a string from columns, paste into a new column

Hello everyone,

I have a data frame as below in the code and would like to get a data frame as shown in the second picture.
I know I can do it in excel with a lot of coping and past, but since my data set is big and missy i thought there must be a way to do it in R.

dat = data.frame(from_1 = c(letters[1:5],rep(NA,10)),  
                 to_1 = c(LETTERS[1:5],rep(NA,10)), 
                 from_2 = c(letters[1:2],rep(NA,13)), 
                 to_2 = c(LETTERS[1:2],rep(NA,13)),  
                 from_3= c(rep(NA,11),letters[7:10]), 
                 to_3 = c(rep(NA,11),LETTERS[7:10]),  
                 from_4= c(rep(NA,8),letters[9:15]), 
                 to_4 = c(rep(NA,8),LETTERS[9:15]))

This is just an example of my dataset; if it were, I would do it as below.

dat$from = letters[1:15]
dat$to = LETTERS[1:15]

I app
pic_1


reciate your help.
Best,
Adr

So, if I understand correctly your data at the start is a data frame containing columns named from_n and columns named to_n. Many of the values are NA.

For each row, you want to select the non-NA value across all the from_n columns where n is the highest, and the non-NA value from the to_n column with the largest n. Am I understanding correctly? On your second picture, rows 6, 7, and 8 have values f, g and h, I don't understand where they come from?

One way to do that is to use dplyr::coalesce() on each row. This function returns the first non-NA element. Since you want the last, you can use rev() before.

Of course, we have to treat to_* and from_* separately. So here is a solution, not a beautiful one, but it works:

dat = data.frame(from_1 = c(letters[1:5],rep(NA,10)),  
                 to_1 = c(LETTERS[1:5],rep(NA,10)), 
                 from_2 = c(letters[1:2],rep(NA,13)), 
                 to_2 = c(LETTERS[1:2],rep(NA,13)),  
                 from_3= c(rep(NA,11),letters[7:10]), 
                 to_3 = c(rep(NA,11),LETTERS[7:10]),  
                 from_4= c(rep(NA,8),letters[9:15]), 
                 to_4 = c(rep(NA,8),LETTERS[9:15]))

dat_from <- dat[, startsWith(colnames(dat), "from_")]
dat_to <- dat[, startsWith(colnames(dat), "from_")]

new_from <- do.call(dplyr::coalesce, rev(dat_from))
new_to <- do.call(dplyr::coalesce, rev(dat_to))

cbind(dat, from = new_from, to = new_to)
#>    from_1 to_1 from_2 to_2 from_3 to_3 from_4 to_4 from   to
#> 1       a    A      a    A   <NA> <NA>   <NA> <NA>    a    a
#> 2       b    B      b    B   <NA> <NA>   <NA> <NA>    b    b
#> 3       c    C   <NA> <NA>   <NA> <NA>   <NA> <NA>    c    c
#> 4       d    D   <NA> <NA>   <NA> <NA>   <NA> <NA>    d    d
#> 5       e    E   <NA> <NA>   <NA> <NA>   <NA> <NA>    e    e
#> 6    <NA> <NA>   <NA> <NA>   <NA> <NA>   <NA> <NA> <NA> <NA>
#> 7    <NA> <NA>   <NA> <NA>   <NA> <NA>   <NA> <NA> <NA> <NA>
#> 8    <NA> <NA>   <NA> <NA>   <NA> <NA>   <NA> <NA> <NA> <NA>
#> 9    <NA> <NA>   <NA> <NA>   <NA> <NA>      i    I    i    i
#> 10   <NA> <NA>   <NA> <NA>   <NA> <NA>      j    J    j    j
#> 11   <NA> <NA>   <NA> <NA>   <NA> <NA>      k    K    k    k
#> 12   <NA> <NA>   <NA> <NA>      g    G      l    L    l    l
#> 13   <NA> <NA>   <NA> <NA>      h    H      m    M    m    m
#> 14   <NA> <NA>   <NA> <NA>      i    I      n    N    n    n
#> 15   <NA> <NA>   <NA> <NA>      j    J      o    O    o    o

Created on 2022-11-07 by the reprex package (v2.0.1)

Although now that I wrote it, coalesce() might not be that clear here. It might be easier to rewrite a new function:

dat = data.frame(from_1 = c(letters[1:5],rep(NA,10)),  
                 to_1 = c(LETTERS[1:5],rep(NA,10)), 
                 from_2 = c(letters[1:2],rep(NA,13)), 
                 to_2 = c(LETTERS[1:2],rep(NA,13)),  
                 from_3= c(rep(NA,11),letters[7:10]), 
                 to_3 = c(rep(NA,11),LETTERS[7:10]),  
                 from_4= c(rep(NA,8),letters[9:15]), 
                 to_4 = c(rep(NA,8),LETTERS[9:15]))



last_non_na <- function(x){
  
  non_na <- which(! is.na(x))
  if(all(! non_na)) return(NA)
  x[max(non_na)]
}


dat_from <- dat[, startsWith(colnames(dat), "from_")]
dat_to <- dat[, startsWith(colnames(dat), "from_")]


new_from <- apply(dat_from, 1, last_non_na)
new_to <- apply(dat_to, 1, last_non_na)


cbind(dat, from = new_from, to = new_to)
#>    from_1 to_1 from_2 to_2 from_3 to_3 from_4 to_4 from   to
#> 1       a    A      a    A   <NA> <NA>   <NA> <NA>    a    a
#> 2       b    B      b    B   <NA> <NA>   <NA> <NA>    b    b
#> 3       c    C   <NA> <NA>   <NA> <NA>   <NA> <NA>    c    c
#> 4       d    D   <NA> <NA>   <NA> <NA>   <NA> <NA>    d    d
#> 5       e    E   <NA> <NA>   <NA> <NA>   <NA> <NA>    e    e
#> 6    <NA> <NA>   <NA> <NA>   <NA> <NA>   <NA> <NA> <NA> <NA>
#> 7    <NA> <NA>   <NA> <NA>   <NA> <NA>   <NA> <NA> <NA> <NA>
#> 8    <NA> <NA>   <NA> <NA>   <NA> <NA>   <NA> <NA> <NA> <NA>
#> 9    <NA> <NA>   <NA> <NA>   <NA> <NA>      i    I    i    i
#> 10   <NA> <NA>   <NA> <NA>   <NA> <NA>      j    J    j    j
#> 11   <NA> <NA>   <NA> <NA>   <NA> <NA>      k    K    k    k
#> 12   <NA> <NA>   <NA> <NA>      g    G      l    L    l    l
#> 13   <NA> <NA>   <NA> <NA>      h    H      m    M    m    m
#> 14   <NA> <NA>   <NA> <NA>      i    I      n    N    n    n
#> 15   <NA> <NA>   <NA> <NA>      j    J      o    O    o    o

Created on 2022-11-07 by the reprex package (v2.0.1)

And here is a tidyverse approach, which is less efficient in this case, but could be easier to expand upon:

dat = data.frame(from_1 = c(letters[1:5],rep(NA,10)),  
                 to_1 = c(LETTERS[1:5],rep(NA,10)), 
                 from_2 = c(letters[1:2],rep(NA,13)), 
                 to_2 = c(LETTERS[1:2],rep(NA,13)),  
                 from_3= c(rep(NA,11),letters[7:10]), 
                 to_3 = c(rep(NA,11),LETTERS[7:10]),  
                 from_4= c(rep(NA,8),letters[9:15]), 
                 to_4 = c(rep(NA,8),LETTERS[9:15]))
library(tidyverse)

last_non_na <- function(x){
  
  non_na <- which(! is.na(x))
  if(all(! non_na)) return(NA)
  x[max(non_na)]
}

dat |>
  mutate(row_number = row_number()) |>
  pivot_longer(-row_number) |>
  separate(name,
           into = c("type", "n")) |>
  group_by(row_number, type) |>
  summarize(last_value = last_non_na(value),
            .groups = 'drop') |>
  pivot_wider(names_from = "type",
              values_from = "last_value")
#> # A tibble: 15 × 3
#>    row_number from  to   
#>         <int> <chr> <chr>
#>  1          1 a     A    
#>  2          2 b     B    
#>  3          3 c     C    
#>  4          4 d     D    
#>  5          5 e     E    
#>  6          6 <NA>  <NA> 
#>  7          7 <NA>  <NA> 
#>  8          8 <NA>  <NA> 
#>  9          9 i     I    
#> 10         10 j     J    
#> 11         11 k     K    
#> 12         12 l     L    
#> 13         13 m     M    
#> 14         14 n     N    
#> 15         15 o     O

Created on 2022-11-07 by the reprex package (v2.0.1)

1 Like

@AlexisW , Thanks, that was super coool!

Maybe a general question, I want to improve my coding skill and coding thinking, would be great if you could suggest to me the top two ways how to do it!
Thanks!

That really depends on you. I learned a lot by reading r4ds and answering questions on this forum, but there are also Youtube videos if that's more your style, the RStudio website also has resources. For coding skill I think it's also useful to do exercises like Project Euler or codewars. I never participated, but Tidy Tuesdays seems like a good way to learn about data analysis...

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.