Tidyr::separate() at second/last occurence of character

agranholm · May 20, 2018, 8:20am

I am working with a dataset that I need to convert from wide form to tall form.
Each row is a single subject. For each subject, the same variable was measured once daily for several months. These data are saved in columns entitled "daily_measure_X", where each part of the name means the following:

"daily_" indicates that this is one of the variables measured daily, and is important, as some of them were also measured at other times, e.g. baseline (and would be titled "baseline_measure").
"measure" refers to the variable name - there's multiple variables, so the same pattern applies for additional variables - thus "measure" is only one of the variables here.
"X" is the day.

The included reprex produces the desired output - however, it is somewhat clumsy and inefficient, as it first separates "daily_measure_X" into three variables, splitting at each underscore, followed by re-uniting the first two new variables.

library(tidyr)
library(dplyr) # For select helper functions

# Input data
example_data <- data.frame(subject_ID = 1:5,
                           daily_measure_1 = 6:10,
                           daily_measure_2 = 11:15,
                           daily_measure_3 = 16:20,
                           daily_measure_4 = 21:25,
                           daily_measure_5 = 26:30)

example_data

#>   subject_ID daily_measure_1 daily_measure_2 daily_measure_3
#> 1          1               6              11              16
#> 2          2               7              12              17
#> 3          3               8              13              18
#> 4          4               9              14              19
#> 5          5              10              15              20
#>   daily_measure_4 daily_measure_5
#> 1              21              26
#> 2              22              27
#> 3              23              28
#> 4              24              29
#> 5              25              30


example_data %>% 
  gather(key = "full_name", value = "value", starts_with("daily_")) %>%
  separate(full_name, into = c("V1", "V2", "day")) %>% 
  unite("variable", V1, V2)

#>    subject_ID      variable day value
#> 1           1 daily_measure   1     6
#> 2           2 daily_measure   1     7
#> 3           3 daily_measure   1     8
#> 4           4 daily_measure   1     9
#> 5           5 daily_measure   1    10
#> 6           1 daily_measure   2    11
#> 7           2 daily_measure   2    12
#> 8           3 daily_measure   2    13
#> 9           4 daily_measure   2    14
#> 10          5 daily_measure   2    15
#> 11          1 daily_measure   3    16
#> 12          2 daily_measure   3    17
#> 13          3 daily_measure   3    18
#> 14          4 daily_measure   3    19
#> 15          5 daily_measure   3    20
#> 16          1 daily_measure   4    21
#> 17          2 daily_measure   4    22
#> 18          3 daily_measure   4    23
#> 19          4 daily_measure   4    24
#> 20          5 daily_measure   4    25
#> 21          1 daily_measure   5    26
#> 22          2 daily_measure   5    27
#> 23          3 daily_measure   5    28
#> 24          4 daily_measure   5    29
#> 25          5 daily_measure   5    30

Another approach would be to use stringr and replace the first "" with something else. Both approaches gets the job done, but I'm interested to hear if anybody could come up with a more elegant solution that splits at the second or ideally last occurence of the "" character?

Maybe it could be done using a regex (to detect the first occurence starting from the end of the string) , but I have no prior experience with this and have not been able to produce a solution using this approach.

Thanks in advance.

mara · May 20, 2018, 11:39am

As you guessed, a regular expression is your best bet here. From the tidyr separate() function reference

Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

RegExr.com has been my go-to for regular expression testing for a while, but Garrick Aden-Buie recently made an RStudio add-in inspired by the very same, RegExplain, which is a great option for keeping your workflow all in one place:
Garrick Aden-Buie - 🕵️‍♂️ RegExplain

Here's how I would do it (which is hacky, but it works); you want to split at the underscore that precedes the day, which is numerical. So, I would use a stringr look around to say, in effect, “split at the underscore that is followed by a digit.”

suppressPackageStartupMessages(library(tidyverse))
example_data <- data.frame(subject_ID = 1:5,
                           daily_measure_1 = 6:10,
                           daily_measure_2 = 11:15,
                           daily_measure_3 = 16:20,
                           daily_measure_4 = 21:25,
                           daily_measure_5 = 26:30)
example_data %>% 
  gather(key = "full_name", value = "value", starts_with("daily_")) %>%
  separate(full_name, c("variable", "day"), sep = "_(?=[:digit:])")
#>    subject_ID      variable day value
#> 1           1 daily_measure   1     6
#> 2           2 daily_measure   1     7
#> 3           3 daily_measure   1     8
#> 4           4 daily_measure   1     9
#> 5           5 daily_measure   1    10
#> 6           1 daily_measure   2    11
#> 7           2 daily_measure   2    12
#> 8           3 daily_measure   2    13
#> 9           4 daily_measure   2    14
#> 10          5 daily_measure   2    15
#> 11          1 daily_measure   3    16
#> 12          2 daily_measure   3    17
#> 13          3 daily_measure   3    18
#> 14          4 daily_measure   3    19
#> 15          5 daily_measure   3    20
#> 16          1 daily_measure   4    21
#> 17          2 daily_measure   4    22
#> 18          3 daily_measure   4    23
#> 19          4 daily_measure   4    24
#> 20          5 daily_measure   4    25
#> 21          1 daily_measure   5    26
#> 22          2 daily_measure   5    27
#> 23          3 daily_measure   5    28
#> 24          4 daily_measure   5    29
#> 25          5 daily_measure   5    30

Created on 2018-05-20 by the reprex package (v0.2.0).

agranholm · May 20, 2018, 12:35pm

This is excellent, and exactly what was needed. Thanks a lot, Mara!

Thanks for the links as well - will need to look further into RegExes for future use.