regex capture groups for pivoting data with `names_pattern` in {tidyr}

Luis · October 4, 2021, 10:53pm

Hi all,

I have some data with multiple variables describing the mean and standard error for different treatments. These 'mean' and 'se' columns are in wide format and I want to pivot them to long format for plotting with error bars.

The variable names have the treatment and the measure (mean or standard error) separated by an underscore (e.g., heat_mean, heat_se).

Here's a reproducible example:

library(dplyr)
library(tidyr)
dat <- tibble(sample=c(rep("A",5),rep("B",5)),
              location=c(rep("Hg",5),rep("Fg",5)),
              heat_mean=rnorm(10),
              heat_se=rnorm(10,0,.1),
              cold_mean=rnorm(10),
              cold_se=rnorm(10,0,.1))

The data would look like this:

sample	location	heat_mean	heat_se	cold_mean	cold_se
A	Hg	1.0515611	-0.0721989	0.3574717	0.1161773
A	Hg	0.2075425	0.0599422	1.3722471	0.0369714
A	Hg	0.5360133	-0.1289325	-0.4205705	-0.1072817
A	Hg	-0.2970411	-0.0501931	2.6781306	-0.1029467
A	Hg	0.3707710	-0.0326119	1.0663721	0.0049086
B	Fg	-0.7290617	-0.0745034	0.4845398	0.1408884
B	Fg	1.8045341	0.0495731	0.6030854	-0.0407545
B	Fg	-0.5688072	0.0031257	-1.1049746	0.0936773
B	Fg	-0.8443316	0.0485147	0.3302154	-0.0919323
B	Fg	2.0630788	-0.0720528	0.2073953	0.0641618

To almost get what I need, I can use the special .value indicator in the names_to argument to pivot_longer and either split the column on the separator with names_sep or use names_pattern and a regular expression with capturing groups to match the strings before and after the underscore.

dat %>% pivot_longer(c(heat_mean:cold_se),
                     names_to = c(".value","treatment"), names_sep = "_")

The result here has the structure I need, but I would have to recode the values in treatment and rename the new columns (Or perhaps edit the variable names first).

sample	location	treatment	heat	cold
A	Hg	mean	1.0515611	0.3574717
A	Hg	se	-0.0721989	0.1161773
A	Hg	mean	0.2075425	1.3722471
A	Hg	se	0.0599422	0.0369714
A	Hg	mean	0.5360133	-0.4205705
A	Hg	se	-0.1289325	-0.1072817

This SO question pretty much describes my situation, but in my case the variables are always "treatment" _ "measure".

I tried to define capture groups with the text after the separator first, but couldn't the matches to work. This is the RE I used: "((?<=_)\\w+$)(\\w+(?=_))". I must be misunderstanding capture groups and how they work from left to right.

Is there a way to define the groups directly with a regexp, or does anyone know how I could easily swap the strings around the separator so that "heat_mean" becomes "mean_heat" and then I can just use the separators?

Thanks!

SamGuay · October 5, 2021, 12:54am

Hi @Luis,

Your question intrigued me, so I tried different things. In the end, the last answer I came up with feels too simple to be true, so I might be missing something atm? I simply interchanged names_to = c(".value","treatment") to names_to = c("treatment",".value"). Is this the structure you wanted?

set.seed(1234)

dat <- tibble(sample=c(rep("A",5),rep("B",5)),
               location=c(rep("Hg",5),rep("Fg",5)),
               heat_mean=rnorm(10),
               heat_se=rnorm(10,0,.1),
               cold_mean=rnorm(10),
               cold_se=rnorm(10,0,.1))

dat %>% pivot_longer(c(heat_mean, heat_se, cold_mean, cold_se),
                     names_to = c("treatment",".value"),
                     names_sep = "_")

# A tibble: 20 x 5
   sample location treatment    mean       se
   <chr>  <chr>    <chr>       <dbl>    <dbl>
 1 A      Hg       heat      -1.21   -0.0477 
 2 A      Hg       cold       0.134   0.110  
 3 A      Hg       heat       0.277  -0.0998 
 4 A      Hg       cold      -0.491  -0.0476 
 5 A      Hg       heat       1.08   -0.0776 
 6 A      Hg       cold      -0.441  -0.0709 
 7 A      Hg       heat      -2.35    0.00645
 8 A      Hg       cold       0.460  -0.0501 
 9 A      Hg       heat       0.429   0.0959 
10 A      Hg       cold      -0.694  -0.163  
11 B      Fg       heat       0.506  -0.0110 
12 B      Fg       cold      -1.45   -0.117  
13 B      Fg       heat      -0.575  -0.0511 
14 B      Fg       cold       0.575  -0.218  
15 B      Fg       heat      -0.547  -0.0911 
16 B      Fg       cold      -1.02   -0.134  
17 B      Fg       heat      -0.564  -0.0837 
18 B      Fg       cold      -0.0151 -0.0294 
19 B      Fg       heat      -0.890   0.242  
20 B      Fg       cold      -0.936  -0.0466

Luis · October 5, 2021, 4:43am

Hi!
Indeed that was the way to go.
Tan Ho has a thread on Twitter about using .values, worth checking out.

This solved my question, although if anyone else wants to chime in on why I can't write capture groups right to left please do.

system · October 12, 2021, 4:43am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.