Using partial strings in variable names to convert data to long form in R

I was hoping I could pick your brains as to how to convert my data to long form in R.

I would like to separate out the condition, Chromophore, and source detector from the variable names in my dataframe. Below I pasted a few examples (but there are many more variables in the data frame). So I need to pick out "Angry" and "Happy" and put those into a variable "Condition", then the Chromophores Hbo , HbR , and Hbt , and finally the source detector pairs "1,1", "1,2" (perhaps searching and pulling for the text before and after the comma because I have 49 combinations of these).

Example variable names:

AngryHRFHbO,1,1 AngryHRFHbR,1,1 AngryHRFHbT,1,1 AngryHRFHbO,2,1 AngryHRFHbR,2,1 AngryHRFHbT,2,1 HappyHRFHbO,4,1 HappyHRFHbR,4,1 HappyHRFHbT,4,1 HappyHRFHbO,2,2 HRFHbR,2,2 HappyHRFHbT,2,2

Thank you for your help! I apologize for the formatting. This is my first time posting so I was not sure as to the best way to present the issue.

All the best, Caroline

Let's create example data (since you didn't provide a reprex):

df <- c("ID", "Time", "AngryHRFHbO,1,1", "AngryHRFHbR,1,1", "HappyHRFHbO,4,1", "HappyHRFHbT,2,2") %>%
  map_dfc(~ runif(5)) %>%
  set_names(c("ID", "Time", "AngryHRFHbO,1,1", "AngryHRFHbR,1,1", "HappyHRFHbO,4,1", "HappyHRFHbT,2,2"))

Now to process these variable names, we need to have them as data in a column, so we need to pivot to the long format. We want to pivot all column names except "ID" and "Time".

df %>%
  pivot_longer(-c(ID, Time), names_to = "variable_name")

You could also select column names that starts_with() "Angry" and "Happy", there are a number of selection helpers available.

Then you can use separate to split on a separator, or extract to use a regular expression.

df %>%
  pivot_longer(-c(ID, Time), names_to = "variable_name") %>%
  extract(variable_name, into=c("condition", "chromophore", "source"), regex = "^([A-Z][a-z]+HRF)(Hb[ROT]),([0-9],[0-9])$")

Rather than how, the threshold question is what?

  1. The data design encoded several attributes into a single variable:
cond <- "Angry"
chro <- "Hbo"
sd1  <- "1"
sd2 <- "1"
oth <- "HRF"

in the first column. for example, which would mean that the combination of those attributes encoded in the third column variable had a value of -4.99E-06 in the first row.

  1. It is not clear how to decompose -4.99E-06 to the respective components.

  2. The design of a solution would also have been aided from beginning with a tidy object

A general rule of thumb is that it is easier to describe functional relationships between variables (e.g., z is a linear combination of x and y, density is the ratio of weight to volume) than between rows, and it is easier to make comparisons between groups of observations (e.g., average of group a vs. average of group b) than between groups of columns.

What would the transformed data frame look like as a tidy table with decomposed attributes? How is the numeric value shown in the screenshot related to the combination of attributes?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.