Split data: text to columns

Dear all,

I am coming across multiple tables where several values (mean, low and high) are introduced in the same cell. I want to split this values into 3 columns for each of the columns in the original data, so I can keep all information. However, I only manage to get a long string. I tried with separate(), but I only manage to keep the value outside the parenthesis while the others are removed.

# My attempt
df <- data.frame(stringsAsFactors=FALSE,
                 X2024 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                           "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                           "1.289\n(1.109 to 1.491)"),
                 X2529 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                           "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                           "1.289\n(1.109 to 1.491)"),
                 X3034 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                           "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                           "1.289\n(1.109 to 1.491)")
)


for(j in 1: ncol(df)) {
   df[,j] <- separate(df, j, into = c("mean", "low", "high"), sep = "\n\\(|\\sto\\s", convert = TRUE)
}

Thanks in advance

It would be easier to help you, if you provide some sample of your dataframe (an image of it is not easy to work with), so could you please ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

This is an example making some assumptions about the structure of your actual data

df <- data.frame(stringsAsFactors = FALSE,
                 x = c("1.41\n(1.19, 1.66)", "0.10\n(0.05, 0.15)"))
library(tidyr)
library(stringr)
library(dplyr)

df %>%
    mutate(x = str_remove_all(x, "[()]")) %>% 
    separate(x, into = c("mean", "low", "high"), sep = "\n|,", convert = TRUE)
#>   mean  low high
#> 1 1.41 1.19 1.66
#> 2 0.10 0.05 0.15
3 Likes

Hi @andresrcs,

Thank you for your reply. I tried to follow your steps but I could not make it work. Mutate() was not making any change on my df, while separate() only keeps the first value, without creating new columns with the three different values. I created a loop, because I need to do these for multiple columns in multiple data frames.

Apologies for my first post, I'm quite new here. I have edited my post, so I hope now it is a bit clearer what I am trying to achieve.

Well, this works although I'm sure it must be a more elegant way to do it using some map_ function from the purrr package, hopefully some one else can come up with a better solution.

df <- data.frame(stringsAsFactors=FALSE,
                 X2024 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                           "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                           "1.289\n(1.109 to 1.491)"),
                 X2529 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                           "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                           "1.289\n(1.109 to 1.491)"),
                 X3034 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                           "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                           "1.289\n(1.109 to 1.491)")
)

library(tidyr)
library(stringr)
library(dplyr)

df <- df %>%
    mutate_all(str_remove_all, pattern = "[()]")

for(j in names(df)) {
    df <- separate(df, j, into = c(paste("mean",j), paste("low", j), paste("high", j)), sep = "\n|\\sto\\s", convert = TRUE)
}
df
#>   mean X2024 low X2024 high X2024 mean X2529 low X2529 high X2529
#> 1      1.391     1.077      1.754      1.391     1.077      1.754
#> 2      1.351     1.012      1.745      1.351     1.012      1.745
#> 3      1.177     1.145      1.208      1.177     1.145      1.208
#> 4      1.059     1.031      1.083      1.059     1.031      1.083
#> 5      1.289     1.109      1.491      1.289     1.109      1.491
#>   mean X3034 low X3034 high X3034
#> 1      1.391     1.077      1.754
#> 2      1.351     1.012      1.745
#> 3      1.177     1.145      1.208
#> 4      1.059     1.031      1.083
#> 5      1.289     1.109      1.491

Created on 2019-03-05 by the reprex package (v0.2.1)

1 Like

Here's how I would approach this problem. The key step is to use tidyr::gather() right away to bring everything into tidy format. This makes it easier to work with the single column containing the strings with mean, min, and max values. Once you have the data in tidy format, you can reshape it into whatever format works best for your output. This reprex gets you from the example data frame to tidy format.

library(dplyr)
library(tidyr)
library(stringr)

df <- tibble(
  X2024 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
            "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
            "1.289\n(1.109 to 1.491)"),
  X2529 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
            "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
            "1.289\n(1.109 to 1.491)"),
  X3034 = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
            "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
            "1.289\n(1.109 to 1.491)")
)

Here’s the full pipeline to generate a tidy-formatted version of the original dataset.

df %>% 
  gather(var, value) %>% 
  separate(value, c("mean", "range"), sep = "\n") %>% 
  mutate(range = str_remove_all(range, "[()]")) %>% 
  separate(range, c("min", "max"), " to ") %>% 
  mutate_at(vars(mean, min, max), as.numeric)
#> # A tibble: 15 x 4
#>    var    mean   min   max
#>    <chr> <dbl> <dbl> <dbl>
#>  1 X2024  1.39  1.08  1.75
#>  2 X2024  1.35  1.01  1.74
#>  3 X2024  1.18  1.14  1.21
#>  4 X2024  1.06  1.03  1.08
#>  5 X2024  1.29  1.11  1.49
#>  6 X2529  1.39  1.08  1.75
#>  7 X2529  1.35  1.01  1.74
#>  8 X2529  1.18  1.14  1.21
#>  9 X2529  1.06  1.03  1.08
#> 10 X2529  1.29  1.11  1.49
#> 11 X3034  1.39  1.08  1.75
#> 12 X3034  1.35  1.01  1.74
#> 13 X3034  1.18  1.14  1.21
#> 14 X3034  1.06  1.03  1.08
#> 15 X3034  1.29  1.11  1.49

Expand the section below to the pipeline step-by-step so that you can see how each transformation changes the data.

Step by step
# gather into tidy format
step_1 <- df %>% 
  gather(var, value) 
#> # A tibble: 15 x 2
#>   var   value                    
#>   <chr> <chr>                    
#> 1 X2024 "1.391\n(1.077 to 1.754)"
#> 2 X2024 "1.351\n(1.012 to 1.745)"
#> 3 X2024 "1.177\n(1.145 to 1.208)"
#> 4 X2024 "1.059\n(1.031 to 1.083)"
#> 5 X2024 "1.289\n(1.109 to 1.491)"
#> # … with 10 more rows
# separate mean from range
step_2 <- step_1 %>% 
  separate(value, c("mean", "range"), sep = "\n") 
#> # A tibble: 15 x 3
#>   var   mean  range           
#>   <chr> <chr> <chr>           
#> 1 X2024 1.391 (1.077 to 1.754)
#> 2 X2024 1.351 (1.012 to 1.745)
#> 3 X2024 1.177 (1.145 to 1.208)
#> 4 X2024 1.059 (1.031 to 1.083)
#> 5 X2024 1.289 (1.109 to 1.491)
#> # … with 10 more rows
# remove parenthesis from range
step_3 <- step_2 %>% 
  mutate(range = str_remove_all(range, "[()]")) 
#> # A tibble: 15 x 3
#>   var   mean  range         
#>   <chr> <chr> <chr>         
#> 1 X2024 1.391 1.077 to 1.754
#> 2 X2024 1.351 1.012 to 1.745
#> 3 X2024 1.177 1.145 to 1.208
#> 4 X2024 1.059 1.031 to 1.083
#> 5 X2024 1.289 1.109 to 1.491
#> # … with 10 more rows
# separate min and max from range
step_4 <- step_3 %>% 
  separate(range, c("min", "max"), " to ") 
#> # A tibble: 15 x 4
#>   var   mean  min   max  
#>   <chr> <chr> <chr> <chr>
#> 1 X2024 1.391 1.077 1.754
#> 2 X2024 1.351 1.012 1.745
#> 3 X2024 1.177 1.145 1.208
#> 4 X2024 1.059 1.031 1.083
#> 5 X2024 1.289 1.109 1.491
#> # … with 10 more rows
# finally, convert strings to numeric
step_4 %>% 
  mutate_at(vars(mean, min, max), as.numeric)
#> # A tibble: 15 x 4
#>   var    mean   min   max
#>   <chr> <dbl> <dbl> <dbl>
#> 1 X2024  1.39  1.08  1.75
#> 2 X2024  1.35  1.01  1.74
#> 3 X2024  1.18  1.14  1.21
#> 4 X2024  1.06  1.03  1.08
#> 5 X2024  1.29  1.11  1.49
#> # … with 10 more rows

Created on 2019-03-05 by the reprex package (v0.2.1)

3 Likes

Based on your initial post before the edits, I'm guessing each row in your data frame corresponds to a variable. In the case that you'd like to keep those grouped together so it looks like the original table, I'm going to expand on @grrrck's solution:

library(tidyr)
library(stringr)
library(dplyr)

df <- data.frame(stringsAsFactors=FALSE,
                 `20_24` = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                             "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                             "1.289\n(1.109 to 1.491)"),
                 `25_29` = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                             "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                             "1.289\n(1.109 to 1.491)"),
                 `30_34` = c("1.391\n(1.077 to 1.754)", "1.351\n(1.012 to 1.745)",
                             "1.177\n(1.145 to 1.208)", "1.059\n(1.031 to 1.083)",
                             "1.289\n(1.109 to 1.491)")
)

df %>% 
  mutate(group = paste0("variable_", row_number())) %>%# add a grouping variable
  gather(var, value, -group) %>% # gather everything but the groups
  separate(value, c("mean", "range"), sep = "\n") %>% 
  mutate(range = str_remove_all(range, "[()]")) %>% 
  separate(range, c("min", "max"), " to ") %>% 
  mutate_at(vars(mean, min, max), as.numeric)  %>% # after creating the 3 separate variables
  gather(key = "statistic", value = "value", -var, -group) %>% # gather them 
  spread(var, value) # and spread the var and value columns

#>         group statistic X20_24 X25_29 X30_34
#> 1  variable_1       max  1.754  1.754  1.754
#> 2  variable_1      mean  1.391  1.391  1.391
#> 3  variable_1       min  1.077  1.077  1.077
#> 4  variable_2       max  1.745  1.745  1.745
#> 5  variable_2      mean  1.351  1.351  1.351
#> 6  variable_2       min  1.012  1.012  1.012
#> 7  variable_3       max  1.208  1.208  1.208
#> 8  variable_3      mean  1.177  1.177  1.177
#> 9  variable_3       min  1.145  1.145  1.145
#> 10 variable_4       max  1.083  1.083  1.083
#> 11 variable_4      mean  1.059  1.059  1.059
#> 12 variable_4       min  1.031  1.031  1.031
#> 13 variable_5       max  1.491  1.491  1.491
#> 14 variable_5      mean  1.289  1.289  1.289
#> 15 variable_5       min  1.109  1.109  1.109

Created on 2019-03-05 by the reprex package (v0.2.1)

This solution does not follow the tidy principles (the X... columns should be their own column, e.g. age_group) but is useful in case you wanted to see a way to achieve it.

1 Like

This solution is fantastic! Not only works in a clear and concise way but also prepared the data in the format I needed (I was planning to apply gather afterwards, don't know why I didn't think about use gather in advance). Thanks!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.