Splitting a vector into different columns if there is a number

cereghetti · May 29, 2020, 12:59pm

Hi there! I'm quite new with R, so this question might seem a bit dumb.
I have the following data frame, where i have at the start some text that would be the name of the variable and then five numbers separated by spaces, than represent different moments of time.

   metric1
1    Total Revenue 112 110 108 100 94
2   Total Revenue 534 523 514 500 485
3      Cost of Revenue 73 72 71 66 62
4 Cost of Revenue 403 394 385 373 361
5         Gross Profit 38 37 37 34 32
6    Gross Profit 131 129 129 126 124

There is any way i can split the column first betweeen text and numbers, and then by number in order to have something like this?

Metric	Year1	Year2	Year3	Year4	Year5
Total Revenue	112	110	108	100	94
Total Revenue	534	523	514	500	485
Cost of Revenue	73	72	71	66	62
Cost of Revenue	403	394	385	373	361
Gross Profit	38	37	37	34	32
Gross Profit	131	129	129	126	124

Thank you for your time!

siddharthprabhu · May 29, 2020, 1:23pm

The tidyr package has great tools for data manipulation tasks such as this. We can achieve this by using the extract() and separate() functions that it contains.

library(tidyr)

data <- tibble(metric1 = c("Total Revenue 112 110 108 100 94",
                           "Total Revenue 534 523 514 500 485",
                           "Cost of Revenue 73 72 71 66 62",
                           "Cost of Revenue 403 394 385 373 361",
                           "Gross Profit 38 37 37 34 32",
                           "Gross Profit 131 129 129 126 124"))

data <- data %>% 
  extract(col = metric1, into = c("Metric", "Numbers"), regex = "(\\D+)(.+)") %>% 
  separate(col = Numbers, into = c("Year1", "Year2", "Year3", "Year4", "Year5"), sep = " ")

print(data)
#> # A tibble: 6 x 6
#>   Metric             Year1 Year2 Year3 Year4 Year5
#>   <chr>              <chr> <chr> <chr> <chr> <chr>
#> 1 "Total Revenue "   112   110   108   100   94   
#> 2 "Total Revenue "   534   523   514   500   485  
#> 3 "Cost of Revenue " 73    72    71    66    62   
#> 4 "Cost of Revenue " 403   394   385   373   361  
#> 5 "Gross Profit "    38    37    37    34    32   
#> 6 "Gross Profit "    131   129   129   126   124

^{Created on 2020-05-29 by the reprex package (v0.3.0)}

cereghetti · May 29, 2020, 1:35pm

What you have done is amazing! Thank you so much!
I know all the tools that you have used execpt these terms (\D+)(.+). I know what they mean but i'm not really used to add them into my code. Whats these arguments name? In order to look for them online and learn about them.

Thank you, you are amazing!

siddharthprabhu · May 29, 2020, 1:43pm

The (\\D+)(.+) is called a regular expression (or regex for short). These are typically used for pattern matching in strings.

\D, + and . are metacharacters in regex that each match a specific pattern. For example, \D matches any non-digit character.

If you'd like to learn more, this website has a excellent interactive tutorial on the basics of regular expressions. Hope you find it useful.

cereghetti · May 29, 2020, 1:52pm

Thank you, Sidd. I really appreciate your time, you are saving me tons of problems.

system · June 5, 2020, 1:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.