group data by name

I need to group a column from a data frame depending on its name.

To expose the issue I set an example. Lets imagine the data. frame "section", that is the column I would like to group:

section <- ALICE_!AAAA, !AAAA_!AAAB, !AAAB_NADIR, NADIR_!AAAC, !AAAC_MANDI.

Here I have 3 names that represent a section of a line and 2 that represent another line. It's the line name what I would like to identify in that segment, in other words, I would like to create another column that group them by their line name, that is to say:

  • ALICE_!AAAA, !AAAA_!AAAB, !AAAB_NADIR <- Line ALICE_NADIR (I know it´s ALICE_NADIR line as sections names are preceded by an exclamation, i.e:!AAAA)
  • NADIR_!AAAC, !AAAC_MANDI <- Line NADIR_MANDI

In the data. frame, sections of a line are list one after another and not mixed with other line sections, I mean, !AAAA_!AAAB is a section from line ALICE_NADIR as it is between ALICE_!AAAA and !AAAB_NADIR, those names set the begging and end of the section.

What I want R Studio to do is reading the column as I have explained, I mean, I want it to read the section column and write in a new column the line to which that section belongs. Important to note that R Studio has to identify line names as I don´t have a list with them and there is more than 2000000 sections and around 56000 lines to be identified. The command I was thinking about is something like this: if there is NOT an "!" before section name that is the beginning of line name. Thereafter, when another section name without an "!" appears, that is the end of the line name. Therefore, all section names between the end and the beginning belong to that line name.

If I understand you correctly this is what you are trying to do, if not, then please provide a REPRoducible EXample (reprex) illustrating your issue

library(tidyverse)
# Sample data to illustrate the problem
df <- data.frame(stringsAsFactors = FALSE,
                 section  = c("ALICE_!AAAA", "!AAAA_!AAAB", "!AAAB_NADIR",
                              "NADIR_!AAAC", "!AAAC_MANDI")
)

df %>% 
    mutate(beginning = str_extract(section, "^[^!]+(?=_)"),
           end = str_extract(section, "(?<=_)[^!]+$")) %>% 
    fill(beginning, .direction = "down") %>% 
    fill(end, .direction = "up") %>% 
    transmute(line = paste(beginning, end, sep = "_"), section = section)
#>          line     section
#> 1 ALICE_NADIR ALICE_!AAAA
#> 2 ALICE_NADIR !AAAA_!AAAB
#> 3 ALICE_NADIR !AAAB_NADIR
#> 4 NADIR_MANDI NADIR_!AAAC
#> 5 NADIR_MANDI !AAAC_MANDI

Created on 2019-10-07 by the reprex package (v0.3.0.9000)

Yes @andresrcs, that is exactly what I needed. However trying it right now with my data, an error message regarding rows extension has appeared: "[ reached 'max' / getOption("max.print") -- omitted 2467428 rows ]" How could I solve it?


  1. ^! ↩︎

This is not an error, just a warning telling you that no more lines can be printed on the console, but that doesn't affect the data being stored in memory.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.