How to remove some portion of all rows in a dataset


I have this dataset and I want to delete the red squared portion in all rows. how can I do this? Thank you.

Hi @ridwna,

I would recommend using either gsub or stringr::str_remove. If you are looking to remove that text verbatim, something like this should work:

data$WGS.ID <- stringr::str_remove(data$WGS.ID, "_WGS_processed_downsamp.bam.cip")

If you are wanting to remove everything after the initial ID (i.e. PGDX...), something like this should work:

data$WGS.ID <- stringr::str_remove(data$WGS.ID, "_WGS.+")
1 Like


@FJCC But I got another problem, on this data I want to remove this red squared portion from all rows, how can I do this?

If you want to remove everything starting at the first underscore, you can to this. I invented a tiny data set for the illustration.

DF <- data.frame(WGS.ID = c("PGDX5881P_WGS_blah",
                            "PGDX5882P_WGS_blah"),
                 OtherColumn = 1:2)
DF
#>               WGS.ID OtherColumn
#> 1 PGDX5881P_WGS_blah           1
#> 2 PGDX5882P_WGS_blah           2
DF$WGS.ID <- gsub("([^_]+)_.+", "\\1", DF$WGS.ID)
DF
#>      WGS.ID OtherColumn
#> 1 PGDX5881P           1
#> 2 PGDX5882P           2

Created on 2022-03-18 by the reprex package (v2.0.1)
The regular expression ([^_]+)_.+" means "one or more characters that are not an underscore, an underscore, one or more of any character".
([^_]+) means "one or more characters that are not an underscore" and the parentheses define that as the first group, which will be used later.
The _ is simply an underscore.
.+ means "one or more of any character"
The replacement argument of gsub is \\1, which means "use the first group defined in the regular expression".

1 Like

@FJCC Thank you very much. It worked. :grinning: :grinning: :grinning: I am very happy.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.