remove rows with at least 3 values higher than 5

lluccabus · June 14, 2021, 4:10pm

Hi,
I'm trying to remove the rows of a dataframe where there are at least 7 values higher than 5, I have tried multiple for loops but it doesn't seem to work and I think there should be an easy way to do it, can you help?

The output of head(df) looks like this

CA001 CA002 CA003 CA004 CA005 CA006 CA007 CA008 CA009 CA010
hsa-miR-1185-1-3p     0     0     0    21     0    19     0     5     0     0
hsa-miR-1245b-5p      0     7     0    14     0    15    18    21     9    18
hsa-miR-1252-3p       0     2     0    19     7    15    24     9    11    11
hsa-miR-135a-5p       0     0     0    14    12    22    32     4     5     0
hsa-miR-154-3p        0     2     0     7     0    10    19     8     7     0
hsa-miR-2054          0     7     0     7     0     4     0     0     0    11

Thank you!

EeethB · June 14, 2021, 4:41pm

I can fiddle with this, but it would be really helpful to have some sample data to work with. Could you paste in the output of dput(head(your_data_frame))?

lluccabus · June 14, 2021, 4:49pm

I have updated the question with the head(df), the output of dput(head(df)) is this

structure(list(CA001 = c(90L, 9L, 5695L, 20L, 12706L, 229L),
CA002 = c(48L, 1L, 4220L, 12L, 9815L, 65L), CA003 = c(180L,
18L, 18350L, 70L, 44602L, 410L), CA004 = c(121L, 33L, 55112L,
106L, 116119L, 337L), CA005 = c(91L, 3L, 7260L, 43L, 22934L,
273L), CA006 = c(174L, 23L, 8838L, 54L, 30878L, 503L), CA007 = c(141L,
21L, 6777L, 31L, 11861L, 388L), CA008 = c(113L, 8L, 5528L,
36L, 15533L, 298L), CA009 = c(79L, 17L, 8096L, 55L, 20396L,
326L), CA010 = c(113L, 12L, 13399L, 44L, 23852L, 305L)), row.names = c("hsa-let-7a-2-3p",
"hsa-let-7a-3p", "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p",
"hsa-let-7c-3p"), class = "data.frame")

EeethB · June 14, 2021, 5:12pm

That's very helpful! Working off of that, here's what I came up with:

library(tidyverse, warn.conflicts = TRUE)

my_df <- structure(list(CA001 = c(90L, 9L, 5695L, 20L, 12706L, 229L),
                        CA002 = c(48L, 1L, 4220L, 12L, 9815L, 65L), CA003 = c(180L,
                                                                              18L, 18350L, 70L, 44602L, 410L), CA004 = c(121L, 33L, 55112L,
                                                                                                                         106L, 116119L, 337L), CA005 = c(91L, 3L, 7260L, 43L, 22934L,
                                                                                                                                                         273L), CA006 = c(174L, 23L, 8838L, 54L, 30878L, 503L), CA007 = c(141L,
                                                                                                                                                                                                                          21L, 6777L, 31L, 11861L, 388L), CA008 = c(113L, 8L, 5528L,
                                                                                                                                                                                                                                                                    36L, 15533L, 298L), CA009 = c(79L, 17L, 8096L, 55L, 20396L,
                                                                                                                                                                                                                                                                                                  326L), CA010 = c(113L, 12L, 13399L, 44L, 23852L, 305L)), row.names = c("hsa-let-7a-2-3p",
                                                                                                                                                                                                                                                                                                                                                                         "hsa-let-7a-3p", "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p",
                                                                                                                                                                                                                                                                                                                                                                         "hsa-let-7c-3p"), class = "data.frame")

my_df %>% 
  rowwise() %>% 
  mutate(count_over_100 = sum(c_across(starts_with("CA")) > 100)) %>% 
  filter(count_over_100 < 3)
#> # A tibble: 2 x 11
#> # Rowwise: 
#>   CA001 CA002 CA003 CA004 CA005 CA006 CA007 CA008 CA009 CA010 count_over_100
#>   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>          <int>
#> 1     9     1    18    33     3    23    21     8    17    12              0
#> 2    20    12    70   106    43    54    31    36    55    44              1

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

EeethB · June 14, 2021, 5:13pm

If you wanted to be really concise, I think you could do it all in the filter() step, but I prefer the longer version for debugging purposes. Hope this helps!

lluccabus · June 15, 2021, 6:52am

Okay, this works perfectly, thanks!
For future viewers, this approach removes the rownames and appends a new column, I obtained my desired results with this code.

my_df$rows= rownames(my_df)
my_df %>% 
  rowwise() %>% 
  mutate(count_over_100 = sum(c_across(starts_with("CA")) > 100)) %>% 
  filter(count_over_100 < 3)
my_df= as.data.frame(my_df)
rownames(my_df) = my_df$rows
my_df$rows= NULL
my_df$count_over_100 = NULL

system · June 22, 2021, 6:52am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.