remove rows with at least 3 values higher than 5

Hi,
I'm trying to remove the rows of a dataframe where there are at least 7 values higher than 5, I have tried multiple for loops but it doesn't seem to work and I think there should be an easy way to do it, can you help?

The output of head(df) looks like this

CA001 CA002 CA003 CA004 CA005 CA006 CA007 CA008 CA009 CA010
hsa-miR-1185-1-3p     0     0     0    21     0    19     0     5     0     0
hsa-miR-1245b-5p      0     7     0    14     0    15    18    21     9    18
hsa-miR-1252-3p       0     2     0    19     7    15    24     9    11    11
hsa-miR-135a-5p       0     0     0    14    12    22    32     4     5     0
hsa-miR-154-3p        0     2     0     7     0    10    19     8     7     0
hsa-miR-2054          0     7     0     7     0     4     0     0     0    11

Thank you!

1 Like

I can fiddle with this, but it would be really helpful to have some sample data to work with. Could you paste in the output of dput(head(your_data_frame))?

I have updated the question with the head(df), the output of dput(head(df)) is this

structure(list(CA001 = c(90L, 9L, 5695L, 20L, 12706L, 229L),
CA002 = c(48L, 1L, 4220L, 12L, 9815L, 65L), CA003 = c(180L,
18L, 18350L, 70L, 44602L, 410L), CA004 = c(121L, 33L, 55112L,
106L, 116119L, 337L), CA005 = c(91L, 3L, 7260L, 43L, 22934L,
273L), CA006 = c(174L, 23L, 8838L, 54L, 30878L, 503L), CA007 = c(141L,
21L, 6777L, 31L, 11861L, 388L), CA008 = c(113L, 8L, 5528L,
36L, 15533L, 298L), CA009 = c(79L, 17L, 8096L, 55L, 20396L,
326L), CA010 = c(113L, 12L, 13399L, 44L, 23852L, 305L)), row.names = c("hsa-let-7a-2-3p",
"hsa-let-7a-3p", "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p",
"hsa-let-7c-3p"), class = "data.frame")

1 Like

That's very helpful! Working off of that, here's what I came up with:

library(tidyverse, warn.conflicts = TRUE)

my_df <- structure(list(CA001 = c(90L, 9L, 5695L, 20L, 12706L, 229L),
                        CA002 = c(48L, 1L, 4220L, 12L, 9815L, 65L), CA003 = c(180L,
                                                                              18L, 18350L, 70L, 44602L, 410L), CA004 = c(121L, 33L, 55112L,
                                                                                                                         106L, 116119L, 337L), CA005 = c(91L, 3L, 7260L, 43L, 22934L,
                                                                                                                                                         273L), CA006 = c(174L, 23L, 8838L, 54L, 30878L, 503L), CA007 = c(141L,
                                                                                                                                                                                                                          21L, 6777L, 31L, 11861L, 388L), CA008 = c(113L, 8L, 5528L,
                                                                                                                                                                                                                                                                    36L, 15533L, 298L), CA009 = c(79L, 17L, 8096L, 55L, 20396L,
                                                                                                                                                                                                                                                                                                  326L), CA010 = c(113L, 12L, 13399L, 44L, 23852L, 305L)), row.names = c("hsa-let-7a-2-3p",
                                                                                                                                                                                                                                                                                                                                                                         "hsa-let-7a-3p", "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p",
                                                                                                                                                                                                                                                                                                                                                                         "hsa-let-7c-3p"), class = "data.frame")

my_df %>% 
  rowwise() %>% 
  mutate(count_over_100 = sum(c_across(starts_with("CA")) > 100)) %>% 
  filter(count_over_100 < 3)
#> # A tibble: 2 x 11
#> # Rowwise: 
#>   CA001 CA002 CA003 CA004 CA005 CA006 CA007 CA008 CA009 CA010 count_over_100
#>   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>          <int>
#> 1     9     1    18    33     3    23    21     8    17    12              0
#> 2    20    12    70   106    43    54    31    36    55    44              1

Created on 2021-06-14 by the reprex package (v2.0.0)

1 Like

If you wanted to be really concise, I think you could do it all in the filter() step, but I prefer the longer version for debugging purposes. Hope this helps! :blush:

Okay, this works perfectly, thanks!
For future viewers, this approach removes the rownames and appends a new column, I obtained my desired results with this code.

my_df$rows= rownames(my_df)
my_df %>% 
  rowwise() %>% 
  mutate(count_over_100 = sum(c_across(starts_with("CA")) > 100)) %>% 
  filter(count_over_100 < 3)
my_df= as.data.frame(my_df)
rownames(my_df) = my_df$rows
my_df$rows= NULL
my_df$count_over_100 = NULL
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.