subset in a dataframe

Hello,
I have a data frame with 28 samples in columns and 56000 parameters in lines.
I would like to select only lines for which there are at least more than 4 values > 0. I don't succeed with apply or subset... Could you please help me ?

Welcome to the community!

I'll suggest you to create a new column containing number of positive numbers in each row, and then filter based on that.

It'll be something like this:

df_new <- df_old[rowSums(x = df_old > 0, na.rm = TRUE) >= 4,]

I can't test this because you haven't shared the data. If this doesn't solve the problem, can you please provide a REPRoducible EXample of your problem? It provides more specifics of your problem, and it helps others to understand what problem you are facing.

If you don't know how to do it, take a look at this thread:

1 Like

You might consider converting your data to a long format.

From tidyr::gather

Gather columns into key-value pairs.
Description
Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

Having samples as columns is the situation of "columns that are not variables". Fixing that so all columns are variables makes manipulating the data a lot easier.

Example

require('tibble')
require('tidyr')

# Data where columns are sample ids.
old_df <- tribble(
  ~parameter, ~sample_1, ~sample_2, ~sample_3,
  "foo",              1,         0,         2,
  "baz",              1,         2,         2,
  "duq",              1,         0,         3,
  "fiz",              1,         0,         0,
  "buz",             10,        10,         1)

# Make long format data
new_df <- old_df %>% gather(key="id", value="value", -parameter)
new_df
#   A tibble: 15 x 3
#   parameter id       value
#   <chr>     <chr>    <dbl>
# 1 foo       sample_1     1
# 2 baz       sample_1     1
# 3 duq       sample_1     1
# 4 fiz       sample_1     1
# 5 buz       sample_1    10
# 6 foo       sample_2     0
# 7 baz       sample_2     2
# 8 duq       sample_2     0
# 9 fiz       sample_2     0
#10 buz       sample_2    10
#11 foo       sample_3     2
#12 baz       sample_3     2
#13 duq       sample_3     3
#14 fiz       sample_3     0
#15 buz       sample_3     1

Now you can summarize and filter very easily as all of your columns are variables. In this case, it allows us to group by the sample id. That allows us to summarize data with respect to the sample id and to ask questions with regards to the sample id. E.g. "Which sample ids have more than 4 values greater than zero?"

new_df %>% 
  group_by(id) %>% 
  summarize(num_pos_values=sum(value > 0)) %>% 
  filter(num_pos_values > 4)

#   A tibble: 1 x 2
#   id       num_pos_values
#   <chr>             <int>
# 1 sample_1              5