I have a data frame with 28 samples in columns and 56000 parameters in lines.
I would like to select only lines for which there are at least more than 4 values > 0. I don't succeed with apply or subset... Could you please help me ?
Welcome to the community!
I'll suggest you to create a new column containing number of positive numbers in each row, and then filter based on that.
It'll be something like this:
df_new <- df_old[rowSums(x = df_old > 0, na.rm = TRUE) >= 4,]
I can't test this because you haven't shared the data. If this doesn't solve the problem, can you please provide a REPRoducible EXample of your problem? It provides more specifics of your problem, and it helps others to understand what problem you are facing.
If you don't know how to do it, take a look at this thread:
You might consider converting your data to a long format.
Gather columns into key-value pairs.
Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.
Having samples as columns is the situation of "columns that are not variables". Fixing that so all columns are variables makes manipulating the data a lot easier.
require('tibble') require('tidyr') # Data where columns are sample ids. old_df <- tribble( ~parameter, ~sample_1, ~sample_2, ~sample_3, "foo", 1, 0, 2, "baz", 1, 2, 2, "duq", 1, 0, 3, "fiz", 1, 0, 0, "buz", 10, 10, 1) # Make long format data new_df <- old_df %>% gather(key="id", value="value", -parameter) new_df # A tibble: 15 x 3 # parameter id value # <chr> <chr> <dbl> # 1 foo sample_1 1 # 2 baz sample_1 1 # 3 duq sample_1 1 # 4 fiz sample_1 1 # 5 buz sample_1 10 # 6 foo sample_2 0 # 7 baz sample_2 2 # 8 duq sample_2 0 # 9 fiz sample_2 0 #10 buz sample_2 10 #11 foo sample_3 2 #12 baz sample_3 2 #13 duq sample_3 3 #14 fiz sample_3 0 #15 buz sample_3 1
Now you can summarize and filter very easily as all of your columns are variables. In this case, it allows us to group by the sample id. That allows us to summarize data with respect to the sample id and to ask questions with regards to the sample id. E.g. "Which sample ids have more than 4 values greater than zero?"
new_df %>% group_by(id) %>% summarize(num_pos_values=sum(value > 0)) %>% filter(num_pos_values > 4) # A tibble: 1 x 2 # id num_pos_values # <chr> <int> # 1 sample_1 5