subset in a dataframe

SimonG · October 31, 2019, 1:38pm

Hello,
I have a data frame with 28 samples in columns and 56000 parameters in lines.
I would like to select only lines for which there are at least more than 4 values > 0. I don't succeed with apply or subset... Could you please help me ?

Yarnabrina · October 31, 2019, 1:59pm

Welcome to the community!

I'll suggest you to create a new column containing number of positive numbers in each row, and then filter based on that.

It'll be something like this:

df_new <- df_old[rowSums(x = df_old > 0, na.rm = TRUE) >= 4,]

I can't test this because you haven't shared the data. If this doesn't solve the problem, can you please provide a REPRoducible EXample of your problem? It provides more specifics of your problem, and it helps others to understand what problem you are facing.

If you don't know how to do it, take a look at this thread:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

grosscol · October 31, 2019, 3:06pm

You might consider converting your data to a long format.

From tidyr::gather

Gather columns into key-value pairs.
Description
Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

Having samples as columns is the situation of "columns that are not variables". Fixing that so all columns are variables makes manipulating the data a lot easier.

Example

require('tibble')
require('tidyr')

# Data where columns are sample ids.
old_df <- tribble(
  ~parameter, ~sample_1, ~sample_2, ~sample_3,
  "foo",              1,         0,         2,
  "baz",              1,         2,         2,
  "duq",              1,         0,         3,
  "fiz",              1,         0,         0,
  "buz",             10,        10,         1)

# Make long format data
new_df <- old_df %>% gather(key="id", value="value", -parameter)
new_df
#   A tibble: 15 x 3
#   parameter id       value
#   <chr>     <chr>    <dbl>
# 1 foo       sample_1     1
# 2 baz       sample_1     1
# 3 duq       sample_1     1
# 4 fiz       sample_1     1
# 5 buz       sample_1    10
# 6 foo       sample_2     0
# 7 baz       sample_2     2
# 8 duq       sample_2     0
# 9 fiz       sample_2     0
#10 buz       sample_2    10
#11 foo       sample_3     2
#12 baz       sample_3     2
#13 duq       sample_3     3
#14 fiz       sample_3     0
#15 buz       sample_3     1

Now you can summarize and filter very easily as all of your columns are variables. In this case, it allows us to group by the sample id. That allows us to summarize data with respect to the sample id and to ask questions with regards to the sample id. E.g. "Which sample ids have more than 4 values greater than zero?"

new_df %>% 
  group_by(id) %>% 
  summarize(num_pos_values=sum(value > 0)) %>% 
  filter(num_pos_values > 4)

#   A tibble: 1 x 2
#   id       num_pos_values
#   <chr>             <int>
# 1 sample_1              5

system · November 21, 2019, 3:06pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.