subsetting data.frame columns by rownames AND values

cook675 · November 13, 2019, 3:07am

Ok guys this one I tried for awhile but got nowhere. I have a data frame (df) as follows:

df <- data.frame(
  one = c(2,1,2,0,0,1),
  two = c(4,5,3,0,1,3),
  three = c(1,0,2,0,7,4),
  four = c(3,2,1,0,0,0)
)

row.names(df) <- c('mm1','mm2','mm3', 'GC1', 'GC2', 'GC3')
df

    one two three four
mm1   2   4     1    3
mm2   1   5     0    2
mm3   2   3     2    1
GC1   0   0     0    0
GC2   0   1     7    0
GC3   1   3     4    0

I want to remove all columns corresponding to any rowname of ^GC that has any value greater than 0. So in this case I would remove columns 1,2, and 3. The result would look like this:

    four
mm1    3
mm2    2
mm3    1
GC1    0
GC2    0
GC3    0

Then after this is done, I would like to remove all rows of ^GC (which should now all be 0's across all columns). The final result would look like this:

    four
mm1    3
mm2    2
mm3    1

Maybe there is a simpler way to combine these two steps, but I want to be sure that I eliminate the column from the dataset that has a value in one of the ^GC rows.

!?!?!?

Thanks!

woodward · November 13, 2019, 3:23am

Instead of rownames you'd be better to make a new column (r say) with those values. Then split r into the GC part and the number part. Then use dplyr::group_by and dplyr::filter. Sorry I can't give you the exact code at the moment.

cook675 · November 13, 2019, 6:36am

Hi woodward no worries. im not sure I understand exactly what you are saying!

woodward · November 13, 2019, 9:00am

Hm, tricky, You can do it by finding the indices of the cells you want to keep.

df <- data.frame(
  one = c(2,1,2,0,0,1),
  two = c(4,5,3,0,1,3),
  three = c(1,0,2,0,7,4),
  four = c(3,2,1,0,0,0)
)
row.names(df) <- c('mm1','mm2','mm3', 'GC1', 'GC2', 'GC3')

library(stringr)
library(tibble)
library(dplyr)
i <- which(str_detect(row.names(df), "^GC"))  # rows with GC
j <- which(names(df) %in% c("one", "two", "three", "four"))  # columns to check
k <- which(colSums(df[i, j]) == 0)  # find columns which are all 0
df %>% 
  rownames_to_column("temp") %>% # save rownames
  slice(-i) %>% 
  select(k + 1, "temp") %>% 
  column_to_rownames("temp")
#>     four
#> mm1    3
#> mm2    2
#> mm3    1

^{Created on 2019-11-13 by the reprex package (v0.3.0)}

cook675 · November 13, 2019, 6:13pm

Thanks woodward Im testing this right now, however Im not sure how to circumvent this line:

j <- which(names(df) %in% c("one", "two", "three", "four"))  # columns to check

I need to test every column of the matrix. There are >5000 of them, and they have random names

woodward · November 13, 2019, 6:46pm

j is just the indices of the columns you need to check. Set it however you want.

Actually you might be able to do a lot with the subset function.

j <- which(names(df) %in% names(subset(df,,one:four)))

woodward · November 13, 2019, 7:56pm

Yes it's easier with subset.

i <- str_detect(row.names(df), "^GC")  # rows with GC
j <- names(df) %in% names(subset(df, TRUE, one:four))  # columns to check
k <- colSums(df[i, j]) == 0  # find columns which are all 0
subset(df, !i, k)

andresrcs · November 13, 2019, 8:35pm

Another option

library(tidyverse)

df <- data.frame(
    one = c(2,1,2,0,0,1),
    two = c(4,5,3,0,1,3),
    three = c(1,0,2,0,7,4),
    four = c(3,2,1,0,0,0), 
    row.names = c('mm1','mm2','mm3', 'GC1', 'GC2', 'GC3')
)

keep_columns <- df %>% 
    rownames_to_column() %>% 
    filter(str_detect(rowname, "GC")) %>% 
    summarise_if(is.numeric, sum) %>% 
    select_if(~.x == 0) %>% 
    names()

df %>% 
    rownames_to_column() %>% 
    filter(str_detect(rowname, "mm")) %>% 
    column_to_rownames() %>% 
    select(keep_columns)
#>     four
#> mm1    3
#> mm2    2
#> mm3    1

cook675 · November 15, 2019, 11:34pm

Hey Andres, how would I change this if instead of a data frame, I had an object, and inside that object was a matrix and not a data frame?

andresrcs · November 15, 2019, 11:53pm

If you want to apply the tidyverse based solution, you would have to convert to data frame and to matrix again at the end, it must be a base R solution working directly with a matrix but I suspect it would be hard to read and ugly to write (at least for me).

system · December 6, 2019, 11:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.