How to write conditional functions in R to extract high variable features?

mtoufiq · December 28, 2022, 2:01am

Hi,

I am working with a data matrix (20000 features (rows) * 18 samples (columns)). I am trying to create a list that would be filtered based on detection thresholds and variance across samples (e.g. detected in at least 10% of samples, and top 1000 most variable features). rowVars function from matrixStats r package will do this. I am finding a way to identify top most variable features.

Maybe this should work for the following cases?

cpm_with_log2 = data matrix with (20000 features (rows) * 18 samples (columns)

Case 1: (Getting the top 1000 highly variable features) - These are the features most variable across all samples regardless of which samples they are

topVarfeatures <- head(order(rowVars(cpm_with_log2), decreasing = TRUE), 1000)
mat_1  <- cpm_with_log2[ topVarfeatures, ]

Case 2: Features detected in at least 10% of samples

topVarfeatures.v1 <- rowVars( cpm_with_log2) > 18/10 ## (Total samples, n=18)
mat_2 <- cpm_with_log2[topVarfeatures.v1, ]

Case 3: Features filtered based on detection thresholds and variance across samples (e.g. detected in at least 10% of samples, and top 1000 most variable features)

Do you know how I could write the code to include 1000 features detected in at least 10% of samples here?

Thank you,

Mohammed

technocrat · December 28, 2022, 8:43am

library(matrixStats)
# fake some data
feats <- paste0("feat",1:20000)
samps <- paste0("sample",1:18)
x <- matrix(rnorm(18000), nrow = 20000, ncol = 18)
dat <- data.frame(cbind(feats,x))
# find variability and take top 1000
cpm_with_log2 <- data.frame(feature = feats, variability = rowVars(x))
tops <- cpm_with_log2[order(cpm_with_log2$variability,decreasing = TRUE),] |> head(1000)
# features in top 1000
kilos <- which(dat$feats %in% tops$feature)
# top features all exceed a minimum value
result <- dat[kilos,2:19] >= min(tops$variability) 
# present in ~10% of columns
tenperc <- rowSums(result) > 18/10
indexes <- as.numeric(unlist(attributes(tenperc)))
dat[indexes,] |> head()
#>       feats                V2                 V3                 V4
#> 3     feat3   2.5540148834323  0.742622971536156   1.95049739004393
#> 24   feat24  1.71361230297024 -0.154061513949041   1.10474619474461
#> 44   feat44 0.333706957954456 0.0319386025021886   1.02362194043021
#> 77   feat77  2.51048022703522  0.587866337393547 0.0530713948367507
#> 126 feat126  1.95099774502058  -2.52030973516224  -1.56729633409586
#> 167 feat167 0.830896567299789   2.93815696850695   -1.4789218142627
#>                     V5                 V6                 V7                 V8
#> 3   -0.467891498070858   1.40464989469426  -2.17838908783106  0.235366733911987
#> 24   0.977329650516096  -1.37364597779908   1.81162489518565 -0.809173227615358
#> 44    1.10939059230788 -0.373512937429292   2.06019514150243  -2.47906619757508
#> 77  -0.349188695960461    0.1236455719009 0.0936362898716663   -1.0927583140139
#> 126  -1.90336973882101  0.375542727144034  -1.00487809153837  0.333979159171015
#> 167   3.62838079935463  -1.23627670192964 -0.720431018225832   1.49970286602291
#>                     V9                V10               V11                V12
#> 3    -1.67022937520797  -1.37467251628901   2.5540148834323  0.742622971536156
#> 24    2.04761548908748  -1.52397903482752  1.71361230297024 -0.154061513949041
#> 44  -0.675736443275503  -1.63027192796814 0.333706957954456 0.0319386025021886
#> 77   -3.05171910931563 -0.518547964510007  2.51048022703522  0.587866337393547
#> 126 -0.520294094254628  0.839167118483221  1.95099774502058  -2.52030973516224
#> 167   -1.9072392489436 -0.441902362237402 0.830896567299789   2.93815696850695
#>                    V13                V14                V15                V16
#> 3     1.95049739004393 -0.467891498070858   1.40464989469426  -2.17838908783106
#> 24    1.10474619474461  0.977329650516096  -1.37364597779908   1.81162489518565
#> 44    1.02362194043021   1.10939059230788 -0.373512937429292   2.06019514150243
#> 77  0.0530713948367507 -0.349188695960461    0.1236455719009 0.0936362898716663
#> 126  -1.56729633409586  -1.90336973882101  0.375542727144034  -1.00487809153837
#> 167   -1.4789218142627   3.62838079935463  -1.23627670192964 -0.720431018225832
#>                    V17                V18                V19
#> 3    0.235366733911987  -1.67022937520797  -1.37467251628901
#> 24  -0.809173227615358   2.04761548908748  -1.52397903482752
#> 44   -2.47906619757508 -0.675736443275503  -1.63027192796814
#> 77    -1.0927583140139  -3.05171910931563 -0.518547964510007
#> 126  0.333979159171015 -0.520294094254628  0.839167118483221
#> 167   1.49970286602291   -1.9072392489436 -0.441902362237402

^{Created on 2022-12-28 by the reprex package (v2.0.1)}

mtoufiq · December 28, 2022, 3:05pm

@technocrat thank you. This is very helpful.

I am just bit confused about the below steps. Why to consider top features all exceed a minimum value step is needed? Instead why dont we directly consider features present in ~10% of columns after extracting top 1000 features?

### top features all exceed a minimum value
result <- dat[kilos,2:19] >= min(tops$variability) 
### present in ~10% of columns
tenperc <- rowSums(result) > 18/10
indexes <- as.numeric(unlist(attributes(tenperc)))
dat[indexes,] |> head()

technocrat · December 28, 2022, 5:43pm

Features are arranged row wise. Having extracted the top 1000 features, we know which rows have at least one occurrence of the feature. Samples are arranged column wise. Any given column either does, or does not contain an occurrence of the feature. We know only that at least one column does. This represents a binary 1/0 TRUE/FALSE for which we want a test. The test for occurrence is the magnitude of the variance returned by rowVars(). If the variance falls below some value, the feature was not present among the top 1000 features for that column. That value is given by the minimum variance for all features in the top 1000. If a feature has a variance of zero, it was not present at all, and if it has a variance of less than the threshold value, the feature was not among the top 1000 for that column.

result provides a truth table for the entire matrix at a single pass. Because TRUE/FALSE evaluate to 1/0 in the application of sum(), the rowSums total represents the number of occurrences of a feature, and that number can be compared to 18/10. The application of >= returns another logical. If the logical is TRUE, the row is retained, otherwise discarded. This satisfies case 3.

The alternative that I saw was iteration, which is no more direct, more involved and less efficient.

Whenever possible, I try to frame a problem in R as an instance of a truth table such as this, for two reasons. First, it permits vectors, arrays, matrixes, to be treated as a single object amenable to linear algebra. Second, such objects are amenable to functional, rather than procedural programming.

Every R problem can be thought of with advantage as the interaction of three objects: an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra&mdash f(x) = y. Any of the objects can be composites.. This has the considerable advantage of directing focus on what rather than how. If a result is outside the range to be expected theoretically, it means the wrong f was selected, meaning that focus was lost on what the nature of the result was supposed to have been compared to the nature of the return value of f. This is functional programming, why help($f$) has sections on arguments and values.

By contrast, in a procedural language an unreasonable result means some step was applied inappropriately, which requires an examination of each step, a how question. That process is potentially both more involved and requires being able to deduce not only what a reasonable final result looks like but what all intermediate steps look like. This is imperative/procedural programming in which the same general purpose steps are available at each point, and a choice among them must be made without direct consideration of the end result.

mtoufiq · December 28, 2022, 7:37pm

@technocrat thank you for the detailed explanation. Much appreciated. Its's very clear now.

system · January 4, 2023, 7:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.