# multiple independent linear regression

Hello Everyone,

I have gene expression values for 10,000 genes (A1--A10,000) across 100 samples (S97---S197). I also have weight values for all 100 samples with some NA values.

I would like to perform linear regression of y (weight) versus 10,000 genes' expression, individually across all 100 samples, with output of the p value for each of the 10,000 genes. The goal is to identify genes that correlate in expression to the weight of the sample.

I also want to drop genes that have zero value for expression across all samples.How can i run independent linear regressions on 10,000 genes, and have the output be the p value in a table?

Some of lines of my input data looks like:

gene S97 S98 S99 S100 S101 S102 S103 S104 S105
weight 1.34 NA 0.50 2.4 0.85 0.93 2.66 1.48 1.08
A1 0.4 0 0 0.3 0 0.1 0 0 0
A2 0 0 0 0 0 0 0 0 0
A3 0.04 0 0 0 0.05 0 0 0 0
A4 0 0 0 0 0 0 0 0 0
A5 0 0 0 0 0 0 0 0 0
A6 0.1 0 0 0 0 0 0 0 0
A7 0.4 0 0 0.1 0.5 0 0 0 0
A8 0 0 0 0 0 0 0 0 0

Is this close to what you want to do?

MAT <- as.matrix(DF[,2:10])

MAT
#>        S97 S98 S99 S100 S101 S102 S103 S104 S105
#>  [1,] 1.34  NA 0.5  2.4 0.85 0.93 2.66 1.48 1.08
#>  [2,] 0.40   0 0.0  0.3 0.00 0.10 0.00 0.00 0.00
#>  [3,] 0.00   0 0.0  0.0 0.00 0.00 0.00 0.00 0.00
#>  [4,] 0.04   0 0.0  0.0 0.05 0.00 0.00 0.00 0.00
#>  [5,] 0.00   0 0.0  0.0 0.00 0.00 0.00 0.00 0.00
#>  [6,] 0.00   0 0.0  0.0 0.00 0.00 0.00 0.00 0.00
#>  [7,] 0.10   0 0.0  0.0 0.00 0.00 0.00 0.00 0.00
#>  [8,] 0.40   0 0.0  0.1 0.50 0.00 0.00 0.00 0.00
#>  [9,] 0.00   0 0.0  0.0 0.00 0.00 0.00 0.00 0.00
#Separate the weight data from the gene expression data
Weights <- MAT[1,]
Expres <- MAT[2:9,]
row.names(Expres) <- DF[2:9, 1]

#Filter out rows that are all zeros
SUMS <- rowSums(Expres)
Expres <- Expres[SUMS > 0,]
Expres
#>     S97 S98 S99 S100 S101 S102 S103 S104 S105
#> A1 0.40   0   0  0.3 0.00  0.1    0    0    0
#> A3 0.04   0   0  0.0 0.05  0.0    0    0    0
#> A6 0.10   0   0  0.0 0.00  0.0    0    0    0
#> A7 0.40   0   0  0.1 0.50  0.0    0    0    0

GetP <- function(x, y) {
FIT <- lm(y ~ x)
FITcoef <- summary(FIT)\$coefficients
FITcoef[2,4]
}

Ps <- apply(X = Expres, MARGIN = 1, FUN = GetP, Weights)
Ps
#>        A1        A3        A6        A7
#> 0.5274029 0.5146351 0.9351755 0.6574236

#Check that the p value reported for the 4th row (Ps[4]) is correct.
summary(lm(Weights ~ Expres[4,]))
#>
#> Call:
#> lm(formula = Weights ~ Expres[4, ])
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -0.9914 -0.4489 -0.1536  0.3383  1.1686
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   1.4914     0.3398   4.390  0.00462 **
#> Expres[4, ]  -0.6915     1.4829  -0.466  0.65742
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8054 on 6 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.03498,    Adjusted R-squared:  -0.1259
#> F-statistic: 0.2175 on 1 and 6 DF,  p-value: 0.6574

Created on 2020-05-05 by the reprex package (v0.3.0)

Thank you, yes I wanted this. I also want to add gene annotations from a gff file to the results from above lm. My gff looks like:

seqid source type start end score strand phase attributes

1 A1 maker gene 21 22 NA - <NA> ID=G_400;Name=G_400;Note=Similar to L400: cellular protein;ref_id=L400

I want an output like:

gene annotation(from gff) p-value
A1 cellular protein 0.5274029

Thank you!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.