Hello Everyone,
I have gene expression values for 10,000 genes (A1--A10,000) across 100 samples (S97---S197). I also have weight values for all 100 samples with some NA values.
I would like to perform linear regression of y (weight) versus 10,000 genes' expression, individually across all 100 samples, with output of the p value for each of the 10,000 genes. The goal is to identify genes that correlate in expression to the weight of the sample.
I also want to drop genes that have zero value for expression across all samples.How can i run independent linear regressions on 10,000 genes, and have the output be the p value in a table?
Some of lines of my input data looks like:
gene
S97
S98
S99
S100
S101
S102
S103
S104
S105
weight
1.34
NA
0.50
2.4
0.85
0.93
2.66
1.48
1.08
A1
0.4
0
0
0.3
0
0.1
0
0
0
A2
0
0
0
0
0
0
0
0
0
A3
0.04
0
0
0
0.05
0
0
0
0
A4
0
0
0
0
0
0
0
0
0
A5
0
0
0
0
0
0
0
0
0
A6
0.1
0
0
0
0
0
0
0
0
A7
0.4
0
0
0.1
0.5
0
0
0
0
A8
0
0
0
0
0
0
0
0
0
FJCC
May 5, 2020, 9:13pm
2
Is this close to what you want to do?
DF <- read.csv("c:/users/fjcc/Documents/R/Play/Dummy.csv",stringsAsFactors = FALSE)
MAT <- as.matrix(DF[,2:10])
MAT
#> S97 S98 S99 S100 S101 S102 S103 S104 S105
#> [1,] 1.34 NA 0.5 2.4 0.85 0.93 2.66 1.48 1.08
#> [2,] 0.40 0 0.0 0.3 0.00 0.10 0.00 0.00 0.00
#> [3,] 0.00 0 0.0 0.0 0.00 0.00 0.00 0.00 0.00
#> [4,] 0.04 0 0.0 0.0 0.05 0.00 0.00 0.00 0.00
#> [5,] 0.00 0 0.0 0.0 0.00 0.00 0.00 0.00 0.00
#> [6,] 0.00 0 0.0 0.0 0.00 0.00 0.00 0.00 0.00
#> [7,] 0.10 0 0.0 0.0 0.00 0.00 0.00 0.00 0.00
#> [8,] 0.40 0 0.0 0.1 0.50 0.00 0.00 0.00 0.00
#> [9,] 0.00 0 0.0 0.0 0.00 0.00 0.00 0.00 0.00
#Separate the weight data from the gene expression data
Weights <- MAT[1,]
Expres <- MAT[2:9,]
row.names(Expres) <- DF[2:9, 1]
#Filter out rows that are all zeros
SUMS <- rowSums(Expres)
Expres <- Expres[SUMS > 0,]
Expres
#> S97 S98 S99 S100 S101 S102 S103 S104 S105
#> A1 0.40 0 0 0.3 0.00 0.1 0 0 0
#> A3 0.04 0 0 0.0 0.05 0.0 0 0 0
#> A6 0.10 0 0 0.0 0.00 0.0 0 0 0
#> A7 0.40 0 0 0.1 0.50 0.0 0 0 0
GetP <- function(x, y) {
FIT <- lm(y ~ x)
FITcoef <- summary(FIT)$coefficients
FITcoef[2,4]
}
Ps <- apply(X = Expres, MARGIN = 1, FUN = GetP, Weights)
Ps
#> A1 A3 A6 A7
#> 0.5274029 0.5146351 0.9351755 0.6574236
#Check that the p value reported for the 4th row (Ps[4]) is correct.
summary(lm(Weights ~ Expres[4,]))
#>
#> Call:
#> lm(formula = Weights ~ Expres[4, ])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.9914 -0.4489 -0.1536 0.3383 1.1686
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.4914 0.3398 4.390 0.00462 **
#> Expres[4, ] -0.6915 1.4829 -0.466 0.65742
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8054 on 6 degrees of freedom
#> (1 observation deleted due to missingness)
#> Multiple R-squared: 0.03498, Adjusted R-squared: -0.1259
#> F-statistic: 0.2175 on 1 and 6 DF, p-value: 0.6574
Created on 2020-05-05 by the reprex package (v0.3.0)
Thank you, yes I wanted this. I also want to add gene annotations from a gff file to the results from above lm. My gff looks like:
seqid source type start end score strand phase attributes
1 A1 maker gene 21 22 NA - <NA> ID=G_400;Name=G_400;Note=Similar to L400: cellular protein;ref_id=L400
I want an output like:
gene annotation(from gff) p-value
A1 cellular protein 0.5274029
Thank you!
system
Closed
May 27, 2020, 12:17am
4
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.