Performing linear regression on thousands of samples

I have been using the lm() function for linear regression for two variables (y= age of biological sample, x=expression level of Gene Z), where the gene expression was pulled from RNA sequencing data from 15 independent samples at different ages.

I would like to perform linear regression of y (age of sample) versus 10,000 genes' expression, individually, with output of the p value for each of the 10,000 genes. The goal is to identify genes that correlate in expression to the age of the sample.

How can i run independent linear regressions on 10,000 genes, and have the output be the p value in a table?

Hi @Namin! In order to help you a bit more, we're going to need to see what your data looks like. We have an FAQ that can help you build a reproducible example, or reprex, for this purpose.

It can be a little difficult to figure out what data to include in a reprex—especially if you're working with medical or otherwise confidential data, which absolutely shouldn't be shared. This thread has some discussion on that:

We don't really need to know exactly what data you have—we're more interested in how it's structured, what the column names are, whether each group has 1 or many rows, that sort of thing. If you can give a synthetic data set (ie. made up values) with maybe 10--100 rows, that can help us get you started :slight_smile:

x=("1   26.63446813 343.5476693 113.9143214 5.159326571 92.5259267  14.6753602  69.63377007 72.6635179  4.336995776 25.78248208
1   27.84917772 410.6839279 156.827618  3.038745698 102.8030316 8.392984282 66.74670888 122.4478989 4.424381077 18.94417554
1   17.79291187 401.225918  132.0120596 5.653336643 74.25716833 6.57710247  57.96412313 82.61125553 2.937669279 28.70580142
0.75    12.025493   211.2275738 79.57596633 1.359677764 63.42913652 4.656496616 50.60993799 42.59730402 2.621699595 4.293192444
0.75    15.09759214 243.5676761 100.788408  4.905031728 31.22944687 13.31374986 70.54548703 53.05375813 3.456702826 24.79592648
0.75    10.99518759 365.5815502 68.36280144 2.442954109 63.8752457  5.618391969 33.12872755 61.35955426 1.967528222 19.87546521
0.35    13.57029345 142.286295  55.13525425 3.729006944 34.12780422 11.32121897 37.96703369 45.36616601 1.742898364 33.97378959
0.35    14.28886134 113.5807686 56.81533373 2.504964463 15.01072387 7.326952857 46.86635231 45.63442939 2.866376918 18.09791804
0.35    9.618382272 167.4899001 64.22707467 2.530840169 27.07473977 7.774234569 34.46064971 25.9753166  1.214599544 11.34000972
0.1 6.056488668 80.81886076 33.80984315 0.870200763 24.89281161 5.6145562   15.35315855 33.28447643 1.461190997 10.10776685
0.1 3.301004278 73.62796114 28.34967155 1.596771015 25.22354103 7.24606463  28.28392085 22.36714968 2.378194159 9.083528266
0.1 9.334802389 54.16479971 41.2559059  0.74165986  16.92846012 4.302434985 29.28303294 17.86713216 0.934403318 22.35684064
0   2.267438217 24.56104308 21.83538984 0.404828452 10.83567324 2.75711313  15.94095588 17.9180724  0.486364908 3.297162163
0   3.413135124 80.18582039 19.42465402 1.421895962 24.89373647 1.897727664 11.45246564 23.31135812 1.186788448 10.0097013
0   2.85803658  65.73603997 29.9312098  0.889316035 23.88302618 2.127453697 18.54746559 28.22525585 1.413780132 6.646643097   ")
x1=read.table(textConnection(x),header=FALSE)
x.sum=summary(lm(x1$V1~x1$V2))
x.sum
#> 
#> Call:
#> lm(formula = x1$V1 ~ x1$V2)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.23819 -0.11982 -0.05624  0.09271  0.33953 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.068174   0.091122  -0.748    0.468    
#> x1$V2        0.043532   0.006514   6.683 1.51e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1945 on 13 degrees of freedom
#> Multiple R-squared:  0.7746, Adjusted R-squared:  0.7572 
#> F-statistic: 44.66 on 1 and 13 DF,  p-value: 1.509e-05

Created on 2018-11-22 by the reprex package (v0.2.1)

Thanks for the tip. In the data set above, I've included 11 columns. The first column is the age (1, 0.75, 0.35, 0.1, 0) in triplicate. Each subsequent column is gene expression for a particular gene (gene names removed). My full data set has 10,000 columns. The code provides the linear regression coefficients for comparison between age (column 1) and the first gene (column 2), giving me a p val of 1.5^10-5 along with other output. I am just interested in the p value of the linear regression at this time, and coding it in a way such that I can easily obtain it for all of the 10,000 genes in my full data set.

1 Like

Okay, so you're looking to make models between the first column (the response) and each of the subsequent columns (predictors) in turn? I think the tidyverse tools might help us get this done more easily :slightly_smiling_face:

First, I'm gonna reshape the data so that every column other than the first is collapsed into a single column. Then, I'll nest the responses and gene expressions for each predictor group, so we'll have a data frame for each one. Finally, I'll run the regression for each group and extract the R-squared value.

suppressPackageStartupMessages(library(tidyverse))
#> Warning: package 'ggplot2' was built under R version 3.5.1
#> Warning: package 'dplyr' was built under R version 3.5.1

# tidy
tidy_data  =
  x1 %>%
  as_data_frame() %>%
  rename(response = V1) %>%
  gather(key = 'gene', value = 'gene_expr', -response) %>%
  print()
#> # A tibble: 150 x 3
#>    response gene  gene_expr
#>       <dbl> <chr>     <dbl>
#>  1     1    V2        26.6 
#>  2     1    V2        27.8 
#>  3     1    V2        17.8 
#>  4     0.75 V2        12.0 
#>  5     0.75 V2        15.1 
#>  6     0.75 V2        11.0 
#>  7     0.35 V2        13.6 
#>  8     0.35 V2        14.3 
#>  9     0.35 V2         9.62
#> 10     0.1  V2         6.06
#> # ... with 140 more rows

# now let's nest and model
tidy_data %>%
  nest(-gene) %>%
  mutate(
    model = map(data, ~ lm(.$response ~ .$gene_expr)),
    rsq = map_dbl(model, ~summary(.)$r.squared))
#> # A tibble: 10 x 4
#>    gene  data              model      rsq
#>    <chr> <list>            <list>   <dbl>
#>  1 V2    <tibble [15 x 2]> <S3: lm> 0.775
#>  2 V3    <tibble [15 x 2]> <S3: lm> 0.914
#>  3 V4    <tibble [15 x 2]> <S3: lm> 0.891
#>  4 V5    <tibble [15 x 2]> <S3: lm> 0.615
#>  5 V6    <tibble [15 x 2]> <S3: lm> 0.785
#>  6 V7    <tibble [15 x 2]> <S3: lm> 0.362
#>  7 V8    <tibble [15 x 2]> <S3: lm> 0.806
#>  8 V9    <tibble [15 x 2]> <S3: lm> 0.746
#>  9 V10   <tibble [15 x 2]> <S3: lm> 0.695
#> 10 V11   <tibble [15 x 2]> <S3: lm> 0.282

Created on 2018-11-23 by the reprex package (v0.2.0).

5 Likes

Thanks for looking into this. So far so good, I think it is close.

The last bit is to output the p value for the linear regression. Currently it is producing the rsq (which is also useful, but not what I need at the moment). I can change "r.squared" to "adj.r.squared" or "sigma" and the code will run, but this wont get me closer to the p value.

Based upon your code, found this: (https://cran.r-project.org/web/packages/broom/vignettes/broom.html) which seems to provide useful information. However, Im having trouble putting it all together to generate the p values.

Take a look here for various methods, including one using broom:

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.