Help for using linear mixed model for RNA-seq data in R


I have a gene expression matrix from RNA-seq as

 gene Individual1 Individual2
 Gene1  77.9        89.7
 Gene2  67.5        14.8

From RNA-seq data, I performed normalization which is represented by the values as 77.9 etc. In the matrix, genes are in rows, columns have samples with FPKM values for each gene across the samples. I want to run a linear mixed effect model to calculate p-vlaues for each gene across all the samples. I want to run a linear mixed model like:

Model = lmer(FPKM ~ (1|gene), data=X)

I am getting an error as there is no variable called FPKM. However, the FPKM values are represented in a matrix for each individual in the columns for each gene. I need help to run the model to generate p-values for each gene across the samples. I will appreciate any help. Thank you!

X must have a variable with the name FPKM

Thank you! I have the FPKM matrix as shown above. How can I make a variable named FPKM representing an array of FPKM values.

The matrix should be converted to a data frame, with individuals as rows and genes as columns. The gene name should be used to indicate the FPKM value. The gene is the fixed effect and the individual is the random effect.

Thank you! I have converted the data frame with individuals as rows and genes as columns. Can you please help in using gene name as FPKM value to run the model with gene as fixed effect and individual as random effect.

1 Like

# synthetic data set
basket <- seq(70.1,79.9,.1)
gene <- seq(1:10) %>% paste0("gene",.)
FPKM <- sample(basket,100,replace = TRUE)
data.frame(gene,FPKM) %>% mutate(gene = as.factor(gene)) -> X
#>    gene FPKM
#> 1 gene1 75.4
#> 2 gene2 74.7
#> 3 gene3 72.4
#> 4 gene4 75.0
#> 5 gene5 70.3
#> 6 gene6 73.6
Model = lmer(FPKM ~ (1|gene), data=X)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: FPKM ~ (1 | gene)
#>    Data: X
#> REML criterion at convergence: 490.2234
#> Random effects:
#>  Groups   Name        Std.Dev.
#>  gene     (Intercept) 0.4158  
#>  Residual             2.7858  
#> Number of obs: 100, groups:  gene, 10
#> Fixed Effects:
#> (Intercept)  
#>       74.91

Created on 2020-09-18 by the reprex package (v0.3.0)

Thank you @technocrat! I have a question about basket variable. I understand it represents FPKM values. Can you please tell which order the values are put such as all values for Gene1 then Gene2 (row-wise)? My real dataset is much bigger and I am making a variable representing all FPKM values to use in above code.

basket holds values in 0.1 increments between in the closed interval [70.1,79.9] to be sampled 100 times with replacement to create the column FPKM. gene is constructed from the closed interval [1,10] with the string gene prepended. When bound as a data frame with FPKM the value recycles and the result is as shown.

To convert from the existing structure

X <- data.frame(gene = c("Gene1","Gene2"),
                Individual1 = c(77.9,89.7),
                Individual2 = c(67.5,14.8)
pivot_longer(X,cols = starts_with("I")) %>% select(-name)
#> # A tibble: 4 x 2
#>   gene  value
#>   <chr> <dbl>
#> 1 Gene1  77.9
#> 2 Gene1  67.5
#> 3 Gene2  89.7
#> 4 Gene2  14.8

Created on 2020-09-18 by the reprex package (v0.3.0)

Thank you! I ran the model and I have a question about Individuals. I want to also get the variance values to calculate the heritability's. The current model does not give any variance value. I tried to include Individual as another random factor in the model but it gives an error. Is there a way to include Individual as random factor in model and get variance values to calculate the heritability's further.

With data in the following form

#> Loading required package: Matrix
#> 'data.frame':    144 obs. of  3 variables:
#>  $ diameter: num  27 23 26 23 23 21 27 23 26 23 ...
#>  $ plate   : Factor w/ 24 levels "a","b","c","d",..: 1 1 1 1 1 1 2 2 2 2 ...
#>  $ sample  : Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ...

Created on 2020-09-19 by the reprex package (v0.3.0)

fit <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)

Thank you! I converted the data like this:

gene Individual FPKM
1       1               77.9
1       2               89.7
2       1               67.5
2       2               14.8

I tried running this model:
Model <- lmer(FPKM ~ 1+ (1|gene) + (1|Individual), data)

But I get the error:
boundary (singular) fit: see ?isSingular

Is the problem in my data or model. My data is pretty big with 500 individuals and 40,000 genes. I am not sure if that's what causing the error. I do have some genes with all zero values across all individuals. Is there a way to remove those to make the data manageable, if that's what causing the error. Thank you!

Generally, this often happens with 0 or NA values. NA should be used in recording non-observations, and 0 when recording observations of zero.

Specifically, the all-zero case is expected to result in singularities.

X <- data.frame(gene = c("Gene1","Gene2"),
                Individual1 = c(77.9,89.7),
                Individual2 = c(67.5,14.8),
                Individual3 = c(0,0)

X[, colSums(X != 0) > 0]
#>    gene Individual1 Individual2
#> 1 Gene1        77.9        67.5
#> 2 Gene2        89.7        14.8

Created on 2020-09-20 by the reprex package (v0.3.0.9001)

Thank you! However, I still get the same error even after removing rows with all zeros.

Is the object row gene or row individual?

I have genes in rows which have all zeros for some genes across all individuals. So the object is row gene.

X <- data.frame(gene = c("Gene1","Gene2","Gene3"),
                Individual1 = c(77.9,89.7,0),
                Individual2 = c(67.5,14.8,0),
                Individual3 = c(16.7,45.2,0)

X[!rowSums(X == 0),]
#>    gene Individual1 Individual2 Individual3
#> 1 Gene1        77.9        67.5        16.7
#> 2 Gene2        89.7        14.8        45.2

Created on 2020-09-20 by the reprex package (v0.3.0.9001)

Thank you! I still got the same error. Here is the summary of model:
Linear mixed model fit by REML ['lmerMod']
Formula: FPKM ~ 1 + (1 | gene) + (1 | Individual)
Data: data

REML criterion at convergence: 116744602

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-113.42   -0.01    0.00    0.00  405.23 

Random effects:
 Groups   Name        Variance Std.Dev.
 gene     (Intercept) 249092   499.1   
 Individual (Intercept)      0     0.0   
 Residual             164508   405.6   
Number of obs: 7851855, groups:  gene, 29997; Genotype, 500

Fixed effects:
            Estimate Std. Error t value
(Intercept)   35.300      2.946   11.98
convergence code: 0
boundary (singular) fit: see ?isSingular

Here random effect for individual has zero value. Is it possible to fit separate models for each gene. Thank you!



for an explanation of why this occurs.

It is possible to run 40,000 models; it could be challenging to review them.

Hello @jessica1,

If you are going the route of fitting a model for the expression of each separated gene you should take a look at DESeq2 or edgeR. Those packages were created to do exactly what you are seeking.

And they have extensive documentation and vignettes. E.g.:

You can also find help at the Bioconductor Forums

1 Like