Can we use UMAP for rna-seq data?

shiinfo · September 24, 2022, 3:17am

Hi all,
I want to use UMAP for clustering RNA-seq data I have an expression matrix file. I want to the see clustering pattern in replicates. Because DEGs show a large number of differences in replicates. I am not familiar with UMAP my question is it possible to see a clustering pattern of replicates using CMP (Count per million) matric data?

My matrix file looks like

Gene	1.rep	2.rep	3.rep	4.rep	5.rep
MSTRG.10603.1	0.353527679	0.863557219	0.154658336	0.302840468	1.68378386
MSTRG.12772.1	12.66807516	12.70662765	10.28477935	12.77986775	14.5417697
MSTRG.8334.1	13.78757948	13.0767236	13.37794608	14.65747865	10.86805946
MSTRG.11583.1	35.94198069	37.44137372	24.51334628	33.67586004	34.13489099
MSTRG.4366.1	41.18597459	42.56103437	30.77700889	36.52256043	24.64447286
MSTRG.4203.1	82.07734278	85.4921647	113.1325729	84.85589912	54.95258235
MSTRG.6397.1	4.890466225	5.304708632	5.026395925	4.663743206	4.285995281
MSTRG.785.1	54.08973487	72.72385439	55.36768434	54.45071614	68.26978197
MSTRG.6825.1	534.4160079	471.440559	563.2656603	515.7373169	505.1351581
MSTRG.1448.1	58.86235854	63.16304232	49.49066757	57.96366556	41.17616895
.
.
up to 10500 genes.

Kindly suggest to me how I can see a clustering pattern of 5 replicates. I am sorry for the lame question I am new to R.

Thank you in advance

AlexisW · September 25, 2022, 7:11pm

Technically, yes, it's perfectly possible. Whether it's the right thing to do... I would say no.

For (bulk) RNA-Seq, the typical packages to use are {DESeq2} (vignette) and {edgeR} (user guide). If you read the linked vignettes, you'll see that both have a step where they plot a PCA or MDS, this is what you want to do here: it will take each replicate as a point in a 10500-dimension space, and try to reduce the dimensionality so it can be plotted in 2D on a screen. So, if your experiment worked as expected, the dimensions that are kept correspond to the experimental parameters (treatment, batch, ...) so the samples do cluster.

So, why does scRNA-Seq use UMAP? Basically because there are too many samples. PCA has some strong constraints on the reduced dimensions it finds: they have to be orthogonal. So, if there is a lot of information contained in many dimensions, PCA will fail to show all that information in the first 2 principal components, you'd need to look at many more. OTOH, UMAP can "torture" its axes until all the information is in 2D, but in the process the axes become meaningless, so the distances and positions of the clusters are hard to interpret. In summary, if you have thousands of samples (single cells) in hundreds of conditions (clusters), PCA can't show it properly. If you have barely a dozen samples in a couple of conditions (like here), then PCA will capture anything important, and the representation can be interpreted.

Final note, from the format of your matrix, I think you used StringTie to discover and quantify transcripts, and took the "TPM" output. I've never done de novo genome annotation, so I won't comment more than warning that novel transcript discovery is a hard problem, make sure you read up on the available methods, and if it's a species that already has an annotation, you probably want to consider using it directly (see the user guides above for recommendations, e.g. Salmon). But if you want to load that data into edgeR or DESeq2, make sure you use the count output, not TPM or FPKM.

All that being said, if you really want to do a UMAP, see the {uwot} and {umap} packages. With 5 replicates a PCA would be much better suited though.

system · November 6, 2022, 7:11pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.