I am analysing RNASeq data. I got 45 samples with nine different treatment, each five replicate. 9*5 = 45 (experimental design).
I did PCA and they do not seem to separate quite well based on treatment.
pc1 = 12 %
pc2 = 10 %
pc3 = 7 %
I did pairwise differential expression among the samples. And I am planning to do PCA on DE genes only. I have normalised matrix of all the genes. My question is:
To do PCA on selected gene should I take input of all genes normalised -> correlation matrix -> retrieve DE genes only -> prcomp
OR from normalised gene matrix -> retrieve DE genes -> correlation matrix -> prcomp
why would you want to do PCA on DE genes? you already know that these genes are going to separate the treatments since this is, presumably, the comparison you did to get those DE genes in the first place.
if you see that PCA on normalized (!) values does not yield the expected pattern, then this is an indication that you might have batch effects that explain more variability than your conditions of interest. that is valuable information as you might be able to identify the factor that explains the batch effect (typical example would be the type of sample instead of the type of treatment).
We are not expecting samples to separate completely on treatment. Some of the treatments have more effect than the other (this is also an observation). Now based on the known study on similar species, we know there are some genes which are known to have more effect on treatment. We want to see if these known genes have similar effect on our study. Which is why I want to do PCA on selected genes.
So select the genes of interest and do a heatmap with them. You can also do a Venn diagram to see how much of the known genes are found by your study as well.
I agree with both Friederike and h.mon here. Following from h.mon's point, just do the clustering with heatmap and think about further refining your DEGs via regression modelling and 'gene signature' creation.
The only realm where I have seen PCA used on DEGs is when one would want to develop a new scoring system based on eigenvalues, as is performed in WGCNA, i.e., network analysis.
I am planning to do this. I have a similar question for this as well. If I do heatmap. To calculate z score (1)Should I calculate z score on entire gene(normalised reads) and get my gene of interest to do heatmap (2) OR Take my gene of interest(normalised reads) and then calculate z score. Does this change the result?
I don't understand the distinction you're making, so I would just say:
make a matrix of normalized expression values for all your genes of interest (e.g., your DEG) and all your samples
use, for example, pheatmap, which will allow you to see the effects of row- and column-based z-scores as well as the actual values without z-value transformation (you can set this via the parameter scale = c("none","row","column") )
Thanks everyone. Thanks for your input.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.