Question

Rna Seq Ranking Genes Based On Principal Component Analysis

2

Entering edit mode

11.7 years ago

Sudeep ★ 1.7k

Hi all
Did anybody try PCA based gene ranking on read count data or do you know any papers on that ? I was searching for a while and almost all the papers I came across used PCA for plotting sample separation. What should be taken into consideration for doing a PCA based gene ranking on read count data (ie to start from scratch ) ?

EDIT I actually meant prioritizing genes based on read counts (expression values) between case and control samples using PCA

Thank you

pca rna-seq • 9.8k views

ADD COMMENT • link updated 11.7 years ago by Michael 54k • written 11.7 years ago by Sudeep ★ 1.7k

1

Entering edit mode

What do you mean by "gene ranking"? What's the criteria for ranking?

ADD REPLY • link 11.7 years ago by Arun 2.4k

0

Entering edit mode

Well what I actually meant was "gene prioritization" based on expression values, not "ranking" as such, I have edited my post.

ADD REPLY • link 11.7 years ago by Sudeep ★ 1.7k

1

Entering edit mode

Have you done a more traditional differential expression analysis using DESeq or edgeR, for example? This will rank genes based on expression value differences between cases and controls.

ADD REPLY • link 11.7 years ago by Sean Davis 26k

0

Entering edit mode

Yes, I already have the DEG's from DESeq, I was just a bit curious if somebody has tried any of the PCA based approaches and what are the caveats in doing such an analysis

ADD REPLY • link 11.7 years ago by Sudeep ★ 1.7k

0

Entering edit mode

So you want to use PCA for differential expression ranking? I am interested in how this works, can you link any papers of this approach? Are they just using PCA as some kind of a smoothing function?

ADD REPLY • link 11.7 years ago by Damian Kao 16k

1

Entering edit mode

Here's an old but nice one on time-course analysis using PCA.

ADD REPLY • link 11.7 years ago by Arun 2.4k

1

Entering edit mode

So given 2 datasets, A and B. They perform PCA on data set A, project B on to A and use the newly projected coordinates to get differential expression. I am not sure what test they are using for the differential expression though. Some kind of ANOVA? I guess the advantage of this is: 1) It is taking the time-course relationship into account. 2) using only the dominant components is kind of a smoothing function as it de-noises the dataset.

ADD REPLY • link 11.7 years ago by Damian Kao 16k

0

Entering edit mode

You can have a look at this paper for other PCA based applications, I found sparse PCA and supervised PCA to be quite interesting

ADD REPLY • link 11.7 years ago by Sudeep ★ 1.7k

2

Entering edit mode

Thanks for the papers. I am actually working with time-course RNA-seq data right now, so this is of interest to me. BTW, I posted some brief code on how to do PCA and visualize it with python in matlibplot couple days ago: http://blog.nextgenetics.net/?e=42

ADD REPLY • link 11.7 years ago by Damian Kao 16k

0

Entering edit mode

Dk, nice post. However, I find that it would be nice to explain the actual concept behind (PCA) and purpose (why in time-series?) in addition to just the code. I love theory! :)

ADD REPLY • link 11.7 years ago by Arun 2.4k

0

Entering edit mode

I've actually been working on a post to explain PCA, just haven't gotten around to finishing it. It's a surprisingly simple concept if you ignore all the crazy maths which I suck at anyways. :) It is essentially just changing the coordinate system's axis (x,y,z..) into a series of orthogonal (perpendicular) best fit lines.

ADD REPLY • link 11.7 years ago by Damian Kao 16k

0

Entering edit mode

Sudeep, I get "content not found"

ADD REPLY • link 11.7 years ago by Arun 2.4k

1

Entering edit mode

Sorry for that, I was logged in from my institute account with direct access to the manuscript, now edited that, please try again.

ADD REPLY • link 11.7 years ago by Sudeep ★ 1.7k

0

Entering edit mode

Unfortunately I couldn't find any interesting papers for read count data. As I said in the post all the papers I saw used PCA just to cluster samples but for microarray I found a couple of papers like the one posted by Arun in reply but I am not sure how the statistics works out for read count data

ADD REPLY • link 11.7 years ago by Sudeep ★ 1.7k

1

Entering edit mode

It makes sense because PCA is a tool for either clustering or dimensionality reduction, as far as I've understood. So, it doesn't make much sense to me in comparing replicates of a gene over two conditions using PCA.

ADD REPLY • link 11.7 years ago by Arun 2.4k

score 4 · Answer 1 · 2012-08-07

4

Entering edit mode

11.7 years ago

Michael 54k

To resolve the unanswered state for this question. In agreement with most of the comments already given, the answer is, that at least in this case it doesn't make much sense to use PCA for gene ranking. This is because you have a Case vs. Control setting, which means you have a "2-dimensional" problem, the applications of PCA described in the paper are directed towards time series or other higher dimensional measurements. Therefore you will get max. 2 principal components, and if you wanted to remove one, for eg. dimension reduction or noise reduction, you have one left. That is not good for doing a statistical test where you wish to compare two conditions.

Of course, one could rank the genes by their factor loadings (projection of the data on the first principle axis), but that doesn't seem to have any advantage in a case-control setting. A statistical test has the advantage of providing estimate of significance (aka. p-values), and allows to estimate power, etc. A PCA is a totally different technique, and doesn't provide these estimates. Unless you can better define the use-case and answer the question why a non-standard method should be applied I would stick with an established method.

You didn't tell if you have replication, but I guess so; therefore if you wanted to use PCA you need to decide at which point in your analysis you wish to summarize the replicates. At that point however, you are going to loose information about within group variance. In a statistical test, for example ANOVA, within group variance would be needed and compared to between group variance. Therefore, it is important to keep within group variance until the statistical test.

ADD COMMENT • link 11.7 years ago by Michael 54k

0

Entering edit mode

Thank you for this long explanation. I am following the traditional methods for analysis, but as I said in the one of the comments, I posted this question just out of curiosity to see if anywork has been done on PCA based methods.

ADD REPLY • link 11.7 years ago by Sudeep ★ 1.7k

1

Entering edit mode

Generally speaking I'd say that application of PCA will be the same for gene expression data, whether they come from microarrays or RNA-seq. For RNA-seq, PCA should be applicable for various gene level normalized read counts, eg. RPKM, FPKM, etc.

ADD REPLY • link 11.7 years ago by Michael 54k

0

Entering edit mode

what if you have biological replicates of different parts of the same tissue and would like to use PCA to exclude the biological replicate in which there is a contamination of one part of tissue by cells from the other part due to improper handeling/microdissection/surgery. is it not easier to detect such a contaminated sample by using PCA?

ADD REPLY • link 11.5 years ago by psola • 0

0

Entering edit mode

easier than what? I have the impression this is a possible application, another method is to cluster samples, still no big difference between rna-seq and microarray data.

ADD REPLY • link 11.5 years ago by Michael 54k