Question

Extract further information from deepTools plotPCA

0

Entering edit mode

6.8 years ago

Roman Hillje ▴ 90

I'm exploring some ChIPseq data of histone modifications and really enjoy using deepTools since it has so many features and is quite fast. Recently, I used the plotPCA function and received the output you can see below.

I gotta admit that I'm not an expert in PCA, but in the first figure it seems that PC2 is the most powerful when it comes to describing the variability of the samples. However, in the second figure it is the first factor that describes 80% of the variability. I know that the factors are ranked so naturally the second is lower than the first, but am I missing something or is then the first factor plotted as PC2?

Follow-up question: Clearly, the variable plotted as PC2 is extremely important for me as I'm looking at a time-course and the results I see are very encouraging. What I would like to do is extract the regions / bins of the genome that can be used to discriminate the samples. Does anybody have an idea how to do that? I was thinking to the Monocle tool that is doing pseudo-time based clustering of single cell RNAseq data. I'm trying to find a way to feed my ChIPseq data into this tool bu there are some problems with normalization. Perhaps there is an easier way?

Any help is greatly appreciated!

plotPCA output

ChIP-Seq deepTools • 13k views

ADD COMMENT • link 6.8 years ago by Roman Hillje ▴ 90

score 3 · Answer 1 · 2017-07-11

3

Entering edit mode

6.8 years ago

Devon Ryan 104k

PC1 is basically "genomic bin" or "feature", which will often have a huge effect. The percentages of variability explained will be included on the axes in deepTools 2.6. Anyway, the scales on the axes is arbitrary in PCA, so you can only judge the magnitude of the effect by the scree plot at the bottom. I'm not familiar with monocle, so I don't know what sort of input it expects.

ADD COMMENT • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

But am I wrong if I assume that factor 1 and 2 in the scree plot correspond to PC1 and PC2 in the PCA plot, respectively? I'm just confused since the samples don't seem to be well distinguishable by PC1, and factor 1 in the scree plot leads me to believe that it should have a very strong effect. Or is it simply a question of the PC1 scale in the PCA plot? Sorry if my questions are not very clear. It's not easy to explain my thoughts.

ADD REPLY • link 6.8 years ago by Roman Hillje ▴ 90

0

Entering edit mode

Yes, factor 1 and 2 are PC1 and PC2 (I'll change the "factors" label in the next minor release). PC1 is usually not informative, since it's effectively the "genomic bin" effect, which is highly variable, but very similar between the samples. Remember that the size of the effect accounted for by a PC doesn't tell you anything about how informative that will be in discriminating between samples.

As an aside, I change to change how PCA is computed in the 2.6 release. I find the current implementation (what matplotlib produces) to be suboptimal and would generally prefer to either subset the matrix or at the very least rcompute the PCA of the transposed matrix, which would be more similar to what one typically does with RNAseq data. My hope is that the resulting plots will be more informative then.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the information! Looking forward to the next release. Since you are working on the PCA feature, maybe you could also implement a color option similar to how other functions of the deepTools kit? Would be great if it was possible to provide a list of color names/codes (like you do for the labels) so that it's easier to find members of a group.

With respect to the informational content of the PCA I'm a bit confused. If the size of the effect doesn't tell me anything about how informative it is in discriminating the samples, what does it tell me then? Just to let you know, I'm aware that this could be a conceptual questions of PCA and of course I'm not expecting you to explain that to me :)

ADD REPLY • link 6.8 years ago by Roman Hillje ▴ 90

0

Entering edit mode

The size of an effect is telling you about how much it contributes to the variation inside a sample. In ChIPseq that tends to be dominated by genomic position, since you'll have things like GC bias or other random library-prep. artefacts that are common between samples.

Feel encouraged to post feature requests to github, since I'll lose track of them otherwise :)

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

That sounds like there is no point in making a PCA on ChIPseq data then. I colored the samples of my experiment based on the group they belong and honestly I can't think of any explanation as to why they should group together except for the obvious reason: the difference in treatment. Groups in blue and yellow are expected to be very similar by the way.

plotPCA

ADD REPLY • link 6.8 years ago by Roman Hillje ▴ 90