Question

DESeq2: subset dds or vsd to plot PCA with specific genes only

2

Entering edit mode

5.1 years ago

Cece ▴ 30

Hi All,

I've run DESeq2 on my counts comparing two conditions, treated vs untreated. I have my dds object and have generated my vsd object using the command:

vsd <- vst(dds, blind = FALSE)

then plotted my PCA using:

plotPCA(vsd, intgroup = "condition", ntop = 500)

However, I'd like to produce another plot using just a subset, which is those genes I identified as DE (padj < 0.05). My problem is, when I try to extract a subset from the vsd data using:

de <- rownames(res[res$padj<0.05 & !is.na(res$padj), ])
de_mat <- assay(vsd)[de,]

Then try to use this object to plot the PCA:

plotPCA(de_mat, intgroup = "condition", ntop = 500)

I receive the error message:

Error in (function (classes, fdef, mtable)  : 
 unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’

Which makes sense because de_mat is a matrix and vsd is not. Can anyone help me with another way to extract my subset of genes from the dds/ vsd object so I can use the subsequent object to plot a PCA?

I need to confirm, via PCA, that this subset does indeed discriminate between the two groups sufficient to justify my downstream analyses. My PCA using the full gene set shows mixing of the groups and it would be great to be able to justify using a great visual like a PCA for just the DE genes at this point, so that I can show clear separation of the groups.

I greatly appreciate all help in advance. If anyone has ideas on other visuals, I'm open to those as well. I've already covered heatmaps and volcano plots, though; it's just that my PI really likes the idea of the DE genes only PCA.

RNA-Seq R DESeq2 plotPCA subset • 9.8k views

ADD COMMENT • link updated 5.1 years ago by Kevin Blighe 87k • written 5.1 years ago by Cece ▴ 30

score 4 · Answer 1 · 2019-03-25

4

Entering edit mode

5.1 years ago

Kevin Blighe 87k

Hey,

The issue is that plotPCA() is a DESeq2-specific function - it does not work with data-matrices or data-frames. To generate your PCA bi-plot, you will have to do:

plotPCA(vsd, intgroup = "condition", ntop = 500)

...or:

plotPCA(vsd[de,], intgroup = "condition", ntop = 500)

If you want to run PCA analyses on a simple data-matrix or data-frame, then you can use my own package: https://github.com/kevinblighe/PCAtools

Kevin

ADD COMMENT • link 3.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks, Kevin,

Subsetting vsd worked great for me. I've also looked at PCATools, which looks really comprehensive but a little intimidating to me. My counts data is in a text file, with genes as rows and columns as samples:

          sample1   sample2    sample3
gene1    1063       1092       1487
gene2    240        309        329
gene3    47         26         31

My phenotypic data is also a text file, with samples as rows and each column representing a different phenotypic trait:

           Age    Gender    condition
sample2    13     Female    treated
sample2    15     Male      treated
sample3    8      Female    untreated

How do I import this data into PCATools?

Cheers, Cece

ADD REPLY • link 5.1 years ago by Cece ▴ 30

0

Entering edit mode

Hey, you would just have to do:

require(PCAtools)
p <- pca(x = MyData, metadata = MyMetadata, removeVar = 0.1)

However, I have a strict check in place whereby the rownames of MyMetadata have to be the same as the colnames of MyData (same names and in same order).

For this, as per DESeq2's function, the input data should be the transformed counts.

Don't worry if you just want to generate a bi-plot - that is easy with DESeq2's function.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey @Kevin, what is the difference between top 500 in vsd and top 500 of the vsd based on the padj values for de?

I thought when you run the first one, it was plotting top 500 genes most variable based on p value? if not, what those it mean ntop=500 in vsd?

plotPCA(vsd, intgroup = "condition", ntop = 500)

ADD REPLY • link 3.4 years ago by ecg1g15 ▴ 30

1

Entering edit mode

ntop = 500 is choosing the top 500 genes based on variance (across all samples). On the other hand, the top DE (differentially expressed) genes are related to your groups that are being compared and are chosen based on p-values and/or fold changes

ADD REPLY • link 3.4 years ago by Kevin Blighe 87k

0

Entering edit mode

ok thanks, I think I understood. However, my groups being compared I said to be between "samples" so I do not understand how this is generating two different PCAs? if both are based on pvalues?

dds <- DESeqDataSetFromMatrix(counts,
                                  colData = coldata,
                                  design=~sample)

ADD REPLY • link 3.4 years ago by ecg1g15 ▴ 30

0

Entering edit mode

What do you mean by "different PCAs"? - bi-plots?

ADD REPLY • link 3.4 years ago by Kevin Blighe 87k

0

Entering edit mode

After I have run the above dds and done the vsd transformation, I can run two PCAs and it generates very different visualisations. This can only be because the ntop 500 on each of them is not the same subset of genes. Therefore my question of why? Those ntop most variable genes of the first one, arent they based on the same pvalue I set on (de)?

plotPCA(vsd, intgroup = "sample", ntop = 500)

plotPCA(vsd[de,], intgroup = "sample", ntop = 500)

ADD REPLY • link 3.4 years ago by ecg1g15 ▴ 30

0

Entering edit mode

No, ntop is, in no way, based on any p-value. ntop selects the top genes based on variance across all samples.

ADD REPLY • link 3.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Ok, sorry, then I might have been misunderstanding the statistics behind it. Then the most variable gene across samples is NOT the one that presents the smallest p-value.

So the top gene would be the one that presents the highest variance to the mean averaged across samples? Whereas the top of (de) is the one which present the most significant variation across samples based on p value? Is this right?

ADD REPLY • link 3.4 years ago by ecg1g15 ▴ 30

3

Entering edit mode

So the top gene would be the one that presents the highest variance to the mean averaged across samples?

It's just the gene that has the highest variance (in R, via var()) across, in this case, the transformed variance-stabilised expression levels.

Whereas the top of (de) is the one which present the most significant variation across samples based on p value?

It's the one that has the lowest p-value after comparing expression levels between your two groups being compared. Granted, this may indirectly be affected by dispersion and, therefore, also, variance. Note, that, in DESeq2, the statistical inferences are made on the normalised counts, not the transformed variance-stabilised or regularised log expression levels.