Hello!

I have RNA-Seq data with 32 samples. I used Salmon to generate a read count matrix. I want to generate a PCA plot to look at the relationship between my samples. How do I use the read count matrix to perform the PCA analysis?

Thanks!

PCA plot from read count matrix from RNA-Seq

4

Entering edit mode

2.5 years ago

rachel.kubik12
•
80

Hello!

I have RNA-Seq data with 32 samples. I used Salmon to generate a read count matrix. I want to generate a PCA plot to look at the relationship between my samples. How do I use the read count matrix to perform the PCA analysis?

Thanks!

42

Entering edit mode

*NB - this is now a Bioconductor R package: https://github.com/kevinblighe/PCAtools*

You should normalise your data prior to performing PCA. In the code below, you'll have to add plot legends yourself, and also colour vectors (passed to the '*col*' parameter).

Then, assuming that you have transcripts as rows and samples as columns:

*NB - in this code, the plots I've shown don't necessarily match the exact code, but the plot type is the same*

**[Edit: also take a look at my definition of PCA here: PCA in a RNA seq analysis]**

```
project.pca <- prcomp(t(MyReadCountMatrix))
summary(project.pca)
#Determine the proportion of variance of each component
#Proportion of variance equals (PC stdev^2) / (sum all PCs stdev^2)
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100
```

```
barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))
```

```
par(cex=1.0, cex.axis=0.8, cex.main=0.8)
pairs(project.pca$x[,1:5], col="black", main="Principal components analysis bi-plot\nPCs 1-5", pch=16)
pairs(project.pca$x[,6:10], col="black", main="Principal components analysis bi-plot\nPCs 6-10", pch=16)
```

```
par(mar=c(4,4,4,4), mfrow=c(1,3), cex=1.0, cex.main=0.8, cex.axis=0.8)
#Plots scatter plot for PC 1 and 2
plot(project.pca$x, type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"))
points(project.pca$x, col="black", pch=16, cex=1)
#Plots scatter plot for PC 1 and 3
plot(project.pca$x[,1], project.pca$x[,3], type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"))
points(project.pca$x[,1], project.pca$x[,3], col="black", pch=16, cex=1)
#Plots scatter plot for PC 2 and 3
plot(project.pca$x[,2], project.pca$x[,3], type="n", main="Principal components analysis bi-plot", xlab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), ylab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"))
points(project.pca$x[,2], project.pca$x[,3], col="black", pch=16, cex=1)
```

```
require(scatterplot3d)
par(mar=c(4,4,4,4), cex=1.0, cex.main=0.8, cex.axis=0.8)
scatterplot3d(project.pca$x[,1:3], angle=-40, main="", color="black", pch=17, xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), zlab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"), grid=FALSE, box=FALSE)
source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r')
addgrids3d(project.pca$x[,1:3], grid = c("xy", "xz", "yz"))
source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r')
addgrids3d(project.pca$x[,1:3], grid = c("xy", "xz", "yz"))
```

0

Entering edit mode

Hi Kevin,

I guess we obtain reads counts from HtSeq, and performed DEG analyses using edgeR. I have 23 samples. Previously, I generated a PCA plot using read.count.matrix data following your codes above. I also would like to ask you; what is difference between generating a PCA plot from PC1 to PC5, and PC1 to PC10?

1

Entering edit mode

Hi Mehmet, I have not used EdgeR (I prefer DESeq2), but PCA should be performed on normally-distributed counts, so, your normalised logged counts.

There is no difference between the generation of the pairs plots for PC1-5 and PC6-10. It is just a matter of fitting all of the plots on the same page. Technically, you can generate a pairs plot for PC1-PC100, but you would require a very large plotting window.

0

Entering edit mode

Hi Kevin, Thank you very much.

Should I do anything in my FPKM data prior to PCA plot?

I have run analyses to generate PCA plot (pairs plots). I would like to ask you a few things: 1. What are the dots in boxes? 2.What are the numbers around each boxes?

I have 17 samples in my data, but I have 15 dots in boxes in the plot.

could you please help me about how to interpret results ?

Thank you

0

Entering edit mode

If you have used HTSeq followed by EdgeR, then you should have logged counts(?). How does the distribution appear when you run: `hist(MyData)`

?

Can you post an image of the pairs plot and provide the code that you are using to execute the PCA?

You can upload figures/images here, and then copy/psate the URL here in your response.

Kevin

0

Entering edit mode

Hi Kevin,

I have uploaded two figures.

my commands:

```
project.pca <- prcomp(t(MyReadCountMatrix))
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100
barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))
par(cex=1.0, cex.axis=0.8, cex.main=0.8)
pairs(project.pca$x[,1:5], col="black", main="Principal components analysis bi-plot\nPCs 1-5", pch=16)
```

0

Entering edit mode

Looks okay to me. In the second one, the title is incorrect though, as it's PC1-10. You could also make the dots smaller here, but it's not that important as these plots are mainly just for QC.

What's the variance explained on PC1 and 2?

Looks fine.

Edit: as you go further down along the PCs, you always find 'odd' plots like the one for PC10. Nothing to worry about, though. I'd worry if I saw that on PC1-3

0

Entering edit mode

Thank you for help.

I do not know variance between PC1 and PC2. When I run commands, I did not get variance value. I have 17 PCs when I type below:

```
Standard deviations:
[1] 3.49034964 1.22304723 1.05526472 0.99765777 0.76043084 0.56623044 0.37994839
[8] 0.24585524 0.23052947 0.17215343 0.08918422 0.07063718 0.06521970 0.05890600
[15] 0.04826593 0.04446027 0.03575958
```

How to interpret the graphs? and what is the meaning of the dots in the boxes? As I have 17 PCs, should I show 17 PCs in the plot?

In some tutorials I saw that people write column names instead of PC1, PC2 .....

Should I use column names also?

There are several type of PCA plots. Which should I use in my paper?

Like this one:

https://www.bioconductor.org/help/workflows/rnaseqGene/#pca-plot

Sorry for asking too many questions.

Thank you

1

Entering edit mode

The PCA bi-plot at https://www.bioconductor.org/help/workflows/rnaseqGene/#pca-plot is generated using the plotPCA function in DESeq2. It is usually used when processing RNA-seq data. However, this function is biased because it removes a large proportion of genes of low variance from the dataset prior to performing PCA, which defeats the purpose of conducting the PCA in the first place. My methods perform a completely unbiased PCA.

- If you follow my code, the percentage of overall variance explained
by each principal component will be held in
`project.pca.proportionvariances`

- Each dot is a sample.
- You do not have to plot all PCs from 1-17. Most people only plot PC1 versus PC2. Look at my code at the top of this page under the heading 'Bi-plots'
- PCA is primarily used for QC purposes to show that there are no outliers in the dataset. Looking at your PCA plots, there are no outliers. That is all you need to say. It is possible to go a lot more advanced into PCA, but, I would not recommend it unless you fully understand the PCA process

0

Entering edit mode

Just a quick additional question, I have normalized my counts using DESeq, and then log transformed (log2) my data, but do you do something about the zero's in the datasets when doing the log transformation? Since they will give a infinite negative value. Do you add something to those zero's? Also, do you do any removal of lowly expressed genes, nor not for this kind of analysis? Thanks!!

0

Entering edit mode

Oh, to manage those, you should use the regularised log transformation, the function for which is supplied with DESeq2: https://rdrr.io/bioc/DESeq2/man/rlog.html

To then access the r-logged counts, you use `assay()`

:

```
x <- rlog(dds)
assay(x)
```

PCA is then performed on these r-logged counts.

0

Entering edit mode

Hi Kevin, I'm sorry to be asking more questions, I'm totally new to this. I'm trying to use the code you sent me, but it's giving me errors:

"rlog() may take a long time with 50 or more samples, vst() is a much faster transformation Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers"

I'm wondering if this is based on the way I normalized; I used a script from a colleague from DESeq (not DESeq2), using the following code:

```
conds <- factor(c(colnames(counts)))
cds <- newCountDataSet(counts, conds)
cds <- estimateSizeFactors(cds)
normalized_counts <- counts(cds, normalized=TRUE)
```

Do you have an example of code that could get me from raw counts to logged normalized counts to use in PCA?

If you need more info on my samples please let me know. Thanks!

0

Entering edit mode

Oh, that error is common. When the dataset is large, it is indeed better to run `vst()`

, and then you access these again via the `assay()`

function. These are perfectly fine in place of logged counts.

If you're starting from your own data matrix of raw counts (*cts*), then you just need:

```
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
normalized_counts <- counts(dds, normalized=TRUE)
```

Are you actually testing anything as in differential expression? Here, `condition`

would be, like, the *CaseControl* variable.

Then, for `vst()`

:

```
vsd <- vst(dds, blind=TRUE)
assay(vsd)
```

Note that DESeq2 already has its own PCA implementation: https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#principal-component-plot-of-the-samples

If you have issues running DESeq2 itself, you may consider opening a new question

0

Entering edit mode

Thanks! I'm going to try around with DESeq2 and if I need more help I will open a new question.

Just one more question for now: do you know the MDS plot? I've used it during my analysis in edgeR, and do you know how this compares to a PCA plot? I've understood that also here the distance between the dots in the plot is a measure of distance between the datasets (Euclidean distance) (explanation in R: This function is a variation on the usual multdimensional scaling (or principle coordinate) plot, in that a distance measure particularly appropriate for the microarray context is used. The distance between each pair of samples (columns) is the root-mean-square deviation (Euclidean distance) for the top top genes. Distances on the plot can be interpreted as leading log2-fold-change, meaning the typical (root-mean-square) log2-fold-change between the samples for the genes that distinguish those samples.)

Would you know how this would compare to PCA?

0

Entering edit mode

Sorry for another reply. I've run the code you gave me and it worked, but I still have the same 'problem' now, in that there is no normal distribution in my data, since the large number of genes that had 0 counts before, now just shifted to around 4 (after normalization and log transformation). Therefore, a histogram of all genes of one of my samples looks like this:

I think the data except for the large bar of unexpressed genes might have a distribution more similar to a normal distribution, but how to deal with this?

0

Entering edit mode

Hey, that does not look like logged or variance-stabilised data. Can you show the code that you used? Regarding MDS and PCA, the mathematical formula behind each is different, but they both ultimately show relationships between samples / variables. Note that MDS can be performed using the PCA transformation.

You may consider a new question for that, as my time is extremely limited today.

0

Entering edit mode

Hi Kevin,

Thanks! That doesn't sound good.. I've started a new post here: How to normalize count data for PCA in R - something goes wrong in case you would like to comment which I would greatly appreciate. I've posted the code there as well.

Thanks for your help so far!

0

Entering edit mode

I want to explore RNAseq data from Expression Atlas (EMBL-EBI database) to get an impression on the similarity/differences of RNA-sequencing samples, i.e. to identify subgroups or outliers. The dataset downloaded from the database was in TPM, so I changed NAs into '0's and multiplied the values by 1E+6 (to get integer values). I want to log transform the data (rlog) but got stuck at:

```
log.data4pca = rlog(data4pca)
```

Which results an error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’

Can anyone help me with that?

Data:

```
Gene ID sample1 sample2 sample3 sample4 sample5
ENSG00000000003 0.6 65.0 30.0 61.0 62.0
ENSG00000000005 1.0 0.4 0.8
ENSG00000000419 87.0 83.0 66.0 98.0 46.0
ENSG00000000457 3.0 11.0 9.0 10.0 5.0
ENSG00000000460 6.0 3.0 3.0 3.0 2.0
ENSG00000000938 308.0 6.0 6.0 12.0 6.0
ENSG00000000971 1.0 61.0 31.0 87.0 738.0
ENSG00000001036 38.0 85.0 74.0 28.0 37.0
ENSG00000001084 25.0 24.0 36.0 41.0 78.0
ENSG00000001167 45.0 25.0 19.0 24.0 13.0
ENSG00000001460 1.0 6.0 3.0 3.0 0.7
ENSG00000001461 5.0 20.0 13.0 17.0 1.0
ENSG00000001497 28.0 15.0 12.0 20.0 8.0
ENSG00000001561 71.0 34.0 27.0 16.0 11.0
ENSG00000001617 2.0 10.0 16.0 80.0 2.0
ENSG00000001626 86.0 62.0 1.0 1.0
ENSG00000001629 12.0 31.0 25.0 32.0 11.0
ENSG00000001630 2.0 4.0 9.0 5.0 4.0
ENSG00000001631 9.0 18.0 19.0 21.0 11.0
ENSG00000002016 4.0 9.0 10.0 6.0 2.0
ENSG00000002079
ENSG00000002330 18.0 30.0 20.0 10.0 9.0
ENSG00000002549 55.0 80.0 90.0 66.0 114.0
ENSG00000002586 46.0 117.0 99.0 160.0 69.0
ENSG00000002587 1.0 3.0 2.0 7.0 0.3
ENSG00000002726 1.0 257.0 248.0 1.0 0.1
ENSG00000002745 0.1 0.2 0.2 0.5
ENSG00000002746 0.1 0.1 0.2
ENSG00000002822 9.0 4.0 4.0 4.0 3.0
ENSG00000002834 123.0 121.0 148.0 51.0 45.0
```

Script:

```
library(PCAtools)
library(DESeq2)
#copy data form Excel
data4pca = read.table(pipe("pbpaste"), sep="\t", header=T)
#extract rownames from first column
rownames(data4pca) = data4pca[,1]
#remove first column with geneIDs
data4pca = data4pca[,-1]
# change NAs into '0's
data4pca[is.na(data4pca)] <- 0
#to get integer values (TPM)
data4pca = data4pca * 1000000
# rlog transform data
log.data4pca = rlog(data4pca)
#principal components analysis (PCA) on log-transformed data
pcadata <- pca(log.data4pca)
```

0

Entering edit mode

Your package (PCAtools) has no rlog

```
> log.data4pca = rlog(data.matrix(data4pca))
Error in rlog(data.matrix(data4pca)) : could not find function "rlog"
```

with loaded DESeq2

```
> log.data4pca = rlog(data.matrix(data4pca))
converting counts to integer mode
Error in validObject(.Object) :
invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix
In addition: Warning message:
In mde(x) : NAs introduced by coercion to integer range
```

its strange as I have all NAs converted to '0's, even

```
> log.data4pca = rlog(data.matrix(data4pca + 1))
```

returns the same error

0

Entering edit mode

```
> str(data4pca)
'data.frame': 45946 obs. of 13 variables:
$ bone.marrow : num 6.00e+05 0.00 8.70e+07 3.00e+06 6.00e+06 3.08e+08 1.00e+06 3.80e+07 2.50e+07 4.50e+07 ...
$ colon : num 6.5e+07 1.0e+06 8.3e+07 1.1e+07 3.0e+06 6.0e+06 6.1e+07 8.5e+07 2.4e+07 2.5e+07 ...
$ duodenum : num 3.0e+07 4.0e+05 6.6e+07 9.0e+06 3.0e+06 6.0e+06 3.1e+07 7.4e+07 3.6e+07 1.9e+07 ...
$ esophagus : num 6.1e+07 8.0e+05 9.8e+07 1.0e+07 3.0e+06 1.2e+07 8.7e+07 2.8e+07 4.1e+07 2.4e+07 ...
$ liver : num 6.20e+07 0.00 4.60e+07 5.00e+06 2.00e+06 6.00e+06 7.38e+08 3.70e+07 7.80e+07 1.30e+07 ...
$ lymph.node : num 6.00e+06 1.00e+05 1.32e+08 1.70e+07 7.00e+06 6.30e+07 3.40e+07 3.60e+07 2.60e+07 5.80e+07 ...
$ rectum : num 6.70e+07 1.00e+06 8.30e+07 1.30e+07 5.00e+06 7.00e+06 9.80e+07 1.13e+08 2.70e+07 3.30e+07 ...
$ saliva.secreting.gland: num 5.3e+07 4.0e+05 3.5e+07 6.0e+06 1.0e+06 3.0e+06 1.8e+07 1.7e+07 7.0e+06 1.0e+07 ...
$ small.intestine : num 2.2e+07 4.0e+05 7.8e+07 9.0e+06 3.0e+06 1.0e+07 5.3e+07 7.6e+07 2.9e+07 2.3e+07 ...
$ spleen : num 1.7e+07 5.0e+05 9.7e+07 1.4e+07 5.0e+06 1.8e+08 1.8e+07 6.3e+07 5.0e+07 3.9e+07 ...
$ stomach : num 2.2e+07 0.0 6.3e+07 1.0e+07 2.0e+06 6.0e+06 6.0e+07 7.1e+07 2.7e+07 2.5e+07 ...
$ tonsil : num 1.40e+07 3.00e+05 1.14e+08 1.40e+07 1.00e+07 3.60e+07 2.10e+07 4.10e+07 2.30e+07 4.90e+07 ...
$ vermiform.appendix : num 1.10e+07 2.00e+06 1.00e+08 1.30e+07 8.00e+06 1.72e+08 6.00e+07 7.00e+07 2.10e+07 4.20e+07 ...
```

0

Entering edit mode

Just looks like there is still something wrong with your formatting. Look at this example:

`NA`

```
mat <- round(matrix(rexp(200, rate=.1), ncol=20), 0)
mat[7,9] <- NA
```

`NA`

```
require(DESeq2)
rlog(mat)
Error in DESeqDataSet(se, design = design, ignoreRank) :
NA values are not allowed in the count matrix
```

`NA`

to 0 and re-try```
mat[is.na(mat)] <- 0
rlog(mat)
converting counts to integer mode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[1,] 1.6303296 2.101263 1.6303296 4.4426980 3.916323 2.5563570 1.6870700 4.2057665 1.4127739 2.7271864 2.5850314 2.5552601 2.286887 2.5540934 2.557778 2.5569548 3.2821198 2.9502635 3.622561 1.703656
[2,] 2.9635154 2.963067 2.9635154 2.4257258 2.943951 3.9320118 4.1476477 2.0566027 2.9753406 2.0970838 2.3144479 3.6813475 2.965651 3.3690668 4.196746 3.4155491 2.9628123 2.0111253 2.481567 2.962925
[3,] 2.2500342 3.188301 3.6284485 2.7554526 3.193896 0.4068467 4.4410790 0.6108451 3.7866725 0.9955659 2.8728750 1.1451716 3.620244 1.1661840 1.239904 3.7693159 1.7485206 0.5350876 2.592244 2.284964
[4,] 4.4256292 3.108021 0.8907872 1.8005166 3.825927 2.8045730 4.6617094 0.7861423 3.1211707 1.7357820 2.7223201 3.1454056 4.921423 1.7119954 1.809768 3.3314205 1.5885089 2.6071480 2.105057 2.414051
[5,] 4.2780751 4.160892 5.4845108 3.3502959 3.350899 3.2308789 3.3498840 3.3493575 3.3491791 3.3481371 3.3482641 1.8285660 4.849273 1.4041449 2.977995 3.2891094 2.7175965 3.3482017 3.349179 3.489831
[6,] 1.3250958 2.979303 3.1284451 4.5200420 3.937701 3.9664888 4.0853001 4.0207723 3.4823841 3.1203092 1.3941709 3.5961874 4.379111 3.1715398 3.914777 4.2228263 3.4001319 4.0778840 5.350684 1.630279
[7,] 1.4466070 1.224425 3.6512460 3.5304046 2.286591 1.9162281 1.5021949 3.6600331 0.5854978 2.3790455 0.6514361 2.5760491 2.643559 1.6948982 2.201923 2.0457115 2.8874726 1.1159874 3.051115 3.568114
[8,] 1.9664247 4.327562 0.7690776 3.9564714 4.720774 2.1731031 3.6682082 1.3603538 3.2485642 2.5102086 2.0997103 2.6595941 3.671308 1.7433200 2.481331 0.5302784 0.4837551 2.9761034 2.263407 1.177239
[9,] 2.6268004 3.554450 3.7418866 0.9981525 1.113274 3.4400201 3.6312819 1.8713298 1.3540877 2.4580801 0.2464204 3.5573662 5.202193 0.8450194 2.100423 3.6015767 2.9490219 1.9511093 3.711696 3.622012
[10,] 0.5877168 2.518681 2.1797479 1.5528766 3.777592 2.8244428 0.6276905 3.2713734 2.6839627 0.9035905 1.6319140 0.6812501 4.749080 -0.1144293 2.163285 3.9095113 1.6715744 1.2102973 1.896780 3.886975
attr(,"betaPriorVar")
[1] 4.796553
attr(,"intercept")
[,1]
[1,] 2.648234
[2,] 2.991484
[3,] 2.311582
[4,] 2.675867
[5,] 3.392713
[6,] 3.485171
[7,] 2.230926
[8,] 2.439339
[9,] 2.628809
[10,] 2.130695
```

To check if there is a `NA`

anywhere in your data, do:

```
anyis.na(mat) == TRUE)
```

It returns TRUE if there is a `NA`

present.

0

Entering edit mode

Hello, you can add a custom legend to any plo using something like:

```
legend("bottomright", bty="n", cex=0.8, title="MyFirstLegend", c("Chinese", "Iraqis", "Iranians", "Syrians", "Russians"), fill=c("red1", "black", "forestgreen", "red2", "darkred"))
```

The first parameter can be any of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", or "center"

Kevin

Loading Similar Posts

Hi Kevin, Thanks! Can you please tell me how can I color two groups, one in green and other in red?

Hey, this is now a Bioconductor package. So, you should be able to colour the groups easily by using that: https://github.com/kevinblighe/PCAtools ?