I have just used PCA tool to create a PC loadings graph in order to identify the genes explaingin the variance for PC3 using the following code:
library(PCAtools)
rld.sm.output <- as.data.frame(assay(rld.sm))
metadata <- coldata
require(org.Hs.eg.db)
mapping <- mapIds(
org.Hs.eg.db,
keys = rownames(rld.sm.output),
column = 'SYMBOL',
keytype = 'ENSEMBL')
rownames(rld.sm.output) <- make.unique(ifelse(is.na(mapping), rownames(rld.sm.output), mapping))
pca.project <- pca(rld.sm.output, metadata = metadata, removeVar = 0.1)
plotloadings(pca.project,
rangeRetain = 0.05,
labSize = 4.0,
title = 'Loadings plot - top 5% contributing',
subtitle = 'PC1, PC2, PC3, PC4, PC5',
caption = 'Top 5% variables',
drawConnectors = TRUE)
My Question:
How can I extract the top loading genes from the PC of interest (PC3 in this case)? When I extract the loadings and then sort by largest to smallest, the genes dont match my loadings graph (apart from the two genes with the highest loading):
write.csv(pca.project$loadings[1:3], file='PCAloadings.csv')
Hi again Kevin,
Thank you for your response it has helped me to understand better. But one thing:
In the csv file I have shared, I have sorted by PC3 loadings by size in column D.
I see PI3 and ENSG00000267303 at the top of column D...
Consequently I do see PI3 and ENSG00000267303 in the PC3 section of my
plotloadings()
graph...but why do I not see PPP1R1B in the graph if the loadings .csv suggests it is also in the top 5%?
Hope that makes sense
plotloadings usually displays a message to terminal when the function is executed. Is the gene listed there? It can be that it falls just outside the top 5% of the range