Entering edit mode
6.2 years ago
lucas.t
•
0
Hi,
I am using MethylKit to analyse gene methylation data and I am having trouble getting the PCASamples function to run. Initial sample files come from Bismark Coverage2Cytosine.
Using R 3.3.3, MethylKit 1.0.0
genome_fasta="~/Mus_musculus.GRCm38.fa"
input_files_base_dir="~"
file_list=file.path(input_files_base_dir,c("Control1.txt","Control2.txt","Control3.txt","Control4.txt","Control5.txt","Control6.txt","Test1.txt","Test2.txt","Test3.txt","Test4.txt","Test5.txt","Test6.txt"))
ids = c("Control1","Control2","Control3","Control4","Control5","Control6","Test1","Test2","Test3","Test4","Test5","Test6")
treatment = c(0,0,0,0,0,0,1,1,1,1,1,1)
methData <- methRead(
location = file_list %>% as.list,
sample.id = ids %>% as.list,
assembly = genome_fasta,
dbtype = NA,
pipeline = "bismarkCytosineReport",
header = FALSE,
context = "CpG",
resolution = "base",
treatment = treatment,
dbdir = getwd(),
mincov = 10
)
meth = unite(methData,destrand=FALSE)
PCAplot <- PCASamples(meth)
pdf(file="PCAPlot.pdf")
plot(PCAplot)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
I checked if I had any NA values in the columns using the following
M <- sapply(meth, function(x) sumis.na(x)))
However, it did not find any NA values. I'm pretty new to MethylKit and R so I'm looking for any ideas or next steps.
Thanks in advance!
What are the contents of PCAplot?
The documentation regarding PCASamples is here, however I'm not sure how to describe it otherwise. str() and head() both return NULL but I assume PCAplot is a plot so that wouldn't work anyways. Due to the size of the initial files I'm working on a server but I don't have access to X11 forwarding so I can't use R Studio and actually see what the plot looks like. Any suggestions on how to get more data on the contents of PCAplot?
Please try base R code for the PCA plot. You can use my code, here: A: PCA plot from read count matrix from RNA-Seq
Just to see if you can actually get the plot generated.
Using the above process to get the variable "meth"
Below is what meth looks like: str(meth)
So, there's the issue: PCA can only be performed on numerical data with absolutely no missing values. You need to remove all textual data / characters. In this case, it looks like the first four columns are just genomic co-ordinates - you should remove these and then try again.
Hope that this makes sense - it should work fine when you remove those.
What type of data does PCA analysis require? The integer columns are actually in groups of three per animal as the following data points: Read Coverage - # of T - # of C. The interesting information is in the ratio of C to T at the different chromosomal coordinates across animals, so would I first calculate that ratio and input that?
PCA is a mathematical transformation that is fundamentally based on variance. Therefore, it just requires a matrix of numbers (no missing values permitted) of greater than 2 X 2 dimensions. Please take a look here for an expanded explanation: A: PCA in a RNA seq analysis
So, it can be performed on anything. Performing it on read coverage and ratios does not make much sense to me. It is typically applied to, for example:
Not sure if that helps?
Yes it does help! And now I have two follow up questions:
Does PCA determine variance by row? When PCA is collecting the data on the initial variance between samples that it will reduce into the set of principal components, this initial variance is calculated by looking at the variance between values that are contained within a row? So not having the gene coordinate information doesn't really matter, PCA will still match the appropriate values to each other to calculate the variance between the samples (columns)
Would PCA work with ratios? In methyl-seq the important information is calculated by the number of cytosines that do not convert into thymine, which is calculated by the number of cytosines / the read count. To use the raw counts of cytosine doesn't really make sense as it wouldn't account for different numbers of reads per gene coordinate.
Thanks for all your help Kevin!
Yes, if we have samples as rows and genes as columns:
In a nutshell, it is summarising the variance (to be more precise, covariance) of the genes across each sample. The resulting principal components (PCs) are vectors that are uncorrelated and contain a single value for each sample. The values for the samples along PC1 will be the strongest; thus, PC1 will explain the largest amount of variance in the entire dataset. PC2 is second-greatest, PC3, third, et cetera.
It is possible to extract the component loadings from each PC, which show the strength of association of each gene to each PC. Genes of high variance across your samples will have high component loadings. If you do a RNA-seq of blood versus some other tissue, then the highest variance genes would most likely be HB (haemoglobin) genes.
If you've got a matrix of numbers, then you can do PCA. I'm just not sure of the standards in methyl-seq to comment further, i.e., I don't know what the performing of PCA on these ratios would mean. If they are the standard measures in methyl-seq, then fair enough!
Most people only use PCA to check for outliers. Those who understand it better know how to dig deeper.
So I've managed to run the pca analysis using your suggested code and it returns the following:
One thing I wanted to check is if I've applied your code properly. I noticed there was a transpose function in your suggested code and the presence of 12 PCAs when I have 12 samples makes little alarms go off.
Yep, you're digging a little too deep here ;)
The maximum number of principal components (PCs, also called eigenvectors) that can be produced is limited by the dimensions of your input matrix. So, the transformation just produces 12 because it's just performing matrix multiplications on your 12-sample data matrix in the background. Not sure if this helps. If you had 45 samples, you'd get 45 eigenvectors.
A little unknown fact: the
prcomp()
function in R, which advertises itself as performing principal components analysis (PCA), is actually performing a data transformation called single value decomposition (SVD). Coincidentally, they end up producing the same values. Other quirks like this exist in statistics, such as conditional logistic regression and Cox proportional hazards, I believe.If you're interested in looking at the formulae for PCA, you could look at the Appendix of my white paper: Haplotype Classification Using Copy Number Variation and Principal Components Analysis (I'm long gone from that institution though)