Question

PCA on sample GO terms

1

Entering edit mode

8.1 years ago

gilagalad ▴ 10

Hi,

I would like to cluster/make PCA among microarray samples accross two different platforms.I am afraid that clustering on the common genes between the platforms would be influenced more by the platform (different probes measuring different sequences of the transcripts and on different scale) then the treatment effect. As there is generally better consistency of the upregulated processes (enriched GO terms, pathways) I would like to cluster based on GO terms.

Suppose cells treated with compound A, B, C or D (each done in several replicates). Compare them to untreated control and that yields lists of differentially regulated genes. Determine GO terms (say for upregulated genes) GO.A, GO.B, GO.C and GO.D. This would be measured on platform 1. Then I would have cells treated with compound E, compared them to untreated control etc. to get GO.E. This experiment would be on platform 2. I would like to know, how similar is the effect of treatment E to A, B, C and D.

One solution that comes to my mind is first find common GO terms that are present on both platforms. Then compute GO.A, GO.B, GO.C, GO.D and GO.E. The GO terms not significantly changed (upregulated) would get p value 1. So I would have p values for all of the common GO terms. Then I would do for example PCA on the p values (I think they should be scaled first) and see the distance among the samples.

Does this make sense? Is there a better way?

Any suggestions appreciated!

Vojta

GO meta analysis microarray PCA clustering • 3.4k views

ADD COMMENT • link updated 6.1 years ago by igor 13k • written 8.1 years ago by gilagalad ▴ 10

1

Entering edit mode

It's an interesting approach. However I think variables used for PCA should be in principle independent from each other. GO terms on the other hand are structured as a tree, and I am not sure if this would break the principle of independency.

ADD REPLY • link 8.0 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Yes to me that is one of the concern if it is breaking the independency factor but then again is it viable to see 1x1 DEGs and then see the GO, if it is cross platform then ideal would be cross platform normalization and then find DEGs for the 4x4 samples to give a more statistically viable DEGs on which GO can be performed and then represented semantically.

ADD REPLY • link 8.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

thank you for your insight. do you think using enriched pathways instead of GO would amend this? or do you have in mind other way how to compare samples based on GO where the tree structure would not be problem?

ADD REPLY • link 8.0 years ago by gilagalad ▴ 10

0

Entering edit mode

It all depends on what you want to categorize as pathways. In GO enrichment the Biological Process is also closely associated to specific pathways or even Molecular Function is translated into pathways. So in a way you are trying to see how enriched are your genes for specific molecular functions (MF) or biological process (BP) and if some pathways which stands for your hypothesis are enriched from any of the categories in BP or MF then bingo that will help you to restrict your gene list. Usually when I refer to pathway I try to see pathways in KEGG or Ingenuity or Reactome. But they are more like downstream biological answers that corresponds to specific design. I guess you are looking for a preliminary approach that will help you so actually proceed with GO terms and either do a PCA on them or a correlation plot to see which are the terms that are closely associated. However am if you are looking for PCA should not it be done on the enrichment scores rather than pvalues? So you can select the significant GOs with pvalues along with their enrichment scores and then make a common venn diagram to see how all the enrichment scores behave across all the samples for the common GO and then either make a heatmap or PCA or correlation plot to make an understand how each samples are distanced.

ADD REPLY • link 8.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Some link could be informative :

ADD REPLY • link 8.1 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

thanks for the links, however they focus mainly on reducing of the datasets to common and expressed genes. this still retains some bias, I read one should verify that the probes target the same transcript region. however, generally the processess upregulated/downregulated on different platforms correspond more than the sole genes Li et al 2009

I may do the GO analysis anyway and compare that manually (see side by side which lists are similar), but I thought there would be some better approach :)

ADD REPLY • link 8.0 years ago by gilagalad ▴ 10

score 2 · Answer 1 · 2016-04-01

2

Entering edit mode

8.1 years ago

Philipp Bayer 8.3k

I've done something similar with csbl.go (installation works with R 3.2.3, site is a bit outdated)

It groups all genes by GO-terms and by gene expression, here's an example picture I just made with 100 of my genes and 5 conditions:

Heatmap

The object that csbl.go makes can then be interrogated to check which genes are in which GO-group. Does that help with your question?

ADD COMMENT • link 8.1 years ago by Philipp Bayer 8.3k

0

Entering edit mode

Philipp, thank your for the suggestion. as I understand it, one needs GO annotated genes that are same in all of the samples. so it does clustering of the samples that is still "gene dependent" on GO level. would it be possible to extend that for multiple platforms so it would be "gene independent" but "GO dependant"?

ADD REPLY • link 8.0 years ago by gilagalad ▴ 10

0

Entering edit mode

Aaah I see - so you could have sample (organism) A with gene A and gene B, but sample (organism) B has gene C and gene D, so you want to cluster "purely" by shared GO-terms.

Hmm you could fake an expression level for gene A and gene B for sample B by setting it to 0, and setting it to 0 for gene C and gene D in sample A, but I'm not sure whether that would work out, you may break some key assumptions.

In that case, my suggestion is likely to not work, sorry about that.

ADD REPLY • link 8.0 years ago by Philipp Bayer 8.3k

score 2 · Answer 2 · 2016-04-01

2

Entering edit mode

8.0 years ago

ivivek_ngs ★ 5.2k

First of all if you have 2 conditions, treated and untreated and in each condition I believe you have 4 replicates then you should run differential expression analysis on this group 4x4 to find list of DEGs and then do GO term enrichment, that will be statistically viable, else 1x1 which you are doing what I understand from your query and then doing GO enrichment to me is not statistically viable. Since you are afraid that they are microarray from 2 different platforms so it might have a batch effect. So you can check this paper. Or you can also use RankProd to see how to find DEGs in microarray coming from different platforms to normalize cross-patform errors or biasnes or take a look at this thread and then finally use all samples 4x4 to find DEGs and then do GO enrichment and if too many GO terms are there you can do as below.

Have you tried to check ReViGO . It does not do PCA but yes it tries to see over semantic space how your GO terms are over-represented . There are different forms of representation there. You can check if it might be interesting. The input is GO terms and pvalues or qvalues.

ADD COMMENT • link 8.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

thanks for your reply. I added into the question decription that there are 4 replicates per treatment, so finding DEGs and GOs for each treatment would be viable. As I understand it, RankProd would be useful for meta-analysis of DEGs, but this is not what I want. I could reduce the dataset so there will be "common genes" shared by both platforms based on gene name. Then do the RankProd on conditions DEG.A, DEG.B... DEG.C and then do dimension reduction (PCA) of the samples based on the gene ranks. I may try it, but I don't like reducing of the gene list to "common genes", which still leaves some bias. REVIGO seems promising for analysis of a single dataset like GO term filtering, but I don't see extension to compare different samples.

ADD REPLY • link 8.0 years ago by gilagalad ▴ 10

0

Entering edit mode

Yes ideally RankProd was done for meta-analysis where one wanted to apply it on microarrays performed in different labs, in that case you can put your samples since they are microarray from different platforms so it is a kind of meta values but the power of the tools might not be sufficient since you have very small number of replicates. I believe rather than gene list you should be considering array probes, since gene lists are skewed and more than 1 probe may be associated to a single single. GO can also be performed on Microarray probes , you might have to take a look at the tools and the kind of input they take it. I am just concerned about how powerful the statistical method will be if you compare 1 sample against the other coming from 2 different platforms. Usually it might not, that is the reason we have tools that are taking into considerations cross platforms normalizations.

ADD REPLY • link 8.0 years ago by ivivek_ngs ★ 5.2k

score 1 · Answer 3 · 2018-03-21

I think GO-PCA may be a good answer here: https://gopca.readthedocs.io/en/latest/intro.html

GO-PCA is an unsupervised method to explore gene expression data using prior knowledge. Briefly, GO-PCA combines principal component analysis (PCA) with nonparametric GO enrichment analysis in order to define signatures, i.e., small sets of genes that are both strongly correlated and closely functionally related.

The expression profiles of all signatures generated can be conveniently visualized as a heat map. This visualization, referred to as the signature matrix, aims to provide a systematic and easily interpretable view of biologically relevant expression patterns in the data. Together with other GO-PCA visualizations, it can serve as a powerful starting point for exploratory data analysis and hypothesis generation.