Question

Significance of GO similarity

1

Entering edit mode

7.8 years ago

Nitin ▴ 170

Dear all,

I computed Gene ontology semantic similarity as follows

I took the genes from two different co-expression networks from gene expression dataset. For example

Module 1 : 30 genes

Module 2: 20 genes

Once I had them in hand I used gene ontology software and annotate the genes using biological process category. Later I obtained GO IDs and used GOSEMSIM R package to compute similarities ranges from 0 to 1. This is the data from experiment. Now I want to test GO similarity significance as follows.

Generate the two random gene sets of the same sizes and compute GO similarity and finally caluculate the P-value by comparing it with GO similarities which are obtained from real data. I am not getting the way how to do it? specifically how to select the random gene sets of same sizes and calculate the P-values? Can any body please let me know the exact procedures to do this ?

Thanks, Sai

R GO • 2.2k views

ADD COMMENT • link updated 7.8 years ago by karl.stamm 4.1k • written 7.8 years ago by Nitin ▴ 170

score 0 · Answer 1 · 2016-07-11

0

Entering edit mode

7.8 years ago

karl.stamm 4.1k

Look into 'permutation testing' or the 'null distribution'. The way empirical p-values are computed can be complicated if you don't know all the terminology. Try Wikipedia for null hypothesis testing. What you're trying to show is that the significance scores seen are greater than would be seen with no-association between conditions.

Basically you would see what kind of values GOSEMSIM could generate on nonsense data, and then show that your values are extremely high with respect to those. Simple enough to test a single similarity score, you get a thousand similarity scores from random genes and then plot a histogram, and see where your similarity score falls. If it is near an edge, then it is extreme. Compute the percentile with ecdf() and you've got a p-value from your one similarity score.

However, since you are working with a set of similarity scores, things get messy. YOu'll have to define significance of a set somehow. Do you mean that an awful lot of similarities are quite high? Maybe the sum of the similarities is quite high? Then you have to repeat this procedure for gene sets of the same size as your experiment; and that can be a lot of computation.

It might be easier to use a theoretical value than an empirical one. The GOSEMSIM package computes several well defined metrics, each with formulas available in their publication. Perhaps the scores have known behavior you should use instead of trying to generate a thousand sets of random genes (and if I recall correctly, computing GOSEMSIM on thousands of sets of genes will be very time consuming).

ADD COMMENT • link 7.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Hi Karl,

Thanks for the reply I Just found one paper with following analysis

Gene Ontology (GO) semantic similarity scores based on GO terms for each pair of genes were computed using the R GOSemSim package. For each of the three GO sub-ontologies (bio- logical process, molecular function and cellular component), the semantic similarity scores were calculated for all gene pairs in a module. To examine the significance of the functional similarity of genes in a module, a randomization test was performed. For a given module, the same number of genes in the module were selected from the 854 genes, and their GO semantic similarities were analyzed. This procedure was performed 1,000 times, and a Kolmogorov-Smirnov test (KS-test) was used to assess whether the GO semantic similarity scores of all gene pairs from the module were significantly higher than that of randomly selected pairs.

I was wondering If I can apply the similar procedure for my work. Actually this question was asked by one of the reviewer of my work. He asked following question.what's the significance of GO similarity? In particular, what is the chance of getting the significant GO similarity from two random gene sets of the same sizes? . Could you please guide me how to do this if possible by example? I am new bie for this type of analysis thats why I am worried about basics

Thanks a lot Sai

ADD REPLY • link 7.8 years ago by Nitin ▴ 170

0

Entering edit mode

Yeah that sounds just like the procedure I was describing. To see if your set of similarity scores is higher than a random gene module, you have to make a random gene module and perform the calculations 1,000 times. Then you will have a distribution of similarity sets and can see if your similarity set is extreme with respect to those random ones.

I disagree that the Kolmogorov Smirnov test is the best method for distinguishing the distributions. It is too precise and will almost definitely tell you that your set is different than the random ones. Just compute the mean or median and compare those. It's much dumber and will give a more interpretable result. The key here is that you understand the method and the implications. We are trying to find a method that achieves your goal, which was to show that the similarities were 'quite high'.

ADD REPLY • link 7.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Thanks for inputs so first I should do the following

Step1 : Select random 30 genes from my dataset and construct a module

Step2 : Perform Gene ontology analysis for this random module

Step3: Run the GOSEMSIM analysis 1000 times??

Could you please confirms these steps?

Thanks, Sai

ADD REPLY • link 7.8 years ago by Nitin ▴ 170

0

Entering edit mode

Not quite. Steps 2 and 3 should be the same each time (based on step1) so it doesnt make sense to repeat just step 3. It is step 1 that varies, so you will have to repeat the whole 1-2-3. Hope you have it scripted to run automatically.

And the results will be messy. The random 30 genes will have some random number of GO terms assigned to the group. The similarities are between GO Terms. Are you measuring the similarities within the group? Maybe its time to go back to the start and think about why you're computing GOSEMSIM at all. It's a metric for comparing two individual GO terms, so to apply it to a set, you will get a lot of different similarity values.

Your original question had a module of 30 genes and another of 20. I guess you have to distill that to one value somehow (maybe median of the pairwise GOSEMSIM scores). Then you can do the same for new random sets of size 30 and 20 to see if your first value is high or normal.

ADD REPLY • link 7.8 years ago by karl.stamm 4.1k

0

Entering edit mode

As you suggested, I went back and checked my files the procedure how exactly I computed the GO similarity using GOSEMSIM R package. I did the following

1) I took the genes from module 1 and module2 performed GO analysis separately

2) The GO analysis gave me list of GO IDs of module 1 and list GO IDs of module 2

3) Later I took GO IDs from module 1 in one text file and GO IDs from module 2 in another text file.

4) Finally I ran the GOSEMSIM R package and compared the GO IDs from module 1 with GO IDs from module 2 this analysis gave the following result

GO:0038127 GO:0007173 0.918

GO:0007169 GO:0038127 0.922

GO:0007169 GO:0048010 0.922

GO:0038127 GO:0007169 0.922

GO:0048010 GO:0007169 0.922

GO:0038095 GO:0038093 0.943

GO:0038093 GO:0038094 0.943

GO:0002768 GO:0038093 0.948

GO:0071363 GO:0071363 1

GO:0051270 GO:0051270 1

GO:2000145 GO:2000145 1

...........

Now I want to compute the significance of GO similarity between each GO IDs (see above). Should I randomize the GO IDs and compute the similarity? If yes could you please suggest procedure how to do that?

Thanks, Sai

ADD REPLY • link 7.8 years ago by Nitin ▴ 170

0

Entering edit mode

I'm sorry, this part is still not clear: "Now I want to compute the significance of GO similarity between each GO IDs (see above)."

Your example table above has eleven rows and three columns. Ten unique elements in the left set, and ten in the right should make about 45 similarity scores.

What do you mean by significance??? I can certainly say that GO:2000145 and GO:2000145 are 100% similar.

From your biological question, I guess you want to say that gene module 1 and 2 are significantly similar. So you need to compile the 45 different measures into one overall measure, (maybe mean), then compare that score to the one made by artificial gene module pairs. This would generate the p-value for how similar the original gene modules are.

ADD REPLY • link 7.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Thanks for clarification, sorry for asking too many questions as i said I am new to this topic I am getting confused with things :(. Now I am getting picture I agree with your following point.

From your biological question, I guess you want to say that gene module 1 and 2 are significantly similar. yes I want an answer for this question. I can easily compute the mean from GO similarity scores from my data. My main issue is how to generate the artificial gene module pairs and generate the GO similarity score for them?

ADD REPLY • link 7.8 years ago by Nitin ▴ 170

0

Entering edit mode

Problem solving is all about breaking down the steps and keeping an understanding of each part. It looks like we have now defined some process for getting a score from a pair of gene modules. You'll have to automate that so it can easily be run on an arbitrary input. Maybe the R package clusterProfiler can be of help. I use GOSeq on RNASeq data to get sets of GO terms in an automatic manner.

The question of how to make random gene modules (lists) is a separate task. It might have another question/answer thread on biostars, or look for making random selections on stackoverflow. You could use sample() in R, or "shuf | head" in unix.

The prerequisite is just to know the possible pool of all genes. That will come from the originating gene expression experiment.

You might find "mean similarity of all pairwise similarities" to be too weak. Maybe it comes to 95% for this set, and 93 to 97% in random sets. Won't know until you run it, so hopefully running it isn't too time consuming and you can tweak the method. An alternative to mean which might be more sensitive is "proportion of all pairwise similarities that are > 0.9".

ADD REPLY • link 7.8 years ago by karl.stamm 4.1k