Question

Which statistical test should be applied in the association gene list

1

Entering edit mode

7.6 years ago

izzy.yichao.cai ▴ 180

Hi all,

I have two kinds of gene list. One describes the status(Active, Bivalent, Repressive and Quiescent) of gene, i.e.:

Gene name    Status
A            Active    
B            Bivalent
C            Repressive
...          ...

The other two describes predicted tumor suppressor gene(TSG) and oncogene(OG), i.e.:

TSG
Gene name    Score
A            0.0001
B            100
C            1
...          ...

OG
Gene name    Score
A            0.001
B            1
C            10
...          ...

Then I associated first gene lists to the two other gene lists, respectively,to see whether the genes in the first list are TSG or OG(regardless of the score). I can get a table like this(the overlap is quite limited):

enter image description here

We can see that for the genes in the first list, there are more tumor suppressor genes than oncogenes(13>8). If I want to test whether repressive genes are indeed more associated with tumor suppressor genes compared to oncogenes, how I can add statistical test?

The lines in the first list are 1769 in total (exclude header); lines in the TSG list are 491 ; lines in the OG list are 501.

gene • 2.2k views

ADD COMMENT • link updated 7.5 years ago by Giovanni M Dall'Olio 28k • written 7.6 years ago by izzy.yichao.cai ▴ 180

0

Entering edit mode

You need the background frequencies in addition, like how many tumor supressor genes and oncogenes are there in total in your genome? Then your problem reduces to the following urn-lottery: Set up a lottery: you put G balls into the basket, N labelled Tum. and M labelled Onc. Now you draw J < G balls from your lottery without putting them back. What is the probability of having n>=7 labelled Tum. and m>=6 labelled Onc. in your sample.

ADD REPLY • link 7.6 years ago by Michael 54k

0

Entering edit mode

Thanks for reply Michael!!! I've edited the question. It may be a bit different from the previous one. Actually I use the second type of lists(TSG/OG) to annotate the first list(result list of my analysis).

ADD REPLY • link 7.6 years ago by izzy.yichao.cai ▴ 180

0

Entering edit mode

You have 2 categorical lists (Quiescent, Repressive, Bivalent and Active) and (Tumor Suppressor Genes, Oncogenes) and want to compare which factor has more relevance? Then test it with McNemar. See this page for more help.

ADD REPLY • link 7.6 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

No, the second list contains three factors: Tumor Suppressors, Oncogenes, and All other genes.

ADD REPLY • link 7.6 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

What are the scores in your second file?

ADD REPLY • link 7.6 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

The score in second is the predicted score from the list(predicted from a large data set of mutation signature). The higher the score, the more likely that the gene would be TSG or OG. But in my case, I would be more interested in finding which gene in my gene list are predicted as TSG or OG.

ADD REPLY • link 7.6 years ago by izzy.yichao.cai ▴ 180

0

Entering edit mode

If I want to test whether repressive genes are indeed more associated with tumor suppressor genes, how I can add statistical test?

This question is incomplete. Are you asking if repressive genes are more associated with tumor suppressors compared than any other gene, or compared to oncogenes? I would do a fisher test or a regression, but first you need to define what you are looking for.

ADD REPLY • link 7.6 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I want to see whether repressive genes are more associated with tumor suppressors compared to oncogenes. Thanks for pointing out the incomplete part!

ADD REPLY • link 7.6 years ago by izzy.yichao.cai ▴ 180

0

Entering edit mode

thanks, but notice that in this way the "all genes" dataset is not taken into consideration.

ADD REPLY • link 7.6 years ago by Giovanni M Dall'Olio 28k

score 1 · Answer 1 · 2016-09-28

1

Entering edit mode

7.6 years ago

Giovanni M Dall'Olio 28k

Apologies, I wrote this a few days ago but forgot to submit it. Luckily the browser didn't delete it :-)

First of all, let's recreate your data frame:

library(dplyr)
library(tidyr)
d = data.frame(
   inheritance=apply(m, 1, function(x) {c(rep("Oncogene", x[1]), rep("TS", x[2]), rep("Rest", x[3]))}) %>% unlist , 
   genefunction = rep(c("Quescent", "Repressive", "Bivalent", "Active"), rowSums(m)))   %>% 
   tbl_df

> d %>% count(inheritance, genefunction)  %>% spread(inheritance, n)
# A tibble: 4 × 4
   genefunction Oncogene  Rest    TS
*        <fctr>    <int> <int> <int>
11       Active       13   546    17
2      Bivalent        9   462     8
3      Quescent        7   257     6
4    Repressive       13   504     8

> d %>% head
# A tibble: 6 × 2
   inheritance genefunction
        <fctr>       <fctr>
11    Oncogene     Quescent
2     Oncogene     Quescent
3     Oncogene     Quescent
4     Oncogene     Quescent
5     Oncogene     Quescent
6     Oncogene     Quescent

At this point we can use a regression to calculate how much belonging to a specific class (Active, Bivalent, etc..) increases the odds of being a TS gene:

> d %>% lm(I(inheritance=="TS")~genefunction-1, data=. ) %>% summary

Call:
lm(formula = I(inheritance == "TS") ~ genefunction - 1, data = .)

Residuals:
     Min       1Q   Median       3Q      Max
-0.02951 -0.02951 -0.01670 -0.01524  0.98476

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
genefunctionActive     0.029514   0.005987   4.930 8.96e-07 ***
genefunctionBivalent   0.016701   0.006565   2.544   0.0110 *
genefunctionQuescent   0.022222   0.008744   2.541   0.0111 *
genefunctionRepressive 0.015238   0.006271   2.430   0.0152 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1437 on 1846 degrees of freedom
Multiple R-squared:  0.02284,   Adjusted R-squared:  0.02072
F-statistic: 10.78 on 4 and 1846 DF,  p-value: 1.214e-08

In this case, Active genes are the only one significantly more likely to be TS, although the odds ratio is not big.

ADD COMMENT • link 7.6 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I think you need to have a 3x2 table:

                 TS   OG   AllOther
Repressive
AllOther

Or even simpler if you just want to see if Repressive is enriched in TS, a 2x2:

                 TS   AllOther
Repressive
AllOther

If you also want to test OG, make another 2x2:

                 OS   AllOther
Repressive
AllOther

A chi-squared test should work. If you test the latter two 2x2 tables, you need to adjust your alpha for multi-testing, e.g. divide your alpha 0.05 by 2.

ADD REPLY • link 7.6 years ago by moxu ▴ 510

0

Entering edit mode

Thanks!! I think I got your point. Just to point out, you might wrongly assigned the number in OG to the TS in your tibble.

When I tried to run your code, I encountered the error of undefined object 'm' during the construction of data frame. Can you tell me what is that?

Also, if I want to present this statistical significance to the audience, what should I phrase the test(like some other test Fisher's exact test, chi-square test...)?

Thanks again!

ADD REPLY • link 7.5 years ago by izzy.yichao.cai ▴ 180

0

Entering edit mode

chi-squared test. You can do Fisher's exact, too, but chi-squared is popular and more people can easily understand.

ADD REPLY • link 7.5 years ago by moxu ▴ 510

0

Entering edit mode

Hi,

I tried the chi-squared test that you mention. The results are not significant:

3X2 test: The chi-square statistic is 1.335. The p-value is .51299. The result is not significant at p < .05

2X2 test: TSG: The chi-square statistic is 0.1401. The p-value is .708191. The result is not significant at p < .05 OG: The chi-square statistic is 1.2127. The p-value is .270803. The result is not significant at p < .05

Both results show that they are independent variables, but are not significant. Does it means that the repressive status is independent from all other gene, which that we can further conclude that the repressive ratio in TSG is independent from repressive ration in all other genes? Is that correct?

How can I make sense that repressive genes are tend to be more associated with TSG but not OG?

Thanks a lot!

ADD REPLY • link 7.5 years ago by izzy.yichao.cai ▴ 180

score 0 · Answer 2 · 2016-10-03

0

Entering edit mode

7.5 years ago

Giovanni M Dall'Olio 28k

sorry, m was the matrix with the count of genes by category that I used to regenerate your dataset. But in your case you can start directly with the lm() part, as you already had the same dataframe.

ADD COMMENT • link 7.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thanks!!

I realized that the testing you used is linear regression. Though there are significant result, I think people would tend to be skeptical to it. Linear relationship doesn't sound good since it is quite simple and straightforward. When I discuss the result of linear regression with my colleagues, them seems don't believe it.

I tried the chi-squared test by Mousheng Xu above. I tried both way but I can't get a significant result.

My boss said that it would be hard to find the enrichment that I wanted to see in such a limited set of OG/TSG. The significance would be diluted by the large size of genome. It may not be a good idea to consider it in the case that we don't get enough overlap between gene list.

Thanks for help!

ADD REPLY • link 7.5 years ago by izzy.yichao.cai ▴ 180

0

Entering edit mode

If there is no enrichment, then there is no enrichment. It does not have anything to do with the size of the genome. Permutation test is a gold standard if you really want to try. By looking at your table above, it looks like the there are about 26% repressive genes are TSG, and about 28% quiesent genes are TSG, the different is not that big. Getting non-significant results is not unexpected.

ADD REPLY • link 7.5 years ago by moxu ▴ 510