Question

Is it OK to select a small subset of expression data for statistical testing?

1

Entering edit mode

7.3 years ago

xyliu00 ▴ 10

It is typical to do geneset enrichment analysis where the differentially expressed genes are divided into subgroups and each group is tested for significance.

I saw someone select a dozen or so overexpressed genes, and test if the overexpressing associated with certain phenotypic traits. In this case, they are all membrane expressing genes.

Further, different combinations of a handful genes (in triplets or pairs) are tested for association between their expression levels (high vs low) and phenotypes.

But is it OK to hand pick a small subset from thousands of differentially expressed genes for statistical analysis? Is it statistically sound? What needs to be considered when doing so?

It probably has been discussed before, but I have trouble to use right keywords to search.

Thanks!

RNA-Seq gene microarray • 1.8k views

ADD COMMENT • link updated 7.3 years ago by Steven Lakin ★ 1.8k • written 7.3 years ago by xyliu00 ▴ 10

score 7 · Answer 1 · 2016-12-19

Adding on to the statistically sound part of the question, it would depend on what kind of statistics he is calculating.

In classical statistical testing, you are expected, as the experimenter, to pre-select your hypothesis and therefore sample size. So selecting a set of overexpressed genes and then performing subsequent statistical testing on those is invalid under the classical model, because those particular genes weren't known prior to developing the hypothesis under consideration. This would incorrectly determine the degrees of freedom that were used to calculate those p-values. The same applies for "hunting" for small subsets of unintended but overexpressed genes. This is an often overlooked aspect of classical stats.

If you use Bayesian methodology, this same requirement doesn't apply, though other aspects of the design could be suspect, such as a choice of prior. With the Bayesian posterior distribution, it can be examined in as many ways or combinations as desired without statistical consequence or incurring the same "penalties" as in the classical system. Kruschke wrote a nice paper summarizing the differences between the two.

If anyone with more empirical experience has input on how this is typically handled by the field with differential expression studies, I think it would be a good conversation to have here.

score 1 · Answer 2 · 2016-12-19

it all depends from your rationale for choosing those genes.

for example, if you choose genes that are g-protein coupled receptors and do functional enrichment analyses, you will find genes that are part of the g-protein coupled receptor pathways...and therefore your analysis will be futile.

otherwise, if you instead have identified a set of proteins interacting with your favorite protein X, and you want to see what those proteins do, that's ok.

the statistics of i.e. GO analyses is a hypergeometric test , which looks for the probability of your genes vs. all genes/background to be part of a specific signature. if you "pre-enrich" based on a specific pre-known function, then you will introduce bias and won't really need to do enrichment analysis.