Question

Tumor subtype classification when genes of gene set are missing from samples - GSVA approach

0

Entering edit mode

5.3 years ago

Pietro ▴ 230

Regarding the approach to classify tumor samples to subtypes based on different gene sets with gene expression data, my question is: what is the significance of the approach when not all the genes of the gene sets are found in the samples? I believe that it may affect significantly the results, but still this is not reported when results are presented, for example when doing GSVA. This happened to me quite often, in particular when using data from TCGA or ICGC. So far, I have only done that using GSVA package, with method 'gsva'. Has anyone done testing on this issue?

Thanks

Note: cross-posting to Bioconductor

RNA-Seq GSVA TCGA geneset subtypes • 1.5k views

ADD COMMENT • link updated 5.3 years ago by Kevin Blighe 87k • written 5.3 years ago by Pietro ▴ 230

score 0 · Answer 1 · 2019-01-09

0

Entering edit mode

5.3 years ago

Kevin Blighe 87k

Your initial job, using programs like GSVA, is to do your best to ensure that all of your genes are in the correct annotation such that they can be used by the program. This will usually involve conversion from one annotation (e.g. Ensembl gene IDs) to another (e.g. HGNC symbols) using something like biomaRt. In reality, it is very rare that 100% of our genes can be used due to a whole variety of reasons. The effect of this is that statistical power may be lost, but I do not know of any studies that have attempted to quantify how much power is lost.

There is another important aspect to consider here: If the gene-sets in the program database are just comprised of protein coding genes, while your data has protein coding and non-coding genes, then, obviously, many will not match. However, you may still achieve 100% matching on protein coding genes alone.

The numbers going into each enrichment analysis should be reported in Supplementary Methods, but I am aware that they are usually not reported, which does not help.

Others may have other opinions.

Kevin

ADD COMMENT • link 5.3 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin

I agree with what you say.

I have encountered this problem especially when using custom gene sets created downstream of a particular pipeline, and used with an expression dataset obtained with a different pipeline. As you say, including or not different biotypes makes the difference.

I'd like to have other opinions and I would like to maybe test the statistical power loss.

ADD REPLY • link 5.3 years ago by Pietro ▴ 230

0

Entering edit mode

In that case, you could start by exploring the relationship between statistical power and the commonly used tests in gene enrichment, i.e., Chi-square (χ2), Fisher’s exact test, and hypergeometric test. However, this now branches completely into statistics, in which case I encourage you to pursue the issue on https://stats.stackexchange.com/:

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k