Question

Is it okay to remove uncharacterised transcripts from downstream analysis in RNA-seq?

0

Entering edit mode

5.6 years ago

antoinefelden ▴ 60

I work with a reference genome that is only partially annotated, and I'm wondering if it's okay for me to discard uncharacterised genes from my dataset (once I've properly calculated TMM-normalisation factors from all transcripts, including the uncharacterised ones).

I can deal with having lots of uncharacterised genes in the output of a classic DGE analysis (i.e. when looking at the top 100ish most DE genes, I can just acknowledge that a subset of these transcripts are unknown and that's fine). However, I also want to build a gene co-expression network (WGCNA), and I'd like to calculate GO enrichment on the relevant gene modules. But obviously, when a large portion of genes are unknown within a module, their GO terms are also unknown and a GO enrichment analysis doesn't really make sense. To overcome that, I want to discard uncharacterised transcripts and only run the analysis on annotated transcripts.

I'm aware that I could also try to annotate these genes myself, but for several reasons I'd rather not to (this genome assembly will be obsolete soon, and - although that's a never a good reason - I'm in a big rush to get a first version of this study out).

Here is a simple outline of the pipeline I'm talking about, starting from a gene raw count matrix:

Apply TMM normalisation using all transcripts (i.e. true library size)
Retrieve only transcripts for which there is a known annotation
Run WGCNA on this subset of transcripts only

RNA-Seq DGE WGCNA • 1.1k views

ADD COMMENT • link updated 5.3 years ago by h.mon 35k • written 5.6 years ago by antoinefelden ▴ 60

0

Entering edit mode

I think for the enrichment part you can chose these annotated genes. Even if you had known genes, with unknown functions, you can still do GO analysis and its fairly acceptable. In my experience when I put my genes in DAVID for analysis, it doesn't recognize some IDs and discard it. These IDs could be psuedogenes or lncrna which are not part of DAVID annotation and the results are acceptable.

ADD REPLY • link 5.6 years ago by piyushjo ▴ 700

score 2 · Answer 1 · 2019-01-21

There are two reasons for not filtering genes when performing co-expression network analysis:

1) when filtering genes, you may change the shape of the network, changing the relation between groups or creating / removing groups.

2) one of the purposes of these analyses is precisely shed light on the function of unknown genes, by examining how they relate to known genes - by removing unknown genes, you gain no insight into their function.

I think you should perform the WGCNA analysis as recommended by the authors, and for the subsequent GO enrichment, discard modules with too few annotated genes.