I work with a reference genome that is only partially annotated, and I'm wondering if it's okay for me to discard uncharacterised genes from my dataset (once I've properly calculated TMM-normalisation factors from all transcripts, including the uncharacterised ones).
I can deal with having lots of uncharacterised genes in the output of a classic DGE analysis (i.e. when looking at the top 100ish most DE genes, I can just acknowledge that a subset of these transcripts are unknown and that's fine). However, I also want to build a gene co-expression network (WGCNA), and I'd like to calculate GO enrichment on the relevant gene modules. But obviously, when a large portion of genes are unknown within a module, their GO terms are also unknown and a GO enrichment analysis doesn't really make sense. To overcome that, I want to discard uncharacterised transcripts and only run the analysis on annotated transcripts.
I'm aware that I could also try to annotate these genes myself, but for several reasons I'd rather not to (this genome assembly will be obsolete soon, and - although that's a never a good reason - I'm in a big rush to get a first version of this study out).
Here is a simple outline of the pipeline I'm talking about, starting from a gene raw count matrix:
- Apply TMM normalisation using all transcripts (i.e. true library size)
- Retrieve only transcripts for which there is a known annotation
- Run WGCNA on this subset of transcripts only
I think for the enrichment part you can chose these annotated genes. Even if you had known genes, with unknown functions, you can still do GO analysis and its fairly acceptable. In my experience when I put my genes in DAVID for analysis, it doesn't recognize some IDs and discard it. These IDs could be psuedogenes or lncrna which are not part of DAVID annotation and the results are acceptable.