Question

RNAseq Tximport transcript counts to gene counts

0

Entering edit mode

6.0 years ago

s.kyungyong64 ▴ 40

Hi!

I have quant.sf files generated by Salmon by mapping 100bp single-end Illumina libraries against primary transcripts. The genome for this species is not so perfect and is missing some genes of interest. So I am using both transcriptome and genome data for this RNA-seq.

Before passing this quantification data, I needed to run Tximport and generated a table containing transcript ID and gene ID from a gff3 file based on the genome annotation. Then, I realized many of the primary transcripts were missing in the genome.

I was going to add the unique transcript ID and some arbitrary gene ID into the table. But is this okay? What would be the standard protocol to deal with this?

Thanks!

RNA-Seq Salmon Tximport • 2.1k views

ADD COMMENT • link updated 6.0 years ago by h.mon 35k • written 6.0 years ago by s.kyungyong64 ▴ 40

0

Entering edit mode

So I am using both transcriptome and genome data for this RNA-seq.

How are you avoiding double counting entities shared between those two?

ADD REPLY • link 6.0 years ago by GenoMax 141k

score 0 · Answer 1 · 2018-05-01

I don't see a simple solution to your problem. Why the genome is missing these genes of interest? Are the genes absent from the reference, or they are present (you can find them with, e.g., blast), but unannotated? Are these genes present on your transcriptome assembly?

The simplest solution: use only the transcriptome. You may use Corset to build a transcript to gene map, and you can map the transcripts to the genome and use bedtools intersect, subtract and overlap to annotate the transcripts and to find which annotated genes are found / not found on your transcriptome, and vice-versa.