Question

How to identify RNA-seq mapped reads to reference genome but not identified by featurecounts of Subread package?

0

Entering edit mode

5.4 years ago

bk11 ★ 2.3k

Hi

We have performed RNA-seq of 60 sorted T cells from human and aligned to the reference genome (GRCh38) using HISAT2. About 80% of reads were mapped to the reference genome. However, when featureCounts of Subread package was used to count mapped reads, only 20-25% of mapped reads were counted to be associated with feature provided (exon and gene_id) of the reference gtf file (Homo_sapiens.GRCh38.87.gtf). I do not know what is happening here. I have two questions:

First what can be possible reason for this observation?
Is it possible that we can identify to which region of the reference genome the rest of the reads mapped?

I will appreciate your help.

RNA-Seq featureCounts HISAT2 • 1.9k views

ADD COMMENT • link updated 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 5.4 years ago by bk11 ★ 2.3k

0

Entering edit mode

Another recommendation, check if you used the proper strand selection in feature counts. Try what happens with different strand selectors -s0 -s1 and -s2 in featureCount.

ADD REPLY • link 5.4 years ago by Michael 54k

0

Entering edit mode

When I changed the strand selector from -s2 to -s0 in featureCount, the assigned read count were doubled i.e. from 10-12% to 20-25%. It is still very low exonic reads.

ADD REPLY • link 5.4 years ago by bk11 ★ 2.3k

1

Entering edit mode

That sounds suspiciously low.

Did you include ribosomal RNA genes in the annotation?
double check the annotation version matches the genome
was the gtf file truncated?
check what happens when you count whole transcripts instead of exon, including intronic regions
Indentify intergenic regions of high coverage and inspect them in IGV, etc. what is there?
to do this quickly you could for example 'invert' your gtf file to contain all gaps between exons and genes and run feature counts again

ADD REPLY • link 5.4 years ago by Michael 54k

score 0 · Answer 1 · 2018-12-03

0

Entering edit mode

5.4 years ago

Kristoffer Vitting-Seerup ★ 4.0k

I agree with you - im my experience 20-25% exonic reads are quite low.

Answer 1: There are multiple reasons you can observe a low exonic mapping rate. The main once are DNA contamination and a large number of pre-mRNA, but these are just a few. Another (more technical) is if you quantified your data as strand specific while it is not actually strand specific.

Answer 2: Yes - I would simply use featureCount with the Rsubread R package. Via this R package it is easy to define which regions featureCounts should quantify yourself. I would suggest distinguishing between exonic, intronic and intergenic reads.

Cheers Kristoffer

ADD COMMENT • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Hi, Earlier I defined exon and gene_id features of reference gtf file in the featureCount. It looks like other regions that can be defined are: CDS, five_prime_utr, gene, Selenocysteine, start_codon, stop_codon, three_prime_utr and transcript.

How can I define intronic and intergenic reads?

ADD REPLY • link 5.4 years ago by bk11 ★ 2.3k

1

Entering edit mode

The simplest way is to define introns as the regions between exon and the intergenic regions as those between genes. If you are actually looking for code to do this you should make a new question.

ADD REPLY • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k