How to identify RNA-seq mapped reads to reference genome but not identified by featurecounts of Subread package?
1
0
Entering edit mode
5.4 years ago
bk11 ★ 2.3k

Hi

We have performed RNA-seq of 60 sorted T cells from human and aligned to the reference genome (GRCh38) using HISAT2. About 80% of reads were mapped to the reference genome. However, when featureCounts of Subread package was used to count mapped reads, only 20-25% of mapped reads were counted to be associated with feature provided (exon and gene_id) of the reference gtf file (Homo_sapiens.GRCh38.87.gtf). I do not know what is happening here. I have two questions:

  1. First what can be possible reason for this observation?
  2. Is it possible that we can identify to which region of the reference genome the rest of the reads mapped?

I will appreciate your help.

RNA-Seq featureCounts HISAT2 • 1.9k views
ADD COMMENT
0
Entering edit mode

Another recommendation, check if you used the proper strand selection in feature counts. Try what happens with different strand selectors -s0 -s1 and -s2 in featureCount.

ADD REPLY
0
Entering edit mode

When I changed the strand selector from -s2 to -s0 in featureCount, the assigned read count were doubled i.e. from 10-12% to 20-25%. It is still very low exonic reads.

ADD REPLY
1
Entering edit mode

That sounds suspiciously low.

  • Did you include ribosomal RNA genes in the annotation?
  • double check the annotation version matches the genome
  • was the gtf file truncated?
  • check what happens when you count whole transcripts instead of exon, including intronic regions
  • Indentify intergenic regions of high coverage and inspect them in IGV, etc. what is there?
  • to do this quickly you could for example 'invert' your gtf file to contain all gaps between exons and genes and run feature counts again
ADD REPLY
0
Entering edit mode
5.4 years ago

I agree with you - im my experience 20-25% exonic reads are quite low.

Answer 1: There are multiple reasons you can observe a low exonic mapping rate. The main once are DNA contamination and a large number of pre-mRNA, but these are just a few. Another (more technical) is if you quantified your data as strand specific while it is not actually strand specific.

Answer 2: Yes - I would simply use featureCount with the Rsubread R package. Via this R package it is easy to define which regions featureCounts should quantify yourself. I would suggest distinguishing between exonic, intronic and intergenic reads.

Cheers Kristoffer

ADD COMMENT
0
Entering edit mode

Hi, Earlier I defined exon and gene_id features of reference gtf file in the featureCount. It looks like other regions that can be defined are: CDS, five_prime_utr, gene, Selenocysteine, start_codon, stop_codon, three_prime_utr and transcript.

How can I define intronic and intergenic reads?

ADD REPLY
1
Entering edit mode

The simplest way is to define introns as the regions between exon and the intergenic regions as those between genes. If you are actually looking for code to do this you should make a new question.

ADD REPLY

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6