High number of no feature while using htseq-count on Tophat2 aligned as well as STAR aligned paired-end data
3
0
Entering edit mode
7.9 years ago
candida.vaz ▴ 40

Dear all,

I have paired-end (150bp) data for the chicken Ileum from Illumina TruSeq. I was told that it is first stranded. I removed the adapters through Trimmomatic tool. I then aligned it to the Chicken Genome (Ensembl) galGal4 using STAR aligner as well as Tophat2.

I wanted to quantify the reads using htseq-count, so I sorted the bam file by name.

I used the following parameters for htseq-count :

htseq-count --format='bam' --order='name' --stranded='yes' --minaqual='10' --type='exon' --idattr='gene_id' --samout='S1-RCI001-top-out.sam' S1-RCI001-tophat.sorted.bam genes.gtf > S1-RCI001-tophat-htseq-res.txt

Total number of SAM alignments : 30059627 (when htseq finishes running) On using grep "XF:" on S1-RCI001-top-out.sam I get 56541081 (Not sure which one to use for comparison) Now, the problem is, that I am getting an exceeding amount of "no_features" and maximum of the gene counts are zero

__no_feature    26618544
__ambiguous 1751
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique   3199738

Please help me identify the problem. As far as I know, I used the correct GTF file, sorted the bam file by name, and I am sure that its first stranded so used stranded="yes".

Thanks for your help!

Candida Vaz

htseq-count RNA-Seq next-gen • 5.1k views
ADD COMMENT
0
Entering edit mode

Can you post the output of the following commands?

  • samtools view -H S1-RCI001-tophat.sorted.bam
  • cut -f 1 genes.gtf | sort | uniq

My guess is that you have a mismatch in chromosome names. If the outputs are long, then just post 10 or so lines of each.

ADD REPLY
0
Entering edit mode

Dear Devon,

Here is the output from samtools view:

@HD VN:1.0  SO:queryname
@SQ SN:1            LN:195276750
@SQ SN:10   LN:19911089
@SQ SN:11   LN:19401079
@SQ SN:12   LN:19897011
@SQ SN:13   LN:17760035
@SQ SN:14   LN:15161805
@SQ SN:15   LN:12656803
@SQ SN:16   LN:535270
@SQ SN:17   LN:10454150
@SQ SN:18   LN:11219875

And here is the output from the genes.gtf file 1st column (just top ten lines)

1
10
11
12
13
14
15
16
17
18
ADD REPLY
0
Entering edit mode

OK, then michael.ante's suggestion is likely the correct one. Have a look at the sorted BAM file in IGV. That should give you a better feel for the sort of numbers that htseq-count should be producing.

ADD REPLY
1
Entering edit mode
7.9 years ago
michael.ante ★ 3.8k

Hi Candida,

For TruSeq stranded libraries, use the Htseq-count option 'reverse'. Reads R1 are reflecting the cDNA orientation while the R2 reads are in orientation of the mRNA. If you want to be sure, use RSeQC's infer_experiment.py tool.

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

Hi Michael,

As you mentioned that for TruSeq stranded libraries, one must use the HTseq-count "reverse" option. I was wondering, what the tophat stranded option should be, I have used "fr-firststrand" while mapping with tophat2 as told by the sequence provider. So mapping with tophat2 "firststrand" option and htseq-count "reverse" is fine right? As for the STAR aligner, I guess there is no option to mention strandedness. Thanks, Candida

ADD REPLY
0
Entering edit mode

Hi Candida,

I have to look that up each time. This blog explains it very good.

Cheers,

Michael

ADD REPLY
1
Entering edit mode
7.9 years ago
michael.ante ★ 3.8k

Hi Candida,

in all fields the read count reflects the number of uniquely mapping read - except the "alignment_not_unique" there each alignment is counted (if a read is mapped 5 times, this number will be increased by 5). Thus in order to compute statistics, you may include everything but the last line. For further information, please have a look at HTSeq-count's FAQ section.

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thanks for the information. I understand that the "alignment_not_unique" should be exempted for computing the statistics.

But to compute the statistics, as for the "yes" option results, to get the percentage of "no_feature" reads should I divide 26618544 by 30059627 (number of SAM alignment pairs processed) or by some other number? If I use 30059627, I get 88.5% no features Similarly, for the "reverse" option results, to get the percentage of "no_feature" reads if I divide 12272882 by 30059627 (number of SAM alignment pairs processed) I still get 41% "no_features".

My next question is how much percentage of "no_features" is acceptable?

Thanks for all your help, Much appreciated. Candida

ADD REPLY
1
Entering edit mode

The no_feature sums up all those reads that are intronic and intergenic - according to your annotation. If you have novel exons, 5' and 3' UTR extension, or novel transcripts, these will contribute to the no_feature. On the other hand, if you have genetic contamination or a lot of retained introns in your sample, these will also contribute.

Thus, you can use the annotated SAM file from HTSeq-count, grep the no_feature reads and analyse them with e.g. the IGV. If you have sufficient clues that most of the reads are novel exons/transcripts etc., you may run cufflinks an your alignments to extend the given annotation (it's called RABT or so).

ADD REPLY
0
Entering edit mode
7.9 years ago
candida.vaz ▴ 40

Hi Michael,

I ran htseq-count using the 'reverse' option and got these results (better than "yes" option): 30059627 SAM alignment pairs processed. __no_feature 12272882 __ambiguous 65301 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 3199738

To get the percentage of "no-feature" and "alignment-not-unique", should I divide these numbers with this 30059627 (get this number after htseq-count finishes) or with grep "XF:" on the "samout" file 56541081

Best, Candida

ADD COMMENT

Login before adding your answer.

Traffic: 2009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6