Question

Low counts with htseq-count?

0

Entering edit mode

7.3 years ago

longoka ▴ 40

I'm working with single-end Illumina RNA-seq data. After producing BAM files using tophat against hg19, I'm running cufflinks (against hg19 knownGene) and subsequently htseq-count (in order to generate count data for use in DESeq2).

The BAM files are aligning between 35-40M reads per sample (>90% of total reads in each case), and the BAMs look good in terms of alignment to the reference.

However, I am seeing htseq counts in the region of 300,000 to 600,000 reads (representing ~7000 transcripts), far below the total number of reads, and certainly what appear to be visually acceptable when viewing the BAM against the hg19 reference with gene annotations.

Cufflinks is producing FPKM values for ~21,000 transcripts which, in contrast with the htseq-count output, makes me think that htseq-count is missing something, or I am missing something and I have not configured it correctly.

Why are my htseq counts so low?

Any help appreciated.

htseq-count cufflinks hg19 • 4.0k views

ADD COMMENT • link 7.3 years ago by longoka ▴ 40

0

Entering edit mode

An example feature read-out table:

__no_feature    24741285
__ambiguous 507184
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  14618731

ADD REPLY • link 7.3 years ago by longoka ▴ 40

0

Entering edit mode

Would be helpful to add the commands you used for mapping and htseq count.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

I'm using usegalaxy.org right now, so operating completely within the browser:

tophat:

Tool: TopHat
Number: 2
Name:   SA01
Created:    Sat 24 Dec 2016 05:53:50 PM (UTC)
Filesize:   1.1 GB
Dbkey:  hg19
Format: bam
Galaxy Tool ID: toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/2.1.0
Galaxy Tool Version:    2.1.0
Tool Version:   TopHat v2.1.0
Tool Standard Output:   stdout
Tool Standard Error:    stderr
Tool Exit Code: 0
History Content API ID: bbd44e69cb8906b5280e413d5f414089
Job API ID: bbd44e69cb8906b5d402ec009420f395
History API ID: c9a1d88f225bdc8e
UUID:   ead9f1dd-eeb4-4de5-8db5-27f4c439df0b

Input Parameter Value   Note for rerun
Is this single-end or paired-end data?  single  
RNA-Seq FASTQ file  5: FASTQ Groomer on data 1  
Use a built in reference genome or own from your history    indexed 
Select a reference genome   hg19    
TopHat settings to use  preSet  
Specify read group? no  
Job Resource Parameters no  
Dependency  Dependency Type Version
bowtie2 tool_shed_package   2.2.5
tophat  tool_shed_package   2.1.0
Inheritance Chain
SA01
↑
'TopHat on data 5: accepted_hits' in SA01

htseq-count:

Tool: htseq-count
Number: 25
Name:   htseq-count on data 3 and data 1
Created:    Thu 29 Dec 2016 07:07:48 PM (UTC)
Filesize:   ??? bytes
Dbkey:  hg19
Format: tabular
Galaxy Tool ID: toolshed.g2.bx.psu.edu/repos/lparsons/htseq_count/htseq_count/0.6.1galaxy1
Galaxy Tool Version:    0.6.1galaxy1
Tool Version:   None
Tool Standard Output:   stdout
Tool Standard Error:    stderr
Tool Exit Code: None
History Content API ID: bbd44e69cb8906b5b69805bb4395ee28
Job API ID: bbd44e69cb8906b527ec8e191b7a648a
History API ID: 6e51d14a043dddb1
UUID:   2eb9cd05-a6c2-4b30-8369-94a257f2e247

Input Parameter Value   Note for rerun
Aligned SAM/BAM File    1: SA01 
GFF File    3: UCSC Main on Human: refGene (genome) 
Mode    Union   
Stranded    Yes 
Minimum alignment quality   10  
Feature type    transcript  
ID Attribute    gene_id 
Additional BAM Output   False   
Force sorting of SAM/BAM file by NAME   False   
Dependency  Dependency Type Version
htseq   tool_shed_package   0.6.1
numpy   tool_shed_package   1.7.1
samtools    tool_shed_package   0.1.19
pysam   tool_shed_package   0.7.7
Inheritance Chain
htseq-count on data 3 and data 1

ADD REPLY • link updated 7.3 years ago by WouterDeCoster 47k • written 7.3 years ago by longoka ▴ 40

0

Entering edit mode

Is your RNA-seq stranded?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

1

Entering edit mode

rna-seq is stranded, yes

I'll add that when looking at the BAM against hg19, it's clear that a majority of reads are exonic

also, setting strandedness to 'no' results in zero counts

Am I missing the nuance between using the cached hg19 reference for tophat, and the hg19 knownGene refGene gtf file that I used for htseq (which I ported-over from UCSC table browser)? Are these two references incompatible?

ADD REPLY • link 7.3 years ago by longoka ▴ 40

0

Entering edit mode

Based on the previous comment, I rechecked my tophat pipeline; you were correct that the data is unstranded and, hence, my initial setting for htseq (Stranded = Yes) was incorrect. Thanks for pointing me in the right direction.

I also did some digging and found this previous post, that clarified the issue for me a bit more.

ADD REPLY • link 7.3 years ago by longoka ▴ 40

0

Entering edit mode

Well that already explains the low counts since half of the reads were ignored given those were in the opposite direction of the annotation. Good luck with the rest of your project.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k